Skip to content

Fix fast-cache-load rank synchronization guard#5398

Open
sandyhouse wants to merge 1 commit into
NVIDIA:mainfrom
sandyhouse:bugfix/dataset-fast-load
Open

Fix fast-cache-load rank synchronization guard#5398
sandyhouse wants to merge 1 commit into
NVIDIA:mainfrom
sandyhouse:bugfix/dataset-fast-load

Conversation

@sandyhouse

@sandyhouse sandyhouse commented Jun 18, 2026

Copy link
Copy Markdown
  • I, the PR author, have personally reviewed every line of this PR.

What does this PR do ?

Fix the --dataloader-fast-cache-load rank synchronization guard in BlendedMegatronDatasetBuilder.

The existing check used the dataset class object when testing for GPTDatasetConfig, which means the guard could fail to disable rank synchronization in some builder paths. This updates the check to inspect self.config, which is the actual dataset config instance.

Motivation

--dataloader-fast-cache-load is intended to skip the rank-0-first synchronization path when loading dataset indices from an existing cache. However, the inner _build_megatron_dataset_splits() guard checked:

isinstance(self.cls, GPTDatasetConfig)

self.cls is the dataset class, such as GPTDataset, not the config object. As a result, this condition is false even when self.config.fast_cache_load is enabled.

Some multi-prefix fast-cache cases may still work because higher-level builder paths already bypass synchronization, but the inner guard remains incorrect and can affect single-prefix paths, direct calls, or future refactors.
Changes

Use self.config instead of self.cls when checking for GPTDatasetConfig. Ensure synchronize_ranks is forced to False when --dataloader-fast-cache-load is enabled for GPT datasets.

Copilot AI review requested due to automatic review settings June 18, 2026 03:16
@sandyhouse sandyhouse requested review from a team as code owners June 18, 2026 03:16
@copy-pr-bot

copy-pr-bot Bot commented Jun 18, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

Copy link
Copy Markdown
Contributor

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes the rank-synchronization guard for --dataloader-fast-cache-load in BlendedMegatronDatasetBuilder by checking the actual dataset config instance (self.config) instead of the dataset class (self.cls), ensuring GPT fast-cache-load correctly forces synchronize_ranks=False.

Changes:

  • Fix fast-cache-load guard in _build_megatron_dataset_splits() to use isinstance(self.config, GPTDatasetConfig).
  • Add a unit test intended to assert that fast-cache-load disables rank synchronization for GPT datasets.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
megatron/core/datasets/blended_megatron_dataset_builder.py Corrects the fast-cache-load type check to apply to the config instance so the synchronization override triggers as intended.
tests/unit_tests/data/test_builder.py Adds a test covering the fast-cache-load synchronization behavior (currently needs a small fix to pass a valid split matrix).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/unit_tests/data/test_builder.py
Signed-off-by: Long Li <lilong@shopee.com>
@sandyhouse sandyhouse force-pushed the bugfix/dataset-fast-load branch from aec42d3 to aaa09dc Compare June 18, 2026 03:27
@sandyhouse sandyhouse marked this pull request as ready for review June 18, 2026 03:37
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team June 18, 2026 03:37
@asolergi-nv

asolergi-nv commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Hey! Thanks for catching this issue. Can you drop the tests? Thanks!

@asolergi-nv asolergi-nv self-assigned this Jun 18, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request waiting-on-customer Waiting on the original author to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants