Fix fast-cache-load rank synchronization guard#5398
Conversation
|
This PR has been automatically converted to draft because all PRs must start as drafts. When you are ready for review, click Ready for Review to begin the review process. This will:
See the contribution guide for more details. |
There was a problem hiding this comment.
Pull request overview
This PR fixes the rank-synchronization guard for --dataloader-fast-cache-load in BlendedMegatronDatasetBuilder by checking the actual dataset config instance (self.config) instead of the dataset class (self.cls), ensuring GPT fast-cache-load correctly forces synchronize_ranks=False.
Changes:
- Fix fast-cache-load guard in
_build_megatron_dataset_splits()to useisinstance(self.config, GPTDatasetConfig). - Add a unit test intended to assert that fast-cache-load disables rank synchronization for GPT datasets.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
megatron/core/datasets/blended_megatron_dataset_builder.py |
Corrects the fast-cache-load type check to apply to the config instance so the synchronization override triggers as intended. |
tests/unit_tests/data/test_builder.py |
Adds a test covering the fast-cache-load synchronization behavior (currently needs a small fix to pass a valid split matrix). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Long Li <lilong@shopee.com>
aec42d3 to
aaa09dc
Compare
|
Hey! Thanks for catching this issue. Can you drop the tests? Thanks! |
What does this PR do ?
Fix the
--dataloader-fast-cache-loadrank synchronization guard inBlendedMegatronDatasetBuilder.The existing check used the dataset class object when testing for
GPTDatasetConfig, which means the guard could fail to disable rank synchronization in some builder paths. This updates the check to inspectself.config, which is the actual dataset config instance.Motivation
--dataloader-fast-cache-loadis intended to skip the rank-0-first synchronization path when loading dataset indices from an existing cache. However, the inner_build_megatron_dataset_splits()guard checked:self.cls is the dataset class, such as GPTDataset, not the config object. As a result, this condition is false even when self.config.fast_cache_load is enabled.
Some multi-prefix fast-cache cases may still work because higher-level builder paths already bypass synchronization, but the inner guard remains incorrect and can affect single-prefix paths, direct calls, or future refactors.
Changes
Use self.config instead of self.cls when checking for GPTDatasetConfig. Ensure synchronize_ranks is forced to False when --dataloader-fast-cache-load is enabled for GPT datasets.