[dev] Add experimental decoupled compact LayerWise DDP layout for Muon#5388
Open
Wohox wants to merge 2 commits into
Open
[dev] Add experimental decoupled compact LayerWise DDP layout for Muon#5388Wohox wants to merge 2 commits into
Wohox wants to merge 2 commits into
Conversation
Decouple the persistent DDP param/grad buffer layout from LayerWise whole-param ownership. There is no dedicated flag: this reuses the existing --no-use-layer-wise-param-layout. With a Muon layer-wise distributed optimizer under --use-distributed-optimizer, disabling the shard-aligned padded LayerWise layout (use_layer_wise_param_layout=False) selects the compact decoupled path. In this mode use_distributed_optimizer becomes a per-buffer property: LayerWise-managed (Muon 2D matrix) buffers use a compact no-padding DDP layout and locally disable DistributedOptimizer semantics (all-reduce gradients, legacy whole-param ping-pong ownership, allgather_params param sync), while sibling non-LayerWise buffers (embeddings, biases, layernorm) keep the standard byte-level DistributedOptimizer layout. This removes the persistent dp_size * max(shard_load) LayerWise padding from the long-lived buffers. The effective flag is computed per _ParamAndGradBuffer / _ParamAndGradBucketGroup from ddp_config.use_layer_wise_param_layout (mirrored onto OptimizerConfig and DistributedDataParallelConfig, auto-populated by field name, default True = padded). partition_buckets splits a force-single bucket group (disable_bucketing / non-first VPP chunks) by the effective per-bucket use_distributed_optimizer so Muon (all-reduce) and sibling (reduce-scatter) buckets never share a group; when all buckets agree this collapses to a single group, identical to the prior behavior. The default padded LayerWise layout is unchanged. The LayerWise wiring in get_model / wrap_model_chunks_with_ddp is a single `if use_layer_wise_distributed_optimizer:` branch: both the padded and compact cases force ddp_config.use_distributed_optimizer=True, tag params for buffer routing, and compute the LayerWise full_param_layout. The padded-vs-compact decision lives entirely in compute_full_param_layout / _ParamAndGradBuffer, which read ddp_config.use_layer_wise_param_layout. Blockwise/MXFP8 compute with fp8_param_gather=False (params persist in bf16) is supported. FP8/FP4 parameter gather is rejected at arg-validation for any layer-wise distributed optimizer: it requires DistributedOptimizer param buffers the layer-wise path does not provide. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
a2a4b68 to
6e726d3
Compare
…ayerwise-compact-buffers # Conflicts: # megatron/training/training.py
Contributor
Author
|
/ok to test f615b2d |
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Add an experimental compact, decoupled LayerWise DDP buffer layout for the Muon (layer-wise) distributed optimizer that removes the persistent
dp_size * max(shard_load)padding from the long-lived param/grad buffers.There is no new flag: this reuses the existing
--no-use-layer-wise-param-layout. With a Muon layer-wise distributed optimizer under--use-distributed-optimizer, disabling the shard-aligned padded LayerWise layout (use_layer_wise_param_layout=False) selects the compact decoupled path.use_layer_wise_param_layoutis mirrored onto bothDistributedDataParallelConfigandOptimizerConfig(auto-populated by field name, defaultTrue= padded layout), so the configs read it directly — no derived switch.In this mode
use_distributed_optimizerbecomes a per-buffer property:DistributedOptimizersemantics: all-reduce gradients, legacy whole-param ping-pong ownership, andallgather_paramsparam sync.DistributedOptimizerlayout.The effective flag is computed per
_ParamAndGradBuffer/_ParamAndGradBucketGroup.partition_bucketssplits a force-single bucket group (disable_bucketing / non-first VPP chunks) by the effective per-bucketuse_distributed_optimizer, so Muon (all-reduce) and sibling (reduce-scatter) buckets never share a group; when all buckets agree this collapses to a single group, identical to the prior behavior.get_modelandwrap_model_chunks_with_ddpshare a singleif use_layer_wise_distributed_optimizer:branch — both the padded and compact cases forceddp_config.use_distributed_optimizer=True, tag params for buffer routing, and compute the LayerWisefull_param_layout. The padded-vs-compact decision lives entirely incompute_full_param_layout/_ParamAndGradBuffer, which readddp_config.use_layer_wise_param_layout.Compatibility
use_layer_wise_param_layout=True).fp8_param_gather=False(params persist in bf16) is supported.DistributedOptimizerparam buffers that the layer-wise path does not provide.num_distributed_optimizer_instances == 1(the non-DistOpt Muon buffers only all-reduce within a single optimizer instance).Issue tracking
For PRs from open-source community contributors:
Linked issue:
Contribution process
Pre-checks
Code review
Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.