[dev] Add experimental decoupled compact LayerWise DDP layout for Muon by Wohox · Pull Request #5388 · NVIDIA/Megatron-LM

Wohox · 2026-06-17T05:01:45Z

I, the PR author, have personally reviewed every line of this PR.

What does this PR do ?

dev counterpart of #5391 (which targets main). Same change, implemented on top of dev. See #5391 for the main-branch version.

Add an experimental compact, decoupled LayerWise DDP buffer layout for the Muon (layer-wise) distributed optimizer that removes the persistent dp_size * max(shard_load) padding from the long-lived param/grad buffers.

There is no new flag: this reuses the existing --no-use-layer-wise-param-layout. With a Muon layer-wise distributed optimizer under --use-distributed-optimizer, disabling the shard-aligned padded LayerWise layout (use_layer_wise_param_layout=False) selects the compact decoupled path. use_layer_wise_param_layout is mirrored onto both DistributedDataParallelConfig and OptimizerConfig (auto-populated by field name, default True = padded layout), so the configs read it directly — no derived switch.

In this mode use_distributed_optimizer becomes a per-buffer property:

LayerWise-managed (Muon 2D matrix) buffers use a compact no-padding DDP layout and locally disable DistributedOptimizer semantics: all-reduce gradients, legacy whole-param ping-pong ownership, and allgather_params param sync.
Sibling non-LayerWise buffers (embeddings, biases, layernorm) keep the standard byte-level DistributedOptimizer layout.

The effective flag is computed per _ParamAndGradBuffer / _ParamAndGradBucketGroup. partition_buckets splits a force-single bucket group (disable_bucketing / non-first VPP chunks) by the effective per-bucket use_distributed_optimizer, so Muon (all-reduce) and sibling (reduce-scatter) buckets never share a group; when all buckets agree this collapses to a single group, identical to the prior behavior.

get_model and wrap_model_chunks_with_ddp share a single if use_layer_wise_distributed_optimizer: branch — both the padded and compact cases force ddp_config.use_distributed_optimizer=True, tag params for buffer routing, and compute the LayerWise full_param_layout. The padded-vs-compact decision lives entirely in compute_full_param_layout / _ParamAndGradBuffer, which read ddp_config.use_layer_wise_param_layout.

Compatibility

The default padded LayerWise layout is unchanged (use_layer_wise_param_layout=True).
Blockwise/MXFP8 compute with fp8_param_gather=False (params persist in bf16) is supported.
FP8/FP4 parameter gather is rejected at arg-validation for any layer-wise distributed optimizer: it requires DistributedOptimizer param buffers that the layer-wise path does not provide.
The compact path requires num_distributed_optimizer_instances == 1 (the non-DistOpt Muon buffers only all-reduce within a single optimizer instance).

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact @NVIDIA/mcore-oncall.

Issue tracking

For PRs from open-source community contributors:

New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

copy-pr-bot · 2026-06-17T05:01:48Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Decouple the persistent DDP param/grad buffer layout from LayerWise whole-param ownership. There is no dedicated flag: this reuses the existing --no-use-layer-wise-param-layout. With a Muon layer-wise distributed optimizer under --use-distributed-optimizer, disabling the shard-aligned padded LayerWise layout (use_layer_wise_param_layout=False) selects the compact decoupled path. In this mode use_distributed_optimizer becomes a per-buffer property: LayerWise-managed (Muon 2D matrix) buffers use a compact no-padding DDP layout and locally disable DistributedOptimizer semantics (all-reduce gradients, legacy whole-param ping-pong ownership, allgather_params param sync), while sibling non-LayerWise buffers (embeddings, biases, layernorm) keep the standard byte-level DistributedOptimizer layout. This removes the persistent dp_size * max(shard_load) LayerWise padding from the long-lived buffers. The effective flag is computed per _ParamAndGradBuffer / _ParamAndGradBucketGroup from ddp_config.use_layer_wise_param_layout (mirrored onto OptimizerConfig and DistributedDataParallelConfig, auto-populated by field name, default True = padded). partition_buckets splits a force-single bucket group (disable_bucketing / non-first VPP chunks) by the effective per-bucket use_distributed_optimizer so Muon (all-reduce) and sibling (reduce-scatter) buckets never share a group; when all buckets agree this collapses to a single group, identical to the prior behavior. The default padded LayerWise layout is unchanged. The LayerWise wiring in get_model / wrap_model_chunks_with_ddp is a single `if use_layer_wise_distributed_optimizer:` branch: both the padded and compact cases force ddp_config.use_distributed_optimizer=True, tag params for buffer routing, and compute the LayerWise full_param_layout. The padded-vs-compact decision lives entirely in compute_full_param_layout / _ParamAndGradBuffer, which read ddp_config.use_layer_wise_param_layout. Blockwise/MXFP8 compute with fp8_param_gather=False (params persist in bf16) is supported. FP8/FP4 parameter gather is rejected at arg-validation for any layer-wise distributed optimizer: it requires DistributedOptimizer param buffers the layer-wise path does not provide. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ayerwise-compact-buffers # Conflicts: # megatron/training/training.py

Wohox · 2026-06-17T09:52:26Z

/ok to test f615b2d

Wohox force-pushed the pingtian/claude/muon-layerwise-compact-buffers branch from a2a4b68 to 6e726d3 Compare June 17, 2026 09:37

Merge remote-tracking branch 'origin/dev' into pingtian/claude/muon-l…

f615b2d

…ayerwise-compact-buffers # Conflicts: # megatron/training/training.py

Wohox changed the title ~~Add experimental decoupled compact LayerWise DDP layout for Muon~~ [dev] Add experimental decoupled compact LayerWise DDP layout for Muon Jun 17, 2026

Wohox marked this pull request as ready for review June 17, 2026 09:50

Wohox requested review from a team as code owners June 17, 2026 09:50

svcnvidia-nemo-ci added the complexity: medium label Jun 17, 2026

copy-pr-bot Bot temporarily deployed to public June 17, 2026 09:53 Inactive

copy-pr-bot Bot temporarily deployed to public June 17, 2026 09:56 Inactive

copy-pr-bot Bot temporarily deployed to public June 17, 2026 09:57 Inactive

Wohox mentioned this pull request Jun 17, 2026

Add experimental decoupled compact LayerWise DDP layout for Muon (main) #5391

Draft

6 tasks

copy-pr-bot Bot temporarily deployed to public June 17, 2026 10:06 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dev] Add experimental decoupled compact LayerWise DDP layout for Muon#5388

[dev] Add experimental decoupled compact LayerWise DDP layout for Muon#5388
Wohox wants to merge 2 commits into
NVIDIA:devfrom
Wohox:pingtian/claude/muon-layerwise-compact-buffers

Wohox commented Jun 17, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 17, 2026

Uh oh!

Wohox commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Wohox commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issue tracking

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot Bot commented Jun 17, 2026

Uh oh!

Wohox commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Wohox commented Jun 17, 2026 •

edited

Loading