Skip to content

[dev] Add experimental decoupled compact LayerWise DDP layout for Muon#5388

Open
Wohox wants to merge 2 commits into
NVIDIA:devfrom
Wohox:pingtian/claude/muon-layerwise-compact-buffers
Open

[dev] Add experimental decoupled compact LayerWise DDP layout for Muon#5388
Wohox wants to merge 2 commits into
NVIDIA:devfrom
Wohox:pingtian/claude/muon-layerwise-compact-buffers

Conversation

@Wohox

@Wohox Wohox commented Jun 17, 2026

Copy link
Copy Markdown
Contributor
  • I, the PR author, have personally reviewed every line of this PR.

What does this PR do ?

dev counterpart of #5391 (which targets main). Same change, implemented on top of dev. See #5391 for the main-branch version.

Add an experimental compact, decoupled LayerWise DDP buffer layout for the Muon (layer-wise) distributed optimizer that removes the persistent dp_size * max(shard_load) padding from the long-lived param/grad buffers.

There is no new flag: this reuses the existing --no-use-layer-wise-param-layout. With a Muon layer-wise distributed optimizer under --use-distributed-optimizer, disabling the shard-aligned padded LayerWise layout (use_layer_wise_param_layout=False) selects the compact decoupled path. use_layer_wise_param_layout is mirrored onto both DistributedDataParallelConfig and OptimizerConfig (auto-populated by field name, default True = padded layout), so the configs read it directly — no derived switch.

In this mode use_distributed_optimizer becomes a per-buffer property:

  • LayerWise-managed (Muon 2D matrix) buffers use a compact no-padding DDP layout and locally disable DistributedOptimizer semantics: all-reduce gradients, legacy whole-param ping-pong ownership, and allgather_params param sync.
  • Sibling non-LayerWise buffers (embeddings, biases, layernorm) keep the standard byte-level DistributedOptimizer layout.

The effective flag is computed per _ParamAndGradBuffer / _ParamAndGradBucketGroup. partition_buckets splits a force-single bucket group (disable_bucketing / non-first VPP chunks) by the effective per-bucket use_distributed_optimizer, so Muon (all-reduce) and sibling (reduce-scatter) buckets never share a group; when all buckets agree this collapses to a single group, identical to the prior behavior.

get_model and wrap_model_chunks_with_ddp share a single if use_layer_wise_distributed_optimizer: branch — both the padded and compact cases force ddp_config.use_distributed_optimizer=True, tag params for buffer routing, and compute the LayerWise full_param_layout. The padded-vs-compact decision lives entirely in compute_full_param_layout / _ParamAndGradBuffer, which read ddp_config.use_layer_wise_param_layout.

Compatibility

  • The default padded LayerWise layout is unchanged (use_layer_wise_param_layout=True).
  • Blockwise/MXFP8 compute with fp8_param_gather=False (params persist in bf16) is supported.
  • FP8/FP4 parameter gather is rejected at arg-validation for any layer-wise distributed optimizer: it requires DistributedOptimizer param buffers that the layer-wise path does not provide.
  • The compact path requires num_distributed_optimizer_instances == 1 (the non-DistOpt Muon buffers only all-reduce within a single optimizer instance).

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact @NVIDIA/mcore-oncall.

Issue tracking

For PRs from open-source community contributors:

  • New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
  • Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

@copy-pr-bot

copy-pr-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Decouple the persistent DDP param/grad buffer layout from LayerWise
whole-param ownership. There is no dedicated flag: this reuses the existing
--no-use-layer-wise-param-layout. With a Muon layer-wise distributed optimizer
under --use-distributed-optimizer, disabling the shard-aligned padded LayerWise
layout (use_layer_wise_param_layout=False) selects the compact decoupled path.

In this mode use_distributed_optimizer becomes a per-buffer property:
LayerWise-managed (Muon 2D matrix) buffers use a compact no-padding DDP layout
and locally disable DistributedOptimizer semantics (all-reduce gradients,
legacy whole-param ping-pong ownership, allgather_params param sync), while
sibling non-LayerWise buffers (embeddings, biases, layernorm) keep the standard
byte-level DistributedOptimizer layout. This removes the persistent
dp_size * max(shard_load) LayerWise padding from the long-lived buffers. The
effective flag is computed per _ParamAndGradBuffer / _ParamAndGradBucketGroup
from ddp_config.use_layer_wise_param_layout (mirrored onto OptimizerConfig and
DistributedDataParallelConfig, auto-populated by field name, default True =
padded). partition_buckets splits a force-single bucket group
(disable_bucketing / non-first VPP chunks) by the effective per-bucket
use_distributed_optimizer so Muon (all-reduce) and sibling (reduce-scatter)
buckets never share a group; when all buckets agree this collapses to a single
group, identical to the prior behavior. The default padded LayerWise layout is
unchanged.

The LayerWise wiring in get_model / wrap_model_chunks_with_ddp is a single
`if use_layer_wise_distributed_optimizer:` branch: both the padded and compact
cases force ddp_config.use_distributed_optimizer=True, tag params for buffer
routing, and compute the LayerWise full_param_layout. The padded-vs-compact
decision lives entirely in compute_full_param_layout / _ParamAndGradBuffer,
which read ddp_config.use_layer_wise_param_layout.

Blockwise/MXFP8 compute with fp8_param_gather=False (params persist in bf16) is
supported. FP8/FP4 parameter gather is rejected at arg-validation for any
layer-wise distributed optimizer: it requires DistributedOptimizer param
buffers the layer-wise path does not provide.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Wohox Wohox force-pushed the pingtian/claude/muon-layerwise-compact-buffers branch from a2a4b68 to 6e726d3 Compare June 17, 2026 09:37
…ayerwise-compact-buffers

# Conflicts:
#	megatron/training/training.py
@Wohox Wohox changed the title Add experimental decoupled compact LayerWise DDP layout for Muon [dev] Add experimental decoupled compact LayerWise DDP layout for Muon Jun 17, 2026
@Wohox Wohox marked this pull request as ready for review June 17, 2026 09:50
@Wohox Wohox requested review from a team as code owners June 17, 2026 09:50
@Wohox

Wohox commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test f615b2d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants