Add experimental Megatron-FSDP fully_shard implementation by wujingyue · Pull Request #5387 · NVIDIA/Megatron-LM

wujingyue · 2026-06-17T04:28:50Z

Recovered replacement for #4976, which GitHub closed after its base ref pull-request/4835 was deleted. This branch is now rebased directly on main.

What does this PR do ?

Adds an experimental per-module Megatron-FSDP fully_shard path that uses DBuffer primitives to shard parameters, materialize full weights for compute, and reduce gradients back into sharded optimizer state.

Meta-parameter materialization is intentionally split out to the follow-up draft PR at #5369.

Issue tracking

Linked issue: N/A

Summary

add the experimental fully_shard(...) entry point plus FsdpModule, FsdpParameterGroup, and Placements runtime state
group parameters by dtype and requires_grad, manage sharded main/model weight buffers, and sync optimizer-updated weights into compute weights
allocate model/gradient buffers from DBuffer.device instead of reaching through local_buffer.device
install forward/backward hooks for unshard, reshard, and parameter-completion-based gradient reduction, including accumulation across backward calls
document storage-lifetime choices for persistent main gradients, autograd-saved unsharded parameter views, and the shared reshard storage-release path
extend distributed coverage for DBuffer layout/redistribution and the experimental FSDP path, including nested ownership, frozen parameters, optimizer-step visibility, CPU-initialized parameter sharding, autograd storage reuse, loss parity, and peak-memory reduction

Testing

git diff --check origin/main..HEAD
git diff --check for the latest comment-only update
uv run --no-sync python -m torch.distributed.run --nproc-per-node 2 -m pytest -q tests/unit_tests/distributed/megatron_fsdp/test_dbuffer.py tests/unit_tests/distributed/megatron_fsdp/test_experimental_fully_shard.py (27 passed, 6 skipped)

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Summarize today's FSDP work: - Add DBuffer storage release/reallocate support and an in-place fully_allgather_into path for materializing replicated buffers. - Simplify ParameterGroup and FsdpModule around sharded DTensor parameters, reused unsharded Parameters, meta materialization, and default-stream unshard/reshard/reduce behavior. - Remove unused optimizer/offload/state/helper surface area from the minimal path and keep version-counter preservation scoped to unsharded model-weight materialization. - Expand DBuffer and experimental FSDP tests for layouts, storage lifecycle, DTensor contracts, meta reset, nested ownership, train-step parity, and peak-memory reduction. Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Add out= support to DBuffer redistribution and primitive communication ops, keeping axis inference in redistribute only. Use preallocated model and gradient buffers in the minimal FSDP path where possible, including direct first-gradient reduce-scatter into main_grad. Update DBuffer and experimental FSDP tests for AVG reductions, explicit primitive axes, storage reuse, and gradient accumulation behavior. Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Use ordered parameter tuples for FSDP parameter swapping and keep the sharded data path aligned with main_weight storage. Set grad_dtype for FSDP-managed parameters so BF16 main gradients can be reduced without pre-reduce casts, and update tests to verify sharded parameter data and grad backing buffers. Clean up hook naming, local gradient accumulation handling, and memory/test assertions for the minimal experimental path. Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Add DBuffer casting for dtype conversion before model-weight sync, refresh model weights from main weights before unshard, and cover the next-forward optimizer update path with FP32 main weights and default BF16 main grads on SGD's non-foreach path. Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

copy-pr-bot · 2026-06-17T04:28:54Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

wujingyue added 13 commits June 17, 2026 04:25

WIP: add experimental minimal FSDP path

844c625

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Require matching FSDP main grad dtype

14e807b

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Reuse FSDP model weights for matching main weights

680f17b

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Preserve autograd grad dtype in minimal FSDP

09909a2

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Compare minimal FSDP loss curve with baseline

c4d13a8

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Adapt minimal FSDP to split DBuffer API

f1c2e57

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Split minimal FSDP runtime modules

195b4ca

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Split minimal FSDP module mixin

13055d8

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Remove experimental FSDP meta parameter support

8759750

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

wujingyue mentioned this pull request Jun 17, 2026

Add experimental Megatron-FSDP fully_shard implementation #4976

Closed

wujingyue marked this pull request as ready for review June 17, 2026 05:09

wujingyue requested review from a team as code owners June 17, 2026 05:09

copy-pr-bot Bot temporarily deployed to public June 17, 2026 05:10 Inactive

svcnvidia-nemo-ci added the complexity: medium label Jun 17, 2026

shjwudp approved these changes Jun 17, 2026

View reviewed changes

svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label Jun 17, 2026

copy-pr-bot Bot temporarily deployed to public June 17, 2026 05:13 Inactive

copy-pr-bot Bot temporarily deployed to public June 17, 2026 05:22 Inactive

copy-pr-bot Bot temporarily deployed to public June 17, 2026 06:45 Inactive

copy-pr-bot Bot temporarily deployed to public June 17, 2026 07:39 Inactive

copy-pr-bot Bot temporarily deployed to public June 17, 2026 07:48 Inactive

wujingyue added 2 commits June 21, 2026 16:24

Remove experimental FSDP distributed pytest markers

99b1261

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Remove DBuffer distributed pytest markers

c12e418

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

copy-pr-bot Bot temporarily deployed to public June 21, 2026 16:26 Inactive

copy-pr-bot Bot had a problem deploying to test June 21, 2026 16:27 Error

copy-pr-bot Bot temporarily deployed to public June 21, 2026 16:29 Inactive

copy-pr-bot Bot temporarily deployed to test June 21, 2026 16:29 Inactive

copy-pr-bot Bot temporarily deployed to public June 21, 2026 16:33 Inactive

copy-pr-bot Bot temporarily deployed to public June 21, 2026 16:34 Inactive

copy-pr-bot Bot temporarily deployed to public June 21, 2026 16:42 Inactive

Remove experimental FSDP CUDA skip guards

f959479

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

copy-pr-bot Bot temporarily deployed to public June 21, 2026 22:50 Inactive

copy-pr-bot Bot had a problem deploying to test June 21, 2026 22:51 Error

Document experimental fully_shard mixin attachment

adb4e12

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

copy-pr-bot Bot temporarily deployed to public June 21, 2026 22:54 Inactive

copy-pr-bot Bot had a problem deploying to test June 21, 2026 22:54 Error

Document fully_shard mixin behavior

06286bc

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

copy-pr-bot Bot temporarily deployed to public June 21, 2026 22:57 Inactive

copy-pr-bot Bot temporarily deployed to test June 21, 2026 22:57 Inactive

copy-pr-bot Bot temporarily deployed to public June 21, 2026 23:00 Inactive

wujingyue mentioned this pull request Jun 21, 2026

Add fully_shard_optimizer for mixed-precision FSDP #5411

Open

Phlip79 approved these changes Jun 22, 2026

View reviewed changes

Cover FSDP loss parity with microbatches

0dfb83b

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add experimental Megatron-FSDP fully_shard implementation#5387

Add experimental Megatron-FSDP fully_shard implementation#5387
wujingyue wants to merge 23 commits into
NVIDIA:mainfrom
wujingyue:fsdp/minimal

wujingyue commented Jun 17, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wujingyue commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issue tracking

Summary

Testing

Uh oh!

copy-pr-bot Bot commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wujingyue commented Jun 17, 2026 •

edited

Loading