Add experimental FSDP microbatch context helper by wujingyue · Pull Request #5378 · NVIDIA/Megatron-LM

wujingyue · 2026-06-16T20:28:27Z

Summary

Add FsdpContext next to FsdpModule, with lazy root-subtree context creation following Add FSDP stream context delayed release #5124's structure.
Add megatron_fsdp.experimental.microbatch(module, is_first) alongside the experimental fully_shard(...) API in fully_shard.py.
Implement microbatch() as a scoped DFS helper that lazy-inits discovered FSDP root contexts, sets is_first_microbatch, and restores prior values on exit.
Keep default behavior correctness-first: forwards outside microbatch() still sync main weights before forward.
Skip main-weight-to-model-weight sync after the first microbatch while still unsharding before every forward.
Add a local Megatron-FSDP NVTX range around sync_model_weight_from_main_weight() without importing megatron.core.utils.
Cover the unwrapped-parent case with child modules wrapped by fully_shard() in an isolated is_first=False context-state test.
Cover sync cadence with a normal top-level FSDP model training loop that verifies weight sync runs once per FSDP group per minibatch, not once per microbatch.

Stack

Stacked on #4976 with base pull-request/4976.

Testing

uv run isort ... could not run because uv attempted to modify /opt/venv and hit a permission error; reran as uv run --no-sync isort ... successfully.
git diff --check passed.
uv run --no-sync python -m torch.distributed.run --nproc-per-node 1 -m pytest -q tests/unit_tests/distributed/megatron_fsdp/test_experimental_fully_shard.py::test_microbatch_false_scopes_unwrapped_parent_child_contexts tests/unit_tests/distributed/megatron_fsdp/test_experimental_fully_shard.py::test_microbatch_training_syncs_once_per_minibatch passed: 2 passed.
uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q tests/unit_tests/distributed/megatron_fsdp/test_experimental_fully_shard.py could not run on this host: only 2 CUDA devices are visible, and rank 3 failed with CUDA error: invalid device ordinal.
uv run --no-sync python -m torch.distributed.run --nproc-per-node 2 -m pytest -q tests/unit_tests/distributed/megatron_fsdp/test_experimental_fully_shard.py passed: 10 passed.

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Summarize today's FSDP work: - Add DBuffer storage release/reallocate support and an in-place fully_allgather_into path for materializing replicated buffers. - Simplify ParameterGroup and FsdpModule around sharded DTensor parameters, reused unsharded Parameters, meta materialization, and default-stream unshard/reshard/reduce behavior. - Remove unused optimizer/offload/state/helper surface area from the minimal path and keep version-counter preservation scoped to unsharded model-weight materialization. - Expand DBuffer and experimental FSDP tests for layouts, storage lifecycle, DTensor contracts, meta reset, nested ownership, train-step parity, and peak-memory reduction. Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Add out= support to DBuffer redistribution and primitive communication ops, keeping axis inference in redistribute only. Use preallocated model and gradient buffers in the minimal FSDP path where possible, including direct first-gradient reduce-scatter into main_grad. Update DBuffer and experimental FSDP tests for AVG reductions, explicit primitive axes, storage reuse, and gradient accumulation behavior. Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

copy-pr-bot · 2026-06-16T20:28:32Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot · 2026-06-16T20:47:18Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

wujingyue added 30 commits June 10, 2026 21:51

Add minimal DBuffer prototype

3a5e674

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Move DBuffer to experimental package

7edbaf3

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Add DBuffer output reuse helpers

9881070

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Clarify DBuffer layout documentation

2c43e21

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Move DBuffer placements to placement module

5700012

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Add DBuffer LCM padding gap coverage

066d62e

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Clarify DBuffer layout packing algorithm

8f15a3c

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Use out variable in DBuffer output paths

7fe5053

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Improve DBuffer out validation errors

de6960c

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Enforce integer DBuffer mesh axes

c064d18

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Avoid full input transfers in DBuffer distribution

365b321

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Assert scatter shares DBuffer storage

138342d

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Rename DBuffer local tensor accessor

c045ce1

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Rename DBuffer test close helper

208cc71

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Clarify DBuffer allgather placement handling

29d106c

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Document DBuffer placement transitions

a333c3d

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Require contiguous DBuffer local storage

0d9f011

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Keep DBuffer tests from resetting process group

baf4a86

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Avoid private DeviceMesh test cleanup

8e4603a

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Use shared distributed fixture in DBuffer tests

e01886a

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Extract DBuffer owned range calculation

80660a2

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Clarify DBuffer chunk size layout comments

28ad3c4

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Validate DBuffer global layout invariants

33d91af

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Move DBuffer global layout to layout module

2fcfbf9

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Remove DBuffer 0D rejection test

b643856

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Document DBuffer tensor range helper

7864d95

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Keep DBuffer placement validation local

c01bf24

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

WIP: add experimental minimal FSDP path

5af01d8

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

wujingyue added 2 commits June 11, 2026 04:25

Split minimal FSDP runtime modules

c88800e

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

Split minimal FSDP module mixin

1de42fe

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

wujingyue changed the title ~~[codex] Add experimental FSDP microbatch context~~ Add experimental FSDP microbatch context Jun 16, 2026

wujingyue force-pushed the fsdp/microbatch-context branch 2 times, most recently from 6dac196 to a8f39bf Compare June 16, 2026 20:47

wujingyue force-pushed the fsdp/microbatch-context branch 6 times, most recently from 8ef58cd to 4b0fa79 Compare June 16, 2026 21:16

Remove experimental FSDP meta parameter support

86f98dd

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>

copy-pr-bot Bot force-pushed the pull-request/4976 branch from b5c805e to 86f98dd Compare June 16, 2026 21:22

wujingyue force-pushed the fsdp/microbatch-context branch 7 times, most recently from 09ab0e2 to 8930e9a Compare June 16, 2026 22:08

Add experimental FSDP microbatch context

a55efcf

wujingyue force-pushed the fsdp/microbatch-context branch from 8930e9a to a55efcf Compare June 16, 2026 22:14

wujingyue changed the title ~~Add experimental FSDP microbatch context~~ Add experimental FSDP microbatch context helper Jun 16, 2026

wujingyue mentioned this pull request Jun 16, 2026

Add experimental Megatron-FSDP fully_shard implementation #4976

Closed

copy-pr-bot Bot force-pushed the pull-request/4976 branch from 86f98dd to 19a814e Compare June 16, 2026 22:25

copy-pr-bot Bot deleted the branch NVIDIA:pull-request/4976 June 17, 2026 04:18

copy-pr-bot Bot closed this Jun 17, 2026

wujingyue added the MFSDPv2 label Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add experimental FSDP microbatch context helper#5378

Add experimental FSDP microbatch context helper#5378
wujingyue wants to merge 41 commits into
NVIDIA:pull-request/4976from
wujingyue:fsdp/microbatch-context

wujingyue commented Jun 16, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wujingyue commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stack

Testing

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wujingyue commented Jun 16, 2026 •

edited

Loading