Skip to content

Add experimental FSDP microbatch context helper#5378

Closed
wujingyue wants to merge 41 commits into
NVIDIA:pull-request/4976from
wujingyue:fsdp/microbatch-context
Closed

Add experimental FSDP microbatch context helper#5378
wujingyue wants to merge 41 commits into
NVIDIA:pull-request/4976from
wujingyue:fsdp/microbatch-context

Conversation

@wujingyue

@wujingyue wujingyue commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add FsdpContext next to FsdpModule, with lazy root-subtree context creation following Add FSDP stream context delayed release #5124's structure.
  • Add megatron_fsdp.experimental.microbatch(module, is_first) alongside the experimental fully_shard(...) API in fully_shard.py.
  • Implement microbatch() as a scoped DFS helper that lazy-inits discovered FSDP root contexts, sets is_first_microbatch, and restores prior values on exit.
  • Keep default behavior correctness-first: forwards outside microbatch() still sync main weights before forward.
  • Skip main-weight-to-model-weight sync after the first microbatch while still unsharding before every forward.
  • Add a local Megatron-FSDP NVTX range around sync_model_weight_from_main_weight() without importing megatron.core.utils.
  • Cover the unwrapped-parent case with child modules wrapped by fully_shard() in an isolated is_first=False context-state test.
  • Cover sync cadence with a normal top-level FSDP model training loop that verifies weight sync runs once per FSDP group per minibatch, not once per microbatch.

Stack

Stacked on #4976 with base pull-request/4976.

Testing

  • uv run isort ... could not run because uv attempted to modify /opt/venv and hit a permission error; reran as uv run --no-sync isort ... successfully.
  • git diff --check passed.
  • uv run --no-sync python -m torch.distributed.run --nproc-per-node 1 -m pytest -q tests/unit_tests/distributed/megatron_fsdp/test_experimental_fully_shard.py::test_microbatch_false_scopes_unwrapped_parent_child_contexts tests/unit_tests/distributed/megatron_fsdp/test_experimental_fully_shard.py::test_microbatch_training_syncs_once_per_minibatch passed: 2 passed.
  • uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q tests/unit_tests/distributed/megatron_fsdp/test_experimental_fully_shard.py could not run on this host: only 2 CUDA devices are visible, and rank 3 failed with CUDA error: invalid device ordinal.
  • uv run --no-sync python -m torch.distributed.run --nproc-per-node 2 -m pytest -q tests/unit_tests/distributed/megatron_fsdp/test_experimental_fully_shard.py passed: 10 passed.

wujingyue added 30 commits June 10, 2026 21:51
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Summarize today's FSDP work:

- Add DBuffer storage release/reallocate support and an in-place fully_allgather_into path for materializing replicated buffers.

- Simplify ParameterGroup and FsdpModule around sharded DTensor parameters, reused unsharded Parameters, meta materialization, and default-stream unshard/reshard/reduce behavior.

- Remove unused optimizer/offload/state/helper surface area from the minimal path and keep version-counter preservation scoped to unsharded model-weight materialization.

- Expand DBuffer and experimental FSDP tests for layouts, storage lifecycle, DTensor contracts, meta reset, nested ownership, train-step parity, and peak-memory reduction.

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Add out= support to DBuffer redistribution and primitive communication ops, keeping axis inference in redistribute only.

Use preallocated model and gradient buffers in the minimal FSDP path where possible, including direct first-gradient reduce-scatter into main_grad.

Update DBuffer and experimental FSDP tests for AVG reductions, explicit primitive axes, storage reuse, and gradient accumulation behavior.

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@wujingyue wujingyue changed the title [codex] Add experimental FSDP microbatch context Add experimental FSDP microbatch context Jun 16, 2026
@wujingyue wujingyue force-pushed the fsdp/microbatch-context branch 2 times, most recently from 6dac196 to a8f39bf Compare June 16, 2026 20:47
@copy-pr-bot

copy-pr-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@wujingyue wujingyue force-pushed the fsdp/microbatch-context branch 6 times, most recently from 8ef58cd to 4b0fa79 Compare June 16, 2026 21:16
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
@copy-pr-bot copy-pr-bot Bot force-pushed the pull-request/4976 branch from b5c805e to 86f98dd Compare June 16, 2026 21:22
@wujingyue wujingyue force-pushed the fsdp/microbatch-context branch 7 times, most recently from 09ab0e2 to 8930e9a Compare June 16, 2026 22:08
@wujingyue wujingyue force-pushed the fsdp/microbatch-context branch from 8930e9a to a55efcf Compare June 16, 2026 22:14
@wujingyue wujingyue changed the title Add experimental FSDP microbatch context Add experimental FSDP microbatch context helper Jun 16, 2026
@copy-pr-bot copy-pr-bot Bot force-pushed the pull-request/4976 branch from 86f98dd to 19a814e Compare June 16, 2026 22:25
@copy-pr-bot copy-pr-bot Bot deleted the branch NVIDIA:pull-request/4976 June 17, 2026 04:18
@copy-pr-bot copy-pr-bot Bot closed this Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant