Skip to content

Add FSDP stream context delayed release#5124

Closed
wujingyue wants to merge 43 commits into
NVIDIA:pull-request/4976from
wujingyue:doublebuffer
Closed

Add FSDP stream context delayed release#5124
wujingyue wants to merge 43 commits into
NVIDIA:pull-request/4976from
wujingyue:doublebuffer

Conversation

@wujingyue

Copy link
Copy Markdown
Contributor

Summary

  • Based on Add experimental Megatron-FSDP fully_shard implementation #4976 (fsdp/minimal), which adds the minimal experimental FSDP path.
  • Adds an FsdpContext to own rank-local FSDP communication stream state and delayed-release scheduling.
  • Runs parameter all-gathers on the context communication stream while compute stays on the default stream.
  • Uses delayed release of unsharded storage so child-unit all-gathers can overlap default-stream GEMMs under a shared root context.
  • Keeps reduce-scatter on the default stream for now; overlapping it needs a follow-up design.

Testing

  • BASE_REF=main CHECK_ONLY=true SKIP_DOCS=false uv run bash tools/autoformat.sh
  • uv run python -m torch.distributed.run --nproc-per-node 2 -m pytest -q tests/unit_tests/distributed/megatron_fsdp/test_dbuffer.py tests/unit_tests/distributed/megatron_fsdp/test_experimental_fully_shard.py --tb=short --disable-warnings -rN

Notes

  • The profiler test covers a fully-sharded parent root with four child FSDP units and asserts NCCL all-gather kernels run on a communication stream that overlaps GEMM/CUTLASS compute on the default stream.

@copy-pr-bot

copy-pr-bot Bot commented Jun 2, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

wujingyue added 28 commits June 10, 2026 21:51
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
wujingyue added 11 commits June 11, 2026 04:25
Summarize today's FSDP work:

- Add DBuffer storage release/reallocate support and an in-place fully_allgather_into path for materializing replicated buffers.

- Simplify ParameterGroup and FsdpModule around sharded DTensor parameters, reused unsharded Parameters, meta materialization, and default-stream unshard/reshard/reduce behavior.

- Remove unused optimizer/offload/state/helper surface area from the minimal path and keep version-counter preservation scoped to unsharded model-weight materialization.

- Expand DBuffer and experimental FSDP tests for layouts, storage lifecycle, DTensor contracts, meta reset, nested ownership, train-step parity, and peak-memory reduction.

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Add out= support to DBuffer redistribution and primitive communication ops, keeping axis inference in redistribute only.

Use preallocated model and gradient buffers in the minimal FSDP path where possible, including direct first-gradient reduce-scatter into main_grad.

Update DBuffer and experimental FSDP tests for AVG reductions, explicit primitive axes, storage reuse, and gradient accumulation behavior.

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Use ordered parameter tuples for FSDP parameter swapping and keep the sharded data path aligned with main_weight storage.

Set grad_dtype for FSDP-managed parameters so BF16 main gradients can be reduced without pre-reduce casts, and update tests to verify sharded parameter data and grad backing buffers.

Clean up hook naming, local gradient accumulation handling, and memory/test assertions for the minimal experimental path.

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Add DBuffer casting for dtype conversion before model-weight sync, refresh model weights from main weights before unshard, and cover the next-forward optimizer update path with FP32 main weights and default BF16 main grads on SGD's non-foreach path.

Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
@wujingyue wujingyue changed the base branch from main to pull-request/4976 June 12, 2026 18:56
@copy-pr-bot

copy-pr-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@copy-pr-bot copy-pr-bot Bot force-pushed the pull-request/4976 branch from 86f98dd to 19a814e Compare June 16, 2026 22:25
@copy-pr-bot copy-pr-bot Bot deleted the branch NVIDIA:pull-request/4976 June 17, 2026 04:18
@copy-pr-bot copy-pr-bot Bot closed this Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant