Add experimental FSDP microbatch context helper#5378
Closed
wujingyue wants to merge 41 commits into
Closed
Conversation
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Summarize today's FSDP work: - Add DBuffer storage release/reallocate support and an in-place fully_allgather_into path for materializing replicated buffers. - Simplify ParameterGroup and FsdpModule around sharded DTensor parameters, reused unsharded Parameters, meta materialization, and default-stream unshard/reshard/reduce behavior. - Remove unused optimizer/offload/state/helper surface area from the minimal path and keep version-counter preservation scoped to unsharded model-weight materialization. - Expand DBuffer and experimental FSDP tests for layouts, storage lifecycle, DTensor contracts, meta reset, nested ownership, train-step parity, and peak-memory reduction. Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Add out= support to DBuffer redistribution and primitive communication ops, keeping axis inference in redistribute only. Use preallocated model and gradient buffers in the minimal FSDP path where possible, including direct first-gradient reduce-scatter into main_grad. Update DBuffer and experimental FSDP tests for AVG reductions, explicit primitive axes, storage reuse, and gradient accumulation behavior. Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
6dac196 to
a8f39bf
Compare
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
8ef58cd to
4b0fa79
Compare
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
b5c805e to
86f98dd
Compare
09ab0e2 to
8930e9a
Compare
8930e9a to
a55efcf
Compare
86f98dd to
19a814e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
FsdpContextnext toFsdpModule, with lazy root-subtree context creation following Add FSDP stream context delayed release #5124's structure.megatron_fsdp.experimental.microbatch(module, is_first)alongside the experimentalfully_shard(...)API infully_shard.py.microbatch()as a scoped DFS helper that lazy-inits discovered FSDP root contexts, setsis_first_microbatch, and restores prior values on exit.microbatch()still sync main weights before forward.sync_model_weight_from_main_weight()without importingmegatron.core.utils.fully_shard()in an isolatedis_first=Falsecontext-state test.Stack
Stacked on #4976 with base
pull-request/4976.Testing
uv run isort ...could not run because uv attempted to modify/opt/venvand hit a permission error; reran asuv run --no-sync isort ...successfully.git diff --checkpassed.uv run --no-sync python -m torch.distributed.run --nproc-per-node 1 -m pytest -q tests/unit_tests/distributed/megatron_fsdp/test_experimental_fully_shard.py::test_microbatch_false_scopes_unwrapped_parent_child_contexts tests/unit_tests/distributed/megatron_fsdp/test_experimental_fully_shard.py::test_microbatch_training_syncs_once_per_minibatchpassed: 2 passed.uv run python -m torch.distributed.run --nproc-per-node 8 -m pytest -q tests/unit_tests/distributed/megatron_fsdp/test_experimental_fully_shard.pycould not run on this host: only 2 CUDA devices are visible, and rank 3 failed withCUDA error: invalid device ordinal.uv run --no-sync python -m torch.distributed.run --nproc-per-node 2 -m pytest -q tests/unit_tests/distributed/megatron_fsdp/test_experimental_fully_shard.pypassed: 10 passed.