[AutoSP] (Sequence Parallelism) support for Multimodal Models (ViT + LLM)#7984
[AutoSP] (Sequence Parallelism) support for Multimodal Models (ViT + LLM)#7984delock merged 35 commits intodeepspeedai:masterfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7475f39b05
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # (query, key, value) tensors separately. For HuggingFace-style | ||
| # attention that receives hidden_states, use scatter_idx=2 / gather_idx=0 | ||
| # defaults which match the typical [bs, seq, heads, dim] layout. | ||
| wrapped = DistributedAttention(local_attention=module, sequence_process_group=process_group) |
There was a problem hiding this comment.
Gate LLM wrapping on DistributedAttention-compatible signatures
This replaces every detected LLM attention module with DistributedAttention, but DistributedAttention.forward expects (query, key, value, batch_dim_idx) and then calls the wrapped module with Q/K/V tensors (deepspeed/sequence/layer.py). For the HF-style classes listed in _LLM_ATTN_CLASSNAMES, decoder blocks are still invoked with hidden_states, so the first wrapped decoder forward will fail (or run with invalid semantics) because no call-site adapter is added here. The new integration tests also avoid running the LLM branch due this interface mismatch, which confirms this auto-wrap path is not actually executable end-to-end.
Useful? React with 👍 / 👎.
| # Wrap ViT encoder attention layers | ||
| # ------------------------------------------------------------------ | ||
| for name, module in info.vit_attn_modules: | ||
| wrapped = UlyssesSPViTAttention(module, process_group) |
There was a problem hiding this comment.
Configure CLS handling when wrapping ViT attention modules
Auto-wrapping always constructs UlyssesSPViTAttention with its default has_cls_token=True, which is incorrect for patch-only vision encoders (including classes newly registered here such as InternVisionAttention and Qwen2VLVisionAttention). In those models, token 0 is a real patch, so forcing CLS mode causes gather/scatter to treat a patch as replicated CLS and mis-partition the sequence. The benchmark and integration tests must manually flip has_cls_token=False after wrapping, so the advertised one-call wrapping is incorrect by default for those targets.
Useful? React with 👍 / 👎.
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
…splice Signed-off-by: nathon-lee <leejianwoo@gmail.com> fix: fix some format issue by pre-commit Signed-off-by: nathon-lee <leejianwoo@gmail.com> fix: fix some format err by tool Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
…parallelism Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com> fix: fix some format errs by tool Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com> fix: fix some format err by tool Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com> fix: delete get_accelerator for not use. Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
7475f39 to
c21fe99
Compare
|
|
||
| # DeepSpeed Team | ||
| """ | ||
| ModalityFusionSPAdapter — Phase 2 |
There was a problem hiding this comment.
If these adapters intend to be applied in model script, should update documentation to reflect the usage.
There was a problem hiding this comment.
Usage documentation has been added (lines 24-63 in autosp_fusion.py)
| across the sequence dimension. Each rank appends its local patches to the | ||
| same ``cls`` token before calling the wrapped attention. | ||
|
|
||
| Padding: when ``num_patches % world_size != 0``, we pad patches with zeros |
There was a problem hiding this comment.
Is there a test covering num_patches % world_size !=0?
There was a problem hiding this comment.
The test coverage for test_noneven_patches has been added, with num_patches % world_size != 0
|
Hi @nathon-lee thanks for your contribution! I have left my review comments, thanks! |
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
|
Hi @delock, thank you for the thorough review!
I'll push the updates shortly — thanks again for your time! |
…cs and tests Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
[AutoSP] Fix ViT CLS handling and skip incompatible HF LLM wrapping
| for _ in range(self.world_size) | ||
| ] | ||
| dist.all_gather(gathered, local_patches_padded.contiguous(), group=self.process_group) | ||
| full_patches = torch.cat(gathered, dim=1) # [bs, world_size * max_local_len, hidden_dim] |
There was a problem hiding this comment.
Hi @nathon-lee , thanks for the fix of uneven divided patches. I have a follow up question. I saw paddings added will appear in full_patches thus will also appear in full_input. During attention computation of full_input, softmax might be affected by these padding patches. Should we 'unpad' the allgather result before compute full_input, or mask the padding during attention computation?
There was a problem hiding this comment.
Thanks for catching this, @delock — much appreciated. You're right that the zero-padded tokens in full_patches would participate in the softmax and lead to divergence from single-device execution.
Fixed by de-padding before calling attention rather than masking. After all_gather, each shard is trimmed to its true length using per-rank lengths collected via a preceding scalar all_gather
De-pad gathered shards before calling attention so that dummy zero-tokens never enter the softmax computation: - all_gather each rank's exact local_patch_len into all_lens - strip per-rank padding from gathered buffers before torch.cat - update scatter offset to sum(all_lens[:rank]) instead of rank*max_local_len All 6 TestViTSPEquivalence tests pass (including test_noneven_patches). Signed-off-by: nathon-lee <leejianwoo@gmail.com> fix some format err by tool Signed-off-by: nathon-lee <leejianwoo@gmail.com>
[AutoSP] Fix padding-before-attention bug in UlyssesSPViTAttention
02262e4 to
462f169
Compare
Description
Hello DeepSpeed Team! 👋
This PR directly addresses the "Multimodal model support" goal outlined in the DeepSpeed Roadmap Q2 2026 (#7861).
It introduces AutoSP (Sequence Parallelism) support for Multimodal Models (ViT + LLM) out of the box. As noted in the roadmap, multimodal models handle significantly longer sequence lengths, making SP critical. This PR automates the injection of DeepSpeed Ulysses-based sequence parallelism into multimodal architectures, removing the need for manual and error-prone engineering efforts.
This is a consolidated PR of several incremental features developed and thoroughly tested in my fork.
🎯 Related Issue
🌟 Key Features & Contributions
AutoSP Scaffolding & Detector (
auto_wrap_model_for_sp):DistributedAttention.ViT Sequence Parallelism (
UlyssesSPViTAttention):Gather-Compute-Scattersequence parallel wrapper tailored for non-causal ViT attention layers.Cross-Modal Fusion Adapters (Phase 2):
LlavaFusionAdapter): Visual token splice replacing image placeholders.InternVLFusionAdapter):IMG_CONTEXTtoken splice.Qwen2VLFusionAdapter): Vision_start/end bounded splice.🧪 Testing & Validation
To ensure this PR does not break any existing functionality and is numerically sound, comprehensive tests have been added:
tests/unit/sequence_parallelism/test_autosp_equivalence.py) verifying that the SP-wrapped path across N ranks produces the exact same numerical results as the equivalent single-device (non-SP) computation.benchmarks/autosp/bench_multimodal_sp.py) to easily verify throughput scaling and peak GPU memory reduction.(All tests pass cleanly on 2 GPUs with
NCCL_P2P_DISABLE=1)🚧 Known Limitations & Future Work
To be fully transparent, there are a few limitations in the current design that I plan to improve in follow-up iterations (or would love guidance on from the team):
ModalityFusionSPAdapterdue to varying HF model implementations. Fully automating Phase 2 is a logical next step.UlyssesSPViTAttentionuses a Gather-Compute-Scatter approach. While it successfully reduces FFN memory byfused_len % world_size != 0, zero-padding is applied. Currently, the globalattention_maskis not automatically intercepted and patched, which might require user attention during inference.I would deeply appreciate any feedback or suggestions from the maintainers! I am more than happy to make any required adjustments, refactorings, or add further test cases to get this perfectly aligned with the Q2 roadmap and DeepSpeed's standards.
Thank you for your time reviewing this! 🚀