padding-free / packed-sequence support for Qwen3.5.#186
Merged
meichangsu1 merged 7 commits intomodelscope:mainfrom May 7, 2026
Merged
padding-free / packed-sequence support for Qwen3.5.#186meichangsu1 merged 7 commits intomodelscope:mainfrom
meichangsu1 merged 7 commits intomodelscope:mainfrom
Conversation
The `is_packed` flag was ambiguous and only inferred from position IDs. Now `padding_free` is explicitly passed as input, making the intent clearer and enabling early validation of attention backend compatibility.
Simplify the logic for returning logits in `forward` and `forward_only` methods by removing redundant `_outputs` copy and `logits` variable. The new logic directly modifies `outputs` and creates a single copy for return, reducing code complexity and potential bugs.
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces support for padding-free and packed sequence inputs for Qwen 3.5 models, specifically targeting GatedDeltaNet and linear attention within a sequence parallel context. Key changes include a new patching mechanism for Qwen 3.5, refactored attention logic to handle variable sequence lengths without padding, and fallback implementations for linear attention kernels when specialized libraries are missing. Feedback highlights a regression in how sequence boundaries are determined in the attention strategy and identifies inconsistencies in the return types and activation handling within the new torch-based fallback for causal convolution.
…helpers and renaming function Remove `_get_real_position_ids` and `_is_packed_position_ids` helper functions that are no longer used. Inline the availability check into `_get_flash_linear_attention_kernels` instead of a separate function. Rename `_run_with_gdn_conv_and_delta_rule_cu_seqlens` to `_patch_gdn_kernels_for_cu_seqlens` for clarity.
… and rely on explicit position_ids The automatic derivation of cu_seq_lens_q from position_ids in `_update_packed_varlen_metadata` was removed to simplify the codebase and avoid potential inconsistencies. Now, packed sequence metadata must be provided explicitly via valid position_ids or other means, with clearer error messages when missing.
tpx818
reviewed
May 7, 2026
tpx818
reviewed
May 7, 2026
The patch class and attribute were renamed from Qwen35-specific names to generic GatedDeltaNet names to reflect that the padding-free optimization is not limited to Qwen3.5 models.
ceb8983 to
4daf906
Compare
tastelikefeet
approved these changes
May 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR type
PR information
This PR focuses on three sequence-parallel / padding-free fixes for Qwen3.5 in the Transformers backend.
Main changes:
cu_seq_lens_q,cu_seq_lens_k,max_length_q,max_length_k) for packed Qwen3.5 inputs.cu_seqlenswhen padding-free is enabled.gather_loss_tensorsto remove sequence-parallel padding before loss computation.logpsandlabels.flash-linear-attentionis unavailable.Experiment results