Skip to content

padding-free / packed-sequence support for Qwen3.5.#186

Merged
meichangsu1 merged 7 commits intomodelscope:mainfrom
meichangsu1:qwen3_5_padding_free_ljl
May 7, 2026
Merged

padding-free / packed-sequence support for Qwen3.5.#186
meichangsu1 merged 7 commits intomodelscope:mainfrom
meichangsu1:qwen3_5_padding_free_ljl

Conversation

@meichangsu1
Copy link
Copy Markdown
Collaborator

@meichangsu1 meichangsu1 commented Apr 30, 2026

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

This PR focuses on three sequence-parallel / padding-free fixes for Qwen3.5 in the Transformers backend.

Main changes:

  • Add Qwen3.5 support for padding-free / packed inputs.
    • Introduce a dedicated Qwen3.5 (Qwen3.6) GatedDeltaNet padding-free patch.
    • Pass explicit packed-sequence metadata (cu_seq_lens_q, cu_seq_lens_k, max_length_q, max_length_k) for packed Qwen3.5 inputs.
    • Make Qwen3.5 linear attention use flash-linear-attention kernels with packed cu_seqlens when padding-free is enabled.
  • Fix gather_loss_tensors to remove sequence-parallel padding before loss computation.
    • Trim SP/RP-added padding from gathered logps and labels.
    • Ensure packed/padding-free loss computation uses only real tokens after gather.
  • Add a non-padding-free fallback path for Qwen3.5 GatedDeltaNet when flash-linear-attention is unavailable.
    • If padding-free is not enabled, fall back to torch-native GatedDeltaNet computation.
    • This keeps non-packed Qwen3.5 sequence-parallel training usable without requiring FLA.

Experiment results

The `is_packed` flag was ambiguous and only inferred from position IDs. Now `padding_free` is explicitly passed as input, making the intent clearer and enabling early validation of attention backend compatibility.
Simplify the logic for returning logits in `forward` and `forward_only` methods by removing redundant `_outputs` copy and `logits` variable. The new logic directly modifies `outputs` and creates a single copy for return, reducing code complexity and potential bugs.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for padding-free and packed sequence inputs for Qwen 3.5 models, specifically targeting GatedDeltaNet and linear attention within a sequence parallel context. Key changes include a new patching mechanism for Qwen 3.5, refactored attention logic to handle variable sequence lengths without padding, and fallback implementations for linear attention kernels when specialized libraries are missing. Feedback highlights a regression in how sequence boundaries are determined in the attention strategy and identifies inconsistencies in the return types and activation handling within the new torch-based fallback for causal convolution.

Comment thread src/twinkle/model/transformers/strategy/sequence_parallel/linear_attention_sp.py Outdated
…helpers and renaming function

Remove `_get_real_position_ids` and `_is_packed_position_ids` helper functions that are no longer used. Inline the availability check into `_get_flash_linear_attention_kernels` instead of a separate function. Rename `_run_with_gdn_conv_and_delta_rule_cu_seqlens` to `_patch_gdn_kernels_for_cu_seqlens` for clarity.
… and rely on explicit position_ids

The automatic derivation of cu_seq_lens_q from position_ids in `_update_packed_varlen_metadata` was removed to simplify the codebase and avoid potential inconsistencies. Now, packed sequence metadata must be provided explicitly via valid position_ids or other means, with clearer error messages when missing.
Comment thread src/twinkle/patch/gdn_padding_free.py
Comment thread src/twinkle/patch/qwen35_gdn_padding_free.py Outdated
The patch class and attribute were renamed from Qwen35-specific names to generic GatedDeltaNet names to reflect that the padding-free optimization is not limited to Qwen3.5 models.
@meichangsu1 meichangsu1 force-pushed the qwen3_5_padding_free_ljl branch from ceb8983 to 4daf906 Compare May 7, 2026 07:48
@meichangsu1 meichangsu1 merged commit 6921342 into modelscope:main May 7, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants