Skip to content

fix(pd): transfer MiniMax-M3 sparse indexer-key cache in disaggregation#1368

Merged
valarLip merged 1 commit into
mainfrom
fix/m3-pd-sparse-index-cache-transfer
Jun 29, 2026
Merged

fix(pd): transfer MiniMax-M3 sparse indexer-key cache in disaggregation#1368
valarLip merged 1 commit into
mainfrom
fix/m3-pd-sparse-index-cache-transfer

Conversation

@Jasen2201

Copy link
Copy Markdown
Contributor

MiniMax-M3 sparse attention reuses the unified KV cache and kv_scale for K/V, so the fp8 per-token scales already travel with the KV blocks. It keeps one extra per-token buffer, runner.sparse_attention_index_cache, holding the indexer keys used for top-k block selection at decode time. get_kv_transfer_tensors() never registered that buffer, so under PD disaggregation the decode node ran top-k against a zero/stale index for the prefilled tokens and attended to the wrong KV blocks. This is masked for short prompts (the init+local+topk window already covers every block, so selection is moot) but corrupts output once the context exceeds that window.

Register the indexer-key cache as block-indexed transfer regions (one per sparse layer, same physical-block striding as the KV cache), guarded by getattr so non-sparse models and bf16 paths are unaffected.

Tested (latest image, 1P+1D TP4, fp8 KV via Triton attention): GSM8K 5-shot = 0.9401, i.e. no regression to M3 fp8 PD. Short-prompt GSM8K does not exercise the long-context top-k path the buffer affects; that path is covered by review, not this run.

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

MiniMax-M3 sparse attention reuses the unified KV cache and kv_scale for
K/V, so the fp8 per-token scales already travel with the KV blocks. It
keeps one extra per-token buffer, runner.sparse_attention_index_cache,
holding the indexer keys used for top-k block selection at decode time.
get_kv_transfer_tensors() never registered that buffer, so under PD
disaggregation the decode node ran top-k against a zero/stale index for
the prefilled tokens and attended to the wrong KV blocks. This is masked
for short prompts (the init+local+topk window already covers every block,
so selection is moot) but corrupts output once the context exceeds that
window.

Register the indexer-key cache as block-indexed transfer regions (one per
sparse layer, same physical-block striding as the KV cache), guarded by
getattr so non-sparse models and bf16 paths are unaffected.

Tested (latest image, 1P+1D TP4, fp8 KV via Triton attention): GSM8K
5-shot = 0.9401, i.e. no regression to M3 fp8 PD. Short-prompt GSM8K does
not exercise the long-context top-k path the buffer affects; that path is
covered by review, not this run.
Copilot AI review requested due to automatic review settings June 26, 2026 07:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes PD disaggregation correctness for MiniMax-M3 sparse attention by ensuring the per-token sparse indexer-key cache is included in the KV RDMA transfer set. This prevents decode workers from running top‑k block selection against stale/zero index-cache data for prefilled tokens, which can mis-select KV blocks once context grows beyond the init/local/top‑k coverage window.

Changes:

  • Register runner.sparse_attention_index_cache as additional block-indexed transfer regions in get_kv_transfer_tensors().
  • Guard the new transfer registration via getattr(...) so non-sparse models (and runners without the buffer) are unaffected.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@zufayu zufayu requested a review from ZhangLirong-amd June 26, 2026 13:58
@valarLip valarLip merged commit 7486551 into main Jun 29, 2026
28 of 34 checks passed
@valarLip valarLip deleted the fix/m3-pd-sparse-index-cache-transfer branch June 29, 2026 03:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants