Skip to content

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5, with working MTP#1337

Open
carlushuang wants to merge 42 commits into
carhuang/support_gfx1151_qwen36from
carhuang/gfx1151_int8_qwen36
Open

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5, with working MTP#1337
carlushuang wants to merge 42 commits into
carhuang/support_gfx1151_qwen36from
carhuang/gfx1151_int8_qwen36

Conversation

@carlushuang

Copy link
Copy Markdown
Collaborator

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5 (Strix Halo), with working MTP

Builds on the gfx1151 BF16 enablement (#1314) to add online INT8 W8A8 (Quark-style, no offline quant step) for the Qwen3.6 dense (27B) and MoE (35B-A3B) architectures, plus the fixes needed to make MTP speculative decoding work on the MoE draft. RDNA3.5 WMMA supports int8 natively (no FP8/FP4), so int8 is the right quantization target for this iGPU.

Precision split (chosen for quality): A8W8 (int8 weight + dynamic per-token int8 activation) for all dense GEMMs and the MoE experts; BF16 for the Gated-DeltaNet linear-attention (recurrent, quant-sensitive — int8 there produces garbage), the MoE router gate, lm_head, and embeddings; KV cache BF16. This keeps gsm8k at BF16-equivalent quality while halving weight bytes (the decode bottleneck on a bandwidth-bound iGPU).

What this enables

  • 35B-A3B runs at all: there is no BF16 MoE kernel on gfx1151 (the asm fused-MoE is gfx9-only; ATOM's Triton MoE is MXFP4-weight-only). The int8 path uses aiter's moe_gemm_int8_smoothquant, which is the only int8-W8A8 grouped-GEMM that runs on RDNA3.5.
  • ~6× faster than the BF16 27B baseline for the 35B-A3B (3B active params × int8 + MTP).

Changes

  • model_ops/linear.pyper_Token int8 branch routes to aiter Triton gemm_a8w8 on non-gfx9 (CK gemm_a8w8_CK is gfx9-only), wrapped as a torch custom op (torch.ops.aiter.atom_gemm_a8w8_triton) so it is HIP-graph / torch.compile safe. Online-quant allow-list += torch.int8.
  • model_ops/moe.py — new Int8MoEMethod (int8 w13/w2 + per-channel fp32 scales) and an int8 branch in FusedMoE._online_quant.
  • model_ops/fused_moe_triton.pytriton_kernel_int8_moe_forward: matmul-ogs routing → per-token int8 quant → moe_gemm_int8_smoothquant (gemm1 with fused gated-SiLU via interleaved w13 columns) → per-token int8 quant → gemm2 with scatter/combine.
  • models/qwen3_5_mtp.py + model_loader/loader.pyMTP-MoE drafter fix: the draft's fused expert weights (experts.gate_up_proj/down_proj) were silently dropped at load → 0% draft acceptance → MTP was pure overhead. Add the draft's fused-expert mapping (detect_fused_expert_format / get_fused_expert_mapping / load_fused_expert_weights), fix get_expert_mapping to use num_experts, and let the loader resolve load_fused_expert_weights_fn from the model. After the fix: acceptance 0 → 0.83.
  • entrypoints/openai/tool_parser.pyunique tool-call ids (call_<uuid> instead of a per-response call_0). Non-unique ids made agentic clients (qwen-code) dedupe every tool call after the first → endless tool-call loop. Extends feat(openai): Qwen3 (qwen3_coder/qwen3_xml) tool-call support #1319.

Quality (gsm8k, 5-shot-equivalent, chat + thinking, greedy)

  • 35B-A3B INT8 W8A8 = 0.84 — BF16-equivalent (int8 is faithful). MTP is lossless (accepts a draft token only when it matches the target's greedy argmax), so the MTP build has identical quality.

Performance (gfx1151 / Radeon 8060S, bs=1)

Decode (single-stream, short context):

Model Config Decode tok/s
27B dense INT8 W8A8 6.0
27B dense INT8 W8A8 + MTP-1 9.4
35B-A3B INT8 W8A8 + HIP graph 24.8
35B-A3B INT8 W8A8 + MTP-1 + HIP graph ~35

Long-context (35B-A3B INT8 W8A8 + MTP-1, bs=1):

Context Prefill TTFT Prefill tok/s Decode (output) tok/s Total tok/s
64K (60,016 tok) 85.4 s 703 23.4 661
128K (119,071 tok) 191.8 s 621 17.3 598

Decode tok/s falls with context (each step reads the growing KV); prefill is compute-bound (one-time prompt-ingestion cost). The hybrid model's KV is cheap (only the interleaved full-attn layers cache KV), so 128K fits easily — at gpu-memory-utilization 0.9 the KV pool holds ~2.1M tokens; the limit is --max-model-len, not memory.

Serve

ATOM_USE_UNIFIED_ATTN=1 \
python -m atom.entrypoints.openai_server --model Qwen/Qwen3.6-35B-A3B \
  --trust-remote-code -tp 1 --kv_cache_dtype bf16 --block-size 64 \
  --max-model-len 131072 --max-num-seqs 2 --gpu-memory-utilization 0.9 \
  --method mtp --num-speculative-tokens 1 \
  --online_quant_config '{"global_quant_config":"ptpc_i8","exclude_layer":["*linear_attn*","*lm_head*","*shared_head*","*embed_tokens*","*mlp.gate"]}'

(Drop --method mtp ... for 35B if you don't want MTP; for the dense 27B MTP is a ~1.6× lossless win.)

Dependency

gyohuangxin and others added 13 commits June 22, 2026 23:55
* [ATOM SGL]Add dsv4 ci

Co-authored-by: Cursor <cursoragent@cursor.com>
* feat(minimax_m3): add MXFP4 native support

Introduce the minimal MiniMax-M3 MXFP4 native ATOM path without BF16, MXFP8, EAGLE, or unified-attention support.

* fix(minimax_m3): align FP4 Triton paths with BF16 branch

Keep the split MXFP4 PR aligned with the BF16 branch for shared Triton kernel paths while removing the extra package marker file.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(minimax_m3): add MXFP4 native support

Introduce the minimal MiniMax-M3 MXFP4 native ATOM path without BF16, MXFP8, EAGLE, or unified-attention support.

* fix(attn): use Triton for unsupported GQA decode

Route the block-size 128 GQA decode shape used by MiniMax-M3 away from generic PA ASM, which has no matching AITER kernel in the validation image.

* chore(minimax_m3): trim FP4 split cleanup

Remove the extra MiniMax-M3 module docstring note and keep Triton attention selection controlled by the existing environment flag.

* docs(minimax_m3): force Triton attention in MXFP4 recipe

Document the required ATOM_FORCE_ATTN_TRITON flag for the MXFP4 TP4 launch path.

* chore(minimax_m3): fix Black formatting and trim comments

Remove the extra blank lines flagged by Black and keep MiniMax-M3 sparse attention comments focused on ATOM's FP4 path.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(minimax_m3): simplify text config normalization

Copy MiniMax-M3 text config attributes generically so the FP4 path keeps required root config fields without maintaining a long field allowlist.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(attn): generalize sparse block-size handling

Read sparse attention block-size requirements from the HF sparse attention config instead of hard-coding the MiniMax-M3 sparse attention constant in the shared AITER metadata builder.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(attn): use generic sparse metadata naming

Keep MiniMax-M3 sparse metadata construction local to the sparse attention path while exposing it through generic attention metadata fields in the shared AITER builder.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(attn): generalize indexed sparse marker

Use a model-agnostic marker for indexed sparse attention modules so the shared AITER cache binding path no longer checks a MiniMax-M3-specific attribute name.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(attn): generalize sparse cache names

Use generic indexed sparse cache and metadata helper names in the shared AITER attention path while keeping the MiniMax-M3 sparse implementation module unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refact attention code

* keep ATOM_USE_UNIFIED_ATTN path

---------

Co-authored-by: xytpai <xytpai@foxmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com>
* mla

* fix

* fix

---------

Co-authored-by: HaonanWang98 <hwang@amd.com>
Co-authored-by: feifei14119 <carlus.huang@amd.com>
* Move gpt-oss and kimi2.5 CI from mi355 to mi350

* Move deepseek-v4-flash and qwen3.5 vllm CI from mi355 to mi350

* Add runner user info

* Add clean up containers function for runner atom-mi35x-8gpu-oot-acc

* CI: tolerate missing Docker config on OOT runners

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>
… on MoE

RDNA3.5 WMMA supports int8 natively (no FP8/FP4), so int8 is the quantization
target for this iGPU. A8W8 (int8 weight + dynamic per-token int8 activation) for
all dense GEMMs and MoE experts; BF16 for the GDN linear-attn (recurrent,
quant-sensitive), router gate, lm_head, embeddings; KV cache BF16.

- model_ops/linear.py: per_Token int8 branch -> aiter Triton gemm_a8w8 on
  non-gfx9 (CK gemm_a8w8_CK is gfx9-only), wrapped as a torch custom op so it is
  HIP-graph / torch.compile safe. Online-quant allow-list += torch.int8.
- model_ops/moe.py + model_ops/fused_moe_triton.py: Int8MoEMethod + int8 branch
  in FusedMoE._online_quant, and triton_kernel_int8_moe_forward using aiter
  moe_gemm_int8_smoothquant (gemm1 with fused gated-SiLU via interleaved w13).
  Enables 35B-A3B, which has no BF16 MoE kernel on gfx1151.
- models/qwen3_5_mtp.py + model_loader/loader.py: fix the MTP-MoE drafter so the
  draft's fused expert weights load (add detect_fused_expert_format /
  get_fused_expert_mapping / load_fused_expert_weights; get_expert_mapping uses
  num_experts; loader resolves load_fused_expert_weights_fn from the model).
  Draft acceptance 0 -> 0.83; MTP now a net win on the MoE model.
- model_ops/topK.py: keep the shared expert as a separate MLP on non-gfx9 so the
  routed MoE uses the portable Triton path.
k50112113 and others added 16 commits June 24, 2026 18:24
* add m3 mxfp8 support

* add mxfp8 recipe

* wip

Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>

* revert dequant fp8 back to bf16 for linear layers

* update m3 recipe

* format

* remove hard code dtype

---------

Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>
Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com>
Co-authored-by: Haoyang Li <lihaoyang0109@gmail.com>
Co-authored-by: ganyi <ygan@amd.com>
Co-authored-by: Guanbao Yu <gyu@amd.com>
Co-authored-by: wuhuikx <hattie.wu@amd.com>
* [fix](qwen): fix qwen3.5 accuracy

* [fix](attn): delete extra code

* [fix](attn): add kv cache to mutate args

* [fix](qwen): remove quick allreduce in qwen3.5

---------

Co-authored-by: perzhang <perzhang@amd.com>
)

* Add NUMA-aware CPU/memory binding

* Add glm-5-2-fp8 benchmark dispatch checkbox
… AAC machine (#1346)

* Modify atom-sgl-accuracy workflow to adapt it for AAC machine
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
* docs: revise M3 fp8/gluon port plan for first-class framework compat

Replace the env-gated bolt-on approach with one driven by main's existing
attention-framework contracts: fp8 selected by config.kv_cache_dtype,
scales returned via KVCacheTensor, binding through build_kv_cache_tensor/
bind_kv_cache, insert via the quantized hook, metadata via make_sparse_*
factories, frozen custom-op signature, CUDAGraph-safe scratch, byte
accounting. Adds a 9-point contract checklist mapped to each task.

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(attn): SparseMHAPagedAttentionImpl skeleton + Attention impl_cls override

Task 0 of the MiniMax-M3 fp8 KV cache + gluon PA port. Adds the subclass
scaffold (SparseMHAPagedAttentionImpl extends PagedAttentionImpl, overriding
only rope_cache + dispatch_backend via delegation for now) and an optional
impl_cls kwarg on Attention.__init__ so a model can plug in a specialized impl
while reusing the backend's metadata builder. Indexer state lives on the impl.

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(minimax_m3): page-16 constants + fused SHUFFLE KV-insert kernel

Task 1 of the M3 fp8/gluon port. Adds ASM_PAGE_SIZE=16 / PAGES_PER_SPARSE_BLOCK=8
and grafts the Triton fused Gemma-RMSNorm + partial-NeoX-RoPE + page-16 SHUFFLE
KV-insert kernel (+ host wrapper) from origin/ganyi/shuffle_kv_cache_fp8_eagle.
GPU round-trip test validates q_out/index_q_out vs PyTorch ref and K/V/index
cache scatter at each token slot.

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(minimax_m3): page-16 sparse block-table builders + fused topk EMIT_SPARSE_BT

Task 2 of the M3 fp8/gluon port. Grafts the decode + prefill page-16 sparse
block-table builders into sparse_attn.py (each selected logical 128-block expands
to 8 contiguous physical 16-pages, partial tail packed last, exact context_lens),
and replaces index_topk.py wholesale with the source-branch version that adds the
fused EMIT_SPARSE_BT block-table emission and MAX_Q spec-decode causal support
(both opt-in via defaulted kwargs, so existing decode callers are unaffected).

Tests: x8 expansion + tail-last packing + ctx lengths for the standalone builder;
fused EMIT path matches the standalone builder bit-for-bit (num_kv_heads==1).

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(minimax_m3): gluon PA decode + prefill runners over page-16 SHUFFLE cache

Task 3 of the M3 fp8/gluon port. Grafts minimax_m3_sparse_attn_decode_asm,
minimax_m3_sparse_attn_prefill_asm, and the shared _run_prefill_fp8_gluon helper
from the source branch: index top-k -> page-16 sparse block table -> AITER gluon
split-KV paged-attention (run_pa_decode_gluon), with fp8 vs bf16 compute_type and
per-token scales selected by the KV cache dtype. Adds `import aiter` (used for
aiter.dtypes.fp8). Parity test (gluon vs Triton split-K decode reference) for
gqa 8/16; validated further by the existing asm/fp8/prefill oracle tests.

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(attn): implement SparseMHAPagedAttentionImpl.rope_cache override

Task 4 of the M3 fp8/gluon port. The override runs MiniMax-M3's fused
qk-norm + partial-NeoX-RoPE + page-16 SHUFFLE KV insert + indexer-key insert via
aiter.fused_qknorm_idxrqknorm (consuming the packed qkv), reading the SHUFFLE
K/V + scale + index caches off the bound layer. It returns the parent's 7-tuple
(query rotated) and stashes the rotated indexer query on self._index_q for
dispatch_backend. fp8 vs bf16 selected by kv_cache_dtype; fp8 writes per-token
dequant scales into k_scale/v_scale. Adds the _minimax_m3_cos_sin_cache helper.

Test (bf16 + fp8): override returns the 7-tuple, populates _index_q with correct
shape, and mutates the KV/index caches (+ fp8 scales).

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(attn): implement SparseMHAPagedAttentionImpl.dispatch_backend override

Task 5 of the M3 fp8/gluon port. dispatch_backend returns the M3 sparse
prefill/decode backend callable (parent contract
fn(q,k,v,k_cache,v_cache,k_scale,v_scale,fwd_ctx)). Both paths select per-token
top-k index blocks with the fused page-16 sparse block-table emit, then run the
gluon split-KV paged-attention over the SHUFFLE cache; fp8 vs bf16 follows the
cache dtype inside the runners. Prefill uses the sync-free on-device metadata
fallback (query_req_id/abs_pos/qo_indptr=None). Consumes self._index_q from
rope_cache and clears it afterward.

Note: index_cache is page-128 3D [num_logical, 128, idx_head_dim], indexed by the
logical block_table in index-topk (distinct from the page-16 SHUFFLE KV cache).
Test (bf16+fp8): dispatch returns the decode callable; running it yields finite
[tokens, nh, hd] output and clears _index_q.

Co-Authored-By: Claude <noreply@anthropic.com>

* first version of refactor

Signed-off-by: ganyi <ygan@amd.com>

* remove unnecessary files

Signed-off-by: ganyi <ygan@amd.com>

* runable and can response resonable output

Signed-off-by: ganyi <ygan@amd.com>

* acc right

Signed-off-by: ganyi <ygan@amd.com>

* reuse mha's allocation for main cache,  view at use time

Signed-off-by: ganyi <ygan@amd.com>

* remove prepare mtp metadata

Signed-off-by: ganyi <ygan@amd.com>

* format

Signed-off-by: ganyi <ygan@amd.com>

* format

Signed-off-by: ganyi <ygan@amd.com>

* resolve comments

Signed-off-by: ganyi <ygan@amd.com>

---------

Signed-off-by: ganyi <ygan@amd.com>
Co-authored-by: Claude <noreply@anthropic.com>
…k_size' (#1348)

Co-authored-by: junxiaguo <JunXia.Guo@amd.com>
…bort (#1322) (#1339)

During CUDAGraph capture, MiniMax-M3's autotuned _topk_index_partial_kernel
discards candidate CompiledKernels. A gen-0 GC firing inside the stream-capture
region runs CompiledKernel.__del__ -> hipModuleUnload, which HIP forbids while a
stream is capturing (HIP 900), corrupting the capture and aborting the
custom_all_reduce IPC handshake (SIGABRT). gc.freeze() did not help because the
discarded kernels are created mid-loop. Disable GC for the whole capture window
and restore via try/finally.
* feat: RTPLLM plugin GLM5 integration

* feat: RTPLLM GLM5 enable cuda graph

* fix: RTP glm5 qwen35 cuda graph conflict

* fix: RTP crash when long input_len > 16384

* fix:[RTP] making GLM5 run true Sparse MLA

* refactor: RTP glm5 code

* feat: RTP glm5 optimize sparse decode path

* refactor: RTP remove redundant envs

* refactor: [RTP] unify GLM5 MLA on sparse path, drop dead dense backend

* fix: RTP GLM5 prefil reuse Sparse MLA metadata

* fix: RTP GLM5 enable FP8 MLA path

* feat: RTP GLM5 conflict issue after rebase

* fix: RTP plugin imports conflict after rebase main

* refactor: RTP GLM5 tests merge

* refactor: cleanup GLM5 RTP sparse MLA backend

* refactor: RTP remove redundant labels

* refactor: RTP GLM5 remove redundant code

* refactor: RTP GLM5 remove mla redundant code

* fix: RTP Qwen35 use prewarmed req id buffer for RTP CUDA graphs

* fix: RTP remove redundant qwen35 code
* feat(minimax-m3): split index cache projection

Route MiniMax-M3 index Q/K through a separate projection and thread it through the attention stack so cached top-k layers can skip indexer work while preserving the non-cache path.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(minimax-m3): keep indexer qk packed

Keep MiniMax-M3 index Q/K in the packed QKV projection so index-cache support only skips top-k work and does not require a separate aiter input ABI.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(minimax-m3): drop leftover formatting noise

Remove residual formatting-only changes from the packed index-cache refactor so the branch only carries functional sparse-attention updates.

Co-authored-by: Cursor <cursoragent@cursor.com>

* code format

* chore(minimax-m3): remove index cache debug logging

Drop temporary hit/miss logging and counters from the MiniMax-M3 top-k cache path now that the packed index-cache flow is settled.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
On gfx1250 with ATOM_USE_UNIFIED_ATTN, a prefix-cache hit during prefill
fell back to the Triton unified_attention path instead of the sink ASM
varlen kernel, because _can_attempt_prefill_sink_asm bailed on has_cached
and on max_seqlen_q != max_seqlen_k.

The gfx1250 sink varlen ASM kernel (fmha_fwd_with_sink_varlen_asm) actually
handles bottom-right causal for sq < sk (chunked-prefill), and cu_seqlens_q/
cu_seqlens_k already carry the per-request new-token vs cached+new lengths.
Verified on gfx1250 against a bottom-right causal + per-head sink reference
(single/multi-batch, GQA, sq=1) within bf16 tolerance, and end-to-end on
gpt-oss-120b (full-attention layers take the ASM path on a cache hit; the
forced-Triton path never gathers).

Changes:
- _can_attempt_prefill_sink_asm: drop the has_cached and
  max_seqlen_q == max_seqlen_k gates.
- prefill_attention: gather the cached+new KV into a dense packed tensor here,
  where the ASM varlen kernel consumes it. Each prefill backend now prepares
  its own KV: the ASM path gathers; the Triton path reads the paged cache
  directly via block_table and never gathers.
- rope_cache: no longer gathers, so dispatch_backend sees q/k with matching
  token counts (sq == sk) and _can_use_prefill_sink_asm's shape check stays
  valid.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* [m3 eagle] migrate draft-side EAGLE3 optimizations (Phase 1)

Bring the model-agnostic / draft-side MiniMax-M3 EAGLE3 work from
wuhuikx/atom-m3-bf16-to-main (2f1c385). These files' pre-eagle base is
byte-identical to current main, so they port as-is:

- eagle3_llama.py / eagle3_deepseek_mla.py: draft fusions (fused dual-RMSNorm
  +concat, fused group-RMSNorm aux, AR+RMSNorm fusion), compute_draft_token,
  replicated-embed option.
- fused_aux_rmsnorm.py (new): the fused RMSNorm kernels for the draft.
- lm_head_argmax.py (new) + embed_head.py: distributed greedy argmax (all-gather
  [N,2] per-rank maxima instead of full [N,vocab] logits).
- spec_decode/eagle.py: draft loop with distributed-argmax fast path, no-pre-concat
  aux, and Eagle3 MHA draft KV-cache transfer for PD disaggregation (from #1331).
- envs.py: ATOM_EAGLE_REPLICATE_EMBED.
- tests/test_lm_head_argmax.py (new, importorskip(aiter) for the no-aiter CI).

Target-side enablement (aux-hidden capture in minimax_m3, q>1 spec-verify
metadata, prepare_mtp_decode) follows in Phase 2; note eagle.py now references
attn_metadata_builder.prepare_mtp_decode which Phase 2 adds.

Mocked suite: 437 passed / 38 pre-existing failures / +1 new skip — no regression.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [m3 eagle] target-side enablement on main's M3 API (Phase 2)

Enable MiniMax-M3 EAGLE3 on current main's (Triton-sparse) M3 base, adapting the
target side to main's API instead of wuhuikx's asm/gluon infra (absent on main).

aiter_attention.py:
- Add the generic block-paged MHA Eagle3 draft metadata: _mtp_prepare_decode_
  metadata_kernel + prepare_mtp_decode + fuse_mtp_decode_position_update (used by
  the migrated eagle.py for both Kimi and M3 drafts; not M3-sparse coupled).
- Replace the two "speculative decode not supported" NotImplementedError sites:
  route q>1 spec-verify through the sparse PREFILL path (make_sparse_prefill_
  metadata; per-query causal via cu_seqlens_q, which is now filled uniformly for
  q>1). prefix_lens is bound to a new persistent sparse_prefix_lens buffer so the
  CUDAGraph-captured sparse indexer reads live causal lengths on each replay.

minimax_m3.py: Eagle3 aux-hidden-state capture (Dynamo-safe, mirrors deepseek_v2):
aux_hidden_state_layers, in-layer residual.clone() after the fused-allreduce norm,
model forward returns (hidden, aux) tuple, set/get_eagle3_aux_hidden_state_layers
on the ForCausalLM + VL-wrapper delegation.

model_runner.py: extend KV transfer regions with the Eagle3 draft pool for PD
disaggregation (#1331).

scheduler.py: trim emitted spec tokens past the stop position (rejection sampler
emits past EOS) so flexible-extract doesn't pick up leaked trailing tokens.

recipes/MiniMax-M3.md: full EAGLE3 section (with a note that the ASM-PA/fp8/MXFP8
specifics reflect the fully-optimized variant, not this Triton-sparse base).

Drop tests/test_lm_head_argmax.py (per request).

Note: the q>1 sparse-verify path is new on main and CUDAGraph-sensitive — needs
GPU validation (GSM8K + accept on TP4/TP8; confirm Kimi eagle unaffected).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* update recipe
make lint happy

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* remove fp8 attn related command

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* [m3 eagle] recipe: set ATOM_FORCE_ATTN_TRITON=1 in EAGLE launch

main's MiniMaxM3Attention (dense layers) does not set force_triton_attn in code
and attention_mha has no block-128 guard, so on this base the dense attention is
routed to Triton only via ATOM_FORCE_ATTN_TRITON=1 (the MXFP4 base section already
sets it). The EAGLE section migrated from wuhuikx omitted it (wuhuikx set
force_triton_attn=True in code instead), so the spec-verify dense attention
(q=num_spec+1) fell into paged_attention_asm and aborted in get_heuristic_kernel
(no bf16 block-128 ASM-PA kernel). Add the env to the EAGLE launch and drop the
stale MXFP8 model_path line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* remove ATOM_FORCE_ATTN_TRITON

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* update recipe

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* update recipe

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* update the recipe with the perf

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* refine the comment

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

---------

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* [atom CI/Nightly/Benchmark] Add MiniMax-M3 and Eagle
into atom infra

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* remove minimax m2.7 case

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

---------

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
@zufayu zufayu requested review from ZhangLirong-amd and removed request for ZhangLirong-amd June 26, 2026 06:12
…ing the target top-k (#1362)

`_should_skip_index_topk` force-skips the DSA indexer top-k for the MTP
layer (layer_id >= num_hidden_layers) whenever `index_share_for_mtp_iteration`
is set, making the MTP block reuse the *target* model's top-k. But the MTP
block ships its OWN indexer weights (indexer.wk / wq_b / weights_proj /
k_norm at layer num_hidden_layers in the checkpoint) and is meant to compute
its own top-k for the drafted position. Reusing the target's top-k feeds the
draft a wrong attention context at all sequence lengths.

This is non-standard: neither vLLM upstream (deepseek_mtp.py allocates a
dedicated topk_indices_buffer + Indexer for the MTP block) nor the ATOM
sglang plugin reuses the target index; both compute the MTP top-k
independently. `index_share_for_mtp_iteration` should at most share across
multiple MTP draft steps (num_speculative_tokens > 1), never reuse the
target model's index.

Fix: drop the MTP special-case so the MTP layer computes its own top-k with
its own (loaded) indexer weights, matching vLLM upstream and sglang.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@zufayu zufayu removed the request for review from ZhangLirong-amd June 26, 2026 08:24
carlushuang and others added 2 commits June 26, 2026 08:27
* wip

* disable debug code for now

* fix(mla): prevent NaN in chunked cached-prefix attention LSE merge

In _forward_prefill_cached_chunked, a seq with no cached tokens in the
first chunk gets lse=-inf from flash_attn. Seeding the running accumulator
with that -inf and later merging against another -inf suffix computes
-inf-(-inf)=NaN in merge_attn_states, permanently poisoning that seq's
output. Only triggers with multiple seqs chunked together at high
concurrency (total_kv > attn_prefill_chunk_size). Sanitize the seed lse
with a large finite sentinel so an absent seq carries ~zero weight without
producing NaN.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* fix(v4): add per-token causal cap to HCA prefill visibility

HCA prefill used the per-seq committed count (ctx_end//128) for every
token, missing the (pos+1)//128 per-token causal cap that CSA already has
(and that the reference get_compress_topk_idxs applies). Under chunked
prefill ctx_end is the chunk's end, so the same logical token saw a
different number of HCA compressed groups depending on which chunk
computed it -> chunked != single-shot -> ~0.02 GSM8K drop.

Cap HCA per-token visibility to min((pos+1)//128, n_committed_hca) in the
indptr build, the prefill-indices kernel (new HCA_RATIO constexpr), and
the reference impl. Decode is unaffected (decode token is at seq end, the
cap is a no-op).

Verified GSM8K (V4-Pro, num_concurrent=4, fp8): chunked 0.93 -> 0.9507,
single-shot 0.9515 (no regression).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* remove debug code

* fix(mla): handle both-empty merge in the kernel, drop call-site workaround

The earlier MLA NaN fix sanitized the seed lse at the call site
(nan_to_num on suf_lse). Now that merge_attn_states lives in ATOM
(triton_merge_attn_states.py, sole caller is the MLA chunked path —
plugin/vllm imports vllm's own copy), fix it at the root instead.

When a token's prefix AND suffix are both empty (max_lse == -inf), the
kernel computed -inf-(-inf)=NaN and a 0/0 scale that poisoned the output.
This is reachable in ATOM's global-axis chunked prefill: a short seq can
fall entirely outside a chunk. Guard both_empty: force a finite 0/0-split
so out=0 (correct for empty attention) and keep lse=-inf. The call-site
nan_to_num is now redundant and reverts to chunked_lse = suf_lse.

Verified GSM8K (R1-MXFP4, tp4, fp8, num_concurrent=64, long-prefill 512):
0.9431 — same as the call-site workaround, no regression.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* rm log

* commit

* fix format

* fix(v4): correct SWA on prefix-cache hit via tail re-forward

V4's sliding-window state is a per-request ring (not shared across blocks),
so a prefix-cache hit left the new request's SWA ring empty; non-block-aligned
prompts then read garbage where a tail token's window reached back into the
cached region. Roll the forward start back by ceil(win_with_spec/block_size)
blocks so those tokens are re-forwarded, repopulating the ring. Compressed-KV
sharing is unaffected (context_lens = cached + scheduled is invariant).
Verified token-identical vs no-cache baseline (non-MTP, MTP1).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* fix(v4): add per-token HCA causal cap to plugin prefill indptr

The native path caps HCA prefill visibility per token
(n_hca = min((pos+1)//128, committed)) to match write_v4_paged_prefill_indices,
but the vLLM and SGLang bridges built the HCA indptr from the uncapped
per-seq committed count. The kernel then writes only the capped number of
entries while the indptr reserves the full committed count, leaving
uninitialized torch.empty garbage in the HCA tail of every token whose
context exceeds 128 -> wrong reads / OOB. CSA already had the cap; HCA was
missed in the three bridge prefill builders. Mirror CSA's cap.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

* fix(scheduler): repair test mocks and partial-prefill finish detection

Two test-blocking regressions:

1. Scheduler.__init__ reads config.hf_config (V4 SWA-warmup detection) but the
   test MockConfig had no hf_config, so every Scheduler-constructing test
   errored at construction. Guard the access with getattr and give MockConfig a
   non-V4 hf_config stub.

2. postprocess flagged a prefill as partial via `num_cached_tokens < num_tokens`.
   Once a completion/EOS token is appended, num_tokens exceeds the prompt length,
   so a finished prefill stayed flagged partial and the EOS/finish loop skipped
   it, leaking the finished seq into the next batch. Compare against
   num_prompt_tokens instead.

tests/test_scheduler.py: 15 failed / 18 errored -> 38 passed.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
@zufayu zufayu requested a review from yhl-amd June 26, 2026 13:58
valarLip and others added 10 commits June 26, 2026 22:32
…dge GitHub's 256 limit (#1374)

* ci(benchmark): shard matrix by (variant×scenario) + concurrency via reusable workflow

The flat per-cell benchmark matrix (278 cells) exceeded GitHub's hard
256-jobs-per-matrix limit, so the benchmark job was never scheduled and
summarize failed with empty results ('[]').

Replace it with a two-level fan-out (mirrors InferenceX run-sweep):
- catalog.build_cell_configs groups cells by (variant × scenario), folding
  concurrency into a JSON list; build_benchmark_matrix emits configs_json.
- New reusable workflow benchmark-tmpl.yml runs one config across its
  concurrency matrix (per-cell body unchanged).
- The benchmark caller job matrixes over configs and calls the template.

Both matrices stay well under 256 (42 configs, <=7 conc each) while every
(config x conc) cell still runs as its own parallel job. Artifact naming,
summarize, and dashboard upload are unchanged. Also refresh the 4 stale
catalog tests (21 variants, glm-5-2-fp8, -dpa-tbo 512 band).

* ci(benchmark): shorten template job name to c=<conc>

The reusable template job is nested under the caller job, which already
shows model + scenario; drop the redundant display/isl/osl so the UI reads
'gpt-oss-120b 1k1k / c=8' instead of repeating the model name.

* docs(benchmark): sync matrix-builder docstrings to configs_json + drop redundant build_cells

- build_benchmark_matrix.py: module docstring said it emits cells_json, but
  it now emits configs_json (variant×scenario configs); fix the description.
- main() called build_cells directly and again inside build_cell_configs;
  drop the redundant call and derive the cell/model counts from configs.
- catalog.py: module docstring still called a cell the single matrix
  dimension; configs are now that dimension. Document build_cell_configs.
* fix online quant

* update comment

* format

---------

Co-authored-by: ganyi <ygan@amd.com>
…ig from JSON file (#1190)

* [atom-vllm nightly acc] remove config in workflow file
and fetch config from JSON file

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* remove term name

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

---------

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
* Modify Qwen3.5-35B-A3B-FP8 runner

* Replace jq with python3

* [fix](ci): mv plugin test to new node

* Modify jq with python3 for vllm-test and add runner name in actionlint

* [fix](ci): mv qwen3.5 test to mi355 node

* Add model cache mount path for vllm-test

* Add model cache mount for sglang-test

* Adapt model cache mount path for new runner

* Use host network

* Remove deepseek-r1-fp8-tp4 from sglang-test

* Align Kimi K2.5 PR CI with nightly settings

Co-authored-by: Cursor <cursoragent@cursor.com>

* Restore DeepSeek R1 FP8 TP4 SGLang CI

Co-authored-by: Cursor <cursoragent@cursor.com>

* Lower Kimi K2.5 PR accuracy threshold

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: perzhang <perzhang@amd.com>
Co-authored-by: wuhuikx <hattie.wu@amd.com>
Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…embed param naming (#1378)

* perf(server): cut event-loop work in streaming hot path

- Reuse engine-computed num_prompt_tokens in the stream response
  generators instead of re-encoding the prompt on the event loop at
  stream start (drops a redundant per-request tokenize).
- Run multimodal input prep (image download + HF processor) in a worker
  thread instead of synchronously on the event loop.
- Batch-decode a whole step's buffered stream chunks with one
  tokenizer.batch_decode in flush_stream_batch instead of one decode per
  seq on the output thread (one GIL-released call instead of N).
- Coalesce each request's finalization SSE messages (content/finish +
  usage + [DONE]) into a single send to cut socket-write syscalls when
  many requests finish simultaneously.

* perf(server): enable uvloop event loop; fix gpt-oss embed param naming

uvloop:
- Run uvicorn on uvloop (libuv) instead of the stdlib asyncio selector
  loop, with graceful fallback to the default loop if uvloop is absent.
  Under high streaming concurrency this cuts the event-loop cost of SSE
  socket I/O (sock.send / selector register-unregister): steady-state
  TPOT P99 8.50ms -> 8.18ms and frontend loop-scheduling delay roughly
  halved. Adds uvloop to dependencies.

gpt-oss:
- Register `embed_tokens` first (with `embedding` as the shared-storage
  alias) so it stays the primary, non-deduped name in named_parameters().
  The checkpoint stores `model.embed_tokens.weight`; with `embedding` as
  the primary name the load-completeness check falsely flagged
  `model.embedding.weight` as unloaded even though the weight is loaded
  via the alias. Byte-identical weights (GSM8K 0.8832, unchanged); the
  spurious "parameters were NOT loaded" warning is gone.
* feat(openai): support Qwen3 (qwen3_coder/qwen3_xml) tool-call format

ATOM's OpenAI/Anthropic servers previously only parsed the Kimi-K2 tool-call
token format (<|tool_calls_section_begin|>...), so Qwen3.5/Qwen3.6 tool calls --
emitted as qwen3_coder XML (<tool_call><function=NAME><parameter=...>) -- were
returned as plain text and never surfaced as structured tool_calls. Agent
frontends (qwen-code, OpenCode, etc.) therefore could not drive tools.

Add Qwen3 XML parsing alongside the Kimi format, auto-detected:

- tool_parser.py: parse <tool_call>/<function=>/<parameter=> into OpenAI
  tool_calls, with JSON-Schema type coercion of parameter values from the
  request's tools (the XML is typeless). Non-streaming + streaming (stream
  content, then buffer+parse the tool-call block -- robust against the
  partial-XML streaming edge cases seen in vLLM/SGLang). Kimi path unchanged.
- protocol.py: deserialize tool_calls[].function.arguments (a JSON string in
  OpenAI requests) to a mapping in to_template_dict, so multi-turn chat
  templates that iterate arguments.items() (Qwen, Hermes) render tool history
  instead of raising "Can only get item pairs from a mapping".
- serving_chat.py / api_server.py: thread the request's tools into the parsers
  for type coercion (default None preserves existing behavior).

Verified: Qwen3.6-27B BF16 served by ATOM drives qwen-code end-to-end on
gfx1151 -- write_file + run-shell tool calls execute and the agent reports the
program output.

* fix(openai): don't pass tools to the /v1/completions stream path

The previous commit's threading of request.tools matched the
stream_completion_response / stream_completion_response_fanout calls in the
/v1/completions handler too. CompletionRequest has no `tools` field, so
/v1/completions raised "AttributeError: 'CompletionRequest' object has no
attribute 'tools'" (HTTP 500). Tool calling only applies to chat; drop tools
from the text-completion stream calls.

* fix(openai): make tool-call ids unique across the conversation

The parser generated ids from a per-response index (call_0, call_1, ...), so the
first tool call in every assistant turn was call_0. OpenAI tool-call ids must be
unique across the whole conversation; agentic clients (e.g. qwen-code) dedupe by
id and silently ignore every repeat -> the tool never executes and the model
retries forever (endless tool-call loop on any multi-tool task). Use a random
call_<uuid> id at both the non-streaming and streaming emit sites.
…n concurrency (#1381)

The conc=1000 accuracy job intermittently failed: the server exhausted its
per-process open-file limit while accepting ~1000 concurrent connections
(plus the engine's DP-rank ZMQ and shared-memory fds), hitting EMFILE on
accept(). The default soft RLIMIT_NOFILE (~1024) is simply too low for that
connection count.

Root cause is that ATOM never raised its own fd soft limit. vLLM and SGLang
both call set_ulimit() at process startup for exactly this reason, and ATOM's
own mesh launch scripts already pass `--ulimit nofile=65536:524288` to docker
-- but plain `python -m atom.entrypoints.openai_server` launches (CI, ad-hoc)
inherit the daemon default and never bump it.

Add a set_ulimit() helper (raise soft -> min(65535, hard)) and call it at the
server entry point before the engine-core subprocesses are spawned, so the
raised limit is inherited. No-op when the soft limit is already high enough.

This is independent of the event-loop choice; it removes the fd ceiling that
turned ordinary high-concurrency load into dropped connections.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.