[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5, with working MTP by carlushuang · Pull Request #1337 · ROCm/ATOM

carlushuang · 2026-06-24T06:45:59Z

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5 (Strix Halo), with working MTP

Builds on the gfx1151 BF16 enablement (#1314) to add online INT8 W8A8 (Quark-style, no offline quant step) for the Qwen3.6 dense (27B) and MoE (35B-A3B) architectures, plus the fixes needed to make MTP speculative decoding work on the MoE draft. RDNA3.5 WMMA supports int8 natively (no FP8/FP4), so int8 is the right quantization target for this iGPU.

Precision split (chosen for quality): A8W8 (int8 weight + dynamic per-token int8 activation) for all dense GEMMs and the MoE experts; BF16 for the Gated-DeltaNet linear-attention (recurrent, quant-sensitive — int8 there produces garbage), the MoE router gate, lm_head, and embeddings; KV cache BF16. This keeps gsm8k at BF16-equivalent quality while halving weight bytes (the decode bottleneck on a bandwidth-bound iGPU).

What this enables

35B-A3B runs at all: there is no BF16 MoE kernel on gfx1151 (the asm fused-MoE is gfx9-only; ATOM's Triton MoE is MXFP4-weight-only). The int8 path uses aiter's moe_gemm_int8_smoothquant, which is the only int8-W8A8 grouped-GEMM that runs on RDNA3.5.
~6× faster than the BF16 27B baseline for the 35B-A3B (3B active params × int8 + MTP).

Changes

model_ops/linear.py — per_Token int8 branch routes to aiter Triton gemm_a8w8 on non-gfx9 (CK gemm_a8w8_CK is gfx9-only), wrapped as a torch custom op (torch.ops.aiter.atom_gemm_a8w8_triton) so it is HIP-graph / torch.compile safe. Online-quant allow-list += torch.int8.
model_ops/moe.py — new Int8MoEMethod (int8 w13/w2 + per-channel fp32 scales) and an int8 branch in FusedMoE._online_quant.
model_ops/fused_moe_triton.py — triton_kernel_int8_moe_forward: matmul-ogs routing → per-token int8 quant → moe_gemm_int8_smoothquant (gemm1 with fused gated-SiLU via interleaved w13 columns) → per-token int8 quant → gemm2 with scatter/combine.
models/qwen3_5_mtp.py + model_loader/loader.py — MTP-MoE drafter fix: the draft's fused expert weights (experts.gate_up_proj/down_proj) were silently dropped at load → 0% draft acceptance → MTP was pure overhead. Add the draft's fused-expert mapping (detect_fused_expert_format / get_fused_expert_mapping / load_fused_expert_weights), fix get_expert_mapping to use num_experts, and let the loader resolve load_fused_expert_weights_fn from the model. After the fix: acceptance 0 → 0.83.
entrypoints/openai/tool_parser.py — unique tool-call ids (call_<uuid> instead of a per-response call_0). Non-unique ids made agentic clients (qwen-code) dedupe every tool call after the first → endless tool-call loop. Extends feat(openai): Qwen3 (qwen3_coder/qwen3_xml) tool-call support #1319.

Quality (gsm8k, 5-shot-equivalent, chat + thinking, greedy)

35B-A3B INT8 W8A8 = 0.84 — BF16-equivalent (int8 is faithful). MTP is lossless (accepts a draft token only when it matches the target's greedy argmax), so the MTP build has identical quality.

Performance (gfx1151 / Radeon 8060S, bs=1)

Decode (single-stream, short context):

Model	Config	Decode tok/s
27B dense	INT8 W8A8	6.0
27B dense	INT8 W8A8 + MTP-1	9.4
35B-A3B	INT8 W8A8 + HIP graph	24.8
35B-A3B	INT8 W8A8 + MTP-1 + HIP graph	~35

Long-context (35B-A3B INT8 W8A8 + MTP-1, bs=1):

Context	Prefill TTFT	Prefill tok/s	Decode (output) tok/s	Total tok/s
64K (60,016 tok)	85.4 s	703	23.4	661
128K (119,071 tok)	191.8 s	621	17.3	598

Decode tok/s falls with context (each step reads the growing KV); prefill is compute-bound (one-time prompt-ingestion cost). The hybrid model's KV is cheap (only the interleaved full-attn layers cache KV), so 128K fits easily — at gpu-memory-utilization 0.9 the KV pool holds ~2.1M tokens; the limit is --max-model-len, not memory.

Serve

ATOM_USE_UNIFIED_ATTN=1 \
python -m atom.entrypoints.openai_server --model Qwen/Qwen3.6-35B-A3B \
  --trust-remote-code -tp 1 --kv_cache_dtype bf16 --block-size 64 \
  --max-model-len 131072 --max-num-seqs 2 --gpu-memory-utilization 0.9 \
  --method mtp --num-speculative-tokens 1 \
  --online_quant_config '{"global_quant_config":"ptpc_i8","exclude_layer":["*linear_attn*","*lm_head*","*shared_head*","*embed_tokens*","*mlp.gate"]}'

(Drop --method mtp ... for 35B if you don't want MTP; for the dense 27B MTP is a ~1.6× lossless win.)

Dependency

[OPUS]: arch-guard fp8/bf8 packed-cvt builtins for RDNA3/3.5 (gfx1151) aiter#3860 (carhuang/gfx1151_opus_fp8_guard) — arch-guard the gfx9-only fp8/bf8-cvt builtins. Required (shared with [gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 via native Triton attention #1314). The int8 GEMM/MoE kernels (gemm_a8w8 Triton, moe_gemm_int8_smoothquant, per_token_quant_hip) are already upstream in aiter; no new aiter code is needed for the int8 path.
[gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 via native Triton attention #1314 (carhuang/support_gfx1151_qwen36) — the gfx1151 BF16 base enablement (arch gate, native Triton attention, GDN block_tables). Prerequisite.
feat(openai): Qwen3 (qwen3_coder/qwen3_xml) tool-call support #1319 (carhuang/qwen3_xml_tool_parser) — qwen3_xml tool-call parsing; the unique-tool-call-id fix here extends it.

* [ATOM SGL]Add dsv4 ci Co-authored-by: Cursor <cursoragent@cursor.com>

* fix * use func

* feat(minimax_m3): add MXFP4 native support Introduce the minimal MiniMax-M3 MXFP4 native ATOM path without BF16, MXFP8, EAGLE, or unified-attention support. * fix(minimax_m3): align FP4 Triton paths with BF16 branch Keep the split MXFP4 PR aligned with the BF16 branch for shared Triton kernel paths while removing the extra package marker file. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(minimax_m3): add MXFP4 native support Introduce the minimal MiniMax-M3 MXFP4 native ATOM path without BF16, MXFP8, EAGLE, or unified-attention support. * fix(attn): use Triton for unsupported GQA decode Route the block-size 128 GQA decode shape used by MiniMax-M3 away from generic PA ASM, which has no matching AITER kernel in the validation image. * chore(minimax_m3): trim FP4 split cleanup Remove the extra MiniMax-M3 module docstring note and keep Triton attention selection controlled by the existing environment flag. * docs(minimax_m3): force Triton attention in MXFP4 recipe Document the required ATOM_FORCE_ATTN_TRITON flag for the MXFP4 TP4 launch path. * chore(minimax_m3): fix Black formatting and trim comments Remove the extra blank lines flagged by Black and keep MiniMax-M3 sparse attention comments focused on ATOM's FP4 path. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(minimax_m3): simplify text config normalization Copy MiniMax-M3 text config attributes generically so the FP4 path keeps required root config fields without maintaining a long field allowlist. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(attn): generalize sparse block-size handling Read sparse attention block-size requirements from the HF sparse attention config instead of hard-coding the MiniMax-M3 sparse attention constant in the shared AITER metadata builder. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(attn): use generic sparse metadata naming Keep MiniMax-M3 sparse metadata construction local to the sparse attention path while exposing it through generic attention metadata fields in the shared AITER builder. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(attn): generalize indexed sparse marker Use a model-agnostic marker for indexed sparse attention modules so the shared AITER cache binding path no longer checks a MiniMax-M3-specific attribute name. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(attn): generalize sparse cache names Use generic indexed sparse cache and metadata helper names in the shared AITER attention path while keeping the MiniMax-M3 sparse implementation module unchanged. Co-authored-by: Cursor <cursoragent@cursor.com> * refact attention code * keep ATOM_USE_UNIFIED_ATTN path --------- Co-authored-by: xytpai <xytpai@foxmail.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com>

…#1328) * Add model cache mount for MI308 sglang benchmark

* mla * fix * fix --------- Co-authored-by: HaonanWang98 <hwang@amd.com> Co-authored-by: feifei14119 <carlus.huang@amd.com>

* Move gpt-oss and kimi2.5 CI from mi355 to mi350 * Move deepseek-v4-flash and qwen3.5 vllm CI from mi355 to mi350 * Add runner user info * Add clean up containers function for runner atom-mi35x-8gpu-oot-acc * CI: tolerate missing Docker config on OOT runners --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com>

… on MoE RDNA3.5 WMMA supports int8 natively (no FP8/FP4), so int8 is the quantization target for this iGPU. A8W8 (int8 weight + dynamic per-token int8 activation) for all dense GEMMs and MoE experts; BF16 for the GDN linear-attn (recurrent, quant-sensitive), router gate, lm_head, embeddings; KV cache BF16. - model_ops/linear.py: per_Token int8 branch -> aiter Triton gemm_a8w8 on non-gfx9 (CK gemm_a8w8_CK is gfx9-only), wrapped as a torch custom op so it is HIP-graph / torch.compile safe. Online-quant allow-list += torch.int8. - model_ops/moe.py + model_ops/fused_moe_triton.py: Int8MoEMethod + int8 branch in FusedMoE._online_quant, and triton_kernel_int8_moe_forward using aiter moe_gemm_int8_smoothquant (gemm1 with fused gated-SiLU via interleaved w13). Enables 35B-A3B, which has no BF16 MoE kernel on gfx1151. - models/qwen3_5_mtp.py + model_loader/loader.py: fix the MTP-MoE drafter so the draft's fused expert weights load (add detect_fused_expert_format / get_fused_expert_mapping / load_fused_expert_weights; get_expert_mapping uses num_experts; loader resolves load_fused_expert_weights_fn from the model). Draft acceptance 0 -> 0.83; MTP now a net win on the MoE model. - model_ops/topK.py: keep the shared expert as a separate MLP on non-gfx9 so the routed MoE uses the portable Triton path.

* replace einsum with bmm * fix

* add m3 mxfp8 support * add mxfp8 recipe * wip Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com> * revert dequant fp8 back to bf16 for linear layers * update m3 recipe * format * remove hard code dtype --------- Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com> Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com> Co-authored-by: Haoyang Li <lihaoyang0109@gmail.com> Co-authored-by: ganyi <ygan@amd.com> Co-authored-by: Guanbao Yu <gyu@amd.com> Co-authored-by: wuhuikx <hattie.wu@amd.com>

* [fix](qwen): fix qwen3.5 accuracy * [fix](attn): delete extra code * [fix](attn): add kv cache to mutate args * [fix](qwen): remove quick allreduce in qwen3.5 --------- Co-authored-by: perzhang <perzhang@amd.com>

) * Add NUMA-aware CPU/memory binding * Add glm-5-2-fp8 benchmark dispatch checkbox

… AAC machine (#1346) * Modify atom-sgl-accuracy workflow to adapt it for AAC machine

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

)

* docs: revise M3 fp8/gluon port plan for first-class framework compat Replace the env-gated bolt-on approach with one driven by main's existing attention-framework contracts: fp8 selected by config.kv_cache_dtype, scales returned via KVCacheTensor, binding through build_kv_cache_tensor/ bind_kv_cache, insert via the quantized hook, metadata via make_sparse_* factories, frozen custom-op signature, CUDAGraph-safe scratch, byte accounting. Adds a 9-point contract checklist mapped to each task. Co-Authored-By: Claude <noreply@anthropic.com> * feat(attn): SparseMHAPagedAttentionImpl skeleton + Attention impl_cls override Task 0 of the MiniMax-M3 fp8 KV cache + gluon PA port. Adds the subclass scaffold (SparseMHAPagedAttentionImpl extends PagedAttentionImpl, overriding only rope_cache + dispatch_backend via delegation for now) and an optional impl_cls kwarg on Attention.__init__ so a model can plug in a specialized impl while reusing the backend's metadata builder. Indexer state lives on the impl. Co-Authored-By: Claude <noreply@anthropic.com> * feat(minimax_m3): page-16 constants + fused SHUFFLE KV-insert kernel Task 1 of the M3 fp8/gluon port. Adds ASM_PAGE_SIZE=16 / PAGES_PER_SPARSE_BLOCK=8 and grafts the Triton fused Gemma-RMSNorm + partial-NeoX-RoPE + page-16 SHUFFLE KV-insert kernel (+ host wrapper) from origin/ganyi/shuffle_kv_cache_fp8_eagle. GPU round-trip test validates q_out/index_q_out vs PyTorch ref and K/V/index cache scatter at each token slot. Co-Authored-By: Claude <noreply@anthropic.com> * feat(minimax_m3): page-16 sparse block-table builders + fused topk EMIT_SPARSE_BT Task 2 of the M3 fp8/gluon port. Grafts the decode + prefill page-16 sparse block-table builders into sparse_attn.py (each selected logical 128-block expands to 8 contiguous physical 16-pages, partial tail packed last, exact context_lens), and replaces index_topk.py wholesale with the source-branch version that adds the fused EMIT_SPARSE_BT block-table emission and MAX_Q spec-decode causal support (both opt-in via defaulted kwargs, so existing decode callers are unaffected). Tests: x8 expansion + tail-last packing + ctx lengths for the standalone builder; fused EMIT path matches the standalone builder bit-for-bit (num_kv_heads==1). Co-Authored-By: Claude <noreply@anthropic.com> * feat(minimax_m3): gluon PA decode + prefill runners over page-16 SHUFFLE cache Task 3 of the M3 fp8/gluon port. Grafts minimax_m3_sparse_attn_decode_asm, minimax_m3_sparse_attn_prefill_asm, and the shared _run_prefill_fp8_gluon helper from the source branch: index top-k -> page-16 sparse block table -> AITER gluon split-KV paged-attention (run_pa_decode_gluon), with fp8 vs bf16 compute_type and per-token scales selected by the KV cache dtype. Adds `import aiter` (used for aiter.dtypes.fp8). Parity test (gluon vs Triton split-K decode reference) for gqa 8/16; validated further by the existing asm/fp8/prefill oracle tests. Co-Authored-By: Claude <noreply@anthropic.com> * feat(attn): implement SparseMHAPagedAttentionImpl.rope_cache override Task 4 of the M3 fp8/gluon port. The override runs MiniMax-M3's fused qk-norm + partial-NeoX-RoPE + page-16 SHUFFLE KV insert + indexer-key insert via aiter.fused_qknorm_idxrqknorm (consuming the packed qkv), reading the SHUFFLE K/V + scale + index caches off the bound layer. It returns the parent's 7-tuple (query rotated) and stashes the rotated indexer query on self._index_q for dispatch_backend. fp8 vs bf16 selected by kv_cache_dtype; fp8 writes per-token dequant scales into k_scale/v_scale. Adds the _minimax_m3_cos_sin_cache helper. Test (bf16 + fp8): override returns the 7-tuple, populates _index_q with correct shape, and mutates the KV/index caches (+ fp8 scales). Co-Authored-By: Claude <noreply@anthropic.com> * feat(attn): implement SparseMHAPagedAttentionImpl.dispatch_backend override Task 5 of the M3 fp8/gluon port. dispatch_backend returns the M3 sparse prefill/decode backend callable (parent contract fn(q,k,v,k_cache,v_cache,k_scale,v_scale,fwd_ctx)). Both paths select per-token top-k index blocks with the fused page-16 sparse block-table emit, then run the gluon split-KV paged-attention over the SHUFFLE cache; fp8 vs bf16 follows the cache dtype inside the runners. Prefill uses the sync-free on-device metadata fallback (query_req_id/abs_pos/qo_indptr=None). Consumes self._index_q from rope_cache and clears it afterward. Note: index_cache is page-128 3D [num_logical, 128, idx_head_dim], indexed by the logical block_table in index-topk (distinct from the page-16 SHUFFLE KV cache). Test (bf16+fp8): dispatch returns the decode callable; running it yields finite [tokens, nh, hd] output and clears _index_q. Co-Authored-By: Claude <noreply@anthropic.com> * first version of refactor Signed-off-by: ganyi <ygan@amd.com> * remove unnecessary files Signed-off-by: ganyi <ygan@amd.com> * runable and can response resonable output Signed-off-by: ganyi <ygan@amd.com> * acc right Signed-off-by: ganyi <ygan@amd.com> * reuse mha's allocation for main cache, view at use time Signed-off-by: ganyi <ygan@amd.com> * remove prepare mtp metadata Signed-off-by: ganyi <ygan@amd.com> * format Signed-off-by: ganyi <ygan@amd.com> * format Signed-off-by: ganyi <ygan@amd.com> * resolve comments Signed-off-by: ganyi <ygan@amd.com> --------- Signed-off-by: ganyi <ygan@amd.com> Co-authored-by: Claude <noreply@anthropic.com>

…k_size' (#1348) Co-authored-by: junxiaguo <JunXia.Guo@amd.com>

…bort (#1322) (#1339) During CUDAGraph capture, MiniMax-M3's autotuned _topk_index_partial_kernel discards candidate CompiledKernels. A gen-0 GC firing inside the stream-capture region runs CompiledKernel.__del__ -> hipModuleUnload, which HIP forbids while a stream is capturing (HIP 900), corrupting the capture and aborting the custom_all_reduce IPC handshake (SIGABRT). gc.freeze() did not help because the discarded kernels are created mid-loop. Disable GC for the whole capture window and restore via try/finally.

* feat: RTPLLM plugin GLM5 integration * feat: RTPLLM GLM5 enable cuda graph * fix: RTP glm5 qwen35 cuda graph conflict * fix: RTP crash when long input_len > 16384 * fix:[RTP] making GLM5 run true Sparse MLA * refactor: RTP glm5 code * feat: RTP glm5 optimize sparse decode path * refactor: RTP remove redundant envs * refactor: [RTP] unify GLM5 MLA on sparse path, drop dead dense backend * fix: RTP GLM5 prefil reuse Sparse MLA metadata * fix: RTP GLM5 enable FP8 MLA path * feat: RTP GLM5 conflict issue after rebase * fix: RTP plugin imports conflict after rebase main * refactor: RTP GLM5 tests merge * refactor: cleanup GLM5 RTP sparse MLA backend * refactor: RTP remove redundant labels * refactor: RTP GLM5 remove redundant code * refactor: RTP GLM5 remove mla redundant code * fix: RTP Qwen35 use prewarmed req id buffer for RTP CUDA graphs * fix: RTP remove redundant qwen35 code

* feat(minimax-m3): split index cache projection Route MiniMax-M3 index Q/K through a separate projection and thread it through the attention stack so cached top-k layers can skip indexer work while preserving the non-cache path. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(minimax-m3): keep indexer qk packed Keep MiniMax-M3 index Q/K in the packed QKV projection so index-cache support only skips top-k work and does not require a separate aiter input ABI. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(minimax-m3): drop leftover formatting noise Remove residual formatting-only changes from the packed index-cache refactor so the branch only carries functional sparse-attention updates. Co-authored-by: Cursor <cursoragent@cursor.com> * code format * chore(minimax-m3): remove index cache debug logging Drop temporary hit/miss logging and counters from the MiniMax-M3 top-k cache path now that the packed index-cache flow is settled. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>

On gfx1250 with ATOM_USE_UNIFIED_ATTN, a prefix-cache hit during prefill fell back to the Triton unified_attention path instead of the sink ASM varlen kernel, because _can_attempt_prefill_sink_asm bailed on has_cached and on max_seqlen_q != max_seqlen_k. The gfx1250 sink varlen ASM kernel (fmha_fwd_with_sink_varlen_asm) actually handles bottom-right causal for sq < sk (chunked-prefill), and cu_seqlens_q/ cu_seqlens_k already carry the per-request new-token vs cached+new lengths. Verified on gfx1250 against a bottom-right causal + per-head sink reference (single/multi-batch, GQA, sq=1) within bf16 tolerance, and end-to-end on gpt-oss-120b (full-attention layers take the ASM path on a cache hit; the forced-Triton path never gathers). Changes: - _can_attempt_prefill_sink_asm: drop the has_cached and max_seqlen_q == max_seqlen_k gates. - prefill_attention: gather the cached+new KV into a dense packed tensor here, where the ASM varlen kernel consumes it. Each prefill backend now prepares its own KV: the ASM path gathers; the Triton path reads the paged cache directly via block_table and never gathers. - rope_cache: no longer gathers, so dispatch_backend sees q/k with matching token counts (sq == sk) and _can_use_prefill_sink_asm's shape check stays valid. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [m3 eagle] migrate draft-side EAGLE3 optimizations (Phase 1) Bring the model-agnostic / draft-side MiniMax-M3 EAGLE3 work from wuhuikx/atom-m3-bf16-to-main (2f1c385). These files' pre-eagle base is byte-identical to current main, so they port as-is: - eagle3_llama.py / eagle3_deepseek_mla.py: draft fusions (fused dual-RMSNorm +concat, fused group-RMSNorm aux, AR+RMSNorm fusion), compute_draft_token, replicated-embed option. - fused_aux_rmsnorm.py (new): the fused RMSNorm kernels for the draft. - lm_head_argmax.py (new) + embed_head.py: distributed greedy argmax (all-gather [N,2] per-rank maxima instead of full [N,vocab] logits). - spec_decode/eagle.py: draft loop with distributed-argmax fast path, no-pre-concat aux, and Eagle3 MHA draft KV-cache transfer for PD disaggregation (from #1331). - envs.py: ATOM_EAGLE_REPLICATE_EMBED. - tests/test_lm_head_argmax.py (new, importorskip(aiter) for the no-aiter CI). Target-side enablement (aux-hidden capture in minimax_m3, q>1 spec-verify metadata, prepare_mtp_decode) follows in Phase 2; note eagle.py now references attn_metadata_builder.prepare_mtp_decode which Phase 2 adds. Mocked suite: 437 passed / 38 pre-existing failures / +1 new skip — no regression. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * [m3 eagle] target-side enablement on main's M3 API (Phase 2) Enable MiniMax-M3 EAGLE3 on current main's (Triton-sparse) M3 base, adapting the target side to main's API instead of wuhuikx's asm/gluon infra (absent on main). aiter_attention.py: - Add the generic block-paged MHA Eagle3 draft metadata: _mtp_prepare_decode_ metadata_kernel + prepare_mtp_decode + fuse_mtp_decode_position_update (used by the migrated eagle.py for both Kimi and M3 drafts; not M3-sparse coupled). - Replace the two "speculative decode not supported" NotImplementedError sites: route q>1 spec-verify through the sparse PREFILL path (make_sparse_prefill_ metadata; per-query causal via cu_seqlens_q, which is now filled uniformly for q>1). prefix_lens is bound to a new persistent sparse_prefix_lens buffer so the CUDAGraph-captured sparse indexer reads live causal lengths on each replay. minimax_m3.py: Eagle3 aux-hidden-state capture (Dynamo-safe, mirrors deepseek_v2): aux_hidden_state_layers, in-layer residual.clone() after the fused-allreduce norm, model forward returns (hidden, aux) tuple, set/get_eagle3_aux_hidden_state_layers on the ForCausalLM + VL-wrapper delegation. model_runner.py: extend KV transfer regions with the Eagle3 draft pool for PD disaggregation (#1331). scheduler.py: trim emitted spec tokens past the stop position (rejection sampler emits past EOS) so flexible-extract doesn't pick up leaked trailing tokens. recipes/MiniMax-M3.md: full EAGLE3 section (with a note that the ASM-PA/fp8/MXFP8 specifics reflect the fully-optimized variant, not this Triton-sparse base). Drop tests/test_lm_head_argmax.py (per request). Note: the q>1 sparse-verify path is new on main and CUDAGraph-sensitive — needs GPU validation (GSM8K + accept on TP4/TP8; confirm Kimi eagle unaffected). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * update recipe make lint happy Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * remove fp8 attn related command Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * [m3 eagle] recipe: set ATOM_FORCE_ATTN_TRITON=1 in EAGLE launch main's MiniMaxM3Attention (dense layers) does not set force_triton_attn in code and attention_mha has no block-128 guard, so on this base the dense attention is routed to Triton only via ATOM_FORCE_ATTN_TRITON=1 (the MXFP4 base section already sets it). The EAGLE section migrated from wuhuikx omitted it (wuhuikx set force_triton_attn=True in code instead), so the spec-verify dense attention (q=num_spec+1) fell into paged_attention_asm and aborted in get_heuristic_kernel (no bf16 block-128 ASM-PA kernel). Add the env to the EAGLE launch and drop the stale MXFP8 model_path line. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * remove ATOM_FORCE_ATTN_TRITON Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * update recipe Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * update recipe Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * update the recipe with the perf Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * refine the comment Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> --------- Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [atom CI/Nightly/Benchmark] Add MiniMax-M3 and Eagle into atom infra Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * remove minimax m2.7 case Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> --------- Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* Modify model cache mount

…ing the target top-k (#1362) `_should_skip_index_topk` force-skips the DSA indexer top-k for the MTP layer (layer_id >= num_hidden_layers) whenever `index_share_for_mtp_iteration` is set, making the MTP block reuse the *target* model's top-k. But the MTP block ships its OWN indexer weights (indexer.wk / wq_b / weights_proj / k_norm at layer num_hidden_layers in the checkpoint) and is meant to compute its own top-k for the drafted position. Reusing the target's top-k feeds the draft a wrong attention context at all sequence lengths. This is non-standard: neither vLLM upstream (deepseek_mtp.py allocates a dedicated topk_indices_buffer + Indexer for the MTP block) nor the ATOM sglang plugin reuses the target index; both compute the MTP top-k independently. `index_share_for_mtp_iteration` should at most share across multiple MTP draft steps (num_speculative_tokens > 1), never reuse the target model's index. Fix: drop the MTP special-case so the MTP layer computes its own top-k with its own (loaded) indexer weights, matching vLLM upstream and sglang. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* wip * disable debug code for now * fix(mla): prevent NaN in chunked cached-prefix attention LSE merge In _forward_prefill_cached_chunked, a seq with no cached tokens in the first chunk gets lse=-inf from flash_attn. Seeding the running accumulator with that -inf and later merging against another -inf suffix computes -inf-(-inf)=NaN in merge_attn_states, permanently poisoning that seq's output. Only triggers with multiple seqs chunked together at high concurrency (total_kv > attn_prefill_chunk_size). Sanitize the seed lse with a large finite sentinel so an absent seq carries ~zero weight without producing NaN. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * fix(v4): add per-token causal cap to HCA prefill visibility HCA prefill used the per-seq committed count (ctx_end//128) for every token, missing the (pos+1)//128 per-token causal cap that CSA already has (and that the reference get_compress_topk_idxs applies). Under chunked prefill ctx_end is the chunk's end, so the same logical token saw a different number of HCA compressed groups depending on which chunk computed it -> chunked != single-shot -> ~0.02 GSM8K drop. Cap HCA per-token visibility to min((pos+1)//128, n_committed_hca) in the indptr build, the prefill-indices kernel (new HCA_RATIO constexpr), and the reference impl. Decode is unaffected (decode token is at seq end, the cap is a no-op). Verified GSM8K (V4-Pro, num_concurrent=4, fp8): chunked 0.93 -> 0.9507, single-shot 0.9515 (no regression). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * remove debug code * fix(mla): handle both-empty merge in the kernel, drop call-site workaround The earlier MLA NaN fix sanitized the seed lse at the call site (nan_to_num on suf_lse). Now that merge_attn_states lives in ATOM (triton_merge_attn_states.py, sole caller is the MLA chunked path — plugin/vllm imports vllm's own copy), fix it at the root instead. When a token's prefix AND suffix are both empty (max_lse == -inf), the kernel computed -inf-(-inf)=NaN and a 0/0 scale that poisoned the output. This is reachable in ATOM's global-axis chunked prefill: a short seq can fall entirely outside a chunk. Guard both_empty: force a finite 0/0-split so out=0 (correct for empty attention) and keep lse=-inf. The call-site nan_to_num is now redundant and reverts to chunked_lse = suf_lse. Verified GSM8K (R1-MXFP4, tp4, fp8, num_concurrent=64, long-prefill 512): 0.9431 — same as the call-site workaround, no regression. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * rm log * commit * fix format * fix(v4): correct SWA on prefix-cache hit via tail re-forward V4's sliding-window state is a per-request ring (not shared across blocks), so a prefix-cache hit left the new request's SWA ring empty; non-block-aligned prompts then read garbage where a tail token's window reached back into the cached region. Roll the forward start back by ceil(win_with_spec/block_size) blocks so those tokens are re-forwarded, repopulating the ring. Compressed-KV sharing is unaffected (context_lens = cached + scheduled is invariant). Verified token-identical vs no-cache baseline (non-MTP, MTP1). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * fix(v4): add per-token HCA causal cap to plugin prefill indptr The native path caps HCA prefill visibility per token (n_hca = min((pos+1)//128, committed)) to match write_v4_paged_prefill_indices, but the vLLM and SGLang bridges built the HCA indptr from the uncapped per-seq committed count. The kernel then writes only the capped number of entries while the indptr reserves the full committed count, leaving uninitialized torch.empty garbage in the HCA tail of every token whose context exceeds 128 -> wrong reads / OOB. CSA already had the cap; HCA was missed in the three bridge prefill builders. Mirror CSA's cap. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> * fix(scheduler): repair test mocks and partial-prefill finish detection Two test-blocking regressions: 1. Scheduler.__init__ reads config.hf_config (V4 SWA-warmup detection) but the test MockConfig had no hf_config, so every Scheduler-constructing test errored at construction. Guard the access with getattr and give MockConfig a non-V4 hf_config stub. 2. postprocess flagged a prefill as partial via `num_cached_tokens < num_tokens`. Once a completion/EOS token is appended, num_tokens exceeds the prompt length, so a finished prefill stayed flagged partial and the EOS/finish loop skipped it, leaking the finished seq into the next batch. Compare against num_prompt_tokens instead. tests/test_scheduler.py: 15 failed / 18 errored -> 38 passed. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4 <noreply@anthropic.com>

…dge GitHub's 256 limit (#1374) * ci(benchmark): shard matrix by (variant×scenario) + concurrency via reusable workflow The flat per-cell benchmark matrix (278 cells) exceeded GitHub's hard 256-jobs-per-matrix limit, so the benchmark job was never scheduled and summarize failed with empty results ('[]'). Replace it with a two-level fan-out (mirrors InferenceX run-sweep): - catalog.build_cell_configs groups cells by (variant × scenario), folding concurrency into a JSON list; build_benchmark_matrix emits configs_json. - New reusable workflow benchmark-tmpl.yml runs one config across its concurrency matrix (per-cell body unchanged). - The benchmark caller job matrixes over configs and calls the template. Both matrices stay well under 256 (42 configs, <=7 conc each) while every (config x conc) cell still runs as its own parallel job. Artifact naming, summarize, and dashboard upload are unchanged. Also refresh the 4 stale catalog tests (21 variants, glm-5-2-fp8, -dpa-tbo 512 band). * ci(benchmark): shorten template job name to c=<conc> The reusable template job is nested under the caller job, which already shows model + scenario; drop the redundant display/isl/osl so the UI reads 'gpt-oss-120b 1k1k / c=8' instead of repeating the model name. * docs(benchmark): sync matrix-builder docstrings to configs_json + drop redundant build_cells - build_benchmark_matrix.py: module docstring said it emits cells_json, but it now emits configs_json (variant×scenario configs); fix the description. - main() called build_cells directly and again inside build_cell_configs; drop the redundant call and derive the cell/model counts from configs. - catalog.py: module docstring still called a cell the single matrix dimension; configs are now that dimension. Document build_cell_configs.

* fix online quant * update comment * format --------- Co-authored-by: ganyi <ygan@amd.com>

…ig from JSON file (#1190) * [atom-vllm nightly acc] remove config in workflow file and fetch config from JSON file Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * remove term name Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> --------- Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* Modify Qwen3.5-35B-A3B-FP8 runner * Replace jq with python3 * [fix](ci): mv plugin test to new node * Modify jq with python3 for vllm-test and add runner name in actionlint * [fix](ci): mv qwen3.5 test to mi355 node * Add model cache mount path for vllm-test * Add model cache mount for sglang-test * Adapt model cache mount path for new runner * Use host network * Remove deepseek-r1-fp8-tp4 from sglang-test * Align Kimi K2.5 PR CI with nightly settings Co-authored-by: Cursor <cursoragent@cursor.com> * Restore DeepSeek R1 FP8 TP4 SGLang CI Co-authored-by: Cursor <cursoragent@cursor.com> * Lower Kimi K2.5 PR accuracy threshold Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: perzhang <perzhang@amd.com> Co-authored-by: wuhuikx <hattie.wu@amd.com> Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>

…embed param naming (#1378) * perf(server): cut event-loop work in streaming hot path - Reuse engine-computed num_prompt_tokens in the stream response generators instead of re-encoding the prompt on the event loop at stream start (drops a redundant per-request tokenize). - Run multimodal input prep (image download + HF processor) in a worker thread instead of synchronously on the event loop. - Batch-decode a whole step's buffered stream chunks with one tokenizer.batch_decode in flush_stream_batch instead of one decode per seq on the output thread (one GIL-released call instead of N). - Coalesce each request's finalization SSE messages (content/finish + usage + [DONE]) into a single send to cut socket-write syscalls when many requests finish simultaneously. * perf(server): enable uvloop event loop; fix gpt-oss embed param naming uvloop: - Run uvicorn on uvloop (libuv) instead of the stdlib asyncio selector loop, with graceful fallback to the default loop if uvloop is absent. Under high streaming concurrency this cuts the event-loop cost of SSE socket I/O (sock.send / selector register-unregister): steady-state TPOT P99 8.50ms -> 8.18ms and frontend loop-scheduling delay roughly halved. Adds uvloop to dependencies. gpt-oss: - Register `embed_tokens` first (with `embedding` as the shared-storage alias) so it stays the primary, non-deduped name in named_parameters(). The checkpoint stores `model.embed_tokens.weight`; with `embedding` as the primary name the load-completeness check falsely flagged `model.embedding.weight` as unloaded even though the weight is loaded via the alias. Byte-identical weights (GSM8K 0.8832, unchanged); the spurious "parameters were NOT loaded" warning is gone.

* feat(openai): support Qwen3 (qwen3_coder/qwen3_xml) tool-call format ATOM's OpenAI/Anthropic servers previously only parsed the Kimi-K2 tool-call token format (<|tool_calls_section_begin|>...), so Qwen3.5/Qwen3.6 tool calls -- emitted as qwen3_coder XML (<tool_call><function=NAME><parameter=...>) -- were returned as plain text and never surfaced as structured tool_calls. Agent frontends (qwen-code, OpenCode, etc.) therefore could not drive tools. Add Qwen3 XML parsing alongside the Kimi format, auto-detected: - tool_parser.py: parse <tool_call>/<function=>/<parameter=> into OpenAI tool_calls, with JSON-Schema type coercion of parameter values from the request's tools (the XML is typeless). Non-streaming + streaming (stream content, then buffer+parse the tool-call block -- robust against the partial-XML streaming edge cases seen in vLLM/SGLang). Kimi path unchanged. - protocol.py: deserialize tool_calls[].function.arguments (a JSON string in OpenAI requests) to a mapping in to_template_dict, so multi-turn chat templates that iterate arguments.items() (Qwen, Hermes) render tool history instead of raising "Can only get item pairs from a mapping". - serving_chat.py / api_server.py: thread the request's tools into the parsers for type coercion (default None preserves existing behavior). Verified: Qwen3.6-27B BF16 served by ATOM drives qwen-code end-to-end on gfx1151 -- write_file + run-shell tool calls execute and the agent reports the program output. * fix(openai): don't pass tools to the /v1/completions stream path The previous commit's threading of request.tools matched the stream_completion_response / stream_completion_response_fanout calls in the /v1/completions handler too. CompletionRequest has no `tools` field, so /v1/completions raised "AttributeError: 'CompletionRequest' object has no attribute 'tools'" (HTTP 500). Tool calling only applies to chat; drop tools from the text-completion stream calls. * fix(openai): make tool-call ids unique across the conversation The parser generated ids from a per-response index (call_0, call_1, ...), so the first tool call in every assistant turn was call_0. OpenAI tool-call ids must be unique across the whole conversation; agentic clients (e.g. qwen-code) dedupe by id and silently ignore every repeat -> the tool never executes and the model retries forever (endless tool-call loop on any multi-tool task). Use a random call_<uuid> id at both the non-streaming and streaming emit sites.

…n concurrency (#1381) The conc=1000 accuracy job intermittently failed: the server exhausted its per-process open-file limit while accepting ~1000 concurrent connections (plus the engine's DP-rank ZMQ and shared-memory fds), hitting EMFILE on accept(). The default soft RLIMIT_NOFILE (~1024) is simply too low for that connection count. Root cause is that ATOM never raised its own fd soft limit. vLLM and SGLang both call set_ulimit() at process startup for exactly this reason, and ATOM's own mesh launch scripts already pass `--ulimit nofile=65536:524288` to docker -- but plain `python -m atom.entrypoints.openai_server` launches (CI, ad-hoc) inherit the daemon default and never bump it. Add a set_ulimit() helper (raise soft -> min(65535, hard)) and call it at the server entry point before the engine-core subprocesses are spawned, so the raised limit is inherited. No-op when the soft limit is already high enough. This is independent of the event-loop choice; it removes the fd ceiling that turned ordinary high-concurrency load into dropped connections.

gyohuangxin and others added 13 commits June 22, 2026 23:55

CI: start Docker release at 21:48 Beijing time (#1313)

345d6a5

[ATOM SGL] Add dsv4 ci (#1224)

cc80cd1

* [ATOM SGL]Add dsv4 ci Co-authored-by: Cursor <cursoragent@cursor.com>

CI: use host network for ATOM test container (#1315)

73f168a

Set sink to fp32 for ps decode asm (#1309)

9c751e1

* fix * use func

fix(sglang): skip sparse MLA fast metadata for unsupported heads (#1252)

f05f3ab

[atom-sgl-benchmark] Add model cache mount for MI308 sglang benchmark (…

f126a50

…#1328) * Add model cache mount for MI308 sglang benchmark

[atom-sgl-accuracy] Modify sglang accuracy runner for mi355 (#1329)

9ca76d6

[atom-vllm benchmark] Add host network to start container (#1325)

b45e3c6

mlatest (#1301)

ef44603

* mla * fix * fix --------- Co-authored-by: HaonanWang98 <hwang@amd.com> Co-authored-by: feifei14119 <carlus.huang@amd.com>

Support PD disaggregation on Single node (#1308)

4577dcc

carlushuang mentioned this pull request Jun 24, 2026

[gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 via native Triton attention #1314

Open

k50112113 and others added 16 commits June 24, 2026 18:24

[Triton] DSV4 replace einsum with Triton BMM (#1270)

c4ae045

* replace einsum with bmm * fix

[fix](qwen): fix qwen3.5 accuracy (#1321)

908cdaf

* [fix](qwen): fix qwen3.5 accuracy * [fix](attn): delete extra code * [fix](attn): add kv cache to mutate args * [fix](qwen): remove quick allreduce in qwen3.5 --------- Co-authored-by: perzhang <perzhang@amd.com>

Add NUMA-aware CPU/memory binding for PD Single Node optimization (#1340

feb5ce5

) * Add NUMA-aware CPU/memory binding * Add glm-5-2-fp8 benchmark dispatch checkbox

[atom-sgl-accuracy] Modify atom-sgl-accuracy workflow to adapt it for…

ab9eb78

… AAC machine (#1346) * Modify atom-sgl-accuracy workflow to adapt it for AAC machine

fix prefill swa write (#1343)

fc4d766

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

Make profile stop timeout configurable with ATOM_PROFILER_TIMEOUT (#1332

083551e

)

fix AttributeError: 'AttentionMetaData' object has no attribute 'bloc…

b9cff14

…k_size' (#1348) Co-authored-by: junxiaguo <JunXia.Guo@amd.com>

[atom-sgl-accuracy] Modify model cache mount (#1352)

c451c40

* Modify model cache mount

zufayu requested review from ZhangLirong-amd and removed request for ZhangLirong-amd June 26, 2026 06:12

zufayu removed the request for review from ZhangLirong-amd June 26, 2026 08:24

carlushuang and others added 2 commits June 26, 2026 08:27

Merge remote-tracking branch 'origin/main' into gfx1151_int8_qwen36

60f24dd

zufayu requested a review from yhl-amd June 26, 2026 13:58

valarLip and others added 10 commits June 26, 2026 22:32

support online quant for quark models (#1370)

e97d631

* fix online quant * update comment * format --------- Co-authored-by: ganyi <ygan@amd.com>

fix(server): batch stream-chunk dispatch (#1367)

4b40ede

fix: preserve FP8 MoE weight bytes during load (#1375)

2ad20b6

Merge remote-tracking branch 'origin/main' into gfx1151_int8_qwen36

d11987f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5, with working MTP#1337

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5, with working MTP#1337
carlushuang wants to merge 42 commits into
carhuang/support_gfx1151_qwen36from
carhuang/gfx1151_int8_qwen36

carlushuang commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Uh oh!

Conversation

carlushuang commented Jun 24, 2026

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5 (Strix Halo), with working MTP

What this enables

Changes

Quality (gsm8k, 5-shot-equivalent, chat + thinking, greedy)

Performance (gfx1151 / Radeon 8060S, bs=1)

Serve

Dependency

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants