gpt-oss gfx1250 ATOM-parity perf patches by dllehr-amd · Pull Request #1035 · ROCm/vllm

dllehr-amd · 2026-06-29T20:39:07Z

Summary

gpt-oss / gfx1250 perf patches extracted from the dllehr_gpt container's installed vLLM
(0.9.2rc2.dev10360+gaa0be2a89.d20260626, a dirty build over aa0be2a89), 3-way rebased onto
455_wip. All changes target gfx1250 and bring it to "ATOM" parity.

Changes (10 files)

v1/attention/backends/rocm_aiter_unified_attn.py — "shuffle KV-cache" path (gated on
rocm_aiter_ops.is_shuffle_kv_cache_enabled()): K/V-outermost (2, num_blocks, …) layout,
no-copy 5D reshape views, fused-RoPE shuffle writer, and reshape_and_cache_shuffle_triton
for the gfx1250 gluon unified-attn-3d kernel.
v1/sample/sampler.py — PATCH6_AITER_RAW_SAMPLE: fused raw-logits Gumbel-max sampler via
aiter mixed_sample_outer_exponential for unconstrained all-random batches (env-gated
VLLM_AITER_TEMP_SAMPLE=1).
model_executor/models/gpt_oss.py — custom op gpt_oss_rmsnorm_pad_add (aiter
fused_add_rmsnorm_pad) fusing add+rmsnorm+zero-pad to the MoE hidden dim.
v1/sample/ops/topk_topp_sampler.py — PATCH5_AITER_TEMP_SAMPLE: fused temperature Gumbel
sampling for the no-filter (k/p=None) case.
model_executor/layers/quantization/utils/mxfp4_utils.py — apply aiter
swizzle_scales_gfx1250 MX-scale swizzle on gfx1250.
model_executor/layers/utils.py — route all bf16 dense GEMMs (incl. large prefill and
lm_head with K%256≠0) through aiter gluon gemm_a16w16 on gfx1250 instead of rocBLAS.
model_executor/layers/attention/attention.py — VLLM_DISABLE_QUERY_QUANT=1 to skip
per-layer fp8 query quant (keep query in bf16).
triton_utils/jit_monitor.py — disable the diagnostic JIT compile hook (crashes on Gluon
PaddedSharedLayout constexprs in upstream triton).
model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py &
aiter_mxfp4_w4a8_moe.py — re-enable in-kernel gather on gfx1250 + GFX1250_SCALE swizzle.

New env flags

VLLM_DISABLE_QUERY_QUANT, VLLM_AITER_TEMP_SAMPLE, VLLM_PATCH6_DEBUG.

Notes

Base: aa0be2a89 (ancestor of 455_wip); rebased via 3-way merge, the two overlapping upstream
files (utils.py, gpt_oss_triton_kernels_moe.py) auto-merged with no conflicts.
Not included: the gitignored build-vendored vllm/third_party/triton_kernels/ (populated by the
Docker build) and the empty vllm-rs/.

🤖 Generated with Claude Code

(cherry picked from commit c16f522a089f11578cd39e1403e40fffea20c9fb)

dllehr_gpt: gpt-oss gfx1250 ATOM-parity perf patches

725c412

(cherry picked from commit c16f522a089f11578cd39e1403e40fffea20c9fb)

dllehr-amd requested a review from AndreasKaratzas as a code owner June 29, 2026 20:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gpt-oss gfx1250 ATOM-parity perf patches#1035

gpt-oss gfx1250 ATOM-parity perf patches#1035
dllehr-amd wants to merge 1 commit into
ROCm:455_wipfrom
dllehr-amd:dllehr/gptoss-gfx1250-atom-455wip

dllehr-amd commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dllehr-amd commented Jun 29, 2026

Summary

Changes (10 files)

New env flags

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant