Skip to content

gpt-oss gfx1250 ATOM-parity perf patches#1035

Open
dllehr-amd wants to merge 1 commit into
ROCm:455_wipfrom
dllehr-amd:dllehr/gptoss-gfx1250-atom-455wip
Open

gpt-oss gfx1250 ATOM-parity perf patches#1035
dllehr-amd wants to merge 1 commit into
ROCm:455_wipfrom
dllehr-amd:dllehr/gptoss-gfx1250-atom-455wip

Conversation

@dllehr-amd

Copy link
Copy Markdown
Collaborator

Summary

gpt-oss / gfx1250 perf patches extracted from the dllehr_gpt container's installed vLLM
(0.9.2rc2.dev10360+gaa0be2a89.d20260626, a dirty build over aa0be2a89), 3-way rebased onto
455_wip. All changes target gfx1250 and bring it to "ATOM" parity.

Changes (10 files)

  • v1/attention/backends/rocm_aiter_unified_attn.py — "shuffle KV-cache" path (gated on
    rocm_aiter_ops.is_shuffle_kv_cache_enabled()): K/V-outermost (2, num_blocks, …) layout,
    no-copy 5D reshape views, fused-RoPE shuffle writer, and reshape_and_cache_shuffle_triton
    for the gfx1250 gluon unified-attn-3d kernel.
  • v1/sample/sampler.pyPATCH6_AITER_RAW_SAMPLE: fused raw-logits Gumbel-max sampler via
    aiter mixed_sample_outer_exponential for unconstrained all-random batches (env-gated
    VLLM_AITER_TEMP_SAMPLE=1).
  • model_executor/models/gpt_oss.py — custom op gpt_oss_rmsnorm_pad_add (aiter
    fused_add_rmsnorm_pad) fusing add+rmsnorm+zero-pad to the MoE hidden dim.
  • v1/sample/ops/topk_topp_sampler.pyPATCH5_AITER_TEMP_SAMPLE: fused temperature Gumbel
    sampling for the no-filter (k/p=None) case.
  • model_executor/layers/quantization/utils/mxfp4_utils.py — apply aiter
    swizzle_scales_gfx1250 MX-scale swizzle on gfx1250.
  • model_executor/layers/utils.py — route all bf16 dense GEMMs (incl. large prefill and
    lm_head with K%256≠0) through aiter gluon gemm_a16w16 on gfx1250 instead of rocBLAS.
  • model_executor/layers/attention/attention.pyVLLM_DISABLE_QUERY_QUANT=1 to skip
    per-layer fp8 query quant (keep query in bf16).
  • triton_utils/jit_monitor.py — disable the diagnostic JIT compile hook (crashes on Gluon
    PaddedSharedLayout constexprs in upstream triton).
  • model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py &
    aiter_mxfp4_w4a8_moe.py — re-enable in-kernel gather on gfx1250 + GFX1250_SCALE swizzle.

New env flags

VLLM_DISABLE_QUERY_QUANT, VLLM_AITER_TEMP_SAMPLE, VLLM_PATCH6_DEBUG.

Notes

  • Base: aa0be2a89 (ancestor of 455_wip); rebased via 3-way merge, the two overlapping upstream
    files (utils.py, gpt_oss_triton_kernels_moe.py) auto-merged with no conflicts.
  • Not included: the gitignored build-vendored vllm/third_party/triton_kernels/ (populated by the
    Docker build) and the empty vllm-rs/.

🤖 Generated with Claude Code

(cherry picked from commit c16f522a089f11578cd39e1403e40fffea20c9fb)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant