gpt-oss gfx1250 ATOM-parity perf patches#1035
Open
dllehr-amd wants to merge 1 commit into
Open
Conversation
(cherry picked from commit c16f522a089f11578cd39e1403e40fffea20c9fb)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
gpt-oss / gfx1250 perf patches extracted from the
dllehr_gptcontainer's installed vLLM(
0.9.2rc2.dev10360+gaa0be2a89.d20260626, a dirty build overaa0be2a89), 3-way rebased onto455_wip. All changes target gfx1250 and bring it to "ATOM" parity.Changes (10 files)
v1/attention/backends/rocm_aiter_unified_attn.py— "shuffle KV-cache" path (gated onrocm_aiter_ops.is_shuffle_kv_cache_enabled()): K/V-outermost(2, num_blocks, …)layout,no-copy 5D reshape views, fused-RoPE shuffle writer, and
reshape_and_cache_shuffle_tritonfor the gfx1250 gluon unified-attn-3d kernel.
v1/sample/sampler.py—PATCH6_AITER_RAW_SAMPLE: fused raw-logits Gumbel-max sampler viaaiter
mixed_sample_outer_exponentialfor unconstrained all-random batches (env-gatedVLLM_AITER_TEMP_SAMPLE=1).model_executor/models/gpt_oss.py— custom opgpt_oss_rmsnorm_pad_add(aiterfused_add_rmsnorm_pad) fusing add+rmsnorm+zero-pad to the MoE hidden dim.v1/sample/ops/topk_topp_sampler.py—PATCH5_AITER_TEMP_SAMPLE: fused temperature Gumbelsampling for the no-filter (k/p=None) case.
model_executor/layers/quantization/utils/mxfp4_utils.py— apply aiterswizzle_scales_gfx1250MX-scale swizzle on gfx1250.model_executor/layers/utils.py— route all bf16 dense GEMMs (incl. large prefill andlm_head with K%256≠0) through aiter gluon
gemm_a16w16on gfx1250 instead of rocBLAS.model_executor/layers/attention/attention.py—VLLM_DISABLE_QUERY_QUANT=1to skipper-layer fp8 query quant (keep query in bf16).
triton_utils/jit_monitor.py— disable the diagnostic JIT compile hook (crashes on GluonPaddedSharedLayoutconstexprs in upstream triton).model_executor/layers/fused_moe/experts/gpt_oss_triton_kernels_moe.py&aiter_mxfp4_w4a8_moe.py— re-enable in-kernel gather on gfx1250 +GFX1250_SCALEswizzle.New env flags
VLLM_DISABLE_QUERY_QUANT,VLLM_AITER_TEMP_SAMPLE,VLLM_PATCH6_DEBUG.Notes
aa0be2a89(ancestor of455_wip); rebased via 3-way merge, the two overlapping upstreamfiles (
utils.py,gpt_oss_triton_kernels_moe.py) auto-merged with no conflicts.vllm/third_party/triton_kernels/(populated by theDocker build) and the empty
vllm-rs/.🤖 Generated with Claude Code