Skip to content

Add PA_PS 8-wave kernel for MI308 with co-execution#2630

Open
quintinwang5 wants to merge 2 commits intomainfrom
qiwan/pa_ps_8w
Open

Add PA_PS 8-wave kernel for MI308 with co-execution#2630
quintinwang5 wants to merge 2 commits intomainfrom
qiwan/pa_ps_8w

Conversation

@quintinwang5
Copy link
Copy Markdown

Motivation

Performance optimization for PA_PS on MI308 with co-execution enabled. To use this feature, just add --wave_per_tg 8

python op_tests/test_pa_ps.py -b 80 -n 10,1 -q 4 --block_size 1024 --quant_type per_1024x128 --wave_per_tg 8

Technical Details

Original 4-wave implements cannot benefit from MFMA co-execution .
The default option is wave_per_tg=4(same as before).

Test Plan

  1. Correctness of gqa_raito=10,16, q_len_with_mtp=1,2,3,4 (for 8-wave, only q_len * gqa_ratio <=48 cases are supported)
  2. Performance gains compared with 4-wave implements.

Test Result

  1. Passed.
  2. Best performance shape: batch=80, kv_seq_len=10240 ~ 16384 (Not all kernels have been well-optimized now).

Submission Checklist

@quintinwang5 quintinwang5 requested a review from a team April 7, 2026 02:18
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-355 Run Triton tests on MI355 in addition to MI325
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2630 --add-label <label>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant