【3/N】add RDNA4 PA decode FP8 kernel, refactor shared reduce code by vivienfanghuagood · Pull Request #357 · ROCm/FlyDSL

vivienfanghuagood · 2026-04-07T09:52:42Z

Motivation

Add RDNA4 (gfx120x) paged-attention FP8 decode kernel to FlyDSL, enabling LLM inference decode on consumer RDNA GPUs. The existing CDNA kernel uses MFMA wave64 instructions and cannot run on RDNA hardware.

Technical Details

kernels/pa_common.py (new): shared constants, stride computation, and reduce kernels extracted from pa_decode_fp8.py to eliminate duplication between CDNA and RDNA paths
kernels/rdna_pa_decode_fp8.py (new): RDNA4 decode dot kernel using wmma_f32_16x16x16_fp8_fp8 with wave32 (8 warps × 32 lanes), softmax P staged through LDS as f32
kernels/pa_decode_fp8.py (modified): CDNA kernel now imports shared code from pa_common, removing ~360 duplicate lines
tests/kernels/test_pa.py (modified): unified test for both CDNA and RDNA — arch-aware kernel selection, aiter made optional (RDNA validates against torch reference; CDNA validates against Gluon when aiter is available)
tests/arch_compat.py (modified): test_pa.py removed from CDNA_ONLY_TESTS since it now self-manages arch dispatch

Test Plan

RDNA correctness: python tests/kernels/test_pa.py on gfx1201 — all PASS (cos_sim > 0.999 vs torch reference)
RDNA performance: 2.5–146× faster than PyTorch SDPA (bf16) on gfx1201
CDNA regression: existing run_single() path unchanged, needs gfx9xx CI validation

Test Result

gfx1201 (RX 9070 XT, ROCm 7.1, PyTorch 2.9.1):

batch	ctx	kernel (us)	status
1	128	6.7	PASS
1	256	7.6	PASS
4	4096	7.7	PASS
32	4096	35.1	PASS

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

vivienfanghuagood · 2026-04-07T09:54:30Z

@coderfeli Hi Felix, we propose to add attention kernel for RDNA, and reuse most codes from CDNA's implementation. Can you help review our submissions? Thanks a lot!

coderfeli · 2026-04-07T10:20:54Z

@vivienfanghuagood our PA is in a big reconstruction and perf tuning. could you wait several days for that?

vivienfanghuagood · 2026-04-07T10:56:57Z

@vivienfanghuagood our PA is in a big reconstruction and perf tuning. could you wait several days for that?

Sure, it's okay. May I ask whether other kernels have similar refactoring plans? We can avoid these kernels when developing.

coderfeli · 2026-04-07T11:04:41Z

also moe @vivienfanghuagood but code style change only. Current PA has many functional and perf issues.

Add RDNA4 paged-attention FP8 decode kernel using WMMA wave32, refactor shared reduce/stride code into pa_common.py, and unify test_pa.py to handle both CDNA and RDNA architectures. New files: - kernels/pa_common.py: shared constants, compute_pa_strides(), build_ps_reduce_kernel(), build_v2_reduce_kernel() - kernels/rdna_pa_decode_fp8.py: RDNA4 WMMA dot kernel (383 lines) Modified: - kernels/pa_decode_fp8.py: imports shared code from pa_common - tests/kernels/test_pa.py: unified CDNA+RDNA test with arch-aware kernel selection (aiter optional for RDNA path) - tests/arch_compat.py: test_pa.py removed from CDNA_ONLY_TESTS (now self-manages via IS_RDNA/HAS_AITER guards) Removed: - tests/kernels/test_rdna_pa.py: merged into test_pa.py RDNA4 architecture: 8 warps x 32 lanes, WMMA f32_16x16x16_fp8_fp8, P staged as f32 in LDS (~16.5 KB). Correctness cos_sim > 0.999, performance 2.5-146x vs PyTorch SDPA on gfx1201. Made-with: Cursor

vivienfanghuagood assigned vivienfanghuagood and coderfeli Apr 7, 2026

vivienfanghuagood force-pushed the rdna4-pa-decode-fp8 branch from c944d2f to 9e84b58 Compare April 7, 2026 11:08

coderfeli closed this Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【3/N】add RDNA4 PA decode FP8 kernel, refactor shared reduce code#357

【3/N】add RDNA4 PA decode FP8 kernel, refactor shared reduce code#357
vivienfanghuagood wants to merge 1 commit intoROCm:mainfrom
vivienfanghuagood:rdna4-pa-decode-fp8

vivienfanghuagood commented Apr 7, 2026 •

edited

Loading

Uh oh!

vivienfanghuagood commented Apr 7, 2026 •

edited

Loading

Uh oh!

coderfeli commented Apr 7, 2026

Uh oh!

vivienfanghuagood commented Apr 7, 2026

Uh oh!

coderfeli commented Apr 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vivienfanghuagood commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

vivienfanghuagood commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderfeli commented Apr 7, 2026

Uh oh!

vivienfanghuagood commented Apr 7, 2026

Uh oh!

coderfeli commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vivienfanghuagood commented Apr 7, 2026 •

edited

Loading

vivienfanghuagood commented Apr 7, 2026 •

edited

Loading

coderfeli commented Apr 7, 2026 •

edited

Loading