Pr/a16wi4 group by yadaish · Pull Request #370 · ROCm/FlyDSL

yadaish · 2026-04-09T08:51:13Z

Summary

bf16 scale with (E, G//2, N, 2) packed layout: two adjacent groups for the same N position packed into one dword, halving scale memory loads (16 vs 32 buffer_load_dword)
Deferred bf16 extraction: raw i32 dwords carried as scf.for loop state, extracted at compute phase via extract_bf16_scale(). Eliminates 32 v_mov_b32 register copies, keeps VGPRs at 140 (3 waves) matching f32 occupancy
Zero-cost bf16->f32: even ku v_lshlrev_b32 v, 16, v; odd ku v_and_b32 v, 0xffff0000, v
shuffle_scale_for_int4() in tests/utils.py auto-handles both f32 (E,G,N) and bf16 (E,G//2,N,2) layouts
Compile-time scale_is_bf16 flag on compile_moe_gemm1 / compile_moe_gemm2
Fix hgemm_splitk.py: update rocdl.mfma_f32_16x16x32_bf16/f16 to new (result_type, operands_list) API

Performance (Kimi 2.5 TP8, tokens=128, model_dim=7168, inter_dim=256, E=384, topk=8)

Stage1 (gate+up GEMM, tile 16x128x128):

Scale	Time	TB/s
f32	201.9 us	4.375
bf16	190.4 us	4.176
Speedup	+6.0%

Stage2 (reduce GEMM, tile 16x128x256):

Scale	Time	TB/s
f32	101.1 us	4.378
bf16	93.8 us	4.251
Speedup	+7.8%

Test plan

test_moe_gemm_w4a16_groupwise_scale[scale_f32] -- PASSED
test_moe_gemm_w4a16_groupwise_scale[scale_bf16] -- PASSED
bench_kimi25_w4a16.sh with SCALE_DTYPE=bf16 and SCALE_DTYPE=f32 -- correctness verified
test_hgemm_splitk -- all 16 tests PASSED

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

…duler The hot_loop_scheduler() in both gemm1 and gemm2 was missing the rocdl.sched_barrier(0); return guard at the top, causing all scheduler hints (dead code on dev branch) to actually execute and degrade stage1 from ~204us to ~221us. Also fix ISA dump indices in bench script. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

…i4-group

coderfeli · 2026-04-10T01:09:02Z

python/flydsl/expr/rocdl/__init__.py

               _to_ir(soffset), _to_ir(offset), _to_ir(aux), **kw)
+
+
+def cvt_off_f32_i4(src_i32, byte_sel=None):


need inline asm?

I am not sure. for mfma, we also import from MLR dialec package as

_ods_mfma_f32_16x16x32_f16 = globals().get("mfma_f32_16x16x32_f16", None)
_ods_mfma_f32_16x16x32_bf16 = globals().get("mfma_f32_16x16x32_bf16", None)

What about giving some advice how to deal with them?

coderfeli · 2026-04-10T01:10:19Z

scripts/run_benchmark.sh

+    if [ -n "${dt_s2}" ] && [ -n "${tf_s2}" ] && [ -n "${tb_s2}" ]; then
+      _emit_row "moe_w4a16_s2" "${shape_moe}" "${dt_s2}" "${tb_s2}" "${tf_s2}"
+    fi
+  done


Benchmark.sh becomes bigger and bigger. add a todo item to use pytest to add benchmark config instead of all in script.

ok, improve it later.

coderfeli · 2026-04-10T01:10:50Z

bench_kimi25_w4a16.sh

+
+SCALE_DTYPE=${SCALE_DTYPE:-f32}
+
+python -c "


need this script?

No. I will remove it later.

coderfeli · 2026-04-11T12:03:44Z

python/flydsl/expr/rocdl/__init__.py

+    from ..._mlir.dialects import llvm as _llvm
+    from ..._mlir import ir
+
+    if byte_sel is not None:


should not put in init.py

root and others added 9 commits April 9, 2026 08:32

clean code

afbeac2

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

update

eb4477f

update

6d27488

update

7ebef5c

Merge branch 'main' into pr/a16wi4-group

09ec0fd

support bf16 with layout [ e, num_group//2, n]

fd9b7e2

Merge branch 'pr/a16wi4-group' of github.com:ROCm/FlyDSL into pr/a16w…

a5bbde6

…i4-group

update

3a3fb68

yadaish marked this pull request as ready for review April 9, 2026 17:52

update

b8b6f74

coderfeli reviewed Apr 10, 2026

View reviewed changes

yadaish added 3 commits April 10, 2026 01:47

update

2877c18

update

1339750

improve testcase

060b1e4

coderfeli reviewed Apr 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pr/a16wi4 group#370

Pr/a16wi4 group#370
yadaish wants to merge 13 commits intomainfrom
pr/a16wi4-group

yadaish commented Apr 9, 2026 •

edited

Loading

Uh oh!

coderfeli Apr 10, 2026

Uh oh!

yadaish Apr 10, 2026

Uh oh!

coderfeli Apr 10, 2026

Uh oh!

yadaish Apr 10, 2026

Uh oh!

coderfeli Apr 10, 2026

Uh oh!

yadaish Apr 10, 2026

Uh oh!

coderfeli Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		_to_ir(soffset), _to_ir(offset), _to_ir(aux), **kw)


		def cvt_off_f32_i4(src_i32, byte_sel=None):


		SCALE_DTYPE=${SCALE_DTYPE:-f32}

		python -c "

Conversation

yadaish commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance (Kimi 2.5 TP8, tokens=128, model_dim=7168, inter_dim=256, E=384, topk=8)

Test plan

Uh oh!

coderfeli Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

yadaish Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderfeli Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

yadaish Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderfeli Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

yadaish Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderfeli Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yadaish commented Apr 9, 2026 •

edited

Loading