Skip to content

Pr/a16wi4 group#370

Open
yadaish wants to merge 13 commits intomainfrom
pr/a16wi4-group
Open

Pr/a16wi4 group#370
yadaish wants to merge 13 commits intomainfrom
pr/a16wi4-group

Conversation

@yadaish
Copy link
Copy Markdown
Collaborator

@yadaish yadaish commented Apr 9, 2026

Summary

  • bf16 scale with (E, G//2, N, 2) packed layout: two adjacent groups for the same N position packed into one dword, halving scale memory loads (16 vs 32 buffer_load_dword)
  • Deferred bf16 extraction: raw i32 dwords carried as scf.for loop state, extracted at compute phase via extract_bf16_scale(). Eliminates 32 v_mov_b32 register copies, keeps VGPRs at 140 (3 waves) matching f32 occupancy
  • Zero-cost bf16->f32: even ku v_lshlrev_b32 v, 16, v; odd ku v_and_b32 v, 0xffff0000, v
  • shuffle_scale_for_int4() in tests/utils.py auto-handles both f32 (E,G,N) and bf16 (E,G//2,N,2) layouts
  • Compile-time scale_is_bf16 flag on compile_moe_gemm1 / compile_moe_gemm2
  • Fix hgemm_splitk.py: update rocdl.mfma_f32_16x16x32_bf16/f16 to new (result_type, operands_list) API

Performance (Kimi 2.5 TP8, tokens=128, model_dim=7168, inter_dim=256, E=384, topk=8)

Stage1 (gate+up GEMM, tile 16x128x128):

Scale Time TB/s
f32 201.9 us 4.375
bf16 190.4 us 4.176
Speedup +6.0%

Stage2 (reduce GEMM, tile 16x128x256):

Scale Time TB/s
f32 101.1 us 4.378
bf16 93.8 us 4.251
Speedup +7.8%

Test plan

  • test_moe_gemm_w4a16_groupwise_scale[scale_f32] -- PASSED
  • test_moe_gemm_w4a16_groupwise_scale[scale_bf16] -- PASSED
  • bench_kimi25_w4a16.sh with SCALE_DTYPE=bf16 and SCALE_DTYPE=f32 -- correctness verified
  • test_hgemm_splitk -- all 16 tests PASSED

root and others added 9 commits April 9, 2026 08:32
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…duler

The hot_loop_scheduler() in both gemm1 and gemm2 was missing the
rocdl.sched_barrier(0); return guard at the top, causing all scheduler
hints (dead code on dev branch) to actually execute and degrade stage1
from ~204us to ~221us. Also fix ISA dump indices in bench script.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@yadaish yadaish marked this pull request as ready for review April 9, 2026 17:52
_to_ir(soffset), _to_ir(offset), _to_ir(aux), **kw)


def cvt_off_f32_i4(src_i32, byte_sel=None):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need inline asm?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure. for mfma, we also import from MLR dialec package as

_ods_mfma_f32_16x16x32_f16 = globals().get("mfma_f32_16x16x32_f16", None)
_ods_mfma_f32_16x16x32_bf16 = globals().get("mfma_f32_16x16x32_bf16", None)

What about giving some advice how to deal with them?

if [ -n "${dt_s2}" ] && [ -n "${tf_s2}" ] && [ -n "${tb_s2}" ]; then
_emit_row "moe_w4a16_s2" "${shape_moe}" "${dt_s2}" "${tb_s2}" "${tf_s2}"
fi
done
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark.sh becomes bigger and bigger. add a todo item to use pytest to add benchmark config instead of all in script.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, improve it later.


SCALE_DTYPE=${SCALE_DTYPE:-f32}

python -c "
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need this script?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I will remove it later.

from ..._mlir.dialects import llvm as _llvm
from ..._mlir import ir

if byte_sel is not None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not put in init.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants