Open
Conversation
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
…duler The hot_loop_scheduler() in both gemm1 and gemm2 was missing the rocdl.sched_barrier(0); return guard at the top, causing all scheduler hints (dead code on dev branch) to actually execute and degrade stage1 from ~204us to ~221us. Also fix ISA dump indices in bench script. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
coderfeli
reviewed
Apr 10, 2026
| _to_ir(soffset), _to_ir(offset), _to_ir(aux), **kw) | ||
|
|
||
|
|
||
| def cvt_off_f32_i4(src_i32, byte_sel=None): |
Collaborator
Author
There was a problem hiding this comment.
I am not sure. for mfma, we also import from MLR dialec package as
_ods_mfma_f32_16x16x32_f16 = globals().get("mfma_f32_16x16x32_f16", None)
_ods_mfma_f32_16x16x32_bf16 = globals().get("mfma_f32_16x16x32_bf16", None)
What about giving some advice how to deal with them?
coderfeli
reviewed
Apr 10, 2026
| if [ -n "${dt_s2}" ] && [ -n "${tf_s2}" ] && [ -n "${tb_s2}" ]; then | ||
| _emit_row "moe_w4a16_s2" "${shape_moe}" "${dt_s2}" "${tb_s2}" "${tf_s2}" | ||
| fi | ||
| done |
Collaborator
There was a problem hiding this comment.
Benchmark.sh becomes bigger and bigger. add a todo item to use pytest to add benchmark config instead of all in script.
Collaborator
Author
There was a problem hiding this comment.
ok, improve it later.
coderfeli
reviewed
Apr 10, 2026
bench_kimi25_w4a16.sh
Outdated
|
|
||
| SCALE_DTYPE=${SCALE_DTYPE:-f32} | ||
|
|
||
| python -c " |
Collaborator
Author
There was a problem hiding this comment.
No. I will remove it later.
coderfeli
reviewed
Apr 11, 2026
| from ..._mlir.dialects import llvm as _llvm | ||
| from ..._mlir import ir | ||
|
|
||
| if byte_sel is not None: |
Collaborator
There was a problem hiding this comment.
should not put in init.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
(E, G//2, N, 2)packed layout: two adjacent groups for the same N position packed into one dword, halving scale memory loads (16 vs 32buffer_load_dword)scf.forloop state, extracted at compute phase viaextract_bf16_scale(). Eliminates 32v_mov_b32register copies, keeps VGPRs at 140 (3 waves) matching f32 occupancyv_lshlrev_b32 v, 16, v; odd kuv_and_b32 v, 0xffff0000, vshuffle_scale_for_int4()intests/utils.pyauto-handles both f32(E,G,N)and bf16(E,G//2,N,2)layoutsscale_is_bf16flag oncompile_moe_gemm1/compile_moe_gemm2hgemm_splitk.py: updaterocdl.mfma_f32_16x16x32_bf16/f16to new(result_type, operands_list)APIPerformance (Kimi 2.5 TP8, tokens=128, model_dim=7168, inter_dim=256, E=384, topk=8)
Stage1 (gate+up GEMM, tile 16x128x128):
Stage2 (reduce GEMM, tile 16x128x256):
Test plan
test_moe_gemm_w4a16_groupwise_scale[scale_f32]-- PASSEDtest_moe_gemm_w4a16_groupwise_scale[scale_bf16]-- PASSEDbench_kimi25_w4a16.shwithSCALE_DTYPE=bf16andSCALE_DTYPE=f32-- correctness verifiedtest_hgemm_splitk-- all 16 tests PASSED