Skip to content

w16ai4 group v1#366

Closed
yadaish wants to merge 17 commits intomainfrom
dev/yadai_w16ai4_group_v1
Closed

w16ai4 group v1#366
yadaish wants to merge 17 commits intomainfrom
dev/yadai_w16ai4_group_v1

Conversation

@yadaish
Copy link
Copy Markdown
Collaborator

@yadaish yadaish commented Apr 9, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

ClementLinCF and others added 17 commits March 12, 2026 23:45
…tions

- Extract _unpack_int4_to_int8_pair(): shared 7-op int4->int8 bit
  manipulation used by unpack_b_w4a16, unpack_b_w4a8, unpack_b_w4a_fp8,
  and load_b_pack_k32 (was copy-pasted in 4 places)
- Extract _pack_i32_pair_to_i64(): shared (even, odd) -> i64 packing
- Extract _load_groupwise_scale(): shared scale address calculation and
  buffer_load for W4A16 and W4A8 groupwise paths
- Have load_b_raw_w4a8_groupwise_k64 delegate weight load to
  load_b_raw_w4a8_k64 (matching W4A16 groupwise pattern)
- Replace ir.IntegerType.get_signless(32) / ir.F32Type.get() with
  T.i32 / T.f32 to follow project conventions
- Replace arith.constant(..., index=True) with fx.Index(...) throughout
- Add 'bf16' to out_dtype parametrization (was only f16/f32)
- Fix run_moe_stage2 to accept bf16 output dtype
- Fix bytes_moved calculation to treat bf16 as 2-byte (like f16)

The stage2 kernel (compile_moe_gemm2) already supports out_dtype='bf16'
using bf16 global atomics on gfx94+/gfx95+, but the test harness
blocked it. Verified all 24 new test cases pass on MI355X (gfx950).
@yadaish yadaish closed this Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants