Skip to content

Add INT4 GEMV ESIMD kernel with per-group scale (group_size=128)#352

Open
liu-shaojun wants to merge 4 commits intomainfrom
dev/int4-esimd-gemv
Open

Add INT4 GEMV ESIMD kernel with per-group scale (group_size=128)#352
liu-shaojun wants to merge 4 commits intomainfrom
dev/int4-esimd-gemv

Conversation

@liu-shaojun
Copy link
Copy Markdown
Contributor

Implement symmetric INT4 GEMV for Qwen3.5-122B-A10B decode (M=1) on XPU. Two kernels: esimd_gemv_int4 (single) and esimd_gemv_int4_fused2 (fused 2-GEMV). Phase 1: VL=128 fixed, K_SPLIT=1/2/4/8, FP32 accumulation.

  • csrc/xpu/esimd_kernels/int4_GEMV.h: kernel implementation
    • int4_dequant: byte-level unpack (low/high nibble) + sign extend
    • Dual accumulator (even/odd) for ILP
    • select_vl_ks_int4: VL/K_SPLIT auto-selection with group alignment
  • csrc/xpu/esimd_kernel.sycl: SYCL wrapper functions
  • csrc/xpu/torch_extension.cc: op registration
  • include/kernel_ops.h: C++ declarations
  • python/ops.py + init.py: Python bindings
  • tests/test_gemv_int4.py: correctness + performance test suite

All tests pass. Kernel correctness <0.04% relative error.

liu-shaojun and others added 4 commits April 13, 2026 09:27
Implement symmetric INT4 GEMV for Qwen3.5-122B-A10B decode (M=1) on XPU.
Two kernels: esimd_gemv_int4 (single) and esimd_gemv_int4_fused2 (fused 2-GEMV).
Phase 1: VL=128 fixed, K_SPLIT=1/2/4/8, FP32 accumulation.

- csrc/xpu/esimd_kernels/int4_GEMV.h: kernel implementation
  - int4_dequant: byte-level unpack (low/high nibble) + sign extend
  - Dual accumulator (even/odd) for ILP
  - select_vl_ks_int4: VL/K_SPLIT auto-selection with group alignment
- csrc/xpu/esimd_kernel.sycl: SYCL wrapper functions
- csrc/xpu/torch_extension.cc: op registration
- include/kernel_ops.h: C++ declarations
- python/ops.py + __init__.py: Python bindings
- tests/test_gemv_int4.py: correctness + performance test suite

All tests pass. Kernel correctness <0.04% relative error.
- int4_GEMV.h: change dequant from two's complement (val-16 if val>=8)
  to GGML q4_0 unsigned convention (uint4 - 8). Scale can be negative.
- test_gemv_int4.py: update Python reference to match GGML q4_0 format
  (scale = max/(-8), quantize with zero_point=8).
  Add GGML C library compatibility tests:
  - test_ggml_packing_format: verify nibble order and scale semantics
  - test_ggml_vs_python_quantize: Python vs C library byte-level match
  - test_ggml_kernel_e2e: C library quantize -> kernel -> compare ref
  Add benchmark_vs_ipex: ESIMD vs IPEX performance comparison.
…lysis

Merge test_correctness_unit_scale, test_correctness_with_scale,
test_correctness_large_k, and test_quantization_error_analysis into
a single test_correctness_detailed() that shows for each shape:
  - First 5 values of kernel output, dequant ref, and fp16 ref
  - vs dequant weight: kernel computation error (expected <0.1%)
  - vs original fp16: quantization loss (expected ~10%)
Add benchmark_int4_vs_fp8() to verify INT4 kernel has no performance
regression relative to FP8 on identical (N, K) shapes. Reports latency,
bandwidth, and speedup for all Qwen3.5-122B TP4 decode shapes with
cache-busting to measure true DRAM bandwidth.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@liu-shaojun liu-shaojun marked this pull request as ready for review April 14, 2026 04:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants