Add INT4 GEMV ESIMD kernel with per-group scale (group_size=128) by liu-shaojun · Pull Request #352 · intel/llm-scaler

liu-shaojun · 2026-04-13T09:32:36Z

Implement symmetric INT4 GEMV for Qwen3.5-122B-A10B decode (M=1) on XPU. Two kernels: esimd_gemv_int4 (single) and esimd_gemv_int4_fused2 (fused 2-GEMV). Phase 1: VL=128 fixed, K_SPLIT=1/2/4/8, FP32 accumulation.

csrc/xpu/esimd_kernels/int4_GEMV.h: kernel implementation
- int4_dequant: byte-level unpack (low/high nibble) + sign extend
- Dual accumulator (even/odd) for ILP
- select_vl_ks_int4: VL/K_SPLIT auto-selection with group alignment
csrc/xpu/esimd_kernel.sycl: SYCL wrapper functions
csrc/xpu/torch_extension.cc: op registration
include/kernel_ops.h: C++ declarations
python/ops.py + init.py: Python bindings
tests/test_gemv_int4.py: correctness + performance test suite

All tests pass. Kernel correctness <0.04% relative error.

Implement symmetric INT4 GEMV for Qwen3.5-122B-A10B decode (M=1) on XPU. Two kernels: esimd_gemv_int4 (single) and esimd_gemv_int4_fused2 (fused 2-GEMV). Phase 1: VL=128 fixed, K_SPLIT=1/2/4/8, FP32 accumulation. - csrc/xpu/esimd_kernels/int4_GEMV.h: kernel implementation - int4_dequant: byte-level unpack (low/high nibble) + sign extend - Dual accumulator (even/odd) for ILP - select_vl_ks_int4: VL/K_SPLIT auto-selection with group alignment - csrc/xpu/esimd_kernel.sycl: SYCL wrapper functions - csrc/xpu/torch_extension.cc: op registration - include/kernel_ops.h: C++ declarations - python/ops.py + __init__.py: Python bindings - tests/test_gemv_int4.py: correctness + performance test suite All tests pass. Kernel correctness <0.04% relative error.

- int4_GEMV.h: change dequant from two's complement (val-16 if val>=8) to GGML q4_0 unsigned convention (uint4 - 8). Scale can be negative. - test_gemv_int4.py: update Python reference to match GGML q4_0 format (scale = max/(-8), quantize with zero_point=8). Add GGML C library compatibility tests: - test_ggml_packing_format: verify nibble order and scale semantics - test_ggml_vs_python_quantize: Python vs C library byte-level match - test_ggml_kernel_e2e: C library quantize -> kernel -> compare ref Add benchmark_vs_ipex: ESIMD vs IPEX performance comparison.

…lysis Merge test_correctness_unit_scale, test_correctness_with_scale, test_correctness_large_k, and test_quantization_error_analysis into a single test_correctness_detailed() that shows for each shape: - First 5 values of kernel output, dequant ref, and fp16 ref - vs dequant weight: kernel computation error (expected <0.1%) - vs original fp16: quantization loss (expected ~10%)

Add benchmark_int4_vs_fp8() to verify INT4 kernel has no performance regression relative to FP8 on identical (N, K) shapes. Reports latency, bandwidth, and speedup for all Qwen3.5-122B TP4 decode shapes with cache-busting to measure true DRAM bandwidth. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

liu-shaojun and others added 4 commits April 13, 2026 09:27

liu-shaojun marked this pull request as ready for review April 14, 2026 04:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add INT4 GEMV ESIMD kernel with per-group scale (group_size=128)#352

Add INT4 GEMV ESIMD kernel with per-group scale (group_size=128)#352
liu-shaojun wants to merge 4 commits intomainfrom
dev/int4-esimd-gemv

liu-shaojun commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

liu-shaojun commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants