Add INT4 GEMV ESIMD kernel with per-group scale (group_size=128)#352
Open
liu-shaojun wants to merge 4 commits intomainfrom
Open
Add INT4 GEMV ESIMD kernel with per-group scale (group_size=128)#352liu-shaojun wants to merge 4 commits intomainfrom
liu-shaojun wants to merge 4 commits intomainfrom
Conversation
Implement symmetric INT4 GEMV for Qwen3.5-122B-A10B decode (M=1) on XPU. Two kernels: esimd_gemv_int4 (single) and esimd_gemv_int4_fused2 (fused 2-GEMV). Phase 1: VL=128 fixed, K_SPLIT=1/2/4/8, FP32 accumulation. - csrc/xpu/esimd_kernels/int4_GEMV.h: kernel implementation - int4_dequant: byte-level unpack (low/high nibble) + sign extend - Dual accumulator (even/odd) for ILP - select_vl_ks_int4: VL/K_SPLIT auto-selection with group alignment - csrc/xpu/esimd_kernel.sycl: SYCL wrapper functions - csrc/xpu/torch_extension.cc: op registration - include/kernel_ops.h: C++ declarations - python/ops.py + __init__.py: Python bindings - tests/test_gemv_int4.py: correctness + performance test suite All tests pass. Kernel correctness <0.04% relative error.
- int4_GEMV.h: change dequant from two's complement (val-16 if val>=8) to GGML q4_0 unsigned convention (uint4 - 8). Scale can be negative. - test_gemv_int4.py: update Python reference to match GGML q4_0 format (scale = max/(-8), quantize with zero_point=8). Add GGML C library compatibility tests: - test_ggml_packing_format: verify nibble order and scale semantics - test_ggml_vs_python_quantize: Python vs C library byte-level match - test_ggml_kernel_e2e: C library quantize -> kernel -> compare ref Add benchmark_vs_ipex: ESIMD vs IPEX performance comparison.
…lysis Merge test_correctness_unit_scale, test_correctness_with_scale, test_correctness_large_k, and test_quantization_error_analysis into a single test_correctness_detailed() that shows for each shape: - First 5 values of kernel output, dequant ref, and fp16 ref - vs dequant weight: kernel computation error (expected <0.1%) - vs original fp16: quantization loss (expected ~10%)
Add benchmark_int4_vs_fp8() to verify INT4 kernel has no performance regression relative to FP8 on identical (N, K) shapes. Reports latency, bandwidth, and speedup for all Qwen3.5-122B TP4 decode shapes with cache-busting to measure true DRAM bandwidth. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implement symmetric INT4 GEMV for Qwen3.5-122B-A10B decode (M=1) on XPU. Two kernels: esimd_gemv_int4 (single) and esimd_gemv_int4_fused2 (fused 2-GEMV). Phase 1: VL=128 fixed, K_SPLIT=1/2/4/8, FP32 accumulation.
All tests pass. Kernel correctness <0.04% relative error.