Benchmark data (aprender-bench-compute)
| Operation |
Size |
Throughput |
vs memcpy |
| Q8_0 dequant |
262K |
1.22 Gelem/s (4.88 GB/s) |
5.5x below |
| Q4_0 dequant |
262K |
1.25 Gelem/s (5.00 GB/s) |
5.4x below |
| Q8_0 quantize |
262K |
302 Melem/s (1.21 GB/s) |
22x below |
| Q4_0 quantize |
262K |
284 Melem/s (1.14 GB/s) |
24x below |
| memcpy (f32 clone) |
262K |
25 GiB/s (26.8 GB/s) |
baseline |
Key observations
-
Q4_0 and Q8_0 dequant are nearly identical speed (~1.2 Gelem/s). Q4_0 reads half the data, so it should be ~2x faster if memory-bandwidth-limited. Equal speed means compute-limited on the unpacking logic.
-
Quantize is 4x slower than dequant. Quantize needs to find block scales (reduction), but 4x overhead is high.
-
Dequant at 4.88 GB/s output vs 26.8 GB/s memcpy — 5.5x gap. With AVX2 SIMD unpacking, Q8_0 dequant should approach memcpy speed (just multiply int8 × scale). Q4_0 needs nibble extraction but should still be 3-4x faster with SIMD.
Impact
For fused dequant+matvec in realizar, dequant overhead is a significant fraction. A 4096×11008 Q4_0 matrix = 22.5M Q4 values → ~18ms dequant time at current throughput. With SIMD: ~3.3ms.
Suggested fix
- AVX2/AVX-512 vectorized Q8_0 dequant:
_mm256_cvtepi8_epi32 + _mm256_mul_ps
- AVX2 Q4_0 dequant: nibble extraction via shift+mask, then
cvt + mul
- Process 32 elements per iteration (one Q4_0/Q8_0 block = 32 elements)
Reproduce
cargo bench -p aprender-bench-compute --bench quantization
Benchmark data (aprender-bench-compute)
Key observations
Q4_0 and Q8_0 dequant are nearly identical speed (~1.2 Gelem/s). Q4_0 reads half the data, so it should be ~2x faster if memory-bandwidth-limited. Equal speed means compute-limited on the unpacking logic.
Quantize is 4x slower than dequant. Quantize needs to find block scales (reduction), but 4x overhead is high.
Dequant at 4.88 GB/s output vs 26.8 GB/s memcpy — 5.5x gap. With AVX2 SIMD unpacking, Q8_0 dequant should approach memcpy speed (just multiply int8 × scale). Q4_0 needs nibble extraction but should still be 3-4x faster with SIMD.
Impact
For fused dequant+matvec in realizar, dequant overhead is a significant fraction. A 4096×11008 Q4_0 matrix = 22.5M Q4 values → ~18ms dequant time at current throughput. With SIMD: ~3.3ms.
Suggested fix
_mm256_cvtepi8_epi32+_mm256_mul_pscvt+mulReproduce