Skip to content

perf: Q4_0/Q8_0 dequant throughput 5x below memcpy ceiling — missing SIMD vectorization #386

@noahgift

Description

@noahgift

Benchmark data (aprender-bench-compute)

Operation Size Throughput vs memcpy
Q8_0 dequant 262K 1.22 Gelem/s (4.88 GB/s) 5.5x below
Q4_0 dequant 262K 1.25 Gelem/s (5.00 GB/s) 5.4x below
Q8_0 quantize 262K 302 Melem/s (1.21 GB/s) 22x below
Q4_0 quantize 262K 284 Melem/s (1.14 GB/s) 24x below
memcpy (f32 clone) 262K 25 GiB/s (26.8 GB/s) baseline

Key observations

  1. Q4_0 and Q8_0 dequant are nearly identical speed (~1.2 Gelem/s). Q4_0 reads half the data, so it should be ~2x faster if memory-bandwidth-limited. Equal speed means compute-limited on the unpacking logic.

  2. Quantize is 4x slower than dequant. Quantize needs to find block scales (reduction), but 4x overhead is high.

  3. Dequant at 4.88 GB/s output vs 26.8 GB/s memcpy — 5.5x gap. With AVX2 SIMD unpacking, Q8_0 dequant should approach memcpy speed (just multiply int8 × scale). Q4_0 needs nibble extraction but should still be 3-4x faster with SIMD.

Impact

For fused dequant+matvec in realizar, dequant overhead is a significant fraction. A 4096×11008 Q4_0 matrix = 22.5M Q4 values → ~18ms dequant time at current throughput. With SIMD: ~3.3ms.

Suggested fix

  • AVX2/AVX-512 vectorized Q8_0 dequant: _mm256_cvtepi8_epi32 + _mm256_mul_ps
  • AVX2 Q4_0 dequant: nibble extraction via shift+mask, then cvt + mul
  • Process 32 elements per iteration (one Q4_0/Q8_0 block = 32 elements)

Reproduce

cargo bench -p aprender-bench-compute --bench quantization

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium priorityperformancePerformance optimization

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions