Skip to content

llama-quant : overlap compute and write with double buffering#21507

Open
nuri-yoo wants to merge 1 commit intoggml-org:masterfrom
nuri-yoo:nuri-yoo/quantize-pipeline
Open

llama-quant : overlap compute and write with double buffering#21507
nuri-yoo wants to merge 1 commit intoggml-org:masterfrom
nuri-yoo:nuri-yoo/quantize-pipeline

Conversation

@nuri-yoo
Copy link
Copy Markdown
Contributor

@nuri-yoo nuri-yoo commented Apr 6, 2026

Overview

Pipeline the main quantization loop so that the previous tensor's disk write overlaps with the current tensor's compute. Uses a background writer thread and double-buffered work areas.

On macOS (where mmap is disabled) this avoids blocking the main thread on sequential writes. Output is bit-identical to the original.

Benchmark (Qwen3-1.7B Q5_K_M → Q4_0, M4 Pro, macOS):

Wall time
Before ~1.86s
After ~1.30s

The improvement should scale with model size since larger tensors have proportionally more I/O to overlap with computation.

Testing

  • Output is bit-identical (MD5 match) with the original
  • --dry-run mode works correctly
  • Exception safety: requantize error exits cleanly without crash
  • test-quantize-fns passes
  • test-quantize-perf passes

Additional information

Closes #20829

Requirements

@nuri-yoo nuri-yoo requested a review from ggerganov as a code owner April 6, 2026 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor: make llama-quantize faster and more efficient

1 participant