llama-quant : overlap compute and write with double buffering by nuri-yoo · Pull Request #21507 · ggml-org/llama.cpp

nuri-yoo · 2026-04-06T09:04:41Z

Overview

Pipeline the main quantization loop so that the previous tensor's disk write overlaps with the current tensor's compute. Uses a background writer thread and double-buffered work areas.

On macOS (where mmap is disabled) this avoids blocking the main thread on sequential writes. Output is bit-identical to the original.

Benchmark (Qwen3-1.7B Q5_K_M → Q4_0, M4 Pro, macOS):

	Wall time
Before	~1.86s
After	~1.30s

The improvement should scale with model size since larger tensors have proportionally more I/O to overlap with computation.

Testing

Output is bit-identical (MD5 match) with the original
--dry-run mode works correctly
Exception safety: requantize error exits cleanly without crash
test-quantize-fns passes
test-quantize-perf passes

Additional information

Closes #20829

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

llama-quant : overlap compute and write with double buffering

c7a5006

nuri-yoo requested a review from ggerganov as a code owner April 6, 2026 09:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-quant : overlap compute and write with double buffering#21507

llama-quant : overlap compute and write with double buffering#21507
nuri-yoo wants to merge 1 commit intoggml-org:masterfrom
nuri-yoo:nuri-yoo/quantize-pipeline

nuri-yoo commented Apr 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nuri-yoo commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Testing

Additional information

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nuri-yoo commented Apr 6, 2026 •

edited

Loading