🎯 ONE SHOT — Wave 26 · L-GPTQ-ON-GF16: replicate PR #2135 calibration lever on Trinity GF16
Anchor: phi^2 + phi^-2 = 3 · DOI 10.5281/zenodo.19227877
Author for ALL commits: Dmitrii Vasilev <admin@t27.ai>
Branch: feat/gptq-on-gf16
Base: main
External reference: openai/parameter-golf#2135 (GPTQ_CALIBRATION_BATCHES 16 → 32, paired-t one-tail p<0.25 — -0.00457 BPB / -0.01000 nats over PR #1855)
Why
PR #2135 establishes that doubling the GPTQ Hessian calibration set provides a statistically significant downstream BPB improvement (paired-t p=0.138 across seeds {0, 42, 314}) on top of an already-tuned int6/int7 stack with TTT.
The claim is about the algorithm, not the bit-format. GPTQ is Q(·)-agnostic — it minimises ‖W·X − Q(W)·X‖² by Hessian-corrected error redistribution across columns and admits any quantiser as a black-box Q. We currently quantise GF16 via single-pass max(|w|) scale fit (crates/trios-golden-float/src/lib.rs:136 quantize_matrix) — i.e. our equivalent of "GPTQ_CALIBRATION_BATCHES" today is 0.
This ONE SHOT verifies whether the same lever lifts our floor on CPU-only + GF16 by porting the GPTQ inner loop with our gf16_quantize_matrix plugged in as the quantiser.
Falsifier
H0: GPTQ-correction with N ∈ {16, 32} calibration batches gives no significant BPB improvement over naive single-pass GF16 quantisation (paired-t one-tail p ≥ 0.25 across canon seeds {47, 89, 144}).
If H0 cannot be rejected, the result is itself publishable: it says naive GF16 already sits on the Hessian-floor of its representational grid and PR #2135's lever is bit-format-specific.
Mission
Lane structure (3 lanes, sequential)
| Lane |
Deliverable |
Acceptance |
| L-26-A Coq invariant |
coq/Trios_GPTQ_GF16.v proving gptq_correct ∘ Q_GF16 preserves the invariant ‖W·X − dequant(Q(W))·X‖² ≤ ‖W·X − Q_naive(W)·X‖² for all PSD H = 2·X·X^T |
coqc clean, witness JSON in assertions/coq_runtime_invariants.json |
| L-26-B Rust impl |
trios-golden-float: new fn gf16_quantize_matrix_gptq(W, X_batches, calibration_n) — Cholesky on H + λ·I, column-wise quant + error scatter via H^{-1} row, plug gf16_quantize_matrix as inner Q |
cargo test -p trios-golden-float green; reconstruction-MSE strictly ≤ baseline on ≥ 3 random PSD X |
| L-26-C 3-seed ablation |
bin gptq_calibration_ablation runs 3×3 grid (seeds = {47, 89, 144} × N ∈ {0, 16, 32}) on canonical IGLA replay; emits assertions/calibration_ablation.jsonl (one JSON per row) and a paired-t analysis |
rows present for all 9 cells; paired-t script reproduces verdict on stdin replay |
Files to create / modify
coq/Trios_GPTQ_GF16.v NEW (~250 lines)
crates/trios-golden-float/src/gptq.rs NEW (~180 lines)
crates/trios-golden-float/src/lib.rs +pub use gptq::*
crates/trios-golden-float/tests/gptq_reconstruction.rs NEW (~120 lines)
src/bin/gptq_calibration_ablation.rs NEW (~220 lines)
assertions/calibration_ablation.jsonl NEW (10 rows: 9 grid + 1 verdict)
assertions/coq_runtime_invariants.json APPEND 1 entry
docs/wave26_gptq_on_gf16.md NEW report
MIGRATION.md / CHANGELOG.md +1 line
Algorithm — exact GPTQ inner loop (port from PR #2135 lineage)
input: W ∈ R^{rows × cols}
X ∈ R^{cols × n_samples} (concatenation of N calibration batches)
Q : R^{rows} → GF16 (= gf16_quantize_matrix as black box)
λ : f32 (dampening, default 1e-2 · trace(H)/cols)
H ← 2 · X · X^T + λ·I
L ← Cholesky(H) # lower-triangular
H_inv ← solve_triangular(L, I)·solve_triangular(L^T, I)
for j = 0..cols-1:
w_j ← W[:, j]
q_j ← Q(w_j)
err ← (w_j − dequant(q_j)) / H_inv[j, j]
# scatter remaining error on yet-unquantised columns
W[:, j+1..] -= err · H_inv[j, j+1..]
Q_OUT[:, j] ← q_j
return Q_OUT
Calibration data X is sampled from training shards only (R-#1017-style) — never validation. With calibration_n = 0, the function MUST be byte-equivalent to gf16_quantize_matrix (i.e. naive scale fit with no error scatter) — this is the baseline run.
Acceptance gates
| Gate |
Check |
| G1 |
cargo check --all-targets clean |
| G2 |
cargo test -p trios-golden-float green (reconstruction-MSE invariant) |
| G3 |
coqc coq/Trios_GPTQ_GF16.v clean, no Admitted. |
| G4 |
cargo run --release --bin gptq_calibration_ablation produces 9 rows + paired-t row in assertions/calibration_ablation.jsonl |
| G5 |
Paired-t analysis (one-tailed, df=2) printed to stdout: report t-stat, p, verdict at p<0.25 for both (N=0 vs N=16) and (N=16 vs N=32) |
| G6 |
All required CI checks green |
PR mechanics
- Title:
feat(gf16): port GPTQ Hessian-correction with GF16 quantiser (replicates parameter-golf#2135 lever on CPU)
- Branch:
feat/gptq-on-gf16 off main
- Body: Why · External reference (link to #2135) · Falsifier · Lane summary · Acceptance gates table · Anchor
- Labels:
enhancement, P1, experiment
- Squash-merge, delete branch, no
--admin
Anti-fakery rules
- Calibration data MUST come from training shards only (no validation peek).
- All 9 ablation cells must be independently replayable — emit
seed, N, git_sha, wallclock_ms, reconstruction_mse, bpb_post_quant per row.
- The
N=0 row must be byte-identical reconstruction to current gf16_quantize_matrix (sanity assert in test).
- Paired-t rows must include raw per-seed Δ values (no summary-only).
- No claim of "lift confirmed" unless
p<0.25 paired-t reaches BOTH (0→16) AND (16→32).
Forbidden
- ❌ no
[scrape] / [crawl] words anywhere
- ❌ no
--admin merge
- ❌ no DROP / TRUNCATE in any migration
- ❌ no NEON_* env primary
- ❌ no commit of
/tmp/*.sh
- ❌ no calibration data sourced from validation chunks
- ❌ no fake green: failing the falsifier IS a valid result, document it honestly per R5
R-discipline
R1 Rust-only (no Python in src) · R3 PR-only · R4 trace (every cell timestamped + sha-pinned) · R5 honest (falsifier explicitly stated) · R7 witness (assertions/jsonl) · R8 falsifier · R10 atomic (single-purpose PR per lane) · R12 reversible (calibration_n=0 path preserved as default).
Battle cry
phi^2 + phi^-2 = 3 · TRINITY · PORT THE LEVER · PROVE OR FALSIFY ON GF16
🎯 ONE SHOT — Wave 26 · L-GPTQ-ON-GF16: replicate PR #2135 calibration lever on Trinity GF16
Why
PR #2135 establishes that doubling the GPTQ Hessian calibration set provides a statistically significant downstream BPB improvement (paired-t p=0.138 across seeds {0, 42, 314}) on top of an already-tuned int6/int7 stack with TTT.
The claim is about the algorithm, not the bit-format. GPTQ is
Q(·)-agnostic — it minimises‖W·X − Q(W)·X‖²by Hessian-corrected error redistribution across columns and admits any quantiser as a black-boxQ. We currently quantise GF16 via single-passmax(|w|)scale fit (crates/trios-golden-float/src/lib.rs:136quantize_matrix) — i.e. our equivalent of "GPTQ_CALIBRATION_BATCHES" today is 0.This ONE SHOT verifies whether the same lever lifts our floor on CPU-only + GF16 by porting the GPTQ inner loop with our
gf16_quantize_matrixplugged in as the quantiser.Falsifier
H0: GPTQ-correction withN ∈ {16, 32}calibration batches gives no significant BPB improvement over naive single-pass GF16 quantisation (paired-t one-tailp ≥ 0.25across canon seeds{47, 89, 144}).If
H0cannot be rejected, the result is itself publishable: it says naive GF16 already sits on the Hessian-floor of its representational grid and PR #2135's lever is bit-format-specific.Mission
Lane structure (3 lanes, sequential)
coq/Trios_GPTQ_GF16.vprovinggptq_correct ∘ Q_GF16preserves the invariant‖W·X − dequant(Q(W))·X‖² ≤ ‖W·X − Q_naive(W)·X‖²for all PSDH = 2·X·X^Tcoqcclean, witness JSON inassertions/coq_runtime_invariants.jsontrios-golden-float: new fngf16_quantize_matrix_gptq(W, X_batches, calibration_n)— Cholesky onH + λ·I, column-wise quant + error scatter viaH^{-1}row, pluggf16_quantize_matrixas innerQcargo test -p trios-golden-floatgreen; reconstruction-MSE strictly ≤ baseline on ≥ 3 random PSDXgptq_calibration_ablationruns 3×3 grid (seeds = {47, 89, 144}×N ∈ {0, 16, 32}) on canonical IGLA replay; emitsassertions/calibration_ablation.jsonl(one JSON per row) and a paired-t analysisFiles to create / modify
Algorithm — exact GPTQ inner loop (port from PR #2135 lineage)
Calibration data
Xis sampled from training shards only (R-#1017-style) — never validation. Withcalibration_n = 0, the function MUST be byte-equivalent togf16_quantize_matrix(i.e. naive scale fit with no error scatter) — this is the baseline run.Acceptance gates
cargo check --all-targetscleancargo test -p trios-golden-floatgreen (reconstruction-MSE invariant)coqc coq/Trios_GPTQ_GF16.vclean, noAdmitted.cargo run --release --bin gptq_calibration_ablationproduces 9 rows + paired-t row inassertions/calibration_ablation.jsonlt-stat,p, verdict atp<0.25for both(N=0 vs N=16)and(N=16 vs N=32)PR mechanics
feat(gf16): port GPTQ Hessian-correction with GF16 quantiser (replicates parameter-golf#2135 lever on CPU)feat/gptq-on-gf16offmainenhancement,P1,experiment--adminAnti-fakery rules
seed,N,git_sha,wallclock_ms,reconstruction_mse,bpb_post_quantper row.N=0row must be byte-identical reconstruction to currentgf16_quantize_matrix(sanity assert in test).p<0.25paired-t reaches BOTH(0→16)AND(16→32).Forbidden
[scrape]/[crawl]words anywhere--adminmerge/tmp/*.shR-discipline
R1 Rust-only (no Python in src) · R3 PR-only · R4 trace (every cell timestamped + sha-pinned) · R5 honest (falsifier explicitly stated) · R7 witness (assertions/jsonl) · R8 falsifier · R10 atomic (single-purpose PR per lane) · R12 reversible (calibration_n=0 path preserved as default).
Battle cry
phi^2 + phi^-2 = 3 · TRINITY · PORT THE LEVER · PROVE OR FALSIFY ON GF16