Skip to content

v1.7.5 sequential-learning adapter contract + multi-seed eval (eval STOP)#22

Merged
kitfunso merged 6 commits into
masterfrom
feat/v1.7.5-sl-adapter
May 7, 2026
Merged

v1.7.5 sequential-learning adapter contract + multi-seed eval (eval STOP)#22
kitfunso merged 6 commits into
masterfrom
feat/v1.7.5-sl-adapter

Conversation

@kitfunso
Copy link
Copy Markdown
Owner

@kitfunso kitfunso commented May 7, 2026

Summary

Sequential-learning benchmark infrastructure release. Closes the v0.39 B3 follow-up that gated public exercising of the dlPFC goal-stack mechanism. Eval ran but stopped per pre-registered sanity gate due to floor effect; hypothesis remains open.

  • Task 1 — Adapter contract pushGoal / completeGoal hooks on interface.mjs. hippo.mjs implements both via existing hippo goal push|complete CLI with HIPPO_HOME / XDG_DATA_HOME isolation. Tag-fix on memory store ([task.trapCategory, ...category.tags, 'error']) so the boost can match — without it the eval would have RETRACTED a working mechanism.
  • Task 2 — Multi-seed eval harness with category-to-slot variance (seeded mulberry32, hash-derived sub-seeds per shape group). Permutation CI for paired Δ (exact sign-flip, 10k resamples). --seed, --n-seeds, --eval-strict flags.
  • Task 3 — Pre-registered protocol + claim inventory committed BEFORE eval. 4 conditions × 20 seeds × eval-strict. Sanity gate fired (C2 late = 0% << headline 14%); STOP applied per prereg. Floor effect — both C2 and C3 saturate at 0% late-phase. ΔLate = 0pp. Mechanism shipped, hypothesis open.

Outside-voice review payoff

Plan v1 → 13 P0/P1 findings from senior-code-reviewer + Codex CLI. v2 plan addressed all. Biggest catches: tag mismatch (boost would have matched zero memories), within-thirds shuffle was statistical theatre, paired t-test fragile for bounded discrete rates → permutation CI, HIPPO_HOME / XDG_DATA_HOME isolation, hard-fail in eval-strict mode.

Test plan

  • npx vitest run → 1445 passed (+27 new), 0 failures
  • npx tsc --noEmit → 0 errors
  • npm run build → clean
  • Pre-eval sanity: tag-fix verified to fire boost (Task 1 integration test)
  • Eval ran cleanly: zero hook failures across all 4 conditions × 20 seeds in eval-strict mode

Eval result

STOP per pre-registered sanity gate. C2 (hippo-base) late-phase = 0.0% across all 20 seeds, outside the pre-registered band [4%, 24%]. Both C2 and C3 saturate at 0% late — floor effect prevents H1/H0 discrimination. The −10pp hypothesis remains untested on a discriminating workload. Documented in docs/evals/2026-05-07-v1.7.5-goal-stack-eval-result.md with full investigation. Future eval needs harder benchmark variant (smaller --budget, adversarial categories, or restricted late-phase window).

Plan: docs/plans/2026-05-07-v1.7.5-sequential-learning-adapter.md.

Closes the v0.39 B3 follow-up "sequential-learning adapter contract" in TODOS.md. The "−10pp hypothesis" line stays open as v1.7.6+ followup.

@kitfunso kitfunso merged commit 2253a8f into master May 7, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant