v1.7.5 sequential-learning adapter contract + multi-seed eval (eval STOP) by kitfunso · Pull Request #22 · kitfunso/hippo-memory

kitfunso · 2026-05-07T12:27:19Z

Summary

Sequential-learning benchmark infrastructure release. Closes the v0.39 B3 follow-up that gated public exercising of the dlPFC goal-stack mechanism. Eval ran but stopped per pre-registered sanity gate due to floor effect; hypothesis remains open.

Task 1 — Adapter contract pushGoal / completeGoal hooks on interface.mjs. hippo.mjs implements both via existing hippo goal push|complete CLI with HIPPO_HOME / XDG_DATA_HOME isolation. Tag-fix on memory store ([task.trapCategory, ...category.tags, 'error']) so the boost can match — without it the eval would have RETRACTED a working mechanism.
Task 2 — Multi-seed eval harness with category-to-slot variance (seeded mulberry32, hash-derived sub-seeds per shape group). Permutation CI for paired Δ (exact sign-flip, 10k resamples). --seed, --n-seeds, --eval-strict flags.
Task 3 — Pre-registered protocol + claim inventory committed BEFORE eval. 4 conditions × 20 seeds × eval-strict. Sanity gate fired (C2 late = 0% << headline 14%); STOP applied per prereg. Floor effect — both C2 and C3 saturate at 0% late-phase. ΔLate = 0pp. Mechanism shipped, hypothesis open.

Outside-voice review payoff

Plan v1 → 13 P0/P1 findings from senior-code-reviewer + Codex CLI. v2 plan addressed all. Biggest catches: tag mismatch (boost would have matched zero memories), within-thirds shuffle was statistical theatre, paired t-test fragile for bounded discrete rates → permutation CI, HIPPO_HOME / XDG_DATA_HOME isolation, hard-fail in eval-strict mode.

Test plan

npx vitest run → 1445 passed (+27 new), 0 failures
npx tsc --noEmit → 0 errors
npm run build → clean
Pre-eval sanity: tag-fix verified to fire boost (Task 1 integration test)
Eval ran cleanly: zero hook failures across all 4 conditions × 20 seeds in eval-strict mode

Eval result

STOP per pre-registered sanity gate. C2 (hippo-base) late-phase = 0.0% across all 20 seeds, outside the pre-registered band [4%, 24%]. Both C2 and C3 saturate at 0% late — floor effect prevents H1/H0 discrimination. The −10pp hypothesis remains untested on a discriminating workload. Documented in docs/evals/2026-05-07-v1.7.5-goal-stack-eval-result.md with full investigation. Future eval needs harder benchmark variant (smaller --budget, adversarial categories, or restricted late-phase window).

Plan: docs/plans/2026-05-07-v1.7.5-sequential-learning-adapter.md.

Closes the v0.39 B3 follow-up "sequential-learning adapter contract" in TODOS.md. The "−10pp hypothesis" line stays open as v1.7.6+ followup.

…view v2)

… task 1)

….5 task 2)

… eval

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Kit and others added 6 commits May 7, 2026 12:46

docs: v1.7.5 sequential-learning adapter implementation plan (post-re…

b2b23b5

…view v2)

feat: sequential-learning adapter pushGoal/completeGoal hooks (v1.7.5…

64ad38b

… task 1)

feat: multi-seed eval harness for sequential-learning benchmark (v1.7…

e2e565e

….5 task 2)

docs: pre-registered protocol + claim inventory for v1.7.5 goal-stack…

14625b7

… eval

docs: v1.7.5 goal-stack eval result (STOP per sanity gate, floor effect)

d5c4981

chore: bump to v1.7.5, update changelog and readme

c485aeb

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kitfunso merged commit 2253a8f into master May 7, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.7.5 sequential-learning adapter contract + multi-seed eval (eval STOP)#22

v1.7.5 sequential-learning adapter contract + multi-seed eval (eval STOP)#22
kitfunso merged 6 commits into
masterfrom
feat/v1.7.5-sl-adapter

kitfunso commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kitfunso commented May 7, 2026

Summary

Outside-voice review payoff

Test plan

Eval result

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant