kitfunso · kitfunso · May 7, 2026 · May 7, 2026 · May 7, 2026 · May 7, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,25 @@
 # Changelog
 
+## 1.7.5 (2026-05-07)
+
+Sequential-learning benchmark infrastructure release. Closes the v0.39 B3 follow-up that gated public exercising of the dlPFC goal-stack mechanism. Eval ran but stopped per the pre-registered sanity gate due to a floor effect; hypothesis remains open.
+
+### Added
+
+- **Sequential-learning adapter contract — `pushGoal` / `completeGoal` hooks.** `benchmarks/sequential-learning/adapters/interface.mjs` accepts optional paired hooks. `hippo.mjs` adapter implements both via the existing `hippo goal push|complete` CLI commands, with `HIPPO_HOME` / `XDG_DATA_HOME` isolation so the eval can't contaminate the user's real store.
+- **Multi-seed eval harness with meaningful seed-driven variance.** `--seed N`, `--n-seeds N`, `--eval-strict` flags on `run.mjs`. `benchmarks/sequential-learning/aggregate.mjs` provides `mean` / `stdDev` / `ciHalfWidth95` (returns 0 for n<5) / `aggregatePhases` / `pairedPermutationCI` (10k resamples, no t-test). `traps.mjs::generateTasks(seed)` randomly assigns categories to position-slots within phase shape groups, preserving total trap count, per-category encounter count, and the early-then-later structure.
+- **`--use-goal-stack` runner flag.** When set AND adapter supplies the hooks, the simulator wraps each trap task in goal-push / goal-complete so the dlPFC boost activates. Eval-strict mode hard-fails on hook errors so silent fallback can't masquerade as a null result.
+- **Tag-fix on memory store.** Stored memories now include `task.trapCategory` (the category id) as the first tag so the goal-stack boost — which keys on `goalsByTag.has(memoryTag)` — can match. Pre-fix the boost would have matched zero memories regardless of mechanism truth.
+
+### Eval
+
+- **Goal-stack lift on sequential-learning benchmark — STOPPED per pre-registered sanity gate.** 4-condition × 20-seed paired eval ran cleanly (zero hook failures, eval-strict mode). Sanity gate fired before the decision rule could apply: hippo-base (C2) measured 0.0% late-phase trap rate (pre-registered band: [4%, 24%] around the README headline 14%). Both C2 and hippo+goal-stack (C3) saturate at 0% late-phase across all 20 seeds — floor effect, no headroom for the goal-stack mechanism to demonstrate further improvement. The −10pp hypothesis remains untested on a discriminating workload. Future eval needs a harder benchmark variant (smaller `--budget`, adversarial categories, or restricted late-phase window). Pre-registration: `docs/evals/2026-05-07-v1.7.5-goal-stack-eval-prereg.md`. Claim inventory: `docs/evals/2026-05-07-v1.7.5-claim-inventory.md`. Full result + investigation: `docs/evals/2026-05-07-v1.7.5-goal-stack-eval-result.md`. Re-derive numbers: `node benchmarks/sequential-learning/analyze-v1.7.5.mjs`.
+
+### Deferred to future release
+
+- Discriminating workload variant for the goal-stack hypothesis (v1.7.6+): reduce store budget, add adversarial trap categories, OR restrict the late-phase metric to last 4 trap encounters.
+- vlPFC interference suppression (v1.8.0): real feature work per RESEARCH.md, separate plan.
+
 ## 1.7.4 (2026-05-07)
 
 Internal hygiene release closing 3 of the 5 B3 dlPFC follow-ups deferred from v0.39.0. Adds optional `RecallOpts.sessionId` (and `RecallOpts.goalTag`) so MCP `hippo_recall` and HTTP `GET /v1/memories` callers get the dlPFC goal-stack boost — previously CLI-only. Adds `--no-propagate` flag on `goal complete`. Refactors `enforceDepthCapWithinTx` helper.

diff --git a/README.md b/README.md
@@ -85,6 +85,12 @@ hippo recall "data pipeline issues" --budget 2000
 
 ---
 
+### What's new in v1.7.5
+
+- **Sequential-learning benchmark gains `pushGoal`/`completeGoal` hooks** + a multi-seed eval harness with seeded category-to-slot variance, exact paired permutation CI, and `--eval-strict` mode. The dlPFC goal-stack mechanism is now exercisable on the public benchmark.
+- **Tag-fix on memory store** so the goal-stack boost can actually match. Pre-fix the boost would have matched zero memories.
+- **Eval ran but stopped per pre-registered sanity gate.** Both hippo-base and hippo+goal-stack hit 0% late-phase trap rate across 20 seeds — floor effect prevents H1/H0 discrimination. The −10pp hypothesis remains untested on a discriminating workload. Mechanism shipped, hypothesis open. Pre-reg + result in `docs/evals/`.
+
 ### What's new in v1.7.4
 
 - **Goal-stack boost on MCP + HTTP.** Set `RecallOpts.sessionId` (or HTTP `?session_id=...`, or MCP `hippo_recall { session_id }`) and the dlPFC goal-stack boost — previously CLI-only — applies on MCP and HTTP too. Both `api.recall` (primary BM25 band, before fresh-tail / summary appendix) AND MCP's separate `physicsSearch`/`hybridSearch` path are boosted. New `RecallOpts.goalTag` lets callers opt out per-call.

diff --git a/benchmarks/sequential-learning/adapters/hippo.mjs b/benchmarks/sequential-learning/adapters/hippo.mjs
@@ -2,10 +2,19 @@
  * Hippo Memory Adapter
  *
  * Uses the hippo CLI via subprocess calls for full isolation.
- * Creates a temporary directory, sets HOME/USERPROFILE to it so the
- * global store (~/.hippo) is also isolated from the user's real store.
+ * Creates a temporary directory, sets HOME/USERPROFILE/HIPPO_HOME to it so the
+ * global store (~/.hippo) is also isolated from the user's real store, and
+ * clears XDG_DATA_HOME so it can never leak the real user store either
+ * (precedence in src/shared.ts:getGlobalRoot is HIPPO_HOME > XDG_DATA_HOME > HOME/.hippo).
  *
- * Requires: hippo CLI on PATH (npm link or global install)
+ * v1.7.5 -- adds optional B3 dlPFC goal-stack hooks (pushGoal / completeGoal).
+ * pushGoal generates a session id, sets HIPPO_SESSION_ID for the rest of the
+ * task lifespan, and parses the printed `g_<16hex>` id from stdout. completeGoal
+ * closes the goal with an outcome (1.0 trap avoided, 0.0 trap hit) and clears
+ * the session so cross-task state cannot leak. Both methods hard-fail (no swallow)
+ * so the simulator's eval-strict mode can detect a broken mechanism.
+ *
+ * Requires: hippo CLI on PATH (npm link or global install).
  */
 
 import { execSync } from 'node:child_process';
@@ -14,20 +23,32 @@ import { join } from 'node:path';
 import { tmpdir } from 'node:os';
 import { createAdapter } from './interface.mjs';
 
+// v1.7.5 -- session id stays stable across the task lifespan. Set via env on
+// every hippo exec so goal push/complete and recall share state.
+let _sessionId = null;
+let _pushedCount = 0;
+let _completedCount = 0;
+
 /**
  * Run a hippo CLI command in the temp store directory.
- * HOME/USERPROFILE are overridden to isolate from the user's global store.
- * Returns stdout as a string, or null on failure.
+ * HOME/USERPROFILE/HIPPO_HOME are overridden to isolate from the user's
+ * global store, and XDG_DATA_HOME is blanked so it cannot win the
+ * getGlobalRoot precedence race. Returns stdout as a string, or null on failure.
  */
 function hippoExec(storeDir, args) {
   try {
     const result = execSync(`hippo ${args}`, {
       cwd: storeDir,
       env: {
         ...process.env,
-        // Override home directory so getGlobalRoot() points into our temp dir
+        // v1.7.5 codex P0 isolation -- HIPPO_HOME wins over HOME in
+        // getGlobalRoot, and we blank XDG_DATA_HOME so neither it nor the
+        // user's real ~/.hippo can leak in.
+        HIPPO_HOME: storeDir,
         HOME: storeDir,
         USERPROFILE: storeDir,
+        XDG_DATA_HOME: '',
+        ...(_sessionId ? { HIPPO_SESSION_ID: _sessionId } : {}),
       },
       encoding: 'utf-8',
       timeout: 15_000,
@@ -90,10 +111,43 @@ export default createAdapter({
     hippoExec(this._storeDir, `outcome ${good ? '--good' : '--bad'}`);
   },
 
+  // v1.7.5 -- B3 goal-stack hooks.
+  async pushGoal(name) {
+    _sessionId = `bench-${Date.now()}-${Math.floor(Math.random() * 1e6)}`;
+    const out = hippoExec(this._storeDir, `goal push ${name}`);
+    if (!out) {
+      _sessionId = null;
+      // v1.7.5 codex P1 -- HARD FAIL. Do not swallow. Eval-strict mode
+      // wants the run to abort if the mechanism cannot fire.
+      throw new Error(`hippo goal push failed for name='${name}'`);
+    }
+    const match = out.match(/g_[0-9a-f]{16}/);
+    if (!match) {
+      _sessionId = null;
+      throw new Error(`hippo goal push output did not contain a goal id: '${out}'`);
+    }
+    _pushedCount++;
+    return match[0];
+  },
+
+  async completeGoal(id, good) {
+    const outcome = good ? '1.0' : '0.0';
+    hippoExec(this._storeDir, `goal complete ${id} --outcome ${outcome}`);
+    _completedCount++;
+    _sessionId = null;
+  },
+
   async cleanup() {
+    // v1.7.5 codex P1 -- always clear cross-task state, even after exceptions.
+    _sessionId = null;
     if (this._storeDir && existsSync(this._storeDir)) {
       rmSync(this._storeDir, { recursive: true, force: true });
     }
     this._storeDir = null;
   },
+
+  // Expose counters so the runner / tests can assert non-zero in eval mode.
+  _stats() {
+    return { pushed: _pushedCount, completed: _completedCount };
+  },
 });
diff --git a/benchmarks/sequential-learning/adapters/interface.mjs b/benchmarks/sequential-learning/adapters/interface.mjs
@@ -38,11 +38,21 @@
  * @property {(query: string) => Promise<RecallResult[]>} recall              - Retrieve memories relevant to a query (return top-5)
  * @property {(good: boolean) => Promise<void>} outcome                       - Feedback on the last recall (did it help?)
  * @property {() => Promise<void>} cleanup                                    - Tear down the memory store (remove temp dirs)
+ *
+ * v1.7.5 -- optional B3 dlPFC goal-stack hooks. Either supply BOTH or NEITHER.
+ * @property {((name: string) => Promise<string>)} [pushGoal]                 - Push an active goal at task start. Returns the goal id
+ *                                                                              (opaque string, e.g. "g_<16hex>"). Adapter must thread the
+ *                                                                              goal's session id into subsequent recall calls so the
+ *                                                                              goal-stack boost activates.
+ * @property {((id: string, good: boolean) => Promise<void>)} [completeGoal]  - Complete the goal at task end. `good` is the outcome
+ *                                                                              (true = trap avoided, false = trap hit). Adapter MAY
+ *                                                                              propagate strength multipliers per its own contract.
  */
 
 /**
  * Create a validated adapter from a plain object.
- * Throws if any required method is missing.
+ * Throws if any required method is missing, or if the optional v1.7.5
+ * goal-stack hooks are supplied unpaired (must be both or neither).
  *
  * @param {MemoryAdapter} adapter
  * @returns {MemoryAdapter}
@@ -59,5 +69,21 @@ export function createAdapter(adapter) {
       throw new Error(`Adapter.${key} must be a function`);
     }
   }
+  // v1.7.5 -- pushGoal / completeGoal must be supplied as a pair.
+  const hasPush = 'pushGoal' in adapter;
+  const hasComplete = 'completeGoal' in adapter;
+  if (hasPush !== hasComplete) {
+    throw new Error(
+      hasPush
+        ? 'Adapter supplies pushGoal but missing completeGoal -- the v1.7.5 B3 hooks are paired'
+        : 'Adapter supplies completeGoal but missing pushGoal -- the v1.7.5 B3 hooks are paired',
+    );
+  }
+  if (hasPush && typeof adapter.pushGoal !== 'function') {
+    throw new Error('Adapter.pushGoal must be a function');
+  }
+  if (hasComplete && typeof adapter.completeGoal !== 'function') {
+    throw new Error('Adapter.completeGoal must be a function');
+  }
   return adapter;
 }
diff --git a/benchmarks/sequential-learning/aggregate.mjs b/benchmarks/sequential-learning/aggregate.mjs
@@ -0,0 +1,138 @@
+// benchmarks/sequential-learning/aggregate.mjs
+// v1.7.5 -- pure aggregation helpers for multi-seed runs. Zero npm deps; only
+// Node 22+ built-ins. Keep this file dependency-free so the benchmark runs on
+// a vanilla Node install with `node run.mjs`.
+
+/**
+ * Sample mean. Returns 0 for empty arrays.
+ * @param {number[]} xs
+ * @returns {number}
+ */
+export function mean(xs) {
+  if (xs.length === 0) return 0;
+  return xs.reduce((a, b) => a + b, 0) / xs.length;
+}
+
+/**
+ * Sample standard deviation (n-1 denominator). Returns 0 for n<2.
+ * @param {number[]} xs
+ * @returns {number}
+ */
+export function stdDev(xs) {
+  if (xs.length < 2) return 0;
+  const m = mean(xs);
+  const sumSq = xs.reduce((a, b) => a + (b - m) ** 2, 0);
+  return Math.sqrt(sumSq / (xs.length - 1));
+}
+
+// t-distribution critical values for 95% two-sided, df = n - 1.
+// Hardcoded for n in [5..30]; falls back to 1.960 (z) for n > 30.
+const T_CRIT_95 = {
+  5: 2.776, 6: 2.571, 7: 2.447, 8: 2.365, 9: 2.306, 10: 2.262,
+  11: 2.228, 12: 2.201, 13: 2.179, 14: 2.160, 15: 2.145, 16: 2.131,
+  17: 2.120, 18: 2.110, 19: 2.101, 20: 2.093, 21: 2.086, 22: 2.080,
+  23: 2.074, 24: 2.069, 25: 2.064, 26: 2.060, 27: 2.056, 28: 2.052,
+  29: 2.048, 30: 2.045,
+};
+
+/**
+ * 95% CI half-width via t-distribution. v1.7.5 codex P2 -- reject n<5 by
+ * returning 0. n=2 t-crit=12.706 gives nonsense CIs; eval requires n>=10
+ * anyway. Smoke runs labelled exploratory, no CI reported.
+ *
+ * @param {number[]} xs
+ * @returns {number}
+ */
+export function ciHalfWidth95(xs) {
+  if (xs.length < 5) return 0;
+  const t = T_CRIT_95[xs.length] ?? 1.960;
+  return (t * stdDev(xs)) / Math.sqrt(xs.length);
+}
+
+/**
+ * mulberry32 -- deterministic PRNG, dep-free. Exported so traps.mjs can reuse
+ * the same RNG implementation for seeded category-to-slot assignment.
+ *
+ * @param {number} seed integer seed (uint32 coerced)
+ * @returns {() => number} function returning a uniform float in [0, 1)
+ */
+export function mulberry32(seed) {
+  let s = seed >>> 0;
+  return function () {
+    s = (s + 0x6D2B79F5) >>> 0;
+    let t = s;
+    t = Math.imul(t ^ (t >>> 15), t | 1);
+    t ^= t + Math.imul(t ^ (t >>> 7), t | 61);
+    return ((t ^ (t >>> 14)) >>> 0) / 4294967296;
+  };
+}
+
+/**
+ * Exact paired sign-flip permutation CI for mean(xsA - xsB).
+ *
+ * v1.7.5 -- used in Task 3 instead of paired t-test: phase rates are bounded
+ * binomial-like and t-test is fragile at n=20. Permutation makes no normality
+ * assumption.
+ *
+ * The CI is built around the observed mean by recentring the resampled-mean
+ * distribution: ciLow = observed + (sortedResamples[loIdx] - mean(resamples)).
+ * This gives a bias-corrected percentile interval.
+ *
+ * Internally seeded with mulberry32(0x9E3779B9) for determinism.
+ *
+ * @param {number[]} xsA per-seed metric for condition A
+ * @param {number[]} xsB per-seed metric for condition B (same seeds, paired)
+ * @param {number} [alpha=0.05] two-sided alpha for 95% CI
+ * @param {number} [nResamples=10000] permutation resample count
+ * @returns {{deltaMean: number, ciLow: number, ciHigh: number}}
+ */
+export function pairedPermutationCI(xsA, xsB, alpha = 0.05, nResamples = 10_000) {
+  if (xsA.length !== xsB.length) {
+    throw new Error('pairedPermutationCI: lengths differ');
+  }
+  const n = xsA.length;
+  if (n < 5) throw new Error('pairedPermutationCI: n<5');
+
+  const diffs = xsA.map((a, i) => a - xsB[i]);
+  const observed = mean(diffs);
+
+  const rng = mulberry32(0x9E3779B9);
+  const resampledMeans = new Array(nResamples);
+  for (let r = 0; r < nResamples; r++) {
+    let s = 0;
+    for (let i = 0; i < n; i++) {
+      s += diffs[i] * (rng() < 0.5 ? -1 : 1);
+    }
+    resampledMeans[r] = s / n;
+  }
+  resampledMeans.sort((a, b) => a - b);
+  const resampleMean = mean(resampledMeans);
+  const loIdx = Math.floor((alpha / 2) * nResamples);
+  const hiIdx = Math.ceil((1 - alpha / 2) * nResamples) - 1;
+  const ciLow = observed + (resampledMeans[loIdx] - resampleMean);
+  const ciHigh = observed + (resampledMeans[hiIdx] - resampleMean);
+  return { deltaMean: observed, ciLow, ciHigh };
+}
+
+/**
+ * Aggregate per-seed phase rates into mean/std/ci95 per phase.
+ *
+ * @param {Array<{early: number, mid: number, late: number}>} seedResults
+ * @returns {{
+ *   early: {mean: number, std: number, ci95: number},
+ *   mid:   {mean: number, std: number, ci95: number},
+ *   late:  {mean: number, std: number, ci95: number},
+ * }}
+ */
+export function aggregatePhases(seedResults) {
+  const result = {};
+  for (const phase of ['early', 'mid', 'late']) {
+    const xs = seedResults.map((r) => r[phase]);
+    result[phase] = {
+      mean: mean(xs),
+      std: stdDev(xs),
+      ci95: ciHalfWidth95(xs),
+    };
+  }
+  return result;
+}