Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,25 @@
# Changelog

## 1.7.5 (2026-05-07)

Sequential-learning benchmark infrastructure release. Closes the v0.39 B3 follow-up that gated public exercising of the dlPFC goal-stack mechanism. Eval ran but stopped per the pre-registered sanity gate due to a floor effect; hypothesis remains open.

### Added

- **Sequential-learning adapter contract — `pushGoal` / `completeGoal` hooks.** `benchmarks/sequential-learning/adapters/interface.mjs` accepts optional paired hooks. `hippo.mjs` adapter implements both via the existing `hippo goal push|complete` CLI commands, with `HIPPO_HOME` / `XDG_DATA_HOME` isolation so the eval can't contaminate the user's real store.
- **Multi-seed eval harness with meaningful seed-driven variance.** `--seed N`, `--n-seeds N`, `--eval-strict` flags on `run.mjs`. `benchmarks/sequential-learning/aggregate.mjs` provides `mean` / `stdDev` / `ciHalfWidth95` (returns 0 for n<5) / `aggregatePhases` / `pairedPermutationCI` (10k resamples, no t-test). `traps.mjs::generateTasks(seed)` randomly assigns categories to position-slots within phase shape groups, preserving total trap count, per-category encounter count, and the early-then-later structure.
- **`--use-goal-stack` runner flag.** When set AND adapter supplies the hooks, the simulator wraps each trap task in goal-push / goal-complete so the dlPFC boost activates. Eval-strict mode hard-fails on hook errors so silent fallback can't masquerade as a null result.
- **Tag-fix on memory store.** Stored memories now include `task.trapCategory` (the category id) as the first tag so the goal-stack boost — which keys on `goalsByTag.has(memoryTag)` — can match. Pre-fix the boost would have matched zero memories regardless of mechanism truth.

### Eval

- **Goal-stack lift on sequential-learning benchmark — STOPPED per pre-registered sanity gate.** 4-condition × 20-seed paired eval ran cleanly (zero hook failures, eval-strict mode). Sanity gate fired before the decision rule could apply: hippo-base (C2) measured 0.0% late-phase trap rate (pre-registered band: [4%, 24%] around the README headline 14%). Both C2 and hippo+goal-stack (C3) saturate at 0% late-phase across all 20 seeds — floor effect, no headroom for the goal-stack mechanism to demonstrate further improvement. The −10pp hypothesis remains untested on a discriminating workload. Future eval needs a harder benchmark variant (smaller `--budget`, adversarial categories, or restricted late-phase window). Pre-registration: `docs/evals/2026-05-07-v1.7.5-goal-stack-eval-prereg.md`. Claim inventory: `docs/evals/2026-05-07-v1.7.5-claim-inventory.md`. Full result + investigation: `docs/evals/2026-05-07-v1.7.5-goal-stack-eval-result.md`. Re-derive numbers: `node benchmarks/sequential-learning/analyze-v1.7.5.mjs`.

### Deferred to future release

- Discriminating workload variant for the goal-stack hypothesis (v1.7.6+): reduce store budget, add adversarial trap categories, OR restrict the late-phase metric to last 4 trap encounters.
- vlPFC interference suppression (v1.8.0): real feature work per RESEARCH.md, separate plan.

## 1.7.4 (2026-05-07)

Internal hygiene release closing 3 of the 5 B3 dlPFC follow-ups deferred from v0.39.0. Adds optional `RecallOpts.sessionId` (and `RecallOpts.goalTag`) so MCP `hippo_recall` and HTTP `GET /v1/memories` callers get the dlPFC goal-stack boost — previously CLI-only. Adds `--no-propagate` flag on `goal complete`. Refactors `enforceDepthCapWithinTx` helper.
Expand Down
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,12 @@ hippo recall "data pipeline issues" --budget 2000

---

### What's new in v1.7.5

- **Sequential-learning benchmark gains `pushGoal`/`completeGoal` hooks** + a multi-seed eval harness with seeded category-to-slot variance, exact paired permutation CI, and `--eval-strict` mode. The dlPFC goal-stack mechanism is now exercisable on the public benchmark.
- **Tag-fix on memory store** so the goal-stack boost can actually match. Pre-fix the boost would have matched zero memories.
- **Eval ran but stopped per pre-registered sanity gate.** Both hippo-base and hippo+goal-stack hit 0% late-phase trap rate across 20 seeds — floor effect prevents H1/H0 discrimination. The −10pp hypothesis remains untested on a discriminating workload. Mechanism shipped, hypothesis open. Pre-reg + result in `docs/evals/`.

### What's new in v1.7.4

- **Goal-stack boost on MCP + HTTP.** Set `RecallOpts.sessionId` (or HTTP `?session_id=...`, or MCP `hippo_recall { session_id }`) and the dlPFC goal-stack boost — previously CLI-only — applies on MCP and HTTP too. Both `api.recall` (primary BM25 band, before fresh-tail / summary appendix) AND MCP's separate `physicsSearch`/`hybridSearch` path are boosted. New `RecallOpts.goalTag` lets callers opt out per-call.
Expand Down
66 changes: 60 additions & 6 deletions benchmarks/sequential-learning/adapters/hippo.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,19 @@
* Hippo Memory Adapter
*
* Uses the hippo CLI via subprocess calls for full isolation.
* Creates a temporary directory, sets HOME/USERPROFILE to it so the
* global store (~/.hippo) is also isolated from the user's real store.
* Creates a temporary directory, sets HOME/USERPROFILE/HIPPO_HOME to it so the
* global store (~/.hippo) is also isolated from the user's real store, and
* clears XDG_DATA_HOME so it can never leak the real user store either
* (precedence in src/shared.ts:getGlobalRoot is HIPPO_HOME > XDG_DATA_HOME > HOME/.hippo).
*
* Requires: hippo CLI on PATH (npm link or global install)
* v1.7.5 -- adds optional B3 dlPFC goal-stack hooks (pushGoal / completeGoal).
* pushGoal generates a session id, sets HIPPO_SESSION_ID for the rest of the
* task lifespan, and parses the printed `g_<16hex>` id from stdout. completeGoal
* closes the goal with an outcome (1.0 trap avoided, 0.0 trap hit) and clears
* the session so cross-task state cannot leak. Both methods hard-fail (no swallow)
* so the simulator's eval-strict mode can detect a broken mechanism.
*
* Requires: hippo CLI on PATH (npm link or global install).
*/

import { execSync } from 'node:child_process';
Expand All @@ -14,20 +23,32 @@ import { join } from 'node:path';
import { tmpdir } from 'node:os';
import { createAdapter } from './interface.mjs';

// v1.7.5 -- session id stays stable across the task lifespan. Set via env on
// every hippo exec so goal push/complete and recall share state.
let _sessionId = null;
let _pushedCount = 0;
let _completedCount = 0;

/**
* Run a hippo CLI command in the temp store directory.
* HOME/USERPROFILE are overridden to isolate from the user's global store.
* Returns stdout as a string, or null on failure.
* HOME/USERPROFILE/HIPPO_HOME are overridden to isolate from the user's
* global store, and XDG_DATA_HOME is blanked so it cannot win the
* getGlobalRoot precedence race. Returns stdout as a string, or null on failure.
*/
function hippoExec(storeDir, args) {
try {
const result = execSync(`hippo ${args}`, {
cwd: storeDir,
env: {
...process.env,
// Override home directory so getGlobalRoot() points into our temp dir
// v1.7.5 codex P0 isolation -- HIPPO_HOME wins over HOME in
// getGlobalRoot, and we blank XDG_DATA_HOME so neither it nor the
// user's real ~/.hippo can leak in.
HIPPO_HOME: storeDir,
HOME: storeDir,
USERPROFILE: storeDir,
XDG_DATA_HOME: '',
...(_sessionId ? { HIPPO_SESSION_ID: _sessionId } : {}),
},
encoding: 'utf-8',
timeout: 15_000,
Expand Down Expand Up @@ -90,10 +111,43 @@ export default createAdapter({
hippoExec(this._storeDir, `outcome ${good ? '--good' : '--bad'}`);
},

// v1.7.5 -- B3 goal-stack hooks.
async pushGoal(name) {
_sessionId = `bench-${Date.now()}-${Math.floor(Math.random() * 1e6)}`;
const out = hippoExec(this._storeDir, `goal push ${name}`);
if (!out) {
_sessionId = null;
// v1.7.5 codex P1 -- HARD FAIL. Do not swallow. Eval-strict mode
// wants the run to abort if the mechanism cannot fire.
throw new Error(`hippo goal push failed for name='${name}'`);
}
const match = out.match(/g_[0-9a-f]{16}/);
if (!match) {
_sessionId = null;
throw new Error(`hippo goal push output did not contain a goal id: '${out}'`);
}
_pushedCount++;
return match[0];
},

async completeGoal(id, good) {
const outcome = good ? '1.0' : '0.0';
hippoExec(this._storeDir, `goal complete ${id} --outcome ${outcome}`);
_completedCount++;
_sessionId = null;
},

async cleanup() {
// v1.7.5 codex P1 -- always clear cross-task state, even after exceptions.
_sessionId = null;
if (this._storeDir && existsSync(this._storeDir)) {
rmSync(this._storeDir, { recursive: true, force: true });
}
this._storeDir = null;
},

// Expose counters so the runner / tests can assert non-zero in eval mode.
_stats() {
return { pushed: _pushedCount, completed: _completedCount };
},
});
28 changes: 27 additions & 1 deletion benchmarks/sequential-learning/adapters/interface.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -38,11 +38,21 @@
* @property {(query: string) => Promise<RecallResult[]>} recall - Retrieve memories relevant to a query (return top-5)
* @property {(good: boolean) => Promise<void>} outcome - Feedback on the last recall (did it help?)
* @property {() => Promise<void>} cleanup - Tear down the memory store (remove temp dirs)
*
* v1.7.5 -- optional B3 dlPFC goal-stack hooks. Either supply BOTH or NEITHER.
* @property {((name: string) => Promise<string>)} [pushGoal] - Push an active goal at task start. Returns the goal id
* (opaque string, e.g. "g_<16hex>"). Adapter must thread the
* goal's session id into subsequent recall calls so the
* goal-stack boost activates.
* @property {((id: string, good: boolean) => Promise<void>)} [completeGoal] - Complete the goal at task end. `good` is the outcome
* (true = trap avoided, false = trap hit). Adapter MAY
* propagate strength multipliers per its own contract.
*/

/**
* Create a validated adapter from a plain object.
* Throws if any required method is missing.
* Throws if any required method is missing, or if the optional v1.7.5
* goal-stack hooks are supplied unpaired (must be both or neither).
*
* @param {MemoryAdapter} adapter
* @returns {MemoryAdapter}
Expand All @@ -59,5 +69,21 @@ export function createAdapter(adapter) {
throw new Error(`Adapter.${key} must be a function`);
}
}
// v1.7.5 -- pushGoal / completeGoal must be supplied as a pair.
const hasPush = 'pushGoal' in adapter;
const hasComplete = 'completeGoal' in adapter;
if (hasPush !== hasComplete) {
throw new Error(
hasPush
? 'Adapter supplies pushGoal but missing completeGoal -- the v1.7.5 B3 hooks are paired'
: 'Adapter supplies completeGoal but missing pushGoal -- the v1.7.5 B3 hooks are paired',
);
}
if (hasPush && typeof adapter.pushGoal !== 'function') {
throw new Error('Adapter.pushGoal must be a function');
}
if (hasComplete && typeof adapter.completeGoal !== 'function') {
throw new Error('Adapter.completeGoal must be a function');
}
return adapter;
}
138 changes: 138 additions & 0 deletions benchmarks/sequential-learning/aggregate.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
// benchmarks/sequential-learning/aggregate.mjs
// v1.7.5 -- pure aggregation helpers for multi-seed runs. Zero npm deps; only
// Node 22+ built-ins. Keep this file dependency-free so the benchmark runs on
// a vanilla Node install with `node run.mjs`.

/**
* Sample mean. Returns 0 for empty arrays.
* @param {number[]} xs
* @returns {number}
*/
export function mean(xs) {
if (xs.length === 0) return 0;
return xs.reduce((a, b) => a + b, 0) / xs.length;
}

/**
* Sample standard deviation (n-1 denominator). Returns 0 for n<2.
* @param {number[]} xs
* @returns {number}
*/
export function stdDev(xs) {
if (xs.length < 2) return 0;
const m = mean(xs);
const sumSq = xs.reduce((a, b) => a + (b - m) ** 2, 0);
return Math.sqrt(sumSq / (xs.length - 1));
}

// t-distribution critical values for 95% two-sided, df = n - 1.
// Hardcoded for n in [5..30]; falls back to 1.960 (z) for n > 30.
const T_CRIT_95 = {
5: 2.776, 6: 2.571, 7: 2.447, 8: 2.365, 9: 2.306, 10: 2.262,
11: 2.228, 12: 2.201, 13: 2.179, 14: 2.160, 15: 2.145, 16: 2.131,
17: 2.120, 18: 2.110, 19: 2.101, 20: 2.093, 21: 2.086, 22: 2.080,
23: 2.074, 24: 2.069, 25: 2.064, 26: 2.060, 27: 2.056, 28: 2.052,
29: 2.048, 30: 2.045,
};

/**
* 95% CI half-width via t-distribution. v1.7.5 codex P2 -- reject n<5 by
* returning 0. n=2 t-crit=12.706 gives nonsense CIs; eval requires n>=10
* anyway. Smoke runs labelled exploratory, no CI reported.
*
* @param {number[]} xs
* @returns {number}
*/
export function ciHalfWidth95(xs) {
if (xs.length < 5) return 0;
const t = T_CRIT_95[xs.length] ?? 1.960;
return (t * stdDev(xs)) / Math.sqrt(xs.length);
}

/**
* mulberry32 -- deterministic PRNG, dep-free. Exported so traps.mjs can reuse
* the same RNG implementation for seeded category-to-slot assignment.
*
* @param {number} seed integer seed (uint32 coerced)
* @returns {() => number} function returning a uniform float in [0, 1)
*/
export function mulberry32(seed) {
let s = seed >>> 0;
return function () {
s = (s + 0x6D2B79F5) >>> 0;
let t = s;
t = Math.imul(t ^ (t >>> 15), t | 1);
t ^= t + Math.imul(t ^ (t >>> 7), t | 61);
return ((t ^ (t >>> 14)) >>> 0) / 4294967296;
};
}

/**
* Exact paired sign-flip permutation CI for mean(xsA - xsB).
*
* v1.7.5 -- used in Task 3 instead of paired t-test: phase rates are bounded
* binomial-like and t-test is fragile at n=20. Permutation makes no normality
* assumption.
*
* The CI is built around the observed mean by recentring the resampled-mean
* distribution: ciLow = observed + (sortedResamples[loIdx] - mean(resamples)).
* This gives a bias-corrected percentile interval.
*
* Internally seeded with mulberry32(0x9E3779B9) for determinism.
*
* @param {number[]} xsA per-seed metric for condition A
* @param {number[]} xsB per-seed metric for condition B (same seeds, paired)
* @param {number} [alpha=0.05] two-sided alpha for 95% CI
* @param {number} [nResamples=10000] permutation resample count
* @returns {{deltaMean: number, ciLow: number, ciHigh: number}}
*/
export function pairedPermutationCI(xsA, xsB, alpha = 0.05, nResamples = 10_000) {
if (xsA.length !== xsB.length) {
throw new Error('pairedPermutationCI: lengths differ');
}
const n = xsA.length;
if (n < 5) throw new Error('pairedPermutationCI: n<5');

const diffs = xsA.map((a, i) => a - xsB[i]);
const observed = mean(diffs);

const rng = mulberry32(0x9E3779B9);
const resampledMeans = new Array(nResamples);
for (let r = 0; r < nResamples; r++) {
let s = 0;
for (let i = 0; i < n; i++) {
s += diffs[i] * (rng() < 0.5 ? -1 : 1);
}
resampledMeans[r] = s / n;
}
resampledMeans.sort((a, b) => a - b);
const resampleMean = mean(resampledMeans);
const loIdx = Math.floor((alpha / 2) * nResamples);
const hiIdx = Math.ceil((1 - alpha / 2) * nResamples) - 1;
const ciLow = observed + (resampledMeans[loIdx] - resampleMean);
const ciHigh = observed + (resampledMeans[hiIdx] - resampleMean);
return { deltaMean: observed, ciLow, ciHigh };
}

/**
* Aggregate per-seed phase rates into mean/std/ci95 per phase.
*
* @param {Array<{early: number, mid: number, late: number}>} seedResults
* @returns {{
* early: {mean: number, std: number, ci95: number},
* mid: {mean: number, std: number, ci95: number},
* late: {mean: number, std: number, ci95: number},
* }}
*/
export function aggregatePhases(seedResults) {
const result = {};
for (const phase of ['early', 'mid', 'late']) {
const xs = seedResults.map((r) => r[phase]);
result[phase] = {
mean: mean(xs),
std: stdDev(xs),
ci95: ciHalfWidth95(xs),
};
}
return result;
}
Loading
Loading