feat: add experiment loop for metric-driven auto-research iterations

codexstar69 · codexstar69 · commit b73627a06f4f · 2026-03-14T12:55:47.000+05:30
Adds experiment-loop.cjs with init/run/log/check-continue/status/stop
CLI commands, experiment.schema.json, comprehensive test suite (61 tests),
and updated loop.md with coverage-driven scan orchestration.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,30 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [3.0.9] - 2026-03-13
+
+### Added
+- `scripts/experiment-loop.cjs` — autonomous experiment loop engine inspired by pi-autoresearch. Provides metric-driven iteration with baseline + delta tracking, append-only JSONL persistence, segmented sessions, and full state reconstruction from log alone.
+- `schemas/experiment.schema.json` — JSON schema for experiment JSONL entries (config, result, resume types)
+- `check-continue` command — single gateway that checks all loop conditions (stop file, iteration cap, consecutive crash breaker, resume cooldown) before each iteration
+- Hard iteration cap (default: 10, configurable via `--max-iterations`) prevents runaway loops
+- Consecutive crash breaker (3 in a row) auto-stops to prevent token waste
+- Stop-file cancellation (`experiment-loop.cjs stop` or `touch .bug-hunter/experiment.stop`) for easy user interruption
+- Auto-resume with 5-minute cooldown for graceful recovery after agent context limits
+- Secondary metric consistency enforcement — locks metric names after first result in a segment
+- Backpressure checks — optional `experiment.checks.sh` script gates keep/discard decisions
+- 40 new tests covering all experiment-loop commands, guardrails, and edge cases (including negative metrics, zero/negative max-iterations, --duration-ms)
+
+### Changed
+- **Experiment tracking is now active by default** when `LOOP_MODE=true` — no `--experiment` flag needed
+- `SKILL.md` now auto-initializes `experiment-loop.cjs` during loop setup (init + check-continue wiring)
+- `modes/loop.md` updated with full experiment tracking integration, per-iteration workflow, and documentation of all stop mechanisms (user-initiated vs automatic)
+- `scripts/schema-runtime.cjs` registers the new `experiment` schema
+- `schemas/experiment.schema.json` cleaned: removed unused `command` and `passed` fields, added `maxIterations` field
+- `scripts/experiment-loop.cjs` `log` command now accepts `--duration-ms` flag to persist actual iteration duration (was hardcoded to 0)
+- `llms.txt` and `llms-full.txt` updated with experiment loop capabilities
+- Test suite expanded from 61 to **101 tests** (0 failures)
+
 ## [3.0.8] - 2026-03-13
 
 ### Highlights
@@ -239,7 +263,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Coverage enforcement - partial audits produce explicit warnings
 - Large codebase strategy with domain-first tiered scanning
 
-[Unreleased]: https://github.com/codexstar69/bug-hunter/compare/v3.0.8...HEAD
+[Unreleased]: https://github.com/codexstar69/bug-hunter/compare/v3.0.9...HEAD
+[3.0.9]: https://github.com/codexstar69/bug-hunter/compare/v3.0.8...v3.0.9
 [3.0.8]: https://github.com/codexstar69/bug-hunter/compare/v3.0.7...v3.0.8
 [3.0.7]: https://github.com/codexstar69/bug-hunter/compare/v3.0.5...v3.0.7
 [3.0.5]: https://github.com/codexstar69/bug-hunter/compare/v3.0.4...v3.0.5
diff --git a/SKILL.md b/SKILL.md
@@ -471,10 +471,28 @@ Read the corresponding mode file using `STRATEGY` from the triage JSON:
 
 **Backend override for local-sequential:** If `AGENT_BACKEND = "local-sequential"`, read `SKILL_DIR/modes/local-sequential.md` instead of the size-based mode file. The local-sequential mode handles all sizes internally with its own chunking logic.
 
-If LOOP_MODE=true, also read:
+If LOOP_MODE=true, also read (loop.md includes experiment tracking with iteration caps, stop-file safety, and auto-resume):
 - `SKILL_DIR/modes/fix-loop.md` when FIX_MODE=true
 - `SKILL_DIR/modes/loop.md` otherwise
 
+**CRITICAL — experiment tracking initialization:** When `LOOP_MODE=true`, initialize experiment tracking BEFORE the first pipeline iteration by running:
+```bash
+node "$SKILL_DIR/scripts/experiment-loop.cjs" init \
+  .bug-hunter/experiment.jsonl \
+  "bug-hunt-$(date +%Y%m%d)" \
+  bugs_confirmed \
+  higher \
+  count \
+  --max-iterations "$MAX_LOOP_ITERATIONS"
+```
+Then before each iteration, call `check-continue`:
+```bash
+node "$SKILL_DIR/scripts/experiment-loop.cjs" check-continue \
+  .bug-hunter/experiment.jsonl \
+  --stop-file .bug-hunter/experiment.stop
+```
+If `continue` is false, stop the loop immediately. After each iteration, log the result with `log`. This is active by default — no `--experiment` flag needed.
+
 **CRITICAL — ralph-loop integration:** When `LOOP_MODE=true`, you MUST call the `ralph_start` tool before running the first pipeline iteration. The loop mode files (`loop.md` / `fix-loop.md`) contain the exact `ralph_start` call to make, including the `taskContent` and `maxIterations` parameters. Without calling `ralph_start`, the loop will NOT iterate — it will run once and stop. After each iteration, call `ralph_done` to continue, or output `<promise>COMPLETE</promise>` when done.
 
 Report the chosen mode to the user.
diff --git a/llms-full.txt b/llms-full.txt
@@ -110,6 +110,7 @@ Critical and High security findings receive CVSS 3.1 base scores with attack vec
 | `--deps` | Include dependency CVE scan |
 | `--autonomous` | No-intervention auto-fix run |
 | `--no-loop` | Single pass (disable iterative coverage) |
+| `--max-iterations <n>` | Hard cap on loop iterations (default: 10) |
 
 ## Output Files
 
@@ -127,6 +128,7 @@ Critical and High security findings receive CVSS 3.1 base scores with attack vec
 | `.bug-hunter/fix-report.json` | Fix results |
 | `.bug-hunter/coverage.json` | Loop coverage state |
 | `.bug-hunter/coverage.md` | Coverage summary |
+| `.bug-hunter/experiment.jsonl` | Experiment loop log (append-only, metric tracking) |
 | `.bug-hunter/threat-model.md` | STRIDE threat model |
 | `.bug-hunter/dep-findings.json` | Dependency CVE results |
 
@@ -162,11 +164,12 @@ bug-hunter/
 ├── prompts/examples/      # Calibration examples for Hunter/Skeptic
 ├── schemas/               # JSON Schema contracts for all artifacts
 ├── scripts/               # Node.js helpers (zero AI tokens)
-│   ├── run-bug-hunter.cjs # Main orchestrator script
-│   ├── doc-lookup.cjs     # Context Hub + Context7 doc lookup
-│   ├── context7-api.cjs   # Context7 standalone fallback
-│   ├── prepublish-guard.cjs # Publish safety net
-│   └── tests/             # Test suite (61 tests)
+│   ├── run-bug-hunter.cjs    # Main orchestrator script
+│   ├── experiment-loop.cjs   # Autonomous experiment loop with metrics + stop-file
+│   ├── doc-lookup.cjs        # Context Hub + Context7 doc lookup
+│   ├── context7-api.cjs      # Context7 standalone fallback
+│   ├── prepublish-guard.cjs  # Publish safety net
+│   └── tests/                # Test suite (101 tests)
 ├── templates/             # Subagent launch template
 └── test-fixture/          # 6 planted bugs for validation
 ```
diff --git a/llms.txt b/llms.txt
@@ -13,6 +13,7 @@ Bug Hunter is an automated adversarial code auditing skill for AI coding agents.
 - Security classification: STRIDE threat categories, CWE weakness IDs, CVSS 3.1 scoring
 - Documentation verification: checks claims against official library docs via Context Hub + Context7
 - Safe auto-fix: git-branched fixes with worktree isolation, canary rollout, test verification, and automatic rollback
+- Experiment loop: autonomous iteration with metric tracking, hard iteration caps, stop-file cancellation, and auto-resume on context limits
 - Dependency CVE scanning: lockfile-aware audits for npm, pnpm, yarn, bun (including Bun 1.2+ text-format lockfiles)
 - PR review: first-class `--pr` workflow for reviewing current, recent, or numbered PRs
 - Enterprise security pack: bundled STRIDE threat modeling, vulnerability validation, and security review skills
diff --git a/modes/loop.md b/modes/loop.md
@@ -120,6 +120,182 @@ Each iteration after the first:
 
 - Max iterations should scale with the queue size so autonomous runs do not stop early
 - Each iteration only scans NEW files — no re-scanning already-DONE files
-- User can stop anytime with ESC or `/ralph-stop`
+- User can stop anytime with ESC, `/ralph-stop`, or `experiment-loop.cjs stop`
 - Canonical state is in `.bug-hunter/coverage.json`; `coverage.md` is derived
   and fully resumable from that JSON
+
+---
+
+## Experiment Tracking (autoresearch integration)
+
+When `LOOP_MODE=true`, each loop iteration is automatically tracked as an **experiment** using the append-only JSONL experiment log. This is active by default — no extra flags needed. It provides metric-driven optimization with baseline comparison, auto-resume, and user-interruptible stop files.
+
+### Setup (first iteration only)
+
+Before the first pipeline iteration, initialize the experiment session:
+
+```bash
+node scripts/experiment-loop.cjs init \
+  .bug-hunter/experiment.jsonl \
+  "bug-hunt-$(date +%Y%m%d)" \
+  bugs_confirmed \
+  higher \
+  count \
+  --max-iterations 10
+```
+
+The `--max-iterations` flag sets the **hard iteration cap** for the session (default: 10). The loop will automatically stop when this cap is reached — no runaway loops. Each subsequent `init` call starts a **new segment** with its own baseline and counter reset.
+
+### Per-iteration workflow
+
+Each iteration follows the **check-continue → run → log** pattern:
+
+1. **Check continue** — the single gateway before every iteration:
+   ```bash
+   node scripts/experiment-loop.cjs check-continue \
+     .bug-hunter/experiment.jsonl \
+     --stop-file .bug-hunter/experiment.stop
+   ```
+
+   This checks ALL conditions in one call and returns a clear yes/no:
+   - `{ "continue": true, "iteration": 3, "remaining": 7 }` — safe to proceed
+   - `{ "continue": false, "reason": "user-stopped" }` — user requested stop
+   - `{ "continue": false, "reason": "max-iterations-reached" }` — hit the cap
+   - `{ "continue": false, "reason": "consecutive-crashes" }` — 3 crashes in a row
+   - `{ "continue": false, "reason": "resume-cooldown" }` — auto-resume too soon
+
+   **If `continue` is false, the loop MUST stop.** Do not override.
+
+2. **Run experiment** — execute the pipeline and measure:
+   ```bash
+   node scripts/experiment-loop.cjs run \
+     .bug-hunter/experiment.jsonl \
+     "node scripts/run-bug-hunter.cjs run --files-json .bug-hunter/triage-files.json" \
+     --stop-file .bug-hunter/experiment.stop \
+     --checks-script .bug-hunter/experiment.checks.sh
+   ```
+
+   The run command:
+   - Checks the stop file before executing (GUARDRAIL)
+   - Times wall-clock duration
+   - Captures stdout/stderr
+   - Parses `METRIC name=value` lines from stdout
+   - Runs the optional checks script if the command passes (backpressure GUARDRAIL)
+
+3. **Log result** — record the outcome:
+   ```bash
+   node scripts/experiment-loop.cjs log \
+     .bug-hunter/experiment.jsonl \
+     keep \          # or: discard, crash, checks_failed
+     12 \            # primary metric value (e.g., bugs confirmed)
+     --description "Iteration 3: scanned auth + payments modules" \
+     --secondary '{"false_positives":2,"files_scanned":15,"fix_success_rate":85}'
+   ```
+
+   The log command:
+   - Validates secondary metric consistency (GUARDRAIL — rejects missing/new metrics unless `--force true`)
+   - Auto-commits on `keep` status (configurable via `--auto-commit false`)
+   - Computes delta from baseline (% improvement)
+   - Returns whether this is the new best result
+
+4. **Check status** — see cumulative progress:
+   ```bash
+   node scripts/experiment-loop.cjs status .bug-hunter/experiment.jsonl
+   ```
+
+### Stopping the loop
+
+#### User-initiated stop (easy, immediate)
+
+The user can cancel the loop at any time. These are all equivalent:
+
+| Method | How | When it takes effect |
+|--------|-----|---------------------|
+| **ESC key** | Press ESC in the terminal | Immediate — kills current iteration |
+| **Ctrl+C** | Terminal interrupt | Immediate |
+| **`/ralph-stop`** | Type in the CLI | End of current iteration |
+| **Stop file** | `node scripts/experiment-loop.cjs stop` | Before next iteration |
+| **Touch file** | `touch .bug-hunter/experiment.stop` | Before next iteration |
+
+The `check-continue` and `run` commands both check the stop file, so the loop will halt gracefully at the next natural checkpoint.
+
+> **Interaction with ralph-loop:** ESC and Ctrl+C kill the process immediately (ralph-loop handles cleanup). The stop file is a softer mechanism — it lets the current operation finish, then halts before the next iteration. Both work independently. If a stale stop file is left behind from a previous run, `check-continue` will detect it and refuse to proceed — so always clean up.
+
+To resume after a user stop:
+
+```bash
+node scripts/experiment-loop.cjs clear-stop
+```
+
+#### Automatic stop (system-initiated)
+
+The system will automatically stop the loop when ANY of these conditions are met — no user action required:
+
+| Condition | Default | Why |
+|-----------|---------|-----|
+| **Iteration cap reached** | 10 iterations | Prevents runaway loops. Configurable via `--max-iterations`. |
+| **3 consecutive crashes** | 3 in a row | Something is broken — don't waste tokens. Fix and re-init. |
+| **Resume cooldown** | 5 minutes | Prevents rapid-fire auto-resumes when agent hits context limits. |
+
+These are the same checks that `check-continue` evaluates. The agent MUST call `check-continue` before every iteration and obey the result.
+
+### Auto-resume on agent context limit (pi-autoresearch pattern)
+
+When the agent's context window fills up mid-loop, it dies. On restart:
+
+1. The agent reads `.bug-hunter/experiment.jsonl` — all state is reconstructable from this file alone
+2. It reads `.bug-hunter/coverage.json` — knows which files were already scanned
+3. It calls `check-continue` — which verifies the 5-minute cooldown has passed
+4. If `check-continue` returns `{ "continue": true }`, it calls `record-resume` and picks up where it left off
+5. If the cooldown hasn't elapsed, it waits or asks the user
+
+```bash
+# Record a resume event (resets the cooldown timer) — call this after check-continue passes
+node scripts/experiment-loop.cjs record-resume .bug-hunter/experiment.jsonl
+```
+
+> **Note:** `can-resume` and `record-resume` are low-level primitives. In normal operation, always use `check-continue` as the primary gateway — it already includes the resume cooldown check along with all other conditions. Use `can-resume` only for diagnostic purposes.
+
+This is distinct from user-initiated stop: the agent auto-resumes after context limits, but respects user stop files.
+
+### Metrics tracked
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `bugs_confirmed` | Primary | Number of bugs surviving the full adversarial pipeline |
+| `false_positives` | Secondary | Findings killed by Skeptic + Referee |
+| `files_scanned` | Secondary | Files processed this iteration |
+| `fix_success_rate` | Secondary | % of fixes that passed verification |
+
+Secondary metrics are **locked after the first result** in a segment. All subsequent results must provide the same set of secondary metrics (or use `--force true` to change them). This prevents inconsistent tracking.
+
+### JSONL file format
+
+The experiment log at `.bug-hunter/experiment.jsonl` is append-only. Each line is one of:
+
+**Config header** (segment boundary):
+```json
+{"type":"config","segment":0,"timestamp":1710000000000,"name":"bug-hunt-20260313","metric":{"name":"bugs_confirmed","unit":"count","direction":"higher"}}
+```
+
+**Result entry**:
+```json
+{"type":"result","segment":0,"timestamp":1710000060000,"value":12,"secondaryMetrics":{"false_positives":2,"files_scanned":15},"status":"keep","description":"Iteration 1","commit":"abc1234","durationMs":45000}
+```
+
+**Resume marker**:
+```json
+{"type":"resume","timestamp":1710000360000}
+```
+
+### Guardrails summary
+
+| Guardrail | Source | Implementation |
+|-----------|--------|----------------|
+| Stop file checked before every run | pi-autoresearch | `run` command exits immediately if `.bug-hunter/experiment.stop` exists |
+| JSONL is append-only | pi-autoresearch | Never modify existing entries; full state reconstructable from log |
+| Secondary metric consistency | pi-autoresearch | Rejects missing/new metrics unless `--force` is used |
+| Auto-resume rate limiting | pi-autoresearch | 5-minute cooldown between resume events |
+| Backpressure checks | pi-autoresearch | Optional `experiment.checks.sh` must pass before `keep` |
+| Segment boundaries | pi-autoresearch | Each `init` starts fresh baseline; old segments preserved |
+| Output truncation | pi-autoresearch | stdout/stderr capped at 50KB to prevent memory bloat |
diff --git a/package.json b/package.json
@@ -1,6 +1,6 @@
 {
   "name": "@codexstar/bug-hunter",
-  "version": "3.0.8",
+  "version": "3.0.9",
   "description": "Adversarial AI bug hunter — multi-agent pipeline finds security vulnerabilities, logic errors, and runtime bugs, then fixes them autonomously. Works with Claude Code, Cursor, Codex CLI, Copilot, Kiro, and more.",
   "license": "MIT",
   "main": "bin/bug-hunter",
diff --git a/schemas/experiment.schema.json b/schemas/experiment.schema.json
@@ -0,0 +1,61 @@
+{
+  "$schema": "http://json-schema.org/draft-07/schema#",
+  "schemaVersion": 1,
+  "artifact": "experiment",
+  "title": "Bug Hunter Experiment Log Entry",
+  "description": "Schema for individual JSONL experiment log entries. Config headers and result entries share this envelope.",
+  "type": "object",
+  "required": ["type"],
+  "properties": {
+    "type": {
+      "type": "string",
+      "enum": ["config", "result", "resume"]
+    },
+    "segment": {
+      "type": "integer",
+      "minimum": 0
+    },
+    "timestamp": {
+      "type": "number"
+    },
+    "name": {
+      "type": "string",
+      "minLength": 1
+    },
+    "metric": {
+      "type": "object",
+      "properties": {
+        "name": { "type": "string", "minLength": 1 },
+        "unit": { "type": "string" },
+        "direction": { "type": "string", "enum": ["lower", "higher"] }
+      },
+      "required": ["name", "direction"]
+    },
+    "value": {
+      "type": "number"
+    },
+    "secondaryMetrics": {
+      "type": "object",
+      "additionalProperties": { "type": "number" }
+    },
+    "status": {
+      "type": "string",
+      "enum": ["keep", "discard", "crash", "checks_failed"]
+    },
+    "description": {
+      "type": "string"
+    },
+    "commit": {
+      "type": "string"
+    },
+    "durationMs": {
+      "type": "number",
+      "minimum": 0
+    },
+    "maxIterations": {
+      "type": "integer",
+      "minimum": 1
+    }
+  },
+  "additionalProperties": false
+}
diff --git a/scripts/experiment-loop.cjs b/scripts/experiment-loop.cjs
diff --git a/scripts/schema-runtime.cjs b/scripts/schema-runtime.cjs
diff --git a/scripts/tests/experiment-loop.test.cjs b/scripts/tests/experiment-loop.test.cjs

Original file line number	Diff line number	Diff line change
`@@ -1,6 +1,6 @@`
`1`	`1`	`{`
`2`	`2`	`"name": "@codexstar/bug-hunter",`
`3`		`- "version": "3.0.8",`
	`3`	`+ "version": "3.0.9",`
`4`	`4`	`"description": "Adversarial AI bug hunter — multi-agent pipeline finds security vulnerabilities, logic errors, and runtime bugs, then fixes them autonomously. Works with Claude Code, Cursor, Codex CLI, Copilot, Kiro, and more.",`
`5`	`5`	`"license": "MIT",`
`6`	`6`	`"main": "bin/bug-hunter",`