Skip to content

Commit b73627a

Browse files
committed
feat: add experiment loop for metric-driven auto-research iterations
Adds experiment-loop.cjs with init/run/log/check-continue/status/stop CLI commands, experiment.schema.json, comprehensive test suite (61 tests), and updated loop.md with coverage-driven scan orchestration.
1 parent d01b83b commit b73627a

10 files changed

Lines changed: 1967 additions & 9 deletions

File tree

CHANGELOG.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,30 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [3.0.9] - 2026-03-13
9+
10+
### Added
11+
- `scripts/experiment-loop.cjs` — autonomous experiment loop engine inspired by pi-autoresearch. Provides metric-driven iteration with baseline + delta tracking, append-only JSONL persistence, segmented sessions, and full state reconstruction from log alone.
12+
- `schemas/experiment.schema.json` — JSON schema for experiment JSONL entries (config, result, resume types)
13+
- `check-continue` command — single gateway that checks all loop conditions (stop file, iteration cap, consecutive crash breaker, resume cooldown) before each iteration
14+
- Hard iteration cap (default: 10, configurable via `--max-iterations`) prevents runaway loops
15+
- Consecutive crash breaker (3 in a row) auto-stops to prevent token waste
16+
- Stop-file cancellation (`experiment-loop.cjs stop` or `touch .bug-hunter/experiment.stop`) for easy user interruption
17+
- Auto-resume with 5-minute cooldown for graceful recovery after agent context limits
18+
- Secondary metric consistency enforcement — locks metric names after first result in a segment
19+
- Backpressure checks — optional `experiment.checks.sh` script gates keep/discard decisions
20+
- 40 new tests covering all experiment-loop commands, guardrails, and edge cases (including negative metrics, zero/negative max-iterations, --duration-ms)
21+
22+
### Changed
23+
- **Experiment tracking is now active by default** when `LOOP_MODE=true` — no `--experiment` flag needed
24+
- `SKILL.md` now auto-initializes `experiment-loop.cjs` during loop setup (init + check-continue wiring)
25+
- `modes/loop.md` updated with full experiment tracking integration, per-iteration workflow, and documentation of all stop mechanisms (user-initiated vs automatic)
26+
- `scripts/schema-runtime.cjs` registers the new `experiment` schema
27+
- `schemas/experiment.schema.json` cleaned: removed unused `command` and `passed` fields, added `maxIterations` field
28+
- `scripts/experiment-loop.cjs` `log` command now accepts `--duration-ms` flag to persist actual iteration duration (was hardcoded to 0)
29+
- `llms.txt` and `llms-full.txt` updated with experiment loop capabilities
30+
- Test suite expanded from 61 to **101 tests** (0 failures)
31+
832
## [3.0.8] - 2026-03-13
933

1034
### Highlights
@@ -239,7 +263,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
239263
- Coverage enforcement - partial audits produce explicit warnings
240264
- Large codebase strategy with domain-first tiered scanning
241265

242-
[Unreleased]: https://github.com/codexstar69/bug-hunter/compare/v3.0.8...HEAD
266+
[Unreleased]: https://github.com/codexstar69/bug-hunter/compare/v3.0.9...HEAD
267+
[3.0.9]: https://github.com/codexstar69/bug-hunter/compare/v3.0.8...v3.0.9
243268
[3.0.8]: https://github.com/codexstar69/bug-hunter/compare/v3.0.7...v3.0.8
244269
[3.0.7]: https://github.com/codexstar69/bug-hunter/compare/v3.0.5...v3.0.7
245270
[3.0.5]: https://github.com/codexstar69/bug-hunter/compare/v3.0.4...v3.0.5

SKILL.md

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -471,10 +471,28 @@ Read the corresponding mode file using `STRATEGY` from the triage JSON:
471471

472472
**Backend override for local-sequential:** If `AGENT_BACKEND = "local-sequential"`, read `SKILL_DIR/modes/local-sequential.md` instead of the size-based mode file. The local-sequential mode handles all sizes internally with its own chunking logic.
473473

474-
If LOOP_MODE=true, also read:
474+
If LOOP_MODE=true, also read (loop.md includes experiment tracking with iteration caps, stop-file safety, and auto-resume):
475475
- `SKILL_DIR/modes/fix-loop.md` when FIX_MODE=true
476476
- `SKILL_DIR/modes/loop.md` otherwise
477477

478+
**CRITICAL — experiment tracking initialization:** When `LOOP_MODE=true`, initialize experiment tracking BEFORE the first pipeline iteration by running:
479+
```bash
480+
node "$SKILL_DIR/scripts/experiment-loop.cjs" init \
481+
.bug-hunter/experiment.jsonl \
482+
"bug-hunt-$(date +%Y%m%d)" \
483+
bugs_confirmed \
484+
higher \
485+
count \
486+
--max-iterations "$MAX_LOOP_ITERATIONS"
487+
```
488+
Then before each iteration, call `check-continue`:
489+
```bash
490+
node "$SKILL_DIR/scripts/experiment-loop.cjs" check-continue \
491+
.bug-hunter/experiment.jsonl \
492+
--stop-file .bug-hunter/experiment.stop
493+
```
494+
If `continue` is false, stop the loop immediately. After each iteration, log the result with `log`. This is active by default — no `--experiment` flag needed.
495+
478496
**CRITICAL — ralph-loop integration:** When `LOOP_MODE=true`, you MUST call the `ralph_start` tool before running the first pipeline iteration. The loop mode files (`loop.md` / `fix-loop.md`) contain the exact `ralph_start` call to make, including the `taskContent` and `maxIterations` parameters. Without calling `ralph_start`, the loop will NOT iterate — it will run once and stop. After each iteration, call `ralph_done` to continue, or output `<promise>COMPLETE</promise>` when done.
479497

480498
Report the chosen mode to the user.

llms-full.txt

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,7 @@ Critical and High security findings receive CVSS 3.1 base scores with attack vec
110110
| `--deps` | Include dependency CVE scan |
111111
| `--autonomous` | No-intervention auto-fix run |
112112
| `--no-loop` | Single pass (disable iterative coverage) |
113+
| `--max-iterations <n>` | Hard cap on loop iterations (default: 10) |
113114

114115
## Output Files
115116

@@ -127,6 +128,7 @@ Critical and High security findings receive CVSS 3.1 base scores with attack vec
127128
| `.bug-hunter/fix-report.json` | Fix results |
128129
| `.bug-hunter/coverage.json` | Loop coverage state |
129130
| `.bug-hunter/coverage.md` | Coverage summary |
131+
| `.bug-hunter/experiment.jsonl` | Experiment loop log (append-only, metric tracking) |
130132
| `.bug-hunter/threat-model.md` | STRIDE threat model |
131133
| `.bug-hunter/dep-findings.json` | Dependency CVE results |
132134

@@ -162,11 +164,12 @@ bug-hunter/
162164
├── prompts/examples/ # Calibration examples for Hunter/Skeptic
163165
├── schemas/ # JSON Schema contracts for all artifacts
164166
├── scripts/ # Node.js helpers (zero AI tokens)
165-
│ ├── run-bug-hunter.cjs # Main orchestrator script
166-
│ ├── doc-lookup.cjs # Context Hub + Context7 doc lookup
167-
│ ├── context7-api.cjs # Context7 standalone fallback
168-
│ ├── prepublish-guard.cjs # Publish safety net
169-
│ └── tests/ # Test suite (61 tests)
167+
│ ├── run-bug-hunter.cjs # Main orchestrator script
168+
│ ├── experiment-loop.cjs # Autonomous experiment loop with metrics + stop-file
169+
│ ├── doc-lookup.cjs # Context Hub + Context7 doc lookup
170+
│ ├── context7-api.cjs # Context7 standalone fallback
171+
│ ├── prepublish-guard.cjs # Publish safety net
172+
│ └── tests/ # Test suite (101 tests)
170173
├── templates/ # Subagent launch template
171174
└── test-fixture/ # 6 planted bugs for validation
172175
```

llms.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ Bug Hunter is an automated adversarial code auditing skill for AI coding agents.
1313
- Security classification: STRIDE threat categories, CWE weakness IDs, CVSS 3.1 scoring
1414
- Documentation verification: checks claims against official library docs via Context Hub + Context7
1515
- Safe auto-fix: git-branched fixes with worktree isolation, canary rollout, test verification, and automatic rollback
16+
- Experiment loop: autonomous iteration with metric tracking, hard iteration caps, stop-file cancellation, and auto-resume on context limits
1617
- Dependency CVE scanning: lockfile-aware audits for npm, pnpm, yarn, bun (including Bun 1.2+ text-format lockfiles)
1718
- PR review: first-class `--pr` workflow for reviewing current, recent, or numbered PRs
1819
- Enterprise security pack: bundled STRIDE threat modeling, vulnerability validation, and security review skills

modes/loop.md

Lines changed: 177 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,182 @@ Each iteration after the first:
120120

121121
- Max iterations should scale with the queue size so autonomous runs do not stop early
122122
- Each iteration only scans NEW files — no re-scanning already-DONE files
123-
- User can stop anytime with ESC or `/ralph-stop`
123+
- User can stop anytime with ESC, `/ralph-stop`, or `experiment-loop.cjs stop`
124124
- Canonical state is in `.bug-hunter/coverage.json`; `coverage.md` is derived
125125
and fully resumable from that JSON
126+
127+
---
128+
129+
## Experiment Tracking (autoresearch integration)
130+
131+
When `LOOP_MODE=true`, each loop iteration is automatically tracked as an **experiment** using the append-only JSONL experiment log. This is active by default — no extra flags needed. It provides metric-driven optimization with baseline comparison, auto-resume, and user-interruptible stop files.
132+
133+
### Setup (first iteration only)
134+
135+
Before the first pipeline iteration, initialize the experiment session:
136+
137+
```bash
138+
node scripts/experiment-loop.cjs init \
139+
.bug-hunter/experiment.jsonl \
140+
"bug-hunt-$(date +%Y%m%d)" \
141+
bugs_confirmed \
142+
higher \
143+
count \
144+
--max-iterations 10
145+
```
146+
147+
The `--max-iterations` flag sets the **hard iteration cap** for the session (default: 10). The loop will automatically stop when this cap is reached — no runaway loops. Each subsequent `init` call starts a **new segment** with its own baseline and counter reset.
148+
149+
### Per-iteration workflow
150+
151+
Each iteration follows the **check-continue → run → log** pattern:
152+
153+
1. **Check continue** — the single gateway before every iteration:
154+
```bash
155+
node scripts/experiment-loop.cjs check-continue \
156+
.bug-hunter/experiment.jsonl \
157+
--stop-file .bug-hunter/experiment.stop
158+
```
159+
160+
This checks ALL conditions in one call and returns a clear yes/no:
161+
- `{ "continue": true, "iteration": 3, "remaining": 7 }` — safe to proceed
162+
- `{ "continue": false, "reason": "user-stopped" }` — user requested stop
163+
- `{ "continue": false, "reason": "max-iterations-reached" }` — hit the cap
164+
- `{ "continue": false, "reason": "consecutive-crashes" }` — 3 crashes in a row
165+
- `{ "continue": false, "reason": "resume-cooldown" }` — auto-resume too soon
166+
167+
**If `continue` is false, the loop MUST stop.** Do not override.
168+
169+
2. **Run experiment** — execute the pipeline and measure:
170+
```bash
171+
node scripts/experiment-loop.cjs run \
172+
.bug-hunter/experiment.jsonl \
173+
"node scripts/run-bug-hunter.cjs run --files-json .bug-hunter/triage-files.json" \
174+
--stop-file .bug-hunter/experiment.stop \
175+
--checks-script .bug-hunter/experiment.checks.sh
176+
```
177+
178+
The run command:
179+
- Checks the stop file before executing (GUARDRAIL)
180+
- Times wall-clock duration
181+
- Captures stdout/stderr
182+
- Parses `METRIC name=value` lines from stdout
183+
- Runs the optional checks script if the command passes (backpressure GUARDRAIL)
184+
185+
3. **Log result** — record the outcome:
186+
```bash
187+
node scripts/experiment-loop.cjs log \
188+
.bug-hunter/experiment.jsonl \
189+
keep \ # or: discard, crash, checks_failed
190+
12 \ # primary metric value (e.g., bugs confirmed)
191+
--description "Iteration 3: scanned auth + payments modules" \
192+
--secondary '{"false_positives":2,"files_scanned":15,"fix_success_rate":85}'
193+
```
194+
195+
The log command:
196+
- Validates secondary metric consistency (GUARDRAIL — rejects missing/new metrics unless `--force true`)
197+
- Auto-commits on `keep` status (configurable via `--auto-commit false`)
198+
- Computes delta from baseline (% improvement)
199+
- Returns whether this is the new best result
200+
201+
4. **Check status** — see cumulative progress:
202+
```bash
203+
node scripts/experiment-loop.cjs status .bug-hunter/experiment.jsonl
204+
```
205+
206+
### Stopping the loop
207+
208+
#### User-initiated stop (easy, immediate)
209+
210+
The user can cancel the loop at any time. These are all equivalent:
211+
212+
| Method | How | When it takes effect |
213+
|--------|-----|---------------------|
214+
| **ESC key** | Press ESC in the terminal | Immediate — kills current iteration |
215+
| **Ctrl+C** | Terminal interrupt | Immediate |
216+
| **`/ralph-stop`** | Type in the CLI | End of current iteration |
217+
| **Stop file** | `node scripts/experiment-loop.cjs stop` | Before next iteration |
218+
| **Touch file** | `touch .bug-hunter/experiment.stop` | Before next iteration |
219+
220+
The `check-continue` and `run` commands both check the stop file, so the loop will halt gracefully at the next natural checkpoint.
221+
222+
> **Interaction with ralph-loop:** ESC and Ctrl+C kill the process immediately (ralph-loop handles cleanup). The stop file is a softer mechanism — it lets the current operation finish, then halts before the next iteration. Both work independently. If a stale stop file is left behind from a previous run, `check-continue` will detect it and refuse to proceed — so always clean up.
223+
224+
To resume after a user stop:
225+
226+
```bash
227+
node scripts/experiment-loop.cjs clear-stop
228+
```
229+
230+
#### Automatic stop (system-initiated)
231+
232+
The system will automatically stop the loop when ANY of these conditions are met — no user action required:
233+
234+
| Condition | Default | Why |
235+
|-----------|---------|-----|
236+
| **Iteration cap reached** | 10 iterations | Prevents runaway loops. Configurable via `--max-iterations`. |
237+
| **3 consecutive crashes** | 3 in a row | Something is broken — don't waste tokens. Fix and re-init. |
238+
| **Resume cooldown** | 5 minutes | Prevents rapid-fire auto-resumes when agent hits context limits. |
239+
240+
These are the same checks that `check-continue` evaluates. The agent MUST call `check-continue` before every iteration and obey the result.
241+
242+
### Auto-resume on agent context limit (pi-autoresearch pattern)
243+
244+
When the agent's context window fills up mid-loop, it dies. On restart:
245+
246+
1. The agent reads `.bug-hunter/experiment.jsonl` — all state is reconstructable from this file alone
247+
2. It reads `.bug-hunter/coverage.json` — knows which files were already scanned
248+
3. It calls `check-continue` — which verifies the 5-minute cooldown has passed
249+
4. If `check-continue` returns `{ "continue": true }`, it calls `record-resume` and picks up where it left off
250+
5. If the cooldown hasn't elapsed, it waits or asks the user
251+
252+
```bash
253+
# Record a resume event (resets the cooldown timer) — call this after check-continue passes
254+
node scripts/experiment-loop.cjs record-resume .bug-hunter/experiment.jsonl
255+
```
256+
257+
> **Note:** `can-resume` and `record-resume` are low-level primitives. In normal operation, always use `check-continue` as the primary gateway — it already includes the resume cooldown check along with all other conditions. Use `can-resume` only for diagnostic purposes.
258+
259+
This is distinct from user-initiated stop: the agent auto-resumes after context limits, but respects user stop files.
260+
261+
### Metrics tracked
262+
263+
| Metric | Type | Description |
264+
|--------|------|-------------|
265+
| `bugs_confirmed` | Primary | Number of bugs surviving the full adversarial pipeline |
266+
| `false_positives` | Secondary | Findings killed by Skeptic + Referee |
267+
| `files_scanned` | Secondary | Files processed this iteration |
268+
| `fix_success_rate` | Secondary | % of fixes that passed verification |
269+
270+
Secondary metrics are **locked after the first result** in a segment. All subsequent results must provide the same set of secondary metrics (or use `--force true` to change them). This prevents inconsistent tracking.
271+
272+
### JSONL file format
273+
274+
The experiment log at `.bug-hunter/experiment.jsonl` is append-only. Each line is one of:
275+
276+
**Config header** (segment boundary):
277+
```json
278+
{"type":"config","segment":0,"timestamp":1710000000000,"name":"bug-hunt-20260313","metric":{"name":"bugs_confirmed","unit":"count","direction":"higher"}}
279+
```
280+
281+
**Result entry**:
282+
```json
283+
{"type":"result","segment":0,"timestamp":1710000060000,"value":12,"secondaryMetrics":{"false_positives":2,"files_scanned":15},"status":"keep","description":"Iteration 1","commit":"abc1234","durationMs":45000}
284+
```
285+
286+
**Resume marker**:
287+
```json
288+
{"type":"resume","timestamp":1710000360000}
289+
```
290+
291+
### Guardrails summary
292+
293+
| Guardrail | Source | Implementation |
294+
|-----------|--------|----------------|
295+
| Stop file checked before every run | pi-autoresearch | `run` command exits immediately if `.bug-hunter/experiment.stop` exists |
296+
| JSONL is append-only | pi-autoresearch | Never modify existing entries; full state reconstructable from log |
297+
| Secondary metric consistency | pi-autoresearch | Rejects missing/new metrics unless `--force` is used |
298+
| Auto-resume rate limiting | pi-autoresearch | 5-minute cooldown between resume events |
299+
| Backpressure checks | pi-autoresearch | Optional `experiment.checks.sh` must pass before `keep` |
300+
| Segment boundaries | pi-autoresearch | Each `init` starts fresh baseline; old segments preserved |
301+
| Output truncation | pi-autoresearch | stdout/stderr capped at 50KB to prevent memory bloat |

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "@codexstar/bug-hunter",
3-
"version": "3.0.8",
3+
"version": "3.0.9",
44
"description": "Adversarial AI bug hunter — multi-agent pipeline finds security vulnerabilities, logic errors, and runtime bugs, then fixes them autonomously. Works with Claude Code, Cursor, Codex CLI, Copilot, Kiro, and more.",
55
"license": "MIT",
66
"main": "bin/bug-hunter",

schemas/experiment.schema.json

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
{
2+
"$schema": "http://json-schema.org/draft-07/schema#",
3+
"schemaVersion": 1,
4+
"artifact": "experiment",
5+
"title": "Bug Hunter Experiment Log Entry",
6+
"description": "Schema for individual JSONL experiment log entries. Config headers and result entries share this envelope.",
7+
"type": "object",
8+
"required": ["type"],
9+
"properties": {
10+
"type": {
11+
"type": "string",
12+
"enum": ["config", "result", "resume"]
13+
},
14+
"segment": {
15+
"type": "integer",
16+
"minimum": 0
17+
},
18+
"timestamp": {
19+
"type": "number"
20+
},
21+
"name": {
22+
"type": "string",
23+
"minLength": 1
24+
},
25+
"metric": {
26+
"type": "object",
27+
"properties": {
28+
"name": { "type": "string", "minLength": 1 },
29+
"unit": { "type": "string" },
30+
"direction": { "type": "string", "enum": ["lower", "higher"] }
31+
},
32+
"required": ["name", "direction"]
33+
},
34+
"value": {
35+
"type": "number"
36+
},
37+
"secondaryMetrics": {
38+
"type": "object",
39+
"additionalProperties": { "type": "number" }
40+
},
41+
"status": {
42+
"type": "string",
43+
"enum": ["keep", "discard", "crash", "checks_failed"]
44+
},
45+
"description": {
46+
"type": "string"
47+
},
48+
"commit": {
49+
"type": "string"
50+
},
51+
"durationMs": {
52+
"type": "number",
53+
"minimum": 0
54+
},
55+
"maxIterations": {
56+
"type": "integer",
57+
"minimum": 1
58+
}
59+
},
60+
"additionalProperties": false
61+
}

0 commit comments

Comments
 (0)