Skip to content

Commit 1405a64

Browse files
jcasimirclaude
andauthored
feat(ce-run,plan): wrapped-phase recognition hook + Erin phase-isolation plan (#9)
* docs(erin): brainstorm + plan + depth-2 dispatch spike Three-pass refinement cycle for the Erin phase-isolation feature: brainstorm → plan → plan-review → Unit 1 spike. The spike caught the load-bearing platform constraint (subagents lack the Agent tool, so direct depth-2 dispatch fails) before any code shipped. Two workarounds verified in the findings doc: A. Constrain wrapped /ce:work to inline-only execution + no Agent-dispatching sub-skills. Sufficient for v1. B. Subprocess via `claude -p` to bypass the constraint when nested dispatch is genuinely needed. ~16s overhead, streaming fidelity regression, but unlocks arbitrary nesting. Plan v1 status: active, proceeding with Workaround A. Units 2–4 (ce-run hook, erin.md update, dogfood) still ahead. Compound artifact: tests/spikes/depth-2-dispatch.md is a durable regression harness — re-run on any future Claude Code update; if it ever produces 5/5 successful dispatches the platform constraint has lifted and the original architecture becomes available. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ce-run,plan): wrapped-phase recognition hook + plan revisions ce-run/SKILL.md gains a small Step 5 branch: when a phase has `wrapped: true`, yield to the orchestrator persona's wrapped-phase behavior instead of invoking the skill in-thread. ce-run does not define a generic primitive — it just recognizes the flag. Plan revisions bake in the Unit 1 spike's Workaround A: the wrapped /ce:work invocation is pinned to Inline execution strategy and forbidden from invoking /ce:review or other Agent-dispatching skills, since subagents lack the Agent tool. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent e3ee5bc commit 1405a64

5 files changed

Lines changed: 674 additions & 4 deletions

File tree

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
---
2+
date: 2026-05-07
3+
topic: erin-phase-isolation
4+
---
5+
6+
# Erin Phase Isolation: Subagent Wrapping for Long-Running Phases
7+
8+
## Problem Frame
9+
10+
Two distinct pains drive this work, and conflating them obscures the right design:
11+
12+
1. **Context bloat.** When Erin runs a long phase (`/ce:work` especially) as an in-thread skill invocation, every tool call, every file read, every reviewer panel output accumulates in the main conversation. Long features burn through the context window before Erin even reaches `compound`. Jeff has to manually `/compact` mid-workflow.
13+
14+
2. **Memory loss across `/compact`.** When Jeff does `/compact` mid-workflow, Erin loses the thread of which phase is current, what decisions were made, what the user just said. He copy-pastes a re-priming reminder to keep going.
15+
16+
These are different problems with different fixes. Pain #2 is solved by Erin writing a small `run-state.md` after each phase and reading it on resume — no subagent isolation needed. Pain #1 is the part that actually requires dispatching long phases out of the main thread so their tool churn doesn't accumulate.
17+
18+
Two rounds of document review tightened the design. The first pass cut a "Tina persona" framing, generic primitive in `ce-run`, JSONL event log, and 4-of-5 wrapped phases. The second pass surfaced epistemic gaps: spike thresholds were unfalsifiable, "observable evidence" was self-reported by the entity being checked, and the `needs-input` round-trip path was destructive (not just wasteful) for a phase like `/ce:work` that already mutates filesystem and git state. v1 now ships: a spike with numeric pass/fail thresholds, the `work` phase wrapped, Erin-verified evidence via direct `git diff` checks, and a hard halt on any subagent that needs user input mid-phase (no dialogue-relay protocol; the user re-runs `/ce:run erin` with the answer in args).
19+
20+
## User Flow
21+
22+
```mermaid
23+
flowchart TB
24+
Jeff[Jeff main session] -->|/ce:run erin feature| Erin[Erin Opus, main thread]
25+
Erin -->|brainstorm + plan in-thread| Plan[Plan doc]
26+
Erin -->|writes run-state.md PRE-dispatch| State[(run-state.md)]
27+
Erin -->|wrapped: true, Agent dispatch| Sub[Work subagent<br/>Opus, isolated context]
28+
Sub -->|streams to Jeff terminal| Jeff
29+
Sub -->|writes handoff + artifacts| Disk[(handoff + plan + commits)]
30+
Sub -->|compact return ≤200 words| Erin
31+
Erin -->|git diff --stat verification| GitCheck{Evidence<br/>matches?}
32+
GitCheck -->|yes| Erin2[Erin updates run-state POST-return]
33+
GitCheck -->|no| Retry[Re-spawn or surface]
34+
Erin2 -->|review + compound in-thread| Done[Done]
35+
36+
JeffCompact[Jeff /compact mid-run] -.->|context reset| Erin
37+
Erin -.->|reads run-state + scans handoffs/ for any newer| State
38+
```
39+
40+
## Requirements
41+
42+
**Phase 0 — Platform Spike (v0 prerequisite, hard gate)**
43+
44+
- R1. Before any other implementation, run a measurement spike that dispatches an Opus subagent which itself dispatches two parallel Sonnet subagents. The spike MUST capture, with explicit numeric pass/fail thresholds:
45+
- **(a) Main-thread token accumulation.** Measure parent-context token growth attributable to the wrapped phase. **Pass:** parent grows by less than 20% of the total tokens consumed by the leaf subagents (i.e., isolation captures >80%). **Fail:** parent grows by more than 50% of leaf token cost.
46+
- **(b) Streaming fidelity.** Capture what surfaces in the user's terminal during the wrapped phase. **Pass:** at minimum, each leaf tool call's name and abbreviated input is visible in real time. **Fail:** terminal shows nothing until subagent returns ("black box for N minutes" UX).
47+
- **(c) Reliability.** Run the dispatch at least 5 times in succession. **Pass:** all 5 succeed end-to-end. **Fail:** any run fails outright; investigate before proceeding.
48+
- **(d) Per-phase baseline.** As part of the spike, also capture per-phase token shares from the most recent 3 real `/ce:run erin` workflows on this codebase. This produces the baseline against which v1's success criterion (≥30% main-thread reduction) is measured AND validates the assumption that `work` is the largest phase.
49+
- R2. The spike's findings document MUST live at `docs/solutions/2026-05-XX-agent-tool-depth-2-spike.md` and include the four measurements above with pass/fail status against the thresholds in R1. The rest of v1 is gated on (a), (b), and (c) all passing. (d) informs v1 scope but is not a pass/fail gate.
50+
- R3. If any of (a)–(c) fail at the thresholds in R1, the rest of v1 is redesigned (or abandoned). If (d) shows `work` is not the largest source of bloat, v1 scope re-considers which phase to wrap first before proceeding.
51+
52+
**Phase Wrapping Primitive (post-spike)**
53+
54+
- R4. Erin's orchestrator file (`erin.md` in `ce-reviewers-jsl`) MUST support a `wrapped: true` flag on individual phase entries.
55+
- R5. When a phase has `wrapped: true`, Erin dispatches that phase via the `Agent` tool to a fresh Opus subagent. The subagent runs the phase's `skill:` (e.g., `ce:work`) in its isolated context, persists artifacts to disk, and returns a compact summary.
56+
- R6. The wrapping logic lives in Erin's behavior (in `erin.md` itself), NOT as a generic primitive in `ce-run`. `ce-run` is unchanged. When orchestrator #2 needs the same pattern, the abstraction extracts itself.
57+
- R7. Phase args (`$ARGUMENTS`, `$PLAN_PATH`) MUST pass through to the subagent prompt unchanged.
58+
- R8. Subagent prompt scope: in v1, the wrapped subagent receives the skill invocation + args ONLY. It does NOT receive Erin's review-preferences, persona definition, or other orchestrator-level prose. (Moot in v1 because review isn't wrapped; revisit when v2 wraps `review` or `plan-review`.)
59+
- R9. Model resolution: subagent runs on Opus by default (judgment-bearing). Workers it spawns follow per-skill defaults (typically Sonnet for reviewers, Haiku for research).
60+
61+
**v1 Wrapped Phases (intentionally narrow)**
62+
63+
- R10. v1 wraps **only the `work` phase**. This is the single largest source of context bloat in the working hypothesis (validated in spike R1(d)). Other phases stay in-thread.
64+
- R11. Phases NOT wrapped in v1: `brainstorm`, `plan`, `plan-review`, `user-plan-review`, `review`, `user-scenarios` (all stages), `everyday-usability`, `todo-resolve`, `test-browser`, `feature-video`, `compound`. Some are interactive; some are short enough that wrapping adds ceremony beyond savings.
65+
- R12. v2 candidates for wrapping: `review` and `plan-review`. Adding them depends on (a) the spike confirming depth-2 streaming and (b) v1's `work` wrap demonstrating real context savings on a real workflow.
66+
67+
**Phase Handoff Contract**
68+
69+
- R13. The wrapped phase's subagent MUST produce a handoff document at `docs/plans/<plan-filename-stem>/handoffs/<phase>.md`, where `<plan-filename-stem>` is the plan's filename minus the `.md` extension. (Stable `plan_id` in plan frontmatter is deferred to a future iteration if rename collisions become a real problem in practice.)
70+
- R14. Handoff frontmatter required fields: `phase`, `plan_filename`, `started`, `completed`, `status`, `claimed_files_modified`, `claimed_lines_changed_delta`. The last two are the subagent's *self-reported* numbers; Erin verifies them independently (R18).
71+
- R15. Handoff body required sections: `## Outcome` (one sentence), `## Artifacts` (paths, commit SHAs), `## Recommended Next Phase Action`. Optional: `## Open Questions`. If the subagent has anything to flag for Erin's between-phase judgment, include `## Judgment Calls For Erin`; omit the section entirely when nothing applies (no stub `None.`).
72+
- R16. `status` enum: `success | partial | failed | needs-input`. `partial` means the phase made forward progress but couldn't complete (e.g., test failures the subagent couldn't resolve). `needs-input` means the underlying skill required user dialogue — see R23.
73+
- R17. The subagent's return value to Erin MUST be a compact summary (≤200 words) containing: `status`, claimed evidence numbers, one-sentence outcome, judgment-call count (if any), recommended next action, handoff doc path.
74+
75+
**Erin-Verified Evidence (Independent Ground Truth)**
76+
77+
- R18. Before dispatching a wrapped phase, Erin captures a pre-dispatch snapshot: the current git HEAD SHA. After the subagent returns, Erin runs `git diff --stat <pre-sha>..HEAD` and `git log <pre-sha>..HEAD --oneline` in the main thread to obtain *independent* counts of files modified and lines changed. These are the **verified** numbers; the subagent's `claimed_*` fields are claims to compare against.
78+
- R19. Sanity-check rule: if the subagent returned `success` OR `partial` AND the verified evidence shows zero files modified AND zero lines changed, treat the result as suspicious. Trigger a re-spawn (R20) on first occurrence; surface to user on second.
79+
- R20. If the subagent's `claimed_*` fields disagree with verified counts by more than a tolerance band (e.g., claimed says 50 files, verified says 5), this is also a sanity-check failure — the subagent is hallucinating progress. Trigger re-spawn with corrective prompt naming the discrepancy.
80+
81+
**Failure Recovery**
82+
83+
- R21. On detectable subagent failure (Agent tool error, malformed handoff frontmatter, missing required artifact paths, evidence sanity-check failure per R19/R20), Erin re-spawns once with a corrective prompt that includes the prior return value verbatim plus the specific complaint.
84+
- R22. If the second attempt also fails, Erin surfaces to Jeff with: phase name, both return values, the specific complaint, and four options — retry, fall back to in-thread skill execution, edit the handoff manually, or abandon.
85+
- R23. **`needs-input` is a hard halt in v1, not a round-trip.** When the subagent encounters a question that requires user dialogue, it captures the question(s) in `## Judgment Calls For Erin` and returns `status: needs-input`. Erin reads the questions, surfaces them to Jeff, and stops. Jeff resolves the questions and re-invokes `/ce:run erin <args-with-answers-folded-in>` for a fresh workflow. **There is no in-flight resume.** This is a deliberate non-goal: re-spawning a partially-completed `/ce:work` is destructive (duplicate commits, conflicting state, silent skips), and supporting safe resume requires a skill-level redesign that's out of v1 scope.
86+
87+
**Erin Disagrees Response**
88+
89+
- R24. After reading a successful handoff and verifying evidence, Erin may judge the subagent's `Recommended Next Phase Action` differently. Three responses are available — Erin uses judgment, not a protocol, to choose:
90+
- **Override:** Erin proceeds with a different next-phase decision based on her between-phase view.
91+
- **Re-dispatch:** Erin re-spawns the subagent with a corrective prompt naming the specific judgment Erin disagrees with.
92+
- **Escalate:** Erin surfaces the disagreement to Jeff with both views.
93+
- R25. Run-state.md (R26) records every phase transition regardless of whether Erin agreed or overrode. This avoids the asymmetric-logging pathology (logging only divergence biases toward override).
94+
95+
**Run-State for `/compact` Survival**
96+
97+
- R26. Erin MUST maintain `docs/plans/<plan-filename-stem>/run-state.md` as the durable workflow thread. The file's content includes: phases complete, current phase, key decisions made by Erin, recent user input, recommended next action.
98+
- R27. Write ordering for wrapped phases:
99+
- **Pre-dispatch:** Erin updates run-state.md with `current_phase: <name>`, `current_phase_status: dispatched`, `pre_dispatch_sha: <git HEAD>`, then dispatches the subagent.
100+
- **Post-return:** Erin updates run-state.md with `current_phase_status: completed` (or `failed`/`needs-input`), records the verified evidence, and notes the next action.
101+
- The pre-dispatch write ensures that a `/compact` between dispatch and return doesn't leave run-state.md silently stale.
102+
- R28. On `/ce:run erin` resume after `/compact`: Erin's first action is to (a) read `run-state.md`, (b) scan `docs/plans/<plan-filename-stem>/handoffs/` for any handoff file with mtime newer than run-state.md's last write — if found, the handoff returned during a `/compact` window and Erin reconciles it before continuing.
103+
- R29. For pre-plan phases (brainstorm, plan-creation), there is no plan-filename-stem yet. Run-state during these phases lives at `docs/runs/<YYYY-MM-DD-HH-MM-SS>-<topic-slug>/run-state.md`. Once a plan file exists, Erin migrates run-state.md into the plan's directory and links forward from the run-state location.
104+
105+
**Explicit Non-Goals (v1 scope)**
106+
107+
- R30. v1 does not build a dialogue-relay protocol for wrapped phases. Interactive phases stay in-thread; wrapped phases that hit `needs-input` halt and require a fresh `/ce:run`. This may be revisited if real-world usage shows `needs-input` halts are common AND a safe resume mechanism is feasible.
108+
- R31. v1 does not support parallel wrapped phases. The wrapped subagent itself may spawn parallel reviewer/worker subagents (`/ce:work` already does), but Erin doesn't run multiple wrapped phases concurrently.
109+
- R32. v1 does not thread persistence across separate `/ce:run` invocations. `run-state.md` covers within a single workflow only.
110+
111+
## Success Criteria
112+
113+
- **Spike clears the gate.** R1 passes (a)/(b)/(c) at the numeric thresholds. R1(d) baseline data is captured and confirms `work` is the right phase to wrap first; if not, v1 scope adjusts.
114+
- **Measurable context savings.** Running `/ce:run erin <a representative feature>` with `work` wrapped consumes meaningfully less main-thread context than the same workflow today. Concrete target: ≥30% reduction in main-thread tokens at the point of entering `compound`, measured against the baseline captured in R1(d). If R1(d) shows `work` is, e.g., 70% of main-thread tokens today, the actual savings should approach that number minus subagent-summary overhead.
115+
- **No regressions.** Existing orchestrators behave exactly as today. Only `erin.md` changes (and only the `work` phase entry).
116+
- **`/compact` survives mid-run.** Jeff can `/compact` after any phase completes; Erin reads `run-state.md` on resume, scans for newer handoffs, and continues without losing thread. Pre-dispatch write ordering ensures the `/compact`-between-dispatch-and-return window doesn't corrupt state.
117+
- **Sanity checks fire on synthetic tests.** Test cases: (a) subagent returns `success` with verified zero files modified → re-spawn fires; (b) subagent's `claimed_files_modified` substantially exceeds verified count → re-spawn fires; (c) `partial` with zero verified evidence → suspicious-success path triggers.
118+
- **Failure recovery exercised.** Test cases for malformed handoff (R21), double failure (R22), and `needs-input` halt (R23) all behave as specified — particularly R23: the workflow stops cleanly and Jeff sees the captured questions.
119+
120+
## Scope Boundaries
121+
122+
- **Out of scope: a `tina.md` persona file.** v1 treats wrapping as infrastructure. The wrapped subagent is "the work-phase subagent," not a named character.
123+
- **Out of scope: cross-repo persona work.** v1 only changes `erin.md` (in `ce-reviewers-jsl`). No new files in either repo.
124+
- **Out of scope: generic `wrapped: <persona>` primitive in `ce-run`.** Erin handles dispatch herself.
125+
- **Out of scope: JSONL event log.** Foreground subagent terminal streaming is sufficient visibility for v1.
126+
- **Out of scope: wrapping more than `work` in v1.** `review`, `plan-review` are v2 contingent on v1 evidence.
127+
- **Out of scope: dialogue-relay for `needs-input`.** See R23 / R30.
128+
- **Out of scope: stable `plan_id` in plan frontmatter.** v1 uses filename-stem; introduce when rename collisions are observed in practice.
129+
- **Out of scope: parallel wrapped phases.** See R31.
130+
- **Out of scope: cross-`/ce:run` continuity.** See R32.
131+
132+
## Key Decisions
133+
134+
- **Spike first with numeric thresholds.** Document review surfaced that the prior spike spec was a vibes-check ("token delta," "reliable"). v1 specifies pass/fail numbers per axis (R1) so the gate is testable, not arguable.
135+
- **Erin-verified evidence via `git diff`, not subagent self-report.** The first sanity-check spec relied on the subagent reporting its own work counts — same entity as the one reporting `success`, so a hallucinating subagent fools both checks identically. v1 has Erin run `git diff --stat` against a pre-dispatch snapshot in the main thread for independent ground truth.
136+
- **`needs-input` is a hard halt in v1 (C-forbid path).** Re-running `/ce:work` from scratch with augmented args is destructive (duplicate commits, conflicts, silent skips). Building safe resume is a skill-level redesign out of v1 scope. v1 halts cleanly; Jeff re-runs.
137+
- **Erin-specific dispatch, not a generic primitive.** Reversed under document review (premature framework). Extracts to `ce-run` later when orchestrator #2 has stated need.
138+
- **Wrap one phase in v1, not five.** `work` is the working hypothesis for largest bloat source; spike R1(d) validates. `review`, `plan-review` are v2.
139+
- **Infrastructure, not persona.** `wrapped: true`, no `tina.md`. The wrapper has no judgment; it's plumbing.
140+
- **Handoff includes claimed evidence; Erin verifies independently.** Two-source check catches subagent hallucination, not just honest-zero-work.
141+
- **`run-state.md` separate from `phase-handoff.md`.** Different readers, different writers, different lifecycles. Pre-dispatch write ordering prevents the `/compact`-corruption window.
142+
- **Erin's three responses to disagreement are judgment, not protocol.** Original draft over-formalized this with audit logging; review flagged the asymmetry pathology. v1: Erin uses judgment; run-state logs all phase transitions equally.
143+
- **Filename-stem for handoff dir, not stable `plan_id`.** Premature infrastructure for v1; legacy fallback already covers what `plan_id` was meant to fix.
144+
- **Drop `tool_calls_count` from handoff fields.** Redundant for the v1 sanity check.
145+
146+
## Dependencies / Assumptions
147+
148+
- Spike (R1) confirms depth-2 Agent dispatch behavior. **Hard prerequisite — failure triggers redesign.**
149+
- The `Agent` tool reliably passes a custom system prompt / persona definition to subagents. (Already used at depth-1 by reviewer panels; spike validates depth-2.)
150+
- Erin can shell out to `git diff --stat` and `git log` in the main thread for evidence verification. (Bash tool is available; this is essentially free.)
151+
- Plans created by `/ce:plan` have stable filenames during a single `/ce:run` workflow. (Today's convention; rename mid-workflow is rare and out of v1 scope.)
152+
153+
## Outstanding Questions
154+
155+
### Resolve Before Planning
156+
*(none — strategic decisions are locked)*
157+
158+
### Deferred to Planning
159+
160+
- [Affects R1][Needs research] Spike implementation form: ad-hoc Bash invocation, or written-to-disk regression test we keep around? Recommend the latter — durable artifact, runnable on every Claude Code platform update.
161+
- [Affects R20][Technical] Tolerance band for evidence-discrepancy sanity check. "Substantially exceeds" needs a number. 2× ratio? Absolute delta? Lock during planning.
162+
- [Affects R8, R9][Technical] If/when v2 wraps `review`, the subagent will need access to Erin's review-preferences (named-persona team composition). Plan for this explicitly when v2 lands; v1 ducks it because review isn't wrapped.
163+
- [Affects R29][Technical] Migration of run-state when a plan file is created mid-workflow (brainstorm → plan transition). Linking forward + leaving a pointer at the original location is the obvious shape; lock specifics during planning.
164+
- [Affects R27][Technical] Atomic-write semantics for run-state.md. If Erin crashes between writes, the file shouldn't be left half-written. Probably write-to-temp + rename. Lock during planning.
165+
166+
## Next Steps
167+
168+
`/ce:plan` for structured implementation planning. The plan MUST schedule R1 (the spike) as the first implementation unit, with a hard gate that the rest of the plan only executes if R1's pass/fail thresholds are met. The plan MUST also explicitly include capturing R1(d) baseline data before declaring v1 success criteria measurable.

0 commit comments

Comments
 (0)