Sequential task calls fail ~50% with "codex app-server connection closed" — broker reused after single-turn exit, shouldRetryDirect misses clean close

## Summary

Sequential `codex-companion task` calls fail **~50% of the time** with
`codex app-server connection closed`, in a near-perfect **pass / fail / pass / fail**
alternation. This is **single-session, non-concurrent** — one call at a time — and is
**not project-specific** (reproduces from any cwd, including `$HOME`).

Auth/`/codex:setup` is healthy, and a directly-spawned `codex app-server`
(`initialize` + `thread/start`) is 100% reliable, so the app-server itself is fine — the
failure is in broker reuse + the direct-retry fallback.

**Env:** plugin `openai/codex-plugin-cc` v1.0.5 (latest) · codex-cli 0.142.1 ·
Node v22.22.3 · macOS arm64.

## Steps to reproduce

```bash
cd ~   # any directory; no project .codex needed
for i in $(seq 1 6); do
  node <plugin>/1.0.5/scripts/codex-companion.mjs task --fresh --effort low "Reply: $i"
done
# => ~3/6 fail with "codex app-server connection closed", alternating ok/FAIL/ok/FAIL/...
```

Forcing a fresh broker each run (delete `state/<slug>-<hash>/broker.json` before each
call) → **6/6 pass**. That isolates the cause to broker **reuse**.

## Root cause

`task` connects through the broker: `CodexAppServerClient.connect()` →
`ensureBrokerSession()`. The broker process exits after serving a **single turn**, but
its `broker.json` + unix socket linger briefly:

- **Run N** — no live broker → spawn fresh broker → turn succeeds. Broker lingers.
- **Run N+1** — `isBrokerEndpointReady()` still connects/initializes against that broker,
  so it is **reused** → the broker drops the **second** turn →
  `AppServerClientBase.handleExit(null)` (`scripts/lib/app-server.mjs:172`) →
  `Error("codex app-server connection closed.")` with **no `rpcCode` and no `code`**.
  This failure tears down the broker.
- **Run N+2** — broker gone → fresh spawn → succeeds. → the alternation.

The existing direct-mode fallback in `withAppServer()`
(`scripts/lib/codex.mjs:620-633`) does **not** catch this. `shouldRetryDirect` only
triggers on `BROKER_BUSY_RPC_CODE`, `ENOENT`, or `ECONNREFUSED` — never on a clean
"connection closed" mid-turn — so the error is thrown to the caller instead of retrying
direct:

```js
const shouldRetryDirect =
  (client?.transport === "broker" && error?.rpcCode === BROKER_BUSY_RPC_CODE) ||
  (brokerRequested && (error?.code === "ENOENT" || error?.code === "ECONNREFUSED"));
```

## Suggested fix (verified locally)

Extend `shouldRetryDirect` to also retry direct when a broker connection drops:

```js
const brokerConnectionDropped =
  client?.transport === "broker" &&
  (error?.code === "ECONNRESET" ||
    error?.code === "EPIPE" ||
    /connection closed|exited unexpectedly/i.test(error?.message ?? ""));
const shouldRetryDirect =
  (client?.transport === "broker" && error?.rpcCode === BROKER_BUSY_RPC_CODE) ||
  brokerConnectionDropped ||
  (brokerRequested && (error?.code === "ENOENT" || error?.code === "ECONNREFUSED"));
```

After this patch, with **no** broker clearing: 6/6 and 5/5 consecutive runs pass across
two different cwds. This just restores the fallback's own intent. A more thorough fix
could additionally avoid reusing a broker that is shutting down (e.g. liveness/age check
in `isBrokerEndpointReady`/`ensureBrokerSession`), but the broker is alive enough to
accept+`initialize` at reuse time, so the direct-retry is what actually unblocks it.

## Related but distinct

- #286 (Race 3) touches `ensureBrokerSession`/`broker.json`, but is about **concurrent**
  same-cwd invocations. This report is purely **sequential** — no race.
- #108 / #380 are about broker **cleanup/orphans** on session exit; this is about reuse
  **dropping the next turn** during normal use.
- #202 is a zombie **job** record (`jobs.json`, "Task is still running"), a different
  failure path and error message.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sequential task calls fail ~50% with "codex app-server connection closed" — broker reused after single-turn exit, shouldRetryDirect misses clean close #402

Summary

Steps to reproduce

Root cause

Suggested fix (verified locally)

Related but distinct

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Sequential task calls fail ~50% with "codex app-server connection closed" — broker reused after single-turn exit, shouldRetryDirect misses clean close #402

Description

Summary

Steps to reproduce

Root cause

Suggested fix (verified locally)

Related but distinct

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions