Skip to content

Sequential task calls fail ~50% with "codex app-server connection closed" — broker reused after single-turn exit, shouldRetryDirect misses clean close #402

Description

@Sami-Ke

Summary

Sequential codex-companion task calls fail ~50% of the time with
codex app-server connection closed, in a near-perfect pass / fail / pass / fail
alternation. This is single-session, non-concurrent — one call at a time — and is
not project-specific (reproduces from any cwd, including $HOME).

Auth//codex:setup is healthy, and a directly-spawned codex app-server
(initialize + thread/start) is 100% reliable, so the app-server itself is fine — the
failure is in broker reuse + the direct-retry fallback.

Env: plugin openai/codex-plugin-cc v1.0.5 (latest) · codex-cli 0.142.1 ·
Node v22.22.3 · macOS arm64.

Steps to reproduce

cd ~   # any directory; no project .codex needed
for i in $(seq 1 6); do
  node <plugin>/1.0.5/scripts/codex-companion.mjs task --fresh --effort low "Reply: $i"
done
# => ~3/6 fail with "codex app-server connection closed", alternating ok/FAIL/ok/FAIL/...

Forcing a fresh broker each run (delete state/<slug>-<hash>/broker.json before each
call) → 6/6 pass. That isolates the cause to broker reuse.

Root cause

task connects through the broker: CodexAppServerClient.connect()
ensureBrokerSession(). The broker process exits after serving a single turn, but
its broker.json + unix socket linger briefly:

  • Run N — no live broker → spawn fresh broker → turn succeeds. Broker lingers.
  • Run N+1isBrokerEndpointReady() still connects/initializes against that broker,
    so it is reused → the broker drops the second turn →
    AppServerClientBase.handleExit(null) (scripts/lib/app-server.mjs:172) →
    Error("codex app-server connection closed.") with no rpcCode and no code.
    This failure tears down the broker.
  • Run N+2 — broker gone → fresh spawn → succeeds. → the alternation.

The existing direct-mode fallback in withAppServer()
(scripts/lib/codex.mjs:620-633) does not catch this. shouldRetryDirect only
triggers on BROKER_BUSY_RPC_CODE, ENOENT, or ECONNREFUSED — never on a clean
"connection closed" mid-turn — so the error is thrown to the caller instead of retrying
direct:

const shouldRetryDirect =
  (client?.transport === "broker" && error?.rpcCode === BROKER_BUSY_RPC_CODE) ||
  (brokerRequested && (error?.code === "ENOENT" || error?.code === "ECONNREFUSED"));

Suggested fix (verified locally)

Extend shouldRetryDirect to also retry direct when a broker connection drops:

const brokerConnectionDropped =
  client?.transport === "broker" &&
  (error?.code === "ECONNRESET" ||
    error?.code === "EPIPE" ||
    /connection closed|exited unexpectedly/i.test(error?.message ?? ""));
const shouldRetryDirect =
  (client?.transport === "broker" && error?.rpcCode === BROKER_BUSY_RPC_CODE) ||
  brokerConnectionDropped ||
  (brokerRequested && (error?.code === "ENOENT" || error?.code === "ECONNREFUSED"));

After this patch, with no broker clearing: 6/6 and 5/5 consecutive runs pass across
two different cwds. This just restores the fallback's own intent. A more thorough fix
could additionally avoid reusing a broker that is shutting down (e.g. liveness/age check
in isBrokerEndpointReady/ensureBrokerSession), but the broker is alive enough to
accept+initialize at reuse time, so the direct-retry is what actually unblocks it.

Related but distinct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions