Skip to content

CLI silently hangs on stalled HTTPS sockets to api.github.com — no timeout, no log/event output #3371

@jay-tau

Description

@jay-tau

Describe the bug

The copilot CLI process can silently hang for many minutes (potentially indefinitely) with data permanently backed up in its TCP send buffers to GitHub API endpoints, while producing zero output to either its process log or the session events.jsonl. The user-facing TUI shows no indication that anything is wrong (no spinner change, no error, no "reconnecting…" message); the process simply stops making progress.

The underlying network condition appears to be a stale TCP connection (most likely after a network/route change on a multi-homed host such as VPN/Tailscale on/off, Wi-Fi → Ethernet handoff, etc.). The kernel keeps the socket in ESTAB because there is no TCP keepalive and no application-level write deadline, so the bytes sit in Send-Q forever. The CLI never times out, never retries on a fresh connection, and never surfaces a diagnostic to the user.

Observed signature

$ ss -tnpi | grep "pid=<copilot-pid>"
ESTAB 0  1481886  10.83.27.238:36816  140.82.112.22:443    users:(("MainThread",pid=92622,fd=129))
ESTAB 0  3154     10.83.27.238:40390  140.82.114.21:443    users:(("MainThread",pid=92622,fd=28))
ESTAB 0  2510     10.83.27.238:40396  140.82.114.21:443    users:(("MainThread",pid=92622,fd=39))
ESTAB 0  3042     10.83.27.238:47374  140.82.113.21:443    users:(("MainThread",pid=92622,fd=45))

All peer IPs reverse-resolve to lb-140-82-*-*-iad.github.com (GitHub's IAD load balancers). Notice the non-zero Send-Q values that never drain — these connections are dead from GitHub's perspective but the local kernel still considers them established.

Meanwhile:

$ ls -la --time-style=full-iso ~/.copilot/logs/process-*-92622.log \
                               ~/.copilot/session-state/<sid>/events.jsonl
-rw-r--r-- ... 21:00:43 ...   process-1779115412438-92622.log
-rw------- ... 20:48:45 ...   <sid>/events.jsonl

…no writes for ~5 and ~17 minutes respectively, while the user is sitting at the prompt with no feedback. The process state is S (sleeping in epoll_wait); from /proc/<pid>/wchan it is parked waiting for EPOLLOUT on the dead sockets.

Affected version

GitHub Copilot CLI 1.0.48

(@github/copilot npm package, Linux x64 binary.)

Steps to reproduce the behavior

Hard to reproduce on demand, but a reliable recipe on a multi-homed Linux host:

  1. Start a copilot session over a network interface that has a default route to GitHub (e.g. a VPN such as Tailscale providing the 10.83.27.238 address above).
  2. Issue any request that causes the CLI to open HTTPS connections to api.github.com / *.github.com.
  3. Bring that interface down, change the default route, or otherwise break the path without sending RSTs back to the local kernel (interface down, NAT box reboot, ISP blip, VPN reconnect, suspend/resume).
  4. Issue another request, or simply wait until the CLI tries to use one of the pooled HTTPS connections.
  5. The CLI hangs. No log lines, no events.jsonl entries, no TUI feedback. ss -tnpi on the PID shows non-zero Send-Q to GitHub IPs that never decreases.

It also reproduces "naturally" during ordinary daily use on machines that toggle between VPN/no-VPN or Wi-Fi/Ethernet.

Expected behavior

The CLI should not hang silently on a dead socket. At minimum, one or more of:

  • Per-request write/read timeouts on the HTTP client used for GitHub API traffic, so a stalled socket is detected within a bounded time (e.g. tens of seconds) instead of relying on the kernel's default TCP retransmission timeout (which can be 15+ minutes).
  • TCP keepalive (SO_KEEPALIVE with sensible TCP_KEEPIDLE/TCP_KEEPINTVL/TCP_KEEPCNT) on long-lived sockets so the kernel itself tears down dead connections faster.
  • Connection-pool health checks that proactively close and replace sockets that have been idle across a known network change (or just replace on any I/O error / timeout rather than retrying on the same dead FD).
  • User-visible feedback in the TUI when an outbound request has been pending longer than a short threshold — even just a "still waiting on api.github.com…" line — so the user can distinguish "the model is thinking" from "we are wedged on a dead socket".
  • Either an automatic retry on a fresh connection, or a clear error surfaced to the user with guidance (e.g. /reconnect or similar).

Additional context

  • OS: Linux (CachyOS, kernel 7.0.6-1-cachyos, x86_64)
  • Node: v26.1.0
  • Install: @github/copilot global npm install, binary at /usr/lib/node_modules/@github/copilot/node_modules/@github/copilot-linux-x64/copilot
  • Terminal / shell: standard pty, bash
  • Network: multi-homed — local LAN (192.168.1.12) + Tailscale (10.83.27.238) + IPv6. The stuck sockets in this incident were all bound to the Tailscale address; concurrently-healthy sockets on the same PID were on IPv6 and on the LAN address.

Diagnostic commands that help confirm the bug

# 1. Stuck send queues on copilot's sockets
ss -tnpi | awk 'NR>1 && /pid=<COPILOT_PID>/ && $3+0>0'

# 2. Process is parked in epoll_wait, not crashed
cat /proc/<COPILOT_PID>/wchan; echo
cat /proc/<COPILOT_PID>/status | grep -E '^(State|Threads)'

# 3. No log/event writes despite the process being "running"
stat -c '%y %n' ~/.copilot/logs/process-*-<COPILOT_PID>.log
stat -c '%y %n' ~/.copilot/session-state/<SID>/events.jsonl

Workaround

  • sudo ss -K dst <github-ip> dport = 443 to forcibly destroy the stuck sockets (requires CAP_NET_ADMIN).
  • Or, killing the copilot process and restarting the session. Either way, no in-product recovery is offered.

Related (not duplicates)

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:networkingProxy, SSL/TLS, certificates, corporate environments, and connectivity issues

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions