Skip to content

fix: close stale relay connection on ErrConnAlreadyExists to recover …#5866

Open
fpenezic wants to merge 2 commits intonetbirdio:mainfrom
fpenezic:fix/stale-relay-conn-recovery
Open

fix: close stale relay connection on ErrConnAlreadyExists to recover …#5866
fpenezic wants to merge 2 commits intonetbirdio:mainfrom
fpenezic:fix/stale-relay-conn-recovery

Conversation

@fpenezic
Copy link
Copy Markdown

@fpenezic fpenezic commented Apr 12, 2026

Summary

When a peer behind NAT undergoes a public IP change (e.g. PPPoE reconnect with IP rotation), the remote peer's relay client retains a stale connection entry in its conns map. Subsequent calls to OpenConn() return ErrConnAlreadyExists and the code previously returned without establishing a new connection — leaving WireGuard sending handshake packets into a dead relay pipe. The tunnel never recovers.

This PR fixes two related issues:

  1. Stale relay connection (worker_relay.go, client.go, manager.go): When OpenConn() returns ErrConnAlreadyExists, close the existing (stale) connection via new CloseConnByPeerKey() method and retry. This ensures the relay pipe is always live when processing a new offer.

  2. Stale ICE session ID (worker_ice.go, conn.go): When a WireGuard handshake times out on a relay connection, the ICE worker is not closed, so its session ID is never rotated. The next offer carries the same session ID, causing the remote peer to skip ICE agent recreation and reuse stale candidates. Added ResetSessionID() to force a fresh session ID so the remote peer creates a new ICE agent with current candidates.

Reproduction scenario

  • Peer A: behind PPPoE NAT (no port forwarding)
  • Peer B: public IP
  • ISP forces periodic PPPoE reconnect, assigning Peer A a new public IP

Before fix: After IP change, Peer B holds dead relay conn → WG handshake stuck at 0001-01-01 00:00:00 indefinitely. Tunnel never recovers without manual restart.

After fix: Tunnel recovers automatically within ~15 seconds.

Changes

  • shared/relay/client/client.go: Add CloseConnByPeerKey() — force-close relay connection by peer key, remove from internal map
  • shared/relay/client/manager.go: Add CloseConnByPeerKey() proxy on Manager
  • client/internal/peer/worker_relay.go: On ErrConnAlreadyExists → close stale conn → retry OpenConn()
  • client/internal/peer/worker_ice.go: Add ResetSessionID() — rotate ICE session ID without closing agent
  • client/internal/peer/conn.go: Call ResetSessionID() in onWGDisconnected relay case

Test plan

  • Tested with real PPPoE reconnect (IP rotation) — tunnel recovers automatically
  • Tested PPPoE reconnect without IP change — tunnel recovers
  • Tested across multiple reconnect cycles over 48h
  • Existing unit tests pass

Summary by CodeRabbit

  • Bug Fixes
    • Relay disconnects now trigger a full ICE session reset to avoid stale session state and improve reconnection reliability.
    • Duplicate relay-connection errors are handled by closing stale connections and retrying, reducing failed-open cases.
    • Added targeted forced-close for relay connections to ensure proper cleanup and unsubscribe before retrying.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 12, 2026

CLA assistant check
All committers have signed the CLA.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 12, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a214f246-0539-4674-b362-990ab374305c

📥 Commits

Reviewing files that changed from the base of the PR and between d2a2ff5 and a9783e0.

📒 Files selected for processing (2)
  • client/internal/peer/worker_relay.go
  • shared/relay/client/manager.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • shared/relay/client/manager.go

📝 Walkthrough

Walkthrough

The PR forces a new ICE session ID when a relay connection is closed, adds WorkerICE.ResetSessionID() and CloseConnByPeerKey() APIs, and changes relay open logic to detect existing connections, evict stale entries, and retry opening the relay connection.

Changes

Cohort / File(s) Summary
ICE Session Management
client/internal/peer/conn.go, client/internal/peer/worker_ice.go
Added WorkerICE.ResetSessionID() and call site: Conn.onWGDisconnected() now calls workerICE.ResetSessionID() when the active connection is conntype.Relay (nil-guarded).
Relay Client Eviction & Retry
client/internal/peer/worker_relay.go, shared/relay/client/client.go, shared/relay/client/manager.go
Added CloseConnByPeerKey() at client and manager layers; OpenConn handling now treats ErrConnAlreadyExists by logging, force-closing the stale conn via CloseConnByPeerKey(), and retrying OpenConn once before failing.

Sequence Diagram(s)

sequenceDiagram
  participant Conn
  participant WorkerRelay
  participant RelayClient
  participant WorkerICE

  Conn->>WorkerRelay: detect WG disconnected (active=Relay)
  WorkerRelay->>RelayClient: Close relay connection
  RelayClient-->>WorkerRelay: closed
  WorkerRelay->>WorkerICE: ResetSessionID()  -- new ICE session ID
  WorkerICE-->>WorkerRelay: ack
  WorkerRelay->>Conn: continue disconnect handling
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • lixmal
  • mlsmaycon

Poem

🐰 When relays fall and tunnels sigh,
I hop and make a brand new try,
A fresh ICE name to start the race,
I nudge the links and mend the place,
Hoppity hop — reconnect with grace!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the primary change: closing stale relay connections on ErrConnAlreadyExists to enable tunnel recovery.
Description check ✅ Passed The description provides comprehensive context including reproduction scenario, specific changes, and testing methodology; however, required template sections are unfilled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@shared/relay/client/manager.go`:
- Around line 150-160: The CloseConnByPeerKey method currently always calls
m.relayClient.CloseConnByPeerKey(peerKey) which misses stale connections stored
under a specific server entry; modify Manager.CloseConnByPeerKey to
accept/lookup the serverAddress (use m.relayClients[srv].relayClient when
present) and call that relayClient.CloseConnByPeerKey(peerKey) instead of always
using m.relayClient; ensure the logic still handles the nil/default
m.relayClient case and preserves the mutex usage around m.relayClients access;
also update the caller in client/internal/peer/worker_relay.go to pass the srv
value into CloseConnByPeerKey so cross-relay ErrConnAlreadyExists recovery finds
and removes the stale entry created in openConnVia/OpenConn.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dcc48e08-b6cd-48da-ad60-56304d5f8378

📥 Commits

Reviewing files that changed from the base of the PR and between 5259e5d and d2a2ff5.

📒 Files selected for processing (5)
  • client/internal/peer/conn.go
  • client/internal/peer/worker_ice.go
  • client/internal/peer/worker_relay.go
  • shared/relay/client/client.go
  • shared/relay/client/manager.go

Comment on lines +150 to +160
// CloseConnByPeerKey closes an existing relay connection for the given peer key
// so that a subsequent OpenConn can create a fresh one.
func (m *Manager) CloseConnByPeerKey(peerKey string) {
m.relayClientMu.RLock()
defer m.relayClientMu.RUnlock()

if m.relayClient == nil {
return
}
m.relayClient.CloseConnByPeerKey(peerKey)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Route stale-connection cleanup to the relay client for serverAddress.

Line 159 always closes m.relayClient, but OpenConn() can return ErrConnAlreadyExists from a foreign relay client created in openConnVia(). In the cross-relay path, the stale entry lives in m.relayClients[srv].relayClient, so the retry from client/internal/peer/worker_relay.go will hit the same error again and recovery still fails there.

Suggested direction
-// CloseConnByPeerKey closes an existing relay connection for the given peer key
-// so that a subsequent OpenConn can create a fresh one.
-func (m *Manager) CloseConnByPeerKey(peerKey string) {
-	m.relayClientMu.RLock()
-	defer m.relayClientMu.RUnlock()
-
-	if m.relayClient == nil {
-		return
-	}
-	m.relayClient.CloseConnByPeerKey(peerKey)
+// CloseConnByPeerKey closes an existing relay connection on the relay client
+// associated with serverAddress so that a subsequent OpenConn can create a fresh one.
+func (m *Manager) CloseConnByPeerKey(serverAddress, peerKey string) {
+	m.relayClientMu.RLock()
+	homeClient := m.relayClient
+	m.relayClientMu.RUnlock()
+
+	if homeClient == nil {
+		return
+	}
+
+	homeAddr, err := homeClient.ServerInstanceURL()
+	if err == nil && homeAddr == serverAddress {
+		homeClient.CloseConnByPeerKey(peerKey)
+		return
+	}
+
+	m.relayClientsMutex.RLock()
+	rt := m.relayClients[serverAddress]
+	m.relayClientsMutex.RUnlock()
+	if rt != nil && rt.relayClient != nil {
+		rt.relayClient.CloseConnByPeerKey(peerKey)
+	}
 }

And update the caller in client/internal/peer/worker_relay.go to pass srv.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@shared/relay/client/manager.go` around lines 150 - 160, The
CloseConnByPeerKey method currently always calls
m.relayClient.CloseConnByPeerKey(peerKey) which misses stale connections stored
under a specific server entry; modify Manager.CloseConnByPeerKey to
accept/lookup the serverAddress (use m.relayClients[srv].relayClient when
present) and call that relayClient.CloseConnByPeerKey(peerKey) instead of always
using m.relayClient; ensure the logic still handles the nil/default
m.relayClient case and preserves the mutex usage around m.relayClients access;
also update the caller in client/internal/peer/worker_relay.go to pass the srv
value into CloseConnByPeerKey so cross-relay ErrConnAlreadyExists recovery finds
and removes the stale entry created in openConnVia/OpenConn.

@sonarqubecloud
Copy link
Copy Markdown

@pappz pappz self-requested a review April 13, 2026 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants