fix: close stale relay connection on ErrConnAlreadyExists to recover …#5866
fix: close stale relay connection on ErrConnAlreadyExists to recover …#5866fpenezic wants to merge 2 commits intonetbirdio:mainfrom
Conversation
…tunnel after NAT IP change
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe PR forces a new ICE session ID when a relay connection is closed, adds Changes
Sequence Diagram(s)sequenceDiagram
participant Conn
participant WorkerRelay
participant RelayClient
participant WorkerICE
Conn->>WorkerRelay: detect WG disconnected (active=Relay)
WorkerRelay->>RelayClient: Close relay connection
RelayClient-->>WorkerRelay: closed
WorkerRelay->>WorkerICE: ResetSessionID() -- new ICE session ID
WorkerICE-->>WorkerRelay: ack
WorkerRelay->>Conn: continue disconnect handling
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@shared/relay/client/manager.go`:
- Around line 150-160: The CloseConnByPeerKey method currently always calls
m.relayClient.CloseConnByPeerKey(peerKey) which misses stale connections stored
under a specific server entry; modify Manager.CloseConnByPeerKey to
accept/lookup the serverAddress (use m.relayClients[srv].relayClient when
present) and call that relayClient.CloseConnByPeerKey(peerKey) instead of always
using m.relayClient; ensure the logic still handles the nil/default
m.relayClient case and preserves the mutex usage around m.relayClients access;
also update the caller in client/internal/peer/worker_relay.go to pass the srv
value into CloseConnByPeerKey so cross-relay ErrConnAlreadyExists recovery finds
and removes the stale entry created in openConnVia/OpenConn.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: dcc48e08-b6cd-48da-ad60-56304d5f8378
📒 Files selected for processing (5)
client/internal/peer/conn.goclient/internal/peer/worker_ice.goclient/internal/peer/worker_relay.goshared/relay/client/client.goshared/relay/client/manager.go
| // CloseConnByPeerKey closes an existing relay connection for the given peer key | ||
| // so that a subsequent OpenConn can create a fresh one. | ||
| func (m *Manager) CloseConnByPeerKey(peerKey string) { | ||
| m.relayClientMu.RLock() | ||
| defer m.relayClientMu.RUnlock() | ||
|
|
||
| if m.relayClient == nil { | ||
| return | ||
| } | ||
| m.relayClient.CloseConnByPeerKey(peerKey) | ||
| } |
There was a problem hiding this comment.
Route stale-connection cleanup to the relay client for serverAddress.
Line 159 always closes m.relayClient, but OpenConn() can return ErrConnAlreadyExists from a foreign relay client created in openConnVia(). In the cross-relay path, the stale entry lives in m.relayClients[srv].relayClient, so the retry from client/internal/peer/worker_relay.go will hit the same error again and recovery still fails there.
Suggested direction
-// CloseConnByPeerKey closes an existing relay connection for the given peer key
-// so that a subsequent OpenConn can create a fresh one.
-func (m *Manager) CloseConnByPeerKey(peerKey string) {
- m.relayClientMu.RLock()
- defer m.relayClientMu.RUnlock()
-
- if m.relayClient == nil {
- return
- }
- m.relayClient.CloseConnByPeerKey(peerKey)
+// CloseConnByPeerKey closes an existing relay connection on the relay client
+// associated with serverAddress so that a subsequent OpenConn can create a fresh one.
+func (m *Manager) CloseConnByPeerKey(serverAddress, peerKey string) {
+ m.relayClientMu.RLock()
+ homeClient := m.relayClient
+ m.relayClientMu.RUnlock()
+
+ if homeClient == nil {
+ return
+ }
+
+ homeAddr, err := homeClient.ServerInstanceURL()
+ if err == nil && homeAddr == serverAddress {
+ homeClient.CloseConnByPeerKey(peerKey)
+ return
+ }
+
+ m.relayClientsMutex.RLock()
+ rt := m.relayClients[serverAddress]
+ m.relayClientsMutex.RUnlock()
+ if rt != nil && rt.relayClient != nil {
+ rt.relayClient.CloseConnByPeerKey(peerKey)
+ }
}And update the caller in client/internal/peer/worker_relay.go to pass srv.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@shared/relay/client/manager.go` around lines 150 - 160, The
CloseConnByPeerKey method currently always calls
m.relayClient.CloseConnByPeerKey(peerKey) which misses stale connections stored
under a specific server entry; modify Manager.CloseConnByPeerKey to
accept/lookup the serverAddress (use m.relayClients[srv].relayClient when
present) and call that relayClient.CloseConnByPeerKey(peerKey) instead of always
using m.relayClient; ensure the logic still handles the nil/default
m.relayClient case and preserves the mutex usage around m.relayClients access;
also update the caller in client/internal/peer/worker_relay.go to pass the srv
value into CloseConnByPeerKey so cross-relay ErrConnAlreadyExists recovery finds
and removes the stale entry created in openConnVia/OpenConn.
|



Summary
When a peer behind NAT undergoes a public IP change (e.g. PPPoE reconnect with IP rotation), the remote peer's relay client retains a stale connection entry in its
connsmap. Subsequent calls toOpenConn()returnErrConnAlreadyExistsand the code previously returned without establishing a new connection — leaving WireGuard sending handshake packets into a dead relay pipe. The tunnel never recovers.This PR fixes two related issues:
Stale relay connection (
worker_relay.go,client.go,manager.go): WhenOpenConn()returnsErrConnAlreadyExists, close the existing (stale) connection via newCloseConnByPeerKey()method and retry. This ensures the relay pipe is always live when processing a new offer.Stale ICE session ID (
worker_ice.go,conn.go): When a WireGuard handshake times out on a relay connection, the ICE worker is not closed, so its session ID is never rotated. The next offer carries the same session ID, causing the remote peer to skip ICE agent recreation and reuse stale candidates. AddedResetSessionID()to force a fresh session ID so the remote peer creates a new ICE agent with current candidates.Reproduction scenario
Before fix: After IP change, Peer B holds dead relay conn → WG handshake stuck at
0001-01-01 00:00:00indefinitely. Tunnel never recovers without manual restart.After fix: Tunnel recovers automatically within ~15 seconds.
Changes
shared/relay/client/client.go: AddCloseConnByPeerKey()— force-close relay connection by peer key, remove from internal mapshared/relay/client/manager.go: AddCloseConnByPeerKey()proxy on Managerclient/internal/peer/worker_relay.go: OnErrConnAlreadyExists→ close stale conn → retryOpenConn()client/internal/peer/worker_ice.go: AddResetSessionID()— rotate ICE session ID without closing agentclient/internal/peer/conn.go: CallResetSessionID()inonWGDisconnectedrelay caseTest plan
Summary by CodeRabbit