fix: close stale relay connection on ErrConnAlreadyExists to recover … by fpenezic · Pull Request #5866 · netbirdio/netbird

fpenezic · 2026-04-12T18:44:58Z

Summary

When a peer behind NAT undergoes a public IP change (e.g. PPPoE reconnect with IP rotation), the remote peer's relay client retains a stale connection entry in its conns map. Subsequent calls to OpenConn() return ErrConnAlreadyExists and the code previously returned without establishing a new connection — leaving WireGuard sending handshake packets into a dead relay pipe. The tunnel never recovers.

This PR fixes two related issues:

Stale relay connection (worker_relay.go, client.go, manager.go): When OpenConn() returns ErrConnAlreadyExists, close the existing (stale) connection via new CloseConnByPeerKey() method and retry. This ensures the relay pipe is always live when processing a new offer.
Stale ICE session ID (worker_ice.go, conn.go): When a WireGuard handshake times out on a relay connection, the ICE worker is not closed, so its session ID is never rotated. The next offer carries the same session ID, causing the remote peer to skip ICE agent recreation and reuse stale candidates. Added ResetSessionID() to force a fresh session ID so the remote peer creates a new ICE agent with current candidates.

Reproduction scenario

Peer A: behind PPPoE NAT (no port forwarding)
Peer B: public IP
ISP forces periodic PPPoE reconnect, assigning Peer A a new public IP

Before fix: After IP change, Peer B holds dead relay conn → WG handshake stuck at 0001-01-01 00:00:00 indefinitely. Tunnel never recovers without manual restart.

After fix: Tunnel recovers automatically within ~15 seconds.

Changes

shared/relay/client/client.go: Add CloseConnByPeerKey() — force-close relay connection by peer key, remove from internal map
shared/relay/client/manager.go: Add CloseConnByPeerKey() proxy on Manager
client/internal/peer/worker_relay.go: On ErrConnAlreadyExists → close stale conn → retry OpenConn()
client/internal/peer/worker_ice.go: Add ResetSessionID() — rotate ICE session ID without closing agent
client/internal/peer/conn.go: Call ResetSessionID() in onWGDisconnected relay case

Test plan

Tested with real PPPoE reconnect (IP rotation) — tunnel recovers automatically
Tested PPPoE reconnect without IP change — tunnel recovers
Tested across multiple reconnect cycles over 48h
Existing unit tests pass

Summary by CodeRabbit

Bug Fixes
- Relay disconnects now trigger a full ICE session reset to avoid stale session state and improve reconnection reliability.
- Duplicate relay-connection errors are handled by closing stale connections and retrying, reducing failed-open cases.
- Added targeted forced-close for relay connections to ensure proper cleanup and unsubscribe before retrying.

…tunnel after NAT IP change

CLAassistant · 2026-04-12T18:45:10Z

All committers have signed the CLA.

coderabbitai · 2026-04-12T18:45:18Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a214f246-0539-4674-b362-990ab374305c

📥 Commits

Reviewing files that changed from the base of the PR and between d2a2ff5 and a9783e0.

📒 Files selected for processing (2)

client/internal/peer/worker_relay.go
shared/relay/client/manager.go

🚧 Files skipped from review as they are similar to previous changes (1)

shared/relay/client/manager.go

📝 Walkthrough

Walkthrough

The PR forces a new ICE session ID when a relay connection is closed, adds WorkerICE.ResetSessionID() and CloseConnByPeerKey() APIs, and changes relay open logic to detect existing connections, evict stale entries, and retry opening the relay connection.

Changes

Cohort / File(s)	Summary
ICE Session Management `client/internal/peer/conn.go`, `client/internal/peer/worker_ice.go`	Added `WorkerICE.ResetSessionID()` and call site: `Conn.onWGDisconnected()` now calls `workerICE.ResetSessionID()` when the active connection is `conntype.Relay` (nil-guarded).
Relay Client Eviction & Retry `client/internal/peer/worker_relay.go`, `shared/relay/client/client.go`, `shared/relay/client/manager.go`	Added `CloseConnByPeerKey()` at client and manager layers; `OpenConn` handling now treats `ErrConnAlreadyExists` by logging, force-closing the stale conn via `CloseConnByPeerKey()`, and retrying `OpenConn` once before failing.

Sequence Diagram(s)

sequenceDiagram
  participant Conn
  participant WorkerRelay
  participant RelayClient
  participant WorkerICE

  Conn->>WorkerRelay: detect WG disconnected (active=Relay)
  WorkerRelay->>RelayClient: Close relay connection
  RelayClient-->>WorkerRelay: closed
  WorkerRelay->>WorkerICE: ResetSessionID()  -- new ICE session ID
  WorkerICE-->>WorkerRelay: ack
  WorkerRelay->>Conn: continue disconnect handling

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

[client] Reset WireGuard endpoint on ICE session change during relay fallback #5283: Related changes to ICE session handling and propagation of session changes affecting endpoint resets.
[client] Extend WG watcher for ICE connection too #5133: Earlier modifications to Conn.onWGDisconnected() and WG lifecycle that this PR extends.

Suggested reviewers

lixmal
mlsmaycon

Poem

🐰 When relays fall and tunnels sigh,
I hop and make a brand new try,
A fresh ICE name to start the race,
I nudge the links and mend the place,
Hoppity hop — reconnect with grace!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the primary change: closing stale relay connections on ErrConnAlreadyExists to enable tunnel recovery.
Description check	✅ Passed	The description provides comprehensive context including reproduction scenario, specific changes, and testing methodology; however, required template sections are unfilled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@shared/relay/client/manager.go`:
- Around line 150-160: The CloseConnByPeerKey method currently always calls
m.relayClient.CloseConnByPeerKey(peerKey) which misses stale connections stored
under a specific server entry; modify Manager.CloseConnByPeerKey to
accept/lookup the serverAddress (use m.relayClients[srv].relayClient when
present) and call that relayClient.CloseConnByPeerKey(peerKey) instead of always
using m.relayClient; ensure the logic still handles the nil/default
m.relayClient case and preserves the mutex usage around m.relayClients access;
also update the caller in client/internal/peer/worker_relay.go to pass the srv
value into CloseConnByPeerKey so cross-relay ErrConnAlreadyExists recovery finds
and removes the stale entry created in openConnVia/OpenConn.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dcc48e08-b6cd-48da-ad60-56304d5f8378

📥 Commits

Reviewing files that changed from the base of the PR and between 5259e5d and d2a2ff5.

📒 Files selected for processing (5)

client/internal/peer/conn.go
client/internal/peer/worker_ice.go
client/internal/peer/worker_relay.go
shared/relay/client/client.go
shared/relay/client/manager.go

coderabbitai · 2026-04-12T18:51:14Z

+// CloseConnByPeerKey closes an existing relay connection for the given peer key
+// so that a subsequent OpenConn can create a fresh one.
+func (m *Manager) CloseConnByPeerKey(peerKey string) {
+	m.relayClientMu.RLock()
+	defer m.relayClientMu.RUnlock()
+
+	if m.relayClient == nil {
+		return
+	}
+	m.relayClient.CloseConnByPeerKey(peerKey)
+}


⚠️ Potential issue | 🟠 Major

Route stale-connection cleanup to the relay client for serverAddress.

Line 159 always closes m.relayClient, but OpenConn() can return ErrConnAlreadyExists from a foreign relay client created in openConnVia(). In the cross-relay path, the stale entry lives in m.relayClients[srv].relayClient, so the retry from client/internal/peer/worker_relay.go will hit the same error again and recovery still fails there.

Suggested direction

-// CloseConnByPeerKey closes an existing relay connection for the given peer key -// so that a subsequent OpenConn can create a fresh one. -func (m *Manager) CloseConnByPeerKey(peerKey string) { - m.relayClientMu.RLock() - defer m.relayClientMu.RUnlock() - - if m.relayClient == nil { - return - } - m.relayClient.CloseConnByPeerKey(peerKey) +// CloseConnByPeerKey closes an existing relay connection on the relay client +// associated with serverAddress so that a subsequent OpenConn can create a fresh one. +func (m *Manager) CloseConnByPeerKey(serverAddress, peerKey string) { + m.relayClientMu.RLock() + homeClient := m.relayClient + m.relayClientMu.RUnlock() + + if homeClient == nil { + return + } + + homeAddr, err := homeClient.ServerInstanceURL() + if err == nil && homeAddr == serverAddress { + homeClient.CloseConnByPeerKey(peerKey) + return + } + + m.relayClientsMutex.RLock() + rt := m.relayClients[serverAddress] + m.relayClientsMutex.RUnlock() + if rt != nil && rt.relayClient != nil { + rt.relayClient.CloseConnByPeerKey(peerKey) + } }

And update the caller in client/internal/peer/worker_relay.go to pass srv.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@shared/relay/client/manager.go` around lines 150 - 160, The CloseConnByPeerKey method currently always calls m.relayClient.CloseConnByPeerKey(peerKey) which misses stale connections stored under a specific server entry; modify Manager.CloseConnByPeerKey to accept/lookup the serverAddress (use m.relayClients[srv].relayClient when present) and call that relayClient.CloseConnByPeerKey(peerKey) instead of always using m.relayClient; ensure the logic still handles the nil/default m.relayClient case and preserves the mutex usage around m.relayClients access; also update the caller in client/internal/peer/worker_relay.go to pass the srv value into CloseConnByPeerKey so cross-relay ErrConnAlreadyExists recovery finds and removes the stale entry created in openConnVia/OpenConn.

…vers

sonarqubecloud · 2026-04-12T19:02:56Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

fix: close stale relay connection on ErrConnAlreadyExists to recover …

d2a2ff5

…tunnel after NAT IP change

coderabbitai bot reviewed Apr 12, 2026

View reviewed changes

fix: route CloseConnByPeerKey to correct relay client for foreign ser…

a9783e0

…vers

pappz self-requested a review April 13, 2026 07:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: close stale relay connection on ErrConnAlreadyExists to recover …#5866

fix: close stale relay connection on ErrConnAlreadyExists to recover …#5866
fpenezic wants to merge 2 commits intonetbirdio:mainfrom
fpenezic:fix/stale-relay-conn-recovery

fpenezic commented Apr 12, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

CLAassistant commented Apr 12, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 12, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 12, 2026

Uh oh!

sonarqubecloud bot commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

fpenezic commented Apr 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reproduction scenario

Changes

Test plan

Summary by CodeRabbit

Uh oh!

CLAassistant commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Apr 12, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fpenezic commented Apr 12, 2026 •

edited by coderabbitai bot

Loading

CLAassistant commented Apr 12, 2026 •

edited

Loading

coderabbitai bot commented Apr 12, 2026 •

edited

Loading