v1.5.1: optimise network node status api#4076
Conversation
|
Review Complete Files Reviewed: 4 By Severity:
Two high-severity bugs: GetMetricsForNodeIDs silently drops cache misses without DB fallback, causing degraded node status computation; and peerSummaryStatus always returns OnlineSt when metrics are absent, defeating peer-connectivity filtering. Files Reviewed (4 files) |
There was a problem hiding this comment.
Risk: 🟠 High (78/100) — 2 high findings · 295 LOC across 4 files
Summary
PR #4076 modifies node status and metrics logic across 4 files in the Netmaker mesh VPN platform. Two high-severity bugs were identified that directly impact network node health assessment.
Finding-001: Cache miss black hole in pro/logic/metrics.go
The batch function GetMetricsForNodeIDs (lines 92-98) returns immediately after the cache-read loop without a DB fallback for cache misses. This diverges from the single-node GetMetrics function (lines 125-130) which correctly falls through to database.FetchRecord. Impact: nodes with valid DB metrics but absent from cache receive nil metrics, causing skipped peer-connectivity analysis and incorrect static-node status (UnKnown).
Finding-002: Dead return path in pro/logic/status.go
The function peerSummaryStatus (lines 188-193) has both branches of the if-else returning OnlineSt unconditionally, making the time.Since check dead code. Peers without metrics always appear healthy, defeating the filtering in checkPeerConnectivityWithContext (lines 243-246) that should exclude Error/Warning peers from the unconnected count — inflating peerNotConnectedCnt and producing false Warning/Error status.
Risk Assessment
Both bugs degrade the core node-status pipeline: one silently drops data, the other produces uniformly optimistic results. Together they can mask real connectivity problems in production meshes. Both are fixable with targeted one-line or small-block changes.
| if servercfg.CacheEnabled() { | ||
| for id := range want { | ||
| if m, ok := getMetricsFromCache(id); ok { | ||
| result[id] = metricsReadCopy(&m) | ||
| } | ||
| } | ||
| return result |
There was a problem hiding this comment.
🟠 GetMetricsForNodeIDs cache path silently drops cache misses without DB fallback — diverges from GetMetrics and causes degraded status computation (bug)
In pro/logic/metrics.go, the batch function GetMetricsForNodeIDs (lines 82-120) when cache is enabled (lines 92-98) iterates the wanted IDs, looks them up in the in-memory cache, and returns immediately — with no fallback to the database for IDs not found in cache. By contrast, the single-node GetMetrics function (lines 122-146, specifically lines 125-130) correctly falls through to database.FetchRecord on cache miss. This inconsistency has two concrete impacts: (1) Nodes whose metrics exist in the DB but are absent from cache (e.g., startup races, cache eviction, newly created nodes) receive nil metrics, causing computeNodeStatusWithContext to skip peer-connectivity analysis and fall back to simple check-in heuristics. (2) Static nodes whose ingress gateway metrics are missing from cache get UnKnown status instead of the correct online/offline determination. Additionally, the bulk status API path and individual GetNodeStatus calls may return different status values for the same node when cache is enabled but has missing entries.
💡 Suggestion: After the cache-read loop (line 97), collect IDs not found in cache and query them via GetMetrics (which itself does cache→DB fallback) or batch-fetch from DB, then merge into the result map before returning.
📋 Prompt for AI Agents
In pro/logic/metrics.go at lines 92-98, after the cache-read loop, collect any wanted IDs that were not found in the cache. For those missed IDs, call GetMetrics(id) (which has its own cache→DB fallback) or perform a batch DB query. Merge the results into the result map before returning, ensuring the batch function is semantically equivalent to calling GetMetrics for each ID individually. This aligns the behavior with the single-node GetMetrics function at lines 122-146.
| func peerSummaryStatus(node models.Node, metrics *models.Metrics, ctx *networkStatusContext) models.NodeStatus { | ||
| if metrics == nil || metrics.Connectivity == nil { | ||
| if time.Since(node.LastCheckIn) < models.LastCheckInThreshold { | ||
| return models.OnlineSt | ||
| } | ||
| return models.OnlineSt |
There was a problem hiding this comment.
🟠 peerSummaryStatus always returns OnlineSt when metrics are absent; both branches identical (bug)
In pro/logic/status.go, the peerSummaryStatus function (lines 188-222) contains a logic error at lines 189-193. When metrics == nil || metrics.Connectivity == nil, both the if-branch and the fallback return models.OnlineSt. This means peers without metrics always appear healthy, defeating the filtering logic in checkPeerConnectivityWithContext (line 243-246) that is meant to skip peers with Error/Warning status when counting unconnected peers.
💡 Suggestion: Change line 193 from return models.OnlineSt to return models.OfflineSt so that peers without metrics who have exceeded the check-in threshold are correctly reported as offline rather than online.
📋 Prompt for AI Agents
In pro/logic/status.go at line 193, change return models.OnlineSt to return models.OfflineSt. This fixes a logic bug in peerSummaryStatus where the fallback branch incorrectly returns OnlineSt instead of OfflineSt when a node's metrics are absent and the check-in threshold has been exceeded.
Describe your changes
Provide Issue ticket number if applicable/not in title
Provide testing steps
Checklist before requesting a review