Skip to content

v1.5.1: optimise network node status api#4076

Open
abhishek9686 wants to merge 1 commit into
release-v1.5.1from
v1.5.1-optimise-node-status
Open

v1.5.1: optimise network node status api#4076
abhishek9686 wants to merge 1 commit into
release-v1.5.1from
v1.5.1-optimise-node-status

Conversation

@abhishek9686

Copy link
Copy Markdown
Member

Describe your changes

Provide Issue ticket number if applicable/not in title

Provide testing steps

Checklist before requesting a review

  • My changes affect only 10 files or less.
  • I have performed a self-review of my code and tested it.
  • If it is a new feature, I have added thorough tests, my code is <= 1450 lines.
  • If it is a bugfix, my code is <= 200 lines.
  • My functions are <= 80 lines.
  • I have had my code reviewed by a peer.
  • My unit tests pass locally.
  • Netmaker is awesome.

@tenki-reviewer

tenki-reviewer Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Review Complete

Files Reviewed: 4
Findings: 2

By Severity:

  • 🟠 High: 2

Two high-severity bugs: GetMetricsForNodeIDs silently drops cache misses without DB fallback, causing degraded node status computation; and peerSummaryStatus always returns OnlineSt when metrics are absent, defeating peer-connectivity filtering.

Files Reviewed (4 files)
logic/nodes.go
pro/initialize.go
pro/logic/metrics.go
pro/logic/status.go

@tenki-reviewer tenki-reviewer Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Risk: 🟠 High (78/100) — 2 high findings · 295 LOC across 4 files


Summary

PR #4076 modifies node status and metrics logic across 4 files in the Netmaker mesh VPN platform. Two high-severity bugs were identified that directly impact network node health assessment.

Finding-001: Cache miss black hole in pro/logic/metrics.go

The batch function GetMetricsForNodeIDs (lines 92-98) returns immediately after the cache-read loop without a DB fallback for cache misses. This diverges from the single-node GetMetrics function (lines 125-130) which correctly falls through to database.FetchRecord. Impact: nodes with valid DB metrics but absent from cache receive nil metrics, causing skipped peer-connectivity analysis and incorrect static-node status (UnKnown).

Finding-002: Dead return path in pro/logic/status.go

The function peerSummaryStatus (lines 188-193) has both branches of the if-else returning OnlineSt unconditionally, making the time.Since check dead code. Peers without metrics always appear healthy, defeating the filtering in checkPeerConnectivityWithContext (lines 243-246) that should exclude Error/Warning peers from the unconnected count — inflating peerNotConnectedCnt and producing false Warning/Error status.

Risk Assessment

Both bugs degrade the core node-status pipeline: one silently drops data, the other produces uniformly optimistic results. Together they can mask real connectivity problems in production meshes. Both are fixable with targeted one-line or small-block changes.

Comment thread pro/logic/metrics.go
Comment on lines +92 to +98
if servercfg.CacheEnabled() {
for id := range want {
if m, ok := getMetricsFromCache(id); ok {
result[id] = metricsReadCopy(&m)
}
}
return result

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 GetMetricsForNodeIDs cache path silently drops cache misses without DB fallback — diverges from GetMetrics and causes degraded status computation (bug)

In pro/logic/metrics.go, the batch function GetMetricsForNodeIDs (lines 82-120) when cache is enabled (lines 92-98) iterates the wanted IDs, looks them up in the in-memory cache, and returns immediately — with no fallback to the database for IDs not found in cache. By contrast, the single-node GetMetrics function (lines 122-146, specifically lines 125-130) correctly falls through to database.FetchRecord on cache miss. This inconsistency has two concrete impacts: (1) Nodes whose metrics exist in the DB but are absent from cache (e.g., startup races, cache eviction, newly created nodes) receive nil metrics, causing computeNodeStatusWithContext to skip peer-connectivity analysis and fall back to simple check-in heuristics. (2) Static nodes whose ingress gateway metrics are missing from cache get UnKnown status instead of the correct online/offline determination. Additionally, the bulk status API path and individual GetNodeStatus calls may return different status values for the same node when cache is enabled but has missing entries.

💡 Suggestion: After the cache-read loop (line 97), collect IDs not found in cache and query them via GetMetrics (which itself does cache→DB fallback) or batch-fetch from DB, then merge into the result map before returning.

📋 Prompt for AI Agents

In pro/logic/metrics.go at lines 92-98, after the cache-read loop, collect any wanted IDs that were not found in the cache. For those missed IDs, call GetMetrics(id) (which has its own cache→DB fallback) or perform a batch DB query. Merge the results into the result map before returning, ensuring the batch function is semantically equivalent to calling GetMetrics for each ID individually. This aligns the behavior with the single-node GetMetrics function at lines 122-146.

Comment thread pro/logic/status.go
Comment on lines +188 to +193
func peerSummaryStatus(node models.Node, metrics *models.Metrics, ctx *networkStatusContext) models.NodeStatus {
if metrics == nil || metrics.Connectivity == nil {
if time.Since(node.LastCheckIn) < models.LastCheckInThreshold {
return models.OnlineSt
}
return models.OnlineSt

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 peerSummaryStatus always returns OnlineSt when metrics are absent; both branches identical (bug)

In pro/logic/status.go, the peerSummaryStatus function (lines 188-222) contains a logic error at lines 189-193. When metrics == nil || metrics.Connectivity == nil, both the if-branch and the fallback return models.OnlineSt. This means peers without metrics always appear healthy, defeating the filtering logic in checkPeerConnectivityWithContext (line 243-246) that is meant to skip peers with Error/Warning status when counting unconnected peers.

💡 Suggestion: Change line 193 from return models.OnlineSt to return models.OfflineSt so that peers without metrics who have exceeded the check-in threshold are correctly reported as offline rather than online.

📋 Prompt for AI Agents

In pro/logic/status.go at line 193, change return models.OnlineSt to return models.OfflineSt. This fixes a logic bug in peerSummaryStatus where the fallback branch incorrectly returns OnlineSt instead of OfflineSt when a node's metrics are absent and the check-in threshold has been exceeded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant