Skip to content

[Performance][Critical] Session cache is ineffective — cache key uses current block height instead of session start height, causing full node overload #509

@jorgecuesta

Description

@jorgecuesta

Problem Statement

Every relay request triggers gRPC GetSession calls to the full node despite sessions lasting ~50 blocks. The in-memory cache (SturdyC) is effectively bypassed because the cache key changes on every new block, turning the cache into a write-only buffer.

Production Impact (PNF Mainnet)

Metric Value
seed-one restarts (2 days) 29
seed-two restarts (2 days) 28
seed-three restarts (2 days) 36
Supplier failure rate 100% for affected services
Failure point Pre-routing (session fetch)

All suppliers show 100% relay failure because every BuildHTTPRequestContextForEndpoint call fails when GetSession returns errors from the overwhelmed full node.

Error Signatures

Full node restarted, not yet in check/finalize state:

rpc error: code = Unknown desc = codespace sdk code 26: invalid height:
context did not contain latest block height in either check state or finalize block state (653615)

Connection reset by overwhelmed node:

rpc error: code = Unavailable desc = error reading from server:
read tcp 10.42.12.164:46910->10.43.16.84:9090: read: connection reset by peer

Node crashed mid-stream:

rpc error: code = Unavailable desc = error reading from server: EOF

Root Cause

1. Cache key includes current block height, not session start height

fullnode_cache.go:230-237:

height, err := cfn.GetCurrentBlockHeight(ctx)  // latest block height
// ...
sessionKey := getSessionCacheKey(serviceID, appAddr, height)  // key changes every block

The key function (fullnode_cache.go:352-355):

func getSessionCacheKey(serviceID protocol.ServiceID, appAddr string, height int64) string {
    return fmt.Sprintf("%s:%s:%s:%d", sessionCacheKeyPrefix, serviceID, appAddr, height)
}

But lazyFullNode.GetSession() queries with height=0 (latest session), so the returned session is identical across all ~50 blocks within a session window. Every new block → new key → cache miss → redundant gRPC call for the same session. The 30s TTL is irrelevant because the key changes before expiry.

2. getActiveGatewaySessions() called per-endpoint with no protocol-level cache

protocol.go:677 calls getActiveGatewaySessions() fresh on every BuildHTTPRequestContextForEndpoint invocation. This is triggered from 6 code paths:

Call Site File Line
Initial endpoint selection http_request_context.go 534
Retry (path 1) http_request_context_handle_request.go 405
Fallback endpoint loop http_request_context_handle_request.go 963
Retry (path 2) http_request_context_handle_request.go 1298
Hedge request hedge.go 163
Health check health_check_executor.go 963

A single relay with hedging + 1 retry = 3 calls × N owned apps = 30 GetSession gRPC calls for one user request (with 10 apps).

3. No cross-replica session sharing

Each PATH replica has an independent in-memory SturdyC cache. With 8 mainnet replicas, all session fetches are multiplied 8×. Cold starts (deploys, restarts) cause a thundering herd.

Compounding this, Sauron routes all GetSession gRPC to whichever seed node has the highest block height, concentrating 100% of session query load on a single node:

{"msg":"gRPC routing decision made","selected_node":"seed-mainnet-three",
 "method":"/pocket.session.Query/GetSession"}

This explains why seed-three had the most restarts (36 vs 28-29).


Suggested Fix Direction

The core fix is straightforward: use session start height instead of current block height in the cache key.

params, err := cfn.GetSharedParams(ctx)  // already cached with 2min TTL
sessionStartHeight := height - (height % int64(params.NumBlocksPerSession))
sessionKey := getSessionCacheKey(serviceID, appAddr, sessionStartHeight)

This makes the cache key stable for the entire session window (~50 blocks). SharedParams is already cached — no additional gRPC calls needed.

Further improvements (protocol-level caching, Redis cross-replica sharing, proactive prefetch at session boundaries) can build on this incrementally.

Expected Impact

  • Current: ~1 GetSession call per block per (serviceID, appAddr) per replica
  • After fix: ~1 GetSession call per session rotation per (serviceID, appAddr) per replica
  • ~98% reduction in GetSession gRPC calls to full nodes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions