Problem Statement
Every relay request triggers gRPC GetSession calls to the full node despite sessions lasting ~50 blocks. The in-memory cache (SturdyC) is effectively bypassed because the cache key changes on every new block, turning the cache into a write-only buffer.
Production Impact (PNF Mainnet)
| Metric |
Value |
| seed-one restarts (2 days) |
29 |
| seed-two restarts (2 days) |
28 |
| seed-three restarts (2 days) |
36 |
| Supplier failure rate |
100% for affected services |
| Failure point |
Pre-routing (session fetch) |
All suppliers show 100% relay failure because every BuildHTTPRequestContextForEndpoint call fails when GetSession returns errors from the overwhelmed full node.
Error Signatures
Full node restarted, not yet in check/finalize state:
rpc error: code = Unknown desc = codespace sdk code 26: invalid height:
context did not contain latest block height in either check state or finalize block state (653615)
Connection reset by overwhelmed node:
rpc error: code = Unavailable desc = error reading from server:
read tcp 10.42.12.164:46910->10.43.16.84:9090: read: connection reset by peer
Node crashed mid-stream:
rpc error: code = Unavailable desc = error reading from server: EOF
Root Cause
1. Cache key includes current block height, not session start height
fullnode_cache.go:230-237:
height, err := cfn.GetCurrentBlockHeight(ctx) // latest block height
// ...
sessionKey := getSessionCacheKey(serviceID, appAddr, height) // key changes every block
The key function (fullnode_cache.go:352-355):
func getSessionCacheKey(serviceID protocol.ServiceID, appAddr string, height int64) string {
return fmt.Sprintf("%s:%s:%s:%d", sessionCacheKeyPrefix, serviceID, appAddr, height)
}
But lazyFullNode.GetSession() queries with height=0 (latest session), so the returned session is identical across all ~50 blocks within a session window. Every new block → new key → cache miss → redundant gRPC call for the same session. The 30s TTL is irrelevant because the key changes before expiry.
2. getActiveGatewaySessions() called per-endpoint with no protocol-level cache
protocol.go:677 calls getActiveGatewaySessions() fresh on every BuildHTTPRequestContextForEndpoint invocation. This is triggered from 6 code paths:
| Call Site |
File |
Line |
| Initial endpoint selection |
http_request_context.go |
534 |
| Retry (path 1) |
http_request_context_handle_request.go |
405 |
| Fallback endpoint loop |
http_request_context_handle_request.go |
963 |
| Retry (path 2) |
http_request_context_handle_request.go |
1298 |
| Hedge request |
hedge.go |
163 |
| Health check |
health_check_executor.go |
963 |
A single relay with hedging + 1 retry = 3 calls × N owned apps = 30 GetSession gRPC calls for one user request (with 10 apps).
3. No cross-replica session sharing
Each PATH replica has an independent in-memory SturdyC cache. With 8 mainnet replicas, all session fetches are multiplied 8×. Cold starts (deploys, restarts) cause a thundering herd.
Compounding this, Sauron routes all GetSession gRPC to whichever seed node has the highest block height, concentrating 100% of session query load on a single node:
{"msg":"gRPC routing decision made","selected_node":"seed-mainnet-three",
"method":"/pocket.session.Query/GetSession"}
This explains why seed-three had the most restarts (36 vs 28-29).
Suggested Fix Direction
The core fix is straightforward: use session start height instead of current block height in the cache key.
params, err := cfn.GetSharedParams(ctx) // already cached with 2min TTL
sessionStartHeight := height - (height % int64(params.NumBlocksPerSession))
sessionKey := getSessionCacheKey(serviceID, appAddr, sessionStartHeight)
This makes the cache key stable for the entire session window (~50 blocks). SharedParams is already cached — no additional gRPC calls needed.
Further improvements (protocol-level caching, Redis cross-replica sharing, proactive prefetch at session boundaries) can build on this incrementally.
Expected Impact
- Current: ~1
GetSession call per block per (serviceID, appAddr) per replica
- After fix: ~1
GetSession call per session rotation per (serviceID, appAddr) per replica
- ~98% reduction in
GetSession gRPC calls to full nodes
Problem Statement
Every relay request triggers gRPC
GetSessioncalls to the full node despite sessions lasting ~50 blocks. The in-memory cache (SturdyC) is effectively bypassed because the cache key changes on every new block, turning the cache into a write-only buffer.Production Impact (PNF Mainnet)
All suppliers show 100% relay failure because every
BuildHTTPRequestContextForEndpointcall fails whenGetSessionreturns errors from the overwhelmed full node.Error Signatures
Full node restarted, not yet in check/finalize state:
Connection reset by overwhelmed node:
Node crashed mid-stream:
Root Cause
1. Cache key includes current block height, not session start height
fullnode_cache.go:230-237:The key function (
fullnode_cache.go:352-355):But
lazyFullNode.GetSession()queries withheight=0(latest session), so the returned session is identical across all ~50 blocks within a session window. Every new block → new key → cache miss → redundant gRPC call for the same session. The 30s TTL is irrelevant because the key changes before expiry.2.
getActiveGatewaySessions()called per-endpoint with no protocol-level cacheprotocol.go:677callsgetActiveGatewaySessions()fresh on everyBuildHTTPRequestContextForEndpointinvocation. This is triggered from 6 code paths:http_request_context.gohttp_request_context_handle_request.gohttp_request_context_handle_request.gohttp_request_context_handle_request.gohedge.gohealth_check_executor.goA single relay with hedging + 1 retry = 3 calls × N owned apps = 30
GetSessiongRPC calls for one user request (with 10 apps).3. No cross-replica session sharing
Each PATH replica has an independent in-memory SturdyC cache. With 8 mainnet replicas, all session fetches are multiplied 8×. Cold starts (deploys, restarts) cause a thundering herd.
Compounding this, Sauron routes all
GetSessiongRPC to whichever seed node has the highest block height, concentrating 100% of session query load on a single node:{"msg":"gRPC routing decision made","selected_node":"seed-mainnet-three", "method":"/pocket.session.Query/GetSession"}This explains why seed-three had the most restarts (36 vs 28-29).
Suggested Fix Direction
The core fix is straightforward: use session start height instead of current block height in the cache key.
This makes the cache key stable for the entire session window (~50 blocks).
SharedParamsis already cached — no additional gRPC calls needed.Further improvements (protocol-level caching, Redis cross-replica sharing, proactive prefetch at session boundaries) can build on this incrementally.
Expected Impact
GetSessioncall per block per (serviceID, appAddr) per replicaGetSessioncall per session rotation per (serviceID, appAddr) per replicaGetSessiongRPC calls to full nodes