[Performance][Critical] Session cache is ineffective — cache key uses current block height instead of session start height, causing full node overload

## Problem Statement

Every relay request triggers gRPC `GetSession` calls to the full node despite sessions lasting ~50 blocks. The in-memory cache (`SturdyC`) is effectively bypassed because the cache key changes on every new block, turning the cache into a write-only buffer.

### Production Impact (PNF Mainnet)

| Metric | Value |
|--------|-------|
| seed-one restarts (2 days) | 29 |
| seed-two restarts (2 days) | 28 |
| seed-three restarts (2 days) | 36 |
| Supplier failure rate | **100%** for affected services |
| Failure point | Pre-routing (session fetch) |

All suppliers show 100% relay failure because every `BuildHTTPRequestContextForEndpoint` call fails when `GetSession` returns errors from the overwhelmed full node.

### Error Signatures

**Full node restarted, not yet in check/finalize state:**
```
rpc error: code = Unknown desc = codespace sdk code 26: invalid height:
context did not contain latest block height in either check state or finalize block state (653615)
```

**Connection reset by overwhelmed node:**
```
rpc error: code = Unavailable desc = error reading from server:
read tcp 10.42.12.164:46910->10.43.16.84:9090: read: connection reset by peer
```

**Node crashed mid-stream:**
```
rpc error: code = Unavailable desc = error reading from server: EOF
```

---

## Root Cause

### 1. Cache key includes current block height, not session start height

`fullnode_cache.go:230-237`:

```go
height, err := cfn.GetCurrentBlockHeight(ctx)  // latest block height
// ...
sessionKey := getSessionCacheKey(serviceID, appAddr, height)  // key changes every block
```

The key function (`fullnode_cache.go:352-355`):
```go
func getSessionCacheKey(serviceID protocol.ServiceID, appAddr string, height int64) string {
    return fmt.Sprintf("%s:%s:%s:%d", sessionCacheKeyPrefix, serviceID, appAddr, height)
}
```

But `lazyFullNode.GetSession()` queries with `height=0` (latest session), so the returned session is identical across all ~50 blocks within a session window. Every new block → new key → cache miss → redundant gRPC call for the **same session**. The 30s TTL is irrelevant because the key changes before expiry.

### 2. `getActiveGatewaySessions()` called per-endpoint with no protocol-level cache

`protocol.go:677` calls `getActiveGatewaySessions()` fresh on every `BuildHTTPRequestContextForEndpoint` invocation. This is triggered from 6 code paths:

| Call Site | File | Line |
|-----------|------|------|
| Initial endpoint selection | `http_request_context.go` | 534 |
| Retry (path 1) | `http_request_context_handle_request.go` | 405 |
| Fallback endpoint loop | `http_request_context_handle_request.go` | 963 |
| Retry (path 2) | `http_request_context_handle_request.go` | 1298 |
| Hedge request | `hedge.go` | 163 |
| Health check | `health_check_executor.go` | 963 |

A single relay with hedging + 1 retry = 3 calls × N owned apps = **30 `GetSession` gRPC calls** for one user request (with 10 apps).

### 3. No cross-replica session sharing

Each PATH replica has an independent in-memory SturdyC cache. With 8 mainnet replicas, all session fetches are multiplied 8×. Cold starts (deploys, restarts) cause a thundering herd.

Compounding this, Sauron routes all `GetSession` gRPC to whichever seed node has the highest block height, concentrating 100% of session query load on a single node:

```json
{"msg":"gRPC routing decision made","selected_node":"seed-mainnet-three",
 "method":"/pocket.session.Query/GetSession"}
```

This explains why seed-three had the most restarts (36 vs 28-29).

---

## Suggested Fix Direction

The core fix is straightforward: **use session start height instead of current block height in the cache key.**

```go
params, err := cfn.GetSharedParams(ctx)  // already cached with 2min TTL
sessionStartHeight := height - (height % int64(params.NumBlocksPerSession))
sessionKey := getSessionCacheKey(serviceID, appAddr, sessionStartHeight)
```

This makes the cache key stable for the entire session window (~50 blocks). `SharedParams` is already cached — no additional gRPC calls needed.

Further improvements (protocol-level caching, Redis cross-replica sharing, proactive prefetch at session boundaries) can build on this incrementally.

### Expected Impact

- Current: ~1 `GetSession` call per block per (serviceID, appAddr) per replica
- After fix: ~1 `GetSession` call per session rotation per (serviceID, appAddr) per replica
- **~98% reduction** in `GetSession` gRPC calls to full nodes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance][Critical] Session cache is ineffective — cache key uses current block height instead of session start height, causing full node overload #509

Problem Statement

Production Impact (PNF Mainnet)

Error Signatures

Root Cause

1. Cache key includes current block height, not session start height

2. `getActiveGatewaySessions()` called per-endpoint with no protocol-level cache

3. No cross-replica session sharing

Suggested Fix Direction

Expected Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Value
seed-one restarts (2 days)	29
seed-two restarts (2 days)	28
seed-three restarts (2 days)	36
Supplier failure rate	100% for affected services
Failure point	Pre-routing (session fetch)

Call Site	File	Line
Initial endpoint selection	`http_request_context.go`	534
Retry (path 1)	`http_request_context_handle_request.go`	405
Fallback endpoint loop	`http_request_context_handle_request.go`	963
Retry (path 2)	`http_request_context_handle_request.go`	1298
Hedge request	`hedge.go`	163
Health check	`health_check_executor.go`	963

[Performance][Critical] Session cache is ineffective — cache key uses current block height instead of session start height, causing full node overload #509

Description

Problem Statement

Production Impact (PNF Mainnet)

Error Signatures

Root Cause

1. Cache key includes current block height, not session start height

2. getActiveGatewaySessions() called per-endpoint with no protocol-level cache

3. No cross-replica session sharing

Suggested Fix Direction

Expected Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

2. `getActiveGatewaySessions()` called per-endpoint with no protocol-level cache