Skip to content

Feat/hot path optimization with external rpc validation#508

Merged
oten91 merged 71 commits intofeat/hot-path-optimizationfrom
feat/hot-path-optimization-with-external-rpc-validation
Mar 11, 2026
Merged

Feat/hot path optimization with external rpc validation#508
oten91 merged 71 commits intofeat/hot-path-optimizationfrom
feat/hot-path-optimization-with-external-rpc-validation

Conversation

@oten91
Copy link
Contributor

@oten91 oten91 commented Feb 17, 2026

updating to go 1.25
adding optional external rpc validation

When all suppliers in a session are behind the real chain tip, internal
consensus considers everything "in sync" because it only compares
endpoints against each other. Add optional per-service external RPC
endpoints that periodically fetch the real block height as a floor for
perceivedBlockNumber, correctly filtering stale suppliers.

Supports multiple sources per service for redundancy (max wins). Handles
EVM hex, Cosmos decimal, and Solana numeric response formats. Failure-safe:
unreachable sources log a warning and are skipped. External heights are
written to Redis for cross-replica benefit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@oten91 oten91 requested a review from jorgecuesta February 17, 2026 22:49
@oten91 oten91 force-pushed the feat/hot-path-optimization-with-external-rpc-validation branch 2 times, most recently from 113320b to ff9606e Compare February 17, 2026 23:06
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@oten91 oten91 force-pushed the feat/hot-path-optimization-with-external-rpc-validation branch from ff9606e to e932bd2 Compare February 17, 2026 23:19
Copy link
Contributor

@jorgecuesta jorgecuesta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a minor question about it assuming all the endpoints are JSON rpc compatible and a possible optimization.

@oten91 oten91 force-pushed the feat/hot-path-optimization-with-external-rpc-validation branch from d1bf40c to 1cb8631 Compare February 18, 2026 13:50
… mode to external block fetcher

- Replace docker/cli exclude with direct require of v29.2.1 which uses
  moby/moby packages with proper Go modules, fixing the range-over-function
  compilation error in E2E tests.
- Pin docker/docker to v28.5.1 and dockertest to v3.11.0 to fix broken pipe
  error when streaming Docker build context via the API in E2E tests.
- Add type field ("jsonrpc" default or "rest") to ExternalBlockSource config
  so Cosmos REST endpoints (GET /status) work alongside JSON-RPC POST.
- Expand .dockerignore to exclude non-essential directories (docusaurus, docs,
  research, etc.) reducing build context size.
- Add integration tests against real public RPC endpoints (ETH, Solana,
  Osmosis JSON-RPC, Osmosis REST) to validate all chain type parsing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@oten91 oten91 force-pushed the feat/hot-path-optimization-with-external-rpc-validation branch from 1cb8631 to 25be89b Compare February 18, 2026 14:10
oten91 and others added 20 commits February 18, 2026 15:20
Prevents cold-start filtering where the external block source reports the
real chain tip before suppliers have reported their heights, causing all
endpoints to appear "behind" and get filtered out.

Default grace period: 60s. Configurable per-service via `grace_period`
in `external_block_sources` config.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds ConsumeExternalBlockHeight to Cosmos and Solana QoS instances so
external_block_sources works for all chain types, not just EVM.

Also fixes EVM: skip Redis writes during grace period to prevent other
replicas from picking up the external height before suppliers report.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Docker image build + container startup can take 5+ minutes in CI,
which was consuming the entire 5-minute context timeout before tests
even started. This caused all Vegeta test methods to immediately
report "canceled" with 0 requests.

Move context creation to after getGatewayURLForTestMode() returns,
so the 5-minute timeout only governs actual test execution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The CI workflow already builds the Docker image in Phase 1 and passes
it as an artifact to Phase 2. But force_rebuild_image was still true
in the default config, causing a redundant ~5 minute rebuild from
scratch (with NoCache) before every E2E test run.

Set force_rebuild_image=false in the CI config script so the test
reuses the pre-built image from Phase 1.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All 5 matrix jobs (3 HTTP + 2 WebSocket) were downloading the Docker
image artifact simultaneously, triggering GitHub's secondary rate
limit (403 Forbidden). Stagger downloads at 0/5/10/15/20s intervals.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…o QoS instances

QoS instances are created at startup before external health check rules are
loaded, so syncAllowance starts at 0 (disabled). This means stale endpoints
were never filtered even when external rules specified a sync_allowance.

Fix by using atomic.Uint64 for syncAllowance in both EVM and Cosmos QoS
configs, and propagating the value from the health check executor when
external rules are loaded/refreshed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…eneric QoS

Services using generic/NoOp QoS (near, sui, tron) previously had no block
height tracking, meaning stale endpoints were never filtered even when
external block sources were configured. This converts NoOpQoS from a
stateless value type to a stateful pointer type with per-endpoint block
height tracking, sync-allowance-based filtering, and external block height
consumption — matching the pattern used by Cosmos and Solana QoS.

Default behavior is unchanged: when syncAllowance=0 (default), endpoint
selection remains purely random.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In multi-replica deployments, each replica updates perceivedBlockHeight
locally only. This means Replica A may see block 1000 while Replica B
stays at 950, causing B to select stale endpoints. EVM already had full
Redis sync; this extends the same pattern to the remaining QoS types.

Adds SetReputationService, StartBackgroundSync (periodic Redis reads),
and async Redis writes in UpdateFromExtractedData and
ConsumeExternalBlockHeight for Solana, Cosmos, and NoOp QoS. Uses simple
max semantics (if Redis > local, update) rather than EVM's consensus
mechanism. No changes to cmd/main.go needed — existing duck-typed
interface checks auto-detect the new methods.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…onfigured

NewSimpleQoSInstance (used for all EVM services) sets getEVMChainID()
to "". However, isChainIDValid() still ran unconditionally, rejecting
every observed endpoint — either because it had no chain ID observation
yet (errNoChainIDObs) or because the observed chain ID (e.g. "0x38")
didn't match "". This caused a flood of warn logs and forced fallback
to random endpoint selection.

Skip chain ID validation and synthetic eth_chainId checks entirely when
no expected chain ID is configured, consistent with the constructor's
intent to delegate validation to active health checks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…a stale supplier filtering

Health checks only run on the leader pod, so non-leader replicas had empty
per-endpoint block heights and could not filter stale suppliers. This adds
Redis HSET/HGETALL sync for per-endpoint block heights across all 4 QoS types
(EVM, Solana, Cosmos, NoOp), enabling all replicas to filter stale endpoints.

Write side: async HSET from UpdateFromExtractedData after each health check.
Read side: HGETALL on the existing 5s background sync ticker, updating only
existing local endpoint entries where Redis has a higher block height.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests cover both write side (UpdateFromExtractedData writes per-endpoint
block heights to Redis) and read side (StartBackgroundSync reads from Redis
and updates local endpoint store) for all 4 QoS types: EVM, Solana, Cosmos,
and NoOp. Verifies that:
- Higher Redis values update local endpoints
- Lower Redis values do NOT downgrade local endpoints
- Non-existent local endpoints are NOT created from Redis data
- Periodic sync picks up new Redis data after startup

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Non-leader replicas had empty endpoint stores at startup because health
checks only run on the leader pod. The syncEndpointBlocksFromRedis
closure would fetch per-endpoint block heights from Redis but silently
skip all entries since none existed locally. This meant stale suppliers
were never filtered on non-leader pods.

Now, EVM/Cosmos/NoOp QoS types create new endpoint entries from Redis
data when the endpoint doesn't exist in the local store, enabling
block height filtering from the first request. Solana is excluded
because its validateBasic() requires health+epoch data.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tractBlockHeight

Extends extractBlockHeight to handle non-EVM response formats for external
block sources on generic chains (near, sui, tron):
- Decimal strings without 0x prefix parsed as base-10 (Sui)
- Numeric latest_block_height in sync_info objects (NEAR)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e period

Add mdbx_panic to the heuristic error indicator patterns so that Erigon
nodes with corrupted/full MDBX databases trigger retries on a different
supplier instead of returning the error directly to users. Also reduce
the external block grace period from 60s to 30s to shorten the cold
start window where stale suppliers can leak through.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… comet_bft unsupported

CometBFT methods (status, block, validators, etc.) sent via JSON-RPC POST
were rejected with "unsupported RPC type" when suppliers hadn't staked
comet_bft endpoints. Since CometBFT nodes handle both REST and JSON-RPC
interfaces on the same endpoint, these requests can safely route to json_rpc
endpoints instead. This fallback is a no-op when comet_bft is properly staked.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ng UNKNOWN_RPC

NoOp QoS was discarding the RPC type detected by the gateway and always
setting UNKNOWN_RPC on service payloads. This caused all generic services
(tron, sui, near) to report unknown_rpc in metrics/observations. Now the
detected type (json_rpc, rest, etc.) flows through to the payload.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Endpoint block height entries accumulate indefinitely as sessions rotate,
causing non-leader replicas to recreate stale entries from Redis on every
sync cycle. This adds a lastSeen timestamp to each endpoint entry, updated
during filtering when endpoints appear in active sessions. A periodic sweep
(every 5s tick) removes entries older than 2h TTL from both local stores
and Redis via HDEL.

Changes across all QoS types (EVM, Cosmos, Solana, NoOp):
- Add lastSeen field to endpoint structs
- Add touchEndpoints() to update lastSeen during filtering (separate WLock)
- Add sweepStaleEndpoints() to remove entries past TTL
- Call sweep in StartBackgroundSync tick loop
- Add RemoveEndpointBlockHeights to ReputationService/Storage interfaces
- Implement HDEL-based cleanup in Redis storage, delete in memory storage
- Update all 6 test mocks with the new interface method

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ove them

Entries created by syncEndpointBlocksFromRedis had lastSeen=zero, which
the sweep intentionally skips. This meant Redis-synced entries were never
cleaned up. Set lastSeen=time.Now() on new entries so they get swept after
the 2h TTL if they never appear in an active session.

Also fix NoOp UpdateFromExtractedData to preserve existing lastSeen instead
of overwriting the entire endpointState struct.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Broken/lazy suppliers return empty responses like "result":[],
"result":{}, or bare {}. This adds two layers of defense:

Layer 1 — Heuristic detection (retry on different supplier):
- REST bare {} → rest_empty_object (confidence 0.80, retry)
- JSON-RPC "result":{} → confidence 0.95, MajorErrorSignal (never valid
  for any JSON-RPC method)
- JSON-RPC "result":[] → confidence 0.75, MinorErrorSignal (valid for
  some methods like eth_getLogs, retry is benign)

Layer 2 — Block height filter (prevent future requests):
- Add ErrInvalidBlockHeightResult sentinel error for extractors to
  signal "response had a result but it was unparseable"
- EVM extractor: returns sentinel when eth_blockNumber gets non-string
  result ([], {}, number, boolean) or unparseable hex
- Cosmos extractor: returns sentinel when CometBFT JSON-RPC response
  has empty container result ({} or [])
- UpdateFromExtractedData (EVM/Cosmos/Solana): stores block height 0
  when InvalidBlockHeight flag is set, causing the endpoint to be
  filtered (0 < perceived - sync_allowance)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…infrastructure

When a relay fails, retries now exclude ALL endpoints behind the same
domain (host), not just the exact endpoint address. This prevents wasting
retry budget on different supplier endpoints that share the same broken
backend (e.g., multiple suppliers behind rel.spacebelt.xyz all returning
empty responses).

Also stops retrying entirely when all available endpoints are from tried
domains, instead of resetting and cycling through the same broken
infrastructure with backoff.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
oten91 and others added 24 commits March 6, 2026 21:51
Add POST /admin/circuit-breaker/clear/{serviceId} endpoint that clears
both in-memory and Redis circuit breaker state. Redis DEL alone is
insufficient because refreshFromRedis merges local entries back —
this endpoint clears local cache first, then Redis.

Also updates pnf_path_rules.yaml sync_allowance values to match
external configuration (eth: 2, poly/arb-one/base: 20).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…are registered

InitExternalConfig runs before SetQoSInstances, so the initial
SetSyncAllowance calls in refreshExternalConfig find an empty
qosInstances map and silently fail. This leaves sync_allowance=0
(disabled) for all services until the first periodic refresh (5min),
creating a window where block height filtering is completely bypassed.

Fix: SetQoSInstances now re-applies sync_allowance from any
already-loaded external configs when QoS instances are registered.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
HTTP 4xx responses indicate the client sent a bad request, not that
the domain is broken. Don't punish domains for correctly rejecting
malformed requests (e.g., NEAR backends returning PARSE_ERROR for
non-JSON-RPC payloads). Only 5xx, transport failures, and heuristic-
detected supplier errors should trigger circuit breaks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Log HTTP method, URL path, content-type, user-agent, and first 512
bytes of request body at error level when RPC type detection fails.
Helps diagnose sources of bad traffic (e.g., malformed NEAR POSTs).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Shannon protocol layer's heuristic check only used payload.JSONRPCMethod,
which is empty for REST requests. This meant path-aware validation (e.g., the
Tron /wallet/ whitelist for valid {} responses) never received the request path,
causing false positive rest_empty_object detections and circuit breaker lockouts.

Add the same payload.Path fallback that gateway-level checks already have.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… breaker reason

- Promote protocol-layer heuristic detection log from Debug to Error for
  production visibility (needed to identify which JSON-RPC method triggers
  false positive empty array detection on sei)
- Include jsonrpc_method and request_path in the heuristic error log
- Append (method=...) to the error string propagated to circuit breaker
- Increase Redis circuit breaker reason truncation from 200 to 500 chars
  so method names aren't cut off in diagnostic queries

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ethod

Two false positive sources found via improved diagnostics:

1. Cosmos REST /cosmos/* paths return {} for non-existent entities (e.g.,
   /cosmos/slashing/v1beta1/signing_infos/{addr} for validators with no
   slashing record). Add /cosmos/ to restEmptyObjectValidPathPrefixes.

2. Sei exposes EVM methods with sei_ prefix (sei_getLogs, sei_getFilterLogs,
   etc.) that behave identically to eth_* equivalents. Add to
   emptyArrayValidMethods whitelist so "result":[] is not flagged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…kout via hedge path

When the protocol layer detects an archival error (e.g., "historical state
not available", "missing trie node") and returns it as an error, the hedge_failed
code path loses the structured heuristicResult. shouldCircuitBreak then sees
heuristicResult=nil and treats it as a transport failure → circuit breaks the
domain → locks out ALL traffic including non-archival requests.

Fix: add error string parameter to shouldCircuitBreak. When heuristicResult is
nil but an error is available, fall back to substring matching against archival
patterns. This prevents domains from being locked out for serving pruned-state
errors while remaining healthy for current-block requests.

Affected chains: arb-one (118K hits easy2stake), base (45K hits nodefleet),
bsc, and other EVM chains with archival traffic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… lockout

Lite fullnodes (e.g., Tron) return plain text like "this API is closed
because this node is a lite fullnode" for unsupported endpoints. Previously
this was classified as generic non_json_response, triggering domain-level
circuit breaking even though the domain works fine for other requests.

Now the structural analyzer detects capability limitation patterns in
non-JSON responses and returns a specific reason (non_json_capability_limitation)
that shouldCircuitBreak recognizes as non-circuit-breakable — same treatment
as archival errors. The request still retries on a different supplier.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When shouldCircuitBreak returns false in the batch processing path,
the else branch accessed checkResult.HeuristicResult.Reason without
a nil check. Moved heuristicReason extraction before the if/else
so it's available in both branches with proper nil safety.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CometBFT endpoints always return JSON-RPC formatted responses (e.g.,
{"jsonrpc":"2.0","id":-1,"result":{}}) even for GET requests. This was
causing two types of false positives:

1. analyzeREST flagged CometBFT GET responses as rest_protocol_mismatch
   because it saw "jsonrpc" in a REST response — but CometBFT paths like
   /health, /status, /block always return JSON-RPC format by design.

2. analyzeJSONRPC flagged {"result":{}} as jsonrpc_empty_object_result
   when rpcType arrived as JSON_RPC instead of COMET_BFT (due to
   rpc_type_fallbacks routing CometBFT through JSON_RPC endpoints).

Fix: Add method-aware and path-aware CometBFT detection so the heuristic
recognizes valid CometBFT responses regardless of how they were routed.
These checks become redundant safety nets once suppliers can stake
comet_bft endpoints directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tions

Supplier addresses rotate across sessions but relay miner URLs stay the
same. Block heights stored under session N's supplier address didn't match
session N+1's endpoint keys, causing sync allowance checks to be skipped
and stale endpoints (50K+ blocks behind) to be served to users.

Add a URL-only fallback lookup in syncEndpointBlocksFromRedis: after the
direct key match pass, populate session endpoints that have no block height
by matching on URL alone. This is safe because GetEndpointBlockHeights is
already scoped by serviceID.

Applied to EVM, Cosmos, and NoOp QoS types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Multiple suppliers can stake against the same backend infrastructure
(e.g., pokt1abc-https://easy2stake.com and pokt1def-https://easy2stake.com).
The hedge racer only excluded the exact primary endpoint, so the hedge
request could race against the same domain — wasting both the hedge and
retry budget on identical infrastructure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… exists

Endpoints in the store with no block number observation were allowed
through the sync_allowance check, even when the perceived block was
known. This let severely stale endpoints (e.g., 58K blocks behind)
serve data to clients indefinitely — their block height was never
recorded despite hundreds of successful relays.

Now treats unknown block height as potentially stale when we have
chain state (perceivedBlock > 0). Fresh endpoints (!data.found) and
cold start (perceivedBlock == 0) are unaffected. Non-leader replicas
are safe because syncEndpointBlocksFromRedis creates store entries
with block heights from Redis before serving requests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New supplier addresses staked against the same stale backend get a
"free pass" through the fresh endpoint bypass (!data.found), serving
stale data before any observation is recorded. Multiple supplier
addresses behind the same URL (e.g., qspider relayminer domains)
each get one free stale response per session rotation.

During endpoint filtering, build a URL→blockHeight map from all
existing store entries. Fresh endpoints are checked against this map
— if another supplier for the same URL has a known stale block
height outside sync_allowance, the fresh endpoint is rejected.
Recovery is automatic: when the backend catches up, the URL block
height updates and fresh endpoints are allowed again.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When all endpoints fail validation, both fallback paths
(STANDARD_FALLBACK_RANDOM and SelectWithMetadata) selected randomly
from the entire unfiltered endpoint list — including endpoints behind
known-stale URLs. This was the last bypass path allowing severely
behind endpoints (33K+ blocks) to serve data to clients.

Add filterStaleURLEndpoints helper that checks the endpoint store for
known block heights by URL and excludes endpoints outside
sync_allowance. Applied to both fallback paths. Falls back to the
full unfiltered list only if filtering removes ALL endpoints (total
cold start).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ering

Add structured logging to filterStaleURLEndpoints so operators can see
when fallback paths are triggered and which stale URLs are being removed.
Logs at warn level when sync_allowance/perceived_block is zero (filter
bypassed), when no block height data exists in the endpoint store, and
when specific stale URLs are removed with blocks_behind detail. Logs at
error level when ALL endpoints are stale and the filter returns unfiltered.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The hedge race, retry, and batch code paths were using raw protocol
endpoints without QoS block height validation, allowing stale endpoints
(e.g. 65K blocks behind) to serve responses ~0.33% of the time. Changed
all three paths to call SelectMultipleWithArchival unconditionally
(not just for archival requests), which runs full QoS validation
including block height, chain ID, and archival checks.

Added regression test TestSelectMultipleWithArchival_FiltersStaleEndpoints
with 5 cases covering stale filtering, boundary conditions, and mixed pools.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dpoints during session rotation

During session rotation, new supplier addresses appear for the same backend
infrastructure. These "fresh" endpoints aren't in the local endpointStore yet,
so URL-based block height checks had no data to validate against — allowing
stale endpoints through the primary QoS path.

This fix caches URL→blockHeight mappings from Redis (updated every 5s by
syncEndpointBlocksFromRedis) and merges them into both filterValidEndpointsWithDetails
and filterStaleURLEndpoints as fallback when local store data is missing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ts are stale

Previously, filterStaleURLEndpoints returned the full unfiltered list when ALL
endpoints had stale URLs, as a safety valve to "avoid total failure". This allowed
sessions where every endpoint is backed by stale infrastructure (e.g., qspider
39K blocks behind) to serve stale responses to clients.

Now returns an empty list, and both callers (SelectMultipleWithArchival and
SelectWithMetadata) return an error. Returning an error to the client is
preferable to silently serving data that is tens of thousands of blocks behind.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…t stale-only selection

Probation routing (3% of traffic) previously returned ONLY probation endpoints,
which could be entirely stale (e.g., qspider 39K blocks behind on base). QoS
validation rejected them all, then the fallback served stale data anyway.

Now includes highest-tier endpoints alongside probation endpoints. QoS validates
both pools — probation endpoints that pass validation still get recovery traffic,
but if all probation endpoints fail (stale block height), QoS selects from the
healthy tier endpoints instead. Recovery still works because health checks
(not relay traffic) update block heights.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… inclusion

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@oten91 oten91 merged commit b43d349 into feat/hot-path-optimization Mar 11, 2026
1 check passed
@oten91 oten91 deleted the feat/hot-path-optimization-with-external-rpc-validation branch March 11, 2026 23:08
oten91 added a commit that referenced this pull request Mar 11, 2026
…onsensus (#507)

## Summary

This PR implements a comprehensive set of optimizations and reliability
improvements for PATH:

1. **Hot Path Optimization** - Raw byte passthrough eliminates JSON
parsing from the critical request path
2. **Shared State via Redis** - Archival status, perceived block height,
and reputation scores are now shared across all replicas
3. **Block Height Consensus** - Median-anchored algorithm protects
against malicious endpoints reporting extreme block heights
4. **Health Check-Based Archival Detection** - Archival capability is
now determined by actual historical queries during health checks
5. **Enhanced Observability** - `/ready` endpoint now provides detailed
endpoint info including block consensus stats

## Changes

### Hot Path Optimization
- Store and return raw response bytes without JSON parsing on the
request path
- Defer heavy JSON parsing to async observation queue
- Convert all extractors (EVM, Cosmos, Solana) to use gjson for reduced
allocations
- Remove unused response parsing code (~1000 lines removed)

### Shared State via Redis
- **Archival status**: Stored with TTL, read-through on cache miss for
non-leader replicas
- **Perceived block number**: Atomic max semantics via Lua script for
cross-replica consistency
- **Reputation scores**: Extended with archival fields (IsArchival,
ArchivalExpiresAt)
- **Background sync**: Every 5 seconds with immediate sync on startup

### Block Height Consensus Mechanism
- Median-anchored algorithm protects against malicious/misconfigured
endpoints
- Outlier rejection: blocks > `median + (syncAllowance × 3)` are
filtered
- Self-adjusting using existing `sync_allowance` config (no new
configuration needed)
- 2-minute sliding window with up to 1000 observations

### Health Check Improvements
- Move archival detection from synthetic QoS checks to health checks
- EVM extractor only evaluates archival-related methods (eth_getBalance,
eth_getStorageAt)
- Health check executor marks endpoints as archival after all
validations pass
- Add archival health checks for all EVM chains aligned with E2E config
- Scale health check worker pool dynamically based on service/endpoint
count

### Enhanced `/ready` Endpoint
- Add `?detailed=true` query parameter for comprehensive endpoint info
- Includes: reputation scores, archival status, tier classification,
cooldown status
- New fields: `perceived_block_height`, `median_block_height`,
`block_observations`

### Bug Fixes
- Strip trailing newlines from JSON responses (hot path optimization
artifact)
- Return error response instead of empty body for failed requests
- Preserve backend error responses through error handling chain
- Handle endpoint unavailability race condition with proper re-selection

Includes #508 fixes

## Test Plan

- [x] Unit tests pass (`make test_unit`)
- [x] E2E tests pass for eth, pocket, xrplevm (`make e2e_test
eth,pocket,xrplevm`)
- [x] Block consensus tests verify outlier rejection
- [x] Shared state test script verifies cross-replica consistency
(`scripts/test_shared_state.sh`)
- [x] `go fmt`, `go vet`, `golangci-lint` pass

---------

Co-authored-by: Otto V <ottoevargas@gmail.com>
oten91 added a commit that referenced this pull request Mar 11, 2026
This PR combines and extends the work from #505 with additional bug
fixes and improvements for hedge racing and retry reliability.

Closes #505

## Features (from #505)

### Protocol Error Propagation
- Add `SetProtocolError` to `RequestQoSContext` interface for specific
error messages
- Replace generic "no endpoint responses received" with specific errors
like "no valid endpoints available for service"

### Hedge Racing (New Feature)
- Spawn parallel "hedge" request after configurable delay if primary
hasn't responded
- First successful response wins; the other is cancelled
- Configurable via `retry_config.hedge_delay` and
`retry_config.connect_timeout`
- Track outcomes via `X-Hedge-Result` header

### Retry Enhancements
- **Time Budget**: `max_retry_latency` skips retries when failed request
already took too long
- **Endpoint Rotation**: Each retry attempt uses a different endpoint
- **Heuristic Detection**: Retry on JSON-RPC errors hidden in HTTP 200
responses
- **Observability**: Track via `X-Retry-Count` and `X-Suppliers-Tried`
headers

### Heuristic Response Analysis
- Detect errors in response payloads despite HTTP 200 status
- Identify: JSON-RPC errors, HTML error pages, empty responses,
malformed JSON
- Record correcting reputation signals for detected failures

### Response Metadata Headers
| Header | Description |

|----------------------|---------------------------------------------------------------------------------|
| `X-Retry-Count` | Number of retry attempts (0 = first attempt
succeeded) |
| `X-Suppliers-Tried` | Comma-separated list of attempted supplier
addresses |
| `X-Hedge-Result` | Hedge racing outcome: `primary_only`,
`primary_won`, `hedge_won`, `both_failed` |
| `X-App-Address` | Application address used for the relay |
| `X-Supplier-Address` | Supplier address of the responding endpoint |
| `X-Session-ID` | Session ID for the relay |

### Health Check & Sync Check
- **Sync check validation**: Health checks now validate endpoint block
height against QoS perceived block number using `sync_allowance` config
- Consolidated block height validation directly into health check
executor (removed standalone `BlockHeightValidator`,
`BlockHeightReferenceCache`)
- Simplified health check config structure
- Fix defer pattern in solana.go for mutex unlock
- Add nil map initialization safety check in solana.go

## Bug Fixes (this PR)

- **X-Suppliers-Tried header**: Pre-register both primary and hedge
suppliers when racing starts
- **selectTopRankedEndpoint**: Return original endpoint address instead
of reputation key (fixes 'endpoint not available' errors)
- **Retry blockchain errors**: Detect and retry node-specific errors
(missing trie node, unhealthy node) even in valid JSON-RPC responses
- **Health check refactor**: Simplify block height validation and
consolidate into health checks

## Contributions from @oten91 

- Prioritized endpoint inclusion during reputation filtering (mitigates
race conditions)
- Request-awareness for data extraction methods
- Enhanced JSON-RPC response analysis with stricter error classification
- Heuristic-based error classification with unit tests
- Improved supplier tracking and debugging
- JSON-RPC error handling to prevent retries for valid client errors

## Configuration

```yaml                                                                                                                                                                                                                           
  services:                                                                                                                                                                                                                         
    - service_id: eth                                                                                                                                                                                                               
      retry_config:                                                                                                                                                                                                                 
        enabled: true                                                                                                                                                                                                               
        max_retries: 2                                                                                                                                                                                                              
        hedge_delay: 500ms                                                                                                                                                                   
        connect_timeout: 200ms                                                                                                                                                             
        max_retry_latency: 5s                                                                                                                                                                
        retry_on_5xx: true                                                                                                                                                                                                          
        retry_on_timeout: true                                                                                                                                                                                                      
        retry_on_connection: true                                                                                                                                                                                                  
```

Includes #508 and #507

### Testing
- [x] Unit tests
- [x] E2E tests  (eth service 74.33% success rate)
- [x] Local hedge testing verified with `scripts/test_hedge.sh`

---------

Co-authored-by: Otto V <ottoevargas@gmail.com>
oten91 added a commit that referenced this pull request Mar 12, 2026
#505 #506 #507 #508  into main

---------

Co-authored-by: Jorge S. Cuesta <jorge.s.cuesta@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants