Feat/hot path optimization with external rpc validation by oten91 · Pull Request #508 · pokt-network/path

oten91 · 2026-02-17T22:49:50Z

updating to go 1.25
adding optional external rpc validation

When all suppliers in a session are behind the real chain tip, internal consensus considers everything "in sync" because it only compares endpoints against each other. Add optional per-service external RPC endpoints that periodically fetch the real block height as a floor for perceivedBlockNumber, correctly filtering stale suppliers. Supports multiple sources per service for redundancy (max wins). Handles EVM hex, Cosmos decimal, and Solana numeric response formats. Failure-safe: unreachable sources log a warning and are skipped. External heights are written to Redis for cross-replica benefit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jorgecuesta

LGTM, just a minor question about it assuming all the endpoints are JSON rpc compatible and a possible optimization.

gateway/external_block_fetcher.go

… mode to external block fetcher - Replace docker/cli exclude with direct require of v29.2.1 which uses moby/moby packages with proper Go modules, fixing the range-over-function compilation error in E2E tests. - Pin docker/docker to v28.5.1 and dockertest to v3.11.0 to fix broken pipe error when streaming Docker build context via the API in E2E tests. - Add type field ("jsonrpc" default or "rest") to ExternalBlockSource config so Cosmos REST endpoints (GET /status) work alongside JSON-RPC POST. - Expand .dockerignore to exclude non-essential directories (docusaurus, docs, research, etc.) reducing build context size. - Add integration tests against real public RPC endpoints (ETH, Solana, Osmosis JSON-RPC, Osmosis REST) to validate all chain type parsing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Prevents cold-start filtering where the external block source reports the real chain tip before suppliers have reported their heights, causing all endpoints to appear "behind" and get filtered out. Default grace period: 60s. Configurable per-service via `grace_period` in `external_block_sources` config. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds ConsumeExternalBlockHeight to Cosmos and Solana QoS instances so external_block_sources works for all chain types, not just EVM. Also fixes EVM: skip Redis writes during grace period to prevent other replicas from picking up the external height before suppliers report. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Docker image build + container startup can take 5+ minutes in CI, which was consuming the entire 5-minute context timeout before tests even started. This caused all Vegeta test methods to immediately report "canceled" with 0 requests. Move context creation to after getGatewayURLForTestMode() returns, so the 5-minute timeout only governs actual test execution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The CI workflow already builds the Docker image in Phase 1 and passes it as an artifact to Phase 2. But force_rebuild_image was still true in the default config, causing a redundant ~5 minute rebuild from scratch (with NoCache) before every E2E test run. Set force_rebuild_image=false in the CI config script so the test reuses the pre-built image from Phase 1. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

All 5 matrix jobs (3 HTTP + 2 WebSocket) were downloading the Docker image artifact simultaneously, triggering GitHub's secondary rate limit (403 Forbidden). Stagger downloads at 0/5/10/15/20s intervals. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…o QoS instances QoS instances are created at startup before external health check rules are loaded, so syncAllowance starts at 0 (disabled). This means stale endpoints were never filtered even when external rules specified a sync_allowance. Fix by using atomic.Uint64 for syncAllowance in both EVM and Cosmos QoS configs, and propagating the value from the health check executor when external rules are loaded/refreshed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…eneric QoS Services using generic/NoOp QoS (near, sui, tron) previously had no block height tracking, meaning stale endpoints were never filtered even when external block sources were configured. This converts NoOpQoS from a stateless value type to a stateful pointer type with per-endpoint block height tracking, sync-allowance-based filtering, and external block height consumption — matching the pattern used by Cosmos and Solana QoS. Default behavior is unchanged: when syncAllowance=0 (default), endpoint selection remains purely random. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

In multi-replica deployments, each replica updates perceivedBlockHeight locally only. This means Replica A may see block 1000 while Replica B stays at 950, causing B to select stale endpoints. EVM already had full Redis sync; this extends the same pattern to the remaining QoS types. Adds SetReputationService, StartBackgroundSync (periodic Redis reads), and async Redis writes in UpdateFromExtractedData and ConsumeExternalBlockHeight for Solana, Cosmos, and NoOp QoS. Uses simple max semantics (if Redis > local, update) rather than EVM's consensus mechanism. No changes to cmd/main.go needed — existing duck-typed interface checks auto-detect the new methods. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…onfigured NewSimpleQoSInstance (used for all EVM services) sets getEVMChainID() to "". However, isChainIDValid() still ran unconditionally, rejecting every observed endpoint — either because it had no chain ID observation yet (errNoChainIDObs) or because the observed chain ID (e.g. "0x38") didn't match "". This caused a flood of warn logs and forced fallback to random endpoint selection. Skip chain ID validation and synthetic eth_chainId checks entirely when no expected chain ID is configured, consistent with the constructor's intent to delegate validation to active health checks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…a stale supplier filtering Health checks only run on the leader pod, so non-leader replicas had empty per-endpoint block heights and could not filter stale suppliers. This adds Redis HSET/HGETALL sync for per-endpoint block heights across all 4 QoS types (EVM, Solana, Cosmos, NoOp), enabling all replicas to filter stale endpoints. Write side: async HSET from UpdateFromExtractedData after each health check. Read side: HGETALL on the existing 5s background sync ticker, updating only existing local endpoint entries where Redis has a higher block height. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Tests cover both write side (UpdateFromExtractedData writes per-endpoint block heights to Redis) and read side (StartBackgroundSync reads from Redis and updates local endpoint store) for all 4 QoS types: EVM, Solana, Cosmos, and NoOp. Verifies that: - Higher Redis values update local endpoints - Lower Redis values do NOT downgrade local endpoints - Non-existent local endpoints are NOT created from Redis data - Periodic sync picks up new Redis data after startup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Non-leader replicas had empty endpoint stores at startup because health checks only run on the leader pod. The syncEndpointBlocksFromRedis closure would fetch per-endpoint block heights from Redis but silently skip all entries since none existed locally. This meant stale suppliers were never filtered on non-leader pods. Now, EVM/Cosmos/NoOp QoS types create new endpoint entries from Redis data when the endpoint doesn't exist in the local store, enabling block height filtering from the first request. Solana is excluded because its validateBasic() requires health+epoch data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tractBlockHeight Extends extractBlockHeight to handle non-EVM response formats for external block sources on generic chains (near, sui, tron): - Decimal strings without 0x prefix parsed as base-10 (Sui) - Numeric latest_block_height in sync_info objects (NEAR) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…e period Add mdbx_panic to the heuristic error indicator patterns so that Erigon nodes with corrupted/full MDBX databases trigger retries on a different supplier instead of returning the error directly to users. Also reduce the external block grace period from 60s to 30s to shorten the cold start window where stale suppliers can leak through. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… comet_bft unsupported CometBFT methods (status, block, validators, etc.) sent via JSON-RPC POST were rejected with "unsupported RPC type" when suppliers hadn't staked comet_bft endpoints. Since CometBFT nodes handle both REST and JSON-RPC interfaces on the same endpoint, these requests can safely route to json_rpc endpoints instead. This fallback is a no-op when comet_bft is properly staked. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ng UNKNOWN_RPC NoOp QoS was discarding the RPC type detected by the gateway and always setting UNKNOWN_RPC on service payloads. This caused all generic services (tron, sui, near) to report unknown_rpc in metrics/observations. Now the detected type (json_rpc, rest, etc.) flows through to the payload. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Endpoint block height entries accumulate indefinitely as sessions rotate, causing non-leader replicas to recreate stale entries from Redis on every sync cycle. This adds a lastSeen timestamp to each endpoint entry, updated during filtering when endpoints appear in active sessions. A periodic sweep (every 5s tick) removes entries older than 2h TTL from both local stores and Redis via HDEL. Changes across all QoS types (EVM, Cosmos, Solana, NoOp): - Add lastSeen field to endpoint structs - Add touchEndpoints() to update lastSeen during filtering (separate WLock) - Add sweepStaleEndpoints() to remove entries past TTL - Call sweep in StartBackgroundSync tick loop - Add RemoveEndpointBlockHeights to ReputationService/Storage interfaces - Implement HDEL-based cleanup in Redis storage, delete in memory storage - Update all 6 test mocks with the new interface method Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ove them Entries created by syncEndpointBlocksFromRedis had lastSeen=zero, which the sweep intentionally skips. This meant Redis-synced entries were never cleaned up. Set lastSeen=time.Now() on new entries so they get swept after the 2h TTL if they never appear in an active session. Also fix NoOp UpdateFromExtractedData to preserve existing lastSeen instead of overwriting the entire endpointState struct. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Broken/lazy suppliers return empty responses like "result":[], "result":{}, or bare {}. This adds two layers of defense: Layer 1 — Heuristic detection (retry on different supplier): - REST bare {} → rest_empty_object (confidence 0.80, retry) - JSON-RPC "result":{} → confidence 0.95, MajorErrorSignal (never valid for any JSON-RPC method) - JSON-RPC "result":[] → confidence 0.75, MinorErrorSignal (valid for some methods like eth_getLogs, retry is benign) Layer 2 — Block height filter (prevent future requests): - Add ErrInvalidBlockHeightResult sentinel error for extractors to signal "response had a result but it was unparseable" - EVM extractor: returns sentinel when eth_blockNumber gets non-string result ([], {}, number, boolean) or unparseable hex - Cosmos extractor: returns sentinel when CometBFT JSON-RPC response has empty container result ({} or []) - UpdateFromExtractedData (EVM/Cosmos/Solana): stores block height 0 when InvalidBlockHeight flag is set, causing the endpoint to be filtered (0 < perceived - sync_allowance) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…infrastructure When a relay fails, retries now exclude ALL endpoints behind the same domain (host), not just the exact endpoint address. This prevents wasting retry budget on different supplier endpoints that share the same broken backend (e.g., multiple suppliers behind rel.spacebelt.xyz all returning empty responses). Also stops retrying entirely when all available endpoints are from tried domains, instead of resetting and cycling through the same broken infrastructure with backoff. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add POST /admin/circuit-breaker/clear/{serviceId} endpoint that clears both in-memory and Redis circuit breaker state. Redis DEL alone is insufficient because refreshFromRedis merges local entries back — this endpoint clears local cache first, then Redis. Also updates pnf_path_rules.yaml sync_allowance values to match external configuration (eth: 2, poly/arb-one/base: 20). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…are registered InitExternalConfig runs before SetQoSInstances, so the initial SetSyncAllowance calls in refreshExternalConfig find an empty qosInstances map and silently fail. This leaves sync_allowance=0 (disabled) for all services until the first periodic refresh (5min), creating a window where block height filtering is completely bypassed. Fix: SetQoSInstances now re-applies sync_allowance from any already-loaded external configs when QoS instances are registered. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

HTTP 4xx responses indicate the client sent a bad request, not that the domain is broken. Don't punish domains for correctly rejecting malformed requests (e.g., NEAR backends returning PARSE_ERROR for non-JSON-RPC payloads). Only 5xx, transport failures, and heuristic- detected supplier errors should trigger circuit breaks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Log HTTP method, URL path, content-type, user-agent, and first 512 bytes of request body at error level when RPC type detection fails. Helps diagnose sources of bad traffic (e.g., malformed NEAR POSTs). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The Shannon protocol layer's heuristic check only used payload.JSONRPCMethod, which is empty for REST requests. This meant path-aware validation (e.g., the Tron /wallet/ whitelist for valid {} responses) never received the request path, causing false positive rest_empty_object detections and circuit breaker lockouts. Add the same payload.Path fallback that gateway-level checks already have. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… breaker reason - Promote protocol-layer heuristic detection log from Debug to Error for production visibility (needed to identify which JSON-RPC method triggers false positive empty array detection on sei) - Include jsonrpc_method and request_path in the heuristic error log - Append (method=...) to the error string propagated to circuit breaker - Increase Redis circuit breaker reason truncation from 200 to 500 chars so method names aren't cut off in diagnostic queries Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ethod Two false positive sources found via improved diagnostics: 1. Cosmos REST /cosmos/* paths return {} for non-existent entities (e.g., /cosmos/slashing/v1beta1/signing_infos/{addr} for validators with no slashing record). Add /cosmos/ to restEmptyObjectValidPathPrefixes. 2. Sei exposes EVM methods with sei_ prefix (sei_getLogs, sei_getFilterLogs, etc.) that behave identically to eth_* equivalents. Add to emptyArrayValidMethods whitelist so "result":[] is not flagged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…kout via hedge path When the protocol layer detects an archival error (e.g., "historical state not available", "missing trie node") and returns it as an error, the hedge_failed code path loses the structured heuristicResult. shouldCircuitBreak then sees heuristicResult=nil and treats it as a transport failure → circuit breaks the domain → locks out ALL traffic including non-archival requests. Fix: add error string parameter to shouldCircuitBreak. When heuristicResult is nil but an error is available, fall back to substring matching against archival patterns. This prevents domains from being locked out for serving pruned-state errors while remaining healthy for current-block requests. Affected chains: arb-one (118K hits easy2stake), base (45K hits nodefleet), bsc, and other EVM chains with archival traffic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… lockout Lite fullnodes (e.g., Tron) return plain text like "this API is closed because this node is a lite fullnode" for unsupported endpoints. Previously this was classified as generic non_json_response, triggering domain-level circuit breaking even though the domain works fine for other requests. Now the structural analyzer detects capability limitation patterns in non-JSON responses and returns a specific reason (non_json_capability_limitation) that shouldCircuitBreak recognizes as non-circuit-breakable — same treatment as archival errors. The request still retries on a different supplier. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When shouldCircuitBreak returns false in the batch processing path, the else branch accessed checkResult.HeuristicResult.Reason without a nil check. Moved heuristicReason extraction before the if/else so it's available in both branches with proper nil safety. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

CometBFT endpoints always return JSON-RPC formatted responses (e.g., {"jsonrpc":"2.0","id":-1,"result":{}}) even for GET requests. This was causing two types of false positives: 1. analyzeREST flagged CometBFT GET responses as rest_protocol_mismatch because it saw "jsonrpc" in a REST response — but CometBFT paths like /health, /status, /block always return JSON-RPC format by design. 2. analyzeJSONRPC flagged {"result":{}} as jsonrpc_empty_object_result when rpcType arrived as JSON_RPC instead of COMET_BFT (due to rpc_type_fallbacks routing CometBFT through JSON_RPC endpoints). Fix: Add method-aware and path-aware CometBFT detection so the heuristic recognizes valid CometBFT responses regardless of how they were routed. These checks become redundant safety nets once suppliers can stake comet_bft endpoints directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tions Supplier addresses rotate across sessions but relay miner URLs stay the same. Block heights stored under session N's supplier address didn't match session N+1's endpoint keys, causing sync allowance checks to be skipped and stale endpoints (50K+ blocks behind) to be served to users. Add a URL-only fallback lookup in syncEndpointBlocksFromRedis: after the direct key match pass, populate session endpoints that have no block height by matching on URL alone. This is safe because GetEndpointBlockHeights is already scoped by serviceID. Applied to EVM, Cosmos, and NoOp QoS types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Multiple suppliers can stake against the same backend infrastructure (e.g., pokt1abc-https://easy2stake.com and pokt1def-https://easy2stake.com). The hedge racer only excluded the exact primary endpoint, so the hedge request could race against the same domain — wasting both the hedge and retry budget on identical infrastructure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… exists Endpoints in the store with no block number observation were allowed through the sync_allowance check, even when the perceived block was known. This let severely stale endpoints (e.g., 58K blocks behind) serve data to clients indefinitely — their block height was never recorded despite hundreds of successful relays. Now treats unknown block height as potentially stale when we have chain state (perceivedBlock > 0). Fresh endpoints (!data.found) and cold start (perceivedBlock == 0) are unaffected. Non-leader replicas are safe because syncEndpointBlocksFromRedis creates store entries with block heights from Redis before serving requests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

New supplier addresses staked against the same stale backend get a "free pass" through the fresh endpoint bypass (!data.found), serving stale data before any observation is recorded. Multiple supplier addresses behind the same URL (e.g., qspider relayminer domains) each get one free stale response per session rotation. During endpoint filtering, build a URL→blockHeight map from all existing store entries. Fresh endpoints are checked against this map — if another supplier for the same URL has a known stale block height outside sync_allowance, the fresh endpoint is rejected. Recovery is automatic: when the backend catches up, the URL block height updates and fresh endpoints are allowed again. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When all endpoints fail validation, both fallback paths (STANDARD_FALLBACK_RANDOM and SelectWithMetadata) selected randomly from the entire unfiltered endpoint list — including endpoints behind known-stale URLs. This was the last bypass path allowing severely behind endpoints (33K+ blocks) to serve data to clients. Add filterStaleURLEndpoints helper that checks the endpoint store for known block heights by URL and excludes endpoints outside sync_allowance. Applied to both fallback paths. Falls back to the full unfiltered list only if filtering removes ALL endpoints (total cold start). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ering Add structured logging to filterStaleURLEndpoints so operators can see when fallback paths are triggered and which stale URLs are being removed. Logs at warn level when sync_allowance/perceived_block is zero (filter bypassed), when no block height data exists in the endpoint store, and when specific stale URLs are removed with blocks_behind detail. Logs at error level when ALL endpoints are stale and the filter returns unfiltered. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The hedge race, retry, and batch code paths were using raw protocol endpoints without QoS block height validation, allowing stale endpoints (e.g. 65K blocks behind) to serve responses ~0.33% of the time. Changed all three paths to call SelectMultipleWithArchival unconditionally (not just for archival requests), which runs full QoS validation including block height, chain ID, and archival checks. Added regression test TestSelectMultipleWithArchival_FiltersStaleEndpoints with 5 cases covering stale filtering, boundary conditions, and mixed pools. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…dpoints during session rotation During session rotation, new supplier addresses appear for the same backend infrastructure. These "fresh" endpoints aren't in the local endpointStore yet, so URL-based block height checks had no data to validate against — allowing stale endpoints through the primary QoS path. This fix caches URL→blockHeight mappings from Redis (updated every 5s by syncEndpointBlocksFromRedis) and merges them into both filterValidEndpointsWithDetails and filterStaleURLEndpoints as fallback when local store data is missing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ts are stale Previously, filterStaleURLEndpoints returned the full unfiltered list when ALL endpoints had stale URLs, as a safety valve to "avoid total failure". This allowed sessions where every endpoint is backed by stale infrastructure (e.g., qspider 39K blocks behind) to serve stale responses to clients. Now returns an empty list, and both callers (SelectMultipleWithArchival and SelectWithMetadata) return an error. Returning an error to the client is preferable to silently serving data that is tens of thousands of blocks behind. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… endpoints are stale" This reverts commit ed0d943.

…t stale-only selection Probation routing (3% of traffic) previously returned ONLY probation endpoints, which could be entirely stale (e.g., qspider 39K blocks behind on base). QoS validation rejected them all, then the fallback served stale data anyway. Now includes highest-tier endpoints alongside probation endpoints. QoS validates both pools — probation endpoints that pass validation still get recovery traffic, but if all probation endpoints fail (stale block height), QoS selects from the healthy tier endpoints instead. Recovery still works because health checks (not relay traffic) update block heights. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… inclusion Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…onsensus (#507) ## Summary This PR implements a comprehensive set of optimizations and reliability improvements for PATH: 1. **Hot Path Optimization** - Raw byte passthrough eliminates JSON parsing from the critical request path 2. **Shared State via Redis** - Archival status, perceived block height, and reputation scores are now shared across all replicas 3. **Block Height Consensus** - Median-anchored algorithm protects against malicious endpoints reporting extreme block heights 4. **Health Check-Based Archival Detection** - Archival capability is now determined by actual historical queries during health checks 5. **Enhanced Observability** - `/ready` endpoint now provides detailed endpoint info including block consensus stats ## Changes ### Hot Path Optimization - Store and return raw response bytes without JSON parsing on the request path - Defer heavy JSON parsing to async observation queue - Convert all extractors (EVM, Cosmos, Solana) to use gjson for reduced allocations - Remove unused response parsing code (~1000 lines removed) ### Shared State via Redis - **Archival status**: Stored with TTL, read-through on cache miss for non-leader replicas - **Perceived block number**: Atomic max semantics via Lua script for cross-replica consistency - **Reputation scores**: Extended with archival fields (IsArchival, ArchivalExpiresAt) - **Background sync**: Every 5 seconds with immediate sync on startup ### Block Height Consensus Mechanism - Median-anchored algorithm protects against malicious/misconfigured endpoints - Outlier rejection: blocks > `median + (syncAllowance × 3)` are filtered - Self-adjusting using existing `sync_allowance` config (no new configuration needed) - 2-minute sliding window with up to 1000 observations ### Health Check Improvements - Move archival detection from synthetic QoS checks to health checks - EVM extractor only evaluates archival-related methods (eth_getBalance, eth_getStorageAt) - Health check executor marks endpoints as archival after all validations pass - Add archival health checks for all EVM chains aligned with E2E config - Scale health check worker pool dynamically based on service/endpoint count ### Enhanced `/ready` Endpoint - Add `?detailed=true` query parameter for comprehensive endpoint info - Includes: reputation scores, archival status, tier classification, cooldown status - New fields: `perceived_block_height`, `median_block_height`, `block_observations` ### Bug Fixes - Strip trailing newlines from JSON responses (hot path optimization artifact) - Return error response instead of empty body for failed requests - Preserve backend error responses through error handling chain - Handle endpoint unavailability race condition with proper re-selection Includes #508 fixes ## Test Plan - [x] Unit tests pass (`make test_unit`) - [x] E2E tests pass for eth, pocket, xrplevm (`make e2e_test eth,pocket,xrplevm`) - [x] Block consensus tests verify outlier rejection - [x] Shared state test script verifies cross-replica consistency (`scripts/test_shared_state.sh`) - [x] `go fmt`, `go vet`, `golangci-lint` pass --------- Co-authored-by: Otto V <ottoevargas@gmail.com>

@oten91

This PR combines and extends the work from #505 with additional bug fixes and improvements for hedge racing and retry reliability. Closes #505 ## Features (from #505) ### Protocol Error Propagation - Add `SetProtocolError` to `RequestQoSContext` interface for specific error messages - Replace generic "no endpoint responses received" with specific errors like "no valid endpoints available for service" ### Hedge Racing (New Feature) - Spawn parallel "hedge" request after configurable delay if primary hasn't responded - First successful response wins; the other is cancelled - Configurable via `retry_config.hedge_delay` and `retry_config.connect_timeout` - Track outcomes via `X-Hedge-Result` header ### Retry Enhancements - **Time Budget**: `max_retry_latency` skips retries when failed request already took too long - **Endpoint Rotation**: Each retry attempt uses a different endpoint - **Heuristic Detection**: Retry on JSON-RPC errors hidden in HTTP 200 responses - **Observability**: Track via `X-Retry-Count` and `X-Suppliers-Tried` headers ### Heuristic Response Analysis - Detect errors in response payloads despite HTTP 200 status - Identify: JSON-RPC errors, HTML error pages, empty responses, malformed JSON - Record correcting reputation signals for detected failures ### Response Metadata Headers | Header | Description | |----------------------|---------------------------------------------------------------------------------| | `X-Retry-Count` | Number of retry attempts (0 = first attempt succeeded) | | `X-Suppliers-Tried` | Comma-separated list of attempted supplier addresses | | `X-Hedge-Result` | Hedge racing outcome: `primary_only`, `primary_won`, `hedge_won`, `both_failed` | | `X-App-Address` | Application address used for the relay | | `X-Supplier-Address` | Supplier address of the responding endpoint | | `X-Session-ID` | Session ID for the relay | ### Health Check & Sync Check - **Sync check validation**: Health checks now validate endpoint block height against QoS perceived block number using `sync_allowance` config - Consolidated block height validation directly into health check executor (removed standalone `BlockHeightValidator`, `BlockHeightReferenceCache`) - Simplified health check config structure - Fix defer pattern in solana.go for mutex unlock - Add nil map initialization safety check in solana.go ## Bug Fixes (this PR) - **X-Suppliers-Tried header**: Pre-register both primary and hedge suppliers when racing starts - **selectTopRankedEndpoint**: Return original endpoint address instead of reputation key (fixes 'endpoint not available' errors) - **Retry blockchain errors**: Detect and retry node-specific errors (missing trie node, unhealthy node) even in valid JSON-RPC responses - **Health check refactor**: Simplify block height validation and consolidate into health checks ## Contributions from @oten91 - Prioritized endpoint inclusion during reputation filtering (mitigates race conditions) - Request-awareness for data extraction methods - Enhanced JSON-RPC response analysis with stricter error classification - Heuristic-based error classification with unit tests - Improved supplier tracking and debugging - JSON-RPC error handling to prevent retries for valid client errors ## Configuration ```yaml services: - service_id: eth retry_config: enabled: true max_retries: 2 hedge_delay: 500ms connect_timeout: 200ms max_retry_latency: 5s retry_on_5xx: true retry_on_timeout: true retry_on_connection: true ``` Includes #508 and #507 ### Testing - [x] Unit tests - [x] E2E tests (eth service 74.33% success rate) - [x] Local hedge testing verified with `scripts/test_hedge.sh` --------- Co-authored-by: Otto V <ottoevargas@gmail.com>

#505 #506 #507 #508 into main --------- Co-authored-by: Jorge S. Cuesta <jorge.s.cuesta@gmail.com>

oten91 requested a review from jorgecuesta February 17, 2026 22:49

oten91 force-pushed the feat/hot-path-optimization-with-external-rpc-validation branch 2 times, most recently from 113320b to ff9606e Compare February 17, 2026 23:06

chore: update Go version to 1.25

e932bd2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

oten91 force-pushed the feat/hot-path-optimization-with-external-rpc-validation branch from ff9606e to e932bd2 Compare February 17, 2026 23:19

jorgecuesta approved these changes Feb 18, 2026

View reviewed changes

gateway/external_block_fetcher.go Outdated Show resolved Hide resolved

gateway/external_block_fetcher.go Show resolved Hide resolved

gateway/external_block_fetcher.go Outdated Show resolved Hide resolved

oten91 force-pushed the feat/hot-path-optimization-with-external-rpc-validation branch from d1bf40c to 1cb8631 Compare February 18, 2026 13:50

oten91 force-pushed the feat/hot-path-optimization-with-external-rpc-validation branch from 1cb8631 to 25be89b Compare February 18, 2026 14:10

oten91 and others added 20 commits February 18, 2026 15:20

oten91 and others added 24 commits March 6, 2026 21:51

docs: document circuit breaker admin clear endpoint in CLAUDE.md

fd62d8e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Revert "fix(qos): return error instead of serving stale data when all…

caa81ae

… endpoints are stale" This reverts commit ed0d943.

test(reputation): add probation routing tests to verify tier fallback…

bf709fa

… inclusion Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

oten91 merged commit b43d349 into feat/hot-path-optimization Mar 11, 2026
1 check passed

oten91 deleted the feat/hot-path-optimization-with-external-rpc-validation branch March 11, 2026 23:08

oten91 mentioned this pull request Mar 11, 2026

Merge Staging with Main for new release #510

Merged

oten91 added a commit that referenced this pull request Mar 12, 2026

Merge Staging with Main for new release (#510)

2646ab1

#505 #506 #507 #508 into main --------- Co-authored-by: Jorge S. Cuesta <jorge.s.cuesta@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/hot path optimization with external rpc validation#508

Feat/hot path optimization with external rpc validation#508
oten91 merged 71 commits intofeat/hot-path-optimizationfrom
feat/hot-path-optimization-with-external-rpc-validation

oten91 commented Feb 17, 2026

Uh oh!

jorgecuesta left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

oten91 commented Feb 17, 2026

Uh oh!

jorgecuesta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants