Skip to content

Add observability: Prometheus metrics, structured logging, and health checks #3

@tac0turtle

Description

@tac0turtle

Summary

The backend (API + indexer) currently has only plain-text logs to stdout via tracing. There are no metrics, no distributed tracing, no structured log output, and minimal health checks. This issue tracks adding proper observability instrumentation.

Current State

  • Logging: tracing + tracing-subscriber with fmt::layer() (text). The json feature is compiled but unused.
  • Metrics: None. No Prometheus, no /metrics endpoint.
  • Distributed tracing: None. No OpenTelemetry, no span propagation.
  • Health checks: GET /health returns "OK" with no dependency verification. Indexer has no health endpoint at all.
  • #[instrument]: Not used — no request-scoped spans, no correlation IDs.
  • tower_http::TraceLayer: Configured but effectively silent at production log levels (info).

Proposed Work

1. Prometheus Metrics (/metrics endpoint)

API server:

  • Request count by route, method, status code
  • Request latency histogram by route
  • Active connections gauge
  • Database query latency histogram
  • Database connection pool stats (active, idle, max)

Indexer:

  • blocks_indexed_total counter
  • blocks_per_second gauge
  • batch_duration_seconds histogram
  • failed_blocks_total counter
  • rpc_requests_total counter (by status: success/failure)
  • rpc_request_duration_seconds histogram
  • indexer_head_block gauge (current indexed height)
  • chain_head_block gauge (latest chain height)
  • indexer_lag_blocks gauge (chain head - indexed head)
  • db_insert_duration_seconds histogram

Crate candidates: metrics + metrics-exporter-prometheus or prometheus-client.

2. Structured JSON Logging

  • Activate tracing-subscriber's json formatter behind a config flag (e.g., LOG_FORMAT=json)
  • Ensure batch-complete stats are emitted as named tracing fields, not embedded in format strings
  • Add #[instrument] to API handler functions and key indexer methods for automatic span context

3. Improved Health Checks

API:

  • GET /health (liveness) — keep as-is
  • GET /health/ready (readiness) — verify DB connectivity + indexer_state freshness (e.g., last update < 5 min)

Indexer:

  • Add a lightweight HTTP server (separate port) with /health that reports:
    • Process is alive
    • Last successful block indexed + timestamp
    • Current lag from chain head
    • failed_blocks table row count

4. OpenTelemetry Integration (optional / future)

  • Wire tracing spans to OTLP exporter via tracing-opentelemetry
  • Propagate trace context through RPC calls
  • Export to Jaeger/Tempo/etc.

Priority

Prometheus metrics and structured logging are the highest priority — they unblock dashboards, alerting, and log aggregation. Health check improvements are a close second. OTEL tracing is a nice-to-have for later.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions