Add observability: Prometheus metrics, structured logging, and health checks

## Summary

The backend (API + indexer) currently has only plain-text logs to stdout via `tracing`. There are no metrics, no distributed tracing, no structured log output, and minimal health checks. This issue tracks adding proper observability instrumentation.

## Current State

- **Logging**: `tracing` + `tracing-subscriber` with `fmt::layer()` (text). The `json` feature is compiled but unused.
- **Metrics**: None. No Prometheus, no `/metrics` endpoint.
- **Distributed tracing**: None. No OpenTelemetry, no span propagation.
- **Health checks**: `GET /health` returns `"OK"` with no dependency verification. Indexer has no health endpoint at all.
- **`#[instrument]`**: Not used — no request-scoped spans, no correlation IDs.
- **`tower_http::TraceLayer`**: Configured but effectively silent at production log levels (`info`).

## Proposed Work

### 1. Prometheus Metrics (`/metrics` endpoint)

**API server:**
- Request count by route, method, status code
- Request latency histogram by route
- Active connections gauge
- Database query latency histogram
- Database connection pool stats (active, idle, max)

**Indexer:**
- `blocks_indexed_total` counter
- `blocks_per_second` gauge
- `batch_duration_seconds` histogram
- `failed_blocks_total` counter
- `rpc_requests_total` counter (by status: success/failure)
- `rpc_request_duration_seconds` histogram
- `indexer_head_block` gauge (current indexed height)
- `chain_head_block` gauge (latest chain height)
- `indexer_lag_blocks` gauge (chain head - indexed head)
- `db_insert_duration_seconds` histogram

Crate candidates: `metrics` + `metrics-exporter-prometheus` or `prometheus-client`.

### 2. Structured JSON Logging

- Activate `tracing-subscriber`'s `json` formatter behind a config flag (e.g., `LOG_FORMAT=json`)
- Ensure batch-complete stats are emitted as named `tracing` fields, not embedded in format strings
- Add `#[instrument]` to API handler functions and key indexer methods for automatic span context

### 3. Improved Health Checks

**API:**
- `GET /health` (liveness) — keep as-is
- `GET /health/ready` (readiness) — verify DB connectivity + indexer_state freshness (e.g., last update < 5 min)

**Indexer:**
- Add a lightweight HTTP server (separate port) with `/health` that reports:
  - Process is alive
  - Last successful block indexed + timestamp
  - Current lag from chain head
  - `failed_blocks` table row count

### 4. OpenTelemetry Integration (optional / future)

- Wire `tracing` spans to OTLP exporter via `tracing-opentelemetry`
- Propagate trace context through RPC calls
- Export to Jaeger/Tempo/etc.

## Priority

Prometheus metrics and structured logging are the highest priority — they unblock dashboards, alerting, and log aggregation. Health check improvements are a close second. OTEL tracing is a nice-to-have for later.

## References

- [`metrics-rs`](https://github.com/metrics-rs/metrics)
- [`tracing-opentelemetry`](https://github.com/tokio-rs/tracing-opentelemetry)
- [`tower-http` TraceLayer docs](https://docs.rs/tower-http/latest/tower_http/trace/index.html)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add observability: Prometheus metrics, structured logging, and health checks #3

Summary

Current State

Proposed Work

1. Prometheus Metrics (`/metrics` endpoint)

2. Structured JSON Logging

3. Improved Health Checks

4. OpenTelemetry Integration (optional / future)

Priority

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add observability: Prometheus metrics, structured logging, and health checks #3

Description

Summary

Current State

Proposed Work

1. Prometheus Metrics (/metrics endpoint)

2. Structured JSON Logging

3. Improved Health Checks

4. OpenTelemetry Integration (optional / future)

Priority

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Prometheus Metrics (`/metrics` endpoint)