Monitoring, backup and restore, retention, cluster operations, and background services.
HyperbyteDB exposes a Prometheus-compatible metrics endpoint at GET /metrics on the same port as the API (default 8086). There is no separate metrics port.
Key metrics:
| Metric | Type | Description |
|---|---|---|
hyperbytedb_write_requests_total |
counter | Total write requests received |
hyperbytedb_write_errors_total |
counter | Failed write requests |
hyperbytedb_write_payload_bytes |
histogram | Raw payload size in bytes |
hyperbytedb_write_duration_seconds |
histogram | Write handler latency |
hyperbytedb_query_requests_total |
counter | Total query requests received |
hyperbytedb_query_errors_total |
counter | Failed queries |
hyperbytedb_query_duration_seconds |
histogram | Query execution latency |
hyperbytedb_ingestion_points_total |
counter | Total points ingested |
hyperbytedb_flush_runs_total |
counter | Flush cycles completed |
hyperbytedb_flush_errors_total |
counter | Failed flush cycles |
hyperbytedb_flush_points_total |
counter | Points flushed to chDB |
hyperbytedb_flush_duration_seconds |
histogram | Flush cycle duration |
hyperbytedb_wal_last_sequence |
gauge | Last flushed WAL sequence |
Cluster-specific metrics:
| Metric | Type | Description |
|---|---|---|
hyperbytedb_replication_writes_total |
counter | Write replication attempts |
hyperbytedb_replication_errors_total |
counter | Failed write replications |
hyperbytedb_replication_duration_seconds |
histogram | Replication latency |
hyperbytedb_cluster_node_state |
gauge | Node state (0=Joining through 5=Leaving) |
hyperbytedb_cluster_peers_active |
gauge | Number of active peers |
hyperbytedb_uptime_seconds |
gauge | Node uptime |
scrape_configs:
- job_name: 'hyperbytedb'
static_configs:
- targets: ['hyperbytedb:8086']
metrics_path: /metrics
scrape_interval: 15sFor clusters, scrape each node individually.
Logs are written to stderr. Control verbosity with the [logging] config section:
| Level | Use case |
|---|---|
error |
Production: errors only |
warn |
Production: errors + warnings |
info |
Default: startup, shutdown, periodic summaries |
debug |
Development: query details, flush activity |
trace |
Deep debugging: all internal operations |
Set format = "json" for structured output compatible with log aggregation (Loki, Elasticsearch, and similar).
Environment variable equivalents:
HYPERBYTEDB__LOGGING__LEVEL=info
HYPERBYTEDB__LOGGING__FORMAT=jsonWhen statement_summary.enabled = true, recently executed TimeseriesQL statements are available at GET /api/v1/statements. Each entry includes the normalized query text, digest, execution time, and error status.
GET /health returns:
{"status": "pass", "message": "ready for queries and writes"}Always returns 200 as long as the HTTP server is running. In cluster mode, a node in Draining or Leaving state still responds to /health but rejects writes.
hyperbytedb backup --output /backups/hyperbytedb-$(date +%Y%m%d)The backup directory contains:
| Directory | Contents |
|---|---|
wal/ |
RocksDB checkpoint of the WAL |
meta/ |
RocksDB checkpoint of metadata |
data/ |
Copy of the chDB session data directory (chdb.session_data_path) |
manifest.json |
Timestamp, WAL sequence, engine data paths |
Backups can run while HyperbyteDB is serving traffic. RocksDB checkpoints are consistent point-in-time snapshots. For off-node copies, use your operator backup CRD or object storage tooling.
# 1. Stop HyperbyteDB
# 2. Restore (overwrites configured directories)
hyperbytedb restore --input /backups/hyperbytedb-20240115
# 3. Start HyperbyteDB
hyperbytedb serveRestore overwrites the configured wal_dir, meta_dir, and chDB session data directory.
Retention policies are enforced by a background loop that runs ALTER TABLE … DELETE against expired rows in each measurement's MergeTree table. Tune frequency with [retention].interval in config. See Configuration.
Use the built-in HTTP endpoints for on-call inspection:
| Endpoint | Description |
|---|---|
GET /cluster/metrics |
Node id, state, membership version, peer counts |
GET /cluster/nodes |
All nodes with health and addresses |
GET /internal/sync/manifest |
WAL watermark and measurement catalog used for sync |
GET /metrics |
Prometheus metrics |
curl -s http://node1:8086/cluster/metrics | jq .
curl -s http://node2:8086/internal/sync/manifest | jq .Compare manifests across nodes to spot replication lag or catalog drift.
To remove a node from the cluster without data loss:
curl -sS -XPOST 'http://node-to-remove:8086/internal/drain'The drain procedure:
- Sets node state to
Draining(rejects new writes with 503). - Flushes all WAL entries into chDB MergeTree tables.
- Waits for replication acks from all peers (up to 60 seconds).
- Notifies peers of departure.
- Sets state to
Leaving.
In cluster mode, the Raft leader periodically compares /internal/sync/manifest responses from peers and may mark peers as needing sync. New nodes pull metadata and WAL deltas via the sync APIs. See Deep Dive: Clustering.
HyperbyteDB runs several background services as Tokio tasks:
| Service | Interval | Purpose |
|---|---|---|
| Flush | flush.interval_secs (10s) |
WAL → chDB MergeTree |
| Retention | retention.interval (12h) |
ALTER TABLE … DELETE for expired rows |
| Continuous Query | 10s (fixed) | Execute CQ schedules |
| Heartbeat | heartbeat_interval_secs (2s, cluster) |
Peer liveness detection |
| Leader sync monitor | 30s (cluster) | Compare peer manifests and trigger sync when needed |
All services shut down gracefully on ctrl+c: the flush service performs a final flush, then all service handles are awaited.
- Configuration — Full reference for all tuning parameters
- Troubleshooting — Diagnosing common issues
- Common workflows — Backup procedures, monitoring setup