Aligned with src/domain/cluster/replication_wire.rs, src/config.rs ([cluster.replication]), src/adapters/cluster/peer_client.rs, and src/adapters/http/peer_handlers.rs (/internal/replicate, /internal/replicate-mutation).
Write replication uses Content-Type: application/vnd.hyperbytedb.replicate+line.v1 with an Influx line protocol body (same encoding as POST /write). Database, retention policy, and optional precision are carried in X-Hyperbytedb-DB, X-Hyperbytedb-RP, and X-Hyperbytedb-Precision. Constants and the hinted-handoff binary envelope live in replication_wire.rs.
There is no JSON body for data replication; metadata mutations still use JSON on /internal/replicate-mutation as a separate API.
Application services fan out writes through the ReplicationPort trait; PeerClient is the production adapter.
Each node has a coordinator-side replication mode controlled by [cluster.replication] in config.toml:
| Mode | Behavior on accepted client write |
|---|---|
async (default) |
Local WAL append → fire-and-forget HTTP fan-out → return 204 to client. Failures trigger hinted handoff and retries; convergence is eventual. |
sync_quorum |
Local WAL append → fan-out with X-Hyperbytedb-Sync: true → await W-of-N peer acks → return to client. On timeout returns 504 and unacked peers fall back to hinted handoff in the background. |
min_acks = "majority" resolves at request time against current active_peers().len(), so the required count auto-adjusts during membership changes (rolling restart, peer crash, scale-out). The local node is never counted toward the quorum — the local WAL append is always done first, so self-durability is implicit.
The mode controls only the coordinator side. Every node always serves both styles from the same /internal/replicate endpoint:
- Header absent or
false→ enqueue and return204immediately (today's behavior, byte-for-byte). - Header
true→ enqueue, await the WAL apply, return200 OKwith{"ok":true,"ack_seq":<u64>}.
This makes any combination of per-node modes safe at any moment, including during a rolling restart where some nodes have flipped to sync_quorum and others have not. A sync_quorum coordinator talking to a peer running async (or an older binary that ignores the header) still works — the peer 200s on success and the coordinator counts that as one ack.
- Header:
X-Hyperbytedb-Sync: true(seeHTTP_HEADER_SYNCinreplication_wire.rs). - Receiver awaits the existing
ReplicationApplyQueueoneshot before responding; the apply queue itself is unchanged and still bounded bycluster.replicate_receiver_queue_depth. - Coordinator counts successes via a
select_allloop and returns oncerequiredpeers ack. The remaining per-peer tasks are NOT cancelled — they continue retrying so all peers eventually persist (and trip hinted handoff on failure).
Next step: opaque WAL / framed log shipping (single encode on the writer, followers append bytes).
Longer term: Raft-style or single-writer log if master–master cost is too high.
Hints are stored as CFh1 binary payloads (ReplicationHintPayload). Older JSON entries in RocksDB are discarded on drain (invalid magic).
See flush_service: per-peer ack watermarks and optional stale peer exclusion via replication_truncate_stale_peer_multiplier in cluster config.
[cluster] in config.rs: replication_queue_depth, replication_max_inflight_batches, replication_max_coalesce_body_bytes, replicate_receiver_queue_depth, replication_truncate_stale_peer_multiplier, etc.
- Outbound: bounded queue, WAL-sequence coalescing, semaphore on fan-out rounds.
- Inbound:
ReplicationApplyQueue; 503 when full.
Startup and reconnect sync exchange metadata snapshots and WAL tail entries via /internal/sync/{manifest,metadata,wal}. Each peer builds local MergeTree tables by flushing replicated WAL entries.
See Deep dive: clustering for bootstrap, drain, and internal endpoints.