A production-grade distributed KV store written in Go, implementing:
- Consistent hashing
- Quorum-based replication
- Vector clocks
- WAL-backed persistence
- Read repair
# Install dependencies
go mod tidy
# Start a single node
go run ./cmd/server --id node1 --addr :8080 --data-dir /tmp/kvstore
# Start a 3-node cluster (3 separate terminals)
go run ./cmd/server --id node1 --addr :8080 --data-dir /tmp/kv \
--peers node2=localhost:8081,node3=localhost:8082 --n 3 --w 2 --r 2
go run ./cmd/server --id node2 --addr :8081 --data-dir /tmp/kv \
--peers node1=localhost:8080,node3=localhost:8082 --n 3 --w 2 --r 2
go run ./cmd/server --id node3 --addr :8082 --data-dir /tmp/kv \
--peers node1=localhost:8080,node2=localhost:8081 --n 3 --w 2 --r 2
# Use the CLI
go run ./cmd/client put hello "world" --server http://localhost:8080
go run ./cmd/client get hello --server http://localhost:8080
go run ./cmd/client delete hello --server http://localhost:8080
go run ./cmd/client cluster nodes --server http://localhost:8080Client (CLI / Go library)
│ HTTP
▼
Coordinator Node
├── Gin Router
├── Replicator (quorum + repair)
└── Store (WAL + snapshot)
│
▼
Replica Nodes
Each request is handled by a coordinator node, which:
- Determines replica nodes using consistent hashing.
- Executes quorum read/write logic.
- Stores data locally using WAL-backed persistence.
- Performs read repair when inconsistencies are detected.
kvstore/
├── README.md
├── go.mod
│
├── cmd/
│ ├── server/
│ │ └── main.go # Node entrypoint, flags, graceful shutdown
│ └── client/
│ └── main.go # Cobra CLI (put / get / delete / cluster)
│
└── internal/
├── store/
│ ├── store.go # In-memory map, Put/Get/Delete, snapshot logic
│ ├── wal.go # Write-Ahead Log (append-only NDJSON)
│ └── vector_clock.go # Vector clock comparison & merge
│
├── cluster/
│ ├── ring.go # Consistent hash ring with virtual nodes
│ ├── membership.go # Node join/leave, replica node lookup
│ └── replicator.go # Quorum writes/reads, read repair, backoff
│
├── api/
│ ├── handlers.go # Gin HTTP handlers (public + internal routes)
│ └── middleware.go # Request logger, panic recovery
│
└── client/
├── client.go # Typed Go client library (Put/Get/Delete)
└── raw.go # Raw HTTP helper for misc endpoints
Every write is appended to an on-disk log before being applied to the in-memory map. If the process crashes mid-write, on restart we replay the WAL from top to bottom to recover exactly the state that existed before the crash.
- Entries are newline-delimited JSON (easy to inspect, easy to parse).
- Each append calls
fsyncto force OS buffers to physical media. - Snapshots compress history: after a snapshot the WAL is truncated.
Key interview point: WAL entries must be idempotent. Re-applying a PUT
twice should produce the same result. Our vector clock comparison in
ApplyRemote handles this: an older entry won't overwrite a newer one.
Keys are mapped to nodes using a hash ring:
- Each physical node is placed at
vnodes(default 150) positions on a 2³² ring by hashing"nodeID#0","nodeID#1", … . - A key is owned by the first node clockwise from
hash(key). - For replication factor N, we walk clockwise and pick the first N distinct physical nodes.
Adding/removing a node only remaps keys between that node and its predecessor (≈ 1/N of all keys), versus naive modular hashing which remaps nearly everything.
Virtual nodes solve the load balance problem: without them, unlucky hash placement could give one node 50% of the ring. 150 vnodes per node gives a standard deviation of ~10% in load.
A vector clock is map[nodeID]→counter. Every write increments the
writer's own counter. When two replicas hold different values for the
same key, we compare their clocks:
| Relationship | Meaning |
|---|---|
| A < B | A happened-before B; discard A |
| A > B | B happened-before A; discard B |
| A ∥ B | Concurrent writes — true conflict |
For conflicts we fall back to wall-clock last-write-wins (pragmatic; used by Cassandra/Riak). A production system could instead surface the conflict to the application.
With N replicas, write quorum W, and read quorum R, strong consistency holds when W + R > N. Classic choice: N=3, W=2, R=2.
Write path:
- Coordinator writes locally (counts as 1 ack).
- Fans out to all N-1 peers in parallel goroutines.
- Returns success once W total acks received.
- Remaining peers catch up asynchronously.
Read path:
- Coordinator asks R replicas for their version.
- Reconciles using vector clocks to pick the winner.
- Any stale replica is updated via read repair (async write-back).
This tolerates up to N-W write failures and N-R read failures.
When a read discovers that some replicas have a stale version, the coordinator asynchronously writes the authoritative version back to those replicas. This is how Cassandra achieves eventual consistency without a background job.
The repair is fire-and-forget (best effort) — if the repair fails, the stale replica will get corrected on the next successful read.
Without snapshots, recovering from a crash requires replaying the entire WAL — unbounded and slow. Snapshots:
- Serialize the in-memory map to
snapshot.jsonvia an atomic write (write to.tmp, thenos.Rename— crash-safe). - Truncate the WAL (everything is captured in the snapshot).
- On startup: load snapshot, then replay only WAL entries written after the snapshot.
Snapshots are taken automatically every 60 seconds in the background goroutine
in cmd/server/main.go, and also on graceful shutdown.
| Method | Path | Description |
|---|---|---|
GET |
/kv/:key |
Read a value (quorum read) |
PUT |
/kv/:key |
Write a value (quorum write). Body: {"value":"…"} |
DELETE |
/kv/:key |
Delete a value (tombstone + replicate) |
GET |
/cluster/nodes |
List all cluster members |
POST |
/cluster/join |
Add a node. Body: {"id":"…","address":"…"} |
POST |
/cluster/leave |
Remove a node. Body: {"id":"…"} |
GET |
/health |
Health check |
POST |
/internal/replicate |
Peer replication endpoint |
GET |
/internal/fetch/:key |
Peer raw-fetch endpoint (for read repair) |