Coordination Services (ZooKeeper / etcd)
What They Provide
Built on consensus (ZAB / Raft), offering strongly consistent primitives that are hard to build correctly from scratch:
| Feature | Description |
|---|---|
| Linearizable atomic operations | Compare-and-set, test-and-set — exactly-once leader election |
| Total ordering of operations | Fencing tokens (monotonically increasing zxid / revision) |
| Failure detection | Sessions + heartbeats; ephemeral nodes/leases deleted on session expiry |
| Change notifications | Clients notified when a key changes (watch) |
| Service discovery | Register service instances; watch for changes |
| Work allocation | Distribute tasks across workers; detect dead workers |
Data Models
| ZooKeeper (ZAB) | etcd (Raft) | |
|---|---|---|
| Model | Hierarchical znodes (like a filesystem) | Flat key-value store |
| Node types | Persistent, Ephemeral, Sequential, Ephemeral-Sequential | Keys with optional TTL lease |
| Watch model | One-shot per watch — fires once, must re-register | Continuous stream — single watch streams all future changes |
| Transactions | multi() — atomic batch of ops |
Txn — if/then/else CAS batch |
| Versioning | version (update count), czxid/mzxid (creation/modify zxid) |
Global revision (monotonically increasing per cluster write) |
Feature 1: Leader Election
ZooKeeper (ZAB)
All candidates try to create: /election/leader (EPHEMERAL node)
Only one CREATE succeeds (linearizable — ZAB ensures one writer)
→ Winner = leader
Losers: watch /election/leader for NodeDeleted event
→ On leader crash: ephemeral node auto-deleted (session expires ~30s after crash)
→ All watchers notified → race to CREATE again → new leader elected
Fencing: leader's session ID acts as epoch; any write from old session ID rejected
after session expiry (zombie leader cannot re-create an ephemeral node)
etcd (Raft)
Use etcd election API (built on leases):
1. Leader creates lease: PUT /election/leader value=nodeA lease=12345 (TTL=10s)
- Succeeds only if key doesn't exist (prevExist=false) → CAS
2. Leader heartbeats lease: KeepAlive(lease=12345) every 3s → extends TTL
3. Leader crashes: lease expires after 10s → key auto-deleted
4. Candidates watch /election/leader → on delete → retry PUT with prevExist=false
→ One wins (Raft serializes all puts) → new leader
Fencing token: lease ID (monotonically increasing) → storage backends check lease ID
to reject writes from expired (zombie) leaders
Key difference: ZooKeeper uses session expiry (TCP keepalives); etcd uses explicit leases with TTL. etcd TTL is more predictable; ZooKeeper session timeout depends on network conditions.
Feature 2: Distributed Lock
ZooKeeper (ZAB) — Herd-Avoiding Lock
Acquire:
1. Create EPHEMERAL SEQUENTIAL node: /locks/lock-0000000042
2. List /locks/ children → get all sequence numbers
3. If my node has the lowest number → I hold the lock
4. Else → watch the predecessor node (lock-0000000041), not all nodes
(avoids herd effect — only 1 waiter woken per unlock)
Release:
- Delete my ephemeral sequential node
- Predecessor's watcher fires → that client checks if it now has lowest → acquires lock
Crash safety:
- Client crashes → session expires → ephemeral node auto-deleted → next waiter woken
- Fencing token = sequence number (monotonically increasing across all locks)
etcd (Raft) — Lease-Based Lock
Acquire:
1. Create lease: leaseid = Grant(TTL=30s)
2. CAS: Txn{ if /locks/mylock doesn't exist → PUT /locks/mylock=clientA with leaseid }
3. If Txn fails → watch /locks/mylock → retry on deletion
Heartbeat:
- KeepAlive(leaseid) every 10s while holding lock
Release:
- DELETE /locks/mylock OR let lease expire (crash recovery)
Fencing:
- Pass leaseid to downstream services; they verify lease still valid via etcd before acting
- Storage: `if request.leaseID == current_lock_leaseID → accept write`
Why fencing tokens matter:
T1: Client A holds lock, leaseid=5. A gets GC pause for 40s.
T2: Lock expires. Client B acquires lock, leaseid=6.
T3: A wakes up, still thinks it holds lock → tries to write with leaseid=5
Storage sees leaseid=5 < current=6 → REJECT → split-brain prevented
Feature 3: Failure Detection / Ephemeral Nodes
ZooKeeper (ZAB)
Client opens session → negotiates sessionTimeout (e.g., 30s)
Client sends heartbeats every sessionTimeout/3 (~10s)
If ZooKeeper doesn't hear from client for sessionTimeout:
→ Session declared expired
→ All EPHEMERAL nodes created by that session → auto-deleted
→ All watches on those nodes → fire NodeDeleted event
Use case: service registry
Service A creates /services/api/A (EPHEMERAL)
A crashes → after 30s, /services/api/A deleted → load balancer watch fires → removes A
etcd (Raft) — Leases
Client creates lease: leaseid = Grant(TTL=15s)
Client attaches keys to lease: PUT /services/api/A lease=leaseid
Client sends KeepAlive every 5s
If KeepAlive stops:
→ After TTL expires, lease revoked
→ All keys attached to lease → auto-deleted
→ Watch on those keys → fires DELETE event
Advantage over ZooKeeper: lease TTL is explicit and tunable per key;
ZooKeeper session timeout is per-connection (all ephemeral nodes share one timeout)
Feature 4: Watch / Change Notifications
ZooKeeper (ZAB) — One-Shot Watch
Client: getData("/config/timeout", watch=true)
Server: returns current value + registers one-shot watch
Config changes:
→ Server sends WatchedEvent(NodeDataChanged, /config/timeout) to client
→ Watch is consumed (one-shot) → client must re-register watch
Gotcha: between event delivery and re-registration, changes can be missed
→ Client must re-read data + re-register atomically:
getData("/config/timeout", watch=true) ← atomic read+register
etcd (Raft) — Continuous Stream Watch
Client: Watch("/config/", prefix=true) ← watches entire subtree
Server: returns a stream; keeps sending events indefinitely
Event types: PUT (create/update), DELETE
Each event carries: key, value, prev_value, revision
Client processes events sequentially; no re-registration needed
On disconnect: resume with Watch("/config/", start_revision=lastSeen+1)
→ no events missed (etcd keeps compacted history)
Used by Kubernetes: controller-manager watches /registry/pods/ stream
→ reacts to every pod create/update/delete in real time
Feature 5: Service Discovery
ZooKeeper (ZAB)
Registration (on service start):
Create EPHEMERAL node: /services/payments/10.0.1.5:8080
Node data: {"version":"2.1","region":"us-east","weight":100}
Discovery (load balancer / client):
getChildren("/services/payments", watch=true)
→ Returns ["10.0.1.5:8080", "10.0.1.6:8080"]
→ On NodeChildrenChanged event → refresh list
Deregistration:
Explicit: delete node on graceful shutdown
Implicit: crash → session expires → ephemeral node deleted → watch fires
Used by: Kafka (broker list in /brokers/ids/), HBase (RegionServer list)
etcd (Raft)
Registration:
leaseid = Grant(TTL=15s)
PUT /services/payments/10.0.1.5:8080 value=<metadata> lease=leaseid
KeepAlive(leaseid) every 5s
Discovery:
Watch("/services/payments/", prefix=true)
→ Stream of PUT/DELETE events as instances come and go
Used by: Kubernetes endpoints controller, Consul (uses Raft internally),
CoreDNS (reads service records from etcd)
Feature 6: Configuration Management
ZooKeeper (ZAB)
Store config at: /config/feature_flags (PERSISTENT node)
All services: getData("/config/feature_flags", watch=true)
→ cache value locally
→ on NodeDataChanged → re-read → update local cache
Update: setData("/config/feature_flags", newValue)
→ ZAB broadcasts to all ZooKeeper nodes → consistent across cluster
→ all watchers notified
Versioning: each setData increments node's `version` field
CAS update: setData(path, value, expectedVersion) → fails if version mismatch
etcd (Raft)
Store: PUT /config/feature_flags value=<json>
All services: Watch("/config/feature_flags")
→ stream of updates; each carries revision number
CAS update:
Txn{
if /config/feature_flags.mod_revision == 42 → PUT /config/feature_flags = newValue
else → no-op (someone else updated it)
}
History: etcd retains old revisions until compacted
GET /config/feature_flags?revision=100 → value as of revision 100
Useful for: audit log, rollback, debugging config drift
ZooKeeper vs etcd — Detailed Comparison
| ZooKeeper (ZAB) | etcd (Raft) | |
|---|---|---|
| Consensus | ZAB (epoch-based, explicit sync phase) | Raft (term-based, log-first) |
| API | Custom binary protocol | gRPC + HTTP/2 |
| Data model | Hierarchical znodes (like inode tree) | Flat KV with range queries |
| Watch | One-shot, re-register required | Continuous stream, resume by revision |
| Lease/Session | Per-connection session (TCP keepalive) | Explicit per-key lease (TTL) |
| Transactions | multi() — atomic, no conditions |
Txn — if/then/else with CAS |
| Read linearizability | sync() call required for linearizable read |
serializable vs linearizable read option |
| Write throughput | ~10k–50k ops/s (Java GC impact) | ~10k–100k ops/s (Go, lower GC jitter) |
| Typical cluster size | 3 or 5 nodes | 3 or 5 nodes |
| Used by | Kafka, HBase, Storm, Hadoop HDFS HA | Kubernetes, CockroachDB, TiKV (PD), Consul |
| Operational complexity | Higher (JVM tuning, GC pauses affect sessions) | Lower (single Go binary) |
Usually Used Indirectly
| System | Uses | For |
|---|---|---|
| Kafka (pre-KRaft) | ZooKeeper | Controller election, broker registry, topic metadata |
| Kafka KRaft | Built-in Raft | Replaced ZooKeeper entirely (Kafka 3.3+) |
| Kubernetes | etcd | All cluster state: pods, services, configmaps, secrets |
| HBase | ZooKeeper | Master election, RegionServer liveness, root region location |
| HDFS HA | ZooKeeper | NameNode active/standby failover |
| TiKV / TiDB | etcd (via PD) | Region metadata, TSO, placement scheduling |
| CockroachDB | Built-in Raft | Per-range consensus; no external coordinator needed |
Key Principle
Use ZooKeeper/etcd for coordination — not user data. They are optimized for small, highly consistent metadata.
- Typical payload: bytes to KB per key (not MB)
- Write throughput: ~10k–100k ops/s (consensus cost per write)
- Read throughput: high (followers or cached); writes serialized through leader
- Cluster size: always odd (3, 5, 7) — majority quorum required