Coordination Services (ZooKeeper / etcd)

Coordination Services (ZooKeeper / etcd)

What They Provide

Built on consensus (ZAB / Raft), offering strongly consistent primitives that are hard to build correctly from scratch:

Feature Description
Linearizable atomic operations Compare-and-set, test-and-set — exactly-once leader election
Total ordering of operations Fencing tokens (monotonically increasing zxid / revision)
Failure detection Sessions + heartbeats; ephemeral nodes/leases deleted on session expiry
Change notifications Clients notified when a key changes (watch)
Service discovery Register service instances; watch for changes
Work allocation Distribute tasks across workers; detect dead workers

Data Models

ZooKeeper (ZAB) etcd (Raft)
Model Hierarchical znodes (like a filesystem) Flat key-value store
Node types Persistent, Ephemeral, Sequential, Ephemeral-Sequential Keys with optional TTL lease
Watch model One-shot per watch — fires once, must re-register Continuous stream — single watch streams all future changes
Transactions multi() — atomic batch of ops Txn — if/then/else CAS batch
Versioning version (update count), czxid/mzxid (creation/modify zxid) Global revision (monotonically increasing per cluster write)

Feature 1: Leader Election

ZooKeeper (ZAB)

All candidates try to create: /election/leader  (EPHEMERAL node)

Only one CREATE succeeds (linearizable — ZAB ensures one writer)
  → Winner = leader

Losers: watch /election/leader for NodeDeleted event
  → On leader crash: ephemeral node auto-deleted (session expires ~30s after crash)
  → All watchers notified → race to CREATE again → new leader elected

Fencing: leader's session ID acts as epoch; any write from old session ID rejected
         after session expiry (zombie leader cannot re-create an ephemeral node)

etcd (Raft)

Use etcd election API (built on leases):

1. Leader creates lease:    PUT /election/leader  value=nodeA  lease=12345  (TTL=10s)
   - Succeeds only if key doesn't exist (prevExist=false) → CAS

2. Leader heartbeats lease: KeepAlive(lease=12345) every 3s → extends TTL

3. Leader crashes: lease expires after 10s → key auto-deleted

4. Candidates watch /election/leader → on delete → retry PUT with prevExist=false
   → One wins (Raft serializes all puts) → new leader

Fencing token: lease ID (monotonically increasing) → storage backends check lease ID
               to reject writes from expired (zombie) leaders

Key difference: ZooKeeper uses session expiry (TCP keepalives); etcd uses explicit leases with TTL. etcd TTL is more predictable; ZooKeeper session timeout depends on network conditions.


Feature 2: Distributed Lock

ZooKeeper (ZAB) — Herd-Avoiding Lock

Acquire:
  1. Create EPHEMERAL SEQUENTIAL node: /locks/lock-0000000042
  2. List /locks/ children → get all sequence numbers
  3. If my node has the lowest number → I hold the lock
  4. Else → watch the predecessor node (lock-0000000041), not all nodes
             (avoids herd effect — only 1 waiter woken per unlock)

Release:
  - Delete my ephemeral sequential node
  - Predecessor's watcher fires → that client checks if it now has lowest → acquires lock

Crash safety:
  - Client crashes → session expires → ephemeral node auto-deleted → next waiter woken
  - Fencing token = sequence number (monotonically increasing across all locks)

etcd (Raft) — Lease-Based Lock

Acquire:
  1. Create lease: leaseid = Grant(TTL=30s)
  2. CAS: Txn{ if /locks/mylock doesn't exist → PUT /locks/mylock=clientA with leaseid }
  3. If Txn fails → watch /locks/mylock → retry on deletion

Heartbeat:
  - KeepAlive(leaseid) every 10s while holding lock

Release:
  - DELETE /locks/mylock  OR  let lease expire (crash recovery)

Fencing:
  - Pass leaseid to downstream services; they verify lease still valid via etcd before acting
  - Storage: `if request.leaseID == current_lock_leaseID → accept write`

Why fencing tokens matter:

T1: Client A holds lock, leaseid=5. A gets GC pause for 40s.
T2: Lock expires. Client B acquires lock, leaseid=6.
T3: A wakes up, still thinks it holds lock → tries to write with leaseid=5
    Storage sees leaseid=5 < current=6 → REJECT → split-brain prevented

Feature 3: Failure Detection / Ephemeral Nodes

ZooKeeper (ZAB)

Client opens session → negotiates sessionTimeout (e.g., 30s)
Client sends heartbeats every sessionTimeout/3 (~10s)

If ZooKeeper doesn't hear from client for sessionTimeout:
  → Session declared expired
  → All EPHEMERAL nodes created by that session → auto-deleted
  → All watches on those nodes → fire NodeDeleted event

Use case: service registry
  Service A creates /services/api/A (EPHEMERAL)
  A crashes → after 30s, /services/api/A deleted → load balancer watch fires → removes A

etcd (Raft) — Leases

Client creates lease: leaseid = Grant(TTL=15s)
Client attaches keys to lease: PUT /services/api/A  lease=leaseid
Client sends KeepAlive every 5s

If KeepAlive stops:
  → After TTL expires, lease revoked
  → All keys attached to lease → auto-deleted
  → Watch on those keys → fires DELETE event

Advantage over ZooKeeper: lease TTL is explicit and tunable per key;
ZooKeeper session timeout is per-connection (all ephemeral nodes share one timeout)

Feature 4: Watch / Change Notifications

ZooKeeper (ZAB) — One-Shot Watch

Client: getData("/config/timeout", watch=true)
Server: returns current value + registers one-shot watch

Config changes:
  → Server sends WatchedEvent(NodeDataChanged, /config/timeout) to client
  → Watch is consumed (one-shot) → client must re-register watch

Gotcha: between event delivery and re-registration, changes can be missed
  → Client must re-read data + re-register atomically:
    getData("/config/timeout", watch=true)  ← atomic read+register

etcd (Raft) — Continuous Stream Watch

Client: Watch("/config/", prefix=true)  ← watches entire subtree
Server: returns a stream; keeps sending events indefinitely

Event types: PUT (create/update), DELETE
Each event carries: key, value, prev_value, revision

Client processes events sequentially; no re-registration needed
On disconnect: resume with Watch("/config/", start_revision=lastSeen+1)
               → no events missed (etcd keeps compacted history)

Used by Kubernetes: controller-manager watches /registry/pods/ stream
  → reacts to every pod create/update/delete in real time

Feature 5: Service Discovery

ZooKeeper (ZAB)

Registration (on service start):
  Create EPHEMERAL node: /services/payments/10.0.1.5:8080
  Node data: {"version":"2.1","region":"us-east","weight":100}

Discovery (load balancer / client):
  getChildren("/services/payments", watch=true)
  → Returns ["10.0.1.5:8080", "10.0.1.6:8080"]
  → On NodeChildrenChanged event → refresh list

Deregistration:
  Explicit: delete node on graceful shutdown
  Implicit: crash → session expires → ephemeral node deleted → watch fires

Used by: Kafka (broker list in /brokers/ids/), HBase (RegionServer list)

etcd (Raft)

Registration:
  leaseid = Grant(TTL=15s)
  PUT /services/payments/10.0.1.5:8080  value=<metadata>  lease=leaseid
  KeepAlive(leaseid) every 5s

Discovery:
  Watch("/services/payments/", prefix=true)
  → Stream of PUT/DELETE events as instances come and go

Used by: Kubernetes endpoints controller, Consul (uses Raft internally),
         CoreDNS (reads service records from etcd)

Feature 6: Configuration Management

ZooKeeper (ZAB)

Store config at: /config/feature_flags  (PERSISTENT node)

All services: getData("/config/feature_flags", watch=true)
  → cache value locally
  → on NodeDataChanged → re-read → update local cache

Update: setData("/config/feature_flags", newValue)
  → ZAB broadcasts to all ZooKeeper nodes → consistent across cluster
  → all watchers notified

Versioning: each setData increments node's `version` field
            CAS update: setData(path, value, expectedVersion) → fails if version mismatch

etcd (Raft)

Store: PUT /config/feature_flags  value=<json>

All services: Watch("/config/feature_flags")
  → stream of updates; each carries revision number

CAS update:
  Txn{
    if /config/feature_flags.mod_revision == 42  →  PUT /config/feature_flags = newValue
    else → no-op (someone else updated it)
  }

History: etcd retains old revisions until compacted
  GET /config/feature_flags?revision=100  → value as of revision 100
  Useful for: audit log, rollback, debugging config drift

ZooKeeper vs etcd — Detailed Comparison

ZooKeeper (ZAB) etcd (Raft)
Consensus ZAB (epoch-based, explicit sync phase) Raft (term-based, log-first)
API Custom binary protocol gRPC + HTTP/2
Data model Hierarchical znodes (like inode tree) Flat KV with range queries
Watch One-shot, re-register required Continuous stream, resume by revision
Lease/Session Per-connection session (TCP keepalive) Explicit per-key lease (TTL)
Transactions multi() — atomic, no conditions Txn — if/then/else with CAS
Read linearizability sync() call required for linearizable read serializable vs linearizable read option
Write throughput ~10k–50k ops/s (Java GC impact) ~10k–100k ops/s (Go, lower GC jitter)
Typical cluster size 3 or 5 nodes 3 or 5 nodes
Used by Kafka, HBase, Storm, Hadoop HDFS HA Kubernetes, CockroachDB, TiKV (PD), Consul
Operational complexity Higher (JVM tuning, GC pauses affect sessions) Lower (single Go binary)

Usually Used Indirectly

System Uses For
Kafka (pre-KRaft) ZooKeeper Controller election, broker registry, topic metadata
Kafka KRaft Built-in Raft Replaced ZooKeeper entirely (Kafka 3.3+)
Kubernetes etcd All cluster state: pods, services, configmaps, secrets
HBase ZooKeeper Master election, RegionServer liveness, root region location
HDFS HA ZooKeeper NameNode active/standby failover
TiKV / TiDB etcd (via PD) Region metadata, TSO, placement scheduling
CockroachDB Built-in Raft Per-range consensus; no external coordinator needed

Key Principle

Use ZooKeeper/etcd for coordination — not user data. They are optimized for small, highly consistent metadata.

  • Typical payload: bytes to KB per key (not MB)
  • Write throughput: ~10k–100k ops/s (consensus cost per write)
  • Read throughput: high (followers or cached); writes serialized through leader
  • Cluster size: always odd (3, 5, 7) — majority quorum required