Replication
Why Replicate?
- Fault tolerance (survive node failures)
- Scalability (spread read load)
- Reduce latency (serve from nearby replica)
Replication Algorithm Cost (by message count)
| Messages | Algorithm |
|---|---|
| 1n | Async primary/backup |
| 2n | Sync primary/backup |
| 4n | 2-phase commit, Multi-Paxos |
| 6n | 3-phase commit, Paxos with leader election |
Leader-Based (Primary/Backup)
- All writes go through leader → forwarded to followers
- Reads from any replica (including leader)
Sync vs. Async
| Mode | Guarantee | Risk |
|---|---|---|
| Synchronous | Follower has up-to-date copy | Blocks if any follower unresponsive |
| Asynchronous | Low latency | Data loss on leader failure |
| Semi-sync | One follower sync, rest async | Practical balance |
Replication Log Formats
| Type | How | Pros | Cons |
|---|---|---|---|
| Statement-based | SQL sent as-is | Simple | Non-determinism (NOW(), RAND()) |
| Write-ahead log | Append-only byte log | Accurate | Coupled to storage engine version |
| Logical log | Row-level change records | Decoupled | More bytes |
| Trigger-based | App code decides | Flexible | Overhead, bug-prone |
Adding a Follower
- Snapshot leader DB
- Copy snapshot to new follower
- Follower requests all changes since snapshot (via log sequence number)
- Follower catches up → serves requests
Leader Failover Process
1. Detect failure — heartbeat timeout (typically 30s); avoid false positives from GC pauses
2. Elect new leader — replica with most up-to-date replication offset wins
(Raft: candidate requests votes; needs majority)
(Sentinel: ≥ quorum Sentinels agree)
3. Reconfigure clients— old leader connections rejected; clients redirect to new leader
4. Old leader rejoins — must be demoted to follower; must discard any writes not replicated
| Method | How | Used By |
|---|---|---|
| Raft election | Followers time out → become candidate → request votes → majority = leader | etcd, CockroachDB, TiKV |
| Sentinel | External watchers monitor leader; quorum vote to promote replica | Redis Sentinel |
| Manual failover | Operator promotes replica | PostgreSQL (default), MySQL |
Failover Risks & Mitigations
| Risk | What happens | Mitigation |
|---|---|---|
| Data loss | New leader lacks writes not yet replicated from old leader | Semi-sync replication — wait for ≥1 follower ACK before returning success |
| Split brain | Old leader revives, both accept writes | Fencing: epoch/generation number; old leader rejects writes if it sees higher epoch; STONITH (shoot the other node) |
| False positive | High load / GC pause looks like failure → unnecessary failover | Tunable timeout + hysteresis (require N consecutive misses, not just 1) |
| Replica too far behind | Promoted replica diverges significantly → data loss even with election | Set max replication lag threshold; don't elect replicas lagging >N bytes/ms |
Replication Lag Problems
| Problem | Definition | Fix |
|---|---|---|
| Read-after-write | User can't see own submitted writes | Route own-data reads to leader; track logical timestamp |
| Monotonic reads | User sees data go "backwards in time" | Same replica per user session |
| Consistent prefix | Causally related writes seen out of order | Same partition for causally related writes |
Rule: If you can't tolerate minutes of lag, use stronger consistency guarantees.
Multi-Leader Replication
One leader per datacenter.
Pros: hidden inter-DC latency, independent per-DC operation, tolerates network issues Cons: write conflicts, breaks auto-increment keys/triggers/constraints
Avoid if possible — dangerous and complex.
Conflict Resolution
| Strategy | Trade-off |
|---|---|
| Last write wins (LWW) | Simple, data loss risk |
| Higher replica ID wins | Deterministic, data loss |
| Merge values | No loss, app complexity |
| Explicit conflict record | Lossless, requires app handling |
Topologies
| Topology | Risk |
|---|---|
| All-to-all | Causality issues (some links faster) |
| Star / Circular | Single point of failure |
Leaderless Replication (Dynamo-style)
Client sends writes/reads to multiple replicas in parallel. No single leader.
Quorum Formula
w + r > n
n = replicas, w = write acks required, r = replicas read
| Config (n=3) | Write | Read | Trade-off |
|---|---|---|---|
| w=3, r=1 | Strong durability | Fast read | Slow writes |
| w=2, r=2 | Balanced | Balanced | Default |
| w=1, r=3 | Fast write | Strong read | Risky writes |
Sloppy quorum: Dynamo allows writes on replacement nodes during failures → breaks quorum overlap → not truly consistent — only probabilistic.
Conflict Detection on Read
- Read R nodes → compare versions
- Discard strictly older (via vector clock)
- If concurrent versions (incomparable clocks) → return all → app resolves
Read Repair + Anti-Entropy
- Read repair: on read, detect stale replica → write newest value back
- Anti-entropy: background process constantly compares and syncs differences
Leaderless "Failover" — No Election Needed
Leaderless systems have no leader to fail over from. Instead, availability is structural:
Normal (n=3, w=2, r=2):
Client writes to all 3 → 2 ACK → success
Node A goes down → write to B + C → still meets w=2
Node A recovers:
Stale → read repair catches up A on next read
Or: anti-entropy background sync catches up A
| Scenario | What happens | Recovery |
|---|---|---|
| 1 of 3 nodes down | Writes/reads still meet quorum (w=2, r=2) | Node rejoins → read repair + anti-entropy syncs missed writes |
| 2 of 3 nodes down | Quorum unmet → writes/reads fail (CP behavior) | Wait for nodes to recover; or relax to sloppy quorum |
| Network partition | Two sides each below quorum → reject | Each side rejects; no split-brain writes (unlike multi-leader) |
| Sloppy quorum | Accept writes on available nodes even if not the "home" N | Hinted handoff: writes stored temporarily on substitute; forwarded when target recovers |
Hinted handoff: substitute node stores write with a hint "this belongs to node A"; when A recovers, substitute forwards the write. Improves availability but breaks quorum overlap guarantees — only best-effort durability.
Anti-entropy with Merkle trees: each node builds a Merkle tree of its key range; nodes exchange tree hashes to find differing subtrees efficiently → sync only divergent keys (O(diff) not O(total)).
Weak Consistency / Eventual Consistency (Dynamo-style Systems)
Strong consistency is expensive — information travels at light speed, majority coordination per op is costly.
Two Categories of Eventual Consistency
| Category | Guarantee | Example |
|---|---|---|
| Probabilistic | May see anomalies during normal ops; detects conflicts later | Amazon Dynamo |
| Strong eventual | Convergence to equivalent sequential execution; no anomalies | CRDTs |
Probabilistically Bounded Staleness (PBS)
Research by Bailis et al. (2012): eventually consistent stores are often consistent within tens to hundreds of milliseconds in practice.
Example (Cassandra): R=1/W=1 → inconsistency window ~1352ms; R=2/W=1 → ~202ms.
CRDTs (Convergent Replicated Data Types)
Operations that are commutative + associative + idempotent converge regardless of order — no coordination needed.
Three Required Properties
- Associative:
(a+(b+c)) = (a+b)+c - Commutative:
a+b = b+a - Idempotent:
a+a = a
These form a join semilattice — any data type expressible as a semilattice can guarantee convergence.
CRDT Types
| Type | Merge Rule | Notes |
|---|---|---|
| G-Counter | Sum per-node counters | Grow-only |
| PN-Counter | Increments − decrements | Two G-counters |
| G-Set | Union | No removal |
| 2P-Set | Add-set ∪ Remove-set | Removes win |
| OR-Set | Track add/remove pairs | Removes only win if they saw the add |
| LWW-Register | max(timestamp) wins | Data loss possible |
| MV-Register | Keep concurrent versions | App merges |
Real Systems
| System | Usage |
|---|---|
| Redis Enterprise (Active-Active) | CRDT counters for geo-distributed deployments |
| Riak | OR-Set for shopping cart, PN-Counter |
| Apple iCloud Notes | Text CRDT for offline merge |
| Figma | CRDT-like OT for concurrent shape edits |
CALM Theorem
"Logically monotonic programs are guaranteed to be eventually consistent."
- Monotonic: new information never invalidates prior conclusions — safe without coordination
- Non-monotonic: new knowledge can invalidate conclusions — requires coordination
Monotonic (no coordination needed)
Selection, projection, join, union, recursive Datalog
Non-monotonic (coordination required)
Negation, set difference, aggregation, universal quantification
Key insight: coordination protocols ARE aggregations — 2PC requires unanimous votes, Paxos requires majority votes. Aggregation = non-monotonic = requires coordination.
Practical: many computations can run coordination-free and only need coordination when passing results to external systems.
Replica Synchronization Methods
| Method | How | Trade-off |
|---|---|---|
| Gossip | Random peer selection every t seconds | Scalable, no SPOF, probabilistic only |
| Merkle trees | Hash content at multiple granularities; exchange only diff | Efficient key comparison across nodes |
Summary: Replication Model Selection
| Model | Best For | Watch Out |
|---|---|---|
| Single-leader (async) | Read-heavy, tolerate lag | Data loss on leader failure |
| Single-leader (sync) | Strong durability | Blocks on slow followers |
| Multi-leader | Multi-datacenter writes | Write conflicts, complexity |
| Leaderless | High availability, write-intensive | Eventual consistency, conflicts |
| CRDTs | Auto-merging without coordination | Limited to specific data shapes |