Multi-Raft

Problem: One Raft Group Doesn't Scale

Single Raft group → single leader → all writes serialized through one node → throughput ceiling.

Multi-Raft

Run many independent Raft groups in parallel — one per data shard (Range in CockroachDB, Region in TiKV).

Node 1          Node 2          Node 3
┌──────────┐   ┌──────────┐   ┌──────────┐
│ Region A │   │ Region A │   │ Region A │  ← Raft group A
│ Region B │   │ Region B │   │ Region B │  ← Raft group B
│ Region C │   │ Region C │   │ Region C │  ← Raft group C
└──────────┘   └──────────┘   └──────────┘
Each node hosts hundreds of Raft groups simultaneously

Each region/range = its own independent Raft consensus group (typically 3 replicas)
Writes to different regions proceed in parallel — no global serialization
Each region has its own leader (may be on different nodes)
Adding nodes → more regions distributed → linear write scalability

Challenges

1. Heartbeat Storm

Naive: every Raft group sends heartbeats independently → O(regions × nodes) messages/sec
1000 regions × 3 nodes = 3000 heartbeats/sec just for liveness

Fix — Coalesced Heartbeats (CockroachDB):

Instead of: Region1→Node2, Region2→Node2, Region3→Node2 (separate msgs)
Send:       Node1→Node2 { region1: ok, region2: ok, region3: ok } (one msg)

Reduces heartbeat traffic from O(regions) to O(nodes).

2. Leader Placement

Leaders should be co-located with the node serving client requests
Leaders scattered randomly → extra hops

Fix: Placement Driver (TiKV) / Rebalancer (CockroachDB) actively moves leaders to balance load and minimize cross-node RPCs.

3. Snapshot Overhead

New replica or lagging replica needs a full snapshot
Naive: snapshot per region separately → bursty I/O

Fix: Rate-limit snapshot transfers; use RocksDB IngestExternalFile for fast atomic snapshot application (TiKV).

4. Linearizable Reads Without Log Writes

Naive: every read goes through Raft log → write amplification just for reads

Fix — Read Index (Raft optimization):

Leader records current commit index
Sends heartbeat to confirm it's still the leader (majority ACK)
Once responses received, serve read locally — no log entry needed
Guarantees: read sees all writes committed before the read started

Region / Range Sizing

System	Unit	Default Size	Split Trigger
TiKV	Region	96 MB	Size exceeds threshold
CockroachDB	Range	512 MB	Size OR QPS threshold

Range-based (not hash-based):

Range scans stay within one or few contiguous regions
Split/merge is purely metadata — no data movement
Tradeoff: sequential writes (e.g., monotonic IDs) can hot-spot one region

Region Split / Merge

Split

Leader detects region too large or too hot
Finds split key (middle key or hot boundary)
Proposes split via Raft
Parent region → two child regions; update metadata (PD/meta ranges)
Two independent Raft groups from here on

Merge

Two adjacent small/cold regions identified
Source region's Raft group dissolved
Key range absorbed by target region's Raft group
Metadata updated

Placement Driver (TiKV) / Rebalancer (CockroachDB)

Centralized scheduler that orchestrates Multi-Raft cluster:

Responsibility	How
Balance replica count per node	Move replicas from over-full to under-full nodes
Balance leader distribution	Transfer Raft leadership
Trigger region split/merge	Based on size + QPS metrics reported via heartbeat
Enforce placement policies	Zone/rack awareness, geo-replication constraints
Track GC safe point	Oldest active transaction for MVCC cleanup

TiKV PD is itself a 3-node Raft group — single point of truth with HA.