Kafka Replication — ISR, Leader Election, and KRaft
A Kafka partition is just an append-only log. Replication is what turns that log into something that survives broker failures. Kafka's replication design is deliberately simple: one leader, N-1 followers, and a tight protocol that keeps them in sync. But the edge cases — what happens when followers fall behind, when leaders crash, when the network partitions — reveal a set of trade-offs that every Kafka operator needs to understand.
This post covers the ISR protocol, leader election, the follower fetch loop, unclean election, and Kafka's move from ZooKeeper to KRaft for consensus.
The Basics: Leader and Followers
Every partition has one leader and zero or more followers, spread across different brokers. The leader handles all reads and writes. Followers replicate by fetching from the leader, exactly as a consumer would.
Broker 0 (leader) ←── writes from producers
Broker 1 (follower) ←── fetches from broker 0
Broker 2 (follower) ←── fetches from broker 0
The replication factor is set per-topic:
replication.factor=3
With replication.factor=3, the data survives the loss of any 2 of the 3 brokers — as long as the data was replicated before the failures.
The In-Sync Replica Set (ISR)
Not all followers are always fully caught up. Kafka tracks which followers are close enough to the leader in a set called the In-Sync Replica set (ISR).
A follower is considered in-sync if it has fetched all messages up to the leader's log end offset within replica.lag.time.max.ms (default 30 seconds). If a follower stops fetching or falls too far behind, it is removed from the ISR.
ISR = {0, 1, 2} ← all three replicas are in sync
ISR = {0, 1} ← broker 2 fell behind or crashed; removed from ISR
The ISR is the safety gate for acks=all. The leader only acknowledges a write to the producer after all current ISR members have confirmed they have written it to their local log.
High Watermark
The high watermark (HW) is the offset up to which all ISR members have confirmed replication. Consumers can only read up to the high watermark — records above it may not be on all replicas yet, so they are not visible.
Leader log: [0, 1, 2, 3, 4, 5] ← log end offset = 6
ISR acked: [0, 1, 2, 3] ← high watermark = 4
Consumer sees: offsets 0–3 only
This is why you sometimes see a small lag between production and consumption even with a fast consumer — the high watermark lags behind the log end offset until followers catch up.
The Follower Fetch Loop
A follower replicates by running a continuous fetch loop, identical to a consumer:
- Send a
FetchRequestto the leader with the next expected offset - Receive the response, append the records to its local log
- Update its fetch position and send the next request
The leader tracks the fetch position of each follower. When a follower's fetch offset reaches the leader's log end offset, it is (re)admitted to the ISR.
The parameter replica.fetch.max.bytes (default 1 MB) controls how much data the follower fetches per request. For high-throughput topics, increase this to reduce the number of round trips needed to keep followers in sync.
replica.fetch.max.bytes=5242880 # 5 MB
min.insync.replicas — The Safety Floor
acks=all means "all current ISR members must confirm". But if the ISR shrinks to just the leader (because all followers crashed or fell behind), acks=all is satisfied by the leader alone — which offers no more durability than acks=1.
min.insync.replicas (set on the topic or broker) prevents this:
min.insync.replicas=2
With this set, the leader refuses to accept writes if the ISR has fewer than 2 members. Producers receive a NotEnoughReplicasException. This is the correct behaviour — better to block writes than to silently degrade to no replication.
The recommended setup for durable topics:
replication.factor=3
min.insync.replicas=2
acks=all ← on the producer
This tolerates one broker failure while still committing writes: 2 of the 3 replicas must confirm, and any one can fail.
Leader Election
When a leader fails, a new leader must be elected from the ISR. The ISR is key: only members of the ISR are eligible, because only they are guaranteed to have all committed data.
The process (simplified):
- The controller (a special broker role) detects the leader failure
- It picks the first broker in the ISR list as the new leader
- It updates the partition metadata in ZooKeeper (or KRaft — see below)
- It sends a
LeaderAndIsrRequestto all brokers with the new leader's identity - Producers and consumers discover the new leader via metadata refresh
Preferred Replica
The first broker in the replica list for a partition is its preferred replica — the one intended to be the leader under normal conditions. Over time, leadership can drift (e.g., after a rolling restart). Running kafka-leader-election.sh or enabling auto.leader.rebalance.enable=true (default) restores preferred leadership periodically.
auto.leader.rebalance.enable=true
leader.imbalance.check.interval.seconds=300 # check every 5 minutes
leader.imbalance.per.broker.percentage=10 # trigger rebalance if >10% of partitions are off preferred
Unclean Leader Election
If all ISR members are unavailable (all crashed, or all too far behind before failing), no eligible leader exists. Kafka has two options:
Wait (default, unclean.leader.election.enable=false): the partition stays offline until an ISR member comes back. No data loss, but availability is sacrificed.
Elect an out-of-ISR replica (unclean.leader.election.enable=true): a follower that was behind is promoted. It may be missing records that were acknowledged to producers — those records are permanently lost and any records written after the lag point are overwritten.
This is the classic CAP theorem trade-off in practice:
false→ choose consistency (no data loss, possible unavailability)true→ choose availability (topic stays online, possible data loss)
For financial or transactional data: keep unclean.leader.election.enable=false. For metrics or log pipelines where occasional loss is acceptable: true may be justified.
unclean.leader.election.enable=false # default; recommended for durable topics
Leader Epoch
Before Kafka 0.11, a subtle bug was possible: a follower could fetch stale data from a newly elected leader if the old leader briefly came back after a crash. Kafka 0.11 introduced the leader epoch — a monotonically increasing counter incremented on every leader change.
Each RecordBatch now carries the epoch of the leader that wrote it. When a follower reconnects, it checks whether its log is consistent with the current leader's epoch. If there is a divergence, it truncates its log to the point of consistency before resuming replication.
This eliminates the "zombie leader" split-brain scenario.
ZooKeeper vs KRaft
For years, Kafka relied on ZooKeeper for two things:
- Storing cluster metadata (broker list, topic configuration, partition leadership, ISR)
- Electing the controller — the broker responsible for managing partition leadership changes
ZooKeeper was an operational burden: a separate cluster to deploy, monitor, and upgrade; a different consistency model; a bottleneck for metadata updates at large scale.
KRaft (Kafka Raft Metadata — introduced in Kafka 2.8, production-ready in 3.3, ZooKeeper removed in 4.0) replaces ZooKeeper with Raft consensus running inside the Kafka cluster itself.
What KRaft Fixes
| Problem with ZooKeeper | KRaft solution |
|---|---|
| Separate cluster to operate | Controllers run inside Kafka itself |
| Slow metadata propagation (ZooKeeper watches) | Brokers pull from metadata log; low-latency fan-out |
| Controller failover took 10–30 seconds | Raft election is sub-second |
| ZooKeeper limited to ~200k partitions | KRaft supports millions of partitions |
| Dual write (ZK + controller cache) | Single metadata log; no dual-write inconsistency window |
KRaft Architecture
A KRaft cluster has two roles. A broker can play one or both:
- Controller — participates in the Raft quorum; votes in elections; replicates the metadata log. Typically 3 or 5 dedicated controller nodes.
- Broker — handles producer/consumer traffic; fetches metadata updates from the active controller. Does not vote.
┌─────────────────────────────────────────────────┐
│ Raft Controller Quorum │
│ │
│ ┌──────────────┐ ┌────────────┐ ┌──────────┐│
│ │ Controller 0 │ │Controller 1│ │Controller││
│ │ (Raft leader)│ │ (follower) │ │2(follower)│
│ └──────┬───────┘ └─────┬──────┘ └────┬─────┘│
└─────────┼────────────────┼───────────────┼──────┘
│ metadata log replication │
▼ ▼ ▼
Broker 3 Broker 4 Broker 5
(observer) (observer) (observer)
The Raft leader is the active controller. It is the sole writer to the metadata log. All partition leadership decisions, ISR updates, topic creation/deletion — everything that was previously coordinated through ZooKeeper — are now records in this log.
The __cluster_metadata Log
The metadata log is a single Kafka partition stored in a dedicated log directory (metadata.log.dir). Its records use strongly-typed schemas (not arbitrary bytes) representing events such as:
| Record type | What it encodes |
|---|---|
TopicRecord |
A new topic and its UUID |
PartitionRecord |
Partition assignment, replicas, ISR, leader, leader epoch |
BrokerRegistrationRecord |
A broker coming online |
ISRChangeRecord |
ISR membership change for a partition |
ConfigRecord |
Topic or broker configuration update |
ProducerIdsRecord |
Block of producer ID assignments |
Every change to cluster state is appended to this log. Any broker can reconstruct the full cluster state by replaying it from offset 0 — the log is the single source of truth.
Raft Consensus in KRaft
Raft is a consensus algorithm designed to be understandable. It solves the same problem as Paxos but with a clearer separation of concerns: leader election and log replication are handled as distinct phases with well-defined state machines.
Terms (Epochs in KRaft)
Time in Raft is divided into terms — monotonically increasing integers. KRaft calls them epochs to avoid confusion with Kafka's existing use of "term", but the concept is identical.
Each epoch begins with an election. At most one leader can exist per epoch. If a node sees a message with a higher epoch than its own, it immediately updates its epoch and reverts to follower state.
Epochs serve as a logical clock: a message from an old leader (stale epoch) is rejected, which prevents split-brain.
Epoch 1: Broker 0 is leader
↓ Broker 0 crashes
Epoch 2: Election begins
↓ Broker 1 wins
Epoch 2: Broker 1 is leader
↓ Broker 0 comes back, sees epoch=2, becomes follower
Phase 1: Leader Election
When a controller does not hear from the active controller within quorum.election.timeout.ms (default 1000 ms), it transitions from follower to candidate and starts an election for epoch N+1.
The candidate:
- Increments its epoch to N+1
- Votes for itself
- Broadcasts a
VoteRequestto all other controllers:VoteRequest { epoch: N+1 candidateId: self lastOffset: last log offset this node has lastOffsetEpoch: epoch of that log entry }
Each voter grants a vote if and only if both conditions hold:
- The request's epoch > the voter's current epoch (i.e., it is a new election)
- The candidate's log is at least as up-to-date as the voter's log
Log completeness is checked by comparing (lastOffsetEpoch, lastOffset) lexicographically. A higher epoch wins; if epochs are equal, a higher offset wins. This ensures the new leader has all entries that were committed in any previous epoch.
Winning the election: A candidate wins when it receives votes from a majority of voters (⌊N/2⌋ + 1 for N controllers). With 3 controllers, 2 votes win. With 5, 3 votes win.
On winning, the new leader immediately appends a BeginQuorumEpoch record to the metadata log and starts sending BeginQuorumEpoch requests to all followers, establishing itself as the authority for epoch N+1.
Election timeline (3 controllers):
t=0 Leader 0 crashes. Controllers 1 and 2 time out.
t=1 Controller 1 increments epoch to 4, sends VoteRequest(epoch=4).
t=1 Controller 2 also times out, increments epoch to 4, sends VoteRequest(epoch=4).
t=2 Controller 1 receives vote from Controller 2 (Controller 2's log is older).
Controller 1 has 2/3 votes → wins election.
t=2 Controller 1 appends BeginQuorumEpoch(4) to metadata log.
t=3 Controller 2 receives BeginQuorumEpoch from Controller 1, reverts to follower.
Phase 2: Log Replication
Once elected, the leader replicates the metadata log to followers. KRaft does not use Raft's standard AppendEntries RPC. Instead, it reuses Kafka's own Fetch protocol — followers pull new records from the leader, exactly like a regular Kafka consumer fetching from a partition leader.
Why fetch instead of push?
Kafka already has a highly optimised, battle-tested fetch path. Reusing it for metadata replication means the same code handles both data replication and consensus replication, and the leader is never blocked waiting for slow followers — it just tracks their fetch offsets.
The replication loop:
- A follower sends a
FetchRequestto the active controller with its current log end offset - The leader responds with new metadata log records
- The follower appends them to its local metadata log
- The follower's next
FetchRequestincludes its updated log end offset, which the leader uses to track replication progress
The High Watermark in Consensus
The metadata log has its own high watermark — the committed offset — which advances when a majority of voters have acknowledged replication.
Leader log end offset: 100
Follower 1 fetch offset: 98
Follower 2 fetch offset: 100
Majority (2/3) have offset 98 → committed offset = 98
Only records up to the committed offset are applied to the in-memory cluster state. The leader includes the current committed offset in every FetchResponse, so followers know which records are safe to apply.
This mirrors exactly how the ISR high watermark works for data partitions — the same pattern, applied at the metadata layer.
Safety Guarantee
Raft guarantees that at most one leader exists per epoch, and that a new leader always has every record that was committed in a previous epoch. The vote's log-completeness check is what enforces this: a node that is missing committed records cannot win an election because a majority of voters must have those records and will reject a vote from a less complete log.
KRaft vs Standard Raft
KRaft adapts Raft in a few ways specific to Kafka:
| Standard Raft | KRaft |
|---|---|
AppendEntries RPC (leader → follower push) |
Fetch RPC (follower → leader pull) |
| "term" | "epoch" (same concept, different name) |
| Log entries are opaque bytes | Strongly-typed metadata records |
| Leader sends heartbeats to prevent elections | Leader sends FetchResponse; no separate heartbeat |
| Snapshot = arbitrary application state | Snapshot = serialised in-memory metadata image |
The fetch-based pull model is the most significant departure. In standard Raft, the leader pushes AppendEntries to all followers and retries on failure. In KRaft, followers pull, and if a follower is slow, the leader does not block — it just waits for the follower's next fetch. This keeps the leader's hot path free of slow-follower back-pressure.
Metadata Snapshots
The metadata log grows indefinitely if left alone. KRaft periodically writes a snapshot — a complete serialised image of the in-memory cluster state at a given committed offset. Once a snapshot is written, all metadata log records up to that offset can be deleted.
metadata log:
[offset 0 ──────── offset 5000] ← captured in snapshot at offset 5000
[offset 5000 ──── offset 8000] ← live log tail
After snapshot: only [5000 – 8000] needs to be retained
A new controller joining the cluster (or one that has been offline for a long time) fetches the latest snapshot from the active controller first, then fetches the log tail from the snapshot offset onwards. This avoids replaying the entire history of the cluster.
Snapshot frequency is controlled by:
metadata.log.max.record.bytes.between.snapshots=20971520 # 20 MB (default)
Observer Brokers
Regular brokers (non-controllers) are observers in the Raft sense: they fetch from the metadata log but do not participate in elections or vote. This is an explicit design choice — including all N brokers in the Raft quorum would make elections slower (more votes needed) and replication latency higher (leader must wait for a majority of all brokers, not just 3 controllers).
The quorum size is fixed at the controller count (typically 3 or 5), keeping election and commit latency independent of how many broker nodes are in the cluster.
Partition Reassignment
When a broker is added or removed, partitions can be reassigned to rebalance load. The partition reassignment process:
- The controller adds the new broker as a replica for the chosen partitions
- The new broker fetches data from the current leader (throttled by
replica.assignment.throttled.rate) - Once caught up and added to the ISR, the partition metadata is updated
- The old broker is removed from the replica set
Throttling is critical to avoid overwhelming the network during reassignment:
kafka-reassign-partitions.sh \
--throttle 50000000 \ # 50 MB/s
--execute
Without throttling, reassignment can saturate inter-broker network bandwidth and cause consumer lag or producer timeout.
Summary
Kafka's replication model is built on a few core ideas:
- ISR — only fully-caught-up replicas can satisfy
acks=all; the high watermark protects consumers from uncommitted data min.insync.replicas— the safety floor; preventsacks=allfrom being vacuously true when followers are down- Leader election from ISR — guarantees no committed data is lost, at the cost of possible unavailability
- Unclean election — the explicit trade-off: choose availability and risk loss, or choose safety and accept downtime
- Leader epoch — prevents split-brain log divergence after leader changes
- KRaft — removes ZooKeeper; epochs prevent split-brain; fetch-based pull keeps the leader's hot path back-pressure free; the
__cluster_metadatalog is the single source of truth; snapshots bound replay time; observer brokers keep quorum size fixed at 3–5 regardless of cluster scale
The replication layer is where Kafka's durability guarantees are actually enforced. Getting replication.factor, min.insync.replicas, and acks right — and understanding what happens when the ISR shrinks — is the most important operational knowledge for anyone running Kafka in production.