Concepts and Terminology — Offsets, Segments, LEO, ISR, Raft, and Serf
Before writing any code, understand the concepts and algorithms that underpin a distributed log. This page covers two areas: the Kafka-style terminology you need for storage and replication (offsets, segments, LEO, HW, ISR), and the distributed systems algorithms you will use (Raft consensus, Serf gossip).
Part 1: Storage and replication concepts
Append-only storage
The log is stored as append-only files. You never update or delete in place — only append at the end. This gives you:
- Sequential writes — disk throughput is highest for sequential I/O. No random seeks for writes.
- Crash safety — a partial write at the tail can be detected and truncated on recovery. No mid-file corruption.
- Simplicity — no compaction or garbage collection at the storage layer (retention policies handle cleanup separately).
Segments
A single append-only file would grow unbounded. Instead, the log is split into segments — bounded chunks of contiguous offsets:
data/my-topic/
├── 00000000000000000000.log ← offsets 0–999
├── 00000000000000000000.idx ← sparse index for segment 0
├── 00000000000000001000.log ← offsets 1000–1999
├── 00000000000000001000.idx ← sparse index for segment 1000
├── 00000000000000002000.log ← active segment (offsets 2000–...)
└── 00000000000000002000.idx
Each segment has:
- A base offset — the first offset in the segment (used as the filename).
- A log file (
.log) — the append-only sequence of encoded records. - An index file (
.idx) — a sparse index mapping offset → byte position.
When the active segment reaches a size limit (e.g., 1 MB), it is sealed (becomes read-only) and a new segment is created. Old segments can be deleted by a retention policy.
Sparse index
Indexing every record would waste space. Instead, we write an index entry every N bytes (e.g., every 4 KB). The index is small enough to memory-map (mmap) and binary-search:
Index entries (every ~4KB of log data):
┌──────────────────┬────────────────────┐
│ Relative Offset │ Byte Position │
│ (4 bytes) │ (8 bytes) │
├──────────────────┼────────────────────┤
│ 0 │ 0 │ ← first record
│ 47 │ 4096 │ ← ~4KB into .log file
│ 93 │ 8192 │ ← ~8KB into .log file
│ 140 │ 12288 │
└──────────────────┴────────────────────┘
Lookup works in two steps:
- Binary search the index for the largest entry with
relative_offset <= target. This gives a starting byte position. - Linear scan forward in the
.logfile from that position, decoding records one by one, until you find the target offset.
The relative offset (offset minus the segment's base offset) fits in 4 bytes instead of 8, saving space.
Offsets
An offset is a monotonically increasing position number assigned to each record:
- Logical, not physical — offset 42 is "the 42nd record," not "byte position 42."
- Stable — once assigned, an offset never changes. Consumers use offsets as bookmarks.
- Per-partition — in a multi-partition topic, each partition has its own independent offset sequence.
The leader assigns offsets when it appends records. Producers receive the assigned offset in the response; consumers use offsets in fetch requests.
Producer
A producer is a client that appends records to the log:
- Discover the leader for the target topic (via any broker).
- Send a produce request with the record value (and optionally a key and headers).
- The leader appends to its local log, assigns an offset.
- Depending on the ack mode, the leader may wait for replication before responding.
Ack modes control the durability guarantee:
| Ack mode | Behavior | Durability | Latency |
|---|---|---|---|
| AckNone (0) | Leader responds immediately after local append. No replication wait. | Lowest — data can be lost if the leader crashes. | Lowest |
| AckLeader (1) | Leader responds after local append and fsync. | Medium — survives leader disk failure but not leader crash before replication. | Medium |
| AckAll (2) | Leader responds only after all ISR replicas have replicated the data. | Highest — data survives any single node failure. | Highest |
Consumer
A consumer reads records from the log by offset:
- Discover the leader for the topic.
- Send a fetch request: "give me records starting at offset X."
- The server reads from the log (up to the high watermark) and returns records.
- The consumer processes the records and advances its offset.
Consumers track their position — the last offset they successfully processed. This offset can be:
- Stored on the server (committed offset) so the consumer can resume after a restart.
- Tracked in-memory if the consumer handles its own checkpointing.
LEO — Log End Offset
LEO is the next offset to be written on a given replica:
Records: [0] [1] [2] [3] [4] [5] [6]
↑
LEO = 7
- On the leader: LEO advances after every successful append.
- On a follower: LEO advances after the follower applies replicated records.
LEO tells you "how far this replica has written." It is not the visibility boundary for consumers — that is the high watermark.
HW — High Watermark
The high watermark (HW) is the highest offset that has been replicated to all in-sync replicas:
Offset: 0 1 2 3 4 5 6 7 8
│ │ │ │ │ │ │ │ │
└───────────────────┘
committed (HW=5) uncommitted (only on leader)
LEO=9
- Consumers read only up to HW. Records beyond HW might be lost if the leader fails before replication completes.
- HW = min(LEO) across all ISR replicas. When every in-sync replica has offset 5, HW advances to 5.
- The leader piggybacks HW on replication responses so followers know the commit point.
The gap between HW and LEO is the "replication lag" — records that exist on the leader but have not yet been confirmed on all replicas.
ISR — In-Sync Replicas
ISR is the set of replicas that are caught up with the leader within an acceptable lag threshold:
Leader LEO = 1000
Replica A: LEO = 998 → in ISR (within threshold of 100)
Replica B: LEO = 950 → in ISR
Replica C: LEO = 700 → NOT in ISR (too far behind)
Why ISR matters:
- HW is computed from ISR replicas only. A slow replica outside the ISR does not block HW advancement.
- Leader election chooses from ISR members to avoid data loss.
- AckAll waits only for ISR replicas, not all replicas.
A replica that falls behind is removed from ISR (via a Raft metadata event). When it catches up, it is added back. This keeps the cluster making progress even when one node is slow.
Replication model
Our distributed log uses pull-based replication (similar to Kafka):
- The leader accepts writes and appends to its local log.
- Each follower runs a replication thread that periodically fetches new records from the leader.
- The leader serves these fetch requests using
ReadUncommitted(reads up to LEO, not just HW) and records the follower's LEO. - The leader computes ISR based on how close each follower's LEO is to its own.
- When all ISR replicas have caught up to a given offset, the leader advances HW.
- Consumers see the new data.
This is different from Raft's push-based replication. In our system, Raft handles metadata (topic creation, leader changes, ISR updates) while data replication uses this pull-based mechanism for throughput.
Part 2: Distributed systems algorithms
Raft consensus
Raft is a consensus algorithm that lets a cluster of nodes agree on a replicated log of entries. Once an entry is committed (replicated to a majority), every correct node applies it in the same order.
Why we need Raft: The cluster must agree on metadata — which topics exist, who leads each topic, which replicas are in-sync. Without consensus, nodes could have conflicting views. Raft gives us a single source of truth for cluster state.
Roles
| Role | Behavior |
|---|---|
| Leader | Accepts new entries, replicates to followers, decides when entries are committed. One leader per term. |
| Follower | Receives entries from the leader, appends to its log, responds to leader. Votes in elections. |
| Candidate | Transient role during leader election. Requests votes from peers. |
Leader election
- A follower that does not hear from the leader within the election timeout becomes a candidate.
- The candidate increments its term (a monotonic epoch number) and sends RequestVote RPCs.
- Each node votes for at most one candidate per term (first-come-first-served, with an "up-to-date" check).
- A candidate that receives votes from a majority becomes leader for that term.
- The leader sends periodic heartbeats (empty AppendEntries) to prevent new elections.
Log replication in Raft
- The leader appends a new entry to its log with the current term and the next index.
- The leader sends AppendEntries RPCs to all followers with the new entries.
- Followers append entries if they are consistent with the leader's log (matching previous term and index).
- When a majority of nodes has the entry, the leader marks it committed.
- The leader notifies followers of the new commit index; all nodes apply committed entries to their state machine.
What we store in the Raft log
In our system, Raft entries contain metadata events, not record data:
| Event type | Purpose |
|---|---|
| CreateTopic | Register a new topic with leader and replica assignments |
| DeleteTopic | Remove a topic |
| LeaderChange | Change the leader for a topic (e.g., after a node failure) |
| IsrUpdate | Update a replica's ISR status (add or remove from ISR) |
| AddNode / RemoveNode | Track cluster membership changes |
The actual record data is replicated by the pull-based replication protocol described above. This separation keeps Raft's log small (metadata only) while letting the data path handle high throughput.
HashiCorp Raft library
We use github.com/hashicorp/raft — a production-grade Go implementation. You implement three interfaces:
| Interface | What you provide |
|---|---|
| FSM (Finite State Machine) | Apply(log) — process a committed entry (deserialize the metadata event, update in-memory state). Snapshot() / Restore() — serialize/deserialize the full state for snapshots. |
| LogStore | Persistent storage for Raft log entries. We adapt our own log (segments + index) as the backing store. |
| StableStore | Key-value store for Raft's internal state (current term, voted-for). BoltDB or a simple file. |
The library handles leader election, log replication, snapshotting, and membership changes. You focus on what happens when an entry is committed (the FSM) and where entries are stored (the LogStore).
Serf and gossip protocol
HashiCorp Serf provides decentralized cluster membership using a gossip protocol:
- Each node periodically sends its state to a few random peers.
- Those peers forward the information to other peers.
- Within a few rounds, every node knows about every other node.
Why we need Serf: Nodes need to discover each other without a static configuration file. When a new node joins, existing nodes must learn its RPC address. When a node fails, the cluster must detect it and reassign its topic leaders.
How Serf works
| Concept | Description |
|---|---|
| Join | A new node contacts one or more seed nodes. Serf gossips the new member to the cluster. |
| Leave | A node announces it is leaving. Serf removes it from the member list. |
| Failure detection | Serf periodically probes members. If a node does not respond after retries, it is marked failed and eventually removed. |
| Tags | Each node attaches key-value metadata (e.g., rpc_addr=10.0.0.1:8400). Tags are spread with membership information. |
How we use Serf
- Each node starts Serf with a unique name, bind address, and tags containing its RPC and Raft addresses.
- Nodes join the cluster by contacting seed nodes.
- When a new member joins, the Serf event handler calls
coordinator.Join()to add the node to Raft (if it is the Raft leader). - When a member leaves or fails, the event handler calls
coordinator.Leave()to remove it from the cluster and trigger leader reassignment for affected topics. - Any node can call
Members()to get the current list of alive peers and their RPC addresses.
How the pieces fit together
┌─────────────────────────────────────────────────────────────┐
│ Client (Producer/Consumer) │
│ │
│ 1. FindTopicLeader(topic) → get leader's RPC address │
│ 2. Produce/Fetch to leader │
└──────────────────────────┬──────────────────────────────────┘
│ TCP (produce / fetch / admin)
▼
┌──────────────┐ gossip ┌──────────────┐ gossip ┌──────────────┐
│ Node 1 │◄────────►│ Node 2 │◄────────►│ Node 3 │
│ (Leader) │ │ (Follower) │ │ (Follower) │
│ │ │ │ │ │
│ ┌──────────┐ │ Raft │ ┌──────────┐ │ Raft │ ┌──────────┐ │
│ │Coordinator│◄├─────────►│Coordinator│◄├─────────►│Coordinator │ │
│ │(Raft+FSM)│ │ │ │(Raft+FSM)│ │ │ │(Raft+FSM)│ │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │
│ │ │ │ │ │
│ ┌──────────┐ │ fetch │ ┌──────────┐ │ fetch │ ┌──────────┐ │
│ │ Topic │ │◄─────────┤ │ Topic │ │◄─────────┤ │ Topic │ │
│ │ Manager │ │ (replica)│ │ Manager │ │ (replica)│ │ Manager │ │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │
│ │ │ │ │ │
│ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │
│ │ Log │ │ │ │ Log │ │ │ │ Log │ │
│ │(segments)│ │ │ │(segments)│ │ │ │(segments)│ │
│ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │
└──────────────┘ └──────────────┘ └──────────────┘
Data flow summary:
- Raft replicates metadata events so all nodes agree on cluster state.
- Serf discovers members and their addresses; triggers Raft membership changes.
- TopicManager uses Raft-committed state to know which topics exist and who leads them.
- Producers write to the leader; the leader appends to its local log.
- Followers pull records from the leader; the leader tracks their LEO and computes ISR/HW.
- Consumers read from the leader up to the high watermark.
Summary
| Concept | Role in the system |
|---|---|
| Segment | Bounded append-only file + sparse index. Unit of storage. |
| Sparse index | Memory-mapped index with binary search. Fast offset lookups without indexing every record. |
| Offset | Monotonic record position. Assigned by leader, used by consumers. |
| LEO | Next offset to write on a replica. Tracks "how far we have written." |
| HW | Highest committed offset (min LEO across ISR). Consumer visibility boundary. |
| ISR | Set of replicas caught up with the leader. Defines commit and HW. |
| Raft | Consensus for metadata: topic creation, leader changes, ISR updates. |
| Serf | Gossip-based cluster membership and discovery. |
| Pull-based replication | Followers fetch from leader. Leader tracks replica LEO and computes ISR/HW. |
| Ack modes | None (fire-and-forget), Leader (local fsync), All (wait for ISR replication). |
With these concepts in mind, the next page sets up the project structure, and then you start building the storage layer.