Storage Engines
Two dominant storage engine architectures: B+ Tree (read-optimized) and LSM Tree (write-optimized).
B+ Tree
How It Works
A balanced N-ary tree. Internal nodes hold keys as routing guides; all data lives in leaf nodes. Leaf nodes are linked as a doubly-linked list for range scans.
[ 30 | 60 ] ← internal node (keys only)
/ | \
[10|20] [40|50] [70|80] ← internal nodes
/ | \ / | \ / | \
[.] [.] [.] [.] [.] [.] [.] [.] [.] ← leaf nodes (key + value/ptr)
↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ↔ ← linked list for range scan
- Page size = 4–16 KB (aligned to OS page / block device sector).
- Branching factor = 100–1000 keys per internal node → tree height 3–4 for billions of rows.
- Height 3 tree, factor 1000: 1000^3 = 1 billion leaf entries reachable in 3 I/Os.
DDIA p.80: "The underlying write operation of B-trees is to overwrite a page on disk with new data... In contrast, log-structured indexes only append to files."
Write Path (B+ Tree)
INSERT key=42
│
├─ 1. WAL append (crash safety) ──────────────────► disk (sequential)
│
├─ 2. Root-to-leaf traversal (3–4 page reads)
│ └─ each page: check buffer pool → cache hit (RAM) or cache miss (disk read)
│
├─ 3. Modify leaf page in buffer pool (in-memory)
│
├─ 4. If page full → SPLIT:
│ promote median key to parent → may cascade up to root
│ allocate new page → write both pages to disk
│
└─ 5. Mark pages dirty → background flusher writes to disk
Write amplification: 1 logical write → potentially many page reads + rewrites (splits, parent updates). Random I/O to disk.
Read Path (B+ Tree)
SELECT WHERE key=42
│
├─ 1. Root page → check buffer pool
│ HIT → read from RAM (nanoseconds)
│ MISS → read from disk into buffer pool (milliseconds)
│ └─ OS read-ahead: prefetch N sequential pages speculatively
│
├─ 2. Walk internal nodes (height 3–4), each step = 1 page
│
├─ 3. Leaf page found → return row / pointer to heap
│
└─ 4. Range scan: follow leaf linked list until end of range
Buffer pool hit rate is critical. 99% hit rate = most reads served from RAM. Miss = physical disk read (random I/O, ~1–5ms SSD, ~10ms HDD).
OS Page Alignment
Disk block size: 512B or 4KB (physical sector)
OS page size: 4KB
DB page size: 4KB–16KB (multiple of OS page)
Aligned read: request aligns to page boundary → single syscall, no wasted I/O
Unaligned read: spans two pages → two reads, extra copy → wasted I/O
B+ tree pages are deliberately sized and aligned to OS pages. PostgreSQL uses 8KB pages; InnoDB uses 16KB.
Examples
| DB | Engine | Notes |
|---|---|---|
| PostgreSQL | Heap + B+ tree indexes | Heap stores rows; indexes point to heap TIDs |
| MySQL InnoDB | Clustered B+ tree | PK is the table (clustered); secondary indexes point to PK |
| SQLite | B+ tree | Single file; entire DB is one B-tree |
| Oracle | B+ tree | Industry standard; bitmap indexes also available |
LSM Tree (Log-Structured Merge-Tree)
DDIA p.78: "Log-structured storage engines... turn random-access writes into sequential writes on disk, which enables higher write throughput."
How It Works
Never overwrites in place. All writes are appends. Periodically merge sorted files.
Components:
- WAL — append-only crash recovery log on disk.
- MemTable — in-memory sorted structure (red-black tree or skip list). Active write buffer.
- Immutable MemTable — MemTable being flushed; still readable.
- SSTables — immutable sorted files on disk, organized in levels.
- Bloom Filter — per-SSTable: "does this key definitely NOT exist here?" — skips unnecessary disk reads. False positives possible; false negatives impossible.
- Sparse Index — per-SSTable: maps key ranges to byte offsets. One entry per block (~4KB). Binary search → seek to block → scan within block.
Write Path (LSM Tree)
PUT key=42, value=X
│
├─ 1. Append to WAL ──────────────────────────────► disk (sequential, fast)
│
├─ 2. Insert into MemTable (in-memory sorted map)
│ └─ return OK to client ← write complete from client's perspective
│
├─ 3. MemTable reaches threshold (e.g. 64 MB) → become Immutable MemTable
│ New MemTable created for incoming writes
│
└─ 4. Background: flush Immutable MemTable → SSTable on disk (Level 0)
└─ SSTable = sorted key-value pairs + sparse index + bloom filter
Levels on disk:
L0: [SST_a][SST_b][SST_c] ← freshly flushed, may overlap, ~4 files
L1: [─────────────SST──────────] ← compacted, non-overlapping, 10× larger
L2: [────────────────────────────] ← 10× larger than L1
L3: [────────────────────────────]
Compaction: Merge SSTables from L0 → L1 → L2. Deduplicates keys (keep newest version), removes tombstones. Keeps each level sorted and non-overlapping (except L0).
Write amplification: Each byte may be rewritten O(L) times across levels (typically 10–30×). Trade: sequential I/O vs B+ tree's random I/O → still much faster writes.
Read Path (LSM Tree)
GET key=42
│
├─ 1. Check MemTable ────────────────────── HIT → return (nanoseconds)
│
├─ 2. Check Immutable MemTable ─────────── HIT → return
│
├─ 3. For each SSTable, newest first (L0 → L1 → ...):
│ a. Bloom Filter
│ NEGATIVE → key definitely absent → SKIP (save disk I/O)
│ POSITIVE → maybe present → continue
│ b. Binary search sparse index → find block offset
│ c. Check block cache (buffer pool)
│ HIT → read from RAM
│ MISS → physical disk read (see path below)
│ d. Scan block for key
│ FOUND → return value (check if tombstone)
│ NOT FOUND → continue to next SSTable
│
└─ 4. Not found anywhere → key does not exist
Worst case read: touches all SSTables across all levels. Bloom filters and block cache make this rare in practice.
Physical Read Path (Cache Miss Detail)
Application: get(key)
│
▼
DB Block Cache (LRU in-memory, e.g. 8GB RocksDB block cache)
├─ HIT → return block ────────────────────────────── ~100ns
└─ MISS
│
▼
OS Page Cache (kernel buffer cache, typically GBs of RAM)
├─ PAGE HIT → copy to user-space ──────────── ~1µs
└─ PAGE MISS → issue block device request
│
▼
Block Device (SSD / NVMe / HDD)
│ read 4KB aligned block from physical location
│ NVMe SSD → ~50–100µs
│ HDD → ~5–10ms (seek + rotational latency)
│
│ OS read-ahead: prefetch N sequential blocks speculatively
│ (helpful for range scans; wasteful for random point lookups)
│
▼
Data returns up the stack:
disk → OS page cache → DB block cache → storage engine → app
Cache hit ratio matters above all else.
- 99% hit → 1 in 100 requests hits disk → fast
- 95% hit → 5 in 100 requests hit disk → 5× more I/O → noticeable degradation
- 80% hit → catastrophic for a latency-sensitive system
LSM Compaction Strategies
| Strategy | How | Best for |
|---|---|---|
| Size-tiered (default Cassandra) | Merge SSTables of similar size together | Write-heavy; fewer compactions |
| Leveled (RocksDB default) | Strict level sizes; L(n+1) = 10× L(n) | Read-heavy; bounded read amplification |
| FIFO | Evict oldest SSTables when size limit hit | Time-series, logs; no merging |
| Tiered + Leveled | Tiered at top levels, leveled at bottom | Hybrid workloads (RocksDB universal) |
Examples
| DB | Notes |
|---|---|
| RocksDB | Facebook; underlying engine for TiKV, CockroachDB, MyRocks |
| Cassandra | LSM per node; pluggable compaction |
| LevelDB | Google's reference implementation |
| HBase | LSM on top of HDFS |
| ScyllaDB | C++ reimplementation of Cassandra |
Indexing Strategies
Clustered Index
Data rows physically stored in index order. The index IS the table.
Clustered index on user_id:
Leaf page 1: [id=1, name=Alice, ...] [id=2, name=Bob, ...] [id=3, ...]
Leaf page 2: [id=4, ...] ...
Range scan WHERE user_id BETWEEN 1 AND 100:
→ sequential leaf page reads → very fast (data is contiguous on disk)
- Only one clustered index per table (data can only be in one physical order).
- InnoDB: clustered on primary key by default.
- PostgreSQL: heap-based (unordered);
CLUSTERcommand reorders once but doesn't stay ordered.
Secondary Index
A separate B+ tree whose leaf nodes store a pointer back to the primary row.
Secondary index on email:
Leaf: [email="alice@x.com" → PK=42] [email="bob@x.com" → PK=7] ...
Query: SELECT * WHERE email="alice@x.com"
1. Lookup secondary index → get PK=42
2. Lookup clustered index with PK=42 → fetch full row ← "key lookup" / double read
- Index-only scan / covering index: if query only needs columns inside the index, skip step 2.
- In LSM (Cassandra): secondary indexes are a separate local SSTable structure per node.
Hash Index
In-memory hash map: key → byte offset on disk. Used by Bitcask (Riak's engine).
hash("alice") → offset 4096
hash("bob") → offset 8192
- O(1) point lookup.
- No range queries — hash has no ordering.
- Entire hash map must fit in RAM.
- Used by: Bitcask, Redis (all data structures backed by hash tables), InnoDB adaptive hash index (auto-built for hot B+ tree pages).
DDIA p.73: "The hash table must fit in memory... Also, range queries are not efficient."
Composite Index
Index on multiple columns (a, b, c) — left-prefix rule:
Effective for: (a), (a,b), (a,b,c)
Useless for: (b), (c), (b,c) ← can't skip leading column
Column order: most selective or most frequently filtered column leftmost.
LSM Indexing: Local vs Global (Distributed)
Relevant in distributed LSM stores (Cassandra, HBase, DynamoDB).
Local (Secondary) Index
Each node indexes only the data it owns.
Node 1 owns keys A–M → local secondary index covers A–M only
Node 2 owns keys N–Z → local secondary index covers N–Z only
Write on Node 1: update local index only → fast, no coordination
Read on secondary key "email":
→ scatter request to ALL nodes
→ each checks its local index
→ coordinator merges results
- Write: fast (local, no coordination).
- Read: scatter-gather → fan-out to every node → higher latency, more load.
- Used by: Cassandra secondary indexes.
Global Index
A separate index structure, itself partitioned across nodes, independent of data partitioning.
Data partitioned by user_id (hash).
Global index on email, partitioned by hash(email).
Write user_id=42, email="alice@x.com":
→ write data to Node A (hash(42))
→ write index entry to Node B (hash("alice@x.com")) ← async or 2-phase
Read WHERE email="alice@x.com":
1. Route to index node for hash("alice") → get user_id=42 ← 1 hop
2. Route to data node for hash(42) → return row ← 1 hop
- Write: must update data node + index node → distributed write (2PC or async, risk of inconsistency).
- Read: 2 targeted hops → faster than scatter-gather.
- Used by: DynamoDB GSI, Google Spanner, CockroachDB.
Range Queries in LSM
WHERE key BETWEEN a AND b:
1. MemTable: iterate from a to b (skip list → O(log n) to find a, then scan)
2. For each SSTable:
a. Bloom filter useless (only checks point existence)
b. Sparse index: find first block where key ≥ a
c. Scan blocks sequentially until key > b
3. K-way merge across all sources, deduplicate by version timestamp
Range queries are worse in LSM than B+ tree because:
- L0 SSTables may overlap → must check all of them.
- K-way merge overhead across levels.
- After compaction, L1+ are sorted + non-overlapping → range scans are efficient there.
MVCC (Multi-Version Concurrency Control)
DDIA p.239: "A key principle of snapshot isolation is readers never block writers, and writers never block readers."
The Problem
Without MVCC: reader takes a shared lock → blocks writer. Writer takes exclusive lock → blocks reader. Serialized access kills throughput.
How It Works
Every write creates a new version of a row tagged with a transaction ID (txn_id). Old versions are kept until no active transaction needs them.
Initial state:
key="balance" → [txn_id=1, value=100]
T2 starts read (takes snapshot at txn_id=2):
sees value=100
T3 writes balance=200 (txn_id=3):
key="balance" → [txn_id=3, value=200] ← new version
[txn_id=1, value=100] ← old version kept
T2 still reading:
its snapshot = txn_id=2 → ignores txn_id=3 version → sees 100 ← consistent!
T2 commits → txn_id=1 version now eligible for GC (VACUUM in PostgreSQL)
Visibility Rule
Transaction with snapshot_id = S sees version V if:
V.txn_id ≤ S(committed before snapshot)Vis not superseded by another version withtxn_id ≤ S
MVCC in B+ Tree (PostgreSQL)
Heap page stores multiple tuples (versions) for the same logical row:
[xmin=1, xmax=3, value=100] ← visible to txns 1–2; dead after T3
[xmin=3, xmax=∞, value=200] ← visible to txns ≥ 3; current
Index still points to both heap tuples.
VACUUM: removes dead tuples, reclaims space, updates index pointers.
HOT update (Heap-Only Tuple): if updated columns are not in any index →
new tuple on same page → no index update needed → faster writes.
MVCC in LSM (RocksDB / Cassandra)
LSM is naturally multi-versioned — every write is an append:
key="balance" @ts=3 → 200 ← newest
key="balance" @ts=1 → 100 ← older version
Read at snapshot ts=2:
scan SSTables newest-first, find latest version with ts ≤ 2 → returns 100
Compaction = GC:
when no active snapshot needs ts=1 → drop it during compaction
Isolation Levels via MVCC
| Level | What you see | Anomalies prevented |
|---|---|---|
| Read Uncommitted | Latest (even uncommitted) | Nothing |
| Read Committed | Latest committed per statement | Dirty reads |
| Repeatable Read | Snapshot at txn start | Dirty + non-repeatable reads |
| Serializable | As if serial execution | All anomalies (phantom reads etc.) |
MVCC naturally provides Snapshot Isolation (Repeatable Read). True Serializable requires SSI — Serializable Snapshot Isolation (PostgreSQL uses this since 9.1).
B+ Tree vs LSM Tree — Summary
| Dimension | B+ Tree | LSM Tree |
|---|---|---|
| Write I/O | Random (in-place page updates) | Sequential (append WAL + flush) |
| Read I/O | Low (tree height = 3–4 pages) | Medium–High (multi-level SSTable scan) |
| Write amplification | Medium | High (compaction rewrites data L× times) |
| Read amplification | Low | Medium (bloom filters + cache reduce it) |
| Space amplification | Low | Medium (stale versions until compaction) |
| Range scans | Excellent (leaf linked list) | Good at lower levels; worse at L0 |
| Compaction overhead | None | Background CPU + I/O (can spike latency) |
| Best for | Read-heavy OLTP, mixed workloads | Write-heavy: logs, time-series, events |
| Examples | PostgreSQL, MySQL, SQLite, Oracle | Cassandra, RocksDB, HBase, LevelDB |
DDIA p.84: "As a rule of thumb, LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads. Reads are typically slower on LSM-trees because they have to check several different data structures and SSTables at different stages of compaction."
Read/Write Path Diagrams
B+ Tree Write Path
Client write
│
▼
WAL append ──────────────────────────────────────► Disk (sequential)
│
▼
Buffer Pool (in-memory pages)
┌─ page already in pool? ─┐
│ YES │ NO
▼ ▼
dirty-mark read page from disk into pool (random I/O)
│ │
└──────────┬──────────────┘
▼
Modify leaf page in pool
│
▼
Page full? ──YES──► SPLIT: allocate new page, write both ──► Disk (random)
│
NO
▼
Background page flusher ──────────────────────────────────► Disk (random)
LSM Tree Write Path
Client write
│
▼
WAL append ──────────────────────────────────────► Disk (sequential, fast)
│
▼
MemTable (sorted skip list, in-memory)
return OK to client ← write done
│
▼ (64MB threshold)
Immutable MemTable ──background flush──► SSTable L0 ──► Disk (sequential)
│
▼ (L0 has ≥4 files)
Compaction: L0 → L1 ──► Disk (sequential)
Compaction: L1 → L2 ──► Disk
LSM Tree Read Path
Client read (GET key)
│
▼
MemTable ──HIT──► return (~100ns)
│
MISS
▼
Immutable MemTable ──HIT──► return
│
MISS
▼
SSTable L0 (newest first, may overlap → check all):
│ Bloom Filter ──NEGATIVE──► SKIP
└─ POSITIVE
│ Sparse Index → block offset
▼
DB Block Cache ──HIT──► return (~100ns)
│
MISS
▼
OS Page Cache ──HIT──► return (~1µs)
│
MISS
▼
Disk (NVMe ~50µs, HDD ~10ms)
read-ahead: prefetch adjacent blocks
│
▼
populate OS page cache → DB block cache → return value
NOT FOUND in L0? → repeat for L1 → L2 → ... → key not exist