Database Indexes Deep Dive — From B-Trees to LSM Trees

Database Indexes Deep Dive — From B-Trees to LSM Trees

Indexes are the single biggest lever you have over query performance. Pick the wrong one and a query that should take milliseconds scans millions of rows. Pick the right one and you can serve results directly from the index without touching the table at all.

This post is split into two parts. Part 1 covers every major B-Tree index type in MySQL and PostgreSQL. Part 2 covers LSM-based engines — how Cassandra and CockroachDB index data differently, and what that means for schema design.


Part 1: B-Tree Based Indexes


1. Hash Index

A hash index applies a hash function to the indexed column and stores hash(key) → row location in a hash map.

key     hash(key)   location
─────────────────────────────
"alice"  0x3f2a  →  page 7, slot 3
"bob"    0x1c9d  →  page 2, slot 1

Strengths

  • O(1) exact-match lookups — as fast as a lookup can get
  • Simple structure, low per-entry overhead

Weaknesses

  • Range queries are impossible — WHERE age > 30 has to scan everything because hashing destroys order
  • The entire hash map must fit in memory; on-disk hash maps suffer heavy random I/O with no data locality
  • No support for ORDER BY or prefix queries

Where you see it

  • PostgreSQL supports CREATE INDEX ... USING HASH but recommends B-Tree for almost everything
  • MySQL's MEMORY engine uses hash indexes; InnoDB has an adaptive hash index that builds itself automatically for frequently accessed B-Tree pages
  • Redis, Memcached — their entire data model is a hash map

When to use it Only when you exclusively need exact-match equality on a high-read, in-memory table and you are certain range queries will never be needed.


2. B-Tree Index

B-Trees (and their variant B+ Trees, which databases almost universally use) are the default index structure for a reason: they balance read performance, write performance, and disk efficiency simultaneously.

                    [30 | 60]
                   /    |    \
           [10|20]   [40|50]   [70|80]
           /  |  \   /  |  \   /  |  \
          ...leaf nodes with pointers to rows...

In a B+ Tree, all data lives in the leaf nodes, and leaves are linked in a doubly-linked list — enabling efficient range scans by following the chain without returning to the root.

Complexity

  • Search, insert, delete: O(log n)
  • Range scan after initial seek: O(k) where k is the number of results

Why B-Trees work well on disk

  • Tree nodes are sized to match disk page boundaries (typically 8 KB or 16 KB), so one I/O fetches one full node
  • Branching factor of hundreds means a table with millions of rows needs only 3–4 levels, so 3–4 disk reads per lookup
  • The OS page cache efficiently caches the upper levels (root and first few levels) — in practice most lookups touch disk only at the leaf

Good for

  • WHERE id = 42 — equality
  • WHERE created_at BETWEEN '2024-01-01' AND '2024-12-31' — range
  • ORDER BY last_name — sorted scan
  • WHERE status = 'active' ORDER BY created_at — combined equality + sort

Not ideal for

  • Write-heavy workloads on large tables: every insert may require a page split and rebalance
  • Highly random primary keys (UUIDs): cause frequent page splits because new keys land in arbitrary positions rather than at the end

3. Clustered vs Non-Clustered Index

This distinction is about where the actual row data lives relative to the index.

Clustered Index

The table's rows are physically stored sorted by the clustered index key. The leaf nodes of the B-Tree are the rows.

Leaf node:
┌─────────────────────────────────────────────┐
│ key=42 │ name="Alice" │ email="..." │ age=29 │  ← full row
└─────────────────────────────────────────────┘
  • InnoDB (MySQL): always uses a clustered primary key. If you don't define one, InnoDB silently creates a hidden 6-byte rowid column.
  • SQL Server: the clustered index key is usually the primary key by default.
  • PostgreSQL: does NOT cluster automatically. You can run CLUSTER table USING index but the table drifts back to heap order as rows are inserted/updated.

UUID primary keys and clustered indexes Because UUIDs have no natural order, each insert lands at a random position in the B-Tree, triggering frequent page splits and bloating the index. On high-write tables this causes measurable write amplification. Solutions:

  • Use BIGINT AUTO_INCREMENT (MySQL) or BIGSERIAL (PostgreSQL)
  • Use UUIDv7 (time-ordered UUID) if you need a UUID
  • Use gen_random_uuid() in PostgreSQL and accept the overhead if global uniqueness matters more than write speed

Non-Clustered (Secondary) Index

The leaf nodes store the indexed column(s) plus a pointer to the actual row.

MySQL InnoDB secondary indexes The pointer is the primary key value, not a physical disk offset. This means every secondary index lookup performs a double lookup:

Query:  WHERE email = 'alice@example.com'

Step 1: secondary index → finds primary key = 42
Step 2: primary (clustered) index → fetches full row using pk=42

The upside: when a row moves (due to page splits), secondary indexes don't need updating — they still point to the same primary key.

PostgreSQL heap indexes Every index (including the primary key) stores a ctid (physical tuple ID: page + slot). A single index lookup goes directly to the heap page.

Query:  WHERE email = 'alice@example.com'

Step 1: index → finds ctid = (page=7, slot=3)
Step 2: heap → reads row at that physical location

The downside: PostgreSQL's MVCC creates a new row version on every UPDATE. All indexes pointing to that row must be updated to the new ctid. This is called index bloat and is why VACUUM is so important in PostgreSQL.


4. Covering Index

A covering index includes all columns a query needs, so the query engine can answer it entirely from the index without touching the table.

-- PostgreSQL / MySQL 8+
CREATE INDEX idx_orders_user_status ON orders(user_id, status) INCLUDE (total_amount, created_at);

-- MySQL older syntax
CREATE INDEX idx_orders_user_status ON orders(user_id, status, total_amount, created_at);

For the query:

SELECT total_amount, created_at FROM orders WHERE user_id = 5 AND status = 'shipped';

The index has everything needed — the engine never reads the table. This is called an index-only scan (PostgreSQL) or covering index scan (MySQL).

Rules of thumb

  • Add frequently-queried columns with INCLUDE rather than as key columns (they don't need to be sorted)
  • Don't add large TEXT, JSON, or BLOB columns — they bloat the index and negate the benefit
  • Check query plans for Index Only Scan (PostgreSQL) or Using index (MySQL EXPLAIN)

5. Composite Index

A composite (multi-column) index concatenates multiple column values into a single index key, sorted left-to-right.

CREATE INDEX idx_users_name ON users(last_name, first_name);

Internally the keys look like:

[Brown | Alice]
[Brown | Bob  ]
[Smith | Alice]
[Smith | Bob  ]

The left-prefix rule: the index can be used for any query that filters on a left-aligned subset of the columns in order.

WHERE last_name = 'Smith'                            ✓  left-prefix
WHERE last_name = 'Smith' AND first_name = 'Alice'  ✓  full key
ORDER BY last_name, first_name                       ✓  sorted scan

WHERE first_name = 'Alice'                           ✗  not left-prefix
WHERE first_name = 'Alice' AND last_name = 'Smith'  ✓  optimizer can reorder equality filters

Column ordering strategy

  • Put the highest-cardinality column first if queries filter on it alone
  • Put equality filter columns before range filter columns (a range breaks further index use)
  • WHERE status = 'active' AND created_at > '2024-01-01' — index on (status, created_at) works; (created_at, status) would use created_at as a range and skip the status filter

6. Partial Index

A partial index covers only a subset of rows matching a predicate. Smaller index, faster scans for the targeted query pattern.

-- PostgreSQL
CREATE INDEX idx_orders_pending ON orders(created_at) WHERE status = 'pending';

-- MySQL 8+ (using expression)
CREATE INDEX idx_active_users ON users(email) WHERE deleted_at IS NULL;

The index for idx_orders_pending only contains rows where status = 'pending', which might be 1% of the table. Queries filtering on pending orders will scan a dramatically smaller index.

Use cases

  • Soft-delete patterns: index only non-deleted rows
  • Status queues: index only unprocessed/active records
  • Sparse columns: index only rows where an optional column is not null

Part 2: LSM-Based Indexes

Traditional B-Tree indexes optimize for reads. LSM (Log-Structured Merge-tree) flips this: optimize writes first, accept some read overhead. This is the foundation of Cassandra, RocksDB (used by CockroachDB, TiKV, and others).

1. How LSM Works

Write path:
  client write
      │
      ▼
  WAL (Write-Ahead Log) ─── crash recovery
      │
      ▼
  MemTable (in-memory sorted structure, usually a skip-list or red-black tree)
      │  when full (e.g., 64 MB)
      ▼
  SSTable L0 (Sorted String Table, immutable, written sequentially)
      │  compaction
      ▼
  SSTable L1 → L2 → L3 ... (larger, fewer files per level)

Every write is sequential: WAL append + MemTable insert. No random I/O on the write path, which is why LSM engines achieve high write throughput on spinning disks and even SSD (less write amplification from random seeks).

Read path challenge: a key may exist in the MemTable, or any of the SSTable levels. A naive read must check all of them.

Bloom filters — each SSTable carries a bloom filter: a probabilistic bit-array that answers "is this key definitely NOT in this SSTable?" with zero false negatives. In practice, a read that misses the MemTable checks the bloom filter of each SSTable; most SSTables respond "definitely not here" in microseconds without a disk read.

Compaction merges SSTables, removes deleted/overwritten versions, and keeps the number of files bounded. This is where LSM's read amplification is paid down.


2. Cassandra Indexes

Cassandra's data model is built around LSM trees (via its own implementation, and later with RocksDB-backed tables in some forks).

Primary Key = Partition Key + Clustering Key

The partition key determines which node holds the data (consistent hashing on the ring). The clustering key determines the sort order of rows within a partition.

CREATE TABLE orders (
  user_id   UUID,
  order_id  TIMEUUID,
  status    TEXT,
  total     DECIMAL,
  PRIMARY KEY (user_id, order_id)
) WITH CLUSTERING ORDER BY (order_id DESC);
  • user_id = partition key → all of a user's orders land on the same node(s)
  • order_id = clustering key → rows are physically sorted by order_id DESC within each partition

Efficient queries:

-- Uses partition key to route to the right node, clustering key for range
SELECT * FROM orders WHERE user_id = ? AND order_id > ? LIMIT 20;

Inefficient query (full cluster scan):

-- No partition key → must ask every node
SELECT * FROM orders WHERE status = 'pending';

Secondary Index (Local Index)

Cassandra's built-in secondary index creates a hidden table on each node, indexing only the rows that node holds locally.

CREATE INDEX ON orders(status);
SELECT * FROM orders WHERE user_id = ? AND status = 'pending';  -- fast (partition key + secondary)
SELECT * FROM orders WHERE status = 'pending';                   -- slow (coordinator queries all nodes)

Secondary indexes without a partition key restriction cause scatter-gather across the entire cluster. Avoid them for high-cardinality columns (e.g., email) or columns with very low cardinality that return huge result sets (e.g., country).

SASI Index (SSTable Attached Secondary Index)

SASI is stored alongside SSTables rather than in a separate hidden table. It supports:

  • Prefix queries: WHERE name LIKE 'Alice%'
  • Range queries on non-primary-key columns
  • Inequality queries: WHERE age > 30
CREATE CUSTOM INDEX ON users(name) USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {'mode': 'PREFIX'};

SELECT * FROM users WHERE user_id = ? AND name LIKE 'Ali%';

Materialized Views

For query patterns that require a different partition key, a materialized view maintains a separate denormalized copy sorted by the new key.

CREATE MATERIALIZED VIEW orders_by_status AS
  SELECT * FROM orders WHERE status IS NOT NULL AND user_id IS NOT NULL AND order_id IS NOT NULL
  PRIMARY KEY (status, user_id, order_id);

Writes to orders automatically propagate to the view. Reads on status now have a partition key.


3. CockroachDB Indexes

CockroachDB presents a PostgreSQL-compatible SQL interface but its storage engine is Pebble (a Go implementation of RocksDB), an LSM-based engine. The index semantics are SQL-standard B-Tree but the on-disk format is LSM.

How SQL Indexes Map to LSM Keys

CockroachDB encodes every SQL row as a key-value pair in the underlying KV store. The key encodes the table ID, index ID, and index column values in a lexicographically sortable encoding. Because LSM SSTables are sorted by key, a range scan on an indexed column becomes a sequential read in the KV layer — exactly as efficient as a B-Tree range scan.

Index key encoding (simplified):
  /TableID/IndexID/col1_value/col2_value/... → row data

Primary and Secondary Indexes

CREATE TABLE orders (
  id         UUID DEFAULT gen_random_uuid() PRIMARY KEY,
  user_id    INT,
  status     STRING,
  created_at TIMESTAMP DEFAULT now()
);

CREATE INDEX idx_orders_user ON orders(user_id, created_at DESC);
CREATE INDEX idx_orders_status ON orders(status) WHERE status != 'completed';  -- partial index

The primary index stores the full row. Secondary indexes store the indexed columns plus the primary key, using the double-lookup pattern (similar to InnoDB) to fetch remaining columns.

LSM Advantages in CockroachDB

  • Distributed writes: each node writes to its local RocksDB/Pebble store. LSM's sequential write path means individual node write throughput is high.
  • MVCC via LSM: CockroachDB implements MVCC by encoding the timestamp into the key. Multiple versions of the same row coexist as separate KV entries. LSM naturally accumulates these versions; compaction garbage-collects old ones past the GC window.
  • Bloom filters: each SSTable level carries bloom filters. Point lookups (WHERE id = ?) check bloom filters before disk reads, achieving O(1) amortized lookup even across multiple LSM levels.

Compaction and Query Performance

LSM levels in Pebble/RocksDB:
  L0 (4 files, recently flushed MemTables, may overlap)
  L1 (10 MB, no overlap within level)
  L2 (100 MB)
  L3 (1 GB)
  ...

Read amplification without bloom filter: check every level
Read amplification with bloom filter:    ~1 file (bloom eliminates the rest)

Range scans are more expensive than point lookups in LSM because bloom filters only help with exact-key checks. CockroachDB mitigates this with iterator caching and block cache (similar to the OS page cache for B-Trees), keeping hot SSTable blocks in memory.



Other Index Types

B-Tree covers the vast majority of application indexing needs. For specific use cases, databases provide purpose-built index types — reach for these only when the use case clearly fits.

Full-Text Index — for keyword search inside text columns. LIKE '%keyword%' cannot use a B-Tree; a full-text index uses an inverted list (maps each word to the rows containing it) to answer these queries efficiently.

-- PostgreSQL: GIN index on a tsvector
CREATE INDEX idx_articles_fts ON articles USING GIN(to_tsvector('english', body));
SELECT * FROM articles WHERE to_tsvector('english', body) @@ to_tsquery('database & index');

-- MySQL: FULLTEXT index
ALTER TABLE articles ADD FULLTEXT INDEX idx_fts (title, body);
SELECT * FROM articles WHERE MATCH(title, body) AGAINST('+database -slow' IN BOOLEAN MODE);

Spatial Index — for geographic proximity queries. A B-Tree cannot answer "find all stores within 10 km" because it has no notion of 2D distance. Spatial indexes (R-Tree in MySQL, GiST+PostGIS in PostgreSQL) store minimum bounding rectangles at each node and prune entire subtrees that can't possibly be within range.

-- MySQL
CREATE SPATIAL INDEX idx_location ON stores(location);
SELECT * FROM stores WHERE ST_Distance_Sphere(location, ST_GeomFromText('POINT(40.7 -74.0)')) < 10000;

-- PostgreSQL + PostGIS
CREATE INDEX idx_stores_geom ON stores USING GIST(geom);
SELECT * FROM stores WHERE ST_DWithin(geom, ST_MakePoint(-74.0, 40.7)::geography, 10000);

Choosing the Right Index — Quick Reference

Pattern Index type Engine
Exact equality B-Tree or Hash All
Range queries, ORDER BY B-Tree All
Sparse column (few non-null rows) Partial B-Tree PostgreSQL / MySQL 8+
All-column queries from index Covering / INCLUDE All
Keyword search in text Full-Text (GIN / FULLTEXT) PostgreSQL / MySQL
Geospatial proximity Spatial (GiST / R-Tree) PostgreSQL / MySQL
Distributed write-heavy workload LSM (partition key design) Cassandra
SQL on distributed LSM CockroachDB indexes CockroachDB

Key Takeaways

  • B-Tree is the default for relational databases. It handles equality, range, sort, and prefix queries with O(log n) performance and plays nicely with disk page caches.
  • Clustered indexes store rows in index order. MySQL InnoDB always clusters on the primary key; PostgreSQL does not cluster automatically.
  • Covering indexes eliminate table lookups entirely — the cheapest read is one that never touches the table.
  • Composite indexes follow the left-prefix rule: put equality columns first, range columns last.
  • LSM engines (Cassandra, CockroachDB) achieve high write throughput through sequential writes. Bloom filters make point reads efficient. Range reads require careful key design or secondary indexes.
  • Partition key design in Cassandra is your index. Getting it wrong means full cluster scans on every query.