Data Structures in Databases
Skip List
Structure: multiple layers of linked lists; each layer skips over more elements.
L3: 1 ─────────────────────── 50 ─────────── 100
L2: 1 ──────── 20 ─────────── 50 ─── 75 ──── 100
L1: 1 ── 10 ── 20 ── 30 ──── 50 ── 60 ── 75 ─ 100
L0: 1 ─ 5 ─ 10 ─ 15 ─ 20 ─ 25 ─ 30 ... (all elements)
- Insert/search/delete: O(log n) average
- Lock-free concurrent variant: compare-and-swap on each level pointer
- Ordered → supports range scans naturally
- Simpler to implement than B-trees for in-memory use
Used in: RocksDB/LevelDB MemTable, Redis sorted sets (ZSET), CockroachDB/TiKV in-memory write buffer, Java ConcurrentSkipListMap
B+ Tree
- All data in leaf nodes; internal nodes are routing keys only
- Leaf nodes linked as doubly-linked list → range scans in O(1) seek + O(k) scan
- Height 3–4 for billions of rows (branching factor 100–1000)
- In-place updates: read page → modify → write page (random I/O)
Used in: PostgreSQL, MySQL InnoDB (clustered), SQLite, Oracle — default index structure for relational DBs
LSM Tree (Log-Structured Merge-Tree)
- MemTable (skip list) → immutable MemTable → SSTables on disk
- Writes always append → sequential I/O
- Reads merge across multiple levels; bloom filters skip irrelevant SSTables
- Compaction: merge-sort SSTables periodically to reclaim space
Used in: RocksDB, LevelDB, Cassandra, HBase, ScyllaDB, TiKV (CF_DEFAULT), InfluxDB
Bloom Filter
Structure: bit array + k hash functions. Test set membership.
- Insert: set k bits at positions
h1(x), h2(x), ..., hk(x) - Query: check all k positions — all set → probably present; any unset → definitely absent
- False positives possible; false negatives impossible
- Space: ~10 bits/element for 1% false positive rate
Used in: RocksDB/Cassandra (skip SSTable reads), BigTable, Akamai CDN cache, Google Chrome safe-browsing, distributed caches, Ethereum (transaction log filtering)
Hash Table
- O(1) average insert/lookup via hash function
- Open addressing (linear/quadratic probing) or chaining
- No ordering → no range queries
- Must fit in memory (or use extendible/linear hashing for disk)
Used in: Redis all data structures, PostgreSQL hash joins, MySQL hash indexes, in-memory join hash tables in any RDBMS, HashMap in query execution engines
Inverted Index
Structure: term → sorted list of document IDs (posting list)
"database" → [doc1, doc5, doc12, doc99]
"engine" → [doc1, doc3, doc12]
- Intersection of posting lists = AND query
- Union = OR query
- TF-IDF / BM25 scores stored alongside doc IDs for ranking
- Compressed posting lists (delta encoding + varint)
Used in: Elasticsearch, Apache Lucene, Solr, PostgreSQL GIN index (full-text search), Sphinx, Meilisearch
R-Tree (Spatial Index)
- Hierarchical bounding rectangles around spatial objects
- Each node = minimum bounding rectangle (MBR) of all children
- Query: intersect query rectangle with MBRs, recurse only into overlapping children
- Supports: point-in-polygon, nearest neighbor, range queries on 2D/3D data
- Variants: R*-tree (better splits), Hilbert R-tree (packed)
Used in: PostGIS (PostgreSQL), MySQL spatial indexes, MongoDB 2d/2dsphere, SQLite R-tree extension, GIS systems (ESRI, QGIS)
Trie / Radix Tree (Patricia Tree)
- Character-by-character key tree; common prefixes share nodes
- Radix tree = compressed trie (skip single-child nodes)
- Lookup: O(m) where m = key length
- Memory-efficient for string keys with shared prefixes
- Supports: prefix search, autocomplete, IP prefix matching
Used in: Linux kernel routing table (LPC-trie), Nginx routing, Redis prefix commands, database dictionary compression, IP CIDR routing (BGP)
Roaring Bitmap
- Compressed bitmap for sets of 32-bit integers
- Key space divided into 2^16 containers; each container is:
- Array (< 4096 elements) → sorted uint16 array
- Bitmap (≥ 4096 elements) → 8KB bitset
- Run-length encoded (long runs)
- Intersection/union: O(n/64) with SIMD
- 2–10× more compact than plain bitmaps
Used in: Elasticsearch (doc filtering), Apache Druid/Spark/Pinot (column bitmaps), ClickHouse, Pilosa, Lucene posting lists
HyperLogLog (HLL)
- Approximate count-distinct with O(log log n) memory
- Error: ~1.6% with 1.5 KB; ~0.8% with 12 KB
- Algorithm: hash each element, observe max leading zeros in hash → estimate cardinality
- Mergeable across shards
Used in: Redis PFADD/PFCOUNT, PostgreSQL pg_hyperloglog, BigQuery COUNT(DISTINCT), Snowflake approximate_count_distinct, Presto/Trino, distributed analytics
Count-Min Sketch
- Approximate frequency counter; 2D array of counters + multiple hash functions
- Insert: increment d positions (one per hash function row)
- Query: return min of d positions → upper bound of true frequency
- Fixed memory regardless of distinct elements; no deletion
Used in: Apache Storm (top-K heavy hitters), network traffic analysis, Apache Flink, database query optimizers (frequency estimation for join order), spam filters
Merkle Tree
- Binary tree where each node = hash of its children
- Leaf nodes = hash of data blocks
- Root hash = fingerprint of entire dataset
- Detect difference between two trees: O(log n) by comparing root, then drilling down
Used in: Git (object store), Bitcoin/Ethereum (transaction trees), ZooKeeper (data verification), Cassandra anti-entropy (comparing replica state), IPFS, Dynamo (replica reconciliation)
T-Digest
- Approximate quantile estimation (median, p99, p999)
- Maintains sorted list of centroids (mean + count); merges nearby centroids
- More accurate at extremes (tails) where precision matters most
- Mergeable across nodes
Used in: Elasticsearch percentile aggregations, Prometheus (histogram_quantile approximation), Netflix Atlas, DataDog, Kibana
Consistent Hash Ring
- Nodes placed on virtual ring (0 to 2^32)
- Key → hash → clockwise walk → first node = owner
- Virtual nodes: each physical node has multiple positions → even distribution
- Node add/remove: only neighboring keys move
Used in: Amazon DynamoDB, Apache Cassandra, Redis Cluster, Memcached (ketama), Akamai CDN routing
Fenwick Tree (Binary Indexed Tree)
- O(log n) prefix sum queries and point updates
- More cache-friendly than segment tree
- Classic use: order statistics, rank computation
Used in: database internal sort statistics, columnar analytics (cumulative distribution), competitive DB benchmarking tools
Summary Table
| Data Structure | Lookup | Range Scan | Space | Used For |
|---|---|---|---|---|
| B+ Tree | O(log n) | O(log n + k) | Medium | Disk indexes, RDBMS |
| Skip List | O(log n) | O(log n + k) | Medium | In-memory write buffer |
| Hash Table | O(1) | ✗ | Low | Point lookup, joins |
| LSM Tree | O(log n)* | O(log n + k)* | + overhead | Write-heavy storage |
| Inverted Index | O(1) lookup | Merge lists | Large | Full-text search |
| R-Tree | O(log n) | O(log n + k) | Medium | Spatial/geo queries |
| Trie/Radix | O(m) | O(m + k) | Low-Med | Prefix, routing |
| Bloom Filter | O(k) | ✗ | Very low | Existence check |
| HyperLogLog | O(1) | ✗ | ~1 KB | Count distinct |
| Count-Min | O(k) | ✗ | Fixed | Frequency estimate |
| Merkle Tree | O(log n) | ✗ | O(n) | Diff/verification |
| Roaring Bitmap | O(1) | O(n/64) | Very low | Set intersection |
*with bloom filter optimization