Algorithms in Databases & Distributed Systems
Sorting
External Merge Sort
- Sort data too large to fit in RAM
- Phase 1: load chunks, sort in-memory → write sorted runs to disk
- Phase 2: k-way merge of runs using min-heap
- I/O: O(N log N / B) where B = buffer size
- Used in: PostgreSQL sort node, MySQL ORDER BY without index, Hadoop MapReduce shuffle, Spark sort-based shuffle, database bulk load
TimSort
- Hybrid merge sort + insertion sort
- Detects and exploits natural runs in data
- O(n log n) worst, O(n) best (already sorted)
- Used in: Python
sorted(), JavaArrays.sort(), Pandas — any in-memory sort in query engines
Introsort
- Hybrid: quicksort + heapsort (fallback on deep recursion) + insertion sort (small arrays)
- O(n log n) guaranteed; cache-friendly in practice
- Used in: C++
std::sort, .NET, query engine in-memory sorts
Searching
Binary Search
- O(log n) on sorted array
- Used everywhere: SSTable sparse index lookup, B-tree key search within page, range boundary detection, PostgreSQL binary search on sorted pages
Interpolation Search
- O(log log n) for uniformly distributed data
- Estimate position:
pos = lo + (key - arr[lo]) / (arr[hi] - arr[lo]) * (hi - lo) - Used in: database index statistics, histogram-based query optimizer estimates
Ternary Search
- Find max/min of unimodal function
- Used in: query plan cost optimization, gradient descent in ML-based query optimizers
Hashing
MurmurHash / xxHash
- Non-cryptographic; fast, good distribution
- MurmurHash3: 128-bit output, ~1 GB/s throughput
- xxHash3: faster (~8–10 GB/s), SIMD-optimized
- Used in: partitioning (Cassandra, Kafka), Bloom filters, hash joins, in-memory hash tables
Consistent Hashing
- Map keys + nodes to a ring; key goes to first clockwise node
- Virtual nodes: each physical node has V positions → even distribution
- Node join/leave: O(K/N) keys move (K=keys, N=nodes)
- Used in: DynamoDB, Cassandra, Redis Cluster, load balancers
Rendezvous Hashing (Highest Random Weight)
- Each key independently scores all nodes:
score(key, node) = hash(key + node) - Key → node with highest score
- No ring needed; naturally handles node add/remove
- Used in: CDN cache routing, Nginx upstream selection, distributed caches
String / Pattern Matching
Aho-Corasick
- Multi-pattern string matching in O(n + m + z) where z = matches
- Build failure links (like KMP but for multiple patterns)
- Used in: intrusion detection (Snort), full-text multi-keyword search, spam filters, grep
-F(multiple fixed strings)
Knuth-Morris-Pratt (KMP)
- Single pattern matching in O(n + m) — no backtracking
- Prefix function avoids redundant comparisons
- Used in:
strstr()implementations, LIKE queries in DBs, grep
Boyer-Moore
- Pattern matching from right to left; skips large portions of text
- O(n/m) best case — faster than KMP in practice for long patterns
- Used in: PostgreSQL LIKE with
pg_trgm, grep, text editors find/replace
Trigram Index
- Index all 3-character substrings of text
- LIKE
%term%→ find trigrams ofterm→ intersect posting lists - Used in: PostgreSQL
pg_trgmextension, Elasticsearch n-gram tokenizer
Graph Traversal
BFS (Breadth-First Search)
- Level-by-level traversal using a queue
- O(V + E), finds shortest path in unweighted graphs
- Used in: Neo4j shortest path, social network friend-of-friend queries, network topology discovery, web crawlers
DFS (Depth-First Search)
- Recursive/stack-based; explores as deep as possible
- O(V + E)
- Used in: cycle detection, topological sort (dependency resolution), tree serialization, CockroachDB range metadata traversal
Dijkstra's Algorithm
- Shortest path in weighted graphs (non-negative weights)
- O((V + E) log V) with binary heap
- Used in: routing tables (OSPF), mapping/navigation (Google Maps), CDN routing, network distance estimation
A* Search
- Dijkstra + heuristic function h(n) to guide search toward goal
- Optimal if h(n) is admissible (never overestimates)
- Used in: game pathfinding, GPS navigation, robot motion planning
PageRank
- Iterative: score(v) = sum of score(u)/out_degree(u) for all u → v
- Converges to stationary distribution of random walk
- O(iterations × E)
- Used in: Google search ranking, LinkedIn network ranking, Apache Spark GraphX, citation analysis
Topological Sort (Kahn's / DFS)
- Linear ordering of DAG vertices
- O(V + E)
- Used in: build system dependency resolution (Bazel, Make), SQL query plan ordering, schema migration dependency ordering, task scheduling
Compression
LZ4
- Byte-oriented LZ compression; extremely fast decode (~1–2 GB/s)
- Lower compression ratio than Zstd but minimal CPU
- Used in: RocksDB block compression (default option), Kafka message compression, ClickHouse, network packet compression where CPU is scarce
Snappy (Google)
- Similar to LZ4; ~250–500 MB/s compress, ~500 MB/s decompress
- Used in: LevelDB/RocksDB, Cassandra, Hadoop, Parquet column pages
Zstandard (Zstd)
- Much better ratio than LZ4/Snappy; fast decode with dictionary
- Tunable: level 1 = near LZ4 speed, level 19 = near gzip ratio
- Used in: PostgreSQL 15+ WAL compression, Zookeeper snapshots, Linux kernel, Facebook's internal storage
Delta + RLE Encoding (Columnar)
- Delta: store differences between consecutive values → small ints → better compression
- RLE: runs of same value stored as (value, count)
- Used in: Parquet, ORC, ClickHouse MergeTree (sort key columns), Apache Druid (timestamp columns)
Gorilla Encoding (Time-Series)
- XOR consecutive floats; leading/trailing zeros encoded in 1–2 bytes
- ~1.37 bytes/point on typical metrics data
- Used in: Facebook Gorilla TSDB, Prometheus (Cortex), VictoriaMetrics, InfluxDB IOx
Cache Eviction
LRU (Least Recently Used)
- Evict the item not accessed for the longest time
- Implementation: doubly-linked list + hash map → O(1) access + eviction
- Used in: OS page cache, PostgreSQL buffer pool, Redis (approximation), CDN edge caches
LFU (Least Frequently Used)
- Evict the item accessed fewest times
- Implementation: frequency buckets + min-heap → O(log n) or O(1) with frequency list trick
- Better than LRU for stable hot-key workloads
- Used in: Redis (LFU policy), CPU L2/L3 caches, CDN static asset caches
Clock (Second Chance)
- Circular buffer; each page has a reference bit
- On eviction: scan clockwise; if bit=1 → clear and skip; if bit=0 → evict
- O(1) amortized; approximates LRU with less memory
- Used in: Linux page cache (not pure Clock but similar), OS virtual memory managers
TinyLFU / W-TinyLFU
- Window (small LRU) + Protected (frequent LRU) + Probationary segments
- Count-Min sketch for frequency tracking
- Near-optimal hit rate; O(1) operations
- Used in: Caffeine (Java cache library), used by Spring, Hadoop, Cassandra read cache
Query Optimization
Cost-Based Optimizer (CBO)
- Enumerate possible plans (join order, index choice, scan method)
- Estimate cost using statistics (row counts, cardinality, histograms)
- Choose minimum cost plan
- Used in: PostgreSQL planner, MySQL optimizer, CockroachDB, Spark Catalyst
Dynamic Programming for Join Order
- Optimal join ordering is NP-hard (all permutations)
- DP: memoize optimal plan for each subset of relations
- O(3^n) time and O(2^n) space — practical up to ~20 tables
- Used in: PostgreSQL join planner, Oracle query optimizer
Predicate Pushdown
- Move WHERE conditions as close to data source as possible
- Reduces rows early → less data through pipeline
- Used in: Spark Catalyst, PostgreSQL, Presto/Trino, Parquet reader (row group filtering)
Vectorized Execution (SIMD)
- Process a batch (vector) of rows per operator call instead of one row at a time
- SIMD instructions process 4–16 values in one CPU instruction
- 2–10× speedup on analytical queries
- Used in: DuckDB, ClickHouse, Snowflake, Apache Arrow, Velox
Concurrency Control
MVCC (Multi-Version Concurrency Control)
- Every write = new version with timestamp; readers see consistent snapshot
- Readers never block writers; writers never block readers
- GC removes old versions below oldest active transaction
- Used in: PostgreSQL, MySQL InnoDB, CockroachDB, TiKV, Oracle
Two-Phase Locking (2PL)
- Growing phase: acquire locks; shrinking phase: release locks
- Strict 2PL: hold write locks until commit (prevents cascading aborts)
- Deadlock detection via wait-for graph (cycle = deadlock)
- Used in: MySQL InnoDB (default), SQL Server, DB2
OCC (Optimistic Concurrency Control)
- Execute without locks; validate at commit time
- Abort if any read was modified by a committed transaction
- Low overhead under low contention; high abort rate under high contention
- Used in: CockroachDB SSI, TiKV Percolator, Hekaton (SQL Server in-memory)
Encoding / Serialization
Varint (Variable-length Integer)
- Small integers use fewer bytes: 0–127 = 1 byte, 128–16383 = 2 bytes
- Used in: Protocol Buffers, RocksDB keys, Lucene posting lists, Parquet
Zigzag Encoding
- Maps signed integers to unsigned: 0→0, -1→1, 1→2, -2→3
- Enables varint for negative numbers
- Used in: Protocol Buffers sint32/sint64, Avro, Thrift
Frame of Reference (FOR) / PFOR
- Subtract min value from all integers in a block; encode deltas with fewer bits
- PFOR: most values fit in b bits; outliers stored separately
- Used in: Lucene doc IDs, Elasticsearch, columnar OLAP engines
Distributed Algorithms
Gossip Protocol
- Each node periodically picks random peers and exchanges state
- Epidemic spreading: O(log N) rounds to reach all N nodes
- Scales to thousands of nodes; no SPOF
- Used in: Cassandra membership, Consul health checks, Bitcoin peer discovery, DynamoDB ring state
Phi Accrual Failure Detector
- Instead of binary alive/dead, outputs suspicion level φ
- φ computed from heartbeat inter-arrival distribution (exponential/normal)
- Threshold tunable: higher φ threshold = fewer false positives, slower detection
- Used in: Cassandra, Akka clustering
Raft Log Replication
- Leader appends entry → sends AppendEntries to followers → commits on majority ACK
- Safety: never commit entries from previous terms unless current-term entry also committed
- Used in: etcd, CockroachDB, TiKV, Consul
Consistent Snapshot (Chandy-Lamport)
- Distributed snapshot without stopping the system
- Each process records own state; markers propagate through channels
- Used in: Apache Flink checkpointing, distributed debuggers, global state capture
Summary: Algorithm → System Mapping
| Algorithm / DS | Systems |
|---|---|
| Skip list | RocksDB MemTable, Redis ZSET, TiKV write buffer |
| B+ Tree | PostgreSQL, MySQL, SQLite, Oracle |
| Bloom filter | RocksDB, Cassandra, BigTable, Chrome |
| Inverted index | Elasticsearch, Lucene, PostgreSQL GIN |
| R-Tree | PostGIS, MongoDB geo, MySQL spatial |
| Consistent hashing | DynamoDB, Cassandra, Redis Cluster |
| HyperLogLog | Redis, BigQuery, Snowflake |
| Roaring bitmap | Druid, ClickHouse, Elasticsearch |
| External merge sort | PostgreSQL, MySQL, Spark shuffle |
| Vectorized execution | DuckDB, ClickHouse, Snowflake |
| Gossip | Cassandra, Bitcoin, Consul |
| Raft | etcd, CockroachDB, TiKV |
| MVCC | PostgreSQL, MySQL, CockroachDB, TiKV |
| W-TinyLFU | Caffeine, Cassandra read cache |
| Gorilla encoding | Prometheus, VictoriaMetrics, InfluxDB |