Percolator Protocol (Distributed MVCC Transactions)
Background
Google Percolator (2010) — incremental processing on top of Bigtable. First practical implementation of lock-free distributed MVCC transactions without a dedicated transaction coordinator.
Used by: TiKV, Google Spanner (inspired), CockroachDB (similar approach).
Key Insight
Traditional 2PC requires a coordinator that holds locks. Percolator eliminates the coordinator by:
- Using a primary lock as the single source of truth for commit/abort
- Letting any node resolve a transaction by checking the primary lock's state
- Using MVCC timestamps (from TSO) to order operations without locks on reads
Storage Layout (TiKV Column Families)
| CF | Key | Value | Role |
|---|---|---|---|
CF_DEFAULT |
(key, start_ts) |
value bytes | Actual data versions |
CF_LOCK |
key |
lock info + primary key ref | In-flight tx locks |
CF_WRITE |
(key, commit_ts) |
{type, start_ts} |
Committed version index |
Timestamps stored in descending order → RocksDB forward scan = newest version first.
Transaction Protocol
Phase 1 — Prewrite
1. Get start_ts from TSO (Timestamp Oracle)
2. Choose one key as "primary" lock, rest are "secondary"
3. For each key to write:
a. Check CF_WRITE at (key, ts > start_ts)
→ if exists: write conflict → abort
b. Check CF_LOCK at key
→ if exists (any ts): lock conflict → abort or wait
c. Write value: CF_DEFAULT[(key, start_ts)] = value
d. Write lock: CF_LOCK[key] = {start_ts, primary_key, ttl, ...}
Phase 2 — Commit
1. Get commit_ts from TSO
2. Check primary lock still exists at CF_LOCK[primary_key]
→ if missing: transaction was rolled back by someone else → abort
3. Write to CF_WRITE[primary_key, commit_ts] = {Put, start_ts}
Delete CF_LOCK[primary_key]
← THIS IS THE COMMIT POINT — atomic, single write
4. Async: clean up secondary locks (commit secondaries)
For each secondary key:
Write CF_WRITE[(key, commit_ts)] = {Put, start_ts}
Delete CF_LOCK[key]
Correctness: commit is a single atomic write to CF_WRITE for the primary key. Secondary cleanup can happen lazily — readers will resolve it.
Read Protocol
Read key K at snapshot timestamp T:
1. Check CF_LOCK[K]
→ if lock.start_ts < T: another tx has an in-flight write visible to us
→ may need to wait or resolve (see below)
→ if lock.start_ts > T: ignore (started after our snapshot)
2. Scan CF_WRITE for (K, commit_ts ≤ T) newest first
→ get start_ts from write record
3. Fetch CF_DEFAULT[(K, start_ts)]
→ return value
Lock Conflict Resolution
When a reader encounters a lock at start_ts < T:
Check primary lock status:
Case 1: Primary committed (CF_WRITE has entry)
→ Secondary is dangling → commit it (write CF_WRITE, delete CF_LOCK)
→ Read the committed value
Case 2: Primary not committed + lock TTL expired
→ Transaction is dead → rollback it
→ Write CF_WRITE[(primary, start_ts)] = {Rollback}
→ Delete CF_LOCK[primary]
→ Clean up secondaries
Case 3: Primary lock still live (TTL not expired)
→ Transaction is still in progress → wait or retry
Any node can resolve any transaction — no dedicated coordinator needed.
Timestamp Oracle (TSO)
Central monotonically increasing timestamp generator.
64-bit timestamp = [46-bit physical ms | 18-bit logical counter]
- Pre-allocates ranges to disk; serves from memory (no disk I/O per request)
- Clients batch requests (one RPC per batch) to reduce overhead
- Single node — leader in a Raft group (PD in TiKV) for HA
- Failure: new leader resumes from a timestamp > last persisted batch → no duplicate timestamps
Bottleneck: all transactions need 2 TSO calls (start_ts + commit_ts). Mitigated by batching.
MVCC Garbage Collection
GC safe point = min(start_ts of all active transactions)
All versions with commit_ts < GC safe point can be deleted
→ Triggered during RocksDB compaction
→ PD tracks and broadcasts GC safe point to all TiKV nodes
Percolator vs. Traditional 2PC
| Traditional 2PC | Percolator | |
|---|---|---|
| Coordinator | Required, holds locks | None — primary lock is coordinator |
| Blocking on coordinator failure | Yes | No — any node resolves via primary lock |
| Lock visibility | Coordinator memory | CF_LOCK in storage (persistent) |
| Commit point | Coordinator decision log | Single CF_WRITE entry for primary |
| Read blocking | Readers wait for locks | Readers resolve stale locks lazily |
| Clock dependency | No | TSO required for ordering |
Percolator in CockroachDB (Similar Approach)
CockroachDB uses write intents (same concept as locks in CF_LOCK):
- Intents = uncommitted MVCC values flagged with transaction ID
- Transaction record = equivalent of primary lock
- Commit = single write to transaction record →
COMMITTED - Intent cleanup = async, correctness doesn't depend on it
- Uses HLC instead of TSO (no single-node timestamp bottleneck)