Transactions

ACID — What Each Property Actually Means

Property	Meaning	Who enforces it
Atomicity	All writes in a tx commit or none do. On fault → abort → client retries safely. NOT about concurrency — about all-or-nothing on crash.	Database (WAL + undo log)
Consistency	DB moves from one valid state to another (invariants hold: no negative balance, FK constraints intact).	Application defines the invariants; DB provides the tools
Isolation	Concurrently executing txs appear to run serially — no interference. Most DBs provide weaker guarantees by default.	Database (locks, MVCC)
Durability	Committed data survives crashes — flushed to disk (WAL fsync) or replicated to N nodes.	Database (storage + replication)

Single-Object vs Multi-Object Operations

Single-Object

One row/document; always atomic in any serious DB
INCR in Redis, findAndModify in MongoDB — atomic by design
Storage engines use WAL + crash recovery to guarantee atomicity per object

Multi-Object

Required for: foreign keys, denormalized copies, secondary index updates
Without a transaction: partial write leaves DB inconsistent (FK exists but referencing row absent)
NoSQL often skips this → app must handle partial failure
Retry pitfall: network failure after commit looks like failure → double-execute → need idempotency key

Isolation Levels (weakest → strongest)

Read Uncommitted  ←  allows dirty reads
Read Committed    ←  no dirty reads/writes; default in PostgreSQL, Oracle, SQL Server
Repeatable Read   ←  no non-repeatable reads; snapshot per tx; default in MySQL InnoDB
Snapshot Isolation←  consistent snapshot for entire tx (MVCC); no write skew prevention
Serializable      ←  full isolation; as if txs ran one at a time

Weak Isolation — Problems & Fixes

Dirty Reads

What: reading uncommitted data from another in-flight tx.

Tx A: UPDATE accounts SET balance = 0 WHERE id=1   -- not yet committed
Tx B: SELECT balance FROM accounts WHERE id=1       -- sees 0 (wrong!)
Tx A: ROLLBACK

Fix: Read Committed isolation — only read committed values (DB maintains two versions: before-update value served to readers).

Dirty Writes

What: overwriting another tx's uncommitted write.

Tx A: UPDATE listings SET buyer=Alice WHERE id=1    -- not committed
Tx B: UPDATE listings SET buyer=Bob   WHERE id=1    -- overwrites Alice (dirty write)
Tx A commits → Alice thinks she won; Bob's invoice also created

Fix: Row-level locks — writer acquires lock; other writers wait until commit/abort.

Read-After-Write (Non-Repeatable Read)

What: reading the same row twice in one tx returns different values (another tx committed in between).

Tx A: SELECT balance → 100
Tx B: UPDATE balance = 50; COMMIT
Tx A: SELECT balance → 50   -- different value, same tx!

Fix: Repeatable Read — tx reads from its own snapshot (MVCC). Same query always returns same result within the tx.

Snapshot Isolation (MVCC Implementation)

Each tx gets a consistent snapshot of the DB at start time. Reads never block writers; writers never block readers.

MVCC mechanics:

Each row has: created_by (tx_id), deleted_by (tx_id or null)

Tx 13 starts → snapshot = {committed txs with id < 13} + {txs already committed before 13 started}

On read: return row if:
  - created_by tx is committed AND created_by < current_tx_id
  - deleted_by is NULL OR deleted_by tx is not yet committed

Visibility rules:

Row created by a later tx → invisible (not in my snapshot)
Row deleted by a later tx → still visible (delete not in my snapshot)
Row deleted by an earlier committed tx → invisible

Used by: PostgreSQL, MySQL InnoDB, Oracle, CockroachDB, TiKV.

Phantom Reads

What: a query returns different rows on two executions within the same tx (another tx inserted/deleted matching rows).

Tx A: SELECT * FROM bookings WHERE room=101 AND checkin='2024-01-01' → 0 rows
Tx B: INSERT INTO bookings (room, checkin) VALUES (101, '2024-01-01'); COMMIT
Tx A: SELECT same query → 1 row   -- phantom appeared!
Tx A: INSERT booking thinking room is free → double booking

Fix options:

Predicate lock: lock all rows matching the WHERE clause (including not-yet-existing rows) — expensive
Index-range lock: lock the index range covering the query → cheaper; blocks insertions into that range
Serializable Snapshot Isolation: detects the conflict at commit time without locking

Write Skew

What: two txs read the same data, make decisions based on it, then write to different objects — breaking an invariant. Neither tx wrote dirty data, but the combination is invalid.

Invariant: at least 1 doctor must be on-call

Tx Alice: SELECT count(*) FROM oncall → 2  (both Alice and Bob)
Tx Bob:   SELECT count(*) → 2
Tx Alice: UPDATE oncall SET oncall=false WHERE name='Alice'  -- sees 2, thinks it's safe
Tx Bob:   UPDATE oncall SET oncall=false WHERE name='Bob'    -- sees 2, thinks it's safe
Both commit → 0 doctors on call ← invariant violated

Pattern: read → decide → write (where write is to a different object than what was read).

Fix: make writes conflict:

Materialize the conflict — create a lock row that both txs must write to
SELECT FOR UPDATE — explicitly lock the read rows
Serializable isolation (SSI or 2PL with predicate locks)

Real examples: meeting room double-booking, username uniqueness, balance ≥ 0 across accounts, claiming a unique item.

Preventing Concurrency Bugs

Compare-and-Set (CAS)

Atomic conditional write — only updates if current value matches expected.

UPDATE wiki SET content = 'new' WHERE id=1 AND content = 'old';
-- Returns 0 rows affected → someone else changed it → retry

Works only on single objects
Unsafe if DB reads from old snapshot (CAS check passes on stale read)

Explicit Locking (`SELECT FOR UPDATE`)

BEGIN;
SELECT * FROM bookings WHERE room=101 FOR UPDATE;  -- acquires row lock
-- safe to check and insert; other txs block on this row
INSERT INTO bookings ...;
COMMIT;

Prevents write skew on existing rows
Does NOT prevent phantom inserts (lock only covers rows that exist)

Materializing Conflicts

When there are no rows to lock (phantom problem), pre-create lock rows:

-- Pre-populate all room×timeslot combinations
INSERT INTO room_slots (room, date) VALUES (101, '2024-01-01'), ...;

-- Tx acquires lock on the slot row:
SELECT * FROM room_slots WHERE room=101 AND date='2024-01-01' FOR UPDATE;

Last resort — leaks concurrency control concern into data model.

Serializability Implementations

Actual Serial Execution

Single thread processes one tx at a time — eliminates all concurrency bugs
Feasible only with fast in-memory txs + stored procedures (no network round trips mid-tx)
Scale: partition data → each partition has own thread → cross-partition txs = expensive coordination
Used by: VoltDB, Redis (single-threaded), H-Store, Datomic

Two-Phase Locking (2PL)

Pessimistic approach — block before conflict can occur.

Phase 1 (growing): acquire locks; never release
Phase 2 (shrinking): release locks; never acquire new

Read  → shared lock    (multiple readers OK; blocks writers)
Write → exclusive lock (blocks all readers AND writers)

Deadlock: Tx A holds lock on row 1, wants row 2. Tx B holds row 2, wants row 1.

DB detects cycle in wait-for graph → aborts one tx → other proceeds
Application must retry aborted tx

Predicate locks: lock all rows matching a WHERE clause — prevents phantoms but expensive to check.

Index-range locks: lock a range in the index covering the predicate (broader but cheaper):

Query: WHERE room=101 AND checkin BETWEEN '2024-01' AND '2024-02'
→ Lock index entry for room=101 (or the full time range in that index)
→ Any INSERT matching room=101 must acquire compatible lock → blocks

Performance: significantly worse than snapshot isolation; writers block readers; unstable latencies.

SSI (Serializable Snapshot Isolation)

Optimistic approach — run without blocking; detect conflicts at commit.

Premise tracking:

Tx reads rows matching a condition → DB records the read premise: "I read X at time T"

At commit:
  Was X modified by another committed tx after T?
  YES → my read is stale → abort + retry
  NO  → commit succeeds

Two types of conflicts detected:

Stale MVCC read: tx read a value, but another tx committed a write to that value before this tx started (tx ignored that write because it wasn't in its snapshot)
Reads affected by a write: another tx commits a write that changes the result of a previously executed query

Used by: PostgreSQL (since 9.1), CockroachDB, FoundationDB, TiKV (Percolator-based)

Pessimistic vs Optimistic Concurrency

	Pessimistic (2PL)	Optimistic (SSI, OCC)
Assumption	Conflicts are likely — block early	Conflicts are rare — detect late
Blocking	Yes — readers/writers wait for locks	No — all txs run concurrently
Abort rate	Low (conflicts prevented)	High under contention (abort + retry)
Throughput	Lower (lock overhead + blocking)	Higher under low contention
Latency	Higher (wait for locks)	Lower (no waiting)
Best for	High contention, short critical sections	Low contention, read-heavy workloads
Starvation	Possible (long-held locks)	Possible (repeated aborts under contention)

Summary: Which Isolation Level Prevents What

Problem	Read Committed	Repeatable Read / Snapshot	Serializable
Dirty reads	✅	✅	✅
Dirty writes	✅	✅	✅
Non-repeatable reads	❌	✅	✅
Phantom reads	❌	❌ (MVCC only)	✅
Write skew	❌	❌	✅
Lost updates	❌	Some DBs ✅	✅

Transaction Isolation Decision Tree

Need serializability?
  ├─ Low contention, distributed → SSI (PostgreSQL, CockroachDB)
  ├─ Simple single-node OLTP    → Actual serial execution (VoltDB)
  └─ High contention, locks OK  → 2PL with index-range locks

Need snapshot reads without serializability?
  └─ Snapshot Isolation / Repeatable Read (MVCC)

Need just no dirty reads?
  └─ Read Committed (cheapest, PostgreSQL default)

Application can handle conflicts?
  └─ Optimistic (CAS, version field) + retry loop

Distributed Transactions

Multi-object txs across partitions/nodes — requires coordination.

Why NoSQL skipped them: 2PC latency + availability cost (coordinator is SPOF). When unavoidable: cross-shard FK, cross-service atomic write.

→ See consensus/two_phase_commit.md for 2PC details. → See consistency/percolator_protocol.md for 2PC without coordinator (TiKV).