Transactions
ACID — What Each Property Actually Means
| Property | Meaning | Who enforces it |
|---|---|---|
| Atomicity | All writes in a tx commit or none do. On fault → abort → client retries safely. NOT about concurrency — about all-or-nothing on crash. | Database (WAL + undo log) |
| Consistency | DB moves from one valid state to another (invariants hold: no negative balance, FK constraints intact). | Application defines the invariants; DB provides the tools |
| Isolation | Concurrently executing txs appear to run serially — no interference. Most DBs provide weaker guarantees by default. | Database (locks, MVCC) |
| Durability | Committed data survives crashes — flushed to disk (WAL fsync) or replicated to N nodes. | Database (storage + replication) |
Single-Object vs Multi-Object Operations
Single-Object
- One row/document; always atomic in any serious DB
INCRin Redis,findAndModifyin MongoDB — atomic by design- Storage engines use WAL + crash recovery to guarantee atomicity per object
Multi-Object
- Required for: foreign keys, denormalized copies, secondary index updates
- Without a transaction: partial write leaves DB inconsistent (FK exists but referencing row absent)
- NoSQL often skips this → app must handle partial failure
- Retry pitfall: network failure after commit looks like failure → double-execute → need idempotency key
Isolation Levels (weakest → strongest)
Read Uncommitted ← allows dirty reads
Read Committed ← no dirty reads/writes; default in PostgreSQL, Oracle, SQL Server
Repeatable Read ← no non-repeatable reads; snapshot per tx; default in MySQL InnoDB
Snapshot Isolation← consistent snapshot for entire tx (MVCC); no write skew prevention
Serializable ← full isolation; as if txs ran one at a time
Weak Isolation — Problems & Fixes
Dirty Reads
What: reading uncommitted data from another in-flight tx.
Tx A: UPDATE accounts SET balance = 0 WHERE id=1 -- not yet committed
Tx B: SELECT balance FROM accounts WHERE id=1 -- sees 0 (wrong!)
Tx A: ROLLBACK
Fix: Read Committed isolation — only read committed values (DB maintains two versions: before-update value served to readers).
Dirty Writes
What: overwriting another tx's uncommitted write.
Tx A: UPDATE listings SET buyer=Alice WHERE id=1 -- not committed
Tx B: UPDATE listings SET buyer=Bob WHERE id=1 -- overwrites Alice (dirty write)
Tx A commits → Alice thinks she won; Bob's invoice also created
Fix: Row-level locks — writer acquires lock; other writers wait until commit/abort.
Read-After-Write (Non-Repeatable Read)
What: reading the same row twice in one tx returns different values (another tx committed in between).
Tx A: SELECT balance → 100
Tx B: UPDATE balance = 50; COMMIT
Tx A: SELECT balance → 50 -- different value, same tx!
Fix: Repeatable Read — tx reads from its own snapshot (MVCC). Same query always returns same result within the tx.
Snapshot Isolation (MVCC Implementation)
Each tx gets a consistent snapshot of the DB at start time. Reads never block writers; writers never block readers.
MVCC mechanics:
Each row has: created_by (tx_id), deleted_by (tx_id or null)
Tx 13 starts → snapshot = {committed txs with id < 13} + {txs already committed before 13 started}
On read: return row if:
- created_by tx is committed AND created_by < current_tx_id
- deleted_by is NULL OR deleted_by tx is not yet committed
Visibility rules:
- Row created by a later tx → invisible (not in my snapshot)
- Row deleted by a later tx → still visible (delete not in my snapshot)
- Row deleted by an earlier committed tx → invisible
Used by: PostgreSQL, MySQL InnoDB, Oracle, CockroachDB, TiKV.
Phantom Reads
What: a query returns different rows on two executions within the same tx (another tx inserted/deleted matching rows).
Tx A: SELECT * FROM bookings WHERE room=101 AND checkin='2024-01-01' → 0 rows
Tx B: INSERT INTO bookings (room, checkin) VALUES (101, '2024-01-01'); COMMIT
Tx A: SELECT same query → 1 row -- phantom appeared!
Tx A: INSERT booking thinking room is free → double booking
Fix options:
- Predicate lock: lock all rows matching the WHERE clause (including not-yet-existing rows) — expensive
- Index-range lock: lock the index range covering the query → cheaper; blocks insertions into that range
- Serializable Snapshot Isolation: detects the conflict at commit time without locking
Write Skew
What: two txs read the same data, make decisions based on it, then write to different objects — breaking an invariant. Neither tx wrote dirty data, but the combination is invalid.
Invariant: at least 1 doctor must be on-call
Tx Alice: SELECT count(*) FROM oncall → 2 (both Alice and Bob)
Tx Bob: SELECT count(*) → 2
Tx Alice: UPDATE oncall SET oncall=false WHERE name='Alice' -- sees 2, thinks it's safe
Tx Bob: UPDATE oncall SET oncall=false WHERE name='Bob' -- sees 2, thinks it's safe
Both commit → 0 doctors on call ← invariant violated
Pattern: read → decide → write (where write is to a different object than what was read).
Fix: make writes conflict:
- Materialize the conflict — create a lock row that both txs must write to
SELECT FOR UPDATE— explicitly lock the read rows- Serializable isolation (SSI or 2PL with predicate locks)
Real examples: meeting room double-booking, username uniqueness, balance ≥ 0 across accounts, claiming a unique item.
Preventing Concurrency Bugs
Compare-and-Set (CAS)
Atomic conditional write — only updates if current value matches expected.
UPDATE wiki SET content = 'new' WHERE id=1 AND content = 'old';
-- Returns 0 rows affected → someone else changed it → retry
- Works only on single objects
- Unsafe if DB reads from old snapshot (CAS check passes on stale read)
Explicit Locking (SELECT FOR UPDATE)
BEGIN;
SELECT * FROM bookings WHERE room=101 FOR UPDATE; -- acquires row lock
-- safe to check and insert; other txs block on this row
INSERT INTO bookings ...;
COMMIT;
- Prevents write skew on existing rows
- Does NOT prevent phantom inserts (lock only covers rows that exist)
Materializing Conflicts
When there are no rows to lock (phantom problem), pre-create lock rows:
-- Pre-populate all room×timeslot combinations
INSERT INTO room_slots (room, date) VALUES (101, '2024-01-01'), ...;
-- Tx acquires lock on the slot row:
SELECT * FROM room_slots WHERE room=101 AND date='2024-01-01' FOR UPDATE;
Last resort — leaks concurrency control concern into data model.
Serializability Implementations
Actual Serial Execution
- Single thread processes one tx at a time — eliminates all concurrency bugs
- Feasible only with fast in-memory txs + stored procedures (no network round trips mid-tx)
- Scale: partition data → each partition has own thread → cross-partition txs = expensive coordination
- Used by: VoltDB, Redis (single-threaded), H-Store, Datomic
Two-Phase Locking (2PL)
Pessimistic approach — block before conflict can occur.
Phase 1 (growing): acquire locks; never release
Phase 2 (shrinking): release locks; never acquire new
Read → shared lock (multiple readers OK; blocks writers)
Write → exclusive lock (blocks all readers AND writers)
Deadlock: Tx A holds lock on row 1, wants row 2. Tx B holds row 2, wants row 1.
- DB detects cycle in wait-for graph → aborts one tx → other proceeds
- Application must retry aborted tx
Predicate locks: lock all rows matching a WHERE clause — prevents phantoms but expensive to check.
Index-range locks: lock a range in the index covering the predicate (broader but cheaper):
Query: WHERE room=101 AND checkin BETWEEN '2024-01' AND '2024-02'
→ Lock index entry for room=101 (or the full time range in that index)
→ Any INSERT matching room=101 must acquire compatible lock → blocks
Performance: significantly worse than snapshot isolation; writers block readers; unstable latencies.
SSI (Serializable Snapshot Isolation)
Optimistic approach — run without blocking; detect conflicts at commit.
Premise tracking:
Tx reads rows matching a condition → DB records the read premise: "I read X at time T"
At commit:
Was X modified by another committed tx after T?
YES → my read is stale → abort + retry
NO → commit succeeds
Two types of conflicts detected:
- Stale MVCC read: tx read a value, but another tx committed a write to that value before this tx started (tx ignored that write because it wasn't in its snapshot)
- Reads affected by a write: another tx commits a write that changes the result of a previously executed query
Used by: PostgreSQL (since 9.1), CockroachDB, FoundationDB, TiKV (Percolator-based)
Pessimistic vs Optimistic Concurrency
| Pessimistic (2PL) | Optimistic (SSI, OCC) | |
|---|---|---|
| Assumption | Conflicts are likely — block early | Conflicts are rare — detect late |
| Blocking | Yes — readers/writers wait for locks | No — all txs run concurrently |
| Abort rate | Low (conflicts prevented) | High under contention (abort + retry) |
| Throughput | Lower (lock overhead + blocking) | Higher under low contention |
| Latency | Higher (wait for locks) | Lower (no waiting) |
| Best for | High contention, short critical sections | Low contention, read-heavy workloads |
| Starvation | Possible (long-held locks) | Possible (repeated aborts under contention) |
Summary: Which Isolation Level Prevents What
| Problem | Read Committed | Repeatable Read / Snapshot | Serializable |
|---|---|---|---|
| Dirty reads | ✅ | ✅ | ✅ |
| Dirty writes | ✅ | ✅ | ✅ |
| Non-repeatable reads | ❌ | ✅ | ✅ |
| Phantom reads | ❌ | ❌ (MVCC only) | ✅ |
| Write skew | ❌ | ❌ | ✅ |
| Lost updates | ❌ | Some DBs ✅ | ✅ |
Transaction Isolation Decision Tree
Need serializability?
├─ Low contention, distributed → SSI (PostgreSQL, CockroachDB)
├─ Simple single-node OLTP → Actual serial execution (VoltDB)
└─ High contention, locks OK → 2PL with index-range locks
Need snapshot reads without serializability?
└─ Snapshot Isolation / Repeatable Read (MVCC)
Need just no dirty reads?
└─ Read Committed (cheapest, PostgreSQL default)
Application can handle conflicts?
└─ Optimistic (CAS, version field) + retry loop
Distributed Transactions
Multi-object txs across partitions/nodes — requires coordination.
Why NoSQL skipped them: 2PC latency + availability cost (coordinator is SPOF). When unavoidable: cross-shard FK, cross-service atomic write.
→ See consensus/two_phase_commit.md for 2PC details. → See consistency/percolator_protocol.md for 2PC without coordinator (TiKV).