Time and Ordering in Distributed Systems

Why Order Matters

Single-machine programs have a natural total order (sequential execution)
Distributed systems have partial order — some events are concurrent and incomparable
Transforming partial → total order requires communication, waiting, and restricts parallelism

Total vs. Partial Order

Model	Description
Total order	Every element comparable — single-threaded execution
Partial order	Some elements incomparable — concurrent operations on different nodes

Three Time Models

Model	How	Limitation
Global Clock	Perfect sync — total order without communication	Only possible to limited accuracy (hardware + physics)
Local Clock	Each machine has own clock	Only orders events within a single node
No Clock (Logical Time)	Counters + communication track causality	Must communicate to establish order

Physical Clocks

Time-of-Day Clock

Returns current date/time
Synchronized with NTP
Can jump backward (NTP correction)
Do NOT use for measuring elapsed time or ordering events across nodes

Monotonic Clock

Always moves forward — suitable for measuring durations (timeouts, intervals)
Absolute value is meaningless — only differences matter
NTP may adjust frequency (slew), not position

Clock Accuracy in Practice

Best practical accuracy with NTP: tens of milliseconds
GPS receivers + PTP: microsecond accuracy
Never rely on wall-clock timestamps for event ordering across distributed nodes

Lamport Clocks (Logical Clocks)

Simple counter-based ordering — forces a total order.

Rules

Increment counter on every local event
Attach counter to every outgoing message
On receive: counter = max(local, received) + 1

Properties

timestamp(a) < timestamp(b) → a may have preceded b (or they are concurrent)
Cannot distinguish concurrent from causally dependent events
Creates total order by breaking ties with node ID

Example

Client A write: timestamp=10, nodeID=1
Client B write: timestamp=10, nodeID=2
→ Tie-break by nodeID → A wins; B's write silently discarded

Use Cases (when agreement > history)

Distributed locks — just need one winner
Total-order multicast — all nodes process in same sequence
Leader election — pick one node without debate

Real Systems

System	Usage
ZooKeeper	`zxid` — Lamport-style monotonic counter for total transaction ordering
Google Chubby	Sequencer tokens for lock acquisitions
Kafka	Log offsets — monotonic, total order within a partition

Vector Clocks

Array of counters — one per node. Provides accurate causality tracking.

Rules

Increment own slot on every local event
Attach full vector to outgoing messages
On receive: merge element-wise max, then increment own slot

Example

Node A writes: [A:1, B:0]
Node B writes: [A:0, B:1]
→ Neither dominates → concurrent writes → conflict detected

Comparison Rules

V1 < V2 if every element of V1 ≤ V2 (and at least one strictly less) → V1 happened before V2
Neither dominates → concurrent (potential conflict)

Properties

Correctly identifies concurrent vs. causally dependent operations
Cost scales linearly with node count
Used as a conflict detector, not a conflict resolver

Real Systems

System	Usage
Amazon DynamoDB	Tracks causality between replica versions; exposes conflicts to caller
Riak	Stores conflicting versions as siblings; client merges
Voldemort (LinkedIn)	Detects stale reads, surfaces conflicts

Failure Detectors

Abstract timing assumptions using heartbeats + timeouts.

The Problem

In asynchronous systems, you cannot distinguish a failed node from a slow node — both look like "no response."

Chandra-Toueg Properties

Property	Meaning
Completeness	Crashed processes are eventually suspected
Accuracy	Non-faulty processes are never falsely suspected (hard to guarantee)

Key Insight

Even weak failure detectors enable solving consensus in async systems
Timeout value is critical (see DDIA ch8)

Timeout Trade-offs

Timeout	Risk
Too short	False positives — declare live node dead → duplicate actions, cascading overload
Too long	Slow fault detection
Theoretical optimal	`2d + r` (max packet delay + processing time) — not bounded in practice

Best practice: continuously measure round-trip times, set timeouts experimentally.

Process Pauses

A node cannot detect its own pause. Causes:

GC stop-the-world (can be seconds)
VM suspension / live migration
OS context switches, disk I/O waits
Swapping to disk

Consequence

A node that wakes up after a pause may act on stale assumptions — e.g., still believe it holds a lease that expired.

Fix: Fencing tokens — monotonically increasing token with each lock grant; storage rejects writes with token ≤ last seen.

Summary: Which Clock/Mechanism to Use

Need	Use
Measure elapsed time (local)	Monotonic clock
Current time for humans	Time-of-day clock
Total ordering (agreement, no conflict history needed)	Lamport timestamps
Detect concurrent writes / causality	Vector clocks
Detect node failures	Failure detector + heartbeats
Prevent stale-lock writes	Fencing tokens