Partitioning (Sharding)

Partitioning (Sharding)

Goal

  • Each record belongs to exactly one partition
  • Spread data and load evenly — avoid hot spots (skewed partitions)
  • Partitioning + replication are independent schemes used together

Partitioning Strategies

By Key Range

  • Sort keys, assign contiguous key boundaries per partition
  • Supports efficient range queries
  • Risk: hot spots when access is sequential (e.g., all writes go to today's time-series partition)
  • Requires continuous boundary adjustment as data grows

By Hash Key

  • Hash the key (MurmurHash, MD5), partition by hash range
  • More uniform distribution, avoids hot spots
  • Loses range query ability — scatter/gather required
  • Compound key trick: hash first column, range on rest → e.g., hash(user_id) + timestamp supports "all events for user X in time range"

Hot Spot Mitigation

  • Append random bytes to hot key → scatter writes across partitions
  • Trade-off: reads must query all partitions and merge results
  • Application must track and manage the suffix

Secondary Index Partitioning

Type How Read Write
Local (document-based) Each partition indexes its own data only Scatter/gather all partitions Fast — single partition
Global (term-based) Index partitioned by term value Single partition — efficient Slow — must update multiple partitions; usually async

Local Index (Document-Based Partitioning)

Each partition maintains its own independent index covering only the data in that partition.

Partition 0: docs {0,1,2}   → local index: color:red → [0,2], color:blue → [1]
Partition 1: docs {3,4,5}   → local index: color:red → [3],   color:blue → [4,5]

Query "color:red" → must ask ALL partitions → merge [0,2] + [3] = [0,2,3]
  • Write: only the partition owning the document is updated → single partition write, no coordination
  • Read: scatter/gather — query all N partitions in parallel, merge results → high read fan-out
  • Failure: if one partition is slow/down, the whole query waits or returns partial results
  • Used by: Elasticsearch, MongoDB, Cassandra (secondary index per node)

Global Index (Term-Based Partitioning)

A single global index is itself partitioned — not by the primary key, but by the index term (the value being indexed).

Term partition 0 (a–r): color:blue → [1,4,5],  color:red → [0,2,3]
Term partition 1 (s–z): color:yellow → [6,8],  color:white → [7]

Query "color:red" → go to term partition 0 only → get [0,2,3]  ← 1 partition read
  • Read: efficient — route directly to the term's partition; no scatter/gather
  • Write: expensive — one document write may update terms across multiple index partitions (e.g., inserting a doc with color=red, brand=Nike updates two different term partitions)
  • Consistency: writes to multiple index partitions are usually async → index may lag behind primary data by seconds
  • Used by: DynamoDB GSI (async), Elasticsearch cross-shard terms, Solr SolrCloud

Partition by Term — How Terms Are Assigned to Partitions

Two strategies for deciding which partition owns a term:

Strategy How Tradeoff
Hash of term partition = hash(term) % N Uniform distribution; range queries on terms impossible
Term range Alphabetical ranges per partition (a–r on p0, s–z on p1) Range queries on index terms work; risk of hot terms (e.g., status:active gets all traffic)

Example — DynamoDB GSI:

  • GSI is a global index partitioned by the GSI partition key (term)
  • Write to base table → async replication to GSI partition owning that term value
  • Read from GSI → single partition scan (if GSI PK supplied) or full GSI scan
  • GSI has its own WCU/RCU budget — if GSI partition is throttled, base table writes succeed but GSI lags

Rebalancing Partitions

Requirements: fair load after rebalance, DB stays available during rebalance, minimal data movement.

Strategy How Pros Cons
Fixed partitions More partitions than nodes; new node steals some Simple ops Partition size can grow too large/small
Dynamic partitioning Split when too large, merge when too small Adapts to data volume Cold start (starts with 1 partition)
Per-node proportion Fixed partitions per node; new node splits and takes half Balances with node count Random splits may be unfair

Always prefer human-in-the-loop for rebalancing — avoid fully automated (risks cascading moves during normal load spikes).


Request Routing (Service Discovery)

Problem: given key K, which node handles the request?

Approach Where routing lives
Each node knows all Node redirects request to correct peer
Routing tier Dedicated load balancer layer
Client-side Client tracks partition map directly

Coordination Services

  • ZooKeeper: keeps cluster metadata + partition assignments; nodes subscribe to changes
  • Gossip protocol (Cassandra, Riak): nodes propagate metadata; no external coordination service — clients may contact any node

Consistent Hashing

Used in leaderless/DHT systems (Dynamo, Cassandra).

Hash ring: 0 ─────────────────────────── 2^32
Nodes placed at positions on ring
Key → hash → walk clockwise → first node = owner
  • Adding/removing node: only adjacent keys move — minimizes data movement
  • Virtual nodes (vnodes): each physical node owns multiple positions → better load distribution + easier rebalancing

Summary

Use Case Partitioning Strategy
Uniform KV access Hash partitioning
Range queries / time-series Key-range partitioning
Secondary index reads Global (term-based) index
Secondary index writes Local (document-based) index
Dynamic data growth Dynamic partitioning
Large stable clusters Fixed or per-node partitions