Partitioning (Sharding)
Goal
- Each record belongs to exactly one partition
- Spread data and load evenly — avoid hot spots (skewed partitions)
- Partitioning + replication are independent schemes used together
Partitioning Strategies
By Key Range
- Sort keys, assign contiguous key boundaries per partition
- Supports efficient range queries
- Risk: hot spots when access is sequential (e.g., all writes go to today's time-series partition)
- Requires continuous boundary adjustment as data grows
By Hash Key
- Hash the key (MurmurHash, MD5), partition by hash range
- More uniform distribution, avoids hot spots
- Loses range query ability — scatter/gather required
- Compound key trick: hash first column, range on rest → e.g.,
hash(user_id) + timestampsupports "all events for user X in time range"
Hot Spot Mitigation
- Append random bytes to hot key → scatter writes across partitions
- Trade-off: reads must query all partitions and merge results
- Application must track and manage the suffix
Secondary Index Partitioning
| Type | How | Read | Write |
|---|---|---|---|
| Local (document-based) | Each partition indexes its own data only | Scatter/gather all partitions | Fast — single partition |
| Global (term-based) | Index partitioned by term value | Single partition — efficient | Slow — must update multiple partitions; usually async |
Local Index (Document-Based Partitioning)
Each partition maintains its own independent index covering only the data in that partition.
Partition 0: docs {0,1,2} → local index: color:red → [0,2], color:blue → [1]
Partition 1: docs {3,4,5} → local index: color:red → [3], color:blue → [4,5]
Query "color:red" → must ask ALL partitions → merge [0,2] + [3] = [0,2,3]
- Write: only the partition owning the document is updated → single partition write, no coordination
- Read: scatter/gather — query all N partitions in parallel, merge results → high read fan-out
- Failure: if one partition is slow/down, the whole query waits or returns partial results
- Used by: Elasticsearch, MongoDB, Cassandra (secondary index per node)
Global Index (Term-Based Partitioning)
A single global index is itself partitioned — not by the primary key, but by the index term (the value being indexed).
Term partition 0 (a–r): color:blue → [1,4,5], color:red → [0,2,3]
Term partition 1 (s–z): color:yellow → [6,8], color:white → [7]
Query "color:red" → go to term partition 0 only → get [0,2,3] ← 1 partition read
- Read: efficient — route directly to the term's partition; no scatter/gather
- Write: expensive — one document write may update terms across multiple index partitions (e.g., inserting a doc with
color=red, brand=Nikeupdates two different term partitions) - Consistency: writes to multiple index partitions are usually async → index may lag behind primary data by seconds
- Used by: DynamoDB GSI (async), Elasticsearch cross-shard terms, Solr SolrCloud
Partition by Term — How Terms Are Assigned to Partitions
Two strategies for deciding which partition owns a term:
| Strategy | How | Tradeoff |
|---|---|---|
| Hash of term | partition = hash(term) % N |
Uniform distribution; range queries on terms impossible |
| Term range | Alphabetical ranges per partition (a–r on p0, s–z on p1) | Range queries on index terms work; risk of hot terms (e.g., status:active gets all traffic) |
Example — DynamoDB GSI:
- GSI is a global index partitioned by the GSI partition key (term)
- Write to base table → async replication to GSI partition owning that term value
- Read from GSI → single partition scan (if GSI PK supplied) or full GSI scan
- GSI has its own WCU/RCU budget — if GSI partition is throttled, base table writes succeed but GSI lags
Rebalancing Partitions
Requirements: fair load after rebalance, DB stays available during rebalance, minimal data movement.
| Strategy | How | Pros | Cons |
|---|---|---|---|
| Fixed partitions | More partitions than nodes; new node steals some | Simple ops | Partition size can grow too large/small |
| Dynamic partitioning | Split when too large, merge when too small | Adapts to data volume | Cold start (starts with 1 partition) |
| Per-node proportion | Fixed partitions per node; new node splits and takes half | Balances with node count | Random splits may be unfair |
Always prefer human-in-the-loop for rebalancing — avoid fully automated (risks cascading moves during normal load spikes).
Request Routing (Service Discovery)
Problem: given key K, which node handles the request?
| Approach | Where routing lives |
|---|---|
| Each node knows all | Node redirects request to correct peer |
| Routing tier | Dedicated load balancer layer |
| Client-side | Client tracks partition map directly |
Coordination Services
- ZooKeeper: keeps cluster metadata + partition assignments; nodes subscribe to changes
- Gossip protocol (Cassandra, Riak): nodes propagate metadata; no external coordination service — clients may contact any node
Consistent Hashing
Used in leaderless/DHT systems (Dynamo, Cassandra).
Hash ring: 0 ─────────────────────────── 2^32
Nodes placed at positions on ring
Key → hash → walk clockwise → first node = owner
- Adding/removing node: only adjacent keys move — minimizes data movement
- Virtual nodes (vnodes): each physical node owns multiple positions → better load distribution + easier rebalancing
Summary
| Use Case | Partitioning Strategy |
|---|---|
| Uniform KV access | Hash partitioning |
| Range queries / time-series | Key-range partitioning |
| Secondary index reads | Global (term-based) index |
| Secondary index writes | Local (document-based) index |
| Dynamic data growth | Dynamic partitioning |
| Large stable clusters | Fixed or per-node partitions |