Amazon S3
Scale (Public Numbers)
- 350 trillion+ objects stored (as of 2023)
- 100 million+ requests/sec at peak
- 11 nines durability (99.999999999%)
- 99.99% availability (Standard tier)
- Launched 2006 — one of AWS's oldest services
Functional Requirements
- PUT object (single + multipart)
- GET object (full + byte-range)
- DELETE object (versioned or permanent)
- HEAD object (metadata only)
- LIST objects in bucket (paginated, prefix/delimiter filter)
- Bucket operations: CREATE, DELETE, configure policy/ACL/versioning/lifecycle
- Pre-signed URLs (time-limited, capability-based access)
- Event notifications on object operations
- Replication (same-region, cross-region)
- Lifecycle transitions (tier down, expire)
Non-Functional Requirements
- Durability: 11 nines — store across ≥3 AZs; erasure-coded
- Availability: 99.99% reads, 99.9% writes
- Consistency: strong (since Dec 2020) for single object; eventual for cross-region replication
- Latency: first-byte < 200ms in-region; Time to First Byte (TTFB) optimized
- Throughput: per-prefix up to 5,500 GET/s and 3,500 PUT/s (S3 scales per prefix)
- Scale: unlimited storage; auto-scale; no pre-provisioning
Capacity Estimation (80:20)
Objects stored: 350 trillion
Avg object size: 1 MB (blended — many small + few large)
Total raw data: 350 PB × avg_replication → ~1 EB replicated
Write QPS: ~3.5M PUT/s (derived from 100M req/s, 20% writes)
Read QPS: ~80M GET/s
Metadata per object: ~200 bytes
Total metadata: 350T × 200B ≈ 70 PB metadata
Daily new data: exabytes/day globally
Per-region write BW: 10s of TB/s
IOPS (per region estimate):
Write: 3.5M PUT/s × 1 sequential write per PUT = 3.5M sequential IOPS
Each object erasure-coded to 9 shards → 3.5M × 9 = 31.5M shard writes/s
NVMe SSD: 1M sequential IOPS → 32+ storage nodes minimum for writes
Read: 80M GET/s served mostly from memory (page cache of hot data)
Cache miss → random read; assuming 1% miss = 800k random IOPS
Per storage node (NVMe 500k IOPS): 2+ nodes just for random reads
In practice: thousands of storage nodes → reads spread across fleet
Server count (per region, estimated):
Frontend fleet: 1,000s of stateless servers (100M req/s ÷ 50k/server)
Storage nodes: 10,000s (petabytes of storage; throughput-driven)
Metadata nodes: 100s (purpose-built distributed KV, sharded heavily)
Replication workers: 100s (async CRR queue processors)
High-Level Architecture
┌─────────────────────────────────────────┐
│ S3 Service │
│ │
Client ──HTTPS──► Front-End Fleet (per-AZ) │
│ auth │ routing │ rate-limit │
│ │ │
┌──────┼────────┼─────────────────┐ │
│ ▼ ▼ │ │
│ Auth/IAM API Router │ │
│ Service (method dispatch) │ │
│ │ │ │ │
│ └────────┤ │ │
│ ▼ │ │
│ Metadata Service ────────┼─ Bucket DB │
│ (object index) │ │
│ │ │ │
│ ▼ │ │
│ Storage Service │ │
│ (chunk placement, │ │
│ read/write to nodes) │ │
│ │ │ │
│ ┌──────────┼──────────┐ │ │
│ ▼ ▼ ▼ │ │
│ AZ-1 AZ-2 AZ-3 │ │
│ Storage Storage Storage │ │
│ Nodes Nodes Nodes │ │
└─────────────────────────────────┘ │
│ │
┌──────┼──────────────────────────────────────┐ │
│ Background Services │ │
│ Replication | Lifecycle | GC | Inventory │ │
└─────────────────────────────────────────────┘ │
└─────────────────────────────────────────┘
Core Components (HLD)
| Component | Responsibility |
|---|---|
| Front-End Fleet | TLS termination, auth token parsing, request routing per method, rate limiting, request logging |
| IAM / Auth Service | Validate SigV4 signatures, evaluate IAM policies, bucket ACLs, pre-signed URL expiry |
| API Router | Dispatch to correct handler: PUT→storage path, GET→read path, LIST→index service |
| Metadata / Index Service | Persistent index: (bucket, key) → {size, ETag, version_id, storage_location, ACL, tags, ...} |
| Bucket Service | Bucket creation/deletion, versioning state, lifecycle rules, replication config, event notification config |
| Storage Service | Determine placement (which storage nodes), orchestrate write to ≥3 AZs, serve reads |
| Storage Nodes | Durable disk storage per AZ; serve individual chunk read/write; checksum every block |
| Replication Service | Async SRR (same-region) / CRR (cross-region); ordered per-object replication queue |
| Lifecycle Service | Periodically scans objects, applies transition (tier) or expiration rules |
| Garbage Collector | Reclaims orphaned chunks: deleted objects, failed writes, abandoned multipart parts |
| Event Bridge | Fan-out events (ObjectCreated, ObjectDeleted, etc.) to SNS/SQS/Lambda |
Low-Level Design
Object Naming and Partitioning
S3 presents a flat namespace but internally indexes hierarchically.
S3 key: "photos/2024/jan/img001.jpg"
Internal key: hash(bucket_name) + "/" + object_key
→ distributed across metadata shards by key hash prefix
Partition scheme (pre-2018): first characters of key determined partition
→ hot prefix problem: all keys starting with "img-" → same shard
Partition scheme (post-2018): S3 internally prefixes with a hash of your key
→ client key "2024-01-01/photo.jpg" → internal: "3a7f/2024-01-01/photo.jpg"
→ Transparent to clients; automatically distributes sequential keys
Throughput scaling: 3,500 PUT/s and 5,500 GET/s per prefix — add prefixes to parallelize.
Metadata Service (Index Layer)
Key: (bucket_id, object_key, version_id)
Value: {
content_length : int64
content_type : string
etag : MD5 of object bytes
last_modified : timestamp
storage_class : STANDARD | IA | GLACIER | ...
server_side_enc : AES256 | aws:kms | null
replication_status: PENDING | COMPLETED | FAILED | REPLICA
acl : canned ACL or bucket policy reference
user_metadata : map<string, string> (x-amz-meta-* headers)
tags : map<string, string> (S3 Object Tagging)
chunks : [{chunk_id, az, node_id, offset, length}]
delete_marker : bool
is_latest : bool
part_list : null | [{part_no, etag, size}] (multipart)
}
Metadata store internals (inferred from AWS talks):
- Purpose-built distributed KV store (not off-the-shelf)
- Sharded by
hash(bucket_id + object_key)across many nodes - Each shard replicated across AZs for HA
- Strong consistency: single-writer per key shard (leader-based replication)
- Metadata write is the commit point — storage bytes written before metadata
Write Path (PUT Object)
1. Client → PUT /bucket/key HTTP/1.1 (with SigV4 auth header)
2. Front-End:
a. Validate SigV4 signature (HMAC-SHA256 of canonical request)
b. IAM policy evaluation: does this identity have s3:PutObject?
c. Bucket exists? Bucket-level checks (versioning, object lock, ACL)
d. Assign internal request-id (for tracing, billing)
3. Storage Service:
a. Receive streaming bytes from front-end
b. Compute MD5 + SHA256 as bytes arrive
c. Select target nodes: 1 primary per AZ (3 AZs = 3 nodes minimum)
d. Stream bytes to primary node per AZ (parallel, not sequential)
e. Each storage node: write to local append-only storage file + checksum
4. Durability barrier:
a. Wait for all 3 AZ primaries to acknowledge (sync — before any response)
b. If any AZ fails: return error, trigger cleanup of partial writes
5. Metadata commit:
a. Write metadata record (with chunk locations) to Metadata Service
b. This write is the COMMIT POINT — object now visible to reads
c. Metadata write is strongly consistent (single leader per shard)
6. Response to client:
a. 200 OK with ETag (MD5 of content)
b. Optionally: x-amz-version-id if bucket is versioned
7. Background:
a. Erasure coding: recode chunks for storage efficiency
b. Cross-region replication (if configured): enqueue for async delivery
c. Event notification: publish ObjectCreated to configured destinations
Why metadata write = commit point:
- If bytes written but metadata not committed → partial write; GC will clean up orphaned chunks
- If metadata committed → object is visible; bytes guaranteed durable (3 AZs already ACKed)
Read Path (GET Object)
1. Client → GET /bucket/key (optionally: Range: bytes=0-1048575)
2. Front-End:
a. Auth + IAM policy: s3:GetObject?
b. Check object ACL / bucket policy
c. Conditional request? Check ETag (If-None-Match) or date (If-Modified-Since)
→ 304 Not Modified if unchanged
3. Metadata lookup:
a. Fetch object metadata by (bucket_id, key)
b. Get chunk list: [{az, node_id, chunk_id, offset, length}]
c. Check storage class: if GLACIER → return error (restore required)
4. Storage read:
a. Select closest/fastest AZ (prefer same AZ as front-end for latency)
b. If Range header: map byte range → relevant chunk(s)
c. Fetch chunk(s) from storage node
d. Verify checksum on each chunk (detect bit rot)
e. Stream bytes to client as they arrive (no full buffer)
5. Response:
Content-Type, ETag, Last-Modified, Content-Length headers
If range request: 206 Partial Content + Content-Range header
Strong consistency (since Dec 2020):
- Achieved via serialization in the Metadata Service
- Every GET reads from the same strongly-consistent metadata shard
- No read-your-writes caching issues — metadata write is immediately visible to all readers
Multipart Upload — Internal Flow
Initiate:
POST /bucket/key?uploads
→ generate UploadId (UUID)
→ create "in-progress upload" record in metadata (state=PENDING)
Upload parts (parallel, any order):
PUT /bucket/key?partNumber=N&uploadId=X body=bytes
→ each part stored as independent object fragment
→ storage path: /internal/multipart/{UploadId}/{partNumber}
→ returns ETag for that part (MD5 of part bytes)
→ minimum part size: 5 MB (last part can be smaller)
→ max parts: 10,000 → max object: 10,000 × 5 GB = 5 TB
Complete:
POST /bucket/key?uploadId=X
Body: <part list with ETags in order>
→ validate all parts present and ETag matches
→ concatenate chunk references into final object metadata
→ write final metadata record (state=ACTIVE) ← commit point
→ delete in-progress upload record
→ object now visible as single addressable key
Abort:
DELETE /bucket/key?uploadId=X
→ mark in-progress record ABORTED
→ GC reclaims part storage asynchronously
S3 Lifecycle rule for abandoned multipart:
{ "AbortIncompleteMultipartUpload": { "DaysAfterInitiation": 7 } }
Without this: parts accumulate, charging storage costs indefinitely.
Versioning
Versioning OFF (default):
PUT key="photo.jpg" → single object, overwrite silently
DELETE key="photo.jpg" → permanent delete
Versioning ON (per bucket):
PUT key="photo.jpg" → new version, version_id="abc123"
PUT key="photo.jpg" → new version, version_id="def456" (abc123 still exists)
DELETE key="photo.jpg" → inserts delete marker (version_id="ghi789")
→ GET returns 404 (latest is delete marker)
→ GET?versionId=def456 → returns that version
DELETE key with versionId → permanently removes that specific version
Internal:
All versions stored with same key but different version_id
Metadata index: (bucket_id, key, version_id) → metadata
"is_latest" flag maintained on most-recent version
LIST only shows latest version unless ?versions param used
Erasure Coding (Storage Efficiency)
S3 does not replicate bytes 3× naively — uses erasure coding for efficiency.
Reed-Solomon (k=6 data shards, m=3 parity shards):
Object → split into 6 equal data shards
→ compute 3 parity shards (XOR combinations)
→ store 9 shards across 9 different storage nodes
(spread across 3+ AZs, different racks within AZs)
Reconstruct: any 6 of 9 shards → recover full object
Tolerate: loss of any 3 shards simultaneously
Storage overhead: 9/6 = 1.5× (vs 3× for naive replication)
11-nine durability: probability of losing 4+ shards simultaneously ≈ 10^-11
Read path with erasure coding:
Full read: fetch 6 data shards → assemble in order → stream to client
Byte-range: calculate which shard(s) contain requested range
→ fetch only those shards (not all 6)
→ faster for small range reads; S3 optimizes shard granularity
Degraded read: if 1-2 shards unavailable → fetch remaining + parity → reconstruct
→ slightly higher latency; transparent to client
Strong Consistency Implementation
S3 achieved strong consistency without a global lock by serializing operations through the metadata service.
Problem (pre-2020):
PUT object → write to metadata node A (replicated async)
GET object → hits metadata node B (may not have seen write yet)
→ stale read → eventual consistency
Solution (post-2020):
Metadata service uses single-leader per key-range shard
All reads and writes for a key go through the same leader
→ leader always has latest write
→ reads on same leader = always see latest committed write
→ no more cache inconsistency
→ cost: slightly higher read latency (must go to leader, not any replica)
Properties:
Read-after-write: guaranteed (same key, same region)
List-after-write: guaranteed (listing reflects latest PUTs/DELETEs)
Cross-region: still eventual (CRR is async)
Lifecycle Management
Background service that scans all objects and applies rules.
Lifecycle rule evaluation:
1. Periodic scanner: iterate all objects in bucket (batched by prefix)
2. Per object: evaluate all lifecycle rules in order
3. Match criteria: prefix, tag filter, age (days since creation/last modified)
4. Action: transition storage class | expire (delete) | abort multipart | expire noncurrent versions
Transition flow:
Object age reaches 30 days
→ Lifecycle service issues internal COPY to new storage class
→ Metadata updated: storage_class = STANDARD_IA
→ Old Standard storage freed
→ Client transparent: same key, same GET API
S3 storage class transitions (allowed direction):
Standard → IA → Glacier Instant → Glacier Flexible → Deep Archive
(cannot promote: Glacier → Standard requires manual Restore operation)
Pre-Signed URL — Internal Mechanism
Generation (server-side):
params = { bucket, key, method, expiry, credentials }
canonical_request = method + "\n" + bucket + "/" + key + "\n" + expiry + ...
signature = HMAC-SHA256(signing_key, canonical_request)
url = "https://s3.amazonaws.com/{bucket}/{key}
?X-Amz-Algorithm=AWS4-HMAC-SHA256
&X-Amz-Credential={access_key}/...
&X-Amz-Date={timestamp}
&X-Amz-Expires={seconds}
&X-Amz-Signature={signature}"
Validation (S3 front-end):
1. Parse query parameters from URL
2. Reconstruct canonical request identically
3. Recompute HMAC using stored secret key
4. Compare → if match: allow operation
5. Check X-Amz-Expires + X-Amz-Date → reject if expired
6. No session state stored server-side (stateless validation)
Key property: stateless — S3 stores nothing per pre-signed URL; validates on-the-fly.
Event Notification Pipeline
S3 operation completes (PUT, DELETE, COPY, multipart complete)
│
▼
EventBridge / Notification Service
- filters: event type, prefix, suffix, tags
- fan-out to multiple destinations:
→ SNS topic (fan-out to email, SQS, Lambda, HTTP)
→ SQS queue (durable, at-least-once delivery)
→ Lambda (direct invocation, at-least-once)
→ EventBridge (rich filtering, schema registry)
At-least-once delivery: events may duplicate → consumers must be idempotent
Order: not guaranteed across objects; per-object ordering best-effort
Replication (CRR / SRR)
Write completes in source bucket
│
▼
Replication queue (per-bucket, per-destination, ordered by object)
│
▼
Replication worker:
1. GET object from source (metadata + bytes)
2. PUT to destination bucket (different region or same region)
3. Preserve: key, ETag, metadata, ACL, tags, storage class (configurable)
4. Mark source object replication_status = COMPLETED
5. On failure: retry with exponential backoff; alert via CloudWatch metric
Replication time:
S3 Replication Time Control (RTC): 99.99% of objects replicated within 15 min SLA
Standard CRR: best-effort (usually seconds, can be minutes under load)
What is NOT replicated by default:
- Objects uploaded before replication was enabled
- Delete markers (configurable opt-in)
- Objects in Glacier (must be restored first)
- Objects already replicated from another source (prevents loops)
S3 Intelligent-Tiering
Automatic tiering without lifecycle rules or retrieval fees.
Intelligent-Tiering monitors access patterns per object:
Frequent Access tier → objects accessed recently
Infrequent Access tier → not accessed for 30+ days → auto-moved
Archive Instant tier → not accessed for 90+ days → auto-moved
Archive tier → configurable 90–730 days threshold
Deep Archive tier → configurable 180–730 days threshold
Small monthly monitoring fee per object
No retrieval fee for tier transitions (unlike manual IA/Glacier)
Best for: unknown or unpredictable access patterns
S3 Express One Zone (2023) — Different Architecture
Purpose-built for single-digit millisecond latency.
| S3 Standard | S3 Express One Zone | |
|---|---|---|
| AZs | ≥3 | 1 (directory bucket) |
| Durability | 11 nines | 99.999999% (lower — single AZ) |
| Latency | 100–200ms TTFB | 10ms TTFB |
| Throughput | 5,500 GET/s per prefix | 100× higher per directory |
| Auth | SigV4 | Session token (CreateSession → reuse) |
| Use case | General | ML training data, HPC, latency-sensitive |
| Pricing | Standard | ~65% higher storage, lower request cost |
Session-based auth: CreateSession(bucket) → reuse session token for all requests in session → eliminates per-request IAM evaluation overhead.
Performance Optimization Patterns
| Pattern | How | Gain |
|---|---|---|
| Prefix randomization | Prefix keys with hash: hash(key)/key |
Distribute across partitions → bypass 3,500 PUT/s limit |
| Byte-range GETs | Parallel Range: bytes=X-Y requests |
Parallelize large object downloads |
| Multipart upload | Parallel part uploads | Parallelize large object uploads |
| S3 Transfer Acceleration | Route via CloudFront edge → AWS backbone | Reduce latency for global uploads |
| Batch Operations | S3 Batch → apply operation to billions of objects | Avoid iterating + calling per object |
| Requester Pays | Bucket owner pays storage; requester pays transfer | Public datasets (OpenData, AWS Registry) |
| S3 Select | SQL on individual objects (CSV, JSON, Parquet) | 80% less data transferred — server-side filter |
Key Design Decisions
| Decision | Choice | Why |
|---|---|---|
| Commit point | Metadata write (after all AZ bytes written) | Atomic visibility; GC handles orphaned bytes |
| Consistency | Single-leader per metadata shard | Strong consistency without global lock |
| Storage | Erasure coding (not 3× replication) | 1.5× overhead vs 3×; same 11-nine durability |
| Multipart | Up to 10k parts, parallel | Large object support + resumability |
| Pre-signed URLs | Stateless HMAC validation | No server state; scales to any QPS |
| Partition key | Hash prefix on object key (internal) | Avoid hot prefix, distribute sequential writes |
| Versioning | Delete markers (not physical delete) | Safe delete + restore; GC handles eventually |
| Lifecycle | Background scanner per bucket | Decouple from read/write hot path |
Interview Scenarios
"Hot prefix — single prefix gets 10k PUTs/sec (exceeds 3,500/s S3 limit)"
- S3 partitions by key prefix; sequential prefixes (dates, auto-increment) concentrate on 1 partition
- Fix: randomize prefix —
hash(key)/original_key(e.g.,a3f2/2024/01/photo.jpg) - S3 auto-scales partitions when traffic detected; randomization just gets there faster
- For read-heavy: add CloudFront CDN in front — cache-hits don't reach S3 partition at all
"Upload a 5 TB database backup"
- S3 max single PUT: 5 GB; max object size: 5 TB
- Must use multipart: 5 TB / 64 MB = ~80,000 parts; upload in parallel (S3 allows up to 10,000 parts)
- Better: 64 MB × 10,000 = 640 GB per object → split 5 TB into 8 objects
- Use S3 Transfer Acceleration for intercontinental uploads: route via CloudFront edge → AWS backbone
- Verify: after complete,
GetObjectAttributesreturns list of part ETags; validate all match
"Need to process every uploaded object in real-time (virus scan, thumbnail generation)"
- S3 Event Notifications → SNS/SQS/Lambda on
s3:ObjectCreated:* - Lambda invoked per upload: runs scan/transform synchronously; writes result back or tags object
- For heavy processing: S3 → SQS → EC2 worker fleet (decouple ingestion from processing)
- S3 Object Lambda: transform object content on-the-fly during GET (resize image before serving)
"Objects must be immutable — no overwrites, no deletes for 7 years (compliance)"
- Enable S3 Object Lock with Compliance mode:
RetainUntilDate = now + 7 years - Compliance mode: not even root account can delete/overwrite before retention date
- Governance mode: admin can override with special permission (less strict, for internal audit)
- Versioning must be enabled — each PUT creates new version; locked version protected independently
"Cross-region disaster recovery — RPO < 1 minute"
- S3 Cross-Region Replication (CRR) with Replication Time Control (RTC): 99.99% of objects replicated within 15 minutes; S3 RTC guarantees <15 min for 99.99%
- For RPO <1 min: not achievable with S3 alone → use active-active app writes to both regions simultaneously
- Failover: Route 53 health check + latency routing → point traffic to replica region in <60s
"Constraint: reduce GET costs for a public dataset (millions of requester-pays GETs)"
- Enable
Requester Paysbucket: requesters pay transfer + GET costs, not bucket owner - Enable CloudFront distribution: requesters served from edge cache; S3 GETs collapse to cache misses only
- Compress objects at upload (gzip/zstd): smaller transfers → lower requester cost → more adoption
"Need to query data inside objects without downloading entire file"
- S3 Select: run SQL on CSV/JSON/Parquet objects; only matching rows returned → up to 80% less data transferred
- For analytics: store in Parquet columnar format → S3 Select only reads relevant columns; massive reduction
- For large-scale querying: AWS Athena (queries S3 via Glue catalog) — serverless SQL on S3 at scale