Amazon S3

Amazon S3

Scale (Public Numbers)

  • 350 trillion+ objects stored (as of 2023)
  • 100 million+ requests/sec at peak
  • 11 nines durability (99.999999999%)
  • 99.99% availability (Standard tier)
  • Launched 2006 — one of AWS's oldest services

Functional Requirements

  • PUT object (single + multipart)
  • GET object (full + byte-range)
  • DELETE object (versioned or permanent)
  • HEAD object (metadata only)
  • LIST objects in bucket (paginated, prefix/delimiter filter)
  • Bucket operations: CREATE, DELETE, configure policy/ACL/versioning/lifecycle
  • Pre-signed URLs (time-limited, capability-based access)
  • Event notifications on object operations
  • Replication (same-region, cross-region)
  • Lifecycle transitions (tier down, expire)

Non-Functional Requirements

  • Durability: 11 nines — store across ≥3 AZs; erasure-coded
  • Availability: 99.99% reads, 99.9% writes
  • Consistency: strong (since Dec 2020) for single object; eventual for cross-region replication
  • Latency: first-byte < 200ms in-region; Time to First Byte (TTFB) optimized
  • Throughput: per-prefix up to 5,500 GET/s and 3,500 PUT/s (S3 scales per prefix)
  • Scale: unlimited storage; auto-scale; no pre-provisioning

Capacity Estimation (80:20)

Objects stored:      350 trillion
Avg object size:     1 MB (blended — many small + few large)
Total raw data:      350 PB × avg_replication → ~1 EB replicated
Write QPS:           ~3.5M PUT/s (derived from 100M req/s, 20% writes)
Read QPS:            ~80M GET/s

Metadata per object: ~200 bytes
Total metadata:      350T × 200B ≈ 70 PB metadata

Daily new data:      exabytes/day globally
Per-region write BW: 10s of TB/s

IOPS (per region estimate):
  Write: 3.5M PUT/s × 1 sequential write per PUT = 3.5M sequential IOPS
         Each object erasure-coded to 9 shards → 3.5M × 9 = 31.5M shard writes/s
         NVMe SSD: 1M sequential IOPS → 32+ storage nodes minimum for writes
  Read:  80M GET/s served mostly from memory (page cache of hot data)
         Cache miss → random read; assuming 1% miss = 800k random IOPS
         Per storage node (NVMe 500k IOPS): 2+ nodes just for random reads
         In practice: thousands of storage nodes → reads spread across fleet

Server count (per region, estimated):
  Frontend fleet:      1,000s of stateless servers (100M req/s ÷ 50k/server)
  Storage nodes:       10,000s (petabytes of storage; throughput-driven)
  Metadata nodes:      100s (purpose-built distributed KV, sharded heavily)
  Replication workers: 100s (async CRR queue processors)

High-Level Architecture

                     ┌─────────────────────────────────────────┐
                     │              S3 Service                  │
                     │                                          │
Client ──HTTPS──►  Front-End Fleet (per-AZ)                    │
                     │   auth │ routing │ rate-limit            │
                     │        │                                 │
              ┌──────┼────────┼─────────────────┐              │
              │      ▼        ▼                 │              │
              │  Auth/IAM  API Router           │              │
              │  Service   (method dispatch)    │              │
              │      │        │                 │              │
              │      └────────┤                 │              │
              │               ▼                 │              │
              │        Metadata Service ────────┼─ Bucket DB   │
              │          (object index)         │              │
              │               │                 │              │
              │               ▼                 │              │
              │        Storage Service          │              │
              │      (chunk placement,          │              │
              │       read/write to nodes)      │              │
              │               │                 │              │
              │    ┌──────────┼──────────┐      │              │
              │    ▼          ▼          ▼      │              │
              │  AZ-1       AZ-2       AZ-3     │              │
              │  Storage    Storage    Storage  │              │
              │  Nodes      Nodes      Nodes    │              │
              └─────────────────────────────────┘              │
                     │                                          │
              ┌──────┼──────────────────────────────────────┐  │
              │   Background Services                        │  │
              │   Replication | Lifecycle | GC | Inventory   │  │
              └─────────────────────────────────────────────┘  │
                     └─────────────────────────────────────────┘

Core Components (HLD)

Component Responsibility
Front-End Fleet TLS termination, auth token parsing, request routing per method, rate limiting, request logging
IAM / Auth Service Validate SigV4 signatures, evaluate IAM policies, bucket ACLs, pre-signed URL expiry
API Router Dispatch to correct handler: PUT→storage path, GET→read path, LIST→index service
Metadata / Index Service Persistent index: (bucket, key){size, ETag, version_id, storage_location, ACL, tags, ...}
Bucket Service Bucket creation/deletion, versioning state, lifecycle rules, replication config, event notification config
Storage Service Determine placement (which storage nodes), orchestrate write to ≥3 AZs, serve reads
Storage Nodes Durable disk storage per AZ; serve individual chunk read/write; checksum every block
Replication Service Async SRR (same-region) / CRR (cross-region); ordered per-object replication queue
Lifecycle Service Periodically scans objects, applies transition (tier) or expiration rules
Garbage Collector Reclaims orphaned chunks: deleted objects, failed writes, abandoned multipart parts
Event Bridge Fan-out events (ObjectCreated, ObjectDeleted, etc.) to SNS/SQS/Lambda

Low-Level Design

Object Naming and Partitioning

S3 presents a flat namespace but internally indexes hierarchically.

S3 key:  "photos/2024/jan/img001.jpg"

Internal key: hash(bucket_name) + "/" + object_key
→ distributed across metadata shards by key hash prefix

Partition scheme (pre-2018): first characters of key determined partition
→ hot prefix problem: all keys starting with "img-" → same shard

Partition scheme (post-2018): S3 internally prefixes with a hash of your key
→ client key "2024-01-01/photo.jpg" → internal: "3a7f/2024-01-01/photo.jpg"
→ Transparent to clients; automatically distributes sequential keys

Throughput scaling: 3,500 PUT/s and 5,500 GET/s per prefix — add prefixes to parallelize.


Metadata Service (Index Layer)

Key:   (bucket_id, object_key, version_id)
Value: {
  content_length   : int64
  content_type     : string
  etag             : MD5 of object bytes
  last_modified    : timestamp
  storage_class    : STANDARD | IA | GLACIER | ...
  server_side_enc  : AES256 | aws:kms | null
  replication_status: PENDING | COMPLETED | FAILED | REPLICA
  acl              : canned ACL or bucket policy reference
  user_metadata    : map<string, string>  (x-amz-meta-* headers)
  tags             : map<string, string>  (S3 Object Tagging)
  chunks           : [{chunk_id, az, node_id, offset, length}]
  delete_marker    : bool
  is_latest        : bool
  part_list        : null | [{part_no, etag, size}]  (multipart)
}

Metadata store internals (inferred from AWS talks):

  • Purpose-built distributed KV store (not off-the-shelf)
  • Sharded by hash(bucket_id + object_key) across many nodes
  • Each shard replicated across AZs for HA
  • Strong consistency: single-writer per key shard (leader-based replication)
  • Metadata write is the commit point — storage bytes written before metadata

Write Path (PUT Object)

1. Client → PUT /bucket/key  HTTP/1.1 (with SigV4 auth header)

2. Front-End:
   a. Validate SigV4 signature (HMAC-SHA256 of canonical request)
   b. IAM policy evaluation: does this identity have s3:PutObject?
   c. Bucket exists? Bucket-level checks (versioning, object lock, ACL)
   d. Assign internal request-id (for tracing, billing)

3. Storage Service:
   a. Receive streaming bytes from front-end
   b. Compute MD5 + SHA256 as bytes arrive
   c. Select target nodes: 1 primary per AZ (3 AZs = 3 nodes minimum)
   d. Stream bytes to primary node per AZ (parallel, not sequential)
   e. Each storage node: write to local append-only storage file + checksum

4. Durability barrier:
   a. Wait for all 3 AZ primaries to acknowledge (sync — before any response)
   b. If any AZ fails: return error, trigger cleanup of partial writes

5. Metadata commit:
   a. Write metadata record (with chunk locations) to Metadata Service
   b. This write is the COMMIT POINT — object now visible to reads
   c. Metadata write is strongly consistent (single leader per shard)

6. Response to client:
   a. 200 OK with ETag (MD5 of content)
   b. Optionally: x-amz-version-id if bucket is versioned

7. Background:
   a. Erasure coding: recode chunks for storage efficiency
   b. Cross-region replication (if configured): enqueue for async delivery
   c. Event notification: publish ObjectCreated to configured destinations

Why metadata write = commit point:

  • If bytes written but metadata not committed → partial write; GC will clean up orphaned chunks
  • If metadata committed → object is visible; bytes guaranteed durable (3 AZs already ACKed)

Read Path (GET Object)

1. Client → GET /bucket/key  (optionally: Range: bytes=0-1048575)

2. Front-End:
   a. Auth + IAM policy: s3:GetObject?
   b. Check object ACL / bucket policy
   c. Conditional request? Check ETag (If-None-Match) or date (If-Modified-Since)
      → 304 Not Modified if unchanged

3. Metadata lookup:
   a. Fetch object metadata by (bucket_id, key)
   b. Get chunk list: [{az, node_id, chunk_id, offset, length}]
   c. Check storage class: if GLACIER → return error (restore required)

4. Storage read:
   a. Select closest/fastest AZ (prefer same AZ as front-end for latency)
   b. If Range header: map byte range → relevant chunk(s)
   c. Fetch chunk(s) from storage node
   d. Verify checksum on each chunk (detect bit rot)
   e. Stream bytes to client as they arrive (no full buffer)

5. Response:
   Content-Type, ETag, Last-Modified, Content-Length headers
   If range request: 206 Partial Content + Content-Range header

Strong consistency (since Dec 2020):

  • Achieved via serialization in the Metadata Service
  • Every GET reads from the same strongly-consistent metadata shard
  • No read-your-writes caching issues — metadata write is immediately visible to all readers

Multipart Upload — Internal Flow

Initiate:
  POST /bucket/key?uploads
  → generate UploadId (UUID)
  → create "in-progress upload" record in metadata (state=PENDING)

Upload parts (parallel, any order):
  PUT /bucket/key?partNumber=N&uploadId=X  body=bytes
  → each part stored as independent object fragment
  → storage path: /internal/multipart/{UploadId}/{partNumber}
  → returns ETag for that part (MD5 of part bytes)
  → minimum part size: 5 MB (last part can be smaller)
  → max parts: 10,000 → max object: 10,000 × 5 GB = 5 TB

Complete:
  POST /bucket/key?uploadId=X
  Body: <part list with ETags in order>
  → validate all parts present and ETag matches
  → concatenate chunk references into final object metadata
  → write final metadata record (state=ACTIVE)  ← commit point
  → delete in-progress upload record
  → object now visible as single addressable key

Abort:
  DELETE /bucket/key?uploadId=X
  → mark in-progress record ABORTED
  → GC reclaims part storage asynchronously

S3 Lifecycle rule for abandoned multipart:

{ "AbortIncompleteMultipartUpload": { "DaysAfterInitiation": 7 } }

Without this: parts accumulate, charging storage costs indefinitely.


Versioning

Versioning OFF (default):
  PUT key="photo.jpg"   → single object, overwrite silently
  DELETE key="photo.jpg" → permanent delete

Versioning ON (per bucket):
  PUT key="photo.jpg"   → new version, version_id="abc123"
  PUT key="photo.jpg"   → new version, version_id="def456" (abc123 still exists)
  DELETE key="photo.jpg" → inserts delete marker (version_id="ghi789")
                           → GET returns 404 (latest is delete marker)
                           → GET?versionId=def456 → returns that version
  DELETE key with versionId → permanently removes that specific version

Internal:
  All versions stored with same key but different version_id
  Metadata index: (bucket_id, key, version_id) → metadata
  "is_latest" flag maintained on most-recent version
  LIST only shows latest version unless ?versions param used

Erasure Coding (Storage Efficiency)

S3 does not replicate bytes 3× naively — uses erasure coding for efficiency.

Reed-Solomon (k=6 data shards, m=3 parity shards):

Object → split into 6 equal data shards
       → compute 3 parity shards (XOR combinations)
       → store 9 shards across 9 different storage nodes
          (spread across 3+ AZs, different racks within AZs)

Reconstruct: any 6 of 9 shards → recover full object
Tolerate:    loss of any 3 shards simultaneously

Storage overhead: 9/6 = 1.5× (vs 3× for naive replication)
11-nine durability: probability of losing 4+ shards simultaneously ≈ 10^-11

Read path with erasure coding:

Full read:    fetch 6 data shards → assemble in order → stream to client
Byte-range:   calculate which shard(s) contain requested range
              → fetch only those shards (not all 6)
              → faster for small range reads; S3 optimizes shard granularity
Degraded read: if 1-2 shards unavailable → fetch remaining + parity → reconstruct
              → slightly higher latency; transparent to client

Strong Consistency Implementation

S3 achieved strong consistency without a global lock by serializing operations through the metadata service.

Problem (pre-2020):
  PUT object   → write to metadata node A (replicated async)
  GET object   → hits metadata node B (may not have seen write yet)
  → stale read → eventual consistency

Solution (post-2020):
  Metadata service uses single-leader per key-range shard
  All reads and writes for a key go through the same leader
  → leader always has latest write
  → reads on same leader = always see latest committed write
  → no more cache inconsistency
  → cost: slightly higher read latency (must go to leader, not any replica)

Properties:
  Read-after-write:   guaranteed (same key, same region)
  List-after-write:   guaranteed (listing reflects latest PUTs/DELETEs)
  Cross-region:       still eventual (CRR is async)

Lifecycle Management

Background service that scans all objects and applies rules.

Lifecycle rule evaluation:
  1. Periodic scanner: iterate all objects in bucket (batched by prefix)
  2. Per object: evaluate all lifecycle rules in order
  3. Match criteria: prefix, tag filter, age (days since creation/last modified)
  4. Action: transition storage class | expire (delete) | abort multipart | expire noncurrent versions

Transition flow:
  Object age reaches 30 days
  → Lifecycle service issues internal COPY to new storage class
  → Metadata updated: storage_class = STANDARD_IA
  → Old Standard storage freed
  → Client transparent: same key, same GET API

S3 storage class transitions (allowed direction):
  Standard → IA → Glacier Instant → Glacier Flexible → Deep Archive
  (cannot promote: Glacier → Standard requires manual Restore operation)

Pre-Signed URL — Internal Mechanism

Generation (server-side):
  params = { bucket, key, method, expiry, credentials }
  canonical_request = method + "\n" + bucket + "/" + key + "\n" + expiry + ...
  signature = HMAC-SHA256(signing_key, canonical_request)
  url = "https://s3.amazonaws.com/{bucket}/{key}
         ?X-Amz-Algorithm=AWS4-HMAC-SHA256
         &X-Amz-Credential={access_key}/...
         &X-Amz-Date={timestamp}
         &X-Amz-Expires={seconds}
         &X-Amz-Signature={signature}"

Validation (S3 front-end):
  1. Parse query parameters from URL
  2. Reconstruct canonical request identically
  3. Recompute HMAC using stored secret key
  4. Compare → if match: allow operation
  5. Check X-Amz-Expires + X-Amz-Date → reject if expired
  6. No session state stored server-side (stateless validation)

Key property: stateless — S3 stores nothing per pre-signed URL; validates on-the-fly.


Event Notification Pipeline

S3 operation completes (PUT, DELETE, COPY, multipart complete)
      │
      ▼
EventBridge / Notification Service
  - filters: event type, prefix, suffix, tags
  - fan-out to multiple destinations:
      → SNS topic  (fan-out to email, SQS, Lambda, HTTP)
      → SQS queue  (durable, at-least-once delivery)
      → Lambda     (direct invocation, at-least-once)
      → EventBridge (rich filtering, schema registry)

At-least-once delivery: events may duplicate → consumers must be idempotent
Order: not guaranteed across objects; per-object ordering best-effort

Replication (CRR / SRR)

Write completes in source bucket
      │
      ▼
Replication queue (per-bucket, per-destination, ordered by object)
      │
      ▼
Replication worker:
  1. GET object from source (metadata + bytes)
  2. PUT to destination bucket (different region or same region)
  3. Preserve: key, ETag, metadata, ACL, tags, storage class (configurable)
  4. Mark source object replication_status = COMPLETED
  5. On failure: retry with exponential backoff; alert via CloudWatch metric

Replication time:
  S3 Replication Time Control (RTC): 99.99% of objects replicated within 15 min SLA
  Standard CRR: best-effort (usually seconds, can be minutes under load)

What is NOT replicated by default:
  - Objects uploaded before replication was enabled
  - Delete markers (configurable opt-in)
  - Objects in Glacier (must be restored first)
  - Objects already replicated from another source (prevents loops)

S3 Intelligent-Tiering

Automatic tiering without lifecycle rules or retrieval fees.

Intelligent-Tiering monitors access patterns per object:
  Frequent Access tier    → objects accessed recently
  Infrequent Access tier  → not accessed for 30+ days → auto-moved
  Archive Instant tier    → not accessed for 90+ days → auto-moved
  Archive tier            → configurable 90–730 days threshold
  Deep Archive tier       → configurable 180–730 days threshold

Small monthly monitoring fee per object
No retrieval fee for tier transitions (unlike manual IA/Glacier)
Best for: unknown or unpredictable access patterns

S3 Express One Zone (2023) — Different Architecture

Purpose-built for single-digit millisecond latency.

S3 Standard S3 Express One Zone
AZs ≥3 1 (directory bucket)
Durability 11 nines 99.999999% (lower — single AZ)
Latency 100–200ms TTFB 10ms TTFB
Throughput 5,500 GET/s per prefix 100× higher per directory
Auth SigV4 Session token (CreateSession → reuse)
Use case General ML training data, HPC, latency-sensitive
Pricing Standard ~65% higher storage, lower request cost

Session-based auth: CreateSession(bucket) → reuse session token for all requests in session → eliminates per-request IAM evaluation overhead.


Performance Optimization Patterns

Pattern How Gain
Prefix randomization Prefix keys with hash: hash(key)/key Distribute across partitions → bypass 3,500 PUT/s limit
Byte-range GETs Parallel Range: bytes=X-Y requests Parallelize large object downloads
Multipart upload Parallel part uploads Parallelize large object uploads
S3 Transfer Acceleration Route via CloudFront edge → AWS backbone Reduce latency for global uploads
Batch Operations S3 Batch → apply operation to billions of objects Avoid iterating + calling per object
Requester Pays Bucket owner pays storage; requester pays transfer Public datasets (OpenData, AWS Registry)
S3 Select SQL on individual objects (CSV, JSON, Parquet) 80% less data transferred — server-side filter

Key Design Decisions

Decision Choice Why
Commit point Metadata write (after all AZ bytes written) Atomic visibility; GC handles orphaned bytes
Consistency Single-leader per metadata shard Strong consistency without global lock
Storage Erasure coding (not 3× replication) 1.5× overhead vs 3×; same 11-nine durability
Multipart Up to 10k parts, parallel Large object support + resumability
Pre-signed URLs Stateless HMAC validation No server state; scales to any QPS
Partition key Hash prefix on object key (internal) Avoid hot prefix, distribute sequential writes
Versioning Delete markers (not physical delete) Safe delete + restore; GC handles eventually
Lifecycle Background scanner per bucket Decouple from read/write hot path

Interview Scenarios

"Hot prefix — single prefix gets 10k PUTs/sec (exceeds 3,500/s S3 limit)"

  • S3 partitions by key prefix; sequential prefixes (dates, auto-increment) concentrate on 1 partition
  • Fix: randomize prefix — hash(key)/original_key (e.g., a3f2/2024/01/photo.jpg)
  • S3 auto-scales partitions when traffic detected; randomization just gets there faster
  • For read-heavy: add CloudFront CDN in front — cache-hits don't reach S3 partition at all

"Upload a 5 TB database backup"

  • S3 max single PUT: 5 GB; max object size: 5 TB
  • Must use multipart: 5 TB / 64 MB = ~80,000 parts; upload in parallel (S3 allows up to 10,000 parts)
  • Better: 64 MB × 10,000 = 640 GB per object → split 5 TB into 8 objects
  • Use S3 Transfer Acceleration for intercontinental uploads: route via CloudFront edge → AWS backbone
  • Verify: after complete, GetObjectAttributes returns list of part ETags; validate all match

"Need to process every uploaded object in real-time (virus scan, thumbnail generation)"

  • S3 Event Notifications → SNS/SQS/Lambda on s3:ObjectCreated:*
  • Lambda invoked per upload: runs scan/transform synchronously; writes result back or tags object
  • For heavy processing: S3 → SQS → EC2 worker fleet (decouple ingestion from processing)
  • S3 Object Lambda: transform object content on-the-fly during GET (resize image before serving)

"Objects must be immutable — no overwrites, no deletes for 7 years (compliance)"

  • Enable S3 Object Lock with Compliance mode: RetainUntilDate = now + 7 years
  • Compliance mode: not even root account can delete/overwrite before retention date
  • Governance mode: admin can override with special permission (less strict, for internal audit)
  • Versioning must be enabled — each PUT creates new version; locked version protected independently

"Cross-region disaster recovery — RPO < 1 minute"

  • S3 Cross-Region Replication (CRR) with Replication Time Control (RTC): 99.99% of objects replicated within 15 minutes; S3 RTC guarantees <15 min for 99.99%
  • For RPO <1 min: not achievable with S3 alone → use active-active app writes to both regions simultaneously
  • Failover: Route 53 health check + latency routing → point traffic to replica region in <60s

"Constraint: reduce GET costs for a public dataset (millions of requester-pays GETs)"

  • Enable Requester Pays bucket: requesters pay transfer + GET costs, not bucket owner
  • Enable CloudFront distribution: requesters served from edge cache; S3 GETs collapse to cache misses only
  • Compress objects at upload (gzip/zstd): smaller transfers → lower requester cost → more adoption

"Need to query data inside objects without downloading entire file"

  • S3 Select: run SQL on CSV/JSON/Parquet objects; only matching rows returned → up to 80% less data transferred
  • For analytics: store in Parquet columnar format → S3 Select only reads relevant columns; massive reduction
  • For large-scale querying: AWS Athena (queries S3 via Glue catalog) — serverless SQL on S3 at scale