Load Balancer
What It Does
- Distributes incoming traffic across multiple backend servers
- Hides topology from clients (single VIP / DNS entry)
- Health checks — removes unhealthy backends automatically
- TLS termination, connection pooling, rate limiting (L7)
L4 vs L7
|
L4 (Transport Layer) |
L7 (Application Layer) |
| Works at |
TCP / UDP |
HTTP / HTTPS / gRPC / WebSocket |
| Sees |
IP + port only |
URL, headers, cookies, body |
| Routing basis |
IP tuple hash |
Path, host, header, method |
| TLS |
Pass-through (no termination) |
Terminates TLS, re-encrypts to backend |
| Latency |
Very low (no parsing) |
Slightly higher (parse HTTP) |
| Sticky sessions |
IP-hash only |
Cookie-based, header-based |
| Health check |
TCP connect |
HTTP GET /health — checks app logic |
| Use case |
Raw TCP (DB proxies, SMTP), ultra-low latency |
HTTP APIs, microservices, A/B testing |
| Examples |
AWS NLB, HAProxy TCP mode, IPVS |
AWS ALB, Nginx, Envoy, HAProxy HTTP mode |
When to Use L4
- Non-HTTP protocols (MySQL proxy, Redis proxy, SMTP)
- Need absolute minimum latency overhead
- TLS passthrough required (client cert auth end-to-end)
- Very high connection rates (millions/sec)
When to Use L7
- HTTP/HTTPS microservices (almost always)
- Need URL-based routing (
/api → service A, /static → S3)
- Header-based routing (canary, A/B, multi-tenant)
- Rate limiting, auth, request rewriting at the edge
- WebSocket upgrades, gRPC (HTTP/2)
Load Balancing Algorithms
Stateless (no per-backend state)
| Algorithm |
How |
Best For |
| Round Robin |
Requests 1,2,3... distributed in order |
Homogeneous servers, uniform requests |
| Weighted Round Robin |
Each server gets weight proportional to capacity |
Heterogeneous servers (different CPU/RAM) |
| Random |
Pick backend at random |
Simple, no coordination, works at scale |
| IP Hash |
hash(client_ip) % N → same server per client |
Weak sticky session, stateful backends |
| URL Hash |
hash(url) % N → same server per URL |
CDN/cache servers — same content to same node |
Stateful (tracks backend state)
| Algorithm |
How |
Best For |
| Least Connections |
Route to backend with fewest active connections |
Long-lived connections (WebSocket, file upload) |
| Weighted Least Connections |
Least connections + capacity weight |
Mixed-capacity fleet with long connections |
| Least Response Time |
Route to fastest-responding backend |
Latency-sensitive, heterogeneous response times |
| Resource-Based |
Route based on CPU/memory reported by agents |
Compute-intensive workloads |
Consistent Hashing
- Backend servers placed on a virtual ring
- Request key hashed → clockwise walk → first node
- Adding/removing a server only remaps
K/N keys (not all)
- Used when: session affinity, cache affinity (same key to same cache node)
- Virtual nodes per server → more even distribution
Ring: 0 ──────── ServerA ──── ServerB ──── ServerC ──── 2^32
Request hash → clockwise → first server = owner
Stateless vs Stateful Backends
Stateless Backends (preferred)
- No session data stored on the app server
- Any request can go to any instance → true horizontal scale
- Session state stored externally (Redis, DB)
- LB algorithm: round robin / random — simple and effective
Stateful Backends (harder to scale)
- Session data lives on a specific server
- Requires sticky sessions (session affinity)
- If that server dies → session lost (unless replicated)
Sticky Sessions Implementation
| Method |
How |
Risk |
| Cookie-based |
LB injects Set-Cookie: SERVERID=s1; routes by cookie |
Cookie can be stripped, HTTPS-only safe |
| IP-hash |
hash(client_ip) % N → always same server |
Breaks with NAT (many users → same IP), CGNAT |
| Consistent hashing |
Stable mapping via ring |
Node failure remaps only adjacent keys |
Best practice: avoid sticky sessions — move state to Redis instead.
Health Checks
| Type |
How |
Use |
| TCP check |
Open TCP connection to port |
L4 LBs; confirms port is listening |
| HTTP check |
GET /health → expect 200 |
L7 LBs; confirms app is alive |
| Deep check |
/health checks DB + cache connectivity |
Detects degraded (alive but broken) backends |
| Passive |
Monitor error rate on real traffic |
Detect degraded performance without extra probes |
- Unhealthy threshold: 2–3 consecutive failures → remove
- Healthy threshold: 2–3 consecutive successes → re-add
- Don't couple
/health to flaky dependencies — cascading removal under DB slowness
Rate Limiting at the Load Balancer
Where to Rate Limit
Client → [CDN rate limit] → [LB/API GW rate limit] → [App rate limit] → Backend
- CDN layer: block DDoS, per-IP limits, geographic blocks
- LB/API Gateway: per-client token bucket, per-route limits
- App layer: per-user business logic limits (X requests/hour per account)
Rate Limiting Algorithms
| Algorithm |
Behavior |
Memory |
Use |
| Token bucket |
Refill at rate R; burst up to capacity B |
O(1) per key |
API gateways — allows controlled bursts |
| Leaky bucket |
Queue requests; drain at constant rate |
O(queue size) |
Traffic shaping, smooth output |
| Fixed window |
Counter per time window (1min, 1hr) |
O(1) per key |
Simple; edge spike at window boundary |
| Sliding window log |
Log all request timestamps; count in window |
O(requests) |
Exact; memory-heavy at high QPS |
| Sliding window counter |
Weighted interpolation of two fixed windows |
O(1) per key |
Approximate; production standard |
Sliding Window Counter (most common)
current_window_count = prev_window_count × (1 - elapsed/window) + current_count
Example: window=60s, elapsed=45s, prev=80, curr=30
→ rate = 80 × (1 - 45/60) + 30 = 80×0.25 + 30 = 50 requests in window
Distributed Rate Limiting
- Single server: in-process counter (fastest)
- Multi-node: Redis atomic counter (
INCR + EXPIRE) or Redis Lua script for atomicity
- Tradeoff: Redis adds ~1–2ms per check; acceptable for most API GWs
-- Redis atomic rate limit check (Lua)
local count = redis.call('INCR', key)
if count == 1 then redis.call('EXPIRE', key, window_seconds) end
return count
Connection Handling
Connection Pooling (L7)
- LB maintains persistent connections to backends (keep-alive)
- Client opens new connection → LB reuses existing backend connection
- Avoids TCP handshake overhead per request
- Critical for DB proxies (PgBouncer, ProxySQL)
SSL/TLS Termination
Client ──[TLS]──► LB ──[plain HTTP or re-encrypted]──► Backend
Option A: Terminate at LB → plain HTTP to backend (faster, less secure internally)
Option B: Terminate at LB → re-encrypt to backend (mTLS) — zero-trust
Option C: TLS passthrough → backend handles TLS (L4 only, client cert auth)
HTTP/2 and gRPC
- L7 LBs must support HTTP/2 to load balance gRPC (stream-level, not connection-level)
- HTTP/2 multiplexes many requests on one connection → naive L4 sends all to one backend
- Envoy/Nginx with HTTP/2 do proper per-request (stream) load balancing
Global Load Balancing (GeoDNS / Anycast)
DNS-Based (GeoDNS)
- DNS resolver returns different IPs based on client location
- Route US users → US region, EU users → EU region
- TTL = 30–60s; failover is slow (TTL must expire)
- Used by: AWS Route53 latency routing, Cloudflare, Akamai
Anycast
- Same IP announced from multiple locations via BGP
- Network routing sends client to nearest PoP automatically
- Instant failover (BGP reconverges in ~seconds)
- Used by: Cloudflare (1.1.1.1), Google (8.8.8.8), CDN PoPs
Active-Passive vs Active-Active
| Mode |
Write |
Read |
Failover |
| Active-Passive |
Primary only |
Primary only |
Promote passive (~30–60s) |
| Active-Active |
Both regions |
Both regions |
Instant (no failover needed) |
| Active-Active w/ conflict |
Both |
Both |
Requires CRDT / last-write-wins |
Service Mesh (L7 in Sidecar)
- LB logic moves into a sidecar proxy (Envoy) next to every service instance
- No centralized LB — each pod has its own proxy
- Features: mTLS between services, retries, circuit breaking, distributed tracing, traffic splitting
Service A → Envoy sidecar → [mTLS] → Envoy sidecar → Service B
Used by: Istio, Linkerd, AWS App Mesh, Consul Connect
Key Numbers
| Metric |
Typical Value |
| L4 LB max connections |
1–10M concurrent (NLB) |
| L7 LB max RPS |
100k–1M RPS (ALB, Nginx) |
| Health check interval |
5–30s |
| Unhealthy threshold |
2–3 failures |
| Connection timeout |
30–60s (idle) |
| Rate limit Redis check overhead |
1–2ms |
| DNS TTL for failover |
30–60s |
| Anycast BGP failover |
~seconds |
Summary: Pick the Right LB
| Need |
Choice |
| HTTP microservices, TLS, routing |
L7 (ALB / Nginx / Envoy) |
| Raw TCP, DB proxy, ultra-low latency |
L4 (NLB / HAProxy TCP) |
| Stateful app, can't move to Redis |
Sticky session via cookie |
| Cache/session locality |
Consistent hashing |
| Heterogeneous fleet |
Weighted round robin |
| Long-lived connections (WS, upload) |
Least connections |
| Global traffic routing |
GeoDNS + Anycast |
| Service-to-service (k8s) |
Service mesh (Envoy sidecar) |
| Rate limiting distributed API |
Sliding window counter + Redis |