Load Balancing
The traffic cop of the internet. How to distribute millions of requests across thousands of servers.
A load balancer is a reverse proxy that distributes incoming network traffic across multiple backend servers. Without one, a single server must handle all traffic — and when it fails, everything fails. With a load balancer, you get:
- Availability: If one server dies, the LB routes traffic to healthy servers. Users never see the failure.
- Scalability: Add more servers behind the LB to handle more traffic. No configuration changes needed for clients.
- Performance: Traffic is spread evenly, preventing any single server from becoming a hotspot.
- Abstraction: Clients connect to one IP address (the LB's VIP). They don't know or care how many servers exist behind it.
A load balancer can be hardware (F5 BIG-IP, Citrix ADC — expensive, high-performance appliances) or software (Nginx, HAProxy, Envoy, cloud-managed ALB/NLB). In the cloud era, software load balancers dominate.
Load balancers operate at different layers of the OSI model. The choice between L4 and L7 fundamentally affects what routing decisions are possible, performance, and complexity.
Layer 4 (Transport)
Operates at the TCP/UDP level. Routes based on source/destination IP and port numbers. Does NOT inspect packet payload — it can't see HTTP headers, URLs, or cookies.
- ✅ Extremely fast — nanosecond decisions, millions of connections/sec
- ✅ Protocol-agnostic (works for HTTP, gRPC, WebSocket, databases, anything TCP)
- ✅ Simpler to operate and debug
- ❌ Can't route by URL path, header, or cookie
- ❌ Can't perform connection multiplexing or caching
Products: AWS NLB, GCP TCP/UDP LB, HAProxy (TCP mode), LVS, IPVS
Layer 7 (Application)
Operates at the HTTP/HTTPS level. Can inspect and route based on URL paths, HTTP headers, cookies, query parameters.
- ✅ Content-based routing (
/api→ backend,/static→ CDN) - ✅ Can modify headers (add X-Request-ID, X-Forwarded-For)
- ✅ Can perform TLS termination, connection pooling, caching
- ✅ Supports A/B testing and canary routing via headers
- ❌ Higher latency (must parse HTTP), more CPU-intensive
- ❌ Only works for HTTP/HTTPS (some support gRPC, WebSocket)
Products: AWS ALB, GCP HTTPS LB, Nginx, HAProxy (HTTP mode), Envoy, Traefik
When to Use Which?
Use L4 for: raw TCP throughput, database connection pooling (PgBouncer), internal service-to-service traffic where routing by path isn't needed. Use L7 for: public-facing HTTP APIs, microservice routing, TLS termination, content-based traffic splitting (canary, A/B). Many architectures use both: an L4 NLB at the edge for DDoS absorption, forwarding to L7 Envoy/Nginx proxies for application routing.
The algorithm determines which backend server receives each request. Choosing the wrong one leads to uneven load distribution, hotspots, and poor utilization.
| Algorithm | Mechanism | Pros | Cons |
|---|---|---|---|
| Round Robin | 1 → 2 → 3 → 1 → 2 → 3... | Simple, zero overhead, O(1) | Ignores server load, slow servers get equal traffic |
| Weighted RR | Server A gets 3x, B gets 1x | Accounts for hardware differences | Static weights, doesn't adapt to runtime load |
| Least Connections | Route to server with fewest active connections | Adapts to variable request duration | Requires connection tracking, thundering herd to "least loaded" |
| Least Response Time | Route to server with fastest avg response | Optimizes for latency, not just fairness | Needs latency tracking, can oscillate |
| IP Hash | hash(client_ip) % N servers | Deterministic, good for session affinity | Adding/removing servers remaps most clients |
| Consistent Hashing | Hash ring with virtual nodes | Only K/N keys remap when N changes | More complex, needs virtual nodes for balance |
| Power of Two | Pick 2 random, choose lighter one | Avoids herd to "least loaded", O(1) | Slightly less optimal than global least-conn |
Consistent Hashing: The Key to Elastic Scaling
With simple hash(key) % N, adding or removing a server remaps nearly every key.
If you have 100 servers and add 1, ~99% of clients get remapped — a catastrophe for cached
sessions. Consistent hashing solves this: only K/N keys (where K
= total keys, N = servers) are remapped when a server is added or removed.
The hash ring works by placing both servers and request keys on a circular hash space (0 to 2^32). Each request is routed to the nearest server clockwise on the ring. To ensure even distribution, each physical server is mapped to multiple virtual nodes (vnodes) on the ring — typically 100-200 per server.
Try It: Hashing Algorithms
Visualize the consistent hash ring, add/remove servers, and see how minimal keys are remapped.
HTTPS requires TLS encryption, which is CPU-intensive. Every connection requires a TLS handshake (key exchange, certificate verification), and every byte must be encrypted/decrypted. TLS termination moves this work from your application servers to the load balancer.
TLS Termination at LB
Client → LB is HTTPS (encrypted). LB → Backend is HTTP (plaintext). The LB handles all certificate management and TLS handshakes. Backend servers are relieved of crypto overhead. This is the most common pattern for internal traffic.
TLS Passthrough
The LB forwards the encrypted connection directly to the backend without decrypting. Used when the backend must see the original TLS certificate (mutual TLS, mTLS). The LB can only do L4 routing since it can't read the encrypted HTTP content.
TLS Re-encryption (End-to-End)
Client → LB is HTTPS (encrypted with public cert). LB decrypts, inspects, routes, then re-encrypts LB → Backend with an internal cert. Provides both L7 routing and encrypted backend traffic. Used in high-security environments (financial, healthcare).
Explore: TLS & Security
Step through the TLS handshake, certificate validation, and understand how HTTPS secures traffic.
A load balancer periodically probes each backend to determine if it can receive traffic. Getting health checks wrong is one of the most common causes of outages.
Probe Types
| Type | How | What It Detects |
|---|---|---|
| TCP Check | Attempt TCP connection to port | Process is running and accepting connections |
| HTTP Check | GET /health → expect 200 | Application is responsive, basic function works |
| Deep Health | GET /ready → checks DB, cache | All dependencies are reachable and the server can serve real traffic |
If your health check queries the database, and the database is temporarily slow, all your servers will fail their health checks simultaneously. The LB removes all backends, causing a total outage even though your servers are healthy. Solution: Use TCP or shallow HTTP checks for liveness, and deep checks for readiness only. Never let a dependency failure cascade into a full fleet removal.
Configuration Best Practices
When your application runs in multiple regions (us-east, eu-west, ap-south), you need a way to route users to the nearest datacenter. GSLB operates at the DNS level — the user's DNS resolver receives the IP of the closest healthy region.
Geolocation Routing
Route by the user's geographic location. European users go to eu-west, Asian users to ap-south. Simple and effective for most applications.
Latency-Based Routing
Route to the region with the lowest measured latency. More precise than geolocation because it accounts for network topology, not just physical distance.
Failover Routing
Active-passive: all traffic goes to the primary region. If health checks fail, DNS automatically switches to the secondary. Recovery is manual or timed.
Anycast is an alternative to DNS-based GSLB: the same IP address is announced from multiple locations via BGP. The internet's routing infrastructure automatically sends packets to the nearest announcement point. Used by CDNs (Cloudflare, Google) and critical infrastructure (DNS root servers).
Explore: Global Networking
See how DNS, BGP, and Anycast work together to route users to the nearest datacenter globally.
If the load balancer itself fails, everything is down. Making the load balancer highly available is critical.
Active-Passive (VRRP/Keepalived)
Two LB instances share a Virtual IP (VIP). The active instance handles all traffic. If it fails, the passive instance detects the failure via heartbeat and takes over the VIP within seconds. Used with HAProxy and Nginx on bare metal.
Active-Active
Multiple LB instances handle traffic simultaneously. DNS round-robin or Anycast distributes across them. Higher throughput but requires shared state for features like session affinity and rate limiting.
Cloud-Managed (AWS ALB/NLB, GCP LB)
The cloud provider manages HA, scaling, and TLS termination. AWS NLB can handle millions of connections per second. You pay per hour + per GB processed. This is the modern default — don't run your own LB unless you have a specific reason.
Case Study: GitHub's LB Migration
GitHub migrated from hardware F5 load balancers to software-based GLB (GitHub Load Balancer) running on commodity servers. GLB uses ECMP (Equal-Cost Multi-Path) routing with consistent hashing at L4, then forwards to HAProxy instances for L7 routing. The migration enabled them to handle 10x more traffic at 1/10th the cost, and eliminated the F5 as a single point of failure.
Takeaway: Software LBs running on commodity hardware have surpassed hardware LBs in both cost and flexibility. The key architecture is L4 (ECMP/IPVS) → L7 (HAProxy/Envoy) two-tier design.
Case Study: Cloudflare's Unimog
Cloudflare's global L4 load balancer, Unimog, uses XDP/eBPF in the Linux kernel to achieve line-rate packet processing (100 Gbps+) without context switches. Packets are steered between servers within the same datacenter using a custom encapsulation protocol. This programmable data plane approach processes billions of packets per second across their 300+ PoPs.
Takeaway: At internet scale, the load balancer must be in the kernel. XDP/eBPF enables custom packet processing at wire speed without user-space overhead.
Case Study: Google's Maglev
Google's Maglev is a kernel-bypass network load balancer that handles Google's entire public traffic (Search, YouTube, Gmail). It uses a consistent hashing algorithm (Maglev hashing) that guarantees minimal disruption when backends change. Maglev runs on standard Linux servers, achieves 10M+ packets per second per machine, and is deployed in every Google PoP worldwide.
Takeaway: Google published the Maglev paper (NSDI 2016) showing that software LBs can match hardware performance. The Maglev hashing algorithm is now used in Envoy, Cilium, and other open-source projects.
- Maglev: A Fast and Reliable Software Network Load Balancer — Google (NSDI 2016) — The paper that proved software LBs can handle Google-scale traffic on commodity hardware.
- Unimog: Cloudflare's Edge Load Balancer — How XDP/eBPF enables line-rate L4 load balancing.
- GLB Director — GitHub Engineering — GitHub's open-source L4 load balancer using ECMP and consistent hashing.
- Designing Data-Intensive Applications by Martin Kleppmann — Chapter 5 (Replication) and Chapter 6 (Partitioning) cover the theory behind consistent hashing. (O'Reilly, 2017)
- HAProxy Configuration Basics — Official guide to configuring HAProxy algorithms and health checks.
- Envoy Load Balancing Docs — Envoy's implementation of Round Robin, Least Request, Ring Hash, and Maglev.
- The Practice of Cloud System Administration by Limoncelli, Hogan, Chalup — Chapters on load balancing and service reliability patterns. (Addison-Wesley, 2014)
All Hands-on Resources
Reinforce these concepts with interactive simulators and visual deep-dives.
What's Next?
Asynchronous Architecture
Break the request-response contract. Learn how message queues, event streams, and CQRS decouple systems for massive scalability.
Continue Journey