Load Balancing

Module 5: Load Balancing

Track 1: Foundations (0–2 YoE)

In Module 4, we learned that horizontal scaling requires distributing traffic across multiple servers. The Load Balancer is the critical infrastructure component that makes this possible. It sits between clients and your server fleet, routing every request to a healthy instance using intelligent algorithms. This module is a comprehensive reference — covering how load balancers work internally, the tradeoffs between different algorithms and OSI layers, TLS termination, global load balancing, and high-availability patterns.

What Is a Load Balancer?

A load balancer is a reverse proxy that distributes incoming network traffic across multiple backend servers. Without one, a single server must handle all traffic — and when it fails, everything fails. With a load balancer, you get:

Availability: If one server dies, the LB routes traffic to healthy servers. Users never see the failure.
Scalability: Add more servers behind the LB to handle more traffic. No configuration changes needed for clients.
Performance: Traffic is spread evenly, preventing any single server from becoming a hotspot.
Abstraction: Clients connect to one IP address (the LB's VIP). They don't know or care how many servers exist behind it.

A load balancer can be hardware (F5 BIG-IP, Citrix ADC — expensive, high-performance appliances) or software (Nginx, HAProxy, Envoy, cloud-managed ALB/NLB). In the cloud era, software load balancers dominate.

L4 vs L7: Choosing the Right Layer

Load balancers operate at different layers of the OSI model. The choice between L4 and L7 fundamentally affects what routing decisions are possible, performance, and complexity.

Layer 4 (Transport)

Operates at the TCP/UDP level. Routes based on source/destination IP and port numbers. Does NOT inspect packet payload — it can't see HTTP headers, URLs, or cookies.

✅ Extremely fast — nanosecond decisions, millions of connections/sec
✅ Protocol-agnostic (works for HTTP, gRPC, WebSocket, databases, anything TCP)
✅ Simpler to operate and debug
❌ Can't route by URL path, header, or cookie
❌ Can't perform connection multiplexing or caching

Products: AWS NLB, GCP TCP/UDP LB, HAProxy (TCP mode), LVS, IPVS

Layer 7 (Application)

Operates at the HTTP/HTTPS level. Can inspect and route based on URL paths, HTTP headers, cookies, query parameters.

✅ Content-based routing (/api → backend, /static → CDN)
✅ Can modify headers (add X-Request-ID, X-Forwarded-For)
✅ Can perform TLS termination, connection pooling, caching
✅ Supports A/B testing and canary routing via headers
❌ Higher latency (must parse HTTP), more CPU-intensive
❌ Only works for HTTP/HTTPS (some support gRPC, WebSocket)

Products: AWS ALB, GCP HTTPS LB, Nginx, HAProxy (HTTP mode), Envoy, Traefik

When to Use Which?

Use L4 for: raw TCP throughput, database connection pooling (PgBouncer), internal service-to-service traffic where routing by path isn't needed. Use L7 for: public-facing HTTP APIs, microservice routing, TLS termination, content-based traffic splitting (canary, A/B). Many architectures use both: an L4 NLB at the edge for DDoS absorption, forwarding to L7 Envoy/Nginx proxies for application routing.

Load Balancing Algorithms

The algorithm determines which backend server receives each request. Choosing the wrong one leads to uneven load distribution, hotspots, and poor utilization.

Algorithm	Mechanism	Pros	Cons
Round Robin	1 → 2 → 3 → 1 → 2 → 3...	Simple, zero overhead, O(1)	Ignores server load, slow servers get equal traffic
Weighted RR	Server A gets 3x, B gets 1x	Accounts for hardware differences	Static weights, doesn't adapt to runtime load
Least Connections	Route to server with fewest active connections	Adapts to variable request duration	Requires connection tracking, thundering herd to "least loaded"
Least Response Time	Route to server with fastest avg response	Optimizes for latency, not just fairness	Needs latency tracking, can oscillate
IP Hash	hash(client_ip) % N servers	Deterministic, good for session affinity	Adding/removing servers remaps most clients
Consistent Hashing	Hash ring with virtual nodes	Only K/N keys remap when N changes	More complex, needs virtual nodes for balance
Power of Two	Pick 2 random, choose lighter one	Avoids herd to "least loaded", O(1)	Slightly less optimal than global least-conn

Consistent Hashing: The Key to Elastic Scaling

With simple hash(key) % N, adding or removing a server remaps nearly every key. If you have 100 servers and add 1, ~99% of clients get remapped — a catastrophe for cached sessions. Consistent hashing solves this: only K/N keys (where K = total keys, N = servers) are remapped when a server is added or removed.

The hash ring works by placing both servers and request keys on a circular hash space (0 to 2^32). Each request is routed to the nearest server clockwise on the ring. To ensure even distribution, each physical server is mapped to multiple virtual nodes (vnodes) on the ring — typically 100-200 per server.

Try It: Hashing Algorithms

Visualize the consistent hash ring, add/remove servers, and see how minimal keys are remapped.

Consistent Hashing Hash Tables

TLS Termination & SSL Offloading

HTTPS requires TLS encryption, which is CPU-intensive. Every connection requires a TLS handshake (key exchange, certificate verification), and every byte must be encrypted/decrypted. TLS termination moves this work from your application servers to the load balancer.

TLS Termination at LB

Client → LB is HTTPS (encrypted). LB → Backend is HTTP (plaintext). The LB handles all certificate management and TLS handshakes. Backend servers are relieved of crypto overhead. This is the most common pattern for internal traffic.

TLS Passthrough

The LB forwards the encrypted connection directly to the backend without decrypting. Used when the backend must see the original TLS certificate (mutual TLS, mTLS). The LB can only do L4 routing since it can't read the encrypted HTTP content.

TLS Re-encryption (End-to-End)

Client → LB is HTTPS (encrypted with public cert). LB decrypts, inspects, routes, then re-encrypts LB → Backend with an internal cert. Provides both L7 routing and encrypted backend traffic. Used in high-security environments (financial, healthcare).

Explore: TLS & Security

Step through the TLS handshake, certificate validation, and understand how HTTPS secures traffic.

How HTTPS Works NAT VPC Networking

Health Checks Deep Dive

A load balancer periodically probes each backend to determine if it can receive traffic. Getting health checks wrong is one of the most common causes of outages.

Probe Types

Type	How	What It Detects
TCP Check	Attempt TCP connection to port	Process is running and accepting connections
HTTP Check	`GET /health` → expect 200	Application is responsive, basic function works
Deep Health	`GET /ready` → checks DB, cache	All dependencies are reachable and the server can serve real traffic

The Deep Health Check Trap

If your health check queries the database, and the database is temporarily slow, all your servers will fail their health checks simultaneously. The LB removes all backends, causing a total outage even though your servers are healthy. Solution: Use TCP or shallow HTTP checks for liveness, and deep checks for readiness only. Never let a dependency failure cascade into a full fleet removal.

Configuration Best Practices

# HAProxy health check config

backend api_servers

balance roundrobin

option httpchk GET /health HTTP/1.1\r\nHost:\ api.internal

http-check expect status 200

# Rise: 3 successful checks before marking UP

# Fall: 2 failures before marking DOWN

server api1 10.0.1.10:8080 check inter 5s rise 3 fall 2

server api2 10.0.1.11:8080 check inter 5s rise 3 fall 2

Global Server Load Balancing (GSLB)

When your application runs in multiple regions (us-east, eu-west, ap-south), you need a way to route users to the nearest datacenter. GSLB operates at the DNS level — the user's DNS resolver receives the IP of the closest healthy region.

Geolocation Routing

Route by the user's geographic location. European users go to eu-west, Asian users to ap-south. Simple and effective for most applications.

Latency-Based Routing

Route to the region with the lowest measured latency. More precise than geolocation because it accounts for network topology, not just physical distance.

Failover Routing

Active-passive: all traffic goes to the primary region. If health checks fail, DNS automatically switches to the secondary. Recovery is manual or timed.

Anycast is an alternative to DNS-based GSLB: the same IP address is announced from multiple locations via BGP. The internet's routing infrastructure automatically sends packets to the nearest announcement point. Used by CDNs (Cloudflare, Google) and critical infrastructure (DNS root servers).

Explore: Global Networking

See how DNS, BGP, and Anycast work together to route users to the nearest datacenter globally.

How DNS Works BGP Routing How CDNs Work

High Availability Patterns

If the load balancer itself fails, everything is down. Making the load balancer highly available is critical.

Active-Passive (VRRP/Keepalived)

Two LB instances share a Virtual IP (VIP). The active instance handles all traffic. If it fails, the passive instance detects the failure via heartbeat and takes over the VIP within seconds. Used with HAProxy and Nginx on bare metal.

Active-Active

Multiple LB instances handle traffic simultaneously. DNS round-robin or Anycast distributes across them. Higher throughput but requires shared state for features like session affinity and rate limiting.

Cloud-Managed (AWS ALB/NLB, GCP LB)

The cloud provider manages HA, scaling, and TLS termination. AWS NLB can handle millions of connections per second. You pay per hour + per GB processed. This is the modern default — don't run your own LB unless you have a specific reason.

Lessons from the Trenches

Case Study: GitHub's LB Migration

GitHub migrated from hardware F5 load balancers to software-based GLB (GitHub Load Balancer) running on commodity servers. GLB uses ECMP (Equal-Cost Multi-Path) routing with consistent hashing at L4, then forwards to HAProxy instances for L7 routing. The migration enabled them to handle 10x more traffic at 1/10th the cost, and eliminated the F5 as a single point of failure.

Takeaway: Software LBs running on commodity hardware have surpassed hardware LBs in both cost and flexibility. The key architecture is L4 (ECMP/IPVS) → L7 (HAProxy/Envoy) two-tier design.

Case Study: Cloudflare's Unimog

Cloudflare's global L4 load balancer, Unimog, uses XDP/eBPF in the Linux kernel to achieve line-rate packet processing (100 Gbps+) without context switches. Packets are steered between servers within the same datacenter using a custom encapsulation protocol. This programmable data plane approach processes billions of packets per second across their 300+ PoPs.

Takeaway: At internet scale, the load balancer must be in the kernel. XDP/eBPF enables custom packet processing at wire speed without user-space overhead.

Case Study: Google's Maglev

Google's Maglev is a kernel-bypass network load balancer that handles Google's entire public traffic (Search, YouTube, Gmail). It uses a consistent hashing algorithm (Maglev hashing) that guarantees minimal disruption when backends change. Maglev runs on standard Linux servers, achieves 10M+ packets per second per machine, and is deployed in every Google PoP worldwide.

Takeaway: Google published the Maglev paper (NSDI 2016) showing that software LBs can match hardware performance. The Maglev hashing algorithm is now used in Envoy, Cilium, and other open-source projects.

All Hands-on Resources

Reinforce these concepts with interactive simulators and visual deep-dives.

Load Balancer Sim Consistent Hashing Circuit Breaker How LBs Work How HTTPS Works How DNS Works BGP Routing Reverse Proxy

What's Next?

Asynchronous Architecture

Break the request-response contract. Learn how message queues, event streams, and CQRS decouple systems for massive scalability.

Continue Journey