The Definitive Guide to Load Balancing

Load balancing is the process of efficiently distributing incoming network traffic across a group of backend servers, also known as a server farm or server pool. In modern highly-available (HA) systems, the load balancer is the absolute front door to your infrastructure. Without it, scaling beyond a single server is practically impossible.

As internet traffic scales from hundreds to millions of concurrent users, a single application server simply cannot keep up. A load balancer acts as the "traffic cop" sitting in front of your servers and routing client requests across all servers capable of fulfilling those requests in a manner that maximizes speed and capacity utilization and ensures that no one server is overworked, which could degrade performance. If a single server goes down, the load balancer redirects traffic to the remaining online servers. When a new server is added to the server group, the load balancer automatically starts to send requests to it.

1. Layer 4 vs. Layer 7 Load Balancing

Load balancers operate at different levels of the OSI (Open Systems Interconnection) reference model. The two most prominent types are Layer 4 (Transport) and Layer 7 (Application) load balancers. Understanding the distinction is the most critical architectural decision when designing your traffic entrypoint.

Layer 4 (Transport Level)

Layer 4 load balancers act upon data found in network and transport layer protocols (IP, TCP, FTP, UDP). They are oblivious to the actual contents of the messages.

Routing based on: Source/Destination IP and Port.
Speed: Extremely fast. No decrypting/parsing required.
State: Maintains TCP connection state tracking.
Drawback: Cannot make smart decisions based on URL paths, HTTP headers, or cookies.
Examples: AWS Network Load Balancer (NLB), HAProxy (TCP mode), Linux Virtual Server (IPVS).

Layer 7 (Application Level)

Layer 7 load balancers distribute requests based on data found in application layer protocols such as HTTP. They inspect the actual payload of the traffic.

Routing based on: HTTP URIs, HTTP Headers, Cookies, Queries.
Speed: CPU-intensive. Requires SSL decryption and HTTP parsing.
State: Terminates the client connection and opens a new connection to the backend.
Drawback: Slower packet processing compared to L4.
Examples: AWS Application Load Balancer (ALB), NGINX, Envoy, HAProxy (HTTP mode).

Deep Dive: TCP Termination. In a Layer 4 LB (like Direct Routing or NAT), the TCP connection might actually be established directly between the client and the backend server (the LB just forwards packets). In a Layer 7 LB, the connection is strictly terminated at the LB. The client performs the TCP handshake with the LB, and the LB performs a separate TCP handshake with the selected backend. This allows the L7 LB to keep a pool of persistent, pre-warmed TCP connections natively multiplexed to backends, vastly reducing latency for subsequent requests (Connection Pooling).

2. Load Balancing Algorithms

The choice of algorithm dictates how the load balancer selects which healthy backend server receives the next request. Algorithms range from purely mathematical (static) to highly adaptive (dynamic).

Round Robin

The simplest algorithm. Requests are distributed across the group of servers sequentially. If you have servers A, B, and C, the first request goes to A, the second to B, the third to C, the fourth to A, and so on.

Best for: Clusters where all servers have identical specifications and all requests take roughly the same amount of computation time.

Weighted Round Robin

Similar to Round Robin, but each server is assigned a mathematical weight indicating its processing capacity. A server with a weight of 5 will receive five times as many connections as a server with a weight of 1 within a single full cycle.

Best for: Heterogeneous clusters. For example, if you add new, powerful instances to an older cluster but don't want to overwork the legacy hardware.

Least Connections

A dynamic algorithm that directs traffic to the server currently managing the fewest active, open client connections. This requires the load balancer to compute mathematical state at every request interval.

Best for: Environments where requests have wildly varying completion times, such as long-lived WebSockets or heavy database-generating PDF downloads, which would easily skew a purely round-robin approach.

IP Hash (or Source Hash)

A cryptographic hash of the client's IP address determines which server the request is routed to. The key factor is that a client with a given IP address will always reach the exact same backend server, as long as the pool of servers doesn't change.

Best for: Session persistence (sticky sessions) where a user's shopping cart state is pinned to the memory of a single backend pod. Note: Consistent Hashing is an evolution of this used by databases like Cassandra to minimize re-mapping during cluster scaling.

Random with the "Power of Two Choices"

A remarkably effective algorithm popularized by NGINX and Envoy. Searching all 10,000 servers for the absolute "least connections" takes O(N) time and locks memory. Instead, the LB randomly picks exactly two servers, checks which of the two has fewer connections, and routes to that one.

Best for: Extremely large, hyper-scale distributed systems where computational efficiency of the proxy itself becomes a bottleneck. Mathematically, choosing between just two random options avoids the "herd behavior" where multiple concurrent load balancers all dump traffic on the single least-connected node simultaneously.

3. Advanced Traffic Management Features

SSL/TLS Termination

Decrypting traffic is extremely CPU intensive. Instead of burdening every backend API server with decrypting HTTPS traffic, the Load Balancer decrypts the TLS certificates at the edge. The traffic is then sent unencrypted (or softly encrypted via local network TLS) to the backend. This centralized certificate management simplifies infrastructure drastically.

Connection Draining

When a server is scheduled for maintenance or scaling down, the Load Balancer stops sending it new requests immediately, but keeps the connection open for existing inflight requests so they can finish gracefully. Wait times are usually configured (e.g., 300 seconds) before the LB forcefully severs the connection and deregisters the node.

Health Checking and The Thundering Herd

Load balancers rely on constant pinging to ensure backend pools are healthy.

Passive Checks: The proxy observes real traffic. If target A returns 5xx errors for actual user requests 3 times in a row, it ejects the node temporarily.
Active Checks: The LB synthetically pings an endpoint (e.g. /healthz) every 5 seconds. If it misses 2 pings, it marks it unhealthy.

⚠️ Danger: The Thundering Herd Problem

If a database hiccup makes your entire 100-node fleet simultaneously report "unhealthy," the Load Balancer might eject all 100 nodes. When the database comes up, the LB sees 100 nodes become healthy instantly and unleashes millions of queued requests onto them, crashing them again. Modern proxies use Panic Routing—if >50% of nodes fail health checks, the proxy assumes its health-checks are flawed and just routes traffic to all nodes anyway, preferring partial failure over complete outage.

4. The Titans of Software Load Balancing

While hardware load balancers (F5 Networks, Citrix NetScaler) used to dominate the enterprise data center, the cloud era is entirely defined by high-performance open-source software load balancers dynamically managed by orchestration engines like Kubernetes.

NGINX

Originally written in 2004 to solve the C10k problem (handling 10,000 concurrent connections). NGINX uses an asynchronous, event-driven architecture rather than creating new threads for each request. It is the most widely deployed web server and reverse proxy on the internet.

upstream backend_cluster {
  least_conn;
  server 10.0.0.1 weight=5;
  server 10.0.0.2 max_fails=3 fail_timeout=30s;
}

HAProxy

The gold standard for pure, extreme-performance load balancing. Written in C, HAProxy is known for its ability to handle millions of connections with effectively zero latency overhead. It powers the edge infrastructure of sites like GitHub, Reddit, and StackOverflow. It is uncompromisingly focused purely on proxying and balancing, making it more efficient than NGINX at raw L4 TCP tasks.

Envoy Proxy (The Modern Era)

Developed by Lyft and written in C++11, Envoy is fundamentally designed for cloud-native microservice architectures. Unlike NGINX which requires reloading configuration files, Envoy is driven by xDS APIs—it pulls routing tables, SSL certs, and endpoints down dynamically via gRPC without pausing traffic. This dynamic nature is why Envoy acts as the foundation for modern Service Meshes like Istio, sitting as a sidecar proxy next to every single container in a cluster.

VMware Avi Load Balancer

Also known as NSX Advanced Load Balancer, Avi pioneered the modern Software-Defined Load Balancing approach by strictly separating the Control Plane from the Data Plane. Rather than managing individual hardware appliances, administrators define policies centrally via REST APIs on the Avi Controller. The Controller then automatically spins up, scales out, or scales in distributed Data Plane instances (Service Engines) across any on-premise data center or public cloud in real-time. It also features deep built-in telemetry, capturing granular analytics for every single transaction without requiring third-party monitoring agents.

5. Load Balancing the Load Balancers (GSLB)

If the load balancer goes down, your entire application goes down. Therefore, load balancers must themselves be highly available. This introduces a recursive problem: how do you balance traffic to the load balancers?

Active-Passive & VRRP

In traditional setups, two HAProxy instances sit at the edge. One is Active, handling traffic. The other is Passive (standby). They communicate using VRRP (Virtual Router Redundancy Protocol) or keepalived. They share a single floating Virtual IP (VIP). If the Active node stops broadcasting VRRP heartbeats, the Passive node immediately aggressively broadcasts ARP packets claiming ownership of the VIP. Switches instantly route the public internet to the backup node.

Global Server Load Balancing (GSLB)

To distribute millions of users globally to the closet data center, we use systems beyond IP packets:

DNS Load Balancing: When a user types netflix.com, the DNS server returns an IP address for the data center geographically closest to them. However, DNS caching (TTLs) makes failovers slow, sometimes taking hours for ISPs to flush dead IPs.
Anycast Routing: The magical technology behind Cloudflare and Google. Hundreds of load balancers worldwide broadcast the exact same IP address via BGP (Border Gateway Protocol). The internet's core routers automatically send packets to whichever load balancer is topologically closest. If a data center goes offline, BGP routes automatically flow around it to the next closest center in milliseconds.
Google's Maglev: Google's proprietary L4 load balancing system uses Consistent Hashing without a connection tracking table. Every packet is hashed independently (using a 5-tuple: Source IP, Source Port, Dest IP, Dest Port, Protocol) directly targeting backend HTTP proxies in the cluster, completely avoiding stateful synchronized bottlenecks.

Load Balancer Simulator

How to Use

Server Pool

Distribution Stats

Request Log

Related Tools

Consistent Hashing

Rate Limiter

Circuit Breaker Visualizer

How Load Balancing Works