How Load Balancing Works

The Engineering of Load Balancing: Distributing the Internet

As a web application grows from hundreds of users to millions, a single server (no matter how powerful) will inevitably fail under the sheer volume of concurrent connections. The solution is horizontal scaling—adding more servers. However, this creates a routing problem: How do clients know which server to talk to? The answer is the Load Balancer, the invisible traffic cop of the internet.

Part 1: Layer 4 vs Layer 7 Load Balancing

Load balancers operate at different layers of the OSI model, trading off between raw speed and routing intelligence.

Layer 4 (Transport/Network Layer): Operates on raw IP addresses and TCP/UDP ports. An L4 balancer (like AWS Network Load Balancer or HAProxy in TCP mode) does not look at the contents of the message. It simply grabs the TCP packet, performs Network Address Translation (NAT) to change the destination IP to a backend server, and forwards it. This is blisteringly fast and can handle millions of requests per second with virtually zero CPU overhead, but it is entirely blind to HTTP headers or URL paths.
Layer 7 (Application Layer): Operates on HTTP/HTTPS. An L7 balancer (like NGINX or AWS Application Load Balancer) fully terminates the TCP connection and decrypts the SSL/TLS certificate. It reads the actual HTTP request (e.g., GET /api/users vs GET /images/logo.png). Because it understands the application data, an L7 balancer can route API requests to a Node.js microservice cluster, while routing image requests to an AWS S3 bucket. This requires significantly more CPU power but enables intelligent, path-based routing.

Part 2: The Routing Algorithms

When a request arrives, the Load Balancer must execute an algorithm to select exactly one backend server from the available pool.

Round Robin: The simplest approach. It literally cycles through the list of servers sequentially (1, 2, 3, 1, 2, 3). It assumes all servers have equal capacity and all requests take an equal amount of time. In the real world, this is rarely true, often leading to uneven load distribution.
Least Connections: The load balancer mathematically tracks exactly how many active TCP connections are open to each backend server. When a new request arrives, it is routed to the server with the absolute fewest active connections. This is vastly superior for workloads where some requests take 5 milliseconds and others take 5 seconds.
IP Hash / Sticky Sessions: The balancer takes the mathematical hash of the Client's IP address (e.g., Hash(192.168.1.50) % 3). This guarantees that a specific user is always routed to the exact same backend server. This is critical for legacy applications that store shopping cart data in local server RAM rather than a shared Redis cache.

Part 3: Health Checks and Self-Healing

A Load Balancer is useless if it blindly routes traffic to a server that has crashed or frozen. To prevent this, active Health Checks are strictly enforced.

Every few seconds, the Load Balancer pings a specific endpoint on every backend server (e.g., GET /health). If a server fails to respond with an HTTP 200 OK within the timeout period (e.g., 2 seconds) for a consecutive number of tries (the Unhealthy Threshold), the Load Balancer instantly ejects that server from the routing pool.

Any active connections to the dead server are immediately severed, and all new traffic is instantly redistributed among the surviving, healthy servers. When the dead server reboots and begins passing health checks again, it is seamlessly reintegrated into the pool. This provides the foundation for massive high-availability infrastructure.

Part 4: SSL Termination and Offloading

Decrypting HTTPS traffic requires heavy CPU cryptographic computation (RSA/ECDSA math). If you have 50 backend web servers, forcing each server to decrypt its own traffic burns massive amounts of CPU that should be used for application logic.

Instead, modern architectures use SSL Offloading. The public TLS Certificate is installed directly on the Load Balancer. The LB accepts the encrypted HTTPS traffic from the public internet, decrypts it using hardware-accelerated chips, and then forwards the plain, unencrypted HTTP traffic to the backend servers over a private, secure VPC network. This dramatically accelerates backend performance.

Conclusion: The Single Point of Contact

Load Balancing abstracts away the physical reality of the datacenter. To the user, Google appears to be a single, infinitely powerful computer sitting at google.com. In reality, the Load Balancer is acting as the ultimate facade—intercepting the request, evaluating server health, executing routing algorithms, terminating encryption, and silently distributing the load across millions of individual microchips in real-time.

Network Load Balancing (Layer 4) vs. Application Load Balancing (Layer 7)

NLB Layer 4 (Transport)

Routes traffic based solely on network and transport layer data: IP address and TCP/UDP ports. It doesn't look at the packet contents.

Pros: Extremely fast (millions of req/sec), low CPU usage, good for non-HTTP traffic (e.g., game servers, databases).
Cons: "Dumb" routing. Cannot route `/images` to Server A and `/api` to Server B.

Tools: AWS Network Load Balancer, HAProxy (TCP), LVS

ALB Layer 7 (Application)

Understands HTTP/HTTPS. It terminates the connection, reads the headers, cookies, and URL, makes a decision, and opens a new connection to the backend.

Pros: Intelligent routing (path-based, header-based), can terminate SSL, can inject HTTP headers.
Cons: Slower than L4, higher CPU overhead for decryption/parsing.

Tools: AWS Application Load Balancer, NGINX, Envoy

Related Resources

🟢

NGINX LB Guide

Official explanation by F5

↗ 🕹️

Load Balancer Simulator

Interact with routing algorithms

Traffic Ingress

Key Takeaways

Scalability

Availability

Efficiency