Scaling Out
Breaking the vertical ceiling and distributing traffic across clusters.
The "Scale Cube" (from The Art of Scalability by Abbott & Fisher) provides a mental model for three dimensions of scaling:
X-Axis: Cloning
Run N identical copies of your application behind a load balancer. Every instance handles every request type. Simplest form of scaling — this module's focus.
Y-Axis: Decomposition
Split by function into microservices. The payments service scales independently from the search service. Covered in the Strangler Fig section of Module 1.
Z-Axis: Sharding
Split by data. Users A-M go to Shard 1, N-Z to Shard 2. Each shard runs the same code but owns a subset of data. Covered in Module 7 (Database Scaling).
X-axis scaling (this module) is the prerequisite for everything else. If your application can't run as multiple identical clones, you can't decompose it into microservices or shard its data. The key requirement: your application must be stateless.
Horizontal scaling only works if your application servers are stateless — meaning no server holds any user-specific data in local memory. If a load balancer sends Request 1 from User A to Server 1 and Request 2 to Server 3, both servers must be able to serve User A identically.
What "state" do applications commonly hold in memory?
- Session data: Login status, shopping cart, form wizard progress
- File uploads: Temporary files saved to local disk
- In-process caches: Application-level data caches (handled in Module 3)
- WebSocket connections: Long-lived connections bound to a specific server
- Background job state: In-progress job tracking stored in memory
Each of these must be externalized to a shared store to enable stateless scaling.
Session Management Strategies
The most common stateful element is the user session. Here are three strategies for externalizing it, each with distinct tradeoffs:
1. Sticky Sessions (Session Affinity)
The load balancer remembers which server a user was routed to (via a cookie or IP hash) and sends all subsequent requests to the same server. Not truly stateless — more of a "patch" that avoids the real problem.
- Zero code changes required
- Works with any session implementation
- Server death = session loss for all its users
- Uneven load distribution (some servers get "heavy" users)
- Prevents auto-scaling (can't remove a server without killing sessions)
2. Centralized Session Store (Redis)
Move all session data to a fast, shared external store like Redis. Every application server reads from and writes to the same Redis cluster. The session is fully decoupled from the server.
- Truly stateless servers — any server can handle any request
- Server death doesn't affect user sessions
- Enables autoscaling, rolling deploys
- Redis becomes a dependency (must be HA)
- ~0.5ms latency per session lookup
- Redis failure = all sessions lost (unless persistent)
3. Client-Side Tokens (JWT)
Store the session data in the user's browser as a JSON Web Token. The server doesn't store anything — it just validates the cryptographic signature on each request. The token contains the user ID, roles, and expiry.
- Zero server-side storage — ultimate statelessness
- Works across services (microservice auth)
- No external dependency (no Redis needed)
- Can't revoke tokens before expiry (without a blocklist)
- Payload visible to the client (never put secrets in a JWT)
- Larger HTTP headers (~1KB per request)
Explore: Session & Auth
See how JWT tokens are created, verified, and how OAuth/OIDC flows delegate authentication to identity providers.
Running a fixed number of servers wastes money during low traffic and underserves during peaks. Auto-scaling dynamically adjusts the number of instances based on demand.
Scaling Metrics
What metric should trigger scaling? The answer depends on your bottleneck:
| Metric | When to Use | Target |
|---|---|---|
| CPU Utilization | Compute-bound workloads (image processing, ML inference) | Scale out at 70%, scale in at 30% |
| Request count (RPS) | Web APIs with predictable per-request cost | N requests per instance (e.g., 500 RPS/instance) |
| Queue depth | Worker/consumer services processing from a queue | Scale when queue length > threshold for > 2 min |
| Response latency (P99) | Latency-sensitive APIs (real-time, gaming) | Scale when P99 > SLA target (e.g., 200ms) |
| Memory utilization | Memory-bound workloads (in-process caches, JVM heaps) | Scale out at 80% — leave headroom for GC |
Scaling Policies
- Reactive scaling responds to current metrics: "If CPU > 70% for 3 minutes, add 2 instances." Fast to implement but always lags behind demand.
- Predictive scaling uses historical patterns: "Every Monday at 9 AM, traffic increases 5x, so pre-scale at 8:55 AM." AWS Auto Scaling and GCP Autoscaler support this natively.
- Scheduled scaling uses cron-like rules: "Set min instances to 20 during business hours, 5 at night." Useful for known traffic patterns (e-commerce sales events, streaming premieres).
Scaling out is easy. Scaling in (removing instances) is dangerous. If you remove an instance that's processing requests, those connections are severed. Solutions: connection draining (finish in-flight requests before shutdown), deregistration delay (wait 30 seconds after removing from LB before terminating), and cool-down periods (don't scale in again for 5 minutes to avoid oscillation).
Explore: Auto-Scaling & Containers
See how Kubernetes Horizontal Pod Autoscaler and cloud provider auto-scaling groups work.
When you add or remove servers dynamically (auto-scaling), how does the rest of the system know where to send traffic? Hardcoding IP addresses doesn't work when instances are ephemeral. Service Discovery solves this by maintaining a registry of available service instances.
Client-Side Discovery
The client queries a service registry (e.g., Consul, etcd, ZooKeeper) to get a list of healthy instances, then load-balances across them locally.
Used by: Netflix Eureka, Spring Cloud LoadBalancer
Server-Side Discovery
The client sends traffic to a load balancer, which queries the registry and routes to a healthy instance. The client doesn't need to know about the registry.
Used by: AWS ALB + ECS, Kubernetes Services, Consul + Envoy
DNS-Based Discovery
Services register as DNS records. Clients resolve the service name to a set of IP addresses. Simple but limited — DNS TTL caching causes delayed updates.
Used by: Kubernetes DNS (CoreDNS), AWS Route 53 Service Discovery
Service Mesh (Sidecar)
A sidecar proxy (Envoy, Linkerd) handles discovery, load balancing, retries, and observability transparently. The application doesn't know about any of it.
Used by: Istio, Linkerd, Consul Connect, AWS App Mesh
Explore: Discovery & Networking
Understand how services find each other in dynamic environments, and how Kubernetes networking connects pods.
In a fleet of servers, instances fail constantly — hardware dies, processes crash, deployments roll out. The system must detect and react to failures automatically, without human intervention.
Health Check Types
- Liveness probe: "Is the process alive?" A simple HTTP 200 response. If this
fails, restart the instance. Example:
GET /healthz→ 200 OK. - Readiness probe: "Can this instance serve traffic?" Checks database connectivity, cache availability, and warm-up status. If this fails, don't send traffic to this instance, but don't restart it.
- Startup probe: "Has the app finished starting?" Important for JVM applications with 30-60 second startup times. Prevents the liveness probe from killing the app before it's ready.
Graceful Shutdown
When an instance is removed (scaling in, deploying a new version), it must shut down gracefully:
- Stop accepting new connections — deregister from the load balancer.
- Drain in-flight requests — wait for active requests to complete (with a timeout, e.g., 30 seconds).
- Close resources — close database connections, flush logs, commit offsets.
- Exit — terminate the process with exit code 0.
In Kubernetes, this is orchestrated by the SIGTERM signal followed by a
configurable terminationGracePeriodSeconds (default: 30s). If the pod hasn't
exited by then, it receives SIGKILL.
Try It: Kubernetes Simulators
Simulate rolling deployments, pod eviction, and see how Kubernetes orchestrates zero-downtime updates.
Choosing the right LB algorithm determines how evenly traffic is distributed. No single algorithm is best for all workloads.
| Algorithm | How It Works | Best For |
|---|---|---|
| Round Robin | Requests go to server 1, 2, 3, 1, 2, 3... in order | Identical servers, uniform request cost |
| Weighted Round Robin | Proportional distribution (e.g., 3:1 ratio for fast:slow servers) | Mixed hardware, canary deploys (10% to new version) |
| Least Connections | Route to the server with the fewest active connections | Variable request durations (WebSockets, uploads) |
| Least Response Time | Route to the server with the fastest average response time | Latency-sensitive APIs |
| IP Hash | Hash the client's IP to deterministically pick a server | Cheap sticky sessions without cookies |
| Power of Two Random | Pick 2 servers at random, route to the one with fewer connections | Large-scale systems (avoids "herd" to the least-loaded) |
L4 vs L7 Load Balancing
Layer 4 (Transport) LBs route based on IP:port — extremely fast, no HTTP parsing. Layer 7
(Application) LBs inspect HTTP headers, URL paths, and cookies — enabling content-based
routing (/api to backend servers, /static to CDN). Use L4 for raw throughput,
L7 for intelligent routing. Most cloud load balancers (AWS ALB, GCP HTTPS LB) are L7.
Try It: Load Balancing Simulator
Visualize Round Robin, Least Connections, and Weighted algorithms distributing requests in real-time.
Horizontal scaling changes how you deploy. You can no longer "stop the server, update, restart." You need zero-downtime deployment strategies:
Rolling Update
Replace instances one at a time. At any point, the fleet has a mix of old and new
versions. Simple but causes brief inconsistency. Kubernetes default via RollingUpdate
strategy with maxSurge: 1, maxUnavailable: 0.
Blue-Green Deployment
Run two identical environments: "Blue" (current) and "Green" (new). Deploy to Green, test it, then switch the load balancer to point to Green. Instant rollback = switch back to Blue. Requires 2x infrastructure during the switch.
Canary Deployment
Route a small percentage (1-5%) of traffic to the new version. Monitor error rates and latency. If healthy, gradually increase to 10% → 50% → 100%. If errors spike, roll back instantly. Used by Google, Netflix, and Amazon for every deploy.
Feature Flags
Deploy new code to 100% of servers but activate it for only selected users via configuration. Separates deployment (code running in production) from release (users seeing the feature). Tools: LaunchDarkly, Unleash, Flagsmith.
File Uploads → Object Storage
If users upload files (images, documents, videos), storing them on the local filesystem breaks with horizontal scaling — Server 1 has the file, but Server 2 doesn't. The solution: object storage (S3, GCS, Azure Blob). Upload directly to S3 using pre-signed URLs — the app server never touches the file, reducing CPU and memory load.
WebSocket Scaling
WebSocket connections are long-lived and stateful by nature — a user connects to Server 1, and all messages must flow through Server 1. This contradicts stateless scaling. Solutions:
- Redis Pub/Sub for fan-out: Each server subscribes to relevant Redis channels. When a message arrives for a user on Server 1, it's published to Redis, and Server 1's subscription delivers it. All servers can receive messages for any user.
- Sticky connections with session awareness: Use L7 LB cookie-based routing to keep WebSocket connections on the same server, combined with a Redis-backed mapping of user → server for targeted message delivery.
Explore: Real-Time Systems
See how WebSockets, Server-Sent Events, and real-time communication protocols work under the hood.
Case Study: The Billion-Dollar Health Check Bug
A major e-commerce platform had an outage where their load balancer's health checks were too aggressive. When a few nodes became slow under load, the LB marked them "Offline" and moved their traffic to remaining nodes. The remaining nodes, now overloaded, also became slow and were marked offline. Within 60 seconds, the entire 500-node cluster was "Offline" despite the hardware being perfectly healthy. The LB had created a cascading failure by being too eager to remove slow nodes.
Takeaway: Health checks need flapping protection (minimum 3 consecutive failures before removal), a minimum healthy percentage (never remove more than 30% of the fleet), and hysteresis (different thresholds for marking down vs. marking up).
Case Study: Netflix's Chaos Monkey
Netflix deliberately kills random production instances during business hours using Chaos Monkey. The philosophy: if your system can't handle a server dying at 2 PM on a Tuesday, it certainly can't handle one dying at 2 AM on a Saturday. This forced every team to build truly stateless, fault-tolerant services. Any service that couldn't survive random instance termination was redesigned. The result: Netflix can survive an entire AWS Availability Zone going offline.
Takeaway: Don't wait for failures to test your scaling assumptions. Kill your servers in production regularly. If your system breaks, better to find out during business hours with engineers at their desks.
Case Study: Discord's 5M Concurrent Connections
Discord handles over 5 million concurrent WebSocket connections using a combination of consistent hashing (user → gateway server mapping), Redis Pub/Sub (cross-server message delivery), and Elixir/Erlang (millions of lightweight processes per node). Each gateway server handles ~1M connections. When a server is removed, its connections are rehashed and reconnected to other servers within seconds.
Takeaway: Stateful protocols (WebSockets) can scale horizontally by combining consistent hashing with a message bus. The key is making reconnection fast and transparent to users.
- The Art of Scalability by Martin Abbott & Michael Fisher — The Scale Cube (X/Y/Z axes) defines the three dimensions of scaling. (Addison-Wesley, 2015)
- Designing Data-Intensive Applications by Martin Kleppmann — Chapter 6 (Partitioning) and Chapter 8 (Trouble with Distributed Systems). (O'Reilly, 2017)
- The Netflix Simian Army — Netflix Tech Blog — How Chaos Monkey, Chaos Kong, and Latency Monkey validate resilience.
- How Discord Stores Trillions of Messages — Scaling real-time systems to millions of concurrent connections.
- Kubernetes HPA Documentation — Official guide to Horizontal Pod Autoscaler, including custom metrics.
- Canary Release — Martin Fowler — The original description of canary deployment strategy.
- Stateless Auth for Stateful Minds — Auth0 — JWT-based session management patterns and tradeoffs.
- Release It! by Michael Nygard — Stability patterns including circuit breakers, bulkheads, and timeouts for production scaling. (Pragmatic Bookshelf, 2018)
All Hands-on Resources
Reinforce these concepts with interactive simulators and visual deep-dives.
What's Next?
Load Balancing
Go deeper into load balancing algorithms, health probes, TLS termination, and the infrastructure that keeps your fleet alive.
Continue Journey