Scaling Out

Breaking the vertical ceiling and distributing traffic across clusters.

Module 4: Scaling Out
Track 1: Foundations (0–2 YoE)
In Module 1, we hit the vertical ceiling. In Module 3, we learned caching to reduce load. But caching alone can't solve the fundamental problem: a single server is a single point of failure. Horizontal scaling — running multiple copies of your application behind a load balancer — is the foundation of every modern cloud-scale system. This module covers the stateless mandate, session externalization, auto-scaling, service discovery, and the operational challenges of running a fleet.
The Scale Cube

The "Scale Cube" (from The Art of Scalability by Abbott & Fisher) provides a mental model for three dimensions of scaling:

X-Axis: Cloning

Run N identical copies of your application behind a load balancer. Every instance handles every request type. Simplest form of scaling — this module's focus.

Y-Axis: Decomposition

Split by function into microservices. The payments service scales independently from the search service. Covered in the Strangler Fig section of Module 1.

Z-Axis: Sharding

Split by data. Users A-M go to Shard 1, N-Z to Shard 2. Each shard runs the same code but owns a subset of data. Covered in Module 7 (Database Scaling).

X-axis scaling (this module) is the prerequisite for everything else. If your application can't run as multiple identical clones, you can't decompose it into microservices or shard its data. The key requirement: your application must be stateless.


The Stateless Mandate

Horizontal scaling only works if your application servers are stateless — meaning no server holds any user-specific data in local memory. If a load balancer sends Request 1 from User A to Server 1 and Request 2 to Server 3, both servers must be able to serve User A identically.

What "state" do applications commonly hold in memory?

  • Session data: Login status, shopping cart, form wizard progress
  • File uploads: Temporary files saved to local disk
  • In-process caches: Application-level data caches (handled in Module 3)
  • WebSocket connections: Long-lived connections bound to a specific server
  • Background job state: In-progress job tracking stored in memory

Each of these must be externalized to a shared store to enable stateless scaling.

Session Management Strategies

The most common stateful element is the user session. Here are three strategies for externalizing it, each with distinct tradeoffs:

1. Sticky Sessions (Session Affinity)

The load balancer remembers which server a user was routed to (via a cookie or IP hash) and sends all subsequent requests to the same server. Not truly stateless — more of a "patch" that avoids the real problem.

Pros
  • Zero code changes required
  • Works with any session implementation
Cons
  • Server death = session loss for all its users
  • Uneven load distribution (some servers get "heavy" users)
  • Prevents auto-scaling (can't remove a server without killing sessions)

2. Centralized Session Store (Redis)

Move all session data to a fast, shared external store like Redis. Every application server reads from and writes to the same Redis cluster. The session is fully decoupled from the server.

// On login: store session in Redis
sessionId = generateUUID()
redis.set(f"session:{sessionId}", userJson, ttl=3600)
response.setCookie("sid", sessionId, httpOnly=true)
// On each request: load session from Redis
sessionId = request.cookie("sid")
user = redis.get(f"session:{sessionId}")
Pros
  • Truly stateless servers — any server can handle any request
  • Server death doesn't affect user sessions
  • Enables autoscaling, rolling deploys
Cons
  • Redis becomes a dependency (must be HA)
  • ~0.5ms latency per session lookup
  • Redis failure = all sessions lost (unless persistent)

3. Client-Side Tokens (JWT)

Store the session data in the user's browser as a JSON Web Token. The server doesn't store anything — it just validates the cryptographic signature on each request. The token contains the user ID, roles, and expiry.

// JWT structure: header.payload.signature
eyJhbGciOiJIUzI1NiJ9 // Header
.eyJ1c2VyIjoibmlsZXNoIiwiZXhwIjoxNzA5NTB9 // Payload
.dBjftJeZ4CVP-mB92K27uhbUJU1p1r_wW1gFWFOEjXk // Signature
Pros
  • Zero server-side storage — ultimate statelessness
  • Works across services (microservice auth)
  • No external dependency (no Redis needed)
Cons
  • Can't revoke tokens before expiry (without a blocklist)
  • Payload visible to the client (never put secrets in a JWT)
  • Larger HTTP headers (~1KB per request)

Explore: Session & Auth

See how JWT tokens are created, verified, and how OAuth/OIDC flows delegate authentication to identity providers.


Auto-Scaling

Running a fixed number of servers wastes money during low traffic and underserves during peaks. Auto-scaling dynamically adjusts the number of instances based on demand.

Scaling Metrics

What metric should trigger scaling? The answer depends on your bottleneck:

Metric When to Use Target
CPU Utilization Compute-bound workloads (image processing, ML inference) Scale out at 70%, scale in at 30%
Request count (RPS) Web APIs with predictable per-request cost N requests per instance (e.g., 500 RPS/instance)
Queue depth Worker/consumer services processing from a queue Scale when queue length > threshold for > 2 min
Response latency (P99) Latency-sensitive APIs (real-time, gaming) Scale when P99 > SLA target (e.g., 200ms)
Memory utilization Memory-bound workloads (in-process caches, JVM heaps) Scale out at 80% — leave headroom for GC

Scaling Policies

  • Reactive scaling responds to current metrics: "If CPU > 70% for 3 minutes, add 2 instances." Fast to implement but always lags behind demand.
  • Predictive scaling uses historical patterns: "Every Monday at 9 AM, traffic increases 5x, so pre-scale at 8:55 AM." AWS Auto Scaling and GCP Autoscaler support this natively.
  • Scheduled scaling uses cron-like rules: "Set min instances to 20 during business hours, 5 at night." Useful for known traffic patterns (e-commerce sales events, streaming premieres).
The Scale-In Trap

Scaling out is easy. Scaling in (removing instances) is dangerous. If you remove an instance that's processing requests, those connections are severed. Solutions: connection draining (finish in-flight requests before shutdown), deregistration delay (wait 30 seconds after removing from LB before terminating), and cool-down periods (don't scale in again for 5 minutes to avoid oscillation).

Explore: Auto-Scaling & Containers

See how Kubernetes Horizontal Pod Autoscaler and cloud provider auto-scaling groups work.


Service Discovery

When you add or remove servers dynamically (auto-scaling), how does the rest of the system know where to send traffic? Hardcoding IP addresses doesn't work when instances are ephemeral. Service Discovery solves this by maintaining a registry of available service instances.

Client-Side Discovery

The client queries a service registry (e.g., Consul, etcd, ZooKeeper) to get a list of healthy instances, then load-balances across them locally.

Used by: Netflix Eureka, Spring Cloud LoadBalancer

Server-Side Discovery

The client sends traffic to a load balancer, which queries the registry and routes to a healthy instance. The client doesn't need to know about the registry.

Used by: AWS ALB + ECS, Kubernetes Services, Consul + Envoy

DNS-Based Discovery

Services register as DNS records. Clients resolve the service name to a set of IP addresses. Simple but limited — DNS TTL caching causes delayed updates.

Used by: Kubernetes DNS (CoreDNS), AWS Route 53 Service Discovery

Service Mesh (Sidecar)

A sidecar proxy (Envoy, Linkerd) handles discovery, load balancing, retries, and observability transparently. The application doesn't know about any of it.

Used by: Istio, Linkerd, Consul Connect, AWS App Mesh

Explore: Discovery & Networking

Understand how services find each other in dynamic environments, and how Kubernetes networking connects pods.


Health Checks & Graceful Shutdown

In a fleet of servers, instances fail constantly — hardware dies, processes crash, deployments roll out. The system must detect and react to failures automatically, without human intervention.

Health Check Types

  • Liveness probe: "Is the process alive?" A simple HTTP 200 response. If this fails, restart the instance. Example: GET /healthz → 200 OK.
  • Readiness probe: "Can this instance serve traffic?" Checks database connectivity, cache availability, and warm-up status. If this fails, don't send traffic to this instance, but don't restart it.
  • Startup probe: "Has the app finished starting?" Important for JVM applications with 30-60 second startup times. Prevents the liveness probe from killing the app before it's ready.
# Kubernetes health check configuration
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 3 # 3 failures before restart
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5

Graceful Shutdown

When an instance is removed (scaling in, deploying a new version), it must shut down gracefully:

  1. Stop accepting new connections — deregister from the load balancer.
  2. Drain in-flight requests — wait for active requests to complete (with a timeout, e.g., 30 seconds).
  3. Close resources — close database connections, flush logs, commit offsets.
  4. Exit — terminate the process with exit code 0.

In Kubernetes, this is orchestrated by the SIGTERM signal followed by a configurable terminationGracePeriodSeconds (default: 30s). If the pod hasn't exited by then, it receives SIGKILL.

Try It: Kubernetes Simulators

Simulate rolling deployments, pod eviction, and see how Kubernetes orchestrates zero-downtime updates.


Load Balancing Algorithms Deep Dive

Choosing the right LB algorithm determines how evenly traffic is distributed. No single algorithm is best for all workloads.

Algorithm How It Works Best For
Round Robin Requests go to server 1, 2, 3, 1, 2, 3... in order Identical servers, uniform request cost
Weighted Round Robin Proportional distribution (e.g., 3:1 ratio for fast:slow servers) Mixed hardware, canary deploys (10% to new version)
Least Connections Route to the server with the fewest active connections Variable request durations (WebSockets, uploads)
Least Response Time Route to the server with the fastest average response time Latency-sensitive APIs
IP Hash Hash the client's IP to deterministically pick a server Cheap sticky sessions without cookies
Power of Two Random Pick 2 servers at random, route to the one with fewer connections Large-scale systems (avoids "herd" to the least-loaded)

L4 vs L7 Load Balancing

Layer 4 (Transport) LBs route based on IP:port — extremely fast, no HTTP parsing. Layer 7 (Application) LBs inspect HTTP headers, URL paths, and cookies — enabling content-based routing (/api to backend servers, /static to CDN). Use L4 for raw throughput, L7 for intelligent routing. Most cloud load balancers (AWS ALB, GCP HTTPS LB) are L7.

Try It: Load Balancing Simulator

Visualize Round Robin, Least Connections, and Weighted algorithms distributing requests in real-time.


Deployment Strategies

Horizontal scaling changes how you deploy. You can no longer "stop the server, update, restart." You need zero-downtime deployment strategies:

Rolling Update

Replace instances one at a time. At any point, the fleet has a mix of old and new versions. Simple but causes brief inconsistency. Kubernetes default via RollingUpdate strategy with maxSurge: 1, maxUnavailable: 0.

Blue-Green Deployment

Run two identical environments: "Blue" (current) and "Green" (new). Deploy to Green, test it, then switch the load balancer to point to Green. Instant rollback = switch back to Blue. Requires 2x infrastructure during the switch.

Canary Deployment

Route a small percentage (1-5%) of traffic to the new version. Monitor error rates and latency. If healthy, gradually increase to 10% → 50% → 100%. If errors spike, roll back instantly. Used by Google, Netflix, and Amazon for every deploy.

Feature Flags

Deploy new code to 100% of servers but activate it for only selected users via configuration. Separates deployment (code running in production) from release (users seeing the feature). Tools: LaunchDarkly, Unleash, Flagsmith.


Externalizing Everything Else

File Uploads → Object Storage

If users upload files (images, documents, videos), storing them on the local filesystem breaks with horizontal scaling — Server 1 has the file, but Server 2 doesn't. The solution: object storage (S3, GCS, Azure Blob). Upload directly to S3 using pre-signed URLs — the app server never touches the file, reducing CPU and memory load.

// Pre-signed URL flow: app gives client a signed URL to upload directly to S3
1. Client → App Server: "I want to upload a file"
2. App Server → S3: generatePresignedUrl("uploads/img-123.jpg", expiry=5min)
3. App Server → Client: "Upload here: https://s3.../uploads/img-123.jpg?sig=..."
4. Client → S3: PUT file directly (bypasses app server entirely)

WebSocket Scaling

WebSocket connections are long-lived and stateful by nature — a user connects to Server 1, and all messages must flow through Server 1. This contradicts stateless scaling. Solutions:

  • Redis Pub/Sub for fan-out: Each server subscribes to relevant Redis channels. When a message arrives for a user on Server 1, it's published to Redis, and Server 1's subscription delivers it. All servers can receive messages for any user.
  • Sticky connections with session awareness: Use L7 LB cookie-based routing to keep WebSocket connections on the same server, combined with a Redis-backed mapping of user → server for targeted message delivery.

Explore: Real-Time Systems

See how WebSockets, Server-Sent Events, and real-time communication protocols work under the hood.


Lessons from the Trenches

Case Study: The Billion-Dollar Health Check Bug

A major e-commerce platform had an outage where their load balancer's health checks were too aggressive. When a few nodes became slow under load, the LB marked them "Offline" and moved their traffic to remaining nodes. The remaining nodes, now overloaded, also became slow and were marked offline. Within 60 seconds, the entire 500-node cluster was "Offline" despite the hardware being perfectly healthy. The LB had created a cascading failure by being too eager to remove slow nodes.

Takeaway: Health checks need flapping protection (minimum 3 consecutive failures before removal), a minimum healthy percentage (never remove more than 30% of the fleet), and hysteresis (different thresholds for marking down vs. marking up).

Case Study: Netflix's Chaos Monkey

Netflix deliberately kills random production instances during business hours using Chaos Monkey. The philosophy: if your system can't handle a server dying at 2 PM on a Tuesday, it certainly can't handle one dying at 2 AM on a Saturday. This forced every team to build truly stateless, fault-tolerant services. Any service that couldn't survive random instance termination was redesigned. The result: Netflix can survive an entire AWS Availability Zone going offline.

Takeaway: Don't wait for failures to test your scaling assumptions. Kill your servers in production regularly. If your system breaks, better to find out during business hours with engineers at their desks.

Case Study: Discord's 5M Concurrent Connections

Discord handles over 5 million concurrent WebSocket connections using a combination of consistent hashing (user → gateway server mapping), Redis Pub/Sub (cross-server message delivery), and Elixir/Erlang (millions of lightweight processes per node). Each gateway server handles ~1M connections. When a server is removed, its connections are rehashed and reconnected to other servers within seconds.

Takeaway: Stateful protocols (WebSockets) can scale horizontally by combining consistent hashing with a message bus. The key is making reconnection fast and transparent to users.


Further Reading & Citations
  • The Art of Scalability by Martin Abbott & Michael Fisher — The Scale Cube (X/Y/Z axes) defines the three dimensions of scaling. (Addison-Wesley, 2015)
  • Designing Data-Intensive Applications by Martin Kleppmann — Chapter 6 (Partitioning) and Chapter 8 (Trouble with Distributed Systems). (O'Reilly, 2017)
  • The Netflix Simian Army — Netflix Tech Blog — How Chaos Monkey, Chaos Kong, and Latency Monkey validate resilience.
  • How Discord Stores Trillions of Messages — Scaling real-time systems to millions of concurrent connections.
  • Kubernetes HPA Documentation — Official guide to Horizontal Pod Autoscaler, including custom metrics.
  • Canary Release — Martin Fowler — The original description of canary deployment strategy.
  • Stateless Auth for Stateful Minds — Auth0 — JWT-based session management patterns and tradeoffs.
  • Release It! by Michael Nygard — Stability patterns including circuit breakers, bulkheads, and timeouts for production scaling. (Pragmatic Bookshelf, 2018)

All Hands-on Resources

Reinforce these concepts with interactive simulators and visual deep-dives.

What's Next?

Load Balancing

Go deeper into load balancing algorithms, health probes, TLS termination, and the infrastructure that keeps your fleet alive.

Continue Journey