Scaling Out

Module 4: Scaling Out

Track 1: Foundations (0–2 YoE)

In Module 1, we hit the vertical ceiling. In Module 3, we learned caching to reduce load. But caching alone can't solve the fundamental problem: a single server is a single point of failure. Horizontal scaling — running multiple copies of your application behind a load balancer — is the foundation of every modern cloud-scale system. This module covers the stateless mandate, session externalization, auto-scaling, service discovery, and the operational challenges of running a fleet.

The Scale Cube

The "Scale Cube" (from The Art of Scalability by Abbott & Fisher) provides a mental model for three dimensions of scaling:

X-Axis: Cloning

Run N identical copies of your application behind a load balancer. Every instance handles every request type. Simplest form of scaling — this module's focus.

Y-Axis: Decomposition

Split by function into microservices. The payments service scales independently from the search service. Covered in the Strangler Fig section of Module 1.

Z-Axis: Sharding

Split by data. Users A-M go to Shard 1, N-Z to Shard 2. Each shard runs the same code but owns a subset of data. Covered in Module 7 (Database Scaling).

X-axis scaling (this module) is the prerequisite for everything else. If your application can't run as multiple identical clones, you can't decompose it into microservices or shard its data. The key requirement: your application must be stateless.

The Stateless Mandate

Horizontal scaling only works if your application servers are stateless — meaning no server holds any user-specific data in local memory. If a load balancer sends Request 1 from User A to Server 1 and Request 2 to Server 3, both servers must be able to serve User A identically.

What "state" do applications commonly hold in memory?

Session data: Login status, shopping cart, form wizard progress
File uploads: Temporary files saved to local disk
In-process caches: Application-level data caches (handled in Module 3)
WebSocket connections: Long-lived connections bound to a specific server
Background job state: In-progress job tracking stored in memory

Each of these must be externalized to a shared store to enable stateless scaling.

Session Management Strategies

The most common stateful element is the user session. Here are three strategies for externalizing it, each with distinct tradeoffs:

1. Sticky Sessions (Session Affinity)

The load balancer remembers which server a user was routed to (via a cookie or IP hash) and sends all subsequent requests to the same server. Not truly stateless — more of a "patch" that avoids the real problem.

Pros

Zero code changes required
Works with any session implementation

Cons

Server death = session loss for all its users
Uneven load distribution (some servers get "heavy" users)
Prevents auto-scaling (can't remove a server without killing sessions)

2. Centralized Session Store (Redis)

Move all session data to a fast, shared external store like Redis. Every application server reads from and writes to the same Redis cluster. The session is fully decoupled from the server.

// On login: store session in Redis

sessionId = generateUUID()

redis.set(f"session:{sessionId}", userJson, ttl=3600)

response.setCookie("sid", sessionId, httpOnly=true)

// On each request: load session from Redis

sessionId = request.cookie("sid")

user = redis.get(f"session:{sessionId}")

Pros

Truly stateless servers — any server can handle any request
Server death doesn't affect user sessions
Enables autoscaling, rolling deploys

Cons

Redis becomes a dependency (must be HA)
~0.5ms latency per session lookup
Redis failure = all sessions lost (unless persistent)

3. Client-Side Tokens (JWT)

Store the session data in the user's browser as a JSON Web Token. The server doesn't store anything — it just validates the cryptographic signature on each request. The token contains the user ID, roles, and expiry.

// JWT structure: header.payload.signature

eyJhbGciOiJIUzI1NiJ9 // Header

.eyJ1c2VyIjoibmlsZXNoIiwiZXhwIjoxNzA5NTB9 // Payload

.dBjftJeZ4CVP-mB92K27uhbUJU1p1r_wW1gFWFOEjXk // Signature

Pros

Zero server-side storage — ultimate statelessness
Works across services (microservice auth)
No external dependency (no Redis needed)

Cons

Can't revoke tokens before expiry (without a blocklist)
Payload visible to the client (never put secrets in a JWT)
Larger HTTP headers (~1KB per request)

Explore: Session & Auth

See how JWT tokens are created, verified, and how OAuth/OIDC flows delegate authentication to identity providers.

JWT Lifecycle OAuth Flow OIDC

Auto-Scaling

Running a fixed number of servers wastes money during low traffic and underserves during peaks. Auto-scaling dynamically adjusts the number of instances based on demand.

Scaling Metrics

What metric should trigger scaling? The answer depends on your bottleneck:

Metric	When to Use	Target
CPU Utilization	Compute-bound workloads (image processing, ML inference)	Scale out at 70%, scale in at 30%
Request count (RPS)	Web APIs with predictable per-request cost	N requests per instance (e.g., 500 RPS/instance)
Queue depth	Worker/consumer services processing from a queue	Scale when queue length > threshold for > 2 min
Response latency (P99)	Latency-sensitive APIs (real-time, gaming)	Scale when P99 > SLA target (e.g., 200ms)
Memory utilization	Memory-bound workloads (in-process caches, JVM heaps)	Scale out at 80% — leave headroom for GC

Scaling Policies

Reactive scaling responds to current metrics: "If CPU > 70% for 3 minutes, add 2 instances." Fast to implement but always lags behind demand.
Predictive scaling uses historical patterns: "Every Monday at 9 AM, traffic increases 5x, so pre-scale at 8:55 AM." AWS Auto Scaling and GCP Autoscaler support this natively.
Scheduled scaling uses cron-like rules: "Set min instances to 20 during business hours, 5 at night." Useful for known traffic patterns (e-commerce sales events, streaming premieres).

The Scale-In Trap

Scaling out is easy. Scaling in (removing instances) is dangerous. If you remove an instance that's processing requests, those connections are severed. Solutions: connection draining (finish in-flight requests before shutdown), deregistration delay (wait 30 seconds after removing from LB before terminating), and cool-down periods (don't scale in again for 5 minutes to avoid oscillation).

Explore: Auto-Scaling & Containers

See how Kubernetes Horizontal Pod Autoscaler and cloud provider auto-scaling groups work.

Autoscaling Containers K8s Rollout Sim

Service Discovery

When you add or remove servers dynamically (auto-scaling), how does the rest of the system know where to send traffic? Hardcoding IP addresses doesn't work when instances are ephemeral. Service Discovery solves this by maintaining a registry of available service instances.

Client-Side Discovery

The client queries a service registry (e.g., Consul, etcd, ZooKeeper) to get a list of healthy instances, then load-balances across them locally.

Used by: Netflix Eureka, Spring Cloud LoadBalancer

Server-Side Discovery

The client sends traffic to a load balancer, which queries the registry and routes to a healthy instance. The client doesn't need to know about the registry.

Used by: AWS ALB + ECS, Kubernetes Services, Consul + Envoy

DNS-Based Discovery

Services register as DNS records. Clients resolve the service name to a set of IP addresses. Simple but limited — DNS TTL caching causes delayed updates.

Used by: Kubernetes DNS (CoreDNS), AWS Route 53 Service Discovery

Service Mesh (Sidecar)

A sidecar proxy (Envoy, Linkerd) handles discovery, load balancing, retries, and observability transparently. The application doesn't know about any of it.

Used by: Istio, Linkerd, Consul Connect, AWS App Mesh

Explore: Discovery & Networking

Understand how services find each other in dynamic environments, and how Kubernetes networking connects pods.

Service Discovery K8s Networking How DNS Works

Health Checks & Graceful Shutdown

In a fleet of servers, instances fail constantly — hardware dies, processes crash, deployments roll out. The system must detect and react to failures automatically, without human intervention.

Health Check Types

Liveness probe: "Is the process alive?" A simple HTTP 200 response. If this fails, restart the instance. Example: GET /healthz → 200 OK.
Readiness probe: "Can this instance serve traffic?" Checks database connectivity, cache availability, and warm-up status. If this fails, don't send traffic to this instance, but don't restart it.
Startup probe: "Has the app finished starting?" Important for JVM applications with 30-60 second startup times. Prevents the liveness probe from killing the app before it's ready.

# Kubernetes health check configuration

livenessProbe:

httpGet:

path: /healthz

port: 8080

initialDelaySeconds: 15

periodSeconds: 10

failureThreshold: 3 # 3 failures before restart

readinessProbe:

httpGet:

path: /ready

port: 8080

periodSeconds: 5

Graceful Shutdown

When an instance is removed (scaling in, deploying a new version), it must shut down gracefully:

Stop accepting new connections — deregister from the load balancer.
Drain in-flight requests — wait for active requests to complete (with a timeout, e.g., 30 seconds).
Close resources — close database connections, flush logs, commit offsets.
Exit — terminate the process with exit code 0.

In Kubernetes, this is orchestrated by the SIGTERM signal followed by a configurable terminationGracePeriodSeconds (default: 30s). If the pod hasn't exited by then, it receives SIGKILL.

Try It: Kubernetes Simulators

Simulate rolling deployments, pod eviction, and see how Kubernetes orchestrates zero-downtime updates.

K8s Rollout Pod Eviction Pod Creation

Load Balancing Algorithms Deep Dive

Choosing the right LB algorithm determines how evenly traffic is distributed. No single algorithm is best for all workloads.

Algorithm	How It Works	Best For
Round Robin	Requests go to server 1, 2, 3, 1, 2, 3... in order	Identical servers, uniform request cost
Weighted Round Robin	Proportional distribution (e.g., 3:1 ratio for fast:slow servers)	Mixed hardware, canary deploys (10% to new version)
Least Connections	Route to the server with the fewest active connections	Variable request durations (WebSockets, uploads)
Least Response Time	Route to the server with the fastest average response time	Latency-sensitive APIs
IP Hash	Hash the client's IP to deterministically pick a server	Cheap sticky sessions without cookies
Power of Two Random	Pick 2 servers at random, route to the one with fewer connections	Large-scale systems (avoids "herd" to the least-loaded)

L4 vs L7 Load Balancing

Layer 4 (Transport) LBs route based on IP:port — extremely fast, no HTTP parsing. Layer 7 (Application) LBs inspect HTTP headers, URL paths, and cookies — enabling content-based routing (/api to backend servers, /static to CDN). Use L4 for raw throughput, L7 for intelligent routing. Most cloud load balancers (AWS ALB, GCP HTTPS LB) are L7.

Try It: Load Balancing Simulator

Visualize Round Robin, Least Connections, and Weighted algorithms distributing requests in real-time.

LB Simulator How LBs Work Reverse Proxy

Deployment Strategies

Horizontal scaling changes how you deploy. You can no longer "stop the server, update, restart." You need zero-downtime deployment strategies:

Rolling Update

Replace instances one at a time. At any point, the fleet has a mix of old and new versions. Simple but causes brief inconsistency. Kubernetes default via RollingUpdate strategy with maxSurge: 1, maxUnavailable: 0.

Blue-Green Deployment

Run two identical environments: "Blue" (current) and "Green" (new). Deploy to Green, test it, then switch the load balancer to point to Green. Instant rollback = switch back to Blue. Requires 2x infrastructure during the switch.

Canary Deployment

Route a small percentage (1-5%) of traffic to the new version. Monitor error rates and latency. If healthy, gradually increase to 10% → 50% → 100%. If errors spike, roll back instantly. Used by Google, Netflix, and Amazon for every deploy.

Feature Flags

Deploy new code to 100% of servers but activate it for only selected users via configuration. Separates deployment (code running in production) from release (users seeing the feature). Tools: LaunchDarkly, Unleash, Flagsmith.

Externalizing Everything Else

File Uploads → Object Storage

If users upload files (images, documents, videos), storing them on the local filesystem breaks with horizontal scaling — Server 1 has the file, but Server 2 doesn't. The solution: object storage (S3, GCS, Azure Blob). Upload directly to S3 using pre-signed URLs — the app server never touches the file, reducing CPU and memory load.

// Pre-signed URL flow: app gives client a signed URL to upload directly to S3

1. Client → App Server: "I want to upload a file"

2. App Server → S3: generatePresignedUrl("uploads/img-123.jpg", expiry=5min)

3. App Server → Client: "Upload here: https://s3.../uploads/img-123.jpg?sig=..."

4. Client → S3: PUT file directly (bypasses app server entirely)

WebSocket Scaling

WebSocket connections are long-lived and stateful by nature — a user connects to Server 1, and all messages must flow through Server 1. This contradicts stateless scaling. Solutions:

Redis Pub/Sub for fan-out: Each server subscribes to relevant Redis channels. When a message arrives for a user on Server 1, it's published to Redis, and Server 1's subscription delivers it. All servers can receive messages for any user.
Sticky connections with session awareness: Use L7 LB cookie-based routing to keep WebSocket connections on the same server, combined with a Redis-backed mapping of user → server for targeted message delivery.

Explore: Real-Time Systems

See how WebSockets, Server-Sent Events, and real-time communication protocols work under the hood.

WebSockets Realtime Comms How Redis Works

Lessons from the Trenches

Case Study: The Billion-Dollar Health Check Bug

A major e-commerce platform had an outage where their load balancer's health checks were too aggressive. When a few nodes became slow under load, the LB marked them "Offline" and moved their traffic to remaining nodes. The remaining nodes, now overloaded, also became slow and were marked offline. Within 60 seconds, the entire 500-node cluster was "Offline" despite the hardware being perfectly healthy. The LB had created a cascading failure by being too eager to remove slow nodes.

Takeaway: Health checks need flapping protection (minimum 3 consecutive failures before removal), a minimum healthy percentage (never remove more than 30% of the fleet), and hysteresis (different thresholds for marking down vs. marking up).

Case Study: Netflix's Chaos Monkey

Netflix deliberately kills random production instances during business hours using Chaos Monkey. The philosophy: if your system can't handle a server dying at 2 PM on a Tuesday, it certainly can't handle one dying at 2 AM on a Saturday. This forced every team to build truly stateless, fault-tolerant services. Any service that couldn't survive random instance termination was redesigned. The result: Netflix can survive an entire AWS Availability Zone going offline.

Takeaway: Don't wait for failures to test your scaling assumptions. Kill your servers in production regularly. If your system breaks, better to find out during business hours with engineers at their desks.

Case Study: Discord's 5M Concurrent Connections

Discord handles over 5 million concurrent WebSocket connections using a combination of consistent hashing (user → gateway server mapping), Redis Pub/Sub (cross-server message delivery), and Elixir/Erlang (millions of lightweight processes per node). Each gateway server handles ~1M connections. When a server is removed, its connections are rehashed and reconnected to other servers within seconds.

Takeaway: Stateful protocols (WebSockets) can scale horizontally by combining consistent hashing with a message bus. The key is making reconnection fast and transparent to users.

All Hands-on Resources

Reinforce these concepts with interactive simulators and visual deep-dives.

Load Balancer JWT Lifecycle Consistent Hashing K8s Rollout How LBs Work Service Discovery WebSockets Autoscaling

What's Next?

Load Balancing

Go deeper into load balancing algorithms, health probes, TLS termination, and the infrastructure that keeps your fleet alive.

Continue Journey

X-Axis: Cloning

Y-Axis: Decomposition

Z-Axis: Sharding

Session Management Strategies

1. Sticky Sessions (Session Affinity)

2. Centralized Session Store (Redis)

3. Client-Side Tokens (JWT)

Explore: Session & Auth

Scaling Metrics

Scaling Policies

Explore: Auto-Scaling & Containers

Client-Side Discovery

Server-Side Discovery

DNS-Based Discovery

Service Mesh (Sidecar)

Explore: Discovery & Networking

Health Check Types

Graceful Shutdown

Try It: Kubernetes Simulators

L4 vs L7 Load Balancing

Try It: Load Balancing Simulator

Rolling Update

Blue-Green Deployment

Canary Deployment

Feature Flags

File Uploads → Object Storage

WebSocket Scaling

Explore: Real-Time Systems

Case Study: The Billion-Dollar Health Check Bug

Case Study: Netflix's Chaos Monkey

Case Study: Discord's 5M Concurrent Connections

All Hands-on Resources

Load Balancing

Table of Contents