Orchestration & Resiliency

Module 9: Orchestration & Resiliency

Track 2: Distributed Shift (3–5 YoE)

In previous modules, we learned to scale out, balance load, and distribute data. But distributed systems fail constantly — networks partition, servers crash, dependencies timeout. Resiliency engineering ensures that failures are expected, contained, and recovered from automatically. Meanwhile, container orchestration (Kubernetes) automates the deployment, scaling, and management of your distributed services. This module covers both: the infrastructure that runs your code, and the patterns that keep it alive.

Containers: The Unit of Deployment

A container packages your application with all its dependencies (runtime, libraries, config files) into a single, portable unit that runs identically everywhere — dev laptop, CI server, production cloud. Unlike VMs, containers share the host OS kernel, making them lightweight (startup in milliseconds, minimal overhead).

Images & Layers

A Docker image is built from a Dockerfile — each instruction (FROM, RUN, COPY) creates a read-only layer. Layers are cached and shared between images, so 10 containers based on the same image share the base layers. Only the writable layer (the container's filesystem changes) is unique.

Namespaces & cgroups

Linux namespaces provide isolation (PID, network, mount, user). Each container sees its own process tree, network stack, and filesystem. cgroups (control groups) limit resources: CPU, memory, disk I/O. Together, they create process-level isolation without a hypervisor.

# Minimal production Dockerfile

FROM node:20-alpine AS builder

WORKDIR /app

COPY package*.json ./

RUN npm ci --production

COPY . .

RUN npm run build

FROM node:20-alpine

WORKDIR /app

COPY --from=builder /app/dist ./dist

COPY --from=builder /app/node_modules ./node_modules

USER node # Never run as root

EXPOSE 8080

CMD ["node", "dist/server.js"]

Kubernetes: Container Orchestration

Kubernetes (K8s) is a container orchestration platform that automates deployment, scaling, networking, and lifecycle management of containerized applications. Instead of manually running containers on servers, you declare the desired state, and K8s makes it reality.

Core Primitives

Pod

The smallest deployable unit. Contains one or more containers sharing network and storage. Pods are ephemeral — K8s replaces failed pods automatically.

Deployment

Manages a set of identical pods. Handles rolling updates, rollbacks, and scaling. "I want 5 replicas of my API server" → the Deployment controller ensures exactly 5 pods are always running.

Service

A stable network endpoint that load-balances across pods. Even when pods are replaced (new IPs), the Service IP and DNS name remain constant. Types: ClusterIP (internal), NodePort, LoadBalancer (external).

Ingress

Routes external HTTP/HTTPS traffic to Services based on hostnames and URL paths. api.example.com/v1 → API Service, example.com → Frontend Service. Handles TLS termination.

HPA (Horizontal Pod Autoscaler)

Automatically scales pods based on CPU, memory, or custom metrics. "Scale from 3 to 20 pods when average CPU > 70%." Integrates with Prometheus for custom metrics (queue depth, latency).

Explore: Kubernetes

See how Kubernetes creates pods, schedules them on nodes, and handles rolling deployments.

Pod Creation K8s Rollout Sim How Containers Work

Resiliency Patterns

In a distributed system, failures are normal, not exceptional. Networks partition. Services crash. Dependencies become slow. Resiliency patterns ensure that these failures are contained and don't cascade into total system failure.

Circuit Breaker

The circuit breaker prevents a service from repeatedly calling a failing dependency. Like an electrical circuit breaker, it "trips" when failures exceed a threshold, and "opens" the circuit — returning errors immediately without attempting the call. After a timeout, it lets one "probe" request through to check if the dependency has recovered.

CLOSED Normal operation. Requests pass through. Failures are counted.

OPEN Failure threshold exceeded. All requests immediately fail with a fallback response.

HALF-OPEN After timeout, one probe request is allowed. If it succeeds → CLOSED. If it fails → OPEN.

Retry with Exponential Backoff

Transient failures (network blips, brief overloads) often resolve on their own. Retries handle these automatically, but naive retries can overwhelm a recovering service. Exponential backoff spaces retries exponentially: 100ms → 200ms → 400ms → 800ms. Adding jitter (randomness) prevents synchronized retry storms.

// Retry with exponential backoff + jitter

delay = min(baseDelay * 2^attempt + random(0, baseDelay), maxDelay)

// Example: baseDelay=100ms, maxDelay=30s

Attempt 1: 100ms + rand(0, 100)

Attempt 2: 200ms + rand(0, 100)

Attempt 3: 400ms + rand(0, 100)

Attempt 4: 800ms + rand(0, 100)

Attempt 5: give up or circuit breaker opens

Bulkhead Pattern

Named after ship compartments that prevent flooding from sinking the entire vessel. The bulkhead pattern isolates different parts of your system so that a failure in one doesn't consume resources needed by others.

Thread pool isolation: Each dependency gets its own thread pool. If the payment service is slow and exhausts its 20 threads, the user service still has its own 20 threads available.
Connection pool isolation: Separate database connection pools for critical vs. non-critical queries. Analytics queries can't starve checkout queries.
Process isolation: Run critical and non-critical workloads in separate containers/pods with independent resource limits.

Rate Limiting

Protect your services from being overwhelmed — whether by legitimate traffic spikes or malicious abuse. Common algorithms:

Algorithm	How It Works	Best For
Token Bucket	Tokens refill at a fixed rate. Each request consumes a token. No tokens = rejected.	API rate limiting (allows short bursts)
Leaky Bucket	Requests queue and drain at a fixed rate, like water from a bucket with a hole.	Smooth traffic shaping (no bursts)
Fixed Window	Count requests in fixed time windows (e.g., 100 req/min). Reset at window boundary.	Simple, but allows 2x burst at window edges
Sliding Window Log	Track timestamp of each request. Count requests in the last N seconds.	Most accurate, but memory-intensive

Try It: Resiliency Simulators

Watch circuit breakers trip, retries back off, and rate limiters reject traffic in real-time.

Circuit Breaker Retry Strategy Rate Limiter

Observability: The Three Pillars

You can't fix what you can't see. Observability gives you insight into the internal state of your system through three complementary signals:

Metrics

Numeric measurements over time. Request rate, error rate, latency percentiles (P50, P95, P99), CPU/memory usage. Tools: Prometheus, Grafana, Datadog. USE method: Utilization, Saturation, Errors — for every resource.

Logs

Structured event records. Each request generates log entries with trace ID, timestamp, severity, and context. Tools: ELK (Elasticsearch, Logstash, Kibana), Loki, Fluentd. Always use structured JSON logs — not plain text.

Traces

Follow a single request across multiple services. A trace shows the complete journey: API Gateway → Auth Service → User Service → Database. Tools: Jaeger, Zipkin, OpenTelemetry, AWS X-Ray. Essential for debugging distributed systems.

Chaos Engineering

Chaos engineering is the practice of deliberately injecting failures into production systems to test resiliency. The hypothesis: "If we kill this server / inject 500ms of latency / corrupt this response, our system should continue serving users without degradation."

Chaos Monkey: Randomly terminates instances. Validates auto-healing and statelessness.
Latency injection: Adds artificial delay to network calls. Tests timeout and circuit breaker configurations.
Resource stress: Consume CPU, memory, or disk I/O. Tests auto-scaling and resource limits.
DNS failure: Return NXDOMAIN for a dependency. Tests fallback behavior.
Zone/region failure: Simulate an entire availability zone going down. Tests multi-AZ/multi-region failover.

Tools: Netflix Chaos Monkey, Gremlin, LitmusChaos, AWS Fault Injection Simulator.

Lessons from the Trenches

Case Study: Netflix's Zuul → Envoy Migration

Netflix's API gateway Zuul handled all incoming traffic with built-in circuit breakers, rate limiting, and routing. As they migrated to a service mesh architecture, they replaced per-service resiliency code with Envoy sidecars injected into every pod. Circuit breakers, retries, and timeouts are now configured in the mesh control plane, not in application code. This reduced per-service boilerplate and ensured consistent resiliency policies across 1,000+ microservices.

Takeaway: Resiliency patterns should be infrastructure concerns, not application concerns. A service mesh (Istio, Linkerd) enforces circuit breakers, retries, and mTLS uniformly without touching application code.

Case Study: Spotify's Golden Path

Spotify has over 2,000 microservices managed by ~200 autonomous teams. To prevent chaos, they created the Golden Path: a standardized Kubernetes deployment template with built-in health checks, circuit breakers, structured logging, Prometheus metrics, and distributed tracing via OpenTelemetry. Teams don't need to configure resiliency — it's baked into the platform. This reduced mean time to recovery (MTTR) by 40%.

Takeaway: Platform engineering — providing opinionated, secure defaults — is more effective than writing documentation and hoping teams follow it.

Case Study: AWS S3's Cascading Failure (2017)

In 2017, an engineer typo'd a command that removed more S3 servers than intended. The blast radius was enormous because S3's internal subsystems depended on each other — the billing system depended on the metadata system, which depended on the placement system. Without individual service-level bulkheads and circuit breakers, the failure cascaded across all S3 subsystems, causing a 4-hour outage that affected a significant portion of the internet.

Takeaway: Internal dependencies are as dangerous as external ones. Every dependency boundary needs a circuit breaker, timeout, and fallback — including dependencies between your own services.

All Hands-on Resources

Reinforce these concepts with interactive simulators and visual deep-dives.

Circuit Breaker Retry Strategy Rate Limiter K8s Rollout Pod Creation How Containers Work Garbage Collection Service Discovery

Completed!

You Made It.

You've covered the fundamental building blocks of system design — from monoliths to distributed, fault-tolerant architectures. Go back to explore more modules or revisit the ones you need.

Back to Handbook