Orchestration & Resiliency
Containers, circuit breakers, and chaos engineering. Build systems that heal themselves.
A container packages your application with all its dependencies (runtime, libraries, config files) into a single, portable unit that runs identically everywhere — dev laptop, CI server, production cloud. Unlike VMs, containers share the host OS kernel, making them lightweight (startup in milliseconds, minimal overhead).
Images & Layers
A Docker image is built from a Dockerfile — each instruction (FROM, RUN, COPY) creates a read-only layer. Layers are cached and shared between images, so 10 containers based on the same image share the base layers. Only the writable layer (the container's filesystem changes) is unique.
Namespaces & cgroups
Linux namespaces provide isolation (PID, network, mount, user). Each container sees its own process tree, network stack, and filesystem. cgroups (control groups) limit resources: CPU, memory, disk I/O. Together, they create process-level isolation without a hypervisor.
Kubernetes (K8s) is a container orchestration platform that automates deployment, scaling, networking, and lifecycle management of containerized applications. Instead of manually running containers on servers, you declare the desired state, and K8s makes it reality.
Core Primitives
Pod
The smallest deployable unit. Contains one or more containers sharing network and storage. Pods are ephemeral — K8s replaces failed pods automatically.
Deployment
Manages a set of identical pods. Handles rolling updates, rollbacks, and scaling. "I want 5 replicas of my API server" → the Deployment controller ensures exactly 5 pods are always running.
Service
A stable network endpoint that load-balances across pods. Even when pods are replaced (new IPs), the Service IP and DNS name remain constant. Types: ClusterIP (internal), NodePort, LoadBalancer (external).
Ingress
Routes external HTTP/HTTPS traffic to Services based on hostnames and URL paths. api.example.com/v1
→ API Service, example.com → Frontend Service. Handles TLS termination.
HPA (Horizontal Pod Autoscaler)
Automatically scales pods based on CPU, memory, or custom metrics. "Scale from 3 to 20 pods when average CPU > 70%." Integrates with Prometheus for custom metrics (queue depth, latency).
Explore: Kubernetes
See how Kubernetes creates pods, schedules them on nodes, and handles rolling deployments.
In a distributed system, failures are normal, not exceptional. Networks partition. Services crash. Dependencies become slow. Resiliency patterns ensure that these failures are contained and don't cascade into total system failure.
Circuit Breaker
The circuit breaker prevents a service from repeatedly calling a failing dependency. Like an electrical circuit breaker, it "trips" when failures exceed a threshold, and "opens" the circuit — returning errors immediately without attempting the call. After a timeout, it lets one "probe" request through to check if the dependency has recovered.
Retry with Exponential Backoff
Transient failures (network blips, brief overloads) often resolve on their own. Retries handle these automatically, but naive retries can overwhelm a recovering service. Exponential backoff spaces retries exponentially: 100ms → 200ms → 400ms → 800ms. Adding jitter (randomness) prevents synchronized retry storms.
Bulkhead Pattern
Named after ship compartments that prevent flooding from sinking the entire vessel. The bulkhead pattern isolates different parts of your system so that a failure in one doesn't consume resources needed by others.
- Thread pool isolation: Each dependency gets its own thread pool. If the payment service is slow and exhausts its 20 threads, the user service still has its own 20 threads available.
- Connection pool isolation: Separate database connection pools for critical vs. non-critical queries. Analytics queries can't starve checkout queries.
- Process isolation: Run critical and non-critical workloads in separate containers/pods with independent resource limits.
Rate Limiting
Protect your services from being overwhelmed — whether by legitimate traffic spikes or malicious abuse. Common algorithms:
| Algorithm | How It Works | Best For |
|---|---|---|
| Token Bucket | Tokens refill at a fixed rate. Each request consumes a token. No tokens = rejected. | API rate limiting (allows short bursts) |
| Leaky Bucket | Requests queue and drain at a fixed rate, like water from a bucket with a hole. | Smooth traffic shaping (no bursts) |
| Fixed Window | Count requests in fixed time windows (e.g., 100 req/min). Reset at window boundary. | Simple, but allows 2x burst at window edges |
| Sliding Window Log | Track timestamp of each request. Count requests in the last N seconds. | Most accurate, but memory-intensive |
Try It: Resiliency Simulators
Watch circuit breakers trip, retries back off, and rate limiters reject traffic in real-time.
You can't fix what you can't see. Observability gives you insight into the internal state of your system through three complementary signals:
Metrics
Numeric measurements over time. Request rate, error rate, latency percentiles (P50, P95, P99), CPU/memory usage. Tools: Prometheus, Grafana, Datadog. USE method: Utilization, Saturation, Errors — for every resource.
Logs
Structured event records. Each request generates log entries with trace ID, timestamp, severity, and context. Tools: ELK (Elasticsearch, Logstash, Kibana), Loki, Fluentd. Always use structured JSON logs — not plain text.
Traces
Follow a single request across multiple services. A trace shows the complete journey: API Gateway → Auth Service → User Service → Database. Tools: Jaeger, Zipkin, OpenTelemetry, AWS X-Ray. Essential for debugging distributed systems.
Chaos engineering is the practice of deliberately injecting failures into production systems to test resiliency. The hypothesis: "If we kill this server / inject 500ms of latency / corrupt this response, our system should continue serving users without degradation."
- Chaos Monkey: Randomly terminates instances. Validates auto-healing and statelessness.
- Latency injection: Adds artificial delay to network calls. Tests timeout and circuit breaker configurations.
- Resource stress: Consume CPU, memory, or disk I/O. Tests auto-scaling and resource limits.
- DNS failure: Return NXDOMAIN for a dependency. Tests fallback behavior.
- Zone/region failure: Simulate an entire availability zone going down. Tests multi-AZ/multi-region failover.
Tools: Netflix Chaos Monkey, Gremlin, LitmusChaos, AWS Fault Injection Simulator.
Case Study: Netflix's Zuul → Envoy Migration
Netflix's API gateway Zuul handled all incoming traffic with built-in circuit breakers, rate limiting, and routing. As they migrated to a service mesh architecture, they replaced per-service resiliency code with Envoy sidecars injected into every pod. Circuit breakers, retries, and timeouts are now configured in the mesh control plane, not in application code. This reduced per-service boilerplate and ensured consistent resiliency policies across 1,000+ microservices.
Takeaway: Resiliency patterns should be infrastructure concerns, not application concerns. A service mesh (Istio, Linkerd) enforces circuit breakers, retries, and mTLS uniformly without touching application code.
Case Study: Spotify's Golden Path
Spotify has over 2,000 microservices managed by ~200 autonomous teams. To prevent chaos, they created the Golden Path: a standardized Kubernetes deployment template with built-in health checks, circuit breakers, structured logging, Prometheus metrics, and distributed tracing via OpenTelemetry. Teams don't need to configure resiliency — it's baked into the platform. This reduced mean time to recovery (MTTR) by 40%.
Takeaway: Platform engineering — providing opinionated, secure defaults — is more effective than writing documentation and hoping teams follow it.
Case Study: AWS S3's Cascading Failure (2017)
In 2017, an engineer typo'd a command that removed more S3 servers than intended. The blast radius was enormous because S3's internal subsystems depended on each other — the billing system depended on the metadata system, which depended on the placement system. Without individual service-level bulkheads and circuit breakers, the failure cascaded across all S3 subsystems, causing a 4-hour outage that affected a significant portion of the internet.
Takeaway: Internal dependencies are as dangerous as external ones. Every dependency boundary needs a circuit breaker, timeout, and fallback — including dependencies between your own services.
- Release It! by Michael Nygard — The canonical reference on stability patterns: circuit breakers, bulkheads, timeouts, and steady-state. (Pragmatic Bookshelf, 2018)
- Site Reliability Engineering by Beyer, Jones, Petoff, Murphy — Google's SRE practices including error budgets, SLOs, and toil. (O'Reilly, 2016)
- The Netflix Simian Army — Chaos engineering at scale: Chaos Monkey, Chaos Kong, Latency Monkey.
- Principles of Chaos Engineering — The manifesto for chaos engineering practices.
- Kubernetes Documentation — Official docs for Pods, Deployments, Services, HPA, and more.
- OpenTelemetry Documentation — The emerging standard for metrics, logs, and traces instrumentation.
- Monitoring Distributed Systems — Google SRE Book — The four golden signals: latency, traffic, errors, and saturation.
All Hands-on Resources
Reinforce these concepts with interactive simulators and visual deep-dives.
Completed!
You Made It.
You've covered the fundamental building blocks of system design — from monoliths to distributed, fault-tolerant architectures. Go back to explore more modules or revisit the ones you need.
Back to Handbook