The 4-Step Design Framework

Module 12: The 4-Step Design Framework

Track 3: Global Scale (6+ YoE)

You now have the building blocks — scaling, caching, databases, queues, load balancing. But how do you pull them together when faced with a blank whiteboard? This module gives you a structured, repeatable process for designing any system — whether it's a 45-minute interview or a real-world greenfield architecture.

Step 1: Requirements & Scope (5 min)

The biggest failure mode: designing the wrong system. Before drawing any boxes, spend 5 minutes clarifying exactly what you're building and for whom.

Functional Requirements

What does the system do? These are the user-facing features. Be specific:

❌ "Design a URL shortener" → too vague
✅ "Users can create short URLs. Short URLs redirect to the original. Users can see click analytics. URLs expire after 30 days by default. Custom aliases are supported."

Ask clarifying questions: Can users delete URLs? Is authentication required? Are there rate limits? What's the max URL length?

Non-Functional Requirements (NFRs)

How should it behave? These are the quality attributes:

Availability

99.9%? 99.99%? 99.9% = 8.76 hours downtime/year. 99.99% = 52 minutes/year. Big difference in architecture.

Latency

P99 latency target? URL redirect should be <50ms. Analytics dashboard can tolerate 2-3 seconds.

Consistency

Strong or eventual? URL creation must be consistent (no duplicates). Analytics can be eventually consistent.

Scale

How many users? Requests/sec? Data volume? This feeds directly into Step 2.

Step 2: Back-of-Envelope Estimation (5 min)

Estimations ground your design in reality. They tell you whether you need 1 server or 1,000, and whether your data fits on SSD or requires distributed storage.

Key Numbers Every Engineer Should Know

Operation	Latency
L1 cache reference	0.5 ns
L2 cache reference	7 ns
Main memory reference	100 ns
SSD random read	150 μs
HDD random read	10 ms
Same-datacenter round trip	0.5 ms
Cross-continent round trip	150 ms

Estimation Example: URL Shortener

// Traffic estimation

100M new URLs/month → ~40 URLs/sec (write)

Read:Write ratio = 100:1 → 4,000 reads/sec

// Storage estimation (5 years)

100M URLs/month × 12 months × 5 years = 6B URLs

Avg URL size: 500 bytes (original) + 7 bytes (short code) + metadata

6B × 1 KB ≈ 6 TB total storage

// Cache estimation (80/20 rule: 20% of URLs get 80% of traffic)

4,000 reads/sec × 86,400 sec/day = 345M reads/day

Cache 20% of daily reads: 345M × 0.2 × 1 KB ≈ 70 GB

// → Fits in a single Redis instance (up to 256 GB RAM)

Step 3: High-Level Design (15 min)

Now draw the architecture. Start with the happy path for the most critical use case, then expand.

The Universal Building Blocks

Clients: Web, mobile, API consumers
Load Balancer / API Gateway: Entry point, TLS termination, rate limiting, routing
Application Servers: Stateless services behind LB
Cache: Redis/Memcached for hot reads
Database: PostgreSQL (OLTP), Cassandra (write-heavy), etc.
Message Queue: Kafka/SQS for async processing
Object Storage: S3 for files, images, videos
CDN: Cloudflare/CloudFront for static assets and edge caching

API Design First

Before drawing boxes, define the API contract. For a URL shortener: POST /api/shorten {"url": "...", "ttl": 3600} → {"short_url": "..."} and GET /{short_code} → 302 Redirect. The API reveals the data flow, which dictates the architecture.

Step 4: Deep Dive (20 min)

Pick 2-3 components and go deep. This is where you demonstrate technical depth. Common deep-dive areas:

Data Model & Schema

Table schemas, partition keys, indexes. How do you handle queries efficiently? What denormalization is needed?

Scaling Bottlenecks

What breaks at 10x traffic? Database sharding strategy? Hot partition handling? Cache stampede prevention?

Failure Modes

What happens if the database is down? If a datacenter fails? What's the blast radius? How do you detect and recover?

Consistency & Concurrency

Race conditions? Distributed transactions? How do you ensure exactly-once processing? What consistency model?

Common Anti-Patterns

Over-engineering: Designing for 1B users when you have 1,000. Start simple, add complexity only when justified by estimation.
Buzzword architecture: Adding Kafka, Redis, Elasticsearch to every design without justification. Each component adds operational cost.
Ignoring failure: A design without failure modes is a design that will fail catastrophically. What happens when X dies?
Monologue mode: In interviews, driving the design without checking in. In real life, designing without stakeholder alignment.
Jumping to solutions: Drawing boxes before understanding requirements. The most common mistake.

Design: File Storage

Put the framework into practice. Design an S3-like distributed file storage system from scratch.

Continue Journey