Caching Strategies

The art of saving results to avoid redundant calculations. From browser bits to global CDNs.

Module 3: Caching Strategies
Track 1: Foundations (0–2 YoE)
In Module 1, we learned that the database is the bottleneck of every monolith. In Module 2, we traced how requests travel across the network. Now we address the single most effective technique for reducing latency and backend load: caching. A well-placed cache can eliminate 90% of database queries, turning a 50ms disk-bound operation into a 0.1ms memory lookup. Caching is not optional at scale — it is the defining optimization that separates hobby projects from production systems.
Why Cache? The Latency Hierarchy

At its core, caching is storing a copy of data in a faster access layer so future requests for the same data bypass the slower original source. The inspiration comes from hardware: every modern CPU has L1, L2, and L3 caches because main memory is too slow. Software caching applies the same principle — keep hot data close to where it's needed.

Access Layer Latency Relative Speed
CPU L1 Cache ~1 ns 1x (baseline)
CPU L2 Cache ~4 ns 4x slower
RAM (In-process cache) ~100 ns 100x slower
Redis (same AZ) ~500 μs 500,000x slower
SSD random read ~100 μs 100,000x slower
Database query (indexed) ~1-10 ms 1,000,000-10,000,000x slower
Cross-region API call ~100-300 ms 100,000,000x+ slower

Latency

A Redis lookup (~500μs) vs a PostgreSQL query (~5ms) is a 10x difference. For a page that makes 20 DB queries, that's the difference between 100ms and 10ms page load — the threshold for perceived "instant" response.

Throughput

By serving 90% of requests from cache, your database handles 10% of the original load. This is equivalent to a 10x "free" database scaling. Redis can handle 100,000+ operations/sec on a single instance.

Cost

A Redis instance with 16 GB RAM costs ~$0.15/hr. A comparable RDS instance costs 5-10x more. Caching is the most cost-effective performance win available.


The Cache Hit/Miss Cycle

The effectiveness of your cache is measured by the Cache Hit Ratio:

Hit Ratio = Cache Hits / (Cache Hits + Cache Misses)
  • Cache Hit: The requested data is found in the cache. The response is served directly from memory. Fast, cheap, no database load.
  • Cache Miss: The data is not in the cache. The system must fetch it from the source (database, API, computation), serve the response, and typically store the result in the cache for next time.
  • Cold Miss: The cache is empty (just started, or after a restart). Every first request for a key is a cold miss. This is unavoidable.
  • Capacity Miss: The data was in the cache but was evicted because the cache ran out of memory. This indicates you need more cache memory or a better eviction policy.
  • Conflict Miss: The data was in the cache but was evicted because another key hashed to the same slot (relevant for hardware caches and hash-based data structures).

A hit ratio of 95% means that only 5 out of every 100 requests touch your database. This is a realistic target for read-heavy workloads (product catalogs, user profiles, configuration data). Highly personalized data (user dashboards, real-time analytics) tends to have lower hit ratios because the data is unique per user.

Measuring hit ratio is non-negotiable. Redis provides INFO stats which reports keyspace_hits and keyspace_misses. Monitor this metric continuously. If your hit ratio drops below 80%, you likely have a cache key design problem, your working set exceeds available memory, or your TTLs are too aggressive.


Update Strategies (Write Patterns)

Choosing when and how to update your cache is one of the most critical design decisions. Each strategy trades off between consistency, performance, and complexity.

1. Cache-Aside (Lazy Loading)

The application is responsible for both the cache and the database. On read: check cache → if miss, query DB → store in cache → return. On write: update DB → invalidate (delete) the cache entry.

// Pseudocode: Cache-Aside read path
data = cache.get(key)
if data == null {
data = db.query("SELECT * FROM users WHERE id = ?", key)
cache.set(key, data, ttl=300) // 5 min TTL
}
return data
Pros
  • Only requested data is cached (lazy, memory-efficient)
  • Cache failure → graceful degradation (just slower)
  • Simple to implement and reason about
Cons
  • First request for each key is always a miss (cold start)
  • Brief staleness window between DB write and cache invalidation
  • Application code manages two data stores

Used by: Most web applications worldwide. This is the default pattern you should reach for first.

2. Write-Through Cache

Data is written to the cache and the database in a single synchronous operation. The cache is always up-to-date because every write passes through it. The write is only acknowledged after both stores confirm.

Pros
  • Cache is never stale — strong consistency guarantee
  • Read-after-write is always consistent
  • Simpler invalidation (no separate invalidation logic needed)
Cons
  • Write latency increases (writing to two stores sequentially)
  • Cache may hold data that is never read ("write-only" waste)
  • Write availability depends on both cache AND database being up

Used by: DynamoDB Accelerator (DAX), hardware CPU caches.

3. Write-Back (Write-Behind)

Data is written only to the cache first. The database is updated asynchronously — after a delay, in batches, or when the cached entry is evicted. Provides the fastest writes but introduces data loss risk.

// Write-Behind: async DB flush
cache.set(key, data) // instant return — ~0.5ms
// Background worker, every 5 seconds:
dirty_keys = cache.get_dirty()
db.batch_upsert(dirty_keys) // bulk write to DB
Pros
  • Write latency = memory speed (~0.5ms)
  • Batch writes reduce DB load dramatically
  • Absorbs write spikes (acts as a buffer)
Cons
  • Data loss risk if cache crashes before DB sync
  • Complex failure handling and recovery
  • DB reads may see stale data

Used by: Linux page cache, CPU L1/L2 caches, gaming leaderboard systems, real-time analytics counters.

4. Read-Through Cache

The cache itself is responsible for loading data from the database on a miss. The application only talks to the cache — it never directly queries the DB. The cache acts as a transparent read proxy.

Pros
  • Application code is simpler (cache abstracts DB layer)
  • Consistent loading logic across all callers
  • Easy to add caching to existing code
Cons
  • Cache must know how to query the database
  • Tighter coupling between cache and data layer
  • Harder to customize loading logic per caller

Used by: Hibernate L2 Cache, Guava LoadingCache, Caffeine (Java), NCache (.NET).

Cache Invalidation: The Two Hard Problems

Phil Karlton famously said: "There are only two hard things in Computer Science: cache invalidation and naming things." Invalidation — deciding when cached data is no longer valid — is where most caching bugs live.

  • Delete-on-write: When data changes, delete the cache entry. Next read triggers a fresh cache-aside load. Simple and safe. The brief window between delete and re-cache is the "stale window."
  • Update-on-write: When data changes, update the cache entry directly. Eliminates the stale window but introduces race conditions: if two writes happen concurrently, the cache might end up with older data than the database.
  • TTL-based expiry: Set a time-to-live on every entry. Data automatically becomes stale after N seconds. Simple to implement, but you're trading consistency for simplicity — users see stale data for up to TTL seconds.
  • Event-driven invalidation: Use database change data capture (CDC) or message queues to push invalidation events. When a row changes in the DB, a Kafka event triggers cache deletion. More complex but provides near-real-time consistency.

Explore: How Caching Works

Visual deep-dive into how caching systems work under the hood — from browser caches to distributed Redis clusters.


Eviction Policies

Caches have finite memory. When full, the cache must "evict" old data to make room for new data. The eviction policy determines which entry to remove, and this choice dramatically affects your cache hit ratio.

LRU (Least Recently Used)

Evicts the entry that was accessed longest ago. Implemented with a doubly-linked list + hash map for O(1) operations. Assumes temporal locality — recently used data is likely to be used again.

Redis: allkeys-lru or volatile-lru. This is the default choice for most workloads.

LFU (Least Frequently Used)

Tracks how often each entry is accessed. Evicts the entry with the lowest access count. Uses aging to decay old counts so historically hot keys don't persist forever.

Redis: allkeys-lfu. Best for workloads with stable popularity patterns (e.g., trending content).

FIFO (First In, First Out)

Evicts the oldest entry by insertion time, regardless of access pattern. Implemented as a simple queue. No overhead for tracking access frequency or recency.

Best for: Simple, predictable eviction where all entries have equal value.

TTL (Time To Live)

Each entry has an expiration timestamp. After TTL elapses, the entry is removed (eagerly via background scanner, or lazily on next access).

Best for: Time-sensitive data: JWT tokens, rate limit counters, session data. Often combined with LRU.

Random Eviction

Evicts a random entry. Surprisingly effective — often only 5-10% worse hit ratio than LRU, with zero tracking overhead. Used when simplicity and CPU efficiency matter more than optimal hit ratio.

Redis: allkeys-random. Good for very large caches where LRU overhead is measurable.

W-TinyLFU (Modern Best)

Combines LRU for recency and LFU for frequency using a probabilistic admission filter (Count-Min Sketch). Used by Java's Caffeine cache — consistently achieves the highest hit ratios in benchmarks.

Used by: Caffeine (Java), Ristretto (Go). The state-of-the-art eviction policy.

Try It: Cache Simulators

Compare LRU, LFU, FIFO, Random policies side-by-side. Visualize the doubly-linked list that powers LRU lookups in O(1).


Cache Stampede (Thundering Herd)

One of the most dangerous failure modes is the cache stampede. It happens when a popular cache entry expires, and hundreds of concurrent requests simultaneously see a cache miss and all query the database at once.

Imagine a product page that gets 10,000 req/s. The cached response has a 60-second TTL. When the TTL expires, all 10,000 requests in that second hit the database simultaneously — a sudden 10,000x spike in DB load. If the database can't handle the burst, it cascades into a full outage.

Prevention Strategies

1. Mutex Lock (Single-Flight)

When a cache miss occurs, acquire a distributed lock (Redis SETNX or Go's singleflight) before querying the database. Only one request rebuilds the cache; all others wait and read the fresh entry. This is the most common solution.

2. Stale-While-Revalidate

Serve the stale (expired) cache entry immediately while a background process refreshes it asynchronously. Used by CDNs via the stale-while-revalidate HTTP header. The user gets fast (slightly outdated) response while the cache updates silently.

3. Probabilistic Early Expiration

Add random jitter to TTL: TTL = 60 ± rand(10) seconds. This spreads out expirations across a 20-second window, preventing synchronized cache misses. Simple, effective, and avoids the complexity of distributed locks.

4. Lease Tokens (Facebook's Approach)

The cache gives the first client a "lease token" for the missing key. Subsequent clients requesting the same key see the lease and wait instead of all hitting the database. This prevents both stampedes and stale sets from concurrent writers.

Explore: Failure Prevention

See how circuit breakers and retry strategies protect your database from stampede cascades.


Redis vs Memcached

The two dominant distributed cache implementations serve different use cases. Choosing the wrong one can cost you performance or flexibility.

Feature Redis Memcached
Data structures Strings, Hashes, Lists, Sets, Sorted Sets, Streams, HyperLogLog Strings only (key-value)
Persistence RDB snapshots + AOF log for durability None — pure in-memory, data lost on restart
Clustering Redis Cluster with automatic sharding (16,384 hash slots) Client-side sharding (consistent hashing)
Pub/Sub Built-in pub/sub and Streams Not supported
Threading Single-threaded event loop (6.0+ has I/O threads) Multi-threaded — better per-node throughput for simple ops
Best for General purpose, complex data, leaderboards, rate limiting, sessions Simple key-value caching at very high throughput

Rule of thumb: Start with Redis. Its data structures (sorted sets for leaderboards, hash maps for objects, streams for event sourcing) solve problems that plain key-value can't. Only choose Memcached if you need pure key-value caching at extreme throughput and want multi-threaded performance on a single node.


Multi-Layer Cache Architecture

Production systems rarely use a single cache layer. Instead, they stack multiple caches at different levels, each optimized for different access patterns:

L1: In-Process Cache (Heap)

A HashMap or LRU cache embedded in the application process (Guava Cache, Caffeine, Go sync.Map). Zero network overhead — 100ns access. But: limited by process memory, not shared across instances, lost on restart. Best for: hot configuration data, auth tokens, frequently-accessed reference data.

L2: Distributed Cache (Redis / Memcached)

A shared cache accessible by all application instances over the network. ~500μs latency per request. Provides consistency across the fleet — when one instance invalidates a key, all instances see the miss on next access. Redis can handle 100K+ ops/sec per node.

L3: CDN / Edge Cache

Static assets, API responses, and HTML pages cached at CDN Points of Presence (PoPs) worldwide. User hits a nearby edge server (5ms) instead of your origin (200ms). Controlled via Cache-Control headers. Cloudflare has 300+ PoPs globally.

L4: Browser Cache

The user's browser caches static assets (JS, CSS, images) based on Cache-Control and ETag headers. This is the fastest possible cache — zero network request required. Proper cache headers can reduce your bandwidth costs by 40-60%.

The typical lookup flow: Browser → CDN → L1 (in-process) → L2 (Redis) → Database. Each miss cascades to the next layer. A well-tuned multi-layer cache achieves a 99%+ hit ratio at the origin, meaning your database handles only 1% of total user traffic.

CDN Caching Deep Dive

CDNs are the most impactful cache for user-facing applications because they eliminate the biggest latency source: geographic distance. Key HTTP headers for CDN caching:

# Cache for 1 hour, serve stale for 1 day while revalidating
Cache-Control: public, max-age=3600, stale-while-revalidate=86400
# Cache by cookie (user-specific content)
Vary: Cookie, Accept-Encoding
# Conditional request (saves bandwidth)
ETag: "abc123"
If-None-Match: "abc123" → 304 Not Modified (0 bytes)
# Prevent caching of sensitive data
Cache-Control: no-store, private

Explore: Edge & HTTP

Understand how CDNs, HTTP/2 multiplexing, and edge computing reduce latency globally.


Lessons from the Trenches

Case Study: Facebook's Memcache at Scale (2013)

Facebook published a landmark NSDI 2013 paper describing their memcached deployment: thousands of servers holding trillions of items. Their key insight was treating memcached as a "demand-filled look-aside cache" and using lease tokens to prevent both stale sets and stampedes. When a client experiences a cache miss, it receives a lease. If another client already holds a lease for that key, subsequent clients wait rather than all querying the database.

Takeaway: At extreme scale, cache-aside with lease-based stampede prevention outperforms more complex patterns. Keep the cache layer simple and stateless — complexity belongs in the application layer.

Case Study: Reddit's Hot Key Problem

When a Reddit post goes viral, a single cache key receives millions of requests per second. Even Redis Cluster struggles because all requests for that key route to the same shard (hash slot). Reddit solved this by replicating hot keys across multiple Redis instances: post:123:r1, post:123:r2, etc. The application randomly selects a replica, distributing the load across shards.

Takeaway: Distributed caches have "hot key" bottlenecks just like databases. Solutions: replicate hot keys, use in-process L1 caches, or use consistent hashing with virtual nodes to spread load.

Case Study: Cloudflare's Cache Everything

Cloudflare reports that their CDN serves ~3.5 million HTTP requests per second from cache, with only ~1% of requests reaching customer origin servers. Their key innovation: tiered caching. Instead of every PoP independently fetching from origin on a miss, cache misses first check a regional "upper tier" cache. This reduces origin traffic by 90% compared to a flat CDN architecture and prevents stampede effects during cache purges.

Takeaway: Multi-tier caching (L1 PoP → L2 Regional → Origin) provides exponential reduction in origin load. Each layer absorbs the majority of misses from the layer above it.


Further Reading & Citations
  • Designing Data-Intensive Applications by Martin Kleppmann — Chapter 5 (Replication) and Chapter 11 (Stream Processing) cover cache consistency and event-driven invalidation. (O'Reilly, 2017)
  • Scaling Memcache at Facebook — Nishtala et al. (NSDI 2013) — The definitive paper on cache-aside at extreme scale, introducing lease tokens for stampede prevention.
  • Redis Benchmarks (Official) — Throughput data at various payload sizes and concurrency levels.
  • AWS Caching Best Practices — Cloud caching strategies covering ElastiCache, DAX, and CloudFront.
  • Optimal Probabilistic Cache Stampede Prevention — Vattani, Levy & Mattei. Proceedings of the VLDB Endowment, 2015. Formal analysis of XFetch and probabilistic early expiration.
  • Caffeine Cache — Efficiency — Benchmarks showing W-TinyLFU outperforming LRU, LFU, and ARC on real workloads.
  • Cloudflare Tiered Cache — How tiered caching reduces origin requests by 90%.
  • MDN: HTTP Caching — Comprehensive reference for browser and CDN cache headers.

All Hands-on Resources

Reinforce these concepts with interactive simulators and visual deep-dives.

What's Next?

Scaling Out

Now that you know how to reduce load with caching, the next step is distributing that load across multiple servers. Learn horizontal scaling, stateless design, and session management.

Continue Journey