Advanced Caching & Invalidation

The hardest problem in computer science, solved at scale.

Module 10: Advanced Caching & Invalidation
Track 3: Global Scale (6+ YoE)
In Module 3, we covered caching fundamentals: read-aside, write-through, eviction policies. This module dives into the hard problems — cache invalidation at scale. Phil Karlton famously said there are only two hard things in computer science: cache invalidation and naming things. He wasn't joking. When you have millions of cached objects across dozens of servers, ensuring stale data doesn't poison your application is an engineering discipline unto itself.
Why Cache Invalidation Is Hard

Caching introduces a second source of truth. Your database says the price is $29.99, but the cache says $24.99 because it was updated 3 minutes ago before the price change. Every user hitting the cache sees the wrong price. This seems simple to fix — just delete the cache entry when the price changes. But at scale, it becomes a distributed coordination nightmare:

  • Multiple cache layers: Browser cache, CDN edge caches (300+ PoPs), application cache (Redis), database query cache. Invalidating one layer doesn't invalidate the others.
  • Race conditions: Service A reads price ($24.99), Service B updates price to $29.99 and invalidates cache, Service A writes its stale read ($24.99) back to cache. The cache is now permanently stale.
  • Partial invalidation: A product page is assembled from 15 cached fragments (product info, reviews, recommendations, pricing, inventory). When the price changes, you must invalidate only the pricing fragment, not the entire page.
  • Distributed caches: Redis Cluster has 6 shards across 3 availability zones. A cache invalidation message sent to one shard doesn't automatically propagate to others holding derived data.

Invalidation Strategies
Strategy How It Works Staleness Window Complexity
TTL (Time-To-Live) Cache entry expires after N seconds, regardless of changes 0 to TTL seconds Low
Event-Driven Publish invalidation event when data changes; subscribers delete/update cache Milliseconds Medium
Write-Through Every write updates both database AND cache atomically Zero (if no races) Medium
Version-Based Cache key includes a version number; bump version on write to naturally invalidate Zero Low
Lease-Based Cache issues a lease (token) on miss; only the lease holder can populate the cache Milliseconds High

Version-Based Invalidation

Instead of deleting cache entries, change the key. Keep a version counter for each entity. When the entity updates, increment the version. The old cache entry is never read again — it simply expires via TTL.

// Version-based cache keys
user_version = INCR("user:42:version") // → 7
cache_key = "user:42:v7"
// On read: fetch version, build key, check cache
version = GET("user:42:version") // → 7
data = GET("user:42:v7") // cache hit!
// On write: increment version → old keys become orphans
INCR("user:42:version") // → 8
// "user:42:v7" is never read again, expires via TTL

Event-Driven Invalidation with CDC

Change Data Capture (CDC) tails the database's write-ahead log and publishes every change as an event. Consumers subscribe and invalidate or update caches in near-real-time. This decouples cache invalidation from application code — the application just writes to the database, and the CDC pipeline handles the rest.

CDC Pipeline

Database WAL → Debezium (CDC connector) → Kafka topic → Cache Invalidation Consumer → DEL cache_key. End-to-end latency: 50-200ms. Used by: LinkedIn, Airbnb, Uber. Tools: Debezium, Maxwell, AWS DMS.


Race Conditions & Stale Sets

The most insidious cache bug: a stale set where a slow reader writes outdated data to the cache after the cache has been invalidated.

// The Classic Stale Set Race
T1: Reader A: cache miss → reads DB (price = $24.99) → [context switch]
T2: Writer B: updates DB (price = $29.99)
T3: Writer B: deletes cache key // invalidation
T4: Reader A: SET cache_key = $24.99 // STALE! Races in after delete
// Cache now shows $24.99 until TTL — could be hours!

Solutions to Stale Sets

  • Delete-after-write with short TTL backfill: After invalidation, set a short-TTL tombstone (SET key "__DELETED__" EX 5) that prevents readers from populating the cache for 5 seconds.
  • Lease-based caching (Facebook's approach): On cache miss, the cache server issues a lease token. Only the holder of the current lease can SET the value. If the key is invalidated between the miss and the SET, the lease becomes invalid and the SET is rejected.
  • Version stamps: Include a version number in the cached value. On SET, compare the version — only write if the version is newer than what's currently cached (SET key value IF version > current_version).

Distributed Cache Architecture

Redis Cluster

Redis Cluster partitions data across multiple masters using hash slots (16,384 slots). Each key is assigned to a slot via CRC16(key) % 16384. Each master owns a range of slots and has one or more replicas for failover.

  • Multi-key operations only work if all keys map to the same slot. Use hash tags ({user:42}:profile, {user:42}:sessions) to co-locate related keys.
  • Resharding moves hash slots between nodes. During migration, the source node redirects clients to the target with MOVED or ASK redirections.
  • Failover: When a master fails, its replica is promoted automatically. Cluster-wide gossip protocol detects failures within seconds.

Multi-Layer Cache Invalidation

Real-world systems have multiple cache layers. Invalidation must propagate through all of them:

L1: In-Process Cache (Guava, Caffeine)

Fastest (nanoseconds). Lives in the application's heap. Invalidated by: local events or short TTL. Problem: each instance has its own L1, so they diverge after invalidation.

L2: Distributed Cache (Redis, Memcached)

Shared across all instances (~1ms network hop). Invalidated by: application writes or CDC events. Single source of truth for the application layer.

L3: CDN Edge Cache (Cloudflare, Fastly)

Global, 300+ PoPs. Invalidated by: purge API calls, surrogate keys, or TTL. Latency to purge all edges: 100ms-5 seconds.

L4: Browser Cache (HTTP Cache-Control)

Client-side. You CANNOT invalidate this remotely. Must use immutable URLs with content hashes (app.a3f8b2.js) or short max-age with stale-while-revalidate.

The Browser Cache Trap

Never set Cache-Control: max-age=31536000 on mutable resources (HTML pages, API responses). If you do, users will see stale content for up to a year with no way to fix it — you can't purge their browser cache. Use long max-age ONLY for immutable, content-hashed assets (JS, CSS, images). For HTML: Cache-Control: no-cache (always revalidate) or max-age=60, stale-while-revalidate=600.

Try It: Cache Simulators

Visualize cache eviction policies, see how LRU/LFU differ, and watch cache hit ratios change in real-time.


CDN Cache Invalidation

CDNs cache content at 300+ global edge locations. Invalidating across all of them is a distributed systems problem. Strategies:

Purge API

Call the CDN's API to delete a specific URL or pattern. Propagation takes 100ms to 5 seconds globally. Rate-limited by CDN providers (Cloudflare: 1,000 purges/min free, unlimited on enterprise).

Surrogate Keys (Cache Tags)

Tag cached responses with custom headers (Surrogate-Key: product-42 category-electronics). When product 42 changes, purge all objects tagged with product-42 in one API call — regardless of URL. Used by Fastly, Varnish, Cloudflare (Cache Tags).

Stale-While-Revalidate

Cache-Control: max-age=60, stale-while-revalidate=600. Serve stale content immediately while fetching a fresh copy in the background. Users see fast responses; freshness converges within seconds. Best for content that doesn't require instant consistency.


Lessons from the Trenches

Case Study: Facebook's Memcache Lease System

Facebook's TAO cache layer serves billions of reads per second. Their 2013 paper describes the lease mechanism that solved the stale set problem: on a cache miss, the cache server issues a 10-second lease token. Only the request holding the current lease can populate the cache. If the key is invalidated before the lease expires, the lease is revoked — preventing the stale write. This eliminated an entire class of consistency bugs across their infrastructure.

Takeaway: Lease-based caching is the gold standard for preventing stale sets in high-throughput systems. Facebook's Memcache paper is required reading for any cache engineer.

Case Study: Cloudflare's Tiered Cache

Cloudflare's CDN has 300+ PoPs. When a cache miss occurs at an edge node, instead of going directly to the origin server, it first checks a regional tier (a larger cache closer to the origin). This tiered caching dramatically reduces origin load — cache misses at 300 edges become cache hits at ~15 regional nodes. Invalidation uses a fan-out pattern: purge the regional tier, which then propagates to its child edge nodes.

Takeaway: Tiered caching reduces origin load exponentially. Invalidation should flow top-down through the cache hierarchy, not broadcast to every node independently.

Case Study: Reddit's Hot Key Problem

When a Reddit post goes viral, millions of users request the same cache key simultaneously. A single Redis shard handling that key becomes a bottleneck. Reddit's solution: local L1 caches with very short TTLs (1-5 seconds) in front of Redis. Hot keys are served from in-process cache, reducing Redis load by 100x. They also use cache warming — when a post trends, proactively populate caches before the traffic spike hits.

Takeaway: Hot keys require special treatment. L1 caches, key replication across shards, and proactive warming prevent single-shard bottlenecks.


Further Reading & Citations

All Hands-on Resources

Reinforce these concepts with interactive simulators and visual deep-dives.

What's Next?

NoSQL & Purpose-Built Databases

Document stores, key-value stores, wide-column, and graph databases. When to leave the relational world behind.

Continue Journey