Database Scaling & Distributed Data

The hardest component to scale. Master replication, sharding, and the fundamental tradeoffs of distributed data.

Module 7: Database Scaling & Distributed Data
Track 2: Distributed Shift (3–5 YoE)
Scaling the application layer is easy — add more stateless servers behind a load balancer. Scaling the database is the hardest problem in distributed systems. The database holds your most valuable asset: data. It must be durable, consistent, and available. This module covers the three fundamental techniques for scaling databases — replication, partitioning, and sharding — along with the theoretical tradeoffs (CAP, PACELC) that govern every design decision.
Why Databases Are the Bottleneck

Application servers are stateless — you can restart, replace, or scale them freely. Databases are stateful — they hold data that must persist across restarts, be consistent across replicas, and survive hardware failures. This statefulness creates three fundamental challenges:

  • Write throughput: A single PostgreSQL instance maxes out at ~10,000 writes/sec on good hardware. If your application generates more, you need multiple write nodes — which introduces consistency problems.
  • Data volume: When your dataset exceeds what fits on a single disk (or in RAM for efficient queries), you must partition it across machines.
  • Availability: A single database server is a single point of failure. Hardware dies. Disks fail. You need replicas to survive failures.

Replication

Replication creates copies of your data on multiple machines. It solves two problems: availability (if one node dies, another has the data) and read scalability (distribute read queries across replicas).

Leader-Follower (Primary-Replica)

The most common replication topology. One node is the leader (primary) — all writes go to it. The leader streams changes to one or more followers (replicas) via the Write-Ahead Log (WAL). Reads can go to any replica.

Synchronous Replication

The leader waits for at least one follower to confirm the write before acknowledging to the client. Guarantees no data loss on leader failure, but increases write latency. PostgreSQL calls this "synchronous_standby_names".

Asynchronous Replication

The leader acknowledges the write immediately and streams changes to followers in the background. Faster writes, but followers may lag behind (replication lag). If the leader dies before replication, those writes are lost.

Replication Lag: The Silent Bug

With async replication, a user writes data, then immediately reads it. If the read goes to a lagging replica, the user sees stale data — their change appears to have disappeared. Solutions: read-your-writes consistency (route reads to the leader for data the user recently wrote), monotonic reads (always read from the same replica for a given user), or use semi-synchronous replication (at least one follower is synchronous).

Multi-Leader Replication

For multi-region deployments, having a single leader in one region means all writes from other regions suffer cross-continent latency. Multi-leader (multi-primary) replication allows writes to any of several leaders, which then asynchronously replicate to each other.

The critical challenge: write conflicts. If User A updates their name to "Alice" on Leader 1, and User B updates the same record to "Bob" on Leader 2 simultaneously, which write wins? Conflict resolution strategies:

  • Last-Write-Wins (LWW): Use timestamps; the latest write wins. Simple but lossy — silently drops writes.
  • Merge: Automatically merge (e.g., CRDT — Conflict-free Replicated Data Types). Works for counters, sets, and specific data structures.
  • Application-level resolution: Present both versions to the user and let them resolve manually (like Git merge conflicts).

Leaderless Replication

In leaderless systems (DynamoDB, Cassandra, Riak), the client writes to multiple nodes simultaneously. A write is considered successful when W (write quorum) out of N nodes acknowledge. Reads query R (read quorum) nodes and take the latest value. As long as W + R > N, the client is guaranteed to read the latest write (quorum overlap).

// Quorum formula: W + R > N guarantees overlap
N = 3 replicas
W = 2 (write to at least 2 nodes)
R = 2 (read from at least 2 nodes)
// W + R = 4 > 3 → guaranteed to see latest write
// Tunable consistency:
W=3, R=1 → Strong consistency writes, fast reads
W=1, R=3 → Fast writes, strong consistency reads
W=1, R=1 → Fastest, but may read stale data

Try It: Replication & Quorums

Visualize quorum reads/writes, adjust W and R values, and see how they affect consistency and availability.


Sharding (Horizontal Partitioning)

Replication creates copies of the same data. Sharding splits different data across different nodes. Each shard holds a subset of rows, and together they hold the complete dataset. Sharding is necessary when your data is too large for a single machine or when write throughput exceeds what one node can handle.

Strategy How It Works Pros Cons
Hash-Based shard = hash(key) % N Even distribution, simple Adding shards remaps most keys; range queries impossible
Range-Based Users A-M → Shard 1, N-Z → Shard 2 Supports range queries, simple lookups Hotspots if data isn't uniformly distributed
Directory-Based A lookup table maps each key to its shard Maximum flexibility, easy to rebalance Directory is a bottleneck and SPOF
Geo-Based US users → US shard, EU → EU shard Low latency for regional data access Cross-region queries are expensive

Choosing the Shard Key

The shard key is the column used to determine which shard a row belongs to. It's the most critical decision in database sharding:

  • High cardinality: The key should have many unique values. Sharding by country (only ~200 values) creates hotspots. Sharding by user_id (millions of values) distributes evenly.
  • Query alignment: Most queries should include the shard key. If you shard by user_id but often query by order_date, every query must scatter across all shards (scatter-gather) — killing performance.
  • Write distribution: Auto-incrementing IDs as shard keys cause all new writes to go to the latest shard (hotspot). Use UUIDs or hash-based keys for even write distribution.
Cross-Shard Joins Are Forbidden

Once data is sharded, JOINs across shards are extremely expensive — they require fetching data from multiple nodes over the network and merging in the application layer. Design your schema so that data that's queried together lives on the same shard. This may mean denormalizing your data model.

Try It: Database Sharding

Visualize how hash-based and range-based sharding distribute data across nodes and see the impact of adding/removing shards.


CAP Theorem

The CAP theorem (Brewer, 2000) states that a distributed data store can provide at most two out of three guarantees simultaneously:

C: Consistency

Every read receives the most recent write. All nodes see the same data at the same time. Linearizable reads — as if there's only one copy of the data.

A: Availability

Every request receives a response (not an error), even if it's not the most recent data. The system never refuses a request.

P: Partition Tolerance

The system continues to operate even if network messages between nodes are dropped or delayed. In a real distributed system, partitions WILL happen.

Since network partitions are inevitable in distributed systems (P is non-negotiable), the real choice is between CP (consistency during partitions — return errors rather than stale data) and AP (availability during partitions — return possibly stale data rather than errors).

System CAP Choice Behavior During Partition
PostgreSQL (single leader) CP Followers become read-only; writes blocked until leader is reachable
MongoDB (replica set) CP Elects new primary; minority partition becomes read-only
Cassandra AP Continues serving reads and writes; reconciles conflicts after partition heals
DynamoDB AP Always available; eventual consistency by default, strong consistency optional
CockroachDB / Spanner CP Serializable transactions; unavailable during partitions that split quorum

Try It: CAP Theorem Simulator

Simulate network partitions and see how CP and AP systems respond differently in real-time.


Consistency Models

"Consistency" means different things in different contexts. Here's the spectrum from strongest to weakest:

Linearizability (Strict)

Every operation appears to take effect instantaneously at some point between invocation and response. The strongest guarantee — as if there's one copy of the data. Required for: distributed locks, leader election, financial transactions.

Sequential Consistency

Operations from each client appear in the order they were issued, but operations from different clients may be interleaved. Weaker than linearizable but easier to implement.

Causal Consistency

If event A causes event B, all nodes see A before B. Events without a causal relationship can be seen in any order. Used by: MongoDB (causal sessions), CockroachDB.

Eventual Consistency

If no new writes occur, all replicas will eventually converge to the same value. No guarantee about how long "eventually" takes. The weakest guarantee, but the highest performance. Used by: DynamoDB, Cassandra, DNS.

PACELC: Beyond CAP

The PACELC theorem extends CAP: during a Partition, choose between Availability and Consistency; Else (no partition), choose between Latency and Consistency. This captures the real-world tradeoff: even when the network is healthy, you still trade latency for consistency.

Database During Partition (PA/PC) Else (EL/EC)
DynamoDB PA (Available) EL (Low Latency)
Cassandra PA EL
MongoDB PC (Consistent) EC (Consistent)
Spanner PC EC

Lessons from the Trenches

Case Study: Instagram's Sharding (2012)

Instagram sharded their PostgreSQL database across multiple machines using user_id as the shard key. Each shard is a complete PostgreSQL instance with its own replica. They used a mapping function: shard_id = user_id % 4096 (logical shards), then mapped logical shards to physical machines. This allowed them to rebalance by moving logical shards between machines without changing the mapping function.

Takeaway: Use logical shards (thousands) mapped to physical machines (dozens). Rebalancing becomes moving logical shards between machines, not remapping individual keys.

Case Study: Google Spanner's TrueTime

Google Spanner achieved the impossible: a globally distributed database with external consistency (linearizability). The secret ingredient: TrueTime, a GPS/atomic clock API that gives a bounded time uncertainty interval. By waiting out the uncertainty (usually <7ms), Spanner guarantees that if transaction T1 commits before T2 starts, T1's timestamp is earlier — even across continents. This eliminates the need for consensus on every read.

Takeaway: Spanner proves that you CAN have global consistency, but it requires specialized hardware (GPS + atomic clocks in every datacenter). CockroachDB approximates this using NTP with hybrid logical clocks.

Case Study: DynamoDB's Design

Amazon's Dynamo paper (2007) introduced many concepts now standard in distributed databases: consistent hashing for partitioning, vector clocks for conflict detection, sloppy quorums for availability during failures, and anti-entropy (Merkle trees) for replica synchronization. DynamoDB serves millions of requests per second for Amazon's shopping cart, checkout, and recommendations — all with single-digit millisecond latency.

Takeaway: The Dynamo paper is a masterclass in choosing availability over consistency. If your application can tolerate eventual consistency, the AP design enables extreme scalability.


Further Reading & Citations

All Hands-on Resources

Reinforce these concepts with interactive simulators and visual deep-dives.

What's Next?

Database Indexing

B-Trees, hash indexes, composite keys, covering indexes — make your queries fly with optimal data structures.

Continue Journey