Database Scaling & Distributed Data
The hardest component to scale. Master replication, sharding, and the fundamental tradeoffs of distributed data.
Application servers are stateless — you can restart, replace, or scale them freely. Databases are stateful — they hold data that must persist across restarts, be consistent across replicas, and survive hardware failures. This statefulness creates three fundamental challenges:
- Write throughput: A single PostgreSQL instance maxes out at ~10,000 writes/sec on good hardware. If your application generates more, you need multiple write nodes — which introduces consistency problems.
- Data volume: When your dataset exceeds what fits on a single disk (or in RAM for efficient queries), you must partition it across machines.
- Availability: A single database server is a single point of failure. Hardware dies. Disks fail. You need replicas to survive failures.
Replication creates copies of your data on multiple machines. It solves two problems: availability (if one node dies, another has the data) and read scalability (distribute read queries across replicas).
Leader-Follower (Primary-Replica)
The most common replication topology. One node is the leader (primary) — all writes go to it. The leader streams changes to one or more followers (replicas) via the Write-Ahead Log (WAL). Reads can go to any replica.
Synchronous Replication
The leader waits for at least one follower to confirm the write before acknowledging to the client. Guarantees no data loss on leader failure, but increases write latency. PostgreSQL calls this "synchronous_standby_names".
Asynchronous Replication
The leader acknowledges the write immediately and streams changes to followers in the background. Faster writes, but followers may lag behind (replication lag). If the leader dies before replication, those writes are lost.
With async replication, a user writes data, then immediately reads it. If the read goes to a lagging replica, the user sees stale data — their change appears to have disappeared. Solutions: read-your-writes consistency (route reads to the leader for data the user recently wrote), monotonic reads (always read from the same replica for a given user), or use semi-synchronous replication (at least one follower is synchronous).
Multi-Leader Replication
For multi-region deployments, having a single leader in one region means all writes from other regions suffer cross-continent latency. Multi-leader (multi-primary) replication allows writes to any of several leaders, which then asynchronously replicate to each other.
The critical challenge: write conflicts. If User A updates their name to "Alice" on Leader 1, and User B updates the same record to "Bob" on Leader 2 simultaneously, which write wins? Conflict resolution strategies:
- Last-Write-Wins (LWW): Use timestamps; the latest write wins. Simple but lossy — silently drops writes.
- Merge: Automatically merge (e.g., CRDT — Conflict-free Replicated Data Types). Works for counters, sets, and specific data structures.
- Application-level resolution: Present both versions to the user and let them resolve manually (like Git merge conflicts).
Leaderless Replication
In leaderless systems (DynamoDB, Cassandra, Riak), the client writes to multiple nodes simultaneously. A write is considered successful when W (write quorum) out of N nodes acknowledge. Reads query R (read quorum) nodes and take the latest value. As long as W + R > N, the client is guaranteed to read the latest write (quorum overlap).
Try It: Replication & Quorums
Visualize quorum reads/writes, adjust W and R values, and see how they affect consistency and availability.
Replication creates copies of the same data. Sharding splits different data across different nodes. Each shard holds a subset of rows, and together they hold the complete dataset. Sharding is necessary when your data is too large for a single machine or when write throughput exceeds what one node can handle.
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Hash-Based | shard = hash(key) % N | Even distribution, simple | Adding shards remaps most keys; range queries impossible |
| Range-Based | Users A-M → Shard 1, N-Z → Shard 2 | Supports range queries, simple lookups | Hotspots if data isn't uniformly distributed |
| Directory-Based | A lookup table maps each key to its shard | Maximum flexibility, easy to rebalance | Directory is a bottleneck and SPOF |
| Geo-Based | US users → US shard, EU → EU shard | Low latency for regional data access | Cross-region queries are expensive |
Choosing the Shard Key
The shard key is the column used to determine which shard a row belongs to. It's the most critical decision in database sharding:
- High cardinality: The key should have many unique values. Sharding by
country(only ~200 values) creates hotspots. Sharding byuser_id(millions of values) distributes evenly. - Query alignment: Most queries should include the shard key. If you shard by
user_idbut often query byorder_date, every query must scatter across all shards (scatter-gather) — killing performance. - Write distribution: Auto-incrementing IDs as shard keys cause all new writes to go to the latest shard (hotspot). Use UUIDs or hash-based keys for even write distribution.
Once data is sharded, JOINs across shards are extremely expensive — they require fetching data from multiple nodes over the network and merging in the application layer. Design your schema so that data that's queried together lives on the same shard. This may mean denormalizing your data model.
Try It: Database Sharding
Visualize how hash-based and range-based sharding distribute data across nodes and see the impact of adding/removing shards.
The CAP theorem (Brewer, 2000) states that a distributed data store can provide at most two out of three guarantees simultaneously:
C: Consistency
Every read receives the most recent write. All nodes see the same data at the same time. Linearizable reads — as if there's only one copy of the data.
A: Availability
Every request receives a response (not an error), even if it's not the most recent data. The system never refuses a request.
P: Partition Tolerance
The system continues to operate even if network messages between nodes are dropped or delayed. In a real distributed system, partitions WILL happen.
Since network partitions are inevitable in distributed systems (P is non-negotiable), the real choice is between CP (consistency during partitions — return errors rather than stale data) and AP (availability during partitions — return possibly stale data rather than errors).
| System | CAP Choice | Behavior During Partition |
|---|---|---|
| PostgreSQL (single leader) | CP | Followers become read-only; writes blocked until leader is reachable |
| MongoDB (replica set) | CP | Elects new primary; minority partition becomes read-only |
| Cassandra | AP | Continues serving reads and writes; reconciles conflicts after partition heals |
| DynamoDB | AP | Always available; eventual consistency by default, strong consistency optional |
| CockroachDB / Spanner | CP | Serializable transactions; unavailable during partitions that split quorum |
Try It: CAP Theorem Simulator
Simulate network partitions and see how CP and AP systems respond differently in real-time.
"Consistency" means different things in different contexts. Here's the spectrum from strongest to weakest:
Linearizability (Strict)
Every operation appears to take effect instantaneously at some point between invocation and response. The strongest guarantee — as if there's one copy of the data. Required for: distributed locks, leader election, financial transactions.
Sequential Consistency
Operations from each client appear in the order they were issued, but operations from different clients may be interleaved. Weaker than linearizable but easier to implement.
Causal Consistency
If event A causes event B, all nodes see A before B. Events without a causal relationship can be seen in any order. Used by: MongoDB (causal sessions), CockroachDB.
Eventual Consistency
If no new writes occur, all replicas will eventually converge to the same value. No guarantee about how long "eventually" takes. The weakest guarantee, but the highest performance. Used by: DynamoDB, Cassandra, DNS.
PACELC: Beyond CAP
The PACELC theorem extends CAP: during a Partition, choose between Availability and Consistency; Else (no partition), choose between Latency and Consistency. This captures the real-world tradeoff: even when the network is healthy, you still trade latency for consistency.
| Database | During Partition (PA/PC) | Else (EL/EC) |
|---|---|---|
| DynamoDB | PA (Available) | EL (Low Latency) |
| Cassandra | PA | EL |
| MongoDB | PC (Consistent) | EC (Consistent) |
| Spanner | PC | EC |
Case Study: Instagram's Sharding (2012)
Instagram sharded their PostgreSQL database across multiple machines using user_id as the shard key. Each shard is a complete PostgreSQL instance with its own replica. They used a mapping
function: shard_id = user_id % 4096 (logical shards), then mapped logical shards
to physical machines. This allowed them to rebalance by moving logical shards between machines
without changing the mapping function.
Takeaway: Use logical shards (thousands) mapped to physical machines (dozens). Rebalancing becomes moving logical shards between machines, not remapping individual keys.
Case Study: Google Spanner's TrueTime
Google Spanner achieved the impossible: a globally distributed database with external consistency (linearizability). The secret ingredient: TrueTime, a GPS/atomic clock API that gives a bounded time uncertainty interval. By waiting out the uncertainty (usually <7ms), Spanner guarantees that if transaction T1 commits before T2 starts, T1's timestamp is earlier — even across continents. This eliminates the need for consensus on every read.
Takeaway: Spanner proves that you CAN have global consistency, but it requires specialized hardware (GPS + atomic clocks in every datacenter). CockroachDB approximates this using NTP with hybrid logical clocks.
Case Study: DynamoDB's Design
Amazon's Dynamo paper (2007) introduced many concepts now standard in distributed databases: consistent hashing for partitioning, vector clocks for conflict detection, sloppy quorums for availability during failures, and anti-entropy (Merkle trees) for replica synchronization. DynamoDB serves millions of requests per second for Amazon's shopping cart, checkout, and recommendations — all with single-digit millisecond latency.
Takeaway: The Dynamo paper is a masterclass in choosing availability over consistency. If your application can tolerate eventual consistency, the AP design enables extreme scalability.
- Designing Data-Intensive Applications by Martin Kleppmann — Chapters 5-9 are the definitive reference on replication, partitioning, transactions, and consistency. (O'Reilly, 2017)
- Dynamo: Amazon's Highly Available Key-value Store (SOSP 2007) — The paper that shaped modern NoSQL databases.
- Spanner: Google's Globally Distributed Database (OSDI 2012) — How Google achieved global consistency with TrueTime.
- Jepsen — Distributed Systems Safety Research — Kyle Kingsbury's testing framework that has found bugs in nearly every distributed database.
- Sharding & IDs at Instagram — How Instagram generates globally unique, time-sortable IDs across shards.
- CAP Twelve Years Later by Eric Brewer — The original CAP theorem author's 2012 clarification on what CAP really means. (IEEE Computer, 2012)
- Living Without Atomic Clocks — CockroachDB — How CockroachDB approximates Spanner without GPS clocks.
All Hands-on Resources
Reinforce these concepts with interactive simulators and visual deep-dives.
What's Next?
Database Indexing
B-Trees, hash indexes, composite keys, covering indexes — make your queries fly with optimal data structures.
Continue Journey