NoSQL & Purpose-Built Databases

Module 11: NoSQL & Purpose-Built Databases

Track 3: Global Scale (6+ YoE)

Relational databases (PostgreSQL, MySQL) are remarkable general-purpose tools. But not every problem is best solved with tables, joins, and ACID transactions. NoSQL databases trade some relational features for specific strengths: massive write throughput, flexible schemas, graph traversals, or sub-millisecond lookups. This module covers each major NoSQL category, when to choose it, and when to combine multiple databases (polyglot persistence).

Why Not Just Use Postgres?

PostgreSQL can handle an astonishing range of workloads — JSONB, full-text search, pub/sub, even time-series with TimescaleDB. But it has fundamental limitations at extreme scale:

Write scalability: PostgreSQL has a single leader for writes. At millions of writes/sec, you need multi-leader or leaderless replication — which relational databases weren't designed for.
Schema rigidity: ALTER TABLE on a 500M-row table can lock it for minutes. If your data model evolves rapidly (e.g., user-generated content, IoT events), a schemaless store is faster to iterate with.
Access patterns: If you always read by primary key and never join, you're paying the overhead of a relational query planner for nothing. A key-value store is 10x simpler and faster.
Specialized queries: Graph traversals (6 degrees of separation), geospatial nearest-neighbor, full-text relevance ranking — each has purpose-built databases optimized for that specific operation.

The Default Answer Is Still Relational

Start with PostgreSQL until you have a specific reason not to. Most applications never outgrow a well-tuned relational database. NoSQL is for when you've identified a specific access pattern or scale requirement that relational databases can't efficiently support.

Document Stores

Store data as self-contained JSON/BSON documents. Each document can have a different structure — no fixed schema. Queries can filter on any field, including nested objects and arrays.

MongoDB

The most popular document database. BSON storage, rich query language, aggregation pipeline, multi-document ACID transactions (since v4.0). Sharding via mongos router.

Best for: Content management, user profiles, product catalogs, real-time analytics, rapid prototyping.

CouchDB / PouchDB

Built for offline-first architectures. Automatic conflict resolution with multi-version concurrency. PouchDB runs in the browser and syncs with CouchDB when online.

Best for: Mobile apps, offline-first PWAs, field data collection.

// MongoDB document — no fixed schema

{

"_id": ObjectId("507f1f77bcf86cd799439011"),

"name": "Alice",

"email": "alice@example.com",

"addresses": [ // Embedded array — no JOIN needed

{ "type": "home", "city": "SF", "zip": "94105" },

{ "type": "work", "city": "NYC", "zip": "10001" }

"metadata": { "signupSource": "referral", "tier": "premium" }

}

Key-Value Stores

The simplest data model: a hash map. GET(key) → value, SET(key, value). No queries by any field except the key. Extremely fast (sub-millisecond), horizontally scalable, and simple to operate.

Database	Data Location	Durability	Best For
Redis	In-memory (+ optional disk persistence)	Optional (AOF/RDB)	Caching, sessions, leaderboards, pub/sub, rate limiting
DynamoDB	SSD (managed by AWS)	Durable (3-AZ replication)	Serverless apps, shopping carts, user preferences
Memcached	In-memory only	None (volatile)	Simple caching (no data structures, no persistence)
etcd	Disk (Raft consensus)	Strong (linearizable)	Configuration store, service discovery (used by K8s)

Wide-Column Stores

Data is organized by row key and column families. Each row can have different columns. Designed for massive write throughput and time-series data. Data is stored sorted by row key, enabling efficient range scans.

Cassandra

AP system (availability over consistency). Masterless ring architecture — every node is equal. Tunable consistency per query. CQL query language (SQL-like).

Best for: Write-heavy workloads, time-series, IoT, messaging (Discord, Netflix, Apple).

Google Bigtable / HBase

CP system. Designed for petabyte-scale structured data. Powers Google Search, Maps, Gmail. HBase is the open-source implementation on HDFS.

Best for: Analytics at petabyte scale, ML feature stores, time-series at extreme scale.

// Cassandra data model — partition key + clustering columns

CREATE TABLE messages (

channel_id UUID, -- partition key (determines shard)

message_id TIMEUUID, -- clustering column (sorted within partition)

author TEXT,

content TEXT,

PRIMARY KEY (channel_id, message_id)

) WITH CLUSTERING ORDER BY (message_id DESC);

-- All messages for a channel are on one partition → fast reads

-- Sorted by time → latest messages first without sorting

Graph Databases

Optimized for data with complex relationships. Data is stored as nodes (entities) and edges (relationships). Traversing relationships is O(1) per hop — unlike SQL JOINs which are O(n) per table.

Neo4j

The most popular graph database. Cypher query language. Native graph storage (index-free adjacency). ACID transactions.

Best for: Social networks, recommendation engines, fraud detection, knowledge graphs.

Amazon Neptune / Dgraph

Managed graph databases. Neptune supports both property graph (Gremlin) and RDF (SPARQL). Dgraph is a distributed graph DB designed for horizontal scaling.

Best for: Enterprise knowledge graphs, identity resolution, network topology.

// Cypher (Neo4j) — find friends-of-friends who like Python

MATCH (me:User {name: "Alice"})-[:FRIEND]->(friend)-[:FRIEND]->(fof)

WHERE (fof)-[:LIKES]->(:Topic {name: "Python"})

AND NOT (me)-[:FRIEND]->(fof)

RETURN fof.name, COUNT(*) AS mutual_friends

ORDER BY mutual_friends DESC LIMIT 10;

-- In SQL, this requires 4 self-joins — exponentially slower

Specialized Databases

Search Engines: Elasticsearch / OpenSearch

Inverted index-based full-text search with relevance scoring (TF-IDF, BM25). Also used for log aggregation (ELK stack), APM, and real-time analytics. NOT a primary database — use as a secondary read store synced via CDC.

Time-Series: InfluxDB / TimescaleDB / ClickHouse

Optimized for append-only, time-ordered data (metrics, IoT sensor data, financial ticks). Features: automatic data retention policies, downsampling (1s → 1min → 1hr), efficient range queries. ClickHouse is column-oriented — 100x faster for analytical queries.

Vector Databases: Pinecone / Weaviate / Milvus

Store high-dimensional vectors (embeddings from ML models) and perform approximate nearest-neighbor (ANN) search. Power: semantic search, recommendation systems, RAG (retrieval-augmented generation) for LLMs. The fastest-growing database category.

Ledger Databases: Amazon QLDB / Hyperledger

Immutable, cryptographically verifiable transaction logs. Every change is recorded permanently and can be audited. Used in: financial systems, supply chain, healthcare records.

Polyglot Persistence

Modern systems use multiple databases, each optimized for its specific access pattern. A single e-commerce platform might use:

Data	Database	Why
Orders, payments	PostgreSQL	ACID transactions, referential integrity
Product catalog	MongoDB	Flexible schema (varying product attributes)
Sessions, cart	Redis	Sub-ms lookups, TTL-based expiration
Product search	Elasticsearch	Full-text search with facets and relevance
Recommendations	Neo4j + Pinecone	Graph for "bought together," vectors for "similar items"
Metrics, logs	ClickHouse	Column-oriented, 100x faster for analytical queries

Polyglot Persistence Cost

Every new database adds operational overhead: monitoring, backups, security patching, connection pooling, team expertise. Don't use 6 databases because you can — use them because you must. The "boring" answer (just use PostgreSQL) is often correct. Add specialized databases only when PostgreSQL becomes a clear bottleneck for that specific workload.

Explore: Database Internals

See how databases store data on disk, handle transactions, and manage concurrency.

Sharding Sim CAP Theorem How Indexing Works

Lessons from the Trenches

Case Study: Discord's Message Storage

Discord stores trillions of messages in Cassandra. Each channel is a partition key, messages are sorted by snowflake ID (time-ordered). This allows efficient "fetch last 50 messages in channel X" queries. But Discord found Cassandra's compaction storms caused latency spikes. Their solution: migrate to ScyllaDB (a C++ rewrite of Cassandra) which reduced P99 latency from 200ms to 15ms while handling more traffic on 1/3 the hardware.

Takeaway: Wide-column stores excel at time-series messaging data. But implementation quality matters as much as architecture — ScyllaDB's C++ rewrite of Cassandra achieved 10x better latency.

Case Study: LinkedIn's Graph of 900M Members

LinkedIn uses a custom graph database to power "People You May Know," connection suggestions, and recruiter search. The social graph has 900M+ member nodes and billions of edges (connections, follows, company affiliations). Traversals like "2nd-degree connections who work at Google and know Java" require multi-hop graph queries that would be impossibly slow in a relational database.

Takeaway: When your core product IS the graph (social networks, knowledge bases), a graph database isn't optional — it's the only viable option. Relational JOINs don't scale to multi-hop traversals on billions of edges.

Case Study: Airbnb's Search Migration to Elasticsearch

Airbnb's property search initially queried PostgreSQL with complex WHERE clauses, full-text matching, and geospatial filters. As they grew to millions of listings, search queries took seconds. They migrated search to Elasticsearch: listings are synced from PostgreSQL via a CDC pipeline. Elasticsearch handles full-text search (fuzzy matching, synonyms), geo-filtering (bounding box, polygon), faceted navigation (price ranges, amenities), and relevance ranking — all in under 50ms.

Takeaway: Don't try to make PostgreSQL do everything. Use it as the source of truth for transactional data, and sync specialized read stores (Elasticsearch, Redis, ClickHouse) for specific query patterns via CDC.

All Hands-on Resources

Reinforce these concepts with interactive simulators and visual deep-dives.

Database Sharding CAP Theorem Quorum Sim B-Tree Sim How Indexing Works Write-Ahead Log Transactions

What's Next?

Back to Handbook

Explore more modules, revisit concepts, and use our interactive simulators to solidify your understanding.

View All Modules