NoSQL & Purpose-Built Databases
The right tool for the right job. When to leave the relational world behind.
PostgreSQL can handle an astonishing range of workloads — JSONB, full-text search, pub/sub, even time-series with TimescaleDB. But it has fundamental limitations at extreme scale:
- Write scalability: PostgreSQL has a single leader for writes. At millions of writes/sec, you need multi-leader or leaderless replication — which relational databases weren't designed for.
- Schema rigidity: ALTER TABLE on a 500M-row table can lock it for minutes. If your data model evolves rapidly (e.g., user-generated content, IoT events), a schemaless store is faster to iterate with.
- Access patterns: If you always read by primary key and never join, you're paying the overhead of a relational query planner for nothing. A key-value store is 10x simpler and faster.
- Specialized queries: Graph traversals (6 degrees of separation), geospatial nearest-neighbor, full-text relevance ranking — each has purpose-built databases optimized for that specific operation.
The Default Answer Is Still Relational
Start with PostgreSQL until you have a specific reason not to. Most applications never outgrow a well-tuned relational database. NoSQL is for when you've identified a specific access pattern or scale requirement that relational databases can't efficiently support.
Store data as self-contained JSON/BSON documents. Each document can have a different structure — no fixed schema. Queries can filter on any field, including nested objects and arrays.
MongoDB
The most popular document database. BSON storage, rich query language, aggregation pipeline, multi-document ACID transactions (since v4.0). Sharding via mongos router.
Best for: Content management, user profiles, product catalogs, real-time analytics, rapid prototyping.
CouchDB / PouchDB
Built for offline-first architectures. Automatic conflict resolution with multi-version concurrency. PouchDB runs in the browser and syncs with CouchDB when online.
Best for: Mobile apps, offline-first PWAs, field data collection.
The simplest data model: a hash map. GET(key) → value,
SET(key, value). No queries by any field except the key. Extremely fast
(sub-millisecond), horizontally scalable, and simple to operate.
| Database | Data Location | Durability | Best For |
|---|---|---|---|
| Redis | In-memory (+ optional disk persistence) | Optional (AOF/RDB) | Caching, sessions, leaderboards, pub/sub, rate limiting |
| DynamoDB | SSD (managed by AWS) | Durable (3-AZ replication) | Serverless apps, shopping carts, user preferences |
| Memcached | In-memory only | None (volatile) | Simple caching (no data structures, no persistence) |
| etcd | Disk (Raft consensus) | Strong (linearizable) | Configuration store, service discovery (used by K8s) |
Data is organized by row key and column families. Each row can have different columns. Designed for massive write throughput and time-series data. Data is stored sorted by row key, enabling efficient range scans.
Cassandra
AP system (availability over consistency). Masterless ring architecture — every node is equal. Tunable consistency per query. CQL query language (SQL-like).
Best for: Write-heavy workloads, time-series, IoT, messaging (Discord, Netflix, Apple).
Google Bigtable / HBase
CP system. Designed for petabyte-scale structured data. Powers Google Search, Maps, Gmail. HBase is the open-source implementation on HDFS.
Best for: Analytics at petabyte scale, ML feature stores, time-series at extreme scale.
Optimized for data with complex relationships. Data is stored as nodes (entities) and edges (relationships). Traversing relationships is O(1) per hop — unlike SQL JOINs which are O(n) per table.
Neo4j
The most popular graph database. Cypher query language. Native graph storage (index-free adjacency). ACID transactions.
Best for: Social networks, recommendation engines, fraud detection, knowledge graphs.
Amazon Neptune / Dgraph
Managed graph databases. Neptune supports both property graph (Gremlin) and RDF (SPARQL). Dgraph is a distributed graph DB designed for horizontal scaling.
Best for: Enterprise knowledge graphs, identity resolution, network topology.
Search Engines: Elasticsearch / OpenSearch
Inverted index-based full-text search with relevance scoring (TF-IDF, BM25). Also used for log aggregation (ELK stack), APM, and real-time analytics. NOT a primary database — use as a secondary read store synced via CDC.
Time-Series: InfluxDB / TimescaleDB / ClickHouse
Optimized for append-only, time-ordered data (metrics, IoT sensor data, financial ticks). Features: automatic data retention policies, downsampling (1s → 1min → 1hr), efficient range queries. ClickHouse is column-oriented — 100x faster for analytical queries.
Vector Databases: Pinecone / Weaviate / Milvus
Store high-dimensional vectors (embeddings from ML models) and perform approximate nearest-neighbor (ANN) search. Power: semantic search, recommendation systems, RAG (retrieval-augmented generation) for LLMs. The fastest-growing database category.
Ledger Databases: Amazon QLDB / Hyperledger
Immutable, cryptographically verifiable transaction logs. Every change is recorded permanently and can be audited. Used in: financial systems, supply chain, healthcare records.
Modern systems use multiple databases, each optimized for its specific access pattern. A single e-commerce platform might use:
| Data | Database | Why |
|---|---|---|
| Orders, payments | PostgreSQL | ACID transactions, referential integrity |
| Product catalog | MongoDB | Flexible schema (varying product attributes) |
| Sessions, cart | Redis | Sub-ms lookups, TTL-based expiration |
| Product search | Elasticsearch | Full-text search with facets and relevance |
| Recommendations | Neo4j + Pinecone | Graph for "bought together," vectors for "similar items" |
| Metrics, logs | ClickHouse | Column-oriented, 100x faster for analytical queries |
Every new database adds operational overhead: monitoring, backups, security patching, connection pooling, team expertise. Don't use 6 databases because you can — use them because you must. The "boring" answer (just use PostgreSQL) is often correct. Add specialized databases only when PostgreSQL becomes a clear bottleneck for that specific workload.
Explore: Database Internals
See how databases store data on disk, handle transactions, and manage concurrency.
Case Study: Discord's Message Storage
Discord stores trillions of messages in Cassandra. Each channel is a partition key, messages are sorted by snowflake ID (time-ordered). This allows efficient "fetch last 50 messages in channel X" queries. But Discord found Cassandra's compaction storms caused latency spikes. Their solution: migrate to ScyllaDB (a C++ rewrite of Cassandra) which reduced P99 latency from 200ms to 15ms while handling more traffic on 1/3 the hardware.
Takeaway: Wide-column stores excel at time-series messaging data. But implementation quality matters as much as architecture — ScyllaDB's C++ rewrite of Cassandra achieved 10x better latency.
Case Study: LinkedIn's Graph of 900M Members
LinkedIn uses a custom graph database to power "People You May Know," connection suggestions, and recruiter search. The social graph has 900M+ member nodes and billions of edges (connections, follows, company affiliations). Traversals like "2nd-degree connections who work at Google and know Java" require multi-hop graph queries that would be impossibly slow in a relational database.
Takeaway: When your core product IS the graph (social networks, knowledge bases), a graph database isn't optional — it's the only viable option. Relational JOINs don't scale to multi-hop traversals on billions of edges.
Case Study: Airbnb's Search Migration to Elasticsearch
Airbnb's property search initially queried PostgreSQL with complex WHERE clauses, full-text matching, and geospatial filters. As they grew to millions of listings, search queries took seconds. They migrated search to Elasticsearch: listings are synced from PostgreSQL via a CDC pipeline. Elasticsearch handles full-text search (fuzzy matching, synonyms), geo-filtering (bounding box, polygon), faceted navigation (price ranges, amenities), and relevance ranking — all in under 50ms.
Takeaway: Don't try to make PostgreSQL do everything. Use it as the source of truth for transactional data, and sync specialized read stores (Elasticsearch, Redis, ClickHouse) for specific query patterns via CDC.
- Designing Data-Intensive Applications by Martin Kleppmann — Chapter 2 (Data Models and Query Languages) covers relational vs document vs graph models. (O'Reilly, 2017)
- How Discord Stores Trillions of Messages — Discord's migration from Cassandra to ScyllaDB.
- Dynamo: Amazon's Highly Available Key-value Store (SOSP 2007) — The paper that inspired DynamoDB, Riak, and Cassandra.
- Why Graph Databases — Neo4j — When relational JOINs hit their limits.
- Database Internals by Alex Petrov — Deep coverage of B-Tree, LSM-Tree, and distributed database internals. (O'Reilly, 2019)
- MongoDB Data Modeling — Official guide to document modeling: embedding vs referencing.
- Cassandra Data Modeling — Query-driven design: model your tables around your queries, not your entities.
All Hands-on Resources
Reinforce these concepts with interactive simulators and visual deep-dives.
What's Next?
Back to Handbook
Explore more modules, revisit concepts, and use our interactive simulators to solidify your understanding.
View All Modules