Database Indexing
From full table scans to O(log n) lookups. Master the data structures that make queries fly.
A database index is like a book's index. Instead of reading every page to find "B-Tree," you look up "B-Tree" in the index, which tells you it's on page 142. The database equivalent: instead of scanning 10 million rows, the index points directly to the matching rows.
Every index has a cost: additional storage space (typically 10-30% of table size) and slower writes (every INSERT/UPDATE must also update the index). The tradeoff: faster reads at the expense of write performance and storage.
| Without Index | With Index |
|---|---|
| Full table scan: O(n) | B-Tree lookup: O(log n) |
| 10M rows → 10M comparisons | 10M rows → ~23 comparisons |
| ~5 seconds | ~0.5 milliseconds |
The B-Tree (balanced tree) is the default index type in virtually every relational database (PostgreSQL, MySQL, Oracle, SQL Server). It's a self-balancing tree where each node can have hundreds of children (high fan-out), which keeps the tree shallow — typically 3-4 levels deep even for billions of keys.
B-Tree Internals
- Root node: The top of the tree. Always in memory.
- Internal nodes: Contain keys and pointers to child nodes. Each node is one disk page (typically 4-16 KB).
- Leaf nodes: Contain the actual index entries (key + pointer to the table row). Leaf nodes are linked in a doubly-linked list for efficient range scans.
A B-Tree with branching factor 500 and 4 levels can hold 500^4 = 62.5 billion keys, with only 4 disk reads to find any key. Since the root and first-level internal nodes are cached in memory, most lookups require only 1-2 disk reads.
B-Tree vs B+Tree
Most databases use B+Trees (a variant): data is stored only in leaf nodes, internal nodes store only keys and pointers. This maximizes the branching factor (more keys per internal node = shallower tree) and enables efficient range scans via linked leaf nodes. When databases say "B-Tree index," they almost always mean B+Tree.
Try It: B-Tree Simulator
Insert, delete, and search keys in a B-Tree. Watch splits, merges, and rebalancing in real-time.
| Index Type | Best For | Limitation |
|---|---|---|
| Hash Index | Exact-match lookups (WHERE id = 42) | Cannot do range queries, sorting, or prefix matching |
| LSM-Tree (Log-Structured Merge) | Write-heavy workloads (Cassandra, RocksDB, LevelDB) | Slower reads (check multiple levels), background compaction uses CPU |
| GIN (Generalized Inverted) | Full-text search, JSONB, arrays (PostgreSQL) | Large index size, slow updates |
| GiST (Generalized Search Tree) | Geospatial queries (PostGIS), range types | Slower than B-Tree for simple comparisons |
| BRIN (Block Range Index) | Naturally ordered data (timestamps, sequential IDs) | Only works when data is physically ordered on disk |
| Bitmap Index | Low-cardinality columns (gender, status, boolean) | Terrible for high-cardinality columns, expensive to update |
LSM-Tree: Write-Optimized Storage
B-Trees are read-optimized: lookups are fast, but writes require random disk I/O (updating the tree in-place). LSM-Trees flip this tradeoff: writes go to an in-memory buffer (memtable), which is periodically flushed to disk as sorted files (SSTables). Reads must check the memtable and all SSTable levels. Background compaction merges SSTables to reduce read amplification.
LSM-Trees are used by: Cassandra, ScyllaDB, RocksDB (LevelDB), CockroachDB (storage engine), InfluxDB, and most time-series databases.
Explore: Storage Engines
See how B-Trees and LSM-Trees store data on disk, and how WAL ensures durability.
Composite Indexes (Multi-Column)
A composite index covers multiple columns. The column order matters: CREATE INDEX ON orders (user_id, created_at)
supports queries filtering by user_id or user_id AND created_at, but
NOT by created_at alone. This is the leftmost prefix rule.
Covering Indexes (Index-Only Scans)
A covering index includes all columns needed by a query, so the database can answer the query entirely from the index without reading the actual table rows. This eliminates the most expensive operation: the random I/O of fetching rows from the heap.
Partial Indexes
A partial index only indexes rows matching a condition. If 95% of your orders
are status = 'completed' and you only query status = 'pending',
index only the pending rows:
EXPLAIN ANALYZE is the most important tool for query optimization. It shows you
exactly what the database is doing: which indexes are used, how many rows are scanned,
and where time is spent.
Red Flags in Query Plans
- Seq Scan on large tables: Full table scan. Add an index on the filtered column.
- High "Rows Removed by Filter": The database read many rows but discarded most. The filter column needs an index.
- Hash Join / Nested Loop with no index: Joining two tables without an index on the join column. Add an index.
- Sort with high cost: If the sort doesn't match an index, the DB sorts in memory (or worse, on disk). Add an index matching the ORDER BY.
- Bitmap Heap Scan: Not necessarily bad, but indicates the query matched many rows. Consider if the query is too broad.
Functions on indexed columns break indexes: WHERE LOWER(email) = 'foo@bar.com'
can't use a B-Tree index on email. Fix: create a functional index:
CREATE INDEX ON users (LOWER(email)). Negation breaks indexes: WHERE status != 'deleted'
can't efficiently use an index. LIKE with leading wildcard breaks indexes: WHERE name LIKE '%john%' requires a full scan (use GIN/trigram instead).
Case Study: Stack Overflow's Index Strategy
Stack Overflow serves 1.7 billion page views per month from a single SQL Server instance. Their secret: aggressive indexing. Every query path has a tailored composite index. They use covering indexes extensively to avoid heap lookups. Their DBA reviews every slow query log entry and adds or adjusts indexes proactively. The result: average query time is under 1ms, even on tables with hundreds of millions of rows.
Takeaway: Indexing is a continuous process, not a one-time setup. Monitor slow query logs, run EXPLAIN ANALYZE on production queries, and iterate on your index strategy regularly.
Case Study: Uber's Schemaless (MySQL + LSM)
Uber built Schemaless, a key-value store on top of sharded MySQL, to handle their trip data. They moved from InnoDB (B-Tree) to MyRocks (LSM-Tree based on RocksDB) because their write-heavy workload (every GPS ping from every driver generates writes) caused massive write amplification in B-Trees. The switch to LSM-Trees reduced write amplification by 10x and halved their storage costs.
Takeaway: B-Tree is not always the answer. For write-heavy workloads (IoT, time-series, event logging), LSM-Tree-based storage engines significantly outperform B-Trees.
Case Study: The Missing Index That Cost $50K/Month
A SaaS company was running a PostgreSQL query that fetched recent activity for each user's
dashboard. The query: SELECT * FROM events WHERE user_id = ? ORDER BY created_at DESC LIMIT 20. Without a composite index on (user_id, created_at DESC), PostgreSQL was
doing a full index scan on user_id and then sorting in memory. With 50M events,
this took 200ms per query. They were scaling up their database instance to handle the load —
costing $50K/month. Adding the composite index dropped the query to 2ms, and they downgraded
to a $5K/month instance.
Takeaway: Before scaling hardware, check your indexes. A single missing index can cost 100x in infrastructure spend.
- Designing Data-Intensive Applications by Martin Kleppmann — Chapter 3 (Storage and Retrieval) covers B-Trees, LSM-Trees, and hash indexes in depth. (O'Reilly, 2017)
- Use The Index, Luke! — The definitive free online book on SQL indexing and query optimization. Covers composite indexes, partial indexes, and EXPLAIN.
- PostgreSQL Index Documentation — Official docs covering B-Tree, Hash, GiST, GIN, BRIN, and partial indexes.
- RocksDB Documentation — LSM-Tree-based storage engine used by CockroachDB, MyRocks, and TiKV.
- Uber's InnoDB to MyRocks Migration — How switching from B-Tree to LSM-Tree reduced write amplification by 10x.
- Database Internals by Alex Petrov — Deep coverage of B-Tree variants, LSM-Trees, and storage engine internals. (O'Reilly, 2019)
- pgMustard — Visual tool for understanding PostgreSQL EXPLAIN ANALYZE output.
All Hands-on Resources
Reinforce these concepts with interactive simulators and visual deep-dives.
What's Next?
Orchestration & Resiliency
Containers, Kubernetes, circuit breakers, retries with exponential backoff, and bulkhead patterns. Build systems that heal themselves.
Continue Journey