Database Indexing

From full table scans to O(log n) lookups. Master the data structures that make queries fly.

Module 8: Database Indexing
Track 2: Distributed Shift (3–5 YoE)
In Module 7, we scaled the database across machines. But even a perfectly sharded database is slow if every query requires a full table scan — reading every row to find matches. An index is a data structure that transforms O(n) scans into O(log n) lookups. Choosing the right index strategy is one of the highest-leverage optimizations an engineer can make — a single missing index can turn a 5ms query into a 5-second query.
How Indexes Work

A database index is like a book's index. Instead of reading every page to find "B-Tree," you look up "B-Tree" in the index, which tells you it's on page 142. The database equivalent: instead of scanning 10 million rows, the index points directly to the matching rows.

Every index has a cost: additional storage space (typically 10-30% of table size) and slower writes (every INSERT/UPDATE must also update the index). The tradeoff: faster reads at the expense of write performance and storage.

Without Index With Index
Full table scan: O(n) B-Tree lookup: O(log n)
10M rows → 10M comparisons 10M rows → ~23 comparisons
~5 seconds ~0.5 milliseconds

B-Tree: The Default Index

The B-Tree (balanced tree) is the default index type in virtually every relational database (PostgreSQL, MySQL, Oracle, SQL Server). It's a self-balancing tree where each node can have hundreds of children (high fan-out), which keeps the tree shallow — typically 3-4 levels deep even for billions of keys.

B-Tree Internals

  • Root node: The top of the tree. Always in memory.
  • Internal nodes: Contain keys and pointers to child nodes. Each node is one disk page (typically 4-16 KB).
  • Leaf nodes: Contain the actual index entries (key + pointer to the table row). Leaf nodes are linked in a doubly-linked list for efficient range scans.

A B-Tree with branching factor 500 and 4 levels can hold 500^4 = 62.5 billion keys, with only 4 disk reads to find any key. Since the root and first-level internal nodes are cached in memory, most lookups require only 1-2 disk reads.

B-Tree vs B+Tree

Most databases use B+Trees (a variant): data is stored only in leaf nodes, internal nodes store only keys and pointers. This maximizes the branching factor (more keys per internal node = shallower tree) and enables efficient range scans via linked leaf nodes. When databases say "B-Tree index," they almost always mean B+Tree.

Try It: B-Tree Simulator

Insert, delete, and search keys in a B-Tree. Watch splits, merges, and rebalancing in real-time.


Index Types Beyond B-Tree
Index Type Best For Limitation
Hash Index Exact-match lookups (WHERE id = 42) Cannot do range queries, sorting, or prefix matching
LSM-Tree (Log-Structured Merge) Write-heavy workloads (Cassandra, RocksDB, LevelDB) Slower reads (check multiple levels), background compaction uses CPU
GIN (Generalized Inverted) Full-text search, JSONB, arrays (PostgreSQL) Large index size, slow updates
GiST (Generalized Search Tree) Geospatial queries (PostGIS), range types Slower than B-Tree for simple comparisons
BRIN (Block Range Index) Naturally ordered data (timestamps, sequential IDs) Only works when data is physically ordered on disk
Bitmap Index Low-cardinality columns (gender, status, boolean) Terrible for high-cardinality columns, expensive to update

LSM-Tree: Write-Optimized Storage

B-Trees are read-optimized: lookups are fast, but writes require random disk I/O (updating the tree in-place). LSM-Trees flip this tradeoff: writes go to an in-memory buffer (memtable), which is periodically flushed to disk as sorted files (SSTables). Reads must check the memtable and all SSTable levels. Background compaction merges SSTables to reduce read amplification.

LSM-Trees are used by: Cassandra, ScyllaDB, RocksDB (LevelDB), CockroachDB (storage engine), InfluxDB, and most time-series databases.

Explore: Storage Engines

See how B-Trees and LSM-Trees store data on disk, and how WAL ensures durability.


Composite & Covering Indexes

Composite Indexes (Multi-Column)

A composite index covers multiple columns. The column order matters: CREATE INDEX ON orders (user_id, created_at) supports queries filtering by user_id or user_id AND created_at, but NOT by created_at alone. This is the leftmost prefix rule.

-- Index: (user_id, created_at, status)
✅ WHERE user_id = 42 -- uses index
✅ WHERE user_id = 42 AND created_at > '2024-01' -- uses index
✅ WHERE user_id = 42 AND created_at > '2024-01' AND status = 'active' -- uses full index
❌ WHERE created_at > '2024-01' -- CANNOT use index (missing leftmost column)
❌ WHERE status = 'active' -- CANNOT use index

Covering Indexes (Index-Only Scans)

A covering index includes all columns needed by a query, so the database can answer the query entirely from the index without reading the actual table rows. This eliminates the most expensive operation: the random I/O of fetching rows from the heap.

-- Query: SELECT email FROM users WHERE last_login > '2024-01-01'
-- Covering index: includes email in the index
CREATE INDEX idx_login_covering ON users (last_login) INCLUDE (email);
-- Now: Index-Only Scan → no heap access → 10x faster

Partial Indexes

A partial index only indexes rows matching a condition. If 95% of your orders are status = 'completed' and you only query status = 'pending', index only the pending rows:

CREATE INDEX idx_pending_orders ON orders (created_at)
WHERE status = 'pending';
-- Index is 20x smaller, 20x faster to scan and maintain

EXPLAIN ANALYZE: Reading Query Plans

EXPLAIN ANALYZE is the most important tool for query optimization. It shows you exactly what the database is doing: which indexes are used, how many rows are scanned, and where time is spent.

-- Check if your query uses an index
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 42;
Index Scan using idx_orders_user_id on orders
Index Cond: (user_id = 42)
Rows: 15 Loops: 1
Planning Time: 0.1ms Execution Time: 0.3ms ✓
Seq Scan on orders ← BAD: full table scan
Filter: (user_id = 42)
Rows Removed by Filter: 9,999,985
Planning Time: 0.1ms Execution Time: 4,523ms ✗

Red Flags in Query Plans

  • Seq Scan on large tables: Full table scan. Add an index on the filtered column.
  • High "Rows Removed by Filter": The database read many rows but discarded most. The filter column needs an index.
  • Hash Join / Nested Loop with no index: Joining two tables without an index on the join column. Add an index.
  • Sort with high cost: If the sort doesn't match an index, the DB sorts in memory (or worse, on disk). Add an index matching the ORDER BY.
  • Bitmap Heap Scan: Not necessarily bad, but indicates the query matched many rows. Consider if the query is too broad.
Index Pitfalls

Functions on indexed columns break indexes: WHERE LOWER(email) = 'foo@bar.com' can't use a B-Tree index on email. Fix: create a functional index: CREATE INDEX ON users (LOWER(email)). Negation breaks indexes: WHERE status != 'deleted' can't efficiently use an index. LIKE with leading wildcard breaks indexes: WHERE name LIKE '%john%' requires a full scan (use GIN/trigram instead).


Lessons from the Trenches

Case Study: Stack Overflow's Index Strategy

Stack Overflow serves 1.7 billion page views per month from a single SQL Server instance. Their secret: aggressive indexing. Every query path has a tailored composite index. They use covering indexes extensively to avoid heap lookups. Their DBA reviews every slow query log entry and adds or adjusts indexes proactively. The result: average query time is under 1ms, even on tables with hundreds of millions of rows.

Takeaway: Indexing is a continuous process, not a one-time setup. Monitor slow query logs, run EXPLAIN ANALYZE on production queries, and iterate on your index strategy regularly.

Case Study: Uber's Schemaless (MySQL + LSM)

Uber built Schemaless, a key-value store on top of sharded MySQL, to handle their trip data. They moved from InnoDB (B-Tree) to MyRocks (LSM-Tree based on RocksDB) because their write-heavy workload (every GPS ping from every driver generates writes) caused massive write amplification in B-Trees. The switch to LSM-Trees reduced write amplification by 10x and halved their storage costs.

Takeaway: B-Tree is not always the answer. For write-heavy workloads (IoT, time-series, event logging), LSM-Tree-based storage engines significantly outperform B-Trees.

Case Study: The Missing Index That Cost $50K/Month

A SaaS company was running a PostgreSQL query that fetched recent activity for each user's dashboard. The query: SELECT * FROM events WHERE user_id = ? ORDER BY created_at DESC LIMIT 20. Without a composite index on (user_id, created_at DESC), PostgreSQL was doing a full index scan on user_id and then sorting in memory. With 50M events, this took 200ms per query. They were scaling up their database instance to handle the load — costing $50K/month. Adding the composite index dropped the query to 2ms, and they downgraded to a $5K/month instance.

Takeaway: Before scaling hardware, check your indexes. A single missing index can cost 100x in infrastructure spend.


Further Reading & Citations
  • Designing Data-Intensive Applications by Martin Kleppmann — Chapter 3 (Storage and Retrieval) covers B-Trees, LSM-Trees, and hash indexes in depth. (O'Reilly, 2017)
  • Use The Index, Luke! — The definitive free online book on SQL indexing and query optimization. Covers composite indexes, partial indexes, and EXPLAIN.
  • PostgreSQL Index Documentation — Official docs covering B-Tree, Hash, GiST, GIN, BRIN, and partial indexes.
  • RocksDB Documentation — LSM-Tree-based storage engine used by CockroachDB, MyRocks, and TiKV.
  • Uber's InnoDB to MyRocks Migration — How switching from B-Tree to LSM-Tree reduced write amplification by 10x.
  • Database Internals by Alex Petrov — Deep coverage of B-Tree variants, LSM-Trees, and storage engine internals. (O'Reilly, 2019)
  • pgMustard — Visual tool for understanding PostgreSQL EXPLAIN ANALYZE output.

All Hands-on Resources

Reinforce these concepts with interactive simulators and visual deep-dives.

What's Next?

Orchestration & Resiliency

Containers, Kubernetes, circuit breakers, retries with exponential backoff, and bulkhead patterns. Build systems that heal themselves.

Continue Journey