The Engineering of Database Indexes: B-Trees and Disk Anatomy
A SQL query that takes 45 minutes to run without an index can be completed in 5 milliseconds with one. This 540,000x difference in speed isn't magic; it is the raw mathematical power of logarithmic time complexity applied to physical storage mediums. To truly understand database performance, we must look beneath the SQL syntax and examine how bits are physically arranged on spinning rust and solid-state silicon.
Part 1: The Physics of the Disk
Databases do not read single rows from a hard drive. It is physically impossible. Storage mediums (both HDDs and SSDs) only allow reading and writing in fixed-size blocks. In PostgreSQL, this is called a Page (exactly 8 Kilobytes). In MySQL, it is 16KB.
If you ask the database for "User ID 50", and that user row is 100 bytes long, the database must instruct the OS to fetch the entire 8,192-byte Page from the disk into RAM, just to extract your 100 bytes. This is called a Page Read.
Memory (RAM) is millions of times faster than Disk. Therefore, the absolute golden rule of database performance is minimizing Disk I/O (Input/Output). A fast query reads 3 pages from the disk. A slow query reads 1,000,000 pages from the disk.
Part 2: The Full Table Scan (O(N))
Imagine a table, transactions, with 10 million rows. You execute:
SELECT amount FROM transactions WHERE transaction_id = 999999;
If the transaction_id column is not indexed, the data is
just a disorganized heap of 8KB Pages on the disk. The Database engine has no map.
It must start at Page 1. It loads Page 1 into RAM. It scans every row. No match. It drops it. It loads Page 2. No match. It heavily taxes the CPU and obliterates the Disk I/O throughput as it sequentially loads and checks every single one of the 10 million rows. This is a Sequential Scan (Seq Scan), and its time complexity is O(N). If the table doubles in size, the query takes twice as long.
Part 3: The B-Tree Index (O(log N))
When you execute CREATE INDEX idx_tx_id ON transactions(transaction_id);, the
database builds a massive, secondary data structure called a Balanced Tree (B-Tree or B+Tree).
The B-Tree is entirely separate from the original table data (the Heap). It looks like an upside-down tree with 3 distinct levels:
- Root Node: A single 8KB Page at the very top. It acts as a massive signpost. It says "IDs less than 3,000,000 go down the Left Branch. IDs between 3,000,000 and 6,000,000 go down the Middle Branch..."
- Branch Nodes: Intermediate 8KB Pages that subdivide the ranges further.
- Leaf Nodes (The Address Book): The bottom of the tree. These 8KB Pages
contain a perfectly sorted list of every
transaction_idin existence, sitting right next to a pointer (Tuple ID or TID) that dictates the exact physical location (Page #, Row #) of that row in the main table.
The Logarithmic Jump
When we search for ID = 999999 with the index, the engine loads the
Root Node
(1 page read). It follows the pointer down to the correct Branch Node
(1 page read). It follows the pointer to the exact Leaf Node (1 page
read). The Leaf Node contains the physical disk location. The engine jumps directly to
the Main Table (1 page read) to fetch the row.
Total Cost: 4 Page Reads. vs 1,000,000 page reads for a Sequential Scan.
The magic of O(log N).
Part 4: Clustered vs. Non-Clustered Indexes
In a Non-Clustered Index (like the one described above, standard in Postgres), the Index is a completely separate file from the Table. The Leaf Nodes contain pointers to the Table.
In a Clustered Index (Standard in MySQL's InnoDB for Primary Keys), the B-Tree IS the table. The actual row data (amount, date, user_ip) is stored directly inside the Leaf Nodes of the B-Tree itself.
Pro: Retrieving data is even faster because there is no final "jump" to the heap.
Con: You can mathematically only have one clustered index per table, because
you can only physically sort and store the raw data one way. All other indexes on that table
must be Non-Clustered (Secondary Indexes), and their leaf nodes will simply point to the Primary
Key instead of a physical disk location.
Part 5: Selectivity and The Optimizer's Rebellion
Sometimes you create an index, run an EXPLAIN ANALYZE, and the database
furiously ignores your index and does a Full Table Scan anyway. Why?
This happens due to Poor Selectivity.
Assume you have an index on the `status` column (e.g., "active" or "deleted"). If 95% of
your 10 million rows are "active", querying WHERE status = 'active' means fetching
9.5 million rows.
If the database uses the index, it must do 9.5 million Random I/O seeks to follow 9.5 million pointers from the Leaf Nodes into the Heap. Random I/O is incredibly slow.
The Database Query Optimizer is a mathematical genius. It instantly calculates that doing 9.5 million Random I/O seeks is actually much slower than just doing one massive, sequential sweep (Sequential Scan) of the entire disk and filtering out the 5% of deleted rows. The index is ignored.
Rule of Thumb: Indexes are only used when they retrieve a tiny sliver of the data (high selectivity).
Conclusion: The Write Penalty
If indexes are mathematically magical, why not index every single column? The Write Penalty.
Every time you `INSERT`, `UPDATE`, or `DELETE` a row, the database must not only write the data to the main table, it must also traverse and update every single B-Tree Index attached to that table. If a table has 15 indexes, one simple `INSERT` statement initiates a cascading avalanche of 16 distinct, heavy Disk I/O write operations. Massive tables with too many indexes become practically read-only due to catastrophic write latency.