LSM-Tree vs B-Tree Simulator
Runs in browserInteractive visualization of Write-Optimized (LSM) vs Read-Optimized (B-Tree) Storage Engines
LSM-Tree
Write Optimized (Append-Only)
Memtable (Sorted)
0 / 4SSTables (Immutable)
B-Tree
Read Optimized (Page Based)
How to Use This Simulator
Interacting with Data:
- Insert: Add a Key-Value pair. Watch it propagate to both engines.
- Batch: Randomly insert 10 records to trigger limits rapidly.
- Delete: See how LSM uses "Tombstones" while B-Tree performs in-place removal.
Observing Internal States:
- LSM-Tree: Watch the Memtable fill and Flush to Disk (SSTables). Use "Force Compaction" to merge files.
- B-Tree: Observe how nodes Split when they exceed capacity and grow the tree height.
LSM-Trees vs B-Trees: The Definitive Storage Engine Guide
At the heart of every database lies a storage engine—the component responsible for storing, retrieving, and managing data on disk. While there are many variations, two fundamental data structures dominate the landscape of modern databases: Log-Structured Merge-Trees (LSM-Trees) and B-Trees. This interactive simulator allows you to visualize the mechanics of each, helping you understand why you might choose RocksDB over PostgreSQL, or vice versa, for specific workloads.
B-Trees: The Ubiquitous Standard
The B-Tree (and its variant, the B+ Tree) has been the gold standard for general-purpose storage engines since the 1970s. It is the default engine for ubiquitous relational databases like PostgreSQL, MySQL (InnoDB), SQL Server, and SQLite.
How It Works
The fundamental unit of a B-Tree is the Page (or block), typically 4KB or 8KB in size. This corresponds closely to the underlying hardware's block size.
- In-Place Updates: Unlike LSM-trees, B-Trees are mutable. When you update a record, the engine finds the specific page containing that record and overwrites the data in place. If the new data fits, no other pages are affected.
- Tree Structure: B-Trees are balanced search trees. They maintain a
distinct hierarchy where all leaf nodes are at the same depth. This strict balancing
guarantees that looking up any key takes
O(log n)time, providing extremely predictable read performance. - Page Splitting: As seen in the simulator, B-Trees have a fixed capacity per node. When you insert data into a full page, it must "split" into two half-full pages. The middle key is promoted to the parent node. This process can ripple up to the root, increasing the tree's height.
Strengths & Weaknesses
Why use B-Trees?
- ✅ Fast Reads: Data is structured for quick lookup. Range scans are efficient (in B+ Trees).
- ✅ Predictable Latency: No background compaction processes causing CPU spikes.
- ✅ Transaction Support: Easier to implement ACID isolation levels due to single-copy semantics.
Challenges
- ❌ Random Writes: To update a tiny row, the entire 4KB page must be written. This leads to "Write Amplification".
- ❌ Fragmentation: Page splitting often results in pages that are only partially full, wasting disk space.
- ❌ Concurrency: Locking a page for update can block other writers, though modern techniques (latching) mitigate this.
LSM-Trees: The Write-Throughput Kings
Log-Structured Merge-Trees took center stage with the rise of "Big Data" and high-ingestion workloads. They are the backbone of modern NoSQL and NewSQL systems like Apache Cassandra, ScyllaDB, RocksDB (used by CockroachDB, YugabyteDB), and Google BigTable.
How It Works
LSM-Trees fundamentally differ from B-Trees by treating the disk as an append-only log. They never modify existing files; they only write new ones.
- The Memtable (RAM): When a write comes in, it initially goes into an in-memory buffer called the Memtable. This is usually a sorted structure (like a Red-Black tree or Skip List). Because it's in RAM, inserts are incredibly fast.
- The Flush (Sequential I/O): When the Memtable fills up (e.g., reaches 64MB), it is flushed to disk as a Sorted String Table (SSTable). Crucially, this write is sequential. Mechanical HDDs and even modern SSDs prefer sequential writes over random writes, giving LSM-trees massive write throughput.
- SSTables (Immutable Disk): Once written, an SSTable is never changed. If you update a key, you simply write the new value to a new Memtable. If you delete a key, you write a special "Tombstone" marker.
- Compaction: Over time, you accumulate many SSTables. A background process runs Compaction, which merges these files, discarding overwritten values and tombstoned records to reclaim space.
Strengths & Weaknesses
Why use LSM-Trees?
- ✅ Extreme Write Speed: Buffer writes in RAM + Sequential I/O = millions of writes per second.
- ✅ Compression: SSTables are immutable, allowing for aggressive block compression, saving 50%+ disk space compared to B-Trees.
- ✅ Hardware Friendly: Reduces write amplification on flash storage (SSDs) and minimizes seek times on HDDs.
Challenges
- ❌ Read Amplification: To read a key, the system might check the Memtable, then L0 SSTable, then L1... potentially keeping multiple files open. Bloom Filters are used to mitigate this.
- ❌ Compaction Stalls: Heavy background compaction can consume disk bandwidth, slowing down user queries temporarily.
- ❌ Space Amplification: Old data lingers on disk until compaction runs.
Head-to-Head Comparison
| Feature | B-Tree (InnoDB, WiredTiger B-Tree) | LSM-Tree (RocksDB, LevelDB) |
|---|---|---|
| Write Pattern | Random I/O (Update pages in place) | Sequential I/O (Append Only) |
| Write Performance | Good (limited by seek/IOPS) | Excellent (limited by bandwidth) |
| Read Performance | Excellent & Stable | Variable (Bloom filters help significantly) |
| Space Efficiency | Lower (Fragmentation) | Higher (Better compression) |
| Best For... | General Purpose, Read-Heavy, Transactional (OLTP) | Write-Heavy, Time-Series, Logs, High-Throughput |
Concept Deep Dive: Write Amplification
One of the most critical metrics in storage engine design is Write Amplification (WA). It is defined as the ratio of data written to disk versus data written by the user.
WA = (Bytes written to storage medium) / (Bytes written by user)
B-Tree WA
If you update a 100-byte row in a B-Tree, the engine must write out the full 4KB (4096
bytes) page (plus WAL entry).
WA ≈ 4096 / 100 = 40x.
This wears out SSDs faster and consumes IOPS.
LSM-Tree WA
LSM writes the data compactly to the WAL and SSTable. However, Compaction reads
and re-writes the same data multiple times as it moves down levels.
Even so, for random write workloads, LSM trees typically have lower WA than B-Trees, often
in the range of 5x - 15x depending on the workload.
References & Further Reading
- RocksDB Overview (LSM Implementation) - Official documentation from Meta explaining the architecture of one of the most popular LSM stores.
- Bigtable: A Distributed Storage System for Structured Data - The original paper by Google (Chang et al., 2006) that popularized the LSM-tree concept.
- Wikipedia: B-Tree - Mathematical properties and history of the B-Tree data structure.
- Log Structured Merge Trees - Ben Stopford - An excellent deep-dive article from a Kafka engineer.
- The LSM-Tree Paper (O'Neil et al., 1996) - The original academic paper introducing the concept.