LSM-Tree vs B-Tree Simulator

Runs in browser

Interactive visualization of Write-Optimized (LSM) vs Read-Optimized (B-Tree) Storage Engines

Speed 1x

LSM-Tree

Write Optimized (Append-Only)

Memory

Memtable (Sorted)

0 / 4
Ready to accept writes...
Disk

SSTables (Immutable)

No data flushed to disk yet

B-Tree

Read Optimized (Page Based)

Order: 3 (Max 2 keys/node)
Page
System Event Log
Waiting for actions...

How to Use This Simulator

Interacting with Data:

  • Insert: Add a Key-Value pair. Watch it propagate to both engines.
  • Batch: Randomly insert 10 records to trigger limits rapidly.
  • Delete: See how LSM uses "Tombstones" while B-Tree performs in-place removal.

Observing Internal States:

  • LSM-Tree: Watch the Memtable fill and Flush to Disk (SSTables). Use "Force Compaction" to merge files.
  • B-Tree: Observe how nodes Split when they exceed capacity and grow the tree height.

LSM-Trees vs B-Trees: The Definitive Storage Engine Guide

At the heart of every database lies a storage engine—the component responsible for storing, retrieving, and managing data on disk. While there are many variations, two fundamental data structures dominate the landscape of modern databases: Log-Structured Merge-Trees (LSM-Trees) and B-Trees. This interactive simulator allows you to visualize the mechanics of each, helping you understand why you might choose RocksDB over PostgreSQL, or vice versa, for specific workloads.


B-Trees: The Ubiquitous Standard

The B-Tree (and its variant, the B+ Tree) has been the gold standard for general-purpose storage engines since the 1970s. It is the default engine for ubiquitous relational databases like PostgreSQL, MySQL (InnoDB), SQL Server, and SQLite.

How It Works

The fundamental unit of a B-Tree is the Page (or block), typically 4KB or 8KB in size. This corresponds closely to the underlying hardware's block size.

  • In-Place Updates: Unlike LSM-trees, B-Trees are mutable. When you update a record, the engine finds the specific page containing that record and overwrites the data in place. If the new data fits, no other pages are affected.
  • Tree Structure: B-Trees are balanced search trees. They maintain a distinct hierarchy where all leaf nodes are at the same depth. This strict balancing guarantees that looking up any key takes O(log n) time, providing extremely predictable read performance.
  • Page Splitting: As seen in the simulator, B-Trees have a fixed capacity per node. When you insert data into a full page, it must "split" into two half-full pages. The middle key is promoted to the parent node. This process can ripple up to the root, increasing the tree's height.

Strengths & Weaknesses

Why use B-Trees?

  • Fast Reads: Data is structured for quick lookup. Range scans are efficient (in B+ Trees).
  • Predictable Latency: No background compaction processes causing CPU spikes.
  • Transaction Support: Easier to implement ACID isolation levels due to single-copy semantics.

Challenges

  • Random Writes: To update a tiny row, the entire 4KB page must be written. This leads to "Write Amplification".
  • Fragmentation: Page splitting often results in pages that are only partially full, wasting disk space.
  • Concurrency: Locking a page for update can block other writers, though modern techniques (latching) mitigate this.

LSM-Trees: The Write-Throughput Kings

Log-Structured Merge-Trees took center stage with the rise of "Big Data" and high-ingestion workloads. They are the backbone of modern NoSQL and NewSQL systems like Apache Cassandra, ScyllaDB, RocksDB (used by CockroachDB, YugabyteDB), and Google BigTable.

How It Works

LSM-Trees fundamentally differ from B-Trees by treating the disk as an append-only log. They never modify existing files; they only write new ones.

  1. The Memtable (RAM): When a write comes in, it initially goes into an in-memory buffer called the Memtable. This is usually a sorted structure (like a Red-Black tree or Skip List). Because it's in RAM, inserts are incredibly fast.
  2. The Flush (Sequential I/O): When the Memtable fills up (e.g., reaches 64MB), it is flushed to disk as a Sorted String Table (SSTable). Crucially, this write is sequential. Mechanical HDDs and even modern SSDs prefer sequential writes over random writes, giving LSM-trees massive write throughput.
  3. SSTables (Immutable Disk): Once written, an SSTable is never changed. If you update a key, you simply write the new value to a new Memtable. If you delete a key, you write a special "Tombstone" marker.
  4. Compaction: Over time, you accumulate many SSTables. A background process runs Compaction, which merges these files, discarding overwritten values and tombstoned records to reclaim space.

Strengths & Weaknesses

Why use LSM-Trees?

  • Extreme Write Speed: Buffer writes in RAM + Sequential I/O = millions of writes per second.
  • Compression: SSTables are immutable, allowing for aggressive block compression, saving 50%+ disk space compared to B-Trees.
  • Hardware Friendly: Reduces write amplification on flash storage (SSDs) and minimizes seek times on HDDs.

Challenges

  • Read Amplification: To read a key, the system might check the Memtable, then L0 SSTable, then L1... potentially keeping multiple files open. Bloom Filters are used to mitigate this.
  • Compaction Stalls: Heavy background compaction can consume disk bandwidth, slowing down user queries temporarily.
  • Space Amplification: Old data lingers on disk until compaction runs.

Head-to-Head Comparison

Feature B-Tree (InnoDB, WiredTiger B-Tree) LSM-Tree (RocksDB, LevelDB)
Write Pattern Random I/O (Update pages in place) Sequential I/O (Append Only)
Write Performance Good (limited by seek/IOPS) Excellent (limited by bandwidth)
Read Performance Excellent & Stable Variable (Bloom filters help significantly)
Space Efficiency Lower (Fragmentation) Higher (Better compression)
Best For... General Purpose, Read-Heavy, Transactional (OLTP) Write-Heavy, Time-Series, Logs, High-Throughput

Concept Deep Dive: Write Amplification

One of the most critical metrics in storage engine design is Write Amplification (WA). It is defined as the ratio of data written to disk versus data written by the user.

WA = (Bytes written to storage medium) / (Bytes written by user)

B-Tree WA

If you update a 100-byte row in a B-Tree, the engine must write out the full 4KB (4096 bytes) page (plus WAL entry).
WA ≈ 4096 / 100 = 40x.
This wears out SSDs faster and consumes IOPS.

LSM-Tree WA

LSM writes the data compactly to the WAL and SSTable. However, Compaction reads and re-writes the same data multiple times as it moves down levels.
Even so, for random write workloads, LSM trees typically have lower WA than B-Trees, often in the range of 5x - 15x depending on the workload.

References & Further Reading