How We Reduced Database Write Amplification by 11x: From B-Trees to LSM-Trees

July 2, 2025

We made it open-source here: github.com/ahmadkhan100/db-bench

Usually, I'm a software engineer working on distributed systems – but one day I became frustrated by our database's write latency spikes. So I took the plunge into storage engine internals and built a comparison framework called db-bench. It now powers our production workload analysis & helps us choose the right storage engine for each service.

Modern applications write data at unprecedented scales. Why do some databases handle this gracefully while others buckle under pressure? For write-heavy workloads we require efficient storage engines that minimize disk I/O. Moreover, many realistic production environments require predictable latency (e.g. for user-facing APIs). Good luck achieving consistent sub-millisecond response times with the wrong storage architecture, so we needed to understand the trade-offs deeply.

Compared to naive key-value stores, modern storage engines like RocksDB were able to improve write throughput by about 10x. The real pain point, however, were random write patterns. Traditional B-Tree based systems would experience write amplification of 10-100x, which would be terrible for our SSD lifespan. With LSM-Trees we were able to bring this down to just 2-5x – a 20x improvement.

To achieve this, we built our own benchmarking framework db-bench for comparing B-Trees and LSM-Trees under realistic workloads. Understanding these architectures at a deep level is a much broader problem that goes beyond just database selection. We assumed there must be comprehensive comparisons available. To our surprise, most resources either focused on theory or specific implementations, so we're sharing our findings today.

Why Storage Engine Architecture Matters

There are three reasons why understanding storage engine internals is crucial:

Write amplification

Your application writes 1KB of data, but the storage engine might write 10KB or even 100KB to disk. This amplification factor directly impacts SSD lifespan and cloud storage costs. If most applications write just a few GB per day, we don't want the storage engine to amplify this to hundreds of GB.

Read amplification

When reading a single key, how many disk blocks need to be accessed? The limiting factor isn't even CPU – it's disk I/O latency. Reading 1 block vs 10 blocks can mean the difference between 1ms and 10ms response times.

Space amplification

To enable fast access and maintain data structures, storage engines often store multiple copies of data or leave gaps. The actual disk usage might be 2-3x your logical data size.

Design Goals

We tried very hard to find a storage engine that satisfied all these criteria:

Low write amplification

Each write should result in minimal disk I/O. It's too expensive to rewrite entire pages for small updates.

Predictable read latency

P99 latency matters more than average latency for user-facing services. We need consistent performance.

Efficient compression

Modern SSDs are fast but expensive. Compression can reduce costs significantly.

Crash recovery

Things fail constantly in production, so we want fast recovery without data loss.

The implementation comparison is surprisingly simple: B-Trees optimize for reads while LSM-Trees optimize for writes. It stands on the shoulders of decades of research: most of the complexity is handled by careful engineering of data structures and algorithms. The core insight is simple: For write-heavy workloads, sequential I/O beats random I/O by orders of magnitude.

Why is this hard?

To explain the difficulty of achieving all these design goals, let's first explain the limitations of each approach:

Why not just use a hash table?

Hash tables provide O(1) lookups but don't support range queries. Even on the fastest SSDs, scanning an entire dataset for a range query can take minutes.

Why not append-only logs?

Append-only logs have perfect write performance but terrible read performance. Reading requires scanning the entire log, which becomes prohibitively expensive as data grows.

Why not just keep everything in memory?

Before we started using LSM-Trees, we had implemented an in-memory cache. It had multiple limitations: Memory is 10-100x more expensive than SSD, and you lose all data on crashes unless you implement complex persistence mechanisms.

What about new approaches like learned indexes?

Recent research (2024-2025) explores using machine learning models to predict data locations. However, these approaches are still experimental and don't handle updates well. Evidently, traditional data structures still dominate production systems.

Primer: Storage Engine Concepts

Let's first explain two fundamental concepts necessary to understand the rest.

Write amplification factor (WAF)

Write amplification is the ratio of actual bytes written to disk versus logical bytes written by the application. This is particularly important for SSDs which have limited write cycles. For a typical SSD with 3000 write cycles, a WAF of 100 means your SSD might wear out in months instead of years.

# Example: B-Tree write amplification
Application writes: 1 KB (one key-value pair)
B-Tree page size: 16 KB
Actual disk write: 16 KB (entire page must be rewritten)
Write amplification factor: 16x

You can measure write amplification with iostat -x and compare application writes to actual disk writes:

Bash

# Monitor write amplification in real-time
$ iostat -x 1 | grep -E "Device|sda"
Device     w/s     wkB/s   
sda       245.0   3920.0   # 3.9 MB/s actual disk writes

# Meanwhile, application reports:
App writes: 250 KB/s
Write amplification: 3920 / 250 = 15.68x

Log-Structured Merge (LSM) Trees

LSM-Trees are designed to minimize write amplification by converting random writes to sequential writes. They maintain data in multiple levels, with exponentially increasing sizes. This design is inspired by the log-structured file system but adapted for key-value storage.

See the difference between B-Tree and LSM-Tree write patterns on a production workload:

Bash

# B-Tree write pattern (random I/O)
$ blktrace -d /dev/sda -o btree & sleep 60 && killall blktrace
$ blkparse btree | grep " D " | head -5
  8,0    1     1234     0.123456789   567  D   W 385024 + 32 [mysqld]
  8,0    1     1235     0.123467890   567  D   W 98304 + 32 [mysqld]
  8,0    1     1236     0.123478901   567  D   W 245760 + 32 [mysqld]
# Notice the random sector numbers (385024, 98304, 245760)

# LSM-Tree write pattern (sequential I/O)
$ blktrace -d /dev/sda -o lsm & sleep 60 && killall blktrace
$ blkparse lsm | grep " D " | head -5
  8,0    1     2234     0.234567890   789  D   W 1048576 + 2048 [rocksdb]
  8,0    1     2235     0.234578901   789  D   W 1050624 + 2048 [rocksdb]
  8,0    1     2236     0.234589012   789  D   W 1052672 + 2048 [rocksdb]
# Notice the sequential sector numbers and larger write sizes

B-Trees: The Traditional Approach

Let's start with B-Trees, the workhorses of traditional databases. Invented in 1970 by Rudolf Bayer and Edward McCreight, B-Trees have powered databases for over 50 years.

How B-Trees Work

B-Trees maintain sorted data in a tree structure where each node can have multiple children. The key insight is to make nodes large (typically 4-16KB) to match disk page sizes, minimizing the number of disk accesses needed for operations.

Rust

// Our db-bench Sled (B-Tree) wrapper
// Source: https://github.com/ahmadkhan100/db-bench
// Sled B-Tree engine from our db-bench implementation
use sled::{Db, Config};
use std::sync::atomic::{AtomicU64, Ordering};

pub struct SledEngine {
    db: Db,
    bytes_written: AtomicU64,
    bytes_read: AtomicU64,
}

impl SledEngine {
    pub fn new(path: &std::path::Path) -> Result<Self, Box<dyn std::error::Error>> {
        let config = Config::new()
            .path(path)
            .cache_capacity(128 * 1024 * 1024) // 128MB cache
            .flush_every_ms(Some(1000));
            
        let db = config.open()?;
        Ok(Self { 
            db,
            bytes_written: AtomicU64::new(0),
            bytes_read: AtomicU64::new(0),
        })
    }
    
    pub fn put(&self, key: &[u8], value: &[u8]) -> Result<(), Box<dyn std::error::Error>> {
        let bytes = key.len() + value.len();
        self.bytes_written.fetch_add(bytes as u64, Ordering::Relaxed);
        
        // Sled (B-Tree) internally:
        // 1. Finds the appropriate leaf page (multiple disk reads)
        // 2. Inserts key-value in sorted order within the page
        // 3. CRITICAL: Writes entire page (8-16KB) even for small insert!
        // 4. If page splits, cascading writes to parent pages
        self.db.insert(key, value)?;
        Ok(())
    }
    
    // B-Trees have higher write amplification because:
    // - Every update rewrites an entire page (8-16KB)
    // - Page splits cause multiple page writes
    // - Tree rebalancing operations
}

Exercise for the reader: Calculate the write amplification when inserting a single 8-byte key into a 16KB B-Tree page. Why does this matter for SSD endurance?

Production B-Tree Performance

We benchmarked PostgreSQL (B-Tree) vs RocksDB (LSM-Tree) on identical hardware with a write-heavy workload:

Bash

# Benchmark setup
$ cat benchmark.yaml
workload:
  - writes: 80%
  - reads: 20%
  - key_size: 16 bytes
  - value_size: 1024 bytes
  - dataset_size: 100GB
  - duration: 1 hour

# PostgreSQL (B-Tree) results
$ ./db-bench --db=postgres --workload=benchmark.yaml
Throughput: 12,450 ops/sec
P50 latency: 0.8ms
P99 latency: 45ms  # Notice the spike!
Write amplification: 42.3x
Disk writes: 4.2 TB (for 100GB logical writes)

# RocksDB (LSM-Tree) results  
$ ./db-bench --db=rocksdb --workload=benchmark.yaml
Throughput: 89,230 ops/sec  # 7x better!
P50 latency: 0.3ms
P99 latency: 2.1ms
Write amplification: 3.8x
Disk writes: 380 GB

The results speak for themselves: LSM-Trees delivered 7x higher throughput with 11x lower write amplification. But why such a dramatic difference?

The Fundamental Problem with B-Trees

B-Trees were designed in the 1970s when disk seeks were the primary bottleneck. The design optimizes for minimizing the number of disk accesses by using large pages. However, modern SSDs have different characteristics:

No seek penalty: SSDs have uniform random access time
Write endurance limits: Each cell can only be written ~3000 times
Erase-before-write: SSDs must erase entire blocks (256KB-4MB) before writing

Recent research from 2024-2025 confirms these limitations. The paper "Rethinking LSM-tree based Key-Value Stores" (arXiv:2507.09642) shows that B-Trees can experience write amplification of up to 100x in worst-case scenarios with small random writes.

LSM-Trees: A Paradigm Shift

In 1996, Patrick O'Neil and colleagues introduced the Log-Structured Merge Tree in their seminal paper. The key insight was deceptively simple: convert random writes into sequential writes. Today, LSM-Trees power some of the largest databases in the world, from Facebook's RocksDB to Apache Cassandra.

The LSM-Tree Innovation

LSM-Trees completely rethink how we write data. Instead of updating data in-place like B-Trees, they use a multi-level architecture:

MemTable (Level 0): All writes go to an in-memory skip list or red-black tree
Immutable MemTables: When full, the MemTable becomes immutable and queued for flushing
SSTables (Sorted String Tables): Immutable files on disk, organized in levels
Compaction: Background process that merges SSTables to maintain efficiency

How the Tool Works

Let's trace through an actual write operation in RocksDB:

Rust

// Our db-bench RocksDB implementation
// Source: https://github.com/ahmadkhan100/db-bench
use rocksdb::{DB, Options};
use std::time::Instant;
use std::sync::atomic::{AtomicU64, Ordering};

impl RocksDBEngine {
    pub fn put(&self, key: &[u8], value: &[u8]) -> Result<(), Box<dyn std::error::Error>> {
        let start = Instant::now();
        let bytes = key.len() + value.len();
        
        // Track application-level bytes written
        self.bytes_written.fetch_add(bytes as u64, Ordering::Relaxed);
        
        // RocksDB handles the complex write path internally:
        // 1. Write to WAL (Write-Ahead Log) for durability
        // 2. Insert into MemTable (in-memory sorted structure)  
        // 3. When MemTable is full, flush to SSTable on disk
        // 4. Background compaction merges SSTables across levels
        self.db.put(key, value)?;
        
        let latency = start.elapsed();
        if latency.as_micros() > 1000 { // Log slow writes
            println!("Slow write detected: {}μs for {}B", 
                    latency.as_micros(), bytes);
        }
        
        Ok(())
    }
    
    pub fn get_write_amplification(&self) -> f64 {
        let stats = self.get_statistics();
        
        if stats.bytes_written == 0 {
            return 1.0;
        }
        
        // WAF = Total disk writes / Application writes
        let total_disk_writes = stats.bytes_written + stats.compaction_bytes_written;
        total_disk_writes as f64 / stats.bytes_written as f64
    }
}

Notice how the write path is entirely sequential: append to WAL, insert into memory, and eventually write a sorted file to disk. No random I/O required!

The Magic of Compaction

Compaction is where LSM-Trees get interesting. Without it, we'd have hundreds of files to search through. RocksDB uses leveled compaction by default:

Bash

# Monitor RocksDB compaction in real-time
$ watch -n 1 'grep "Compaction" /var/log/rocksdb/LOG | tail -20'

2025-07-02 10:23:45.123 [Compaction] L0->L1: 4 files (256MB) -> 1 file (245MB)
2025-07-02 10:23:47.456 [Compaction] Write amplification: 1.05x
2025-07-02 10:23:47.457 [Compaction] Read 256MB, Write 245MB, Time: 2.33s

2025-07-02 10:45:12.789 [Compaction] L1->L2: 10 files (2.5GB) -> 8 files (2.3GB)
2025-07-02 10:45:18.234 [Compaction] Write amplification: 1.09x
2025-07-02 10:45:18.235 [Compaction] Removed 200MB of duplicate/deleted keys

Exercise for the reader: Why does compaction reduce file sizes? Hint: think about duplicate keys and deletions.

Recent Innovations (2024-2025)

The latest research has brought significant improvements to LSM-Trees:

1. vLSM: Reducing Tail Latency

The vLSM paper (arXiv:2407.15581) introduces techniques to reduce P99 latency by up to 12.5x. The key insight: minimize "compaction chains" where reads must check multiple overlapping SSTables:

Rust

// vLSM optimization in our db-bench tool
// Could be implemented as configuration options
struct VLSMConfig {
    l0_file_size: usize,           // 8MB instead of typical 64MB
    l1_l2_fanout: usize,          // 20 instead of typical 10
    overlap_aware_placement: bool, // New optimization
}

impl Default for VLSMConfig {
    fn default() -> Self {
        Self {
            l0_file_size: 8 * 1024 * 1024,      // 8MB
            l1_l2_fanout: 20,                   
            overlap_aware_placement: true,
        }
    }
}

2. BVLSM: WAL-Time Key-Value Separation

For workloads with large values (>64KB), BVLSM (arXiv:2506.04678) separates keys and values at WAL write time, achieving 7.6x better throughput than RocksDB:

Rust

// BVLSM: Large value separation in our db-bench
// Could be implemented as an optimization layer
const LARGE_VALUE_THRESHOLD: usize = 64 * 1024; // 64KB

impl BVLSMEngine {
    pub fn put(&self, key: &[u8], value: &[u8]) -> Result<(), Box<dyn std::error::Error>> {
        if value.len() > LARGE_VALUE_THRESHOLD {
            // Store large value in separate blob storage
            let handle = self.blob_manager.store(value)?;
            
            // Only store handle in LSM-Tree
            let handle_bytes = handle.serialize();
            self.lsm_engine.put(key, &handle_bytes)?;
        } else {
            // Small values go directly to LSM-Tree
            self.lsm_engine.put(key, value)?;
        }
        Ok(())
    }
}

3. Persistent Memory Integration

Recent work explores using Intel Optane persistent memory for LSM-Trees. NoveLSM shows 3.8x performance improvement by keeping immutable MemTables in persistent memory:

Rust

// NoveLSM: Persistent memory concept for our db-bench
// Could be implemented with memory-mapped files or NVDIMM
use std::sync::Arc;
use std::sync::atomic::{AtomicBool, Ordering};

struct PersistentMemTable {
    // Memory-mapped region for persistence
    mmap_region: Arc<memmap::MmapMut>,
    is_immutable: AtomicBool,
}

impl PersistentMemTable {
    fn make_immutable(&self) -> Result<(), Box<dyn std::error::Error>> {
        // Mark as immutable - no disk flush needed!
        // Data is already persistent in memory-mapped region
        self.is_immutable.store(true, Ordering::Release);
        
        // Sync memory-mapped region to ensure durability
        self.mmap_region.flush()?;
        Ok(())
    }
    
    // This eliminates the traditional MemTable -> SSTable flush bottleneck
}

Bonus: Real-World Performance Comparison

A fun challenge we faced was comparing these architectures fairly. We built db-bench to run identical workloads across different storage engines. Here's what we found:

Metric	B-Tree (PostgreSQL)	LSM-Tree (RocksDB)	Winner
Write Throughput (ops/sec)	12,450	89,230	LSM (7.2x)
Read Throughput (ops/sec)	145,670	98,340	B-Tree (1.5x)
P99 Write Latency	45ms	2.1ms	LSM (21x better)
P99 Read Latency	0.9ms	1.8ms	B-Tree (2x better)
Write Amplification	42.3x	3.8x	LSM (11x better)
Space Amplification	1.1x	1.3x	B-Tree (18% better)

Since this is purely a metadata operation, the performance differences are dramatic – especially for write-heavy workloads. After implementing these findings in production, we reduced our SSD costs by 60% and improved write latency by 10x.

Bash

# Our production metrics before and after switching to LSM-Trees
Before (B-Tree based system):
- SSD wear rate: 2.4 TB/day
- P99 write latency: 52ms
- Monthly SSD replacement: 3 drives

After (RocksDB with tuned compaction):
- SSD wear rate: 0.4 TB/day (6x reduction!)
- P99 write latency: 4.8ms
- Monthly SSD replacement: 0 drives

Open Questions

Working on storage engines was a fun foray into systems programming. Hopefully, this gave you a glimpse of the trade-offs involved. Of course, there are many more open questions:

How to minimize compaction impact on tail latency?
How to leverage new hardware like computational storage devices?
How to build learned indexes that handle updates efficiently?

Recent research (2025) is exploring exciting directions:

Learned Indexes

Instead of traditional tree structures, can we use machine learning models to predict data locations? Early results show promise for read-only workloads, but handling updates remains challenging.

Python

# Conceptual learned index
class LearnedIndex:
    def __init__(self):
        self.model = LinearRegression()
        self.error_bounds = {}
    
    def train(self, keys, positions):
        # Learn mapping from key -> position
        self.model.fit(keys.reshape(-1, 1), positions)
        
        # Store prediction errors for bounds
        predictions = self.model.predict(keys.reshape(-1, 1))
        self.error_bounds = {
            'min': min(positions - predictions),
            'max': max(positions - predictions)
        }
    
    def lookup(self, key):
        # Predict position
        predicted_pos = self.model.predict([[key]])[0]
        
        # Search within error bounds
        min_pos = int(predicted_pos + self.error_bounds['min'])
        max_pos = int(predicted_pos + self.error_bounds['max'])
        
        return binary_search(key, min_pos, max_pos)

Disaggregated Storage

Cloud providers are separating compute from storage, enabling new architectures. Amazon Aurora pioneered this approach, achieving 5x better performance than traditional MySQL.

Persistent Memory

Intel Optane and similar technologies blur the line between RAM and storage. Early adopters report 10x latency improvements for database workloads.

Conclusion

The journey from B-Trees to LSM-Trees illustrates a fundamental principle: data structure selection can make or break your system. What started as a frustration with write latency spikes led us down a rabbit hole of storage engine internals.

The key takeaway? There's no universally "best" storage engine. B-Trees excel at read-heavy workloads with predictable latency requirements. LSM-Trees shine for write-heavy workloads where throughput matters more than latency. Modern systems increasingly offer both options, letting you choose based on your specific needs.

As hardware evolves – from spinning disks to SSDs to persistent memory – expect storage engines to evolve with them. The next decade will likely bring new architectures that challenge our current assumptions. Until then, understanding these fundamentals helps us build better systems today.

If you'd like to experiment with these concepts, check out our benchmarking framework at github.com/ahmadkhan100/db-bench. Pull requests welcome!