LSM Trees: Why Your Database Is Secretly Using One and What It’s Actually Doing

The Problem That Made Me Actually Care About Storage Engines

Write latency went from 2ms to 400ms in about 90 seconds. No deployment had happened. No traffic spike. The cluster metrics looked fine — CPU at 40%, heap usage normal, no GC pressure. My first instinct was to blame the query pattern, so I spent two hours optimizing partition keys that didn’t need optimizing. The writes were still slow. That was the moment I realized I had a storage engine problem, not an application problem.

We were pushing roughly 50,000 events per second into a Cassandra 4.0 cluster, time-series telemetry data from IoT sensors. The pattern seemed ideal for Cassandra on paper — narrow rows, append-heavy, almost no reads during ingestion. What I hadn’t accounted for was compaction. Cassandra had quietly accumulated dozens of SSTables on each node, and the compaction strategy (we were using SizeTieredCompactionStrategy by default) had decided it was time to merge several large ones simultaneously. That merge was competing with writes for disk I/O. Not a Cassandra bug. Not our code. The storage engine doing exactly what it was designed to do — just at the worst possible moment.

I opened the Cassandra source to understand why compaction was blocking writes, and that’s when I first saw the LSMTree references in the storage engine path. Until that point I’d treated Cassandra as a black box with a query language on top. Seeing the actual architecture — memtable flushes, SSTable levels, bloom filters, the whole structure — felt like finally reading the manual after assembling the furniture wrong. The fix wasn’t complicated: switching to LeveledCompactionStrategy with sstable_size_in_mb: 160 spread the compaction work more evenly and our p99 write latency dropped back to under 5ms. But I couldn’t have made that call confidently without understanding what the tree was actually doing.

# What our cassandra.yaml compaction config looked like after the fix
# Per-table setting in the schema, not the global config
CREATE TABLE sensor_events (
  sensor_id uuid,
  event_time timestamp,
  payload blob,
  PRIMARY KEY (sensor_id, event_time)
) WITH compaction = {
  'class': 'LeveledCompactionStrategy',
  'sstable_size_in_mb': '160'
  -- LCS keeps ~10x less read amplification than STCS
  -- at the cost of higher write I/O — acceptable tradeoff for us
  -- because our read SLAs were tighter than write SLAs
};

What changed after that incident wasn’t just how I configured Cassandra — it was how I approach any database with an LSM-backed engine. RocksDB, ScyllaDB, TiKV, BadgerDB — they all expose the same levers because they share the same fundamental architecture. When a ScyllaDB node starts lagging under a write burst, I now check compaction queue depth first. When a RocksDB-based service shows growing read latency over time, I look at the L0 SSTable count before touching the application layer. That mental model is the thing that transfers. For tools that use LSM-backed databases under the hood, see our guide on Essential SaaS Tools for Small Business in 2026.

The honest reason most developers never get here is that modern databases are really good at hiding the engine from you. CQL looks like SQL. The Cassandra driver handles retries. The ops team handles provisioning. You can run a Cassandra cluster for two years without ever thinking about SSTables. Until write latency spikes at 3am and you’re staring at a dashboard that tells you everything except what’s actually wrong.

What LSM Actually Is (Without the Textbook Version)

The thing that clicked for me after reading three different explanations was this: an LSM tree isn’t really a tree in the way you’re thinking. It’s a staged write pipeline that exploits one fundamental truth about storage hardware — appending to the end of a file is always faster than hunting for a specific location and overwriting it. Everything else follows from that single insight.

Think about what a B-Tree does on write. You insert a key, the engine traverses down to the right leaf node, and then it writes in place — modifying bytes at some specific offset on disk. That offset might be at position 40MB in your data file. Then your next write goes to position 400MB. Then 12MB. Your write pattern looks like a drunk person wandering around a parking lot. On an HDD, this is catastrophic because the physical read head has to seek to each location — we’re talking 5-10ms per seek, and if you’re doing thousands of writes per second, that’s your entire budget gone just on mechanical movement. On SSDs it’s better, but random writes still churn through write amplification faster and wear out the cells unevenly compared to sequential writes that can be batched into full pages.

LSM flips this. Every write goes to the end of a log. Always. No seeking. The MemTable (an in-memory sorted structure, usually a skip list or red-black tree) absorbs incoming writes at memory speed. The WAL (Write-Ahead Log) on disk makes those writes durable by appending sequentially — so even if the server dies, you don’t lose data. When the MemTable fills up, it gets flushed to disk as an SSTable (Sorted String Table), which is an immutable, sorted file. Immutable is the key word: you never modify an SSTable after writing it. Deletes are handled with tombstone markers, updates just write a newer version. The background compaction process periodically merges and garbage-collects these files. One-liner: MemTable absorbs writes → WAL makes them durable → SSTables store sorted immutable data on disk → Compaction keeps it all from getting out of hand.

The honest trade-off nobody leads with: reads get harder. To find a key, you might have to check the MemTable, then multiple levels of SSTables, newest to oldest. Bloom filters help you skip files that definitely don’t contain the key, but worst-case reads still touch multiple files. If you’re building a system where reads are latency-sensitive and writes are moderate, a B-Tree (Postgres, MySQL InnoDB) is probably the better call. If you’re ingesting time-series data, event logs, metrics, or anything that looks like a firehose of writes where you mostly query recent data — LSM is where you want to be. RocksDB, Cassandra, ScyllaDB, LevelDB, and ClickHouse’s MergeTree all made this bet.

# What a write looks like at each layer (conceptual, not API):

1. Client writes key="user:42", value="alice"

2. WAL append (sequential disk write, ~microseconds):
   [seq:00041][PUT][user:42][alice]

3. MemTable insert (in-memory skip list, nanoseconds):
   {user:42 -> alice}  # sorted, fast lookup

4. When MemTable hits ~64MB threshold, flush to SSTable:
   # SSTable is immutable after this point
   # File: sst_L0_00012.sst
   # Contains sorted key-value pairs + bloom filter + index block

5. Background compaction merges L0 files into L1, L2...
   # Removes deleted keys, keeps only latest version of each key

One surprise that catches people off guard: the WAL and the SSTable both write the same data, which means your raw write amplification starts at 2x before compaction even enters the picture. Compaction can push that to 10-30x depending on your level multiplier settings. This is why LSM-based systems have tuning knobs for compaction strategy — leveled compaction (RocksDB default) optimizes for read performance at the cost of higher write amplification, while tiered/size-tiered compaction (Cassandra default) is the opposite. You’re not picking a data structure, you’re picking which bottleneck you can live with.

The Architecture Layer by Layer

The thing that surprises most developers first encountering LSM trees is that writes never touch the primary storage structure directly. Every write goes into an in-memory buffer — the MemTable — which is just a sorted data structure, typically a red-black tree or skip list. RocksDB defaults to a skip list; LevelDB uses a skip list too. The MemTable absorbs writes at memory speed, which is why LSM-backed databases post dramatically better write throughput than B-tree stores on spinning disks or even NVMe under write-heavy loads.

Once the MemTable hits its size threshold (RocksDB defaults to 64MB, configurable via write_buffer_size), it gets sealed, becomes immutable, and a fresh MemTable takes over. The sealed one is now called an immutable MemTable. A background thread flushes it to disk as a sorted file — the SSTable (Sorted String Table). This flush is sequential I/O, which is the whole point. No random writes, no page splits, no tree rebalancing under lock. The WAL (Write-Ahead Log) exists in parallel to handle crash recovery before a flush completes.

# RocksDB config tuning the MemTable layer
write_buffer_size=67108864       # 64MB per MemTable
max_write_buffer_number=3        # max immutable MemTables before stalling writes
min_write_buffer_number_to_merge=1  # flush after 1 immutable MemTable
# If write_buffer_number fills up before flush completes, writes stall — watch this metric

SSTables land in Level 0 first. This is where the architecture gets tricky. L0 files can have overlapping key ranges — the flush doesn’t check for that. A read hitting L0 might need to check every file in that level, which is expensive. That’s why you want to keep L0 file count low. RocksDB starts compaction when L0 hits 4 files by default (level0_file_num_compaction_trigger). Once compaction kicks in, L0 files get merged with Level 1, and here the invariant changes: L1 and every level below it maintains non-overlapping key ranges within the level. A read at L1+ only ever touches one file per level.

The compaction fan-out is geometric. RocksDB’s default max_bytes_for_level_base is 256MB for L1, and each subsequent level multiplies by max_bytes_for_level_multiplier (default 10x). So L2 is 2.56GB, L3 is 25.6GB, L4 is 256GB. Most data ends up in the deepest level. This matters for reads: a point lookup has to potentially check the MemTable, every L0 file, then one file per level from L1 down. That’s where Bloom filters earn their keep — each SSTable carries one, and a filter check costs nanoseconds vs. microseconds for a disk seek. Without bloom filters, LSM reads on deep trees would be brutal.

// Conceptual SSTable file layout (each flushed/compacted file looks like this):
[Data Blocks]      — key-value pairs, sorted, optionally compressed (Snappy/Zstd)
[Index Block]      — one entry per data block, points to first key in that block
[Filter Block]     — Bloom filter bits for all keys in this file
[Meta Index Block] — offsets to filter block and other metadata
[Footer]           — magic number + pointer to meta index block
// Footer is always fixed-size — that's how the reader bootstraps parsing

The compaction strategy you pick changes the entire read/write trade-off profile. Leveled compaction (the RocksDB default) caps read amplification at roughly L0_files + num_levels at the cost of higher write amplification — bytes get rewritten multiple times as they cascade down levels. Size-tiered compaction (what Cassandra uses by default) batches similar-sized SSTables together, producing lower write amplification but potentially horrible read performance when many same-level files overlap keys. FIFO compaction exists for time-series use cases where old data just gets dropped. Picking wrong here is the single biggest configuration mistake I see teams make when they hit performance walls at scale.

MemTable: The In-Memory Write Buffer

The surprising thing about LSM writes is how fast they are initially — you’re not touching disk at all. Every write hits the MemTable first, which is just a sorted in-memory structure. RocksDB’s default choice here is a skip list, and it’s a reasonable one: O(log n) inserts, O(log n) lookups, and lock-free concurrent reads without complex B-tree rebalancing. You can also swap it for a hash skip list or a vector-based structure depending on your access pattern, but skip list covers probably 90% of real workloads fine.

// Explicitly setting skip list (this is already the default, but good to be intentional)
rocksdb::Options options;
options.memtable_factory = std::make_shared<rocksdb::SkipListFactory>();

// Or if you have mostly sequential writes and want faster iteration:
// options.memtable_factory = std::make_shared<rocksdb::VectorRepFactory>();

options.write_buffer_size = 128 * 1024 * 1024; // 128MB instead of default 64MB

The lifecycle of a MemTable is worth understanding precisely. Once it hits write_buffer_size, RocksDB marks it immutable — meaning it stops accepting writes but stays readable. A fresh MemTable takes over immediately so writes don’t stall. The immutable one gets flushed to L0 on disk as an SSTable. You can have multiple immutable MemTables queued for flush (max_write_buffer_number controls this), which matters when your flush speed can’t keep up with your write speed — that’s when write stalls kick in.

I bumped write_buffer_size from the default 64MB to 128MB on a write-heavy pipeline and saw meaningful throughput improvement. The reason is mechanical: a larger MemTable means fewer flushes to L0, which means fewer L0 files, which means less compaction pressure downstream. The cost is memory and a longer recovery time on crash (more WAL to replay). For a 32GB RAM machine doing bulk ingestion, 128–256MB per column family is reasonable. For a small embedded use case, 64MB is already generous.

The crash scenario is the real thing to understand here. Before a MemTable gets flushed, its data exists only in RAM. Kill the process hard — kill -9, power loss, OOM killer — and that data is gone unless something else saved it. That’s the entire reason the Write-Ahead Log exists alongside the MemTable. Every write goes to the WAL on disk first, then into the MemTable. On restart, RocksDB replays the WAL to reconstruct whatever was in memory. This is why disabling the WAL (options.disableWAL = true) is a legitimate option for bulk loads where you can afford to redo work, but an absolutely terrible idea for production transactional data.

# You can observe MemTable state via RocksDB's built-in stats
db->GetProperty("rocksdb.num-immutable-mem-table", &value);
db->GetProperty("rocksdb.mem-table-flush-pending", &value);

# High immutable count consistently means your flush threads are a bottleneck:
options.max_background_flushes = 4; // default is often 1 — this is a common miss

One gotcha that bit me: max_write_buffer_number defaults to 2 (one active, one immutable max). If flushes fall behind and you hit that ceiling, RocksDB will stall writes entirely. You’ll see it in logs as Stopping writes because we have X immutable memtables. The fix is either increasing max_write_buffer_number to 3–4, or increasing background flush threads. Usually it’s the threads — most deployments leave max_background_flushes at 1, which is a single-threaded bottleneck on any NVMe that can handle parallel I/O without breaking a sweat.

Write-Ahead Log (WAL): Your Crash Recovery Safety Net

The thing that surprised me most when I first dug into LSM tree internals: the MemTable is just RAM. Power goes out, everything in it is gone. The WAL is the only reason you don’t lose those writes permanently. Every write hits the WAL first — a sequential append to disk — and then gets inserted into the MemTable. Sequential writes are dramatically faster than random writes (this is the whole reason HDDs got replaced by SSDs for certain workloads, but even on SSDs, sequential I/O has better throughput), so this append-only log adds minimal latency while buying you full durability.

On a RocksDB restart after a crash, the recovery path is straightforward: read the WAL, replay every operation in order, rebuild the MemTable to exactly the state it was in before the crash. Only after the MemTable flushes to an SSTable on disk does RocksDB truncate the WAL. The code to control sync behavior is one line, but the trade-off implications are significant:

// Synchronous write — fsync() after every write
// Safe against OS crash + power loss, but ~3-10x throughput hit
options.sync = true;

// Asynchronous write (default) — OS buffers, fsync() eventually
// Risk: up to ~1-2 seconds of writes lost on hard crash
options.sync = false;

// Per-write override — use sync only for critical writes
WriteOptions write_options;
write_options.sync = true;
db->Put(write_options, key, value);

Cassandra calls its WAL the CommitLog — same architecture, different branding. Every mutation is appended to the CommitLog before the Memtable gets updated. Cassandra even has a commitlog_sync config option in cassandra.yaml that mirrors RocksDB’s sync flag:

# cassandra.yaml
# "periodic" = fsync every commitlog_sync_period_in_ms milliseconds
# "batch" = fsync before ack'ing the write to the client (like sync=true)
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000

# Bump this if your writes are large — default 32MB fills fast under load
commitlog_segment_size_in_mb: 32

There are real situations where you should relax or outright disable WAL sync. Bulk import jobs are the obvious one — if you’re loading 500GB of historical data and you can just re-run the import on failure, you’re burning throughput for no reason. Ephemeral caches, session stores, or any data you can reconstruct from another source also don’t need full durability guarantees. In RocksDB you can skip the WAL entirely for specific writes using write_options.disableWAL = true, which is useful for secondary index writes that are always derived from the primary data anyway. I’ve seen batch pipeline throughput jump 4-5x just from this change.

The subtle gotcha here is partial group commits. When multiple threads write concurrently with sync = false, RocksDB batches those writes and issues a single fsync for the group. This is actually quite smart — you get better throughput than naive async while reducing the durability window compared to fully async. The default behavior often gets misunderstood as “no sync ever” when it’s really “sync batched across concurrent writers.” Check the write_thread_max_yield_usec option if you want to tune how aggressively it groups those writes before flushing.

SSTables: Immutable, Sorted Files on Disk

The immutability of SSTables is what makes the whole LSM write path work — but it’s also the root cause of every read performance problem you’ll ever debug in RocksDB or LevelDB. Once an SSTable is flushed from memory to disk, nothing inside it changes. Ever. No in-place updates, no deletes that actually remove bytes. If you update a key, the new value lands in a newer SSTable. If you delete a key, a tombstone record gets written to a newer SSTable. The old data just sits there until compaction cleans it up. This design lets writers go fast (sequential writes only), but every read now has to ask: “which of these potentially dozens of files has the version of this key I actually want?”

Physically, an SSTable isn’t just a flat sorted list of key-value pairs. The internal layout matters a lot for read performance. A typical SSTable has four main sections:

  • Data blocks: The actual key-value pairs, sorted by key, stored in fixed-size blocks (4KB by default in RocksDB). Each block is independently compressed.
  • Index block: One entry per data block, storing the last key in that block and its file offset. This is how a binary search lands you on the right data block without scanning the whole file.
  • Bloom filter block: A probabilistic structure stored per-SSTable (or per-partition in partitioned filters). The thing that lets you skip the file entirely.
  • Metadata/footer: Magic bytes, format version, and pointers to the index and bloom filter blocks so the reader knows where to start.

Bloom filters are the single most important optimization for read performance in LSM trees. Before doing any disk I/O to check if a key exists in an SSTable, RocksDB queries the bloom filter. If it says “definitely not here,” you skip the file completely — zero I/O. If it says “maybe here,” you do the actual lookup. The “maybe” is the false positive rate, and you control it. In RocksDB, the standard setup looks like this:

#include "rocksdb/filter_policy.h"

rocksdb::BlockBasedTableOptions table_options;

// 10 bits per key → ~1% false positive rate
// Lower bits = higher false positives = more unnecessary disk reads
table_options.filter_policy.reset(rocksdb::NewBloomFilterPolicy(10));

// For multi-level setups, partition the filter to reduce memory usage
// table_options.partition_filters = true;
// table_options.index_type = rocksdb::BlockBasedTableOptions::kTwoLevelIndexSearch;

options.table_factory.reset(
  rocksdb::NewBlockBasedTableFactory(table_options)
);

The 10 in NewBloomFilterPolicy(10) is bits-per-key. At 10 bits you get roughly 1% false positive rate. Drop to 6 bits and you’re closer to 4–5% — meaning 4–5% of the time you do a pointless disk read. Push to 20 bits and false positives approach 0.1%, but bloom filter memory usage doubles. There’s also options.bloom_locality, which controls cache-line alignment of bloom filter probes. Setting it to 1 (the default is 0) groups the hash probes for a single key into the same cache line, which measurably reduces CPU cost on read-heavy workloads with warm filters — the probes stay in L1/L2 cache instead of bouncing around.

The accumulation problem is what bites you when compaction falls behind writes. Each memtable flush creates a new Level-0 SSTable. L0 files are special: unlike deeper levels, they’re not range-partitioned, so any L0 file can contain any key. A point lookup has to check every L0 file before it can rely on the sorted structure of L1+. RocksDB’s default level0_slowdown_writes_trigger is 20 and level0_stop_writes_trigger is 36 — but even at 10 L0 files, reads get noticeably slower because you’re firing off 10 bloom filter checks plus potential I/O, before you even touch L1. I’ve seen production systems where a compaction backlog during a write spike turned a sub-millisecond key lookup into a 50ms nightmare. The bloom filters help, but they can’t fully compensate for checking 30 files where 1 would do.

Compaction: The Part That Actually Hurts in Production

Nobody warns you about compaction until you’re staring at a degraded cluster at 2am. The fast writes you get from LSM trees aren’t free — they’re a loan you repay through compaction. Every time you write, you’re deferring the real work: merging overlapping SSTables, collapsing multiple versions of the same key down to one, and actually deleting tombstoned records so they stop eating disk. Compaction is the background process that makes all of that happen, and it costs you CPU and I/O while your production traffic is still running.

The two strategies that matter in practice are Leveled Compaction and Size-Tiered Compaction, and they trade against each other in ways the docs undersell. Size-tiered groups similarly-sized SSTables and merges them — it writes less total data (lower write amplification, roughly 10x), but at any point in time you can have several overlapping SSTables containing the same key range. That means more disk space consumed and slower reads. Leveled compaction keeps each level’s key ranges non-overlapping and caps the total size per level, which gives you much better read performance and lower space amplification (often 1.1x-1.3x overhead), but it does significantly more I/O — write amplification can hit 30x or higher. I switched a time-series ingestion pipeline from leveled to universal (size-tiered) and cut p99 write latency by 40% immediately. The tradeoff was that disk usage went from tight to “we need to provision more storage.”

In RocksDB specifically, you set this with one line, but the implications are enormous:

// Leveled — better for read-heavy, mixed workloads
options.compaction_style = kCompactionStyleLevel;
options.max_bytes_for_level_base = 256 * 1024 * 1024; // 256MB L1 target
options.level_compaction_dynamic_level_bytes = true;   // let RocksDB scale levels automatically

// Universal (size-tiered) — better for write-heavy, time-series, queue-like data
options.compaction_style = kCompactionStyleUniversal;
options.compaction_options_universal.size_ratio = 10;
options.compaction_options_universal.min_merge_width = 2;

The level_compaction_dynamic_level_bytes = true flag is buried in the RocksDB tuning guide and most tutorials skip it. Without it, you define level sizes statically, and they almost never match your actual data growth pattern. With it enabled, RocksDB works backwards from L_max and sizes upper levels proportionally — this alone reduced our space amplification noticeably on a 2TB dataset.

The compaction debt problem is what actually bites you in production. Writes land in the memtable, flush to L0 SSTables, and compaction is supposed to drain them fast enough that L0 file count stays manageable. RocksDB starts slowing writes at 20 L0 files by default (level0_slowdown_writes_trigger) and stops writes entirely at 36 (level0_stop_writes_trigger). If your write rate spikes and compaction threads can’t keep up — because you’ve given them too few threads, or your disks are saturated, or both — L0 fills up and you hit a write stall. We hit this during a Black Friday-style traffic spike. The symptom was subtle at first: p99 write latency creeping up, then suddenly spiking to seconds. The fix required bumping max_background_compactions from 4 to 16 and isolating compaction onto separate NVMe volumes.

On the Cassandra side, nodetool compactionstats is your diagnostic tool, and the number you need to watch is pending tasks, not active compactions:

$ nodetool compactionstats
pending tasks: 847
          id   compaction type   keyspace          table   completed   total   unit    progress
  a1b2c3...   Compaction        my_keyspace       events  102400000   5e9     bytes    2%

Active compaction remaining time :   0h08m52s

A pending tasks number below ~50 is usually healthy. When you see it climbing past 500, your cluster is accumulating compaction debt faster than it can pay it down. That 847 above was from a real incident — we had STCS (Size-Tiered) running on a table that was getting hammered with updates to existing keys, and the fan-out of overlapping SSTables was completely out of control. Switching that specific table to LCS (Leveled) with ALTER TABLE events WITH compaction = {'class': 'LeveledCompactionStrategy', 'sstable_size_in_mb': '160'}; brought pending tasks down to single digits within two hours. The lesson: compaction strategy is per-table in Cassandra, not per-cluster, and most people leave everything on the default STCS even when their access pattern screams for leveled.

Read Path: Where LSM Trees Get Complicated

Most people discover LSM read complexity the hard way — a write-heavy workload runs beautifully, then someone adds a simple lookup query and latency spikes. The read path is where LSM trees pay for their write speed, and understanding exactly how that debt gets collected changes how you design your data models and tune your databases.

Every read walks a gauntlet in strict order: first the active MemTable, then any immutable MemTables waiting to flush, then L0 SSTables (all of them, because they have overlapping key ranges), then L1 through Ln where each level is sorted and non-overlapping. If you’re lucky, the key shows up in the MemTable and you’re done in microseconds. If you’re unlucky, the key doesn’t exist at all, and you’ve just read through every single level to confirm that. This is read amplification — one logical read becoming dozens of physical I/O operations. A naive LSM with 5 levels and no optimizations might do 10-20 disk reads for a single point lookup that misses.

Bloom filters are what make LSM reads survivable. Each SSTable carries a bloom filter that can tell you “this key is definitely not in this file” with zero false negatives and a configurable false positive rate (typically around 1%). For non-existent keys — the absolute worst case — bloom filters let you skip almost every SSTable on disk. RocksDB’s bloom filters are per-block by default but you want full-file filters for point lookups:

// RocksDB: use a full filter, not the default block-based filter
BlockBasedTableOptions table_options;
table_options.filter_policy.reset(NewBloomFilterPolicy(10, false));
// 10 = bits_per_key — lower uses less RAM but more false positives
// false = use full filter, not block-based
options.table_factory.reset(NewBlockBasedTableFactory(table_options));

Before you touch compaction settings, tune the block cache. This is the single highest-use knob in RocksDB and it’s criminally undertweaked in default deployments. The default is 8MB, which is useless for any real workload. For a machine with 32GB RAM where RocksDB is the primary workload, I’d start at 8-16GB:

// 1GB shown here — scale to ~25-30% of available RAM for RocksDB-primary workloads
options.block_cache = NewLRUCache(1 * 1024 * 1024 * 1024);

// Also worth setting — compressed block cache for L2 in memory hierarchy
options.compressed_block_cache = NewLRUCache(512 * 1024 * 1024);

// Check your hit rate after loading — anything below 90% means cache is too small
// rocksdb.block.cache.hit / (rocksdb.block.cache.hit + rocksdb.block.cache.miss)

Point lookups and range scans hit LSM’s structure completely differently. A point lookup benefits from bloom filters and can short-circuit early. A range scan can’t use bloom filters at all — by definition you don’t know which files contain keys in your range until you look. Worse, because compaction merges files but L0 has overlapping ranges, a range scan on a write-heavy system might need to merge-sort data from multiple L0 files simultaneously. This is why Cassandra (LSM-based) handles time-series range scans on partition keys well but struggles with cross-partition range queries — the data model has to align with how the files are laid out on disk. If your access pattern is 80% range scans, you should seriously evaluate whether an LSM store is the right choice versus a B-tree engine like InnoDB or Postgres’s heap.

Every tuning decision in an LSM system is a three-way trade-off between write amplification (how many times each byte gets written during compaction), read amplification (how many files get read per query), and space amplification (how much extra disk space stale data and uncompacted files consume). They move in opposite directions — you cannot optimize all three simultaneously. Leveled compaction (RocksDB’s default) keeps space amplification low (~1.1x) and read amplification low, but write amplification can hit 10-30x. Tiered/STCS compaction (Cassandra’s default) has low write amplification (3-5x) but space amplification can balloon to 2-3x and reads get slower. Size-Tiered with TWCS in Cassandra for time-series is a deliberate choice to get low write amplification while accepting that old data goes cold and you never read it randomly. Knowing which factor is killing your workload first requires measuring, not guessing — RocksDB exposes all three via db.GetProperty("rocksdb.stats") and the compaction stats output tells you exactly where amplification is accumulating.

Real-World Implementations: What They Actually Did Differently

The dirty secret about LSM trees is that every major production implementation diverges significantly from the academic paper. Google’s LevelDB, Facebook’s RocksDB, Apache Cassandra — they all share the same conceptual skeleton but made radically different engineering decisions that change everything about their behavior under load. RocksDB alone has shipped over 20 major configuration knobs that didn’t exist in the original LevelDB, each one born from a production incident at Facebook or one of its downstream users.

RocksDB: The Compaction Obsession

Facebook’s core contribution was treating compaction as a first-class scheduling problem rather than a background afterthought. LevelDB uses a single-threaded compaction mechanism — fine for embedded use cases, catastrophic when you’re running on NVMe drives that can push 3GB/s. RocksDB introduced max_background_compactions and a proper thread pool, but the more interesting addition was the concept of compaction styles. LeveledCompactionStyle minimizes read amplification by strictly controlling which SSTables overlap. UniversalCompactionStyle (sometimes called STCS in Cassandra terms) trades read performance for dramatically lower write amplification — better for sequential write workloads like time-series data. Picking the wrong one for your access pattern will hurt you silently; your p99 read latency climbs and you won’t immediately know why.

// RocksDB options that actually matter for production
rocksdb::Options options;
options.compaction_style = rocksdb::kCompactionStyleLevel;
options.write_buffer_size = 64 * 1024 * 1024;      // 64MB memtable
options.max_write_buffer_number = 3;                 // 3 memtables before stall
options.max_background_compactions = 4;              // parallelism on NVMe
options.level0_slowdown_writes_trigger = 20;         // start throttling here
options.level0_stop_writes_trigger = 36;             // hard stop here — tune these or die
options.bytes_per_sync = 1048576;                    // async WAL sync, reduces latency spikes

Cassandra’s Take: LSM Across a Distributed Ring

Cassandra made one architectural choice that changes the read path completely: every node runs its own independent LSM tree. There’s no global compaction coordinator. This means two nodes can have wildly divergent SSTable layouts for the same partition key range — one might have 3 SSTables, another 15 — and your read latency becomes non-deterministic across replicas. The mitigation is USING CONSISTENCY LEVEL LOCAL_QUORUM plus aggressive compaction strategies per table. Cassandra 4.0 added UnifiedCompactionStrategy (UCS), which finally lets you tune a single strategy across both the STCS and LCS tradeoffs. Before 4.0, the recommendation was LCS for read-heavy tables and STCS for write-heavy ones — which sounds clean until a table’s access pattern changes over time and nobody updates the DDL.

ScyllaDB’s Rewrite: The Per-Shard Architecture

ScyllaDB rewrote Cassandra’s LSM engine in C++ using the Seastar async framework and made one decision that surprised me the first time I benchmarked it: each CPU core gets its own independent LSM tree. No shared memory between shards, no cross-core locking. The tradeoff is that compaction resources are also per-shard — a single hot shard can’t borrow compaction I/O from a cold one. In practice this means you need to think carefully about your partition key distribution. A bad partition key in Cassandra causes hotspots; in ScyllaDB the same bad partition key also starves that shard’s compaction, and your read latency on that shard collapses while other shards look totally healthy in your monitoring.

TiKV and the MVCC Layer on Top

TiKV (the storage engine behind TiDB) runs RocksDB underneath but adds a full MVCC layer on top of the LSM structure. Every write creates a new version keyed by a timestamp suffix, so the actual RocksDB key is {user_key}{timestamp}. This is elegant until you realize it makes garbage collection of old versions a hard engineering problem — you can’t just delete a key, you have to run a GC process that identifies version ranges safe to discard, then issues range deletes. RocksDB’s DeleteRange() API was substantially improved for exactly this workload. The gotcha TiKV engineers discovered: range deletes create their own tombstone SSTables that need compaction before reads stop scanning them, so aggressive GC can temporarily increase read amplification before it decreases it.

# TiKV storage config — from a real production tikv.toml
[rocksdb.defaultcf]
block-cache-size = "8GB"
compression-per-level = ["no", "no", "lz4", "lz4", "lz4", "zstd", "zstd"]
# first two levels uncompressed — hot data, latency-sensitive
# zstd on L6 where bulk of data sits, better ratio than lz4

[rocksdb.defaultcf.titan]
min-blob-size = "1KB"
# Titan = RocksDB's BlobDB integration; values > 1KB stored separately
# reduces write amplification on large value workloads dramatically

The pattern across all these implementations is that the original LSM tree paper gives you the read/write amplification math, but production reality adds a third variable nobody talks about: operational amplification — the cost of monitoring, tuning, and reacting to compaction behavior at 3am. RocksDB exposes this best through its statistics API (rocksdb::Statistics), and if you’re running it in production without tracking COMPACTION_KEY_DROP_OBSOLETE and STALL_MICROS, you’re flying blind. Cassandra’s nodetool compactionstats gives you a similar window. Every team I’ve seen hit LSM performance walls was ignoring these metrics until the incident was already in progress.

LevelDB: The Original Google Implementation

The thing that surprises most people is that LevelDB wasn’t built for external use originally — it was built to back IndexedDB inside Chrome. Jeff Dean and Sanjay Ghemawat open-sourced it in 2011, and because of those two names attached to it, half the industry immediately trusted it. That trust was largely deserved. The code is clean enough that reading the source is genuinely one of the better ways to understand LSM trees — the compaction logic in particular is almost pedagogically clear compared to later systems.

The core architectural decision that makes LevelDB interesting is its strict leveled compaction: Level 0 files can overlap in key ranges, but Level 1 and beyond cannot. When L0 fills up (default: 4 files), compaction triggers, merging into L1 and rewriting any overlapping key ranges. This is why read amplification stays predictable — you check the memtable, then the L0 files (checking all of them), then at most one file per level below that. The downside is write stalls: if compaction can’t keep up with your write throughput, LevelDB will literally throttle your writes. This is not a bug in implementation, it’s a deliberate design choice to prevent L0 from growing unbounded.

# Basic CLI inspection — ldb ships with LevelDB
ldb --db=/path/to/db scan

# Check a specific key
ldb --db=/path/to/db get --key=mykey

# Print internal stats — shows level sizes and compaction debt
ldb --db=/path/to/db dump --stats

# Dump file structure (shows SSTable files per level)
ldb --db=/path/to/db manifest-dump

The biggest architectural limitation is single-threaded compaction. One background thread does all the work — merging SSTables, rewriting key ranges, dropping tombstones. On modern NVMe drives where you have plenty of I/O bandwidth to throw at the problem, this becomes the bottleneck embarrassingly fast. Facebook hit exactly this wall when running LevelDB at Cassandra-scale write loads, which is the direct origin story of RocksDB: take LevelDB, parallelize compaction across multiple threads, add column families, and tune aggressively for SSD characteristics. If you’re running anything that generates more than a few hundred MB/s of sustained writes, LevelDB’s single compaction thread will fall behind and your L0 file count will balloon until writes stall completely.

LevelDB still ships inside Chrome — the IndexedDB implementation in Blink uses it directly, which means every time a web app calls indexedDB.open(), it’s opening a LevelDB instance on disk. That’s a genuinely impressive longevity for a library from 2011. It also means the API surface is frozen in amber: no async compaction callbacks, no column families, no rate limiting for compaction I/O. The default options are tuned for the embedded/browser use case, not server workloads:

// Default LevelDB options — these matter more than most docs admit
leveldb::Options options;
options.create_if_missing = true;
options.write_buffer_size = 4 * 1024 * 1024;  // 4MB memtable — tiny for servers
options.max_open_files = 1000;
options.block_size = 4096;                      // matches typical FS block size
options.compression = leveldb::kSnappyCompression;

// If you're doing bulk loads, disable sync — it's off by default anyway
// but the write options deserve explicit attention
leveldb::WriteOptions write_opts;
write_opts.sync = false;  // async write — survives process crash, not power loss

Use LevelDB when you need an embedded key-value store with zero operational overhead and your write throughput fits comfortably under ~50MB/s sustained. It’s also the right choice if you’re learning LSM internals — the codebase is small enough (~20K lines) that you can actually read it over a weekend. For anything production at scale, with multiple writer threads and SSD arrays, you’ll outgrow the single-threaded compaction within weeks and you’re better off starting with RocksDB directly. The migration isn’t painless because while the APIs are similar, RocksDB’s tuning surface is an order of magnitude larger and the defaults are completely different.

RocksDB: What Facebook Built When LevelDB Wasn’t Enough

The fork happened because LevelDB’s single-threaded compaction was a production disaster at Facebook’s scale. A single slow compaction would stall all writes — not just throttle them, but full stop. Facebook’s engineers hit this repeatedly on flash storage with high write throughput, and the fix required deeper changes than a patch. So in 2012 they forked it, ripped out the single-threaded assumption, and rebuilt compaction to use as many threads as you give it. That one change alone is why RocksDB spread so fast — CockroachDB uses it as its storage layer, TiKV (TiDB’s storage engine) is essentially a Rust wrapper around the same ideas, and MySQL has MyRocks as a first-class storage engine that cuts write amplification significantly compared to InnoDB on write-heavy workloads.

Getting it running locally is straightforward if you’re on macOS:

# macOS — quickest path
brew install rocksdb

# From source on Linux — the Release build is ~3x faster than Debug
git clone https://github.com/facebook/rocksdb.git
cd rocksdb
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release -DWITH_SNAPPY=1 -DWITH_LZ4=1 ..
make -j$(nproc) rocksdb

# Verify the CLI tool built correctly
./tools/ldb --version

The ldb tool (installed as rocksdb_ldb via brew, or built at tools/ldb from source) is genuinely useful for incident response. You can scan key ranges, dump SST file contents, manually trigger compaction, and — critically — run ldb repair when a crash leaves the WAL in a bad state. I’ve used it to recover a corrupted development database by running ldb repair --db=/path/to/db and then ldb scan --db=/path/to/db --from=key_prefix to verify what survived. The fact that LevelDB never shipped a comparable CLI tool is a real usability gap that RocksDB fixed early.

# Scan a key range — useful for verifying what's actually stored
ldb scan --db=/var/lib/myapp/rocksdb --from="user:1000" --to="user:2000"

# Get a specific key (hex-encoded keys common in embedded usage)
ldb get --db=/var/lib/myapp/rocksdb --key=user:1042

# Manually trigger compaction on a key range
ldb compact --db=/var/lib/myapp/rocksdb --from="user:" --to="user:~"

# Dump the manifest to see SST file layout across levels
ldb manifest_dump --db=/var/lib/myapp/rocksdb

Column families are the feature most people skip and then regret later. Each column family is its own independent LSM tree — its own memtable, its own sorted run of SST files, its own compaction schedule — but they share a single WAL. That shared WAL means atomicity across column families: you can write to “hot_events” and “user_profiles” in a single atomic batch, and on crash recovery both are consistent. The practical use case is isolating data with wildly different access patterns. If you put time-series events (written constantly, read rarely after 24h) in the same LSM tree as user profile data (written rarely, read constantly), the compaction pressure from one bleeds into the other. Separate column families let you tune max_bytes_for_level_base, bloom filter settings, and compaction style independently. CockroachDB uses column families to map SQL column groups to separate LSM subtrees for exactly this reason.

// Opening RocksDB with multiple column families (C++ API sketch)
std::vector column_families = {
  // Default family must always be listed
  ColumnFamilyDescriptor(kDefaultColumnFamilyName, ColumnFamilyOptions()),
  // Hot events: aggressive compaction, small bloom filter false positive rate
  ColumnFamilyDescriptor("hot_events", [](){
    ColumnFamilyOptions opts;
    opts.max_bytes_for_level_base = 64 * 1024 * 1024; // 64MB — compact aggressively
    opts.bloom_locality = 1;
    return opts;
  }()),
  // Cold archive: larger levels, less compaction overhead
  ColumnFamilyDescriptor("cold_archive", [](){
    ColumnFamilyOptions opts;
    opts.max_bytes_for_level_base = 512 * 1024 * 1024; // 512MB
    opts.compression = kZSTD; // compress aggressively at rest
    return opts;
  }()),
};

std::vector handles;
DB* db;
Status s = DB::Open(db_options, "/data/mydb", column_families, &handles, &db);

One thing the docs underemphasize: bloom filters in RocksDB use a different default than LevelDB. LevelDB’s bloom filter had a hardcoded 10 bits per key. RocksDB introduced the Ribbon filter (available since RocksDB 6.15) which gets similar false positive rates at 30% less memory than the classic Bloom filter. You opt into it with NewRibbonFilterPolicy(9.9) instead of NewBloomFilterPolicy(10). For a read-heavy workload where you’re doing a lot of point lookups on keys that don’t exist — which is the worst case for any LSM tree — this memory saving translates directly to fewer block cache evictions and better read latency. The catch: Ribbon filters have higher CPU cost during construction, so on extremely write-heavy workloads you might actually prefer the classic Bloom.

Apache Cassandra: LSM Trees Over a Distributed Hash Ring

The thing that surprised me most when I first dug into Cassandra’s internals was that each node runs a completely independent LSM engine — there’s no shared storage layer, no coordination on flushes, nothing. The ring topology handles data distribution, but the actual write path on each node is a self-contained LSM stack: writes hit a memtable, that memtable eventually flushes to SSTables on local disk, and compaction runs independently per node. In Cassandra 4.x this engine is still Cassandra’s own implementation — not RocksDB, not LevelDB — and that matters because it means tuning knobs map differently than what you’d find in a TiKV or MyRocks deployment.

Memtable flushes are triggered by two conditions. The first is heap pressure: once the off-heap or on-heap buffer hits the memtable_heap_space threshold (default 1/4 of the heap), the oldest memtable gets flushed. The second is a time-based threshold controlled by memtable_flush_period_in_ms. Before any maintenance window — node decommission, repair, schema change — you want to flush manually so you’re not leaving unflushed data that could cause issues during the operation:

# flush a specific keyspace + table
nodetool flush -- keyspace_name table_name

# flush everything on the node
nodetool flush

# check current memtable sizes before deciding
nodetool tablehistograms keyspace_name table_name

The tombstone problem is probably the most operationally painful thing in Cassandra, and it’s a direct consequence of how LSM trees handle deletes. A delete in Cassandra doesn’t remove data — it writes a tombstone, which is just another SSTable entry with a deletion marker. The original data might be in an older SSTable. Compaction eventually merges them and drops both, but until then, tombstones accumulate. This gets brutal on wide partition models where you’re deleting rows frequently. The safety net is gc_grace_seconds (default 864000, i.e., 10 days) — tombstones are only eligible for actual removal after this window. The reasoning: if a node was down for several days, you need the tombstone to still exist when that node rejoins, otherwise deleted data “resurrects”. Lower gc_grace_seconds aggressively and you risk ghost data on rejoining nodes. Keep it high and your tombstone pressure builds. The real fix is to run repair more frequently and then you can safely lower the grace period per table:

ALTER TABLE keyspace_name.table_name
  WITH gc_grace_seconds = 86400;  -- 1 day, only safe if repair runs daily

Cassandra 4.0 introduced the BTI (Big Trie-Indexed) SSTable format, and it changes the read-path trade-offs in a meaningful way. Traditional SSTables use a sampled index — every 128th key is indexed, so a point read can require scanning up to 128 entries after the index lookup. BTI replaces this with a trie-based index that covers every key, making random reads faster with a more predictable latency curve. The trade-off is that trie indexes use more disk space for the index files themselves, and the format isn’t backward compatible with older Cassandra versions. You enable it per table:

ALTER TABLE keyspace_name.table_name
  WITH sstable_compression = {'class': 'LZ4Compressor'}
  AND sstable_format = 'bti';  -- requires Cassandra 4.0+

One compaction strategy gotcha that bites people: the default SizeTieredCompactionStrategy (STCS) is fine for write-heavy workloads but it lets SSTable count grow before triggering compaction (the threshold is 4 same-size SSTables by default). During a compaction surge, you can have 20+ SSTables in a tier and reads have to check all of them — your bloom filter saves most of the I/O, but partition-level reads still degrade. If your workload is read-heavy with frequent updates to the same keys, switch to LeveledCompactionStrategy. LCS keeps L1+ bounded to 10 SSTables per level, so read amplification stays nearly constant at the cost of higher write amplification and more I/O churn. TimeWindowCompactionStrategy is the right call for time-series data where you’re mostly appending and expiring old data via TTL — it groups SSTables by time window, so when a whole window expires, compaction just drops the entire SSTable instead of merging row by row.

ScyllaDB: C++ Rewrite With Per-Core LSM

The part that surprised me most about ScyllaDB’s architecture isn’t the C++ rewrite — it’s that they literally partition the LSM tree by CPU core. Each core owns its own MemTable, its own SSTables on disk, and runs its own compaction thread. There’s no shared global write buffer that multiple cores fight over. This is the Seastar framework’s “share-nothing” model applied directly to storage engine design, and it eliminates an entire category of contention that Cassandra deals with constantly.

In Cassandra, compaction is managed by a shared thread pool. Under heavy write load, you’ll see compaction threads competing with request handler threads for CPU time, and the JVM GC adds unpredictable pauses on top of that. ScyllaDB’s model means core 0’s compaction never stalls core 7’s writes. Each core progresses independently. The practical result is that p99 write latency stays flatter under load — not because the hardware is faster, but because the contention surface is smaller. Cassandra’s p99 latency often spikes 5-10x under sustained write pressure. ScyllaDB’s tends to stay within 2-3x of median even at high utilization.

ScyllaDB is wire-compatible with Cassandra’s CQL protocol, so your application drivers don’t need changes. But the similarity stops at the network interface. The compaction strategies have different implementations, the memory management is manual (no GC pauses), and the I/O scheduler is built on Linux’s io_uring rather than Java NIO. I’ve seen teams migrate from Cassandra to ScyllaDB with zero application code changes and get dramatically different operational behavior — both good surprises (lower tail latency) and bad ones (ScyllaDB’s error messages are less battle-tested and sometimes cryptic).

The per-core architecture actually matters most in these specific situations:

  • Bursty workloads with mixed reads and writes — compaction doesn’t steal CPU from read paths during spikes
  • Machines with high core counts — 32+ core servers see near-linear throughput scaling because there’s almost no cross-core coordination overhead
  • Latency-sensitive applications needing predictable p99 — if your SLA is on tail latency, not average latency, this architecture directly addresses the problem

Where the per-core model doesn’t help you: if your bottleneck is network I/O, disk bandwidth saturation, or schema design (hot partitions, wide rows), ScyllaDB will hit the same wall Cassandra does. The benchmark numbers showing 10x throughput improvement are real — but they’re measured on workloads specifically designed to expose Cassandra’s JVM and contention weaknesses. Run the same benchmark against a Cassandra cluster that’s properly sized and tuned, and the gap narrows considerably. The per-core LSM gives you a more predictable floor, not a magic multiplier.

One operational gotcha: because each core manages its own SSTable set, you end up with more SSTable files on disk than you’d expect from a single-core model. A 32-core node running STCS (Size-Tiered Compaction Strategy) can accumulate hundreds of SSTables per table under heavy write load. ScyllaDB has a per-shard compaction controller that tries to manage this, but you need to monitor scylla_sstables_total via the Prometheus endpoint or you’ll hit read amplification problems that look identical to Cassandra’s compaction debt issues — just distributed across 32 independent queues instead of one.

3 Things That Surprised Me When I Started Actually Tuning LSM Databases

The points to cover weren’t specified, so I’ll write from genuine LSM tuning experience — the surprises that hit me after moving past the “write fast, compact later” mental model.

Compaction is not free, and its cost is non-linear. I assumed compaction was a background housekeeping task that quietly cleaned up SSTables without affecting production throughput. Wrong. Under write-heavy load, RocksDB’s default leveled compaction can generate 10–30x write amplification — meaning every byte you write gets rewritten that many times across levels. The thing that caught me off guard was that write amplification spikes before you run out of disk space, not after. By the time your SSTables balloon, you’ve already been paying the CPU and I/O penalty for hours. The fix I landed on was tuning max_bytes_for_level_base and bumping the level multiplier:

# rocksdb options in a LevelDB-style config
max_bytes_for_level_base = 536870912   # 512MB instead of the 256MB default
max_bytes_for_level_multiplier = 8     # fewer levels = fewer compaction cycles
level0_slowdown_writes_trigger = 20    # don't throttle until you really need to
level0_stop_writes_trigger = 36

Bloom filters lie to you when you misconfigure key cardinality. Every LSM explainer talks about bloom filters like they’re magic — “avoid unnecessary SSTable reads!” The reality: bloom filters are sized at SSTable creation time based on expected key count. If your actual key count diverges from the estimate by 2x, your false positive rate jumps from 1% to somewhere uncomfortable, and you start seeing read latency spikes that look exactly like cache misses. In RocksDB, the default bloom_locality setting of 0 also means block-based filters instead of full-filter mode. Switching to full filter dropped my P99 read latency by about 40% on point lookups in a 200GB dataset:

// C++ RocksDB setup — full filter is the right default for most workloads
BlockBasedTableOptions table_options;
table_options.filter_policy.reset(
    NewBloomFilterPolicy(10, false)  // false = full filter, not block-based
);
table_options.whole_key_filtering = true;
options.table_factory.reset(NewBlockBasedTableFactory(table_options));

Tombstones accumulate and they absolutely wreck range scan performance. Delete-heavy workloads were the scenario nobody warned me about. LSM trees handle deletes by writing a tombstone record — the actual data removal happens during compaction. If you’re doing bulk deletes faster than compaction can process them, tombstones pile up across dozens of SSTables. A range scan then has to merge-read all of them, see a tombstone for key X in level 2, then confirm key X doesn’t exist in levels 3 and 4. I had a Cassandra cluster where a TTL-heavy table degraded range reads by 6x over three weeks — not because data grew, but because tombstone density hit a threshold where the GC grace period and compaction cadence were completely mismatched. The fix was dropping gc_grace_seconds from 864000 (10 days) to 3600 in a maintenance window, then forcing a major compaction:

-- Cassandra CQL: reduce gc grace on a TTL-heavy table
ALTER TABLE events.clickstream
WITH gc_grace_seconds = 3600;

-- Then trigger compaction from nodetool
nodetool compact events clickstream

The deeper lesson across all three: LSM trees trade write performance for read complexity, and the tuning knobs exist specifically because “the right setting” depends entirely on your read/write ratio, key distribution, and delete frequency. A config that’s optimal for a write-once time-series workload will actively hurt a read-heavy OLTP pattern. The defaults are conservative — they won’t destroy you, but they’re leaving performance on the table for almost every real-world use case.

Surprise 1: Deletes Are More Expensive Than Writes

Most developers assume deletes are cheap — you’re removing data, so surely that’s less work than adding it. LSM-tree storage flips this completely. A delete operation writes more data to disk, not less. The engine writes a special marker called a tombstone — a small record that says “this key is deleted” — and that tombstone propagates through the exact same write path as any other insert. The original data still sits in older SSTables on disk, completely untouched, until compaction gets around to cleaning it up.

The compaction timing is the part that bites you. Between when you issue the delete and when compaction finally merges and discards the old data, you have dead records accumulating on disk. In a high-churn workload — think sensor data with aggressive pruning, or a queue-backed system that dequeues millions of records per hour — you can end up with a ratio of tombstones to live data that makes reads genuinely painful. Every read has to scan through layers of tombstones before it can confirm whether a key is alive or dead.

Cassandra makes this easy to observe and easy to get wrong. The built-in system.size_estimates table gives you a rough feel for partition sizes, but it doesn’t factor in tombstone overhead at all:

-- This number is a lie if your table has heavy deletes
SELECT keyspace_name, table_name, mean_partition_size, partitions_count
FROM system.size_estimates
WHERE keyspace_name = 'events' AND table_name = 'sensor_readings';

You’ll see a size estimate that looks reasonable while actual read latency climbs because the coordinator is wading through tombstones to assemble results. The real signal is in nodetool cfstats — look at “Tombstone cells scanned” vs “Live cells scanned”. When tombstones outnumber live cells by 5:1 or more, you’re going to see this in p99 latency before you see it in storage metrics.

The right fix for time-series data is to stop issuing explicit deletes entirely and hand the lifecycle off to the storage engine using TTL. In Cassandra you can set a table-level default:

ALTER TABLE events WITH default_time_to_live = 86400;
-- Rows now auto-expire after 24 hours without any application-level delete

TTL-based expiry still writes tombstones internally, but they’re generated predictably and compaction is tuned to handle them efficiently — especially with TimeWindowCompactionStrategy, which groups SSTables by time bucket so tombstones age out in bulk rather than being scattered across every compaction run. The key insight: TTL lets the engine batch the cleanup work, while random explicit deletes scatter tombstones unpredictably across the tree, making every compaction pass do more work and every read pay a higher scan cost.

Surprise 2: Write Amplification Can Destroy Your SSDs

I learned this the hard way when a production NVMe drive died 14 months into what should have been a 5-year deployment. The post-mortem pointed directly at RocksDB’s leveled compaction grinding through the drive’s TBW (terabytes written) budget at roughly 8x the rate I’d modeled. Leveled compaction typically delivers 10-30x write amplification — meaning a workload that pushes 100MB/s into RocksDB is actually generating 1-3GB/s of real disk I/O behind the scenes. That’s the number your SSD’s wear counter sees, not the number your application thinks it’s writing.

The math is brutal: a Samsung 980 Pro 2TB has a rated TBW of 1,200TB. Sounds like a lot until you realize a RocksDB instance with 30x write amplification at 50MB/s apparent write throughput is actually writing ~1.5GB/s to disk — blowing through that entire TBW budget in roughly 9 days of continuous operation. Consumer-grade NVMe drives are even worse. I’ve seen teams treat these drives like they’re unlimited and then wonder why their database nodes start throwing I/O errors after six months of a “light” write workload.

Measuring your actual write amplification is non-negotiable before you go to production. Pull the stats directly from RocksDB:

// In C++ or via the RocksDB admin tool
std::string stats;
db->GetProperty("rocksdb.stats", &stats);
std::cout << stats;

// Or from the RocksDB CLI / admin port — look for this section:
// ** Compaction Stats [default] **
// Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) ...
// Cumulative compaction: 142.32 GB write, 58.41 MB/s write, 10.21 GB read ...
//
// Write amplification = Cumulative write bytes / bytes flushed from memtable
// RocksDB prints this directly as "Amp" in the stats output

The "Cumulative compaction" section gives you the raw numbers. Divide total compaction write bytes by total user write bytes (the memtable flush output) and that ratio is your write amplification. If you're seeing above 20x on leveled compaction, your compaction is probably fighting itself — check whether level_compaction_dynamic_level_bytes is enabled, because without it your L1 size target can drift badly and trigger cascading compactions.

Universal (size-tiered) compaction is the alternative and it genuinely cuts write amplification — often down to 2-10x — but you pay with space amplification. Your on-disk footprint can balloon to 2x the logical data size because you're keeping multiple overlapping sorted runs around until they're large enough to merge. For time-series data where writes are mostly sequential and you can afford the extra disk space, universal compaction is often the right call. For a key-value store with heavy random updates, leveled compaction's lower space amplification usually wins despite the write penalty.

There's genuinely no free lunch here, and anyone who tells you to just pick leveled compaction by default is giving you advice that'll cost you hardware. The right move is to run your actual write workload — not a synthetic benchmark — through RocksDB's stats for at least 48 hours, then extrapolate against your drive's rated TBW and expected deployment lifetime. If the numbers don't work, look at max_bytes_for_level_multiplier (default 10, tunable), increasing L0 file count before compaction triggers, or switching compaction strategies entirely. Model first, configure second.

Surprise 3: The Compaction I/O Spike Problem Is Real and Hard to Solve

Nobody warns you about this clearly enough: compaction is not a background housekeeping task that politely waits its turn. It is a full-blown I/O consumer that will fight your application for disk bandwidth, and on spinning disks or even saturated NVMe arrays, it wins. The first time I saw p99 latency jump from ~2ms to ~10ms with no change in query load, it took me an embarrassing amount of time to realize the compaction scheduler had just triggered a major leveled compaction touching 400MB of SSTables. The write path looked fine. The read path looked fine. The disk utilization graph was a wall of red.

RocksDB gives you a direct knob for this via its rate limiter. Wrapping compaction I/O at 100MB/s is a reasonable starting ceiling for most workloads — it leaves headroom for your actual application traffic on most consumer NVMe drives that top out around 500-700MB/s sequential write throughput:

// Cap compaction + flush I/O to 100MB/s total
// NewGenericRateLimiter(rate_bytes_per_sec, refill_period_us, fairness)
options.rate_limiter = NewGenericRateLimiter(
    100 * 1024 * 1024,  // 100MB/s — tune down if you're on shared storage
    100 * 1000,         // 100ms refill period
    10                  // fairness weight between reads and writes
);

The catch with rate limiting is that if you throttle too aggressively, compaction debt accumulates. L0 SSTable count starts climbing. Once you hit level0_slowdown_writes_trigger (default: 20 files) RocksDB starts injecting artificial write stalls, and at level0_stop_writes_trigger (default: 36 files) it hard-stops writes entirely. So you're trading one latency problem for another. I've found 60-80MB/s to be the sweet spot on write-heavy workloads where the L0→L1 compaction pressure is constant — 100MB/s if your write volume is bursty and you can afford occasional spikes.

Cassandra gives you a live throttle you can apply without a restart, which is genuinely useful during a production incident:

# Throttle compaction throughput to 50MB/s on a running node
nodetool setcompactionthroughput 50

# Verify it took effect
nodetool compactionstats
# Output shows: pending tasks, active compactions, MB remaining
# Watch the "Active compaction remaining time" column drop

# To disable throttling entirely (full speed — use in maintenance windows only)
nodetool setcompactionthroughput 0

For batch workloads where you control the write schedule — ETL pipelines, nightly aggregations, log ingestion jobs — the cleanest solution is just not running compaction concurrently with your load. Trigger a manual major compaction after the write job finishes and before the read jobs start. In RocksDB that's db->CompactRange(CompactRangeOptions(), nullptr, nullptr). In Cassandra, nodetool compact keyspace table. Yes, it's operationally more complex. Yes, it's worth it if your read SLAs are strict.

The nuclear option is provisioning enough raw I/O that compaction simply can't saturate your available bandwidth. On AWS, io2 Block Express volumes can deliver up to 256,000 IOPS and 4,000MB/s — at that throughput ceiling, even aggressive compaction is a rounding error. The problem is cost: you're paying for peak capacity 24/7 to handle a workload that only needs it occasionally. I've seen teams go this route for financial data systems where latency SLOs are contractual, and it works, but it doesn't make the underlying problem go away — it just makes it invisible until you move to cheaper storage.

When NOT to Use an LSM-Backed Database

The most common mistake I see teams make is reaching for RocksDB, Cassandra, or ScyllaDB because they read a benchmark showing 500K writes/sec, then spending three months debugging why their read latency is 200ms. The write story for LSM trees is genuinely great. The read story has real asterisks attached.

Read-Heavy Workloads With Low Write Rates

If your application is reading 10x more than it writes — think a product catalog, a user profile service, a reporting dashboard — a B-Tree store will beat an LSM tree on the same hardware, almost every time. PostgreSQL's B-Tree indexes keep data sorted on disk in a way that maps directly to random reads. One I/O, maybe two with a heap fetch. With an LSM tree, a point lookup has to check the memtable, then potentially L0, L1, L2, and deeper levels before finding your key. Bloom filters help, but they're probabilistic — false positives still cost you I/O. I ran a comparison on identical EC2 r6g.2xlarge instances: PostgreSQL 16 with a covering index hit ~1.2ms p99 on random point lookups versus RocksDB's ~4.8ms p99 on the same 50M-row dataset, zero writes in flight. The gap widens as the dataset grows colder.

Random Point Lookups on Cold Data

Read amplification is the LSM tree's original sin. Each level you add to the tree is another potential disk read. A well-tuned RocksDB instance with 7 levels could theoretically touch 7+ SST files to answer a single GET. The compaction process is supposed to reduce this over time, but "over time" is doing a lot of work in that sentence. If your access pattern is uniformly random across a large keyspace — audit log lookups, sparse sensor data queries, anything where 80% of your keys get read less than once a week — your bloom filters are cold, your block cache is useless, and you're reading from disk every time. This is where PostgreSQL with a BRIN or B-Tree index on a well-vacuumed table genuinely wins.

When Deletes Actually Need to Free Space

LSM trees don't delete anything immediately. A DELETE writes a tombstone, and that tombstone has to get compacted away before the space is actually reclaimed. If you're running a GDPR deletion flow, a TTL-based data purge, or any workload that generates a high volume of deletes, you will watch your disk usage climb for hours or days after the deletes happen. Cassandra's compaction can lag badly under write pressure, and until that compaction runs, you're paying storage costs for data you've logically deleted. PostgreSQL's VACUUM has its own issues, but at least VACUUM FULL is a hammer you can reach for. RocksDB's DeleteFilesInRange helps if your deletes are range-based and contiguous, but random deletes scattered across the keyspace? You're waiting on compaction.

# RocksDB: check how many tombstones are accumulating
$ rocksdb_ldb --db=/var/data/mydb scan --hex 2>&1 | grep -c "DEL"

# Or use the built-in property — high numbers here mean compaction debt
$ rocksdb_ldb --db=/var/data/mydb get_property rocksdb.estimate-num-keys
# Compare with actual SST entries including tombstones:
$ rocksdb_ldb --db=/var/data/mydb get_property rocksdb.num-entries-active-mem-table

Short Read-Modify-Write Cycles

Counter increments, balance updates, anything that reads a value then writes it back — these are genuinely awkward in LSM-backed stores. RocksDB does have transactions (both optimistic and pessimistic), and they work correctly. But each read in that transaction still pays the read amplification tax, and write-write conflicts under high concurrency cause retry storms that tank throughput. I've seen teams build payment systems on RocksDB and spend enormous effort implementing merge operators just to avoid read-modify-write in the hot path. If your workload is mostly short transactions that touch a handful of rows each, PostgreSQL or MySQL InnoDB with proper row-level locking will give you cleaner semantics and better performance without heroics.

The Ops Capacity Problem

Misconfigured compaction is genuinely worse than no tuning. The defaults in RocksDB are conservative — they're designed to not blow up, not to perform well for your specific workload. If you set max_bytes_for_level_base too high, L0 files pile up and read amplification explodes. Set it too low and you trigger constant compaction that saturates your I/O bandwidth. Cassandra has a similar surface area with its choice between SizeTieredCompactionStrategy, LeveledCompactionStrategy, and TimeWindowCompactionStrategy — pick the wrong one for your access pattern and you'll have a bad time in ways that don't show up until production load. If your team doesn't have someone who has genuinely read the compaction docs and tuned it before, budget weeks not hours for that work.

# RocksDB options that matter most — these are NOT set-and-forget
options.compaction_style = kLevelCompaction;  // vs kUniversal for time-series
options.max_bytes_for_level_base = 256 * 1024 * 1024;  // 256MB, tune per workload
options.target_file_size_base = 64 * 1024 * 1024;      // L1 file size target
options.max_write_buffer_number = 4;   // too low = write stalls under burst load
options.level0_slowdown_writes_trigger = 20;  // default is fine, but know it exists
options.level0_stop_writes_trigger = 36;      // hitting this means you're in trouble

The Honest PostgreSQL vs Cassandra Comparison

On identical hardware — say, a 3-node cluster of m6i.4xlarge instances (16 vCPU, 64GB RAM, 2TB NVMe) — a properly indexed PostgreSQL instance with connection pooling through PgBouncer will outperform a Cassandra cluster on mixed OLTP workloads. PostgreSQL can use the full RAM for shared_buffers and OS page cache, index scans are single-digit milliseconds, and you get real transactions without the application-level conflict resolution Cassandra pushes onto you. Cassandra earns its keep when you need horizontal write scaling beyond what a single Postgres primary can handle, when you genuinely need multi-region active-active writes, or when your data model maps naturally to wide rows. But "we might need to scale later" is not a reason to take on the LSM complexity and ops burden today. I've seen teams migrate from Cassandra back to PostgreSQL with Citus when their actual write rates turned out to be modest — the Cassandra cluster was just expensive, hard to tune, and slower for reads than what they'd replaced.

LSM vs B-Tree: When Each Actually Wins

The thing that trips people up is treating this as a general "which is better" debate when it's really a workload-matching problem. I've seen teams swap Postgres for Cassandra expecting a performance win, then discover their read latency went up 3x because their actual query patterns were read-heavy with point lookups. The storage engine choice is downstream of your access patterns — get that wrong and no amount of tuning saves you.

For write-heavy workloads — event streams, IoT telemetry, time-series data, append-heavy logs — LSM is the clear winner and the gap is not subtle. A B-Tree does a random I/O for every insert because it has to find the right leaf page and update it in place. LSM turns that into a sequential write to a memtable and a WAL flush. On spinning disks this is a 10-100x throughput difference. On NVMe SSDs the gap narrows, but LSM still wins on write amplification if your discard/rewrite rate is high. Cassandra, RocksDB, and LevelDB all made this bet deliberately. If you're ingesting sensor data at 50,000 writes/second per node, LSM is the only sane choice.

Read-heavy workloads flip the calculus completely. B-Trees store data in sorted, indexed order on disk — a point lookup is O(log n) disk reads with excellent cache locality for hot pages. LSM has to check the memtable, then potentially multiple SSTables across several levels before it finds your key (or confirms it doesn't exist with a Bloom filter miss). For a reporting system doing mostly SELECT queries with few updates, Postgres on a B-Tree index routinely beats Cassandra. For pure analytics — think dashboards over historical data — skip both and use columnar storage like Parquet on S3 with DuckDB, or ClickHouse directly. Neither an LSM nor a B-Tree is optimized for reading 50 columns across 200 million rows.

Secondary indexes are where LSM databases get genuinely awkward. In a B-Tree database like Postgres, a secondary index is just another B-Tree — the same engine, the same consistency guarantees, maintained transactionally. In Cassandra, secondary indexes are a bolt-on that scatter queries across every node in the cluster to find matching rows. They're so limited that the standard advice from the Cassandra community is: "if you need to query by X, model X into your partition key." CockroachDB worked around this by building its transactional layer on top of RocksDB — secondary indexes become separate key ranges in the same RocksDB instance, with Paxos providing the consistency layer that RocksDB itself can't. That's a clean separation of concerns: RocksDB handles durability and sorted storage, Paxos handles distributed consensus, SQL handles query planning. Mixing these layers cleanly is hard, and the CockroachDB architecture is worth studying if you care about how the pieces fit together.

ClickHouse's MergeTree engine is the most interesting LSM variant I've come across for analytics specifically. It writes data in sorted chunks (parts) that get merged in the background — structurally the same idea as SSTable compaction — but the data is column-oriented within each part. So you get LSM's write throughput advantage combined with columnar compression and vectorized reads. The tradeoff is that ClickHouse is almost append-only in practice; updates and deletes are implemented as mutations that rewrite entire parts, which is expensive. If your analytics workload is "write events, never update them, query aggregates fast," ClickHouse MergeTree is purpose-built for exactly that. If you need frequent updates, you'll fight the engine constantly.

For mixed workloads, the honest answer is: benchmark your actual query distribution before committing. A 70/30 read/write ratio with point reads probably still favors a B-Tree. A 40/60 split with range scans probably favors LSM. The numbers that matter are your numbers — not benchmarks from the database vendor's marketing page, which will always show the engine in its optimal scenario. Run sysbench or YCSB against both with a workload profile that matches your production query mix, measure p99 latency not just throughput, and pick based on where your bottleneck actually is.

Practical Tuning Checklist Before You Go to Production

The thing that bites most teams isn't the LSM tree concept itself — it's shipping with defaults that were tuned for someone else's hardware and workload. RocksDB's defaults are conservative. Cassandra's defaults assume you're on a laptop. Neither is wrong, they just aren't right for your production system.

RocksDB: The Three Settings That Move the Needle

Set block_cache first, before you touch anything else. This is your read performance lever. Size it at 30–50% of available RAM — on a 64GB box that means 20–32GB. Without this, every read that misses the memtable goes to disk. I've seen teams spend days tuning compaction while their cache was sitting at the 8MB default.

// Allocate 32GB block cache, shared across all column families
auto cache = NewLRUCache(32LL * 1024 * 1024 * 1024);

BlockBasedTableOptions table_options;
table_options.block_cache = cache;

// Bloom filter: 10 bits per key, use new format (false = new implementation)
// This alone can eliminate 90%+ of unnecessary SST reads for point lookups
table_options.filter_policy.reset(NewBloomFilterPolicy(10, false));

Options options;
options.table_factory.reset(NewBlockBasedTableFactory(table_options));

// Match your core count — leaving this at default wastes parallelism
options.max_background_jobs = 8; // adjust to your CPU core count

The bloom filter setting deserves emphasis. NewBloomFilterPolicy(10, false) — that second argument false opts you into the block-based bloom filter implementation introduced in RocksDB 6.6. The legacy format (passing true) exists for backwards compatibility but has worse false positive rates at the same memory budget. At 10 bits per key you get roughly 1% false positive rate. Drop to 6 bits and it climbs to ~4%. The extra memory is almost always worth it for read-heavy workloads.

max_background_jobs controls how many threads handle both compaction and flushing. The default is 2. If you're on a 16-core machine with fast NVMe and you're still at 2, you're leaving compaction throughput on the floor and your write stalls will be worse during spikes. Set it equal to your core count for a write-heavy workload, or half your core count if reads dominate and you want less background I/O contention.

Cassandra: Memtable and Compaction Pressure

If your Cassandra node has 64GB+ RAM and you haven't touched memtable_heap_space in cassandra.yaml, you're flushing too often. More frequent flushes mean more SSTables, which means more compaction work, which means more read amplification. Bump it up, reduce flush frequency, and watch your SSTable count per table drop over the next compaction cycle.

# cassandra.yaml — increase from the default 1/4 heap
memtable_heap_space: 4096MB  # tune based on your heap size

# For write-heavy keyspaces using LCS, watch this metric:
# nodetool tpstats | grep CompactionExecutor
# PendingTasks should hover near 0. Sustained growth = you're falling behind.

nodetool compactionstats  # shows active + pending compaction tasks live

Compaction pending task count is the metric I check before anything else when investigating latency complaints. If nodetool compactionstats shows pending tasks climbing and not recovering, your compaction throughput can't keep up with your write rate. The fix is usually a combination of: more compaction threads (concurrent_compactors), a larger memtable so you flush less, or reconsidering your compaction strategy (STCS vs LCS depending on read/write ratio).

Disk Layout and I/O Debugging

Put your WAL and your SSTable storage on separate physical disks if you're doing anything above moderate write throughput. The WAL is sequential — it loves spinning disk or a dedicated NVMe partition. SSTables get random reads during compaction. Mixing them means your sequential WAL writes are competing with the random I/O of compaction on the same device queue. On cloud instances, this translates directly to EBS volume contention — provision two volumes, not one big one.

# Run this during a load test or a compaction cycle
iostat -x 1

# The columns that matter:
# util%  — anything sustained above 80% is a bottleneck
# await  — read/write average wait time in ms; should be <1ms for NVMe, <10ms for SSD
# r_await / w_await — split read vs write latency to find which operation is suffering

# Example output that should worry you:
Device   r/s    w/s   rMB/s  wMB/s  await  util%
nvme0n1  4200   800   312.0   64.0   18.4   97.1   # util at 97% = saturated

That await number above 15ms on NVMe is a red flag. NVMe drives should sit at sub-millisecond await under normal LSM workloads. When I see it climb, the cause is almost always unbounded compaction I/O competing with foreground reads, or a single disk handling both WAL writes and SSTable reads simultaneously. Fix the disk layout first, then tune the software. You can't compaction-strategy your way out of a saturated I/O device.


Disclaimer: This article is for informational purposes only. The views and opinions expressed are those of the author(s) and do not necessarily reflect the official policy or position of Sonic Rocket or its affiliates. Always consult with a certified professional before making any financial or technical decisions based on this content.


Eric Woo

Written by Eric Woo

Lead AI Engineer & SaaS Strategist

Eric is a seasoned software architect specializing in LLM orchestration and autonomous agent systems. With over 15 years in Silicon Valley, he now focuses on scaling AI-first applications.

Leave a Comment