LSM Trees: Why Your Database Writes Are Fast and Your Reads Are Lying to You

The Problem That Made Me Actually Care About Storage Engines

The thing that broke my comfortable ignorance about storage engines was a pipeline ingesting sensor telemetry — about 50,000 inserts per second into a PostgreSQL 15 cluster. The hardware wasn’t cheap: NVMe drives, 32 cores, 128GB RAM. Didn’t matter. Around 40k inserts/sec, write latency would climb from 2ms to 400ms and stay there. pg_stat_bgwriter showed checkpoint pressure. iostat -x 1 showed %util pinned at 100% on the data volume. PostgreSQL’s B-tree indexes update in-place — every insert is a random write, and at that velocity, random writes just kill you.

# What I was staring at every morning
$ iostat -x 1 /dev/nvme0n1
Device    r/s    w/s    rkB/s    wkB/s   await  %util
nvme0n1  12.4  8941.2   198.4  142058.1   48.3  100.00

Switching to Cassandra fixed the write problem immediately. Genuinely — 50k inserts/sec became a rounding error. Cassandra’s commit log and memtable design meant writes were sequential, not random. The disk stopped being the bottleneck. I felt smart for about a week. Then a teammate asked why a simple SELECT * FROM events WHERE device_id = 'abc' AND ts = 1704067200 was taking 80ms on a table with 10 minutes of data in it. I had no coherent answer. I said something about compaction. He nodded. Neither of us actually knew what that meant.

That gap — writes fast, reads need explanation — is what forced me to actually read the source material. Not blog posts, but the original 2006 Bigtable paper and O’Neil’s 1996 LSM-tree paper. The mental model I’d been operating on (“Cassandra is fast because distributed”) was embarrassingly incomplete. The write performance comes from a specific structural decision about how data is organized on disk, and that same decision is exactly why reads are more expensive and why you sometimes get stale results without realizing it. Those things are linked. You can’t understand one without the other.

The honest reason most developers never build this mental model is that the abstraction holds until it doesn’t. Your ORM inserts rows, life is good. But write-heavy workloads — IoT ingestion, event sourcing, time-series data, audit logs — hit the ceiling fast, and when they do, you’re debugging symptoms instead of causes. I wasted probably three days tuning PostgreSQL autovacuum settings and work_mem before accepting that the architecture was wrong for the workload, not the configuration. For a complete list of tools that help with database workflow automation, check out our guide on Productivity Workflows. The storage engine is the first thing you should understand, not the last resort after everything else fails.

What specifically clicked for me was realizing that PostgreSQL’s heap-based, in-place update model and Cassandra’s LSM-tree model make opposite bets. PostgreSQL bets that reads are common and updates are scattered — it optimizes read paths and pays a write amplification cost. LSM trees bet that writes arrive in bursts and reads can tolerate some indirection — they turn random writes into sequential ones by staging data through memory before flushing to disk. Neither bet is universally correct. Matching the bet to your workload is the actual skill.

What an LSM Tree Actually Does (Without the Textbook Nonsense)

The thing that surprised me most when I first dug into LSM trees wasn’t the cleverness — it was how much of the design is just exploiting one simple fact: sequential writes on disk are an order of magnitude faster than random writes. Everything else flows from that. RocksDB, LevelDB, Cassandra’s storage engine, ClickHouse’s MergeTree — they’re all built on this same bet.

Here’s what actually happens when you write a key-value pair. The write goes into the memtable — an in-memory sorted structure, typically a red-black tree or skip list. Both give you O(log n) inserts with sorted iteration, which matters because you’ll need to dump this thing to disk in order. The write also gets appended to the Write-Ahead Log (WAL) on disk before the memtable is updated. That order matters: WAL first, memtable second. If your process crashes before the memtable flushes, the WAL is how you replay the missing writes. Skip the WAL and you get fast writes until you lose power — then you lose data, full stop.

# Simplified view of what a WAL entry looks like in RocksDB's internal format:
# [sequence_number][type][key_length][key][value_length][value]
# Type 1 = Put, Type 0 = Delete

# You can inspect WAL files with:
./ldb --db=/path/to/db dump_wal --walfile=000003.log --print_header --header
# Output snippet:
# Sequence 1112, count: 1, WriteBatch
#   PUT : 'user:4821' => '{"name":"alice","score":99}'

Once the memtable hits a size threshold (512MB in many RocksDB configs, configurable), it becomes immutable and a new memtable takes over. A background thread flushes the immutable memtable to disk as an SSTable — Sorted String Table. The key word is sorted: the data is written in key order, as one big sequential pass. No seeking around. The file is written once and never modified. That’s the immutability guarantee. Updates to an existing key don’t overwrite the old value; they write a newer entry that shadows the old one. Deletes write a tombstone marker. The old data sticks around until compaction runs.

The sequential write speed difference is real and measurable. On a spinning HDD, random writes land somewhere around 100–200 IOPS, while sequential writes can push 100–200 MB/s. That’s not a marketing number — it’s physics. The read/write head has to seek to a new position for every random write, and a seek takes ~8ms on a typical 7200 RPM drive. Sequential writes just stream data to wherever the head already is. On NVMe SSDs the gap narrows but doesn’t disappear: random writes still generate more write amplification due to flash page alignment, and the SSD’s FTL (Flash Translation Layer) has to do more work managing out-of-place updates. LSM trees hand the SSD a stream of large sequential writes, which the FTL handles efficiently and which reduces wear on the flash cells.

# You can observe this difference yourself with fio:
# Random write test (4K blocks, simulates B-tree in-place updates)
fio --name=randwrite --rw=randwrite --bs=4k --numjobs=1 \
    --size=1G --runtime=30 --filename=/tmp/test.dat --ioengine=libaio --direct=1

# Sequential write test (simulates SSTable flush)
fio --name=seqwrite --rw=write --bs=4M --numjobs=1 \
    --size=1G --runtime=30 --filename=/tmp/test.dat --ioengine=libaio --direct=1

# On a mid-range NVMe you'll typically see seqwrite at 5-10x the IOPS of randwrite
# On HDD the gap is more like 50-100x

The trade-off you’re accepting with all of this is read complexity. A key might live in the active memtable, the immutable memtable, or any of several SSTable files across multiple levels. Reading requires checking all of them in order, newest first. That’s why Bloom filters exist in every production LSM implementation — they let you skip SSTables that definitely don’t contain your key with a single probabilistic check. But even with Bloom filters, read amplification is higher than a B-tree’s worst case. If you benchmark RocksDB write throughput and think “this is cheating,” you’re right. The cost gets deferred to reads and to compaction, which is a background process that merges SSTables and actually evicts stale data and tombstones. More on that later.

The Write Path Step by Step

The part that surprised me most when I first dug into RocksDB internals: a “write” is actually two completely separate I/O operations happening at different speeds, on different media, for different reasons. Most explanations skip past this and just say “writes are fast.” They’re fast because the design is deliberately staged.

Step 1: The WAL Gets It First

Every write hits the Write-Ahead Log before anything else. On disk, sequentially. In RocksDB, that’s the .log file sitting in your DB directory — you’ll see files like 000003.log. Sequential writes are fast because the kernel can buffer and flush them without seeking. The WAL’s only job is durability: if the process crashes before the memtable is flushed, RocksDB replays the WAL on startup and reconstructs in-memory state. If you’re on NVMe, WAL writes are essentially “free” in terms of latency. On network-attached storage, this is where you start bleeding milliseconds.

Step 2: The Memtable Gets a Copy

After the WAL write, the data lands in the active memtable — a sorted in-memory structure (RocksDB uses a skip list by default, though you can swap in a hash skip list or vector). This is what makes reads fast for recently written data: the memtable is a live index in RAM. Writes here are just memory operations, sub-microsecond. The memtable is where your data actually “lives” until a flush happens.

Step 3: Memtable Fills Up, Becomes Immutable

When the active memtable hits write_buffer_size (default 64MB in RocksDB), it stops accepting new writes and becomes immutable. A new active memtable takes over immediately. The key config here is max_write_buffer_number, which controls how many memtables (active + immutable combined) can exist before RocksDB starts applying write stalls. Default is 2. If your flush thread can’t keep up and you hit that limit, writes block — that’s not a bug, it’s intentional back-pressure.

// RocksDB options in C++ or mapped 1:1 in rocksdb-rs, python-rocksdb, etc.
rocksdb::Options options;

// Each memtable gets 128MB before going immutable
options.write_buffer_size = 128 * 1024 * 1024;

// Allow up to 4 memtables total (1 active + 3 immutable waiting for flush)
// More headroom before write stalls hit, but more RAM used
options.max_write_buffer_number = 4;

// Flush starts when 2 immutable memtables are queued
// (default is 1, lowering this keeps L0 file count down)
options.min_write_buffer_number_to_merge = 2;

The gotcha with max_write_buffer_number: raising it buys you headroom against write stalls, but your worst-case memory usage scales with it. At 128MB × 4, you’re committing 512MB to just the write buffer layer before flushing has even started. On a write-heavy workload, I’ve seen people triple this trying to fix stalls, then wonder why their RSS is through the roof.

Step 4: Background Flush to Level 0

A dedicated background thread picks up immutable memtables and flushes them as SSTable files into Level 0. Each flush produces one SSTable — a sorted, immutable file on disk. Level 0 is the only level where files can have overlapping key ranges, which is why reads at L0 are more expensive (the read path has to check every L0 file). The flush itself is sequential I/O, so it’s fast, but it does compete with compaction for disk bandwidth. If you’re on a single spinning disk and doing heavy writes, this is where contention actually shows up.

You can watch this in real time without instrumenting your app. RocksDB exposes internal state through properties:

# Using the rocksdb CLI tool (ldb) to check live state
ldb --db=/path/to/your/db get_property rocksdb.num-immutable-mem-table

# Or from within your application (C++ example, but the property name is identical
# in every language binding)
std::string value;
db->GetProperty("rocksdb.num-immutable-mem-table", &value);
// value == "0" means flush is keeping up
// value > 1 consistently means your flush thread is falling behind

# Other useful properties to watch alongside it:
# rocksdb.mem-table-flush-pending  — "1" if a flush is queued
# rocksdb.num-running-flushes      — how many flush threads are active right now
# rocksdb.estimate-pending-compaction-bytes — how far behind compaction is

If rocksdb.num-immutable-mem-table is sitting at 2 or 3 consistently during normal load, you’re already flirting with write stalls. Either your flush disk is too slow, or you need to bump max_background_flushes (default 1 in older RocksDB versions — set it to 2 on any serious workload). The write path only feels “automatic” until you push it hard enough to see the seams.

The Read Path — Why It’s More Expensive Than You Think

The surprising thing about LSM read paths is that the cost isn’t linear with data size — it’s linear with the number of SSTables you’ve accumulated. I’ve seen databases with 500MB of total data have worse read latency than ones with 5GB, purely because compaction wasn’t keeping up and the read path was checking 30+ files per query.

The lookup order is strict: memtable first, then any immutable memtables waiting to flush, then SSTables from newest to oldest. The “newest to oldest” part matters because it enforces correctness — a more recent write always shadows an older one. But it also means you can’t short-circuit without help. Every layer is a potential stop on the tour, and if you’re looking for a key that was deleted or never existed, you complete the entire tour. That’s the read amplification problem in its worst form: a point lookup for a missing key hits every single SSTable on disk.

Bloom filters are what make this survivable. Each SSTable carries a bloom filter that answers “is this key definitely not in here?” with high confidence. The filter can have false positives (it says maybe when the answer is no) but never false negatives. So the read path becomes: ask the bloom filter, and if it says no, skip the SSTable entirely. In RocksDB, you control this with bloom_bits_per_key inside BlockBasedTableOptions:

// RocksDB C++ — higher bits_per_key = lower false positive rate,
// but more memory for the filter. 10 is the standard default.
rocksdb::BlockBasedTableOptions table_options;
table_options.filter_policy.reset(
  rocksdb::NewBloomFilterPolicy(10, false)
);
options.table_factory.reset(
  rocksdb::NewBlockBasedTableFactory(table_options)
);

At 10 bits per key, the false positive rate sits around 1%. Going to 16 bits drops it to roughly 0.1% but your block cache starts competing with filter memory. The tradeoff is real — don’t just bump it without watching RSS. For a workload heavy on point lookups for potentially missing keys (think cache-miss patterns, existence checks), I’d go to 12–14. For mostly-present key lookups, 10 is fine.

The write amplification vs read amplification tradeoff is the core tension you’re always managing. Aggressive compaction (like RocksDB’s leveled strategy) funnels data down into fewer, larger SSTables — so reads touch fewer files. But to get there, the same data gets rewritten 10–30x. STCS (size-tiered compaction, what Cassandra defaults to) writes less but lets SSTables pile up, which punishes reads. There’s no free lunch. Your workload ratio — mostly writes vs mostly reads — should determine which side you accept pain on.

The practical debugging signal I reach for first when p99 read latency starts climbing is the level-0 file count. Level-0 is where freshly flushed SSTables land before compaction moves them down, and unlike other levels, reads in level-0 have to check all files because their key ranges overlap. When that number climbs, you feel it immediately in tail latency:

# Check L0 file count at runtime — if this is above 20, you have a problem
$ rocksdb_ldb --db=/path/to/db get_property rocksdb.num-files-at-level0

# Or via the RocksDB C++ API at runtime:
std::string value;
db->GetProperty("rocksdb.num-files-at-level0", &value);
// Also useful: "rocksdb.stats" dumps the full compaction status table

RocksDB triggers a write stall at L0 by default when you hit 20 files (level0_slowdown_writes_trigger) and a hard stop at 36 (level0_stop_writes_trigger). But your read latency will degrade well before the write stall kicks in — usually somewhere around 8–12 L0 files depending on key distribution. Don’t wait for write stalls to tell you something is wrong. Watch the L0 count proactively and treat it as a leading indicator.

Compaction: The Thing That Keeps LSM Trees From Falling Apart

The first time I watched an LSM-based system fall over in production, compaction was the culprit. Writes looked fine, latencies were normal, and then reads started climbing — 10ms, 50ms, 400ms — until the service was basically dead. We had 200+ L0 SSTables piled up and RocksDB was fanning out reads across all of them. Without compaction running fast enough to keep pace with ingestion, LSM trees degrade into something worse than a naive append-only log.

The mental model that helped me: compaction is the garbage collector of the LSM world. Without it, every read has to check more and more SSTables for the latest version of a key, bloom filters start costing real memory, and space amplification balloons because deleted or overwritten data just sits there in old SSTables. The write path stays fast — that’s the whole point — but you’re borrowing against future read performance and disk space.

Leveled vs Size-Tiered: Pick Your Poison

Leveled compaction (LevelDB’s default, also the default in RocksDB) organizes SSTables into levels where each level is roughly 10x the size of the one above it. L1 might be 256MB, L2 2.5GB, L3 25GB. When L0 accumulates enough files, they get merged down into L1, and so on. The upside is bounded read amplification — you check at most one SSTable per level, so worst-case reads touch maybe 5-6 files total. The downside is write amplification. I’ve measured 10-30x write amplification on write-heavy workloads with leveled compaction, which destroys SSD endurance over time and burns I/O bandwidth.

Size-tiered compaction (Cassandra’s default) takes a different approach: it groups SSTables of similar size and merges them together. You end up with fewer merge operations and much lower write amplification — good for pure write throughput. But during a merge, you temporarily need up to 2x the space, and reads can end up scanning many same-tier SSTables because overlapping key ranges aren’t separated cleanly. If you’re running Cassandra on a time-series workload and space is tight, size-tiered will bite you. I’ve seen disk usage spike 60-70% above the actual data size during heavy compaction windows.

FIFO compaction in RocksDB is the one most people ignore and the one that’s actually perfect for the right use case: time-series data with a known retention window. Instead of merging, it just drops the oldest SSTable when total size hits the configured limit. Zero write amplification from compaction. The catch is it only works if your reads don’t need data older than the retention window and keys are roughly time-ordered. Configure it like this:

// RocksDB options for FIFO compaction
options.compaction_style = kCompactionStyleFIFO;
options.compaction_options_fifo.max_table_files_size = 10ULL * 1024 * 1024 * 1024; // 10GB
options.compaction_options_fifo.allow_compaction = false; // pure FIFO, no intra-level merges
options.ttl = 86400; // 1 day TTL, pairs well with FIFO

Checking Compaction Health Before It Becomes an Incident

Two ways I check compaction status in RocksDB. The quick one for a running process is via GetProperty:

// Check if compaction is pending
std::string value;
db->GetProperty("rocksdb.compaction-pending", &value);
// Returns "1" if compaction is pending

// Full stats dump — pipe this to a log or metrics system
db->GetProperty("rocksdb.stats", &value);
std::cout << value;

// L0 file count specifically — this is your early warning signal
db->GetProperty("rocksdb.num-files-at-level0", &value);

For offline benchmarking or when you’re trying to reproduce a compaction problem, db_bench gives you the full statistics breakdown:

# Run with statistics enabled, then inspect compaction metrics
./db_bench \
  --benchmarks=fillrandom,stats \
  --statistics=true \
  --stats_interval_seconds=10 \
  --db=/tmp/testdb \
  --num=10000000 \
  --value_size=256

# After the run, look for these in the output:
# rocksdb.compaction.times.micros
# rocksdb.l0.num.files.stall.micros  <-- this is the one that kills you
# rocksdb.write.stall

Compaction Debt Is a Real Thing and It Sneaks Up on You

The problem I saw in production: we had a batch job that would spike writes for about 45 minutes every hour. RocksDB's background compaction threads (we had 2) couldn't drain L0 fast enough. L0 file count hit the level0_slowdown_writes_trigger (default: 20 files), writes started getting throttled, and then hit level0_stop_writes_trigger (default: 36 files), and writes stopped entirely. The fix wasn't magic — we bumped max_background_compactions from 2 to 6 and tuned max_bytes_for_level_base to match our actual write rate:

options.max_background_compactions = 6;
options.max_background_flushes = 2;
options.max_bytes_for_level_base = 512 * 1024 * 1024; // 512MB instead of default 256MB
options.level0_file_num_compaction_trigger = 4;
options.level0_slowdown_writes_trigger = 20;
options.level0_stop_writes_trigger = 36;
// Give compaction threads access to more I/O
options.rate_limiter = NewGenericRateLimiter(200 * 1024 * 1024); // 200MB/s

The right compaction strategy depends entirely on your read/write ratio and whether you can tolerate space amplification. Leveled is a safe default for mixed workloads. Size-tiered wins on write-heavy pipelines where you have headroom on disk. FIFO is genuinely underrated for logs and metrics with a TTL. What you can't do is set it once and forget it — compaction debt accumulates silently and announces itself at the worst possible time.

Setting Up RocksDB and Hitting the Real Rough Edges

The thing that caught me off guard with RocksDB wasn't the LSM theory — it was that a default install quietly bleeds file descriptors until your process hits the OS limit and starts throwing cryptic IO errors. Most tutorials skip straight to the "look how fast writes are" benchmark and never mention that you need to sort out ulimits before you open your first DB instance.

Building from Source vs. the Python Binding

If you need the C++ library directly — which you will if you're embedding RocksDB in a service — build it yourself:

git clone https://github.com/facebook/rocksdb.git
cd rocksdb
# DEBUG_LEVEL=0 gives you the optimized build, not the debug one
DEBUG_LEVEL=0 make static_lib -j$(nproc)
sudo make install-static

That spits out librocksdb.a under /usr/local/lib. For most Python experimentation though, the binding is faster to get running:

sudo apt install librocksdb-dev libsnappy-dev zlib1g-dev libbz2-dev liblz4-dev libzstd-dev
pip install rocksdb

The pip install rocksdb package links against your system's RocksDB, so make sure the system package and the Python binding versions aren't mismatched. I've been burned by Ubuntu 22.04 shipping RocksDB 6.11 while the pip package expects 7.x — the import crashes with a symbol lookup error that gives you zero useful context.

Actual Working Code — Open, Write, Read

import rocksdb

# options.create_if_missing is required — it won't create the dir otherwise
opts = rocksdb.Options()
opts.create_if_missing = True
opts.max_open_files = -1  # let RocksDB manage its own FD pool
opts.write_buffer_size = 67108864  # 64MB memtable before flush
opts.max_write_buffer_number = 3
opts.target_file_size_base = 67108864

db = rocksdb.DB("/tmp/testdb", opts)

# Keys and values must be bytes — passing strings will fail silently in some versions
db.put(b"user:1001", b'{"name":"alice","plan":"pro"}')
db.put(b"user:1002", b'{"name":"bob","plan":"free"}')

val = db.get(b"user:1001")
print(val.decode("utf-8"))
# {"name":"alice","plan":"pro"}

# Batch writes — this is the pattern you actually want in production
batch = rocksdb.WriteBatch()
for i in range(1000, 2000):
    batch.put(f"event:{i}".encode(), b"payload")
db.write(batch)

One thing that trips people up: keys and values are bytes, not strings. Pass a plain Python string and you'll get a TypeError, but in older versions of the binding you'd get a segfault instead. Always encode explicitly.

The File Descriptor Exhaustion Problem

RocksDB keeps SST files open as it works through compaction levels. On a database with any real write volume, you can easily have hundreds of files open simultaneously — L0 alone can back up to 20+ files before compaction kicks in. The default Linux ulimit for open files is 1024, which sounds like a lot until RocksDB hits a busy compaction cycle and opens 300 files at once.

Fix this before you start the process, not after it's already running:

# Check current limits
ulimit -n
# 1024 — that's going to be a problem

# Set for the current shell session
ulimit -n 100000

# For a systemd service, add this to the unit file:
# [Service]
# LimitNOFILE=100000

# Permanent fix in /etc/security/limits.conf:
# * soft nofile 100000
# * hard nofile 100000

Then in your RocksDB options, set max_open_files = -1. This tells RocksDB to manage its own internal file descriptor cache rather than capping it at an arbitrary number. The alternative — setting max_open_files to a specific count — forces RocksDB to close and reopen files constantly, and you'll pay for it with read latency on cold data. The only reason to set a specific number is if you're running multiple RocksDB instances in the same process and need to divide your FD budget between them.

Monitoring: What the Built-In Stats Actually Tell You

import rocksdb
import time

opts = rocksdb.Options()
opts.create_if_missing = True
opts.max_open_files = -1
# This is the line most people miss — stats are OFF by default
opts.statistics = rocksdb.Statistics()

db = rocksdb.DB("/tmp/testdb_monitored", opts)

# Do some work...
batch = rocksdb.WriteBatch()
for i in range(10000):
    batch.put(f"key:{i}".encode(), b"x" * 512)
db.write(batch)

# Dump the stats string — it's verbose but searchable
stats = db.get_property(b"rocksdb.stats")
print(stats.decode())

# The specific counters I actually watch:
props = [
    b"rocksdb.num-files-at-level0",   # if this climbs past 20, writes are stalling
    b"rocksdb.num-files-at-level1",
    b"rocksdb.estimate-pending-compaction-bytes",  # compaction backlog
    b"rocksdb.mem-table-flush-pending",  # 1 means a flush is queued
    b"rocksdb.compaction-pending",
    b"rocksdb.estimate-num-keys",
]

for prop in props:
    val = db.get_property(prop)
    print(f"{prop.decode()}: {val.decode() if val else 'N/A'}")

The two numbers I watch religiously are rocksdb.num-files-at-level0 and rocksdb.estimate-pending-compaction-bytes. L0 file count climbing past 20 means your write rate is exceeding compaction throughput — RocksDB will start throttling inbound writes before it stalls completely, but by the time you see that throttle in latency, you're already in trouble. The pending compaction bytes tell you how far behind the background workers are. If that number is growing faster than it's shrinking, you need to either increase max_background_compactions or accept that your write rate is too high for your hardware.

How Cassandra and ScyllaDB Use LSM Differently Than RocksDB

Cassandra's SSTable Is Not RocksDB

A lot of people assume Cassandra uses RocksDB under the hood the way Kafka uses it for certain state stores, or how MyRocks is essentially MySQL bolted onto RocksDB. That's not what's happening. Cassandra has its own hand-rolled SSTable format — the spec has evolved from version 'ma' through 'oa' across Cassandra 3.x and 4.x — and it carries a lot of Cassandra-specific metadata: partition indexes, row-level tombstone markers, bloom filters per-SSTable, and compression chunk maps. When you crack open a data directory you'll see files like:

# Cassandra 4.1 SSTable files for a single generation
nb-1-big-Data.db
nb-1-big-Index.db
nb-1-big-Filter.db       # bloom filter
nb-1-big-Statistics.db   # min/max timestamps, tombstone counts
nb-1-big-CompressionInfo.db
nb-1-big-TOC.txt

RocksDB's SSTable is a much simpler key-value store format. Cassandra's needs to encode wide rows, clustering columns, TTL expiry per cell, and deletion markers across multiple hierarchy levels. That design choice matters when you're debugging: a sstable2json dump from Cassandra will show you row-level structure, whereas RocksDB's tooling is purely byte-range KV. Neither is better — they're solving different schemas.

ScyllaDB Rewrote the Engine, Kept the Protocol

ScyllaDB's whole pitch is that they kept the Cassandra Query Language wire protocol (CQL) and the SSTable format compatibility, but threw out the JVM runtime and reimplemented the storage engine in C++ using the Seastar framework. The practical consequence is a share-nothing, per-CPU-shard architecture where each shard owns its own memtable and compaction queue. You can point your existing Cassandra driver at ScyllaDB without changing a line of application code.

The per-shard compaction model is where ScyllaDB genuinely diverges under load. In Cassandra, compaction is coordinated by a shared thread pool — the default concurrent_compactors is usually 1 or 2, and under heavy write pressure, the compaction queue backs up globally. I've seen production Cassandra clusters where SSTables pile up to 200+ per partition key because compaction couldn't keep pace with ingest. ScyllaDB's shards compact independently, so a hot shard on one core doesn't block compaction on others. Under 32-core hardware, that's the difference between 32 parallel compaction workers vs. Cassandra's 2.

Tombstones: The Cassandra LSM Problem That Will Eventually Bite You

Deletes in any LSM-based system get written as markers rather than actual removals — the actual data only disappears during compaction. In Cassandra, these markers are called tombstones, and they're more granular than you'd expect: you can have cell-level tombstones, row tombstones, range tombstones, and partition tombstones. The trouble is that reads have to scan through all of them. When you query a partition with a lot of historical deletes and compaction hasn't caught up, Cassandra has to evaluate each tombstone to determine if the data beneath it is still live.

Hit enough tombstones in a single read and you get the dreaded TombstoneOverwhelmingException:

WARN  [ReadStage-1] 2024-03-15 Read 1001 live rows and 100001 tombstone cells
for query SELECT * FROM events WHERE user_id = 'abc123' LIMIT 1000
(see tombstone_warn_threshold); query aborted (see tombstone_failure_threshold)

# cassandra.yaml thresholds
tombstone_warn_threshold: 1000
tombstone_failure_threshold: 100000

The usual culprit is a time-series pattern where you're deleting old events or using TTLs heavily, combined with infrequent compaction. The fix isn't just tuning those thresholds — that's just muting the smoke alarm. The actual fix is choosing the right compaction strategy (TWCS for time-series specifically, because it creates SSTables with non-overlapping time windows that compact and expire cleanly) and ensuring your TTLs are actually triggering compaction on schedule.

Manual Compaction: When nodetool compact Makes Sense

Background compaction in Cassandra (managed by whatever strategy you've configured — STCS, LCS, or TWCS) is designed to be self-regulating. Most of the time you should leave it alone. But there are specific scenarios where triggering it manually is the right call:

  • After a bulk delete or data expiry event — if you just ran a mass delete or a large batch of TTLs just fired, background compaction will get there eventually, but running nodetool compact keyspace table immediately reclaims disk and clears tombstone debt before your next read-heavy window.
  • Before decommissioning a node — compacting before you stream data out reduces the amount of tombstone-laden data sent to peers.
  • After restoring from snapshot — restored SSTables aren't merged, so a manual compact avoids a read-amplification spike during the first wave of queries.
# Compact a specific table — blocks until done, use with caution on large tables
nodetool compact my_keyspace events

# Check compaction queue depth before and after
nodetool compactionstats

# Watch throughput live
nodetool compactionhistory

What you don't want to do is schedule nodetool compact as a daily cron job on production nodes. It's a blocking, CPU and I/O heavy operation — running it on all nodes simultaneously during peak hours is a great way to cause a latency incident. If you need predictable compaction, tune compaction_throughput_mb_per_sec and the strategy parameters instead.

ClickHouse's MergeTree: Same Idea, Different Vocabulary

ClickHouse uses the term "parts" where other LSM systems say SSTables, but the underlying pattern is identical: writes land in a small part, and background merges combine parts into larger ones. The MergeTree family (ReplacingMergeTree, AggregatingMergeTree, CollapsingMergeTree) are all variations on what compaction does when parts merge — deduplicate by primary key, aggregate pre-computed values, or collapse update pairs respectively.

The manual compaction equivalent in ClickHouse is OPTIMIZE TABLE:

-- Merge all parts in the table — can take a long time on large datasets
OPTIMIZE TABLE events;

-- Force into a single part (expensive, rarely needed)
OPTIMIZE TABLE events FINAL;

-- Compact only a specific partition
OPTIMIZE TABLE events PARTITION '2024-03';

-- Check current part count before/after
SELECT partition, count() as parts, sum(rows) as total_rows
FROM system.parts
WHERE table = 'events' AND active = 1
GROUP BY partition
ORDER BY partition;

The thing that catches people off guard with ClickHouse: OPTIMIZE TABLE without FINAL doesn't guarantee a single part per partition — it just triggers a merge pass. If you're using ReplacingMergeTree for upsert semantics and you need guaranteed deduplication before a query, you either need FINAL on the OPTIMIZE or use SELECT ... FINAL at query time (which does the deduplication on the fly, at read cost). It's a sharp edge that's bitten every ClickHouse user at least once.

LSM vs B-Tree Storage Engines: When You're Picking the Wrong Tool

The thing that surprises most people is that B-Trees don't actually lose on reads — they lose on writes, specifically random writes. A B-Tree like InnoDB maintains a balanced tree on disk. Every UPDATE means finding the exact page that holds that row and modifying it in-place. When you're updating 50,000 rows per second spread across a 200GB table, you're hitting hundreds of different disk pages — that's random I/O, and spinning disks absolutely hate it. Even on NVMe, the write amplification compounds: WAL write, page write, possibly a double-write buffer write. You end up with 3–5x write amplification before the data even settles.

LSM flips this. Every write is a sequential append to an in-memory memtable that eventually flushes to an immutable SSTable on disk. Sequential I/O is fast. But you're trading write efficiency for read complexity — a read might have to check the memtable, L0 SSTables, L1 SSTables, all the way down to Lmax. RocksDB mitigates this with bloom filters on each level (10 bits per key by default), so you avoid disk reads for keys that don't exist, but a real key lookup on a cold cache is still doing more work than a single B-Tree page walk. That's the honest trade-off nobody puts in the marketing copy.

Where I've actually seen LSM win in production: any pipeline where the write pattern is append-dominated. Event ingestion, CDC pipelines feeding into Kafka consumers that write downstream, time-series sensor data, audit logs. In these cases the data almost never gets updated after insert — you're writing rows and then maybe scanning ranges over them later. Cassandra and ClickHouse are both LSM-backed for exactly this reason. ClickHouse uses a custom LSM variant (MergeTree) that's tuned for columnar batch writes and will absorb hundreds of thousands of rows per second without choking. On the flip side, if your workload has UPDATE-heavy patterns — e-commerce inventory, banking ledgers, anything with real concurrent row mutations — put it in PostgreSQL. The B-Tree's in-place update model is genuinely better for that, and you get real MVCC, foreign keys, and planner-driven join optimization without fighting the storage engine.

The space amplification story with LSM is underappreciated until you're running out of disk. When you update a key in RocksDB, the old version doesn't disappear — it sits in an older SSTable level as a "dead" entry until compaction merges and drops it. Under active write load with default compaction settings, I've seen RocksDB sit at 1.5–2x the logical data size in dead versions. Cassandra is worse if you're running light compaction because tombstones accumulate and don't get cleaned up until a full compaction cycle runs across all replicas. If you're on a write-heavy RocksDB setup and you want tighter space overhead, you can tune max_bytes_for_level_multiplier downward and increase compaction thread count, but you're trading CPU and I/O for space:

# RocksDB options (passed via config or programmatically)
max_bytes_for_level_base = 268435456      # 256MB instead of default 256MB at L1
max_bytes_for_level_multiplier = 5        # default is 10 — smaller = more aggressive compaction
max_background_compactions = 4            # more concurrent compaction jobs
compression_per_level = [none, none, lz4, lz4, zstd, zstd, zstd]

Be careful with aggressive compaction tuning — I turned max_bytes_for_level_multiplier down to 4 once and the compaction I/O started competing with read traffic during business hours. There's no free lunch.

Here's how the four main engines actually compare across the amplification axes:

Engine Write Amplification Read Amplification Space Amplification Transaction Support Operational Complexity
RocksDB 10–30x (leveled) Low with bloom filters; spikes on cache miss ~1.1x (tiered) to ~2x (leveled during writes) Optimistic only; no distributed ACID High — tuning compaction is a full-time job
Cassandra ~10x (STCS), lower with LCS Moderate; partition key reads fast, wide scans slow 1.5–3x with uncompacted tombstones Lightweight transactions only (Paxos-based, slow) High — tombstone management, repair, compaction strategy selection
PostgreSQL 2–5x (WAL + heap + vacuum) Very low — index points directly to heap page ~1.2–1.5x with bloat; VACUUM reclaims dead tuples Full ACID, MVCC, serializable isolation Medium — autovacuum tuning, bloat monitoring
ClickHouse Low for batch inserts; high for single-row INSERTs Low for columnar scans; bad for point lookups ~1.5x during active merges; excellent at rest with compression Limited — no multi-table ACID, no row-level locking Medium — merge scheduling, part management, avoid tiny inserts

The practical decision rule I use: if you're doing more than ~20% UPDATEs or DELETEs on your data, or if you need joins and foreign key constraints, PostgreSQL is the right default. If your data mostly flows in one direction — time-ordered events, logs, metrics, CDC streams — and you can structure your access patterns around partition keys or time ranges, an LSM-backed engine will let you write faster and scale storage horizontally without the random-write bottleneck. The mistake I see most often is people picking Cassandra for a workload that has complex queries and then spending months fighting its lack of secondary index support. Know your read pattern before you commit to the write-optimized path.

3 Things That Surprised Me After Running LSM in Production

I spent the first few weeks with RocksDB feeling smug about write throughput numbers. Then production happened. Three specific behaviors burned me badly enough that I now brief every engineer who touches our storage layer on them before they ship anything.

Surprise 1: Deletes Are a Lie (Until Compaction Runs)

The thing that trips people up is expecting a delete to behave like a delete. It doesn't. A delete is a write — specifically a tombstone entry that says "this key is gone now." The actual data underneath? Still sitting on disk. Space doesn't free up until compaction runs and physically merges the tombstone with the old value and drops both. If you kick off a bulk delete job — say, purging 30 million expired records — you will watch your disk usage climb before it ever comes down. The tombstones themselves take space, and compaction hasn't caught up yet.

The gotcha inside the gotcha: if your compaction is already behind (which it often is under load), a bulk delete makes it worse. You're adding write pressure at the exact moment the system needs breathing room to compact. I've seen teams run a "cleanup job" that doubled disk usage temporarily and triggered alerts because monitoring interpreted the growth as runaway data. The fix isn't to avoid bulk deletes — it's to throttle them and monitor compaction queue depth separately from raw disk usage. In RocksDB you can check pending compaction bytes with:

db.GetProperty("rocksdb.estimate-pending-compaction-bytes")

Watch that number during any bulk delete. If it's growing faster than compaction can drain it, back off.

Surprise 2: Write Stalls Will Take Down Your Service at 2am

RocksDB has a self-preservation mechanism that most people don't read about until it bites them. When the number of L0 files hits level0_slowdown_writes_trigger (default: 20), RocksDB deliberately throttles write throughput. When it hits level0_stop_writes_trigger (default: 36), writes stop entirely. Not degrade — stop. Any write call blocks until compaction catches up.

I watched this take down a service at 2am. The compaction threads couldn't keep up with an ingest spike, L0 file count climbed, and every write in the system started blocking. From the application side it looked like total database unavailability. The fix we shipped afterward was a combination of: bumping max_background_compactions, setting max_subcompactions to use more CPU per compaction job, and adding an alert on L0 file count before it hits the slowdown trigger — not after. Here's the config we landed on for a write-heavy workload:

options.max_background_compactions = 8;
options.max_background_flushes = 4;
options.max_subcompactions = 4;
// Give compaction more room before it panics
options.level0_slowdown_writes_trigger = 40;
options.level0_stop_writes_trigger = 56;
// Alert at 30 — gives you time to react
// rocksdb.num-files-at-level0 via GetProperty()

Raising the trigger numbers buys you time but doesn't fix the underlying problem — you still need compaction to actually keep up. The real lever is CPU and I/O budget for background jobs. Don't run compaction threads starved on a box that's also doing heavy reads.

Surprise 3: Your Restart Time Is Hostage to WAL Size

LSM writes go to the memtable first, and the WAL (write-ahead log) is what makes that safe across crashes. On restart, RocksDB has to replay any WAL data that wasn't flushed to an SST file. The bigger your write_buffer_size, the more unflushed data can exist at crash time, and the longer your restart takes replaying it.

The default write_buffer_size is 64MB per column family. Sounds fine until you have 16 column families and a bursty write workload that filled all of them right before a deploy restart. That's potentially over 1GB of WAL to replay. On rotational disk, or even on a loaded NVMe, that adds tens of seconds to startup — and if you have a readiness probe with a 30-second timeout, you will fail health checks and crash-loop. I've seen Kubernetes pods get stuck in exactly this loop because the WAL replay pushed past the probe window.

The trade-off is real: smaller write_buffer_size means faster restarts and more frequent flushing, but more L0 files and more compaction pressure. A setting I've found reasonable for services that need predictable restart times:

# In RocksDB options (or equivalent in LevelDB-derived systems)
write_buffer_size = 32MB          # smaller memtable = less WAL to replay
max_write_buffer_number = 3       # allow two in-flight while one flushes
min_write_buffer_number_to_merge = 1  # flush aggressively

# For column-family-heavy setups, also check:
db_write_buffer_size = 256MB      # global cap across all CFs

The real lesson: size write_buffer_size by thinking about restart latency first, write throughput second. If your service lives in Kubernetes with tight health check windows, 32–64MB per column family is usually the ceiling, not the floor.

When NOT to Use an LSM-Based Database

The single biggest mistake I see with LSM adoption is cargo-culting. Someone reads that RocksDB or Cassandra handles millions of writes per second, and suddenly every new project gets an LSM-based backend. Here's the thing: LSM trees trade read performance for write performance, and that trade-off only makes sense in specific conditions. Miss those conditions and you've added operational complexity for negative returns.

Heavy random point reads on a large cold dataset

Bloom filters help, but they're not magic. A bloom filter tells you "this key is definitely not in this SSTable" — it can't tell you which SSTable has it. On a cold dataset with many SSTables across multiple levels, a single key lookup might still touch 3-5 files from disk after the filter eliminates the obvious misses. Compare that to a B-tree index in PostgreSQL where a point read on a well-indexed column costs you O(log n) page reads, almost always 2-3 I/Os, and those hot pages are likely in the buffer cache. I've seen read latency on RocksDB go from 2ms to 40ms just because the compaction fell behind and L0 accumulated 20 files. That doesn't happen with a B-tree.

Complex multi-table joins and transactions with ACID guarantees

LSM engines are fundamentally key-value stores or wide-column stores. ScyllaDB, Cassandra, RocksDB, LevelDB — they're all excellent at "give me the value for this key" or "scan this partition." The moment you need multi-table joins, foreign key constraints, or multi-row transactions with rollback semantics, you're fighting the data model. Yes, you can bolt a SQL layer on top (TiDB does this over TiKV, which is RocksDB under the hood), but you're adding significant complexity. If your schema looks like a normalized relational model with 10+ tables and complex query patterns, PostgreSQL 16 with proper indexing will outperform almost any LSM-based SQL alternative while being massively easier to operate.

Workloads with frequent small updates to the same key

This one is counterintuitive because LSM databases are marketed as write-optimized. But "write-optimized" means ingestion throughput — appending new data. If you're doing something like:

# Incrementing a counter for an active user session every 5 seconds
UPDATE user_sessions SET last_seen = NOW(), request_count = request_count + 1
WHERE session_id = 'abc123'

...then every update writes a new version of that key. Compaction has to repeatedly merge and discard old versions of the same key. Your write amplification factor can balloon to 10x–30x on NVMe (meaning 10–30 bytes written to disk per logical byte you write). You're not getting the throughput benefit because the hot keys are constantly being re-written through every compaction level. For this pattern, Redis with persistence or even PostgreSQL with an UPDATE-heavy workload on a table with a proper primary key index will be more efficient.

Teams that haven't tuned compaction before

Misconfigured compaction is insidious because it doesn't fail loudly — it just silently degrades over weeks. I've watched a Cassandra cluster go from 5ms p99 reads to 300ms p99 reads over 6 weeks because the compaction strategy was set to SizeTieredCompactionStrategy on a table with a high tombstone ratio. There was no alert, no error — just gradual degradation that looked like a traffic increase until we profiled it. Tuning compaction means understanding the differences between leveled, size-tiered, and TWCS strategies, knowing how to read the compaction metrics, and having the operational runbook for when things go sideways. If your team is primarily application developers who treat the database as a black box, you'll eventually hit a production incident that takes days to diagnose.

When your actual write volume is modest

If your application handles a few hundred writes per second or fewer, PostgreSQL on a decent instance (even an db.t3.medium on RDS at ~$0.068/hr) will handle it without breaking a sweat. I've seen teams spin up managed Cassandra clusters (DataStax Astra starts at meaningful per-GB pricing once you're past the free tier) for workloads that genuinely fit in a single Postgres instance. The LSM write optimization only pays for itself when you're sustaining tens of thousands of writes per second with high concurrency. Below that threshold, you're just paying the complexity tax — harder schema migrations, no joins, manual data modeling — with none of the performance upside.

Practical Tuning Checklist Before You Go to Production

Most LSM tree performance problems I've seen in production weren't because someone chose the wrong database — they were because the defaults got shipped as-is. RocksDB's defaults are tuned for correctness on a laptop, not for a 32-core machine pushing 100K writes/second. Here's what I actually change before anything goes live.

RocksDB Block Cache

The block cache is your read amplification escape hatch. Without it, every point lookup that misses the memtable triggers SST file reads at multiple levels. I set it to 30–50% of available RAM using a shared cache across column families — that way you're not accidentally double-allocating per-CF caches:

#include "rocksdb/cache.h"
#include "rocksdb/table.h"

// 8GB cache — adjust to 30-50% of your machine's RAM
auto cache = rocksdb::NewLRUCache(8LL * 1024 * 1024 * 1024);

rocksdb::BlockBasedTableOptions table_options;
table_options.block_cache = cache;

// Apply to every column family you open
rocksdb::Options options;
options.table_factory.reset(
  rocksdb::NewBlockBasedTableFactory(table_options)
);

The thing that caught me off guard: if you open multiple column families without sharing the cache object, you end up with N independent caches each thinking they own 30% of RAM. You'll blow past your memory budget fast. Share the std::shared_ptr explicitly.

Bloom Filters on Every Column Family

Bloom filters are the single highest-use change for read performance in LSM trees. Without them, a key lookup has to check every SST file at every level until it finds the key (or exhausts the search). A false-positive rate of ~1% with 10 bits per key is the standard trade-off:

rocksdb::BlockBasedTableOptions table_options;

// 10 bits/key = ~1% false positive rate — good default for most workloads
table_options.filter_policy.reset(
  rocksdb::NewBloomFilterPolicy(10)
);

// use_block_based_filter = false means whole-file filter (better for L0+)
// This is the default in RocksDB 6.x+, but be explicit
table_options.whole_key_filtering = true;

Skip this and your tail read latencies will be brutal under compaction when L0 file count spikes. I've seen p99 reads jump 10x in that window on a system without bloom filters. This is non-negotiable.

Compaction and Flush Thread Counts

The defaults — max_background_compactions = 1, max_background_flushes = 1 — made sense when RocksDB was being cautious about resource usage. On any modern server with 8+ cores and NVMe storage, they're a bottleneck waiting to ambush you under write load:

options.max_background_compactions = 4;  // start here, tune up if stalls persist
options.max_background_flushes = 2;      // flushes block writes; keep ahead of memtable fills
options.max_background_jobs = 6;         // RocksDB 6.x unified thread pool — set this too

// Increase the env thread pool to actually back these up
rocksdb::Env::Default()->SetBackgroundThreads(4, rocksdb::Env::LOW);
rocksdb::Env::Default()->SetBackgroundThreads(2, rocksdb::Env::HIGH);

The gotcha: setting max_background_compactions without also calling SetBackgroundThreads does nothing useful — the thread pool won't grow automatically. I've watched engineers bump the compaction count to 8 and wonder why nothing changed. Check the actual thread pool size.

Monitoring Write Stalls

Write stalls are RocksDB's self-preservation mechanism — it throttles or stops writes when compaction can't keep up with flush output. You want rocksdb.stall.micros at or near zero in steady state. If it's climbing, you have a compaction backpressure problem, not an application problem:

// Enable statistics collection
options.statistics = rocksdb::CreateDBStatistics();

// Later, check stall time
std::string stall_micros;
db->GetProperty("rocksdb.stats", &stall_micros);

// Or pull specific counter
uint64_t stall = options.statistics->getTickerCount(
  rocksdb::STALL_MICROS
);
// Alert if this grows faster than 1ms/s in steady state

I also expose this via a sidecar that scrapes GetProperty("rocksdb.stats") every 30 seconds and pushes it to Prometheus. A stall counter that's non-zero but stable usually means you've hit a compaction ceiling — increase threads first, then look at compaction style (leveled vs. universal).

Cassandra: Check Dead Cell Ratio Before Blaming Reads

Every time I've gotten a Cassandra read latency complaint, the first thing I run is nodetool cfstats. A high dead-to-live cell ratio means tombstones are stacking up and reads are churning through ghost data across SSTables — no amount of read tuning fixes a tombstone problem:

# Run on each node, filter to your keyspace/table
nodetool cfstats keyspace_name.table_name | grep -E "Live|Tombstone|SSTable"

# You want output like:
#   Number of live cells per slice (last five minutes): 42.0
#   Number of tombstones per slice (last five minutes): 1.0
#   SSTable count: 8

If you see tombstone counts within an order of magnitude of live cells, you have a data modeling or TTL problem. Fix your delete patterns or compaction strategy (TWCS for time-series, STCS vs LCS for the right access pattern) before you start touching read timeouts.

ScyllaDB Per-Shard Compaction Queue Depth

ScyllaDB's shard-per-core model means compaction backpressure isn't global — one shard can be completely saturated while others are idle. The Prometheus endpoint at /metrics (default port 9180) exposes exactly this:

# Scrape the metrics endpoint directly
curl -s http://localhost:9180/metrics | grep compaction_backlog

# Look for per-shard breakdown:
# scylla_compaction_manager_backlog{shard="0"} 0.0
# scylla_compaction_manager_backlog{shard="1"} 847123.0  <-- problem shard
# scylla_compaction_manager_backlog{shard="2"} 0.0

# Also check pending compactions
curl -s http://localhost:9180/metrics | grep "pending_compactions"

A single hot shard with a massive backlog usually means your partition key has a hotspot — one logical key is receiving a disproportionate share of writes and its SSTables are accumulating on one shard faster than compaction can drain them. The fix is upstream in your data model, not in compaction thread counts. ScyllaDB's Grafana dashboard (the one in their monitoring stack repo) visualizes this per-shard breakdown out of the box if you don't want to grep metrics manually.


Disclaimer: This article is for informational purposes only. The views and opinions expressed are those of the author(s) and do not necessarily reflect the official policy or position of Sonic Rocket or its affiliates. Always consult with a certified professional before making any financial or technical decisions based on this content.


Eric Woo

Written by Eric Woo

Lead AI Engineer & SaaS Strategist

Eric is a seasoned software architect specializing in LLM orchestration and autonomous agent systems. With over 15 years in Silicon Valley, he now focuses on scaling AI-first applications.

Leave a Comment