The Problem That Sent Me Down This Rabbit Hole
The query that broke my patience was dead simple: filter a 500GB Delta Lake table by a bounding box — latitude/longitude range — and aggregate a few columns. Should have been fast. On paper, the data was “organized.” We had date partitions. We had Parquet compression. We had a cluster with enough RAM to handle it. And we were still sitting at 38–42 seconds per query, every time, no matter what we tried tuning on the Spark side.
The first instinct — and I’ve seen this across every team I’ve worked with — was to throw more compute at it. Scale the cluster from 8 workers to 16. Bump executor memory. Crank up spark.sql.shuffle.partitions. You know what that got us? 36-second queries. The problem wasn’t compute. The problem was that Spark was still opening thousands of Parquet files and scanning most of them, because the data layout gave it no way to skip anything. More workers just means more parallelized wasted work. The bottleneck was always I/O and file enumeration, not CPU.
Here’s the trap with traditional Hive-style partitioning: you have to commit to one or two columns upfront, and that decision has to match your query patterns forever. We partitioned by event_date because that was the most common filter. Except product started asking for spatial queries. Then user-segment queries. Then mixed queries that didn’t touch date at all. Every new query pattern was either a full table scan or required a redesign of the partition scheme. The “partition or column?” choice gets especially painful when you realize adding a high-cardinality column as a partition key — like a region ID with 10,000 distinct values — doesn’t help; it creates millions of tiny files and tanks write performance instead.
Adaptive Compression in Inverted Indexes: What Actually Happens Inside Lucene, Elasticsearch, and Tantivy
# What we were doing — and why it stopped working
spark.read.format("delta") \
.load("s3://my-lake/events/") \
.filter(
(col("lat") >= 48.8) & (col("lat") <= 48.9) &
(col("lon") >= 2.3) & (col("lon") <= 2.4)
) \
.groupBy("user_segment") \
.count()
# Spark plan showed: Scan parquet → PartitionFilters: [event_date]
# DataFilters: [lat >= 48.8, lat <= 48.9, lon >= 2.3, lon <= 2.4]
# Files scanned: 4,200 out of 4,312. Basically the whole table.
What I actually needed was file-level skipping based on the query's actual filter columns — at query time, not at write time. Parquet has min/max statistics per column per file, and Delta Lake can use those for data skipping. But the catch is your data has to be physically sorted in a way that makes those min/max ranges tight. If your lat/lon values are scattered randomly across files because they were ingested in time order, the min/max of every file covers the whole range, and skipping is useless. You need a spatial index — something that clusters multi-dimensional data so that nearby values in space end up in the same files. That's the gap Qbeast's OTree index fills, and why it became the thing I actually needed rather than a bigger cluster.
Quick Orientation: What Qbeast Actually Is (And Isn't)
Most people's first instinct is to treat Qbeast like a query engine or some kind of managed database layer. It's neither. Qbeast is a table format extension — it bolts onto Delta Lake and changes how data is physically organized on disk when you write it. Your existing Spark setup, your S3 or ADLS storage, your Delta transaction log — all of that stays. Qbeast just takes over the file layout decisions.
The OTree index is what makes it interesting. Instead of organizing data by partition column or arbitrary file boundaries, it builds a multi-dimensional spatial tree where data is distributed by value density. Think of it like a quadtree over your numeric columns: dense regions of data get finer subdivisions, sparse regions get coarser ones. The practical effect is that a query like "give me rows where price BETWEEN 100 AND 200 AND quantity > 50" can skip whole branches of the tree without touching the files they point to. The index is genuinely multi-dimensional — you're not just partitioning on one column and hoping for the best.
The thing that caught me off guard was how fundamentally different this is from Delta's built-in Z-ordering. Z-order in Delta is a post-write reorg — you write your data however it lands, then you run OPTIMIZE ZORDER BY (col1, col2) as a separate maintenance job that physically reshuffles files. OTree is written at ingest time. The index structure is built while the data is being committed to Delta. That's an enormous operational difference: with Qbeast you skip the maintenance job entirely, but you're paying a small extra cost on every write. If you're doing heavy batch ingestion and running OPTIMIZE once a day anyway, Z-order might be simpler. If you're doing continuous streaming writes and want query efficiency without a separate reorg pipeline, OTree starts looking very attractive.
How Search Engines Shrink Inverted Indexes Without Killing Query Speed: Adaptive Compression in Practice
Before you download anything: Qbeast requires Spark 3.x — it does not work on Spark 2.x, and Spark 4.x support is still catching up depending on the release. You also need the qbeast-spark JAR on your classpath before any of the format registration hooks work. Forget this and you'll get cryptic AnalysisException: Could not find a suitable table format errors that look like a Delta problem but aren't. The JAR needs to be there at driver startup, not just added mid-session.
# Check the releases page first — versions are tightly coupled
# https://github.com/qbeast-io/qbeast-spark/releases
# Example for Spark 3.4 + Delta 2.4:
spark-shell \
--packages io.qbeast:qbeast-spark_2.12:0.6.0 \
--conf spark.sql.extensions=io.qbeast.spark.internal.QbeastSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=io.qbeast.spark.internal.sources.catalog.QbeastCatalog
The GitHub repo is at qbeast-io/qbeast-spark — go to the releases page, not the README, because the README sometimes trails the actual release matrix. Match your Spark version, your Scala version (2.12 vs 2.13), and your Delta Lake version before pulling a JAR. Getting those three misaligned is the most common reason people spend an afternoon debugging ClassNotFoundExceptions instead of running queries.
Installation and First Config — The Steps I Actually Had to Do
The first thing that bit me wasn't the concept — it was a ClassNotFoundException at 11pm because I grabbed the wrong Delta Lake version. Qbeast pins hard to specific Delta releases, and this isn't plastered at the top of their README where it should be. Before you write a single line of Spark code, go to their GitHub compatibility matrix and confirm which Delta version your target Qbeast release expects. Qbeast 0.6.x works with Delta 2.4.x, and mixing even minor versions will silently blow up at class loading time, not at compile time.
For a quick spark-submit test run, the dependency flag looks like this:
spark-submit \
--packages io.qbeast:qbeast-spark_2.12:0.6.0,io.delta:delta-core_2.12:2.4.0 \
--conf spark.sql.extensions=io.qbeast.spark.internal.sources.QbeastSparkSessionExtension,io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
your_job.py
That spark.sql.extensions line is doing double duty — you need both the Qbeast extension and the Delta extension registered, comma-separated, in that config key. Miss the Delta one and you get a catalog registration failure. If you're building with sbt instead of a command flag, pin both artifacts explicitly in your build.sbt:
libraryDependencies ++= Seq(
"io.qbeast" %% "qbeast-spark" % "0.6.0",
"io.delta" %% "delta-core" % "2.4.0"
)
// Tell sbt not to resolve a "newer" Delta on its own
dependencyOverrides += "io.delta" %% "delta-core" % "2.4.0"
The dependencyOverrides line is critical if your project pulls in other Delta-dependent libraries. Without it, sbt's eviction resolution will silently pick a different Delta version and you're back to runtime class errors.
desiredCubeSize is the tuning knob the docs mention in one sentence and then move on from, which is a mistake. This number controls how many rows land in each OTree cube node before Qbeast splits it into children. Too small (like the default 5,000) and you get massive tree depth, slow ingestion, and too many small files — the Spark small-file problem in tree form. Too large and each cube covers a wide range of your index columns, which degrades query pruning back toward full-scan territory. I've had good results with 500,000 for datasets in the 50M–500M row range. For datasets under 10M rows, 100,000 is more appropriate. Think of it like choosing B-tree page size — it's a write-vs-read trade-off.
Once your session config is right, the verification write that confirms everything is wired up correctly looks like this:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.sql.extensions",
"io.qbeast.spark.internal.sources.QbeastSparkSessionExtension,"
"io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
df = spark.range(1_000_000).toDF("id") \
.withColumn("value", (df["id"] * 3.14).cast("double"))
df.write.format("qbeast") \
.option("columnsToIndex", "id,value") \
.option("cubeSize", "500000") \
.mode("overwrite") \
.save("/tmp/qbeast_test")
# If this runs without ClassNotFoundException or AnalysisException, you're good
spark.read.format("qbeast").load("/tmp/qbeast_test").show(5)
If that write completes and the read shows 5 rows, your Qbeast install is functional. If you see AnalysisException: Table or view not found or any mention of QbeastCatalog failing, the extension registration is wrong — double-check that both extension class names are in a single comma-separated string in that one config key, not set as two separate .config() calls overwriting each other.
Understanding OTree Spatial Index: The Mental Model You Need
The thing that caught me off guard about OTree is that it doesn't care how many rows are in a partition — it cares about where those rows live in value space. Most indexing systems I'd worked with before split data by row count or by a single partition key. OTree subdivides the actual n-dimensional coordinate space your columns inhabit. If your data has a cluster of GPS coordinates all jammed into São Paulo, OTree doesn't just dump them into one overloaded partition — it keeps splitting that dense region into smaller and smaller cubes until the density is manageable. That's the core mental model: recursive spatial subdivision based on data distribution, not row count.
Concretely, imagine your dataset has two dimensions: latitude and event_timestamp. OTree maps both columns onto axes of a coordinate space, then recursively halves that space — first into two halves, then four, then eight cubes — based on where your actual data points land. A query like WHERE latitude BETWEEN 23.0 AND 24.0 AND event_timestamp BETWEEN '2024-01-01' AND '2024-01-31' can now prune entire branches of that tree. If the branch's bounding cube doesn't intersect your filter rectangle, Spark skips every file in that branch without opening them. With a traditional single-column partition on date, that same query still scans every partition looking for the latitude match. The multi-dimensional pruning is the actual win here.
The cube/file relationship is worth burning into memory before you go further because it trips people up. One cube maps to one or more Parquet files on disk — when a cube gets too large, Qbeast flushes a new file under the same cube. But every file belongs to exactly one cube, never shared. This matters because when the index prunes at query time, it operates at the cube level. If a cube has three files and your filter intersects that cube's bounding box, you read all three files. You can't partially skip within a cube. So when you're tuning, the cubeWeightsBufferCapacity setting controls how big a cube gets before it flushes:
df.write
.format("qbeast")
.option("columnsToIndex", "latitude,event_timestamp")
.option("cubeSize", "500000") // target rows per cube before splitting
.save("/data/events_indexed")
Smaller cube sizes mean more granular pruning but more files to manage. I found 300K–700K rows per cube is a reasonable starting point depending on your average Parquet row group size.
The skew-tolerance behavior is genuinely different from static partitioning and it's why I stopped reaching for PARTITION BY (country) when data is uneven. With static partitioning, a country with ten million rows and a country with fifty rows get the same single partition — you can't do anything about that without manual sub-partitioning. OTree handles it automatically: dense regions get split more aggressively, sparse regions might never split beyond the root cube. The index is essentially self-balancing along the value distribution curve. The practical effect is that your hot spots in São Paulo or New York don't create a single giant partition that tanks your query times — they get subdivided until the cube sizes are roughly comparable to your target cubeSize.
The sampling behavior is the feature that made me show it to three other people on my team within the same afternoon. OTree stores the revision and weight metadata needed to return a statistically representative sample of your dataset without touching most files:
-- Returns ~10,000 representative rows across the full data distribution
-- NOT random rows from arbitrary files — spatially stratified
SELECT * FROM events TABLESAMPLE (10000 ROWS);
The reason this works is that OTree assigns each row a fractional weight during indexing — rows in dense cubes get low weights, rows in sparse cubes get higher weights. A sample query uses these weights to pull from across the entire value distribution rather than just reading the first few files. For a 500GB lake this means a 10K-row sample completes in seconds instead of minutes, and the sample actually reflects your data shape rather than being biased toward whatever files happen to be listed first. If you're doing exploratory analysis or building histograms for query planning, this changes the workflow completely.
Picking the Right Columns to Index — Where Most People Get It Wrong
The most common mistake I see people make with Qbeast's OTree index is treating column selection like a standard B-tree index decision — picking whatever column has the highest cardinality and calling it a day. That intuition is wrong for a completely different reason here: OTree is a spatial partitioning structure, and what matters is how queries filter across multiple dimensions simultaneously, not how many distinct values any single column has.
Low-cardinality columns like a status enum (active, pending, cancelled) are a trap. You'd think they'd help prune, but an OTree node subdivides space — and if one of your indexed dimensions only has 4 possible values, the tree gets shallow and asymmetric almost immediately. You end up with huge leaf partitions that barely prune because the spatial split barely discriminates. You still pay the overhead of maintaining the index structure across that dimension, and you get almost no query skip benefit back. I've seen this mistake double scan times versus just not indexing that column at all.
The sweet spot I've landed on is 2–3 columns that are continuously distributed or have genuinely high cardinality across your actual data range. Think Unix epoch timestamps, user/entity IDs above a few million distinct values, latitude/longitude pairs, or sensor readings. These give the OTree space to subdivide meaningfully and actually skip large cube regions. When I added event_timestamp and user_id together as my indexed columns, cube-based pruning started eliminating whole file groups rather than just shaving a few row groups:
# Writing with Qbeast format in PySpark
df.write \
.format("qbeast") \
.option("columnsToIndex", "event_timestamp,user_id") \
.option("cubeSize", "500000") \
.save("s3://my-lake/events")
Before you commit to a column config, run ANALYZE on your query logs — not your schema docs. What you think your analysts query and what they actually run are usually different. I pull the last 30 days of slow queries out of Spark history server, grep for WHERE clause patterns, and tally co-occurrence:
# Rough bash one-liner to tally column co-occurrence in WHERE clauses
grep -h "WHERE" spark-query-log.txt \
| grep -oP '\b(event_timestamp|user_id|region|status|session_id)\b' \
| sort | uniq -c | sort -rn
The column pair that shows up together most often in real filter predicates is your index config. Not the pair with the highest theoretical cardinality, and definitely not what the schema designer intended as the "primary key".
The thing that bit my team hardest: columnsToIndex is not something you tune after ingestion. Changing it requires a full rewrite of the dataset — you can't alter it in-place the way you'd add a secondary index to Postgres. If you're loading production data in the range of hundreds of gigabytes or terabytes, a full rewrite is hours of Spark cluster time and non-trivial cost. Get this decision right before you ingest. I now treat the index column choice as a schema-level decision, documented alongside the DDL, with sign-off before any production backfill runs.
The desiredCubeSize Tuning Loop
The default desiredCubeSize of 5,000,000 rows sounds reasonable until you actually look at what it does to a table with under 50GB of data. At that scale, you end up with one or two massive cubes that the OTree spatial index can't prune meaningfully — your query filter might touch 10% of the logical space, but Spark still reads 80% of the files because the cubes are too coarse. I hit this exact wall on a 30GB events table where query times barely improved after indexing. The index existed, the statistics were written, but the spatial partitioning was too blunt to help.
The failure mode on the other side is subtler but just as painful. Drop desiredCubeSize to something like 10,000 on a table that gets append-heavy writes and you'll generate thousands of small Parquet files in S3 or GCS. Spark's file-open overhead on object storage isn't trivial — each file open involves a metadata request, and when you're opening 4,000 files instead of 40, the latency stacks up fast. The OTree index prunes aggressively, but you're still paying per-file cost on every file that survives pruning. The sweet spot is a cube size where the index prunes hard and each surviving file has enough rows to amortize open cost.
The feedback loop that actually works: write a representative sample of your real data, run your most common filter query with EXPLAIN, then open Spark UI and look at the "files read" metric under the scan stage. That number is your tuning signal — not query time, not shuffle size, not executor memory. Files read. If your table has 200 files and a selective query reads 180 of them, your cube size is too large. If it reads 3, you might be over-partitioned and generating tiny files. Here's the test use I actually use:
# Write a test dataset with a candidate cubeSize
df.write.format("qbeast")
.option("columnsToIndex", "event_timestamp,user_id")
.option("cubeSize", "300000")
.save("/tmp/qbeast_test_v3")
# Then in a separate session, run your real query pattern
spark.read.format("qbeast").load("/tmp/qbeast_test_v3") \
.filter("event_timestamp BETWEEN '2024-01-01' AND '2024-01-07' AND user_id = 'u_1234'") \
.explain()
# Check Spark UI → Jobs → your query → expand the FileScan node
# Look for: "number of files read" vs total files in the path
For a ~200GB time-series table with mixed access patterns — range scans on event_timestamp, point lookups on user_id — I landed on cubeSize 300000 with those two columns as the index. The ordering matters: put the column you filter on most frequently first. In my case that's event_timestamp because 90% of queries are time-windowed. With this config on a 200GB dataset I went from reading 340 files on a week-long time window query down to reading 28. The config that stuck:
df.write.format("qbeast")
.option("columnsToIndex", "event_timestamp,user_id")
.option("cubeSize", "300000")
.mode("overwrite")
.save("gs://my-lake/events_indexed/")
One thing that caught me off guard: desiredCubeSize is a target, not a hard limit. Qbeast will exceed it when the data distribution is skewed — if 40% of your rows share the same user_id, those cubes will be larger. So after writing, always run DeltaLog inspection or spark.read.format("qbeast").load(path).selectExpr("element_at(tags, 'cube')", "count(*)").groupBy(1).count() to see actual cube size distribution. If you see one cube with 3 million rows and everything else at 300k, you have a skew problem that cubeSize alone won't solve — you need a third index column to break up the hot partition.
3 Things That Surprised Me After Running This in Practice
I expected the usual "impressive benchmark, painful production reality" gap. Qbeast with OTree narrowed that gap more than I anticipated, but it also had one rough edge that genuinely annoyed me for a week before I figured out what was happening. Here's what actually caught me off guard.
Surprise 1: Incremental appends keep the index healthy without a maintenance job
My assumption going in was that OTree would behave like Delta Lake's Z-order — great after a fresh OPTIMIZE, degraded after 50 appends, requiring you to schedule another OPTIMIZE to bring things back. That mental model is completely wrong for Qbeast. OTree rebalances on write. Every time you append a batch, the index structure updates to account for the new data distribution. There's no separate compaction step you have to remember to run. I ran a pipeline that appended roughly 10 million rows daily for three weeks and file pruning efficiency on range queries stayed within a few percentage points the entire time. With Z-order I was used to watching pruning ratios degrade noticeably within days of the last OPTIMIZE. The write-time cost is real — appends take slightly longer — but for append-heavy workloads the trade-off is obviously worth it.
Surprise 2: The sampling API is actually production-grade
Before switching, we had a reservoir sampling pipeline — a Spark job that read the full dataset, shuffled, and wrote a stratified sample out to a separate location. It was slow, it drifted from the source data, and it had to be re-run manually whenever the underlying data changed significantly. I replaced the whole thing with a SQL hint:
-- fraction is approximate, OTree uses the spatial structure to serve it efficiently
SELECT *
FROM delta.`/data/events`
TABLESAMPLE (5 PERCENT) REPEATABLE (42)
-- Qbeast pushes this down to the index; you're not reading 100% of files
The key thing that makes this not just a syntactic shortcut: Qbeast resolves the sample using the OTree structure, so it reads a fraction of the physical files rather than scanning everything and then discarding rows. Our reservoir pipeline was reading 40+ GB to produce a 2 GB sample. The SQL hint reads roughly what you'd expect proportionally. We now point our exploratory notebooks directly at the hint instead of at a stale sample table. The one caveat — if your query also has a tight filter on a non-indexed column, you need to verify the sample is still representative for your use case, because the index selects files spatially first.
Surprise 3: Plain Delta readers still work — you just lose the pushdown
I was worried that writing through Qbeast would lock us into a specialized reader everywhere. It doesn't. Any tool that can read Delta Lake — Trino, DuckDB, Databricks, plain delta-rs — reads a Qbeast-managed table without modification. The files are valid Delta, the transaction log is valid Delta. What you lose is OTree pushdown, so those readers fall back to scanning based on Delta's own file stats (min/max per column). For tools like Metabase or dbt that query your lake via a generic Delta connector, this is a non-issue — they were never doing spatial pruning anyway. For Spark workloads that need pruning, you use the Qbeast DataSource. We run both simultaneously: Qbeast for the Spark pipelines that care about pruning, plain Delta reader for the BI layer.
The rough edge: Spark UI tells you almost nothing about OTree pruning
This one cost me real time. After I deployed, I wanted to verify that OTree pruning was actually firing on production queries. The Spark UI's "Input Files" count and "Data Read" metrics don't distinguish between "we skipped these files because of Delta stats" and "we skipped these files because of OTree spatial pruning." They're both just "files not read." To get actual OTree-specific metrics, you have to go through Qbeast's own instrumentation. The fastest way I found was checking the qbeast.filesRead metric exposed via the Qbeast analyzer, or just being deliberate about comparing files-read counts between a plain Delta read and a Qbeast read on the same query:
-- Run both and compare sc.ui() files-read, or use Qbeast metrics API
val qbeastDf = spark.read.format("qbeast")
.option("columnsToIndex", "user_id,event_ts")
.load("/data/events")
// Check pruning via Qbeast metrics after execution
qbeastDf.filter("event_ts > '2024-01-01' AND user_id < 100000").count()
// Then inspect: spark.sparkContext.statusTracker or Qbeast's own QueryReport
The fix I actually landed on: add a lightweight wrapper to our pipeline logging that records files-read before and after the Qbeast query, and alert if the ratio degrades past a threshold. Not elegant, but it works until Qbeast exposes a proper Spark plugin hook for metrics. If you're running this on Databricks, the cluster event logs give you slightly more visibility, but it's still not as clean as you'd want.
When NOT to Use Qbeast + OTree
The most useful thing I can tell you about Qbeast is when to put it back on the shelf. I've watched teams adopt it because the benchmarks look great, then spend two weeks debugging rebalancing behavior on workloads it was never designed for. The OTree index earns its complexity at scale and with multi-dimensional queries — but that complexity has real costs in the wrong context.
Point lookups by primary key. If your access pattern is "give me the row where user_id = 18372," Qbeast is actively the wrong tool. Hive-style partitioning on a date column plus a sorted Parquet file with Bloom filters will get you there in single-digit milliseconds. A proper OLAP database like ClickHouse or even DuckDB on flat Parquet will demolish OTree on this pattern. The spatial index shines when you're querying across multiple numeric dimensions simultaneously — the moment you reduce that to a single equality filter, you're paying index overhead for zero gain.
Tables under ~10GB. Seriously, just load it into memory. DuckDB reading flat Parquet files will outperform any indexed format on a dataset this size, and you skip all the write-time overhead. The OTree cube structure requires a minimum data volume to fill enough cubes that the skip stats actually eliminate work. Below that threshold you're often scanning more metadata than data. I've seen people add Qbeast to a 3GB events table and wonder why queries got slower — the index fanout was wider than the table.
Streaming micro-batch writes at high frequency. If you're landing thousands of small Spark Structured Streaming micro-batches per minute, cube rebalancing will add measurable latency to every write commit. The OTree index needs to reassign data between cubes as it learns the distribution — that's fine when you're doing large batch appends, brutal when each micro-batch is 10K rows and you're committing every 30 seconds. Your write SLA will suffer. For high-frequency streaming ingestion, land raw Parquet first, compact + index on a schedule (hourly or daily), and query the indexed layer separately.
Teams not already running Spark. Qbeast is Spark-native and that's not changing fast. As of the time of writing, there's no standalone reader for Trino or Presto — if your query layer is either of those, you either wrap a Spark thrift server in front (painful) or you skip Qbeast entirely. The Delta Lake format underneath is readable by Trino, but you lose all the OTree skip logic and just get a full scan. This is a genuine ecosystem gap, not a configuration problem you can fix.
Highly predictable, static query patterns. If every query filters on the same two or three columns in the same ranges — think a time-series table always queried by event_date and region — Hive partitioning plus predicate pushdown is faster to set up, easier to debug, and requires zero special Spark configuration. You partition on event_date, maybe bucket on region, and Spark's built-in file pruning does most of the work. Hive partitioning is also readable by every query engine without a plugin. Qbeast's value is specifically in ad hoc multi-dimensional queries where you don't know which column combinations users will filter on — the moment your patterns crystallize, the simpler solution wins.
Performance Mindset Shift: Stop Optimizing After the Fact
The thing that changed how I think about query performance wasn't a new execution engine or a fancier cluster configuration — it was realizing I'd been optimizing at the wrong time. I kept throwing more parallelism at read-time problems that were actually write-time problems. Tuning Spark executor memory or shuffle partitions is real work, but it's patching over the fact that your data is physically laid out in a way that forces full scans no matter how clever your query plan gets.
With Qbeast's OTree spatial index, the layout decision happens at ingestion, and the index structure itself encodes your query patterns into the file organization on disk. That sounds abstract until you see what it means in practice: a query with a range filter on two columns that used to scan 40GB now touches 3GB, not because the engine got smarter, but because the files containing relevant data are physically grouped together. The data layout decision at write time delivered a 10x improvement that no executor tuning would have gotten close to.
The workflow I now run before any significant ingestion looks like this — profile staging data first, then design the index:
# 1. Profile your actual query patterns against a sample in staging
# Look at which columns appear in WHERE clauses and JOIN conditions most
# 2. Define index columns and cube size in your write config
df.write \
.format("qbeast") \
.option("columnsToIndex", "event_timestamp,user_id") \
.option("cubeSize", "500000") \ # tune this based on data volume — smaller = more granular pruning
.save("s3://my-lake/events_indexed/")
# 3. Verify pruning is actually happening post-ingest
spark.read.format("qbeast") \
.load("s3://my-lake/events_indexed/") \
.filter("event_timestamp > '2024-01-01' AND user_id BETWEEN 1000 AND 5000") \
.explain(True)
# Check "files read" in the physical plan — if it's still the full dataset, your columnsToIndex is wrong
The cubeSize parameter is where I see people make the biggest mistake. Setting it too high means coarse-grained cubes that don't prune effectively on narrow queries. Too low and you end up with thousands of tiny files that kill throughput on broad aggregations. My rule of thumb: start at 500K rows per cube for datasets under 500GB, and benchmark both a narrow point query and a full-table aggregate before committing. The OTree structure forces you to think about this before ingestion — and that's genuinely a feature. It creates a forcing function that schema reviews never did, because schemas feel abstract but "how many files will this query touch" is immediately measurable.
What I've started doing is treating data layout the same way I treat schema migrations: it gets a dedicated review step in the pipeline design process. Before any new dataset goes into the lake, someone has to answer three questions: What are the top three query patterns? What columns drive those filters? What's the estimated row count at 6-month growth? Those answers directly map to columnsToIndex and cubeSize. If nobody can answer those questions, the dataset isn't ready to ingest — same logic as not merging a schema change without knowing downstream consumers. For teams building out the surrounding infrastructure, the Essential SaaS Tools for Small Business in 2026 guide covers several data orchestration tools that slot naturally into this pattern — specifically the ones that support pre-ingest profiling and pipeline metadata management alongside a Qbeast-backed lake.
The mindset shift is this: read-time optimization has a ceiling determined entirely by write-time decisions. You can tune parallelism, cache aggressively, and repartition on read all day — but if the data isn't spatially organized around your query patterns, you're just running a full scan faster. OTree makes the right thing the only obvious path, which is exactly what good tooling should do.
Quick Reference: Commands and Config I Actually Use
I keep forgetting which config keys go where, so here's the block I paste into every new project. The ordering matters less than making sure all three extension points are present — miss the catalog config and you'll get a generic Parquet reader with no OTree awareness at all.
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.appName("qbeast-workload")
# Without this, Delta operations silently fall back to default behavior
.config("spark.sql.extensions",
"io.delta.sql.DeltaSparkSessionExtension,"
"io.qbeast.spark.delta.QbeastSparkSessionExtension")
# Qbeast piggybacks on Delta's catalog — both must be declared
.config("spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog")
.config("spark.qbeast.index.defaultCubeSize", "5000000")
# Raise this if your cluster has headroom; lower it if cubes are too sparse
.config("spark.qbeast.index.numberOfRetries", "3")
.getOrCreate()
)
Writing with the index is the part where most people under-specify and wonder why sampling doesn't work later. columnsToIndex is mandatory — Qbeast builds the OTree around those dimensions. cubeSize is optional but you want to set it explicitly instead of trusting the default, especially if your rows are wide. I typically start at 5M for datasets under 500GB and tune from there based on the explain plan output.
# Writing a partitioned Delta table with OTree indexing
(
df.write
.format("qbeast")
.option("columnsToIndex", "user_id,event_timestamp")
.option("cubeSize", "5000000")
# appendMode works but you'll want to OPTIMIZE after heavy appends
.mode("overwrite")
.save("/mnt/datalake/events_indexed")
)
# If you're appending incrementally:
(
new_df.write
.format("qbeast")
.option("columnsToIndex", "user_id,event_timestamp")
.option("cubeSize", "5000000")
.mode("append")
.save("/mnt/datalake/events_indexed")
)
Reading back and confirming the index is actually being used requires running explain() — don't assume the pushdown is working just because the query returns. Look for QbeastScan in the physical plan. If you see FileScan parquet or FileScan delta instead, something in your session config is missing.
df = spark.read.format("qbeast").load("/mnt/datalake/events_indexed")
# Filter pushdown — Qbeast prunes cubes that can't contain matching rows
filtered = df.filter(
(df.event_timestamp >= "2024-01-01") &
(df.event_timestamp < "2024-02-01")
)
# Check the physical plan — you want QbeastScan, not FileScan
filtered.explain(mode="formatted")
# Expected output excerpt (confirm this appears):
# == Physical Plan ==
# *(1) Filter (...)
# +- QbeastScan qbeast [user_id, event_timestamp, ...]
# PushedFilters: [IsNotNull(event_timestamp), GreaterThanOrEqual(...)]
# ReadSchema: ...
The TABLESAMPLE syntax is where OTree earns its keep. Normal TABLESAMPLE in Spark is random-row filtering applied after a full scan — it's just cosmetic. Qbeast's version actually navigates the cube tree and returns only top-level cubes, which means you get a representative statistical sample without reading the whole dataset. The difference shows up immediately in scan metrics.
-- OTree-aware sampling via SQL — reads only top-N cube levels, not all files
SELECT *
FROM qbeast.`/mnt/datalake/events_indexed`
TABLESAMPLE(10 PERCENT)
-- Or from PySpark:
sampled = spark.sql("""
SELECT user_id, event_timestamp
FROM qbeast.`/mnt/datalake/events_indexed`
TABLESAMPLE(5 PERCENT)
""")
# Sanity check: file count read should be much lower than total file count
sampled.explain(mode="formatted")
Inspecting cube structure directly is something I only learned to do after a bad append left the index fragmented. queryExecution alone isn't the cleanest interface — the more reliable path is using Qbeast's built-in metadata table, which surfaces cube IDs, their depth in the tree, the element count per cube, and the min/max ranges per indexed column. Run this after any bulk write or OPTIMIZE to confirm the tree looks sane.
from qbeast.spark.implicits import QbeastTable
# Full cube metadata — depth, element count, min/max per indexed column
qt = QbeastTable.forPath(spark, "/mnt/datalake/events_indexed")
qt.getIndexMetrics().show(truncate=False)
# Lower-level: poke at the raw query execution plan
raw = spark.read.format("qbeast").load("/mnt/datalake/events_indexed")
print(raw.queryExecution.optimizedPlan)
# If you want cube-level file mapping:
qt.analyze() # Re-runs OTree stats without rewriting files
# Output includes: cubeId | depth | elementCount | minTimestamp | maxTimestamp
One gotcha that burned me: after appending data multiple times without running OPTIMIZE, the cube tree gets unbalanced — some cubes hold far more elements than the target cubeSize, and sampling accuracy degrades silently. The fix is running qt.optimize() or OPTIMIZE qbeast.\`path\` via SQL periodically. For streaming ingestion I set a scheduled job that runs it every 6 hours; for batch loads I run it immediately after each write. You can tell the tree needs it when getIndexMetrics() shows cube element counts consistently above 2x your configured cubeSize.