DuckLake’s 900x Speed Claim: A Database in Your Catalog Is Worth Two in the Cloud

DuckLake’s 900x Speed Claim: A Database in Your Catalog Is Worth Two in the Cloud

DuckLake’s data inlining promises 926x faster queries than Iceberg by storing streaming data directly in the catalog database. We dissect the architecture, benchmarks, and whether this is genuine innovation or just moving the bottleneck.

DuckLake’s 900x Speed Claim: A Database in Your Catalog Is Worth Two in the Cloud

DuckLake featured image showing database architecture overview
DuckLake Data Architecture Analysis

When a new data tool claims to be nine hundred times faster than Apache Iceberg, the default reaction shouldn’t be excitement, it should be suspicion. DuckLake’s recent announcement of data inlining promises exactly that: 926× faster aggregation queries and 105× faster ingestion for streaming workloads. These aren’t marginal gains, they’re the kind of numbers that suggest either a fundamental architectural breakthrough or a benchmark carefully tuned to generate headlines.

The truth, as usual, lives in the uncomfortable middle. DuckLake isn’t cheating with the metrics, but it is cheating with the architecture, in a way that challenges everything we think we know about separating storage from compute.

The Small Files Problem Is a Tax on Streaming

To understand why DuckLake’s claims aren’t just marketing fluff, you need to understand the pathology of traditional data lakes. Iceberg, Delta Lake, and Hudi all suffer from the same design constraint: every write operation creates files. Not just data files (Parquet), but metadata files (JSON manifests, Avro snapshots) that track what changed.

Insert 100 rows into an Iceberg table one row at a time, a common pattern in sensor data or event streaming, and you’ve created 100 Parquet files plus 300+ metadata files. Querying that data requires traversing a forest of tiny objects, issuing thousands of S3 LIST operations, and downloading manifest files just to figure out which bytes to read. It’s like trying to read a book where every sentence is stored in a separate envelope, and you need to open a filing cabinet to find each envelope.

The standard workaround is compaction: batching those tiny files into larger ones through scheduled maintenance jobs. But compaction is a band-aid on a gunshot wound. It doesn’t prevent the file explosion at write time, it consumes resources while running, and it introduces latency, you can’t query the latest data until the buffer flushes and compacts.

Data Inlining: Putting the Database Back in Data Lake

DuckLake’s solution is conceptually simple and architecturally radical: use the catalog database as a write buffer. Instead of immediately persisting small writes to object storage as Parquet files, DuckLake stores them directly in the catalog (PostgreSQL, SQLite, or DuckDB) until they hit a configurable threshold, by default, 10 rows.

This isn’t just caching, it’s a structural inversion of the storage hierarchy. Traditional lakes treat the catalog as a thin metadata layer pointing to files in S3. DuckLake treats the catalog as the primary storage tier for active data, with S3 serving as the archival tier for larger, colder datasets.

The mechanics are straightforward. When you insert fewer than 10 rows:

import duckdb

con = duckdb.connect()
con.execute("ATTACH 'ducklake:sqlite:sensors.ducklake' AS lake (DATA_PATH 'sensor_data/')")
con.execute("INSERT INTO lake.readings VALUES (1, 21.5, now())")

DuckLake creates zero Parquet files. Instead, it appends to an internal table in the catalog database with schema:

┌───────┐──────────────┐───────┐──────┐──────────┐───────────────────────┐
│ row_id │ begin_snapshot │ end_snapshot │ sensor_id │ temperature │          ts          │
│ int64  │     int64      │    int64     │   int32   │   double    │      timestamp      │
├───────┼────────────────────────────────┼───────────────────────┼───────────────────────┤
│      0 │              2 │          NULL │         1 │        21.5 │ 2025-03-27 10:00:00 │
└───────┴────────────────────────────────┴───────────────────────┴───────────────────────┘

begin_snapshot and end_snapshot columns maintain time-travel semantics without requiring separate metadata files. Deletions are handled by updating the end_snapshot column rather than writing deletion vectors or rewriting files. When the buffer finally flushes, either manually via CALL ducklake_flush_inlined_data() or during checkpoint operations, it consolidates everything into a single, properly sized Parquet file.

Comparison of Iceberg vs. DuckLake after 100 single-row inserts. Iceberg creates 100 Parquet files and 100 metadata snapshots. DuckLake stores all data inlined in the catalog with zero Parquet files, and after flushing consolidates everything into a single Parquet file.
DuckLake Storage Comparison: Zero Parquet files during buffering phase

Deconstructing the 926× Claim

The benchmark numbers are staggering, but they reveal the specific pain point DuckLake solves rather than universal superiority. In the autonomous car sensor simulation (300,000 rows across 30,000 batches over 50 minutes), the aggregation query performance tells the story:

Step Iceberg (Polaris) DuckLake with inlining Improvement
Insert 1,148.77 s 10.88 s 105×
Aggregation 83.06 s 0.09 s 923×
Checkpointing 52.83 s 0.28 s 189×

The 926× speedup on aggregations isn’t magic, it’s the difference between opening 30,000 individual Parquet files on S3 versus executing a SQL query against a PostgreSQL table that happens to be sitting on a fast NVMe drive in the same VPC. When DuckLake runs aggregations, it’s not reading from a data lake, it’s reading from a database, which has been optimizing point queries for decades.

Against standard DuckLake without inlining, the results are less dramatic but still significant: 5.2× faster inserts and 925.9× faster aggregations. The insertion gains come from avoiding the round-trip to S3 for every micro-batch, the aggregation gains come from eliminating the metadata traversal tax.

But here’s the critical caveat: these gains evaporate if your workload isn’t streaming micro-batches. If you’re doing bulk loads of 100,000+ rows, traditional Iceberg performs comparably because it was designed for batch analytics, not event streaming. The 900x figure is real, but it’s a measure of how badly unsuited traditional lakes are for high-frequency small writes, not necessarily how fast DuckLake is in absolute terms.

The Architecture Tradeoffs Nobody Talks About

Moving data into the catalog database isn’t free, it just moves the costs around. DuckLake’s approach introduces new constraints that the benchmarks don’t highlight:

Catalog Capacity Becomes a Bottleneck. PostgreSQL is excellent at transactional workloads, but it wasn’t designed to hold terabytes of sensor data indefinitely. The inlining threshold (default 10 rows) acts as a safety valve, but if you have thousands of tables receiving constant micro-writes, your catalog database will bloat. DuckLake addresses this with periodic flushing, but you’re now managing database capacity in addition to object storage.

Concurrency Control Shifts to the Database. Iceberg uses optimistic concurrency control with file-level atomic swaps, which allows multiple writers to S3 without a coordination service (though REST catalogs like Polaris change this). DuckLake pushes all concurrency concerns into the catalog database. This is fine if you’re using PostgreSQL with proper isolation levels, but it means your lake’s write throughput is now capped by your database’s transaction processing capacity, not your object storage’s bandwidth.

Cold Start Latency. Querying inlined data requires a connection to the catalog database. If you’re using DuckLake with SQLite for edge deployments, this is trivial. But if your catalog is RDS PostgreSQL, you’ve introduced a network hop that doesn’t exist in file-based metadata systems where clients can cache manifests locally.

These aren’t fatal flaws, but they represent a shift in operational complexity. You’re trading S3 costs and file management for database administration and connection pooling. For teams already running embedded analytics databases in production, this feels natural. For teams invested in serverless query engines that expect S3-as-source-of-truth, it’s a paradigm shift.

The Iceberg Interop Reality

DuckLake’s authors are quick to point out that this isn’t an attack on Iceberg, it’s an evolution. DuckLake uses the same Parquet storage format and supports schema evolution and time travel. It even offers metadata-only migration paths from Iceberg tables, suggesting a future where these formats coexist rather than compete.

But the interoperability story has friction. Iceberg’s ecosystem, Spark, Trino, Flink, Snowflake, is vast and battle-tested. DuckLake currently has first-class support in DuckDB, with Spark and DataFusion implementations in various alpha states. If you adopt DuckLake for the streaming ingestion speed but need to query the data with Spark for heavy ETL, you’re betting on the Spark connector’s maturity.

The MotherDuck ecosystem offers a managed path around this, though recent cloud analytics platform pricing changes have made the fully-managed route more expensive for small teams. Self-hosting DuckLake with PostgreSQL remains viable for cost-conscious users, but requires accepting the operational burden of database maintenance.

Is It Worth the Switch?

The 900x number is technically accurate for the specific torture test of high-frequency streaming inserts, but it’s misleading as a general performance claim. What DuckLake actually delivers is architectural appropriateness: using a database for what databases do well (fast, transactional small writes) and object storage for what it does well (cheap, immutable bulk storage).

If your data pipeline involves streaming events, IoT sensor data, or CDC (Change Data Capture) with high frequency and low latency requirements, DuckLake’s inlining eliminates the Kafka-to-Lake buffering complexity that currently requires Flink or Spark Streaming. You can write directly to the lake without fear of the small files problem, because the files don’t exist yet.

If you’re running traditional batch ETL, hourly or daily loads of millions of rows, the benefits diminish. Iceberg’s mature ecosystem, broader engine support, and proven compaction strategies remain the safer bet for pure analytics workloads.

DuckLake isn’t revolutionizing query execution, it’s revolutionizing write buffering. And in a world where “streaming-first” is becoming the default rather than the exception, that might be enough to justify the hype, if not quite the full 900x.

Share:

Related Articles