You Migrated from Spark to dbt. Your Incremental Models Got Slower. Here's Why.

You Migrated from Spark to dbt. Your Incremental Models Got Slower. Here’s Why.

A real-world account of migrating from Spark-based ETL to dbt ELT on Redshift, highlighting the surprising performance drawbacks of Redshift Serverless for incremental models compared to provisioned Redshift.

You know that sinking feeling when you upgrade your infrastructure expecting your problems to vanish, only to discover a new, more confusing set of problems?

One data engineering team lived this reality. They ditched a painfully slow Spark-based ETL pipeline that took six hours to complete. They moved to a modern dbt ELT pipeline on Redshift and slashed runtime to 90 minutes on provisioned hardware. A win, right?

Then they migrated to Redshift Serverless, expecting an even bigger spike in performance. Instead, they got a paradox: their full refreshes were blazing fast, but their incremental models, the ones designed to be faster, were actually slower than a full reload.

The pipeline that was supposed to be optimized had become its own worst enemy.

The Migration Nobody Regretted (At First)

The original stack was a classic case of over-engineering. A Spark-based ETL process that took six hours to complete. The team had a strong Python and SQL background, but the Spark pipeline was built without a deep understanding of how to configure its distributed engine. As one experienced engineer put it, building a Spark pipeline without understanding shuffle, partitions, and memory tuning almost guarantees you’ll end up with code that “runs slowly and costs a lot to run.”

The switch to dbt Core was a breath of fresh air. dbt compiles Jinja-based SQL models into warehouse-ready queries, tracks lineage through a ref()-based dependency graph, and enforces build order. The team was suddenly working in their native language, SQL, rather than wrestling with Spark’s DataFrame API.

The results spoke for themselves:

Pipeline Type Runtime
Original Spark ETL ~6 hours
dbt on Provisioned Redshift ~1 hour 30 min
dbt on Redshift Serverless ~45 minutes (full refresh)

That’s an 87% reduction in runtime just by switching to ELT on a well-tuned warehouse. But the story doesn’t end there.

The Redshift Serverless Paradox

Here’s where things get strange. The team observed that on Redshift Serverless:

  • Full refreshes and models materialized as tables performed “extremely well.”
  • Incremental models, the ones that should only process new or changed data, actually performed worse than full refreshes.

This is the kind of behavior that makes you question your understanding of basic physics. Incremental models exist specifically to avoid reprocessing historic data. If a full refresh is faster, the incremental model is failing at its only job.

The root cause lies in how Redshift Serverless manages compute resources. Serverless architectures auto-scale and allocate compute on-demand. For large, sequential full-refresh operations, this is great, Redshift can throw all available resources at a single, predictable query pattern.

Incremental models, by contrast, are often small, frequent operations that involve MERGE or UPDATE statements. These operations require Redshift to:
1. Scan the target table to find matching rows.
2. Perform row-level modifications (which in Redshift’s columnar storage means deleting and re-inserting whole blocks).
3. Manage transaction logs and concurrency.

On provisioned Redshift, you pay for fixed compute capacity and can tune workload management to prioritize these operations. On Serverless, the compute layer abstracts this away, and the abstraction doesn’t always favor the incremental pattern. The cost-per-compute-time on Serverless is higher than provisioned, and the auto-scaling logic can over-provision or under-provision for these mixed workloads.

When Spark Actually Makes Sense

This isn’t a “dbt beats Spark” story. It’s a story about fit. The original Spark pipeline was poorly built, but that doesn’t invalidate the technology.

Apache Spark scores an 8.6/10 overall with a 9.2/10 for features, driven by its Catalyst Optimizer and unified batch/streaming runtime. It’s the #1 ranked tool for a reason. When properly configured, Spark can be faster than Redshift for complex transformations because it lets you define exactly how many compute nodes to use and how to partition your data.

The key trade-off is this:

  • Spark excels when you need a distributed engine for massive shuffles, complex joins, or ML feature engineering.
  • dbt on a warehouse excels when your data already lives in a SQL database and you want to enforce testing, documentation, and lineage.

There’s no universal winner, only the right tool for the right workload pattern.

The Silent Cost of Redshift Serverless

Beyond the performance paradox, there’s a financial reality that’s easy to overlook. The team noted that “Redshift Serverless is costing us more compared to provisioned.”

This isn’t an accident. Serverless pricing models are designed for elasticity, not efficiency. With provisioned Redshift, you pay for a fixed cluster 24/7, even when it’s idle. Serverless charges per compute time used. For sporadic workloads, this is cheaper. But for a pipeline that runs daily and processes significant data, the per-unit compute cost on Serverless often exceeds what you’d pay for sustained provisioned capacity.

The smartest approach might be a hybrid setup: use provisioned Redshift for predictable, high-volume loads, and route sporadic or exploratory workloads to Serverless for elasticity.

Managing large-scale database infrastructure is never as simple as picking the “newer” option.

What Actually Works: Practical Fixes for the Incremental Model Slowdown

If you’re stuck in this paradox, here’s where to start debugging:

1. Profile Your Merge Operations

Run an EXPLAIN on your incremental model’s SQL. Look for Nested Loop joins or Merge Join operations targeting the entire table. Redshift’s optimizer can make poor choices when the target table has no distribution keys aligned with the incremental source data.

2. Optimize Distribution and Sort Keys

Incremental models benefit wildly from proper DISTKEY and SORTKEY configuration. If your incremental model does a MERGE on user_id, make sure the target table has DISTKEY(user_id) and SORTKEY(updated_at). Without this, Redshift Serverless will shuffle data across nodes for every incremental run.

3. Consider Hardcoding a Refresh Threshold

If incremental models are slower than full refreshes for your data volume, just… stop using them for that model. Materialize as a table and run a full refresh. The team reported that their total run time on Serverless is still only 45 minutes, that’s likely acceptable for many workflows.

4. Test with Concurrency Scaling

Redshift Serverless has built-in concurrency scaling. If your incremental models are competing for resources with other queries, isolate them. Use separate user groups or schedule incremental models during low-query windows.

The Bigger Lesson: ELT Isn’t a Magic Bullet

The excitement around dbt and ELT is warranted, but it’s also dangerous. Moving transformation logic into the warehouse can create performance bottlenecks that are harder to diagnose than distributed engine issues.

Exploring the SQL vs Spark ecosystem trade-offs reveals a fundamental truth: both approaches have sharp edges. Spark can be a resource hog if untuned. dbt can create warehouse CPU spikes that throttle everything else.

The real skill isn’t picking the “best” framework, it’s knowing which failure mode you’re willing to live with.

The data engineer who posted this story is living the reality of modern data infrastructure. Their pipeline went from 6 hours to 45 minutes. That’s a win. But the fact that incremental models underperform full refreshes on Redshift Serverless is a cautionary tale that most blog posts won’t tell you.

Serverless isn’t always faster. Incremental isn’t always cheaper. And the tool that solves one set of problems will always create a new set.

The next time you’re tempted to chase the hype of “modernizing” your stack, ask yourself: what new, interesting failure mode am I about to discover?

Leah Kessler, whose research on compiling software ranks dbt at 7.8/10 overall with a feature score of 8.2/10, captures this perfectly: dbt’s incremental model patterns are powerful, but they “require warehouse literacy since most behavior depends on SQL engine semantics.” Translation: the tool is only as smart as the engineer tuning it.

Share:

Related Articles