The Death of Data Partitioning? Clustering Takes Over in Modern Data Lakes

Emerging shift from traditional data partitioning to clustering techniques in Delta Lake, Iceberg, Snowflake, and Databricks, and what it means for data engineering best practices.

October 27, 2025

The data engineering playbook is being rewritten in real-time, and partitioning, that foundational concept we’ve all relied on for decades, might be on its way out. Traditional Hive-style partitioning with its rigid folder structures is facing extinction, replaced by more flexible clustering techniques that promise to eliminate some of our biggest data layout headaches.

The Partitioning Predicament: Where Traditional Data Layout Falls Short

Remember the pain of dt=YYYY-MM-DD folder hierarchies? Partitioning has served us well, but its limitations have become glaringly obvious in modern data environments:

Schema rigidity: Once you commit to partition keys, changing them requires massive data rewrites
Small file problems: Poorly chosen partitions can lead to thousands of tiny files
Cardinality nightmare: High-cardinality columns become partitioning disasters
Skewed performance: Uneven data distribution can negate partitioning benefits

The industry sentiment has shifted dramatically. As developers on technical forums note, major platforms like Snowflake and Databricks have moved away from partitioning in favor of clustering approaches. Databricks now explicitly recommends clustering as the best method ↗ and advises avoiding partitioning entirely.

Liquid Clustering: The Game Changer from Databricks

Databricks Liquid Clustering represents the industry’s most sophisticated approach to this shift. As their documentation explains, “Liquid clustering replaces table partitioning and ZORDER to simplify data layout decisions and optimize query performance. It provides the flexibility to redefine clustering keys without rewriting existing data, allowing data layout to evolve alongside analytic needs over time.”

The key innovation? Virtual partitioning. Instead of physically co-locating files to nested folders (partitions), Liquid Clustering marks in metadata that “this parquet file is related to cluster (partition) X” in a corresponding metadata JSON-file.

This enables revolutionary flexibility: you can change your clustering schema without rewriting the data entirely. As one technical discussion highlighted, “The benefit is you can change the partitioning schema without rewriting the data. And because of this, writer can also control the size of files, so if you design your partitioning schema in a wrong way, clustering will care about small files.”

Implementation Reality: How Clustering Actually Works

Setting up Liquid Clustering in Databricks is straightforward:

-- Create an empty Delta table
CREATE TABLE table1(col0 INT, col1 string) CLUSTER BY(col0);
 
-- Alter an existing table
ALTER TABLE <table_name>
CLUSTER BY(<clustering_columns>);

The performance thresholds for clustering on write vary by implementation:

Number of clustering columns	Threshold size for Unity Catalog managed tables	Threshold size for other Delta tables
1	64 MB	256 MB
2	256 MB	1 GB
3	512 MB	2 GB
4	1 GB	4 GB

But the automatic clustering capabilities take this even further. Databricks Runtime 15.4 LTS and above introduces CLUSTER BY AUTO, where the platform intelligently chooses clustering keys based on historical query patterns:

-- Create an empty table with automatic clustering
CREATE OR REPLACE TABLE table1(column01 int, column02 string) CLUSTER BY AUTO;
 
-- Enable automatic liquid clustering on an existing table
ALTER TABLE table1 CLUSTER BY AUTO;

The ZORDER Connection: What Snowflake and Databricks Do Differently

So what’s different about how major platforms implement clustering compared to vanilla Spark? ZORDER is essentially the clustering technique at its core, but the platforms have built sophisticated automation around it.

Developers have noted that “ZORDER is basically the clustering technique, But what does Snowflake or Databricks do differently that avoids partitioning entirely?” The answer lies in the metadata management and automated optimization layers that these platforms provide.

The clustering approach fundamentally changes how we think about data skipping. Instead of relying on partition elimination through folder structures, clustering enables data skipping within files through metadata about data clustering ranges.

The Performance Trade-Offs: When Partitioning Still Makes Sense

Despite the hype, partitioning hasn’t disappeared entirely. The conventional wisdom suggests that “if you know what to do, partitioning still can be effective and on a big scale it is better” for certain scenarios.

Performance optimization experts caution that “you should be careful on the costs of repeatedly running ZORDER on a large table. It makes sense for some columns in some tables, but you have to have enough people hitting that table often enough to justify all the time you spend on zordering the table.”

One practical compromise many teams are adopting: “If you have big enough data, partitioning on date and running ZORDER within each date can be a good compromise because you only have to run ZORDER once per date, and only on the data in that partition.”

Concurrent Updates and Partitioning’s Last Stand

There’s one area where traditional partitioning still holds significant advantage: concurrent updates. As noted in technical discussions, “Main reason I found to specify partitions explicitly is to support concurrent updates of a table, where the concurrent updates are varying partitions.”

This becomes crucial in scenarios where multiple processes are writing to different portions of a dataset simultaneously. Partition boundaries provide natural isolation mechanisms that clustering metadata doesn’t inherently provide in the same way.

The Clustering Sweet Spot: Where It Shines

Databricks documentation outlines specific scenarios where clustering provides maximum benefit:

Tables often filtered by high cardinality columns
Tables with significant skew in data distribution
Tables that grow quickly and require maintenance and tuning effort
Tables with concurrent write requirements
Tables with access patterns that change over time
Tables where typical partition keys could leave too many or too few partitions

This flexibility makes clustering particularly valuable in modern data environments where query patterns evolve and data characteristics change rapidly.

Implementation Best Practices: Beyond the Hype

Choosing clustering keys requires careful consideration. The Databricks guidance recommends:

Selecting keys based on columns most frequently used in query filters
Using up to four clustering keys maximum
Avoiding highly correlated columns
For smaller tables (<10TB), fewer clustering keys typically perform better

The migration path from traditional approaches is straightforward:

Current data optimization technique	Recommendation for clustering keys
Hive-style partitioning	Use partition columns as clustering keys
Z-order indexing	Use the ZORDER BY columns as clustering keys
Hive-style partitioning and Z-order	Use both partition columns and ZORDER BY columns as clustering keys

The Future of Data Layout: Towards Autonomous Optimization

The evolution continues with automatic liquid clustering representing the next frontier. When enabled, “Databricks intelligently chooses clustering keys to optimize query performance. Key selection and clustering operations run asynchronously as a maintenance operation.”

This moves data optimization toward autonomous systems that adapt to changing usage patterns without human intervention. The system analyzes “historical query workload and identifies the best candidate columns” and changes clustering keys when “the predicted cost savings from data skipping improvements outweigh the data clustering cost.”

The Bottom Line: Partitioning Isn’t Dead, But It’s Mortally Wounded

Partitioning won’t disappear from data engineering vocabulary overnight, but its dominance as the go-to optimization strategy is ending. As Databricks positions Liquid Clustering as “the best method to go with and avoid partitioning”, and other major platforms follow similar paths, the industry shift is undeniable.

For new implementations, clustering should be the default choice. For existing systems, the migration path depends on specific use cases, with automatic clustering offering the lowest-maintenance future-proofing.

The era of manually managing data layouts through rigid partitioning schemes is drawing to a close. The future belongs to adaptive, metadata-driven clustering that can evolve alongside your data and query patterns, whether you call it Liquid Clustering in Databricks, automatic clustering in Snowflake, or the next generation of optimization techniques emerging across the data ecosystem.

Parquet Was Never Enough: Why Your Raw Data Lake Is a Data Swamp Waiting to Happen

While Parquet excels at storage efficiency, it lacks transactional integrity. Open table formats like Iceberg, Delta Lake, and Hudi are increasingly seen as essential for reliable data pipelines by adding ACID compliance and schema evolution.

#data-engineering#data-lakes#apache-iceberg...

data-engineering

The Fivetran-dbt "Marriage of Convenience" That's Quietly Declaring War on Snowflake

The merger between Fivetran and dbt Labs isn't just consolidation, it's a strategic power play to flip the data stack hierarchy and make warehouses the commodities, not the kings.

#data-engineering#merger#fivetran...

data-engineering

Java Streaming is Dead: Why Your 75GB CSV Ingestion Strategy Belongs in 2015

A brutal case study comparing Java streaming approaches against modern tools like DuckDB and Spark for massive data ingestion, revealing why traditional methods are costing you time and sanity.

#data-engineering#java#spark...

View All Related (4)

Navigation

Categories

The Death of Data Partitioning? Clustering Takes Over in Modern Data Lakes

Emerging shift from traditional data partitioning to clustering techniques in Delta Lake, Iceberg, Snowflake, and Databricks, and what it means for data engineering best practices.

The Partitioning Predicament: Where Traditional Data Layout Falls Short

Liquid Clustering: The Game Changer from Databricks

Implementation Reality: How Clustering Actually Works

The ZORDER Connection: What Snowflake and Databricks Do Differently

The Performance Trade-Offs: When Partitioning Still Makes Sense

Concurrent Updates and Partitioning’s Last Stand

The Clustering Sweet Spot: Where It Shines

Implementation Best Practices: Beyond the Hype

The Future of Data Layout: Towards Autonomous Optimization

The Bottom Line: Partitioning Isn’t Dead, But It’s Mortally Wounded

Related Articles

Parquet Was Never Enough: Why Your Raw Data Lake Is a Data Swamp Waiting to Happen

The Fivetran-dbt "Marriage of Convenience" That's Quietly Declaring War on Snowflake

Java Streaming is Dead: Why Your 75GB CSV Ingestion Strategy Belongs in 2015

Parquet Was Never Enough: Why Your Raw Data Lake Is a Data Swamp Waiting to Happen

The Fivetran-dbt "Marriage of Convenience" That's Quietly Declaring War on Snowflake

Java Streaming is Dead: Why Your 75GB CSV Ingestion Strategy Belongs in 2015

Database Design: Integrity Clashes With Performance

Table of Contents