
The Death of Data Partitioning? Clustering Takes Over in Modern Data Lakes
Emerging shift from traditional data partitioning to clustering techniques in Delta Lake, Iceberg, Snowflake, and Databricks, and what it means for data engineering best practices.
The data engineering playbook is being rewritten in real-time, and partitioning, that foundational concept we’ve all relied on for decades, might be on its way out. Traditional Hive-style partitioning with its rigid folder structures is facing extinction, replaced by more flexible clustering techniques that promise to eliminate some of our biggest data layout headaches.
The Partitioning Predicament: Where Traditional Data Layout Falls Short
Remember the pain of dt=YYYY-MM-DD folder hierarchies? Partitioning has served us well, but its limitations have become glaringly obvious in modern data environments:
- Schema rigidity: Once you commit to partition keys, changing them requires massive data rewrites
- Small file problems: Poorly chosen partitions can lead to thousands of tiny files
- Cardinality nightmare: High-cardinality columns become partitioning disasters
- Skewed performance: Uneven data distribution can negate partitioning benefits
The industry sentiment has shifted dramatically. As developers on technical forums note, major platforms like Snowflake and Databricks have moved away from partitioning in favor of clustering approaches. Databricks now explicitly recommends clustering as the best method ↗ and advises avoiding partitioning entirely.
Liquid Clustering: The Game Changer from Databricks
Databricks Liquid Clustering represents the industry’s most sophisticated approach to this shift. As their documentation explains, “Liquid clustering replaces table partitioning and ZORDER to simplify data layout decisions and optimize query performance. It provides the flexibility to redefine clustering keys without rewriting existing data, allowing data layout to evolve alongside analytic needs over time.”
The key innovation? Virtual partitioning. Instead of physically co-locating files to nested folders (partitions), Liquid Clustering marks in metadata that “this parquet file is related to cluster (partition) X” in a corresponding metadata JSON-file.
This enables revolutionary flexibility: you can change your clustering schema without rewriting the data entirely. As one technical discussion highlighted, “The benefit is you can change the partitioning schema without rewriting the data. And because of this, writer can also control the size of files, so if you design your partitioning schema in a wrong way, clustering will care about small files.”
Implementation Reality: How Clustering Actually Works
Setting up Liquid Clustering in Databricks is straightforward:
The performance thresholds for clustering on write vary by implementation:
| Number of clustering columns | Threshold size for Unity Catalog managed tables | Threshold size for other Delta tables |
|---|---|---|
| 1 | 64 MB | 256 MB |
| 2 | 256 MB | 1 GB |
| 3 | 512 MB | 2 GB |
| 4 | 1 GB | 4 GB |
But the automatic clustering capabilities take this even further. Databricks Runtime 15.4 LTS and above introduces CLUSTER BY AUTO, where the platform intelligently chooses clustering keys based on historical query patterns:
The ZORDER Connection: What Snowflake and Databricks Do Differently
So what’s different about how major platforms implement clustering compared to vanilla Spark? ZORDER is essentially the clustering technique at its core, but the platforms have built sophisticated automation around it.
Developers have noted that “ZORDER is basically the clustering technique, But what does Snowflake or Databricks do differently that avoids partitioning entirely?” The answer lies in the metadata management and automated optimization layers that these platforms provide.
The clustering approach fundamentally changes how we think about data skipping. Instead of relying on partition elimination through folder structures, clustering enables data skipping within files through metadata about data clustering ranges.
The Performance Trade-Offs: When Partitioning Still Makes Sense
Despite the hype, partitioning hasn’t disappeared entirely. The conventional wisdom suggests that “if you know what to do, partitioning still can be effective and on a big scale it is better” for certain scenarios.
Performance optimization experts caution that “you should be careful on the costs of repeatedly running ZORDER on a large table. It makes sense for some columns in some tables, but you have to have enough people hitting that table often enough to justify all the time you spend on zordering the table.”
One practical compromise many teams are adopting: “If you have big enough data, partitioning on date and running ZORDER within each date can be a good compromise because you only have to run ZORDER once per date, and only on the data in that partition.”
Concurrent Updates and Partitioning’s Last Stand
There’s one area where traditional partitioning still holds significant advantage: concurrent updates. As noted in technical discussions, “Main reason I found to specify partitions explicitly is to support concurrent updates of a table, where the concurrent updates are varying partitions.”
This becomes crucial in scenarios where multiple processes are writing to different portions of a dataset simultaneously. Partition boundaries provide natural isolation mechanisms that clustering metadata doesn’t inherently provide in the same way.
The Clustering Sweet Spot: Where It Shines
Databricks documentation outlines specific scenarios where clustering provides maximum benefit:
- Tables often filtered by high cardinality columns
- Tables with significant skew in data distribution
- Tables that grow quickly and require maintenance and tuning effort
- Tables with concurrent write requirements
- Tables with access patterns that change over time
- Tables where typical partition keys could leave too many or too few partitions
This flexibility makes clustering particularly valuable in modern data environments where query patterns evolve and data characteristics change rapidly.
Implementation Best Practices: Beyond the Hype
Choosing clustering keys requires careful consideration. The Databricks guidance recommends:
- Selecting keys based on columns most frequently used in query filters
- Using up to four clustering keys maximum
- Avoiding highly correlated columns
- For smaller tables (<10TB), fewer clustering keys typically perform better
The migration path from traditional approaches is straightforward:
| Current data optimization technique | Recommendation for clustering keys |
|---|---|
| Hive-style partitioning | Use partition columns as clustering keys |
| Z-order indexing | Use the ZORDER BY columns as clustering keys |
| Hive-style partitioning and Z-order | Use both partition columns and ZORDER BY columns as clustering keys |
The Future of Data Layout: Towards Autonomous Optimization
The evolution continues with automatic liquid clustering representing the next frontier. When enabled, “Databricks intelligently chooses clustering keys to optimize query performance. Key selection and clustering operations run asynchronously as a maintenance operation.”
This moves data optimization toward autonomous systems that adapt to changing usage patterns without human intervention. The system analyzes “historical query workload and identifies the best candidate columns” and changes clustering keys when “the predicted cost savings from data skipping improvements outweigh the data clustering cost.”
The Bottom Line: Partitioning Isn’t Dead, But It’s Mortally Wounded
Partitioning won’t disappear from data engineering vocabulary overnight, but its dominance as the go-to optimization strategy is ending. As Databricks positions Liquid Clustering as “the best method to go with and avoid partitioning”, and other major platforms follow similar paths, the industry shift is undeniable.
For new implementations, clustering should be the default choice. For existing systems, the migration path depends on specific use cases, with automatic clustering offering the lowest-maintenance future-proofing.
The era of manually managing data layouts through rigid partitioning schemes is drawing to a close. The future belongs to adaptive, metadata-driven clustering that can evolve alongside your data and query patterns, whether you call it Liquid Clustering in Databricks, automatic clustering in Snowflake, or the next generation of optimization techniques emerging across the data ecosystem.



