Normalization is a Trap: Why ClickHouse Wants Your Data Wide and Flat

You spent years learning to normalize your database. Third normal form was drilled into you as the one true path. Then you hit ClickHouse, or any serious OLAP engine, and everything you know is suddenly wrong.

The database that processes 100 million rows in under 100 milliseconds doesn’t want your normalized, join-heavy schema. It wants wide, flat, denormalized tables that would make your DBA instructor weep.

This isn’t a bug. It’s a fundamental architectural choice rooted in how columnar engines actually execute queries.

The Columnar Revolution That Broke the Rules

Normalization is a trap for OLAP workloads.

Let’s be clear: normalization exists for a good reason. In OLTP systems like PostgreSQL or MySQL, splitting data into multiple tables prevents anomalies, ensures consistency, and makes single-row writes cheap and safe.

But ClickHouse, BigQuery, Snowflake, and DuckDB don’t store data by rows. They store data by columns.

This single design decision cascades into a completely different set of performance characteristics:

Aspect	Row-oriented	Columnar
Storage layout	Tuples packed into pages	One file per column
Best workload	OLTP, point lookups	OLAP, aggregations, bulk scans
Compression	1.5, 3× typical	5, 10× typical, up to 30×
Read I/O on wide tables	Reads all columns	Reads only selected columns
Write path	Single-row insert/update	Bulk-append optimized

The columnar layout means a query that touches 3 columns out of 50 reads roughly 6% of the data a row store would need. That’s not incremental improvement, that’s a different class of performance entirely.

But here’s where it gets spicy: you throw away that advantage the moment you introduce joins.

Why Joins Are the Hidden Performance Killer in ClickHouse

ClickHouse’s execution engine is optimized for scanning and aggregating columns, not for stitching tables together. Every join in a columnar database forces it to:

Read the join key columns from both tables
Build hash tables or sort both sides
Materialize the combined result

This process bypasses the core columnar advantage, reading only what you need, because the join engine must touch multiple columns across multiple tables simultaneously.

The ClickHouse team has made real progress here. Recent versions improved join performance by over 26x through better statistics, optimizer improvements, spill-to-disk, and new algorithms. But the fundamental constraint remains: the fastest join is the one you don’t execute.

This is where the community gets heated. One school of thought, represented by Armend Avdijaj’s article on GlassFlow’s blog, argues that you shouldn’t denormalize because ClickHouse now supports joins well enough. A ClickHouse employee even responded to the Reddit discussion of that post saying they “personally don’t agree with the claims it makes.”

But the data tells a different story.

The Case Study That Proves the Point

An e-commerce company implemented ClickHouse with a normalized schema, separate tables for orders, customers, and products. Their dashboards required joining all three for every query. Performance was mediocre.

They denormalized into a single order_details table. Storage grew. Compression ratios worsened. Updates became painful.

But their query performance jumped dramatically.

The financial analytics platform that chose a different path, pre-aggregation via materialized views rather than full denormalization, got similar performance gains without the storage bloat.

The lesson isn’t that one approach is universally correct. It’s that the optimal strategy depends on your specific update patterns and query characteristics.

When Denormalization Wins (and When It Kills You)

Denormalize When:

Your data is append-only or mostly immutable. Logs, events, time-series data, and clickstreams rarely need updates. A single wide table with denormalized dimensions eliminates join overhead entirely, and the lack of updates means you’ll never pay the “rewrite the whole table” penalty.

Storage is cheap and query speed is paramount. ClickHouse’s compression algorithms are aggressive. The “wasted space” from denormalization is smaller than you think, ZSTD compression on low-cardinality columns can hit 30x ratios. Storage costs are dropping. Your analysts’ time isn’t.

Your access patterns are predictable. If every dashboard query touches the same columns, a wide table with those columns materialized together will always outperform normalized tables with joins.

Keep it Normalized When:

Data changes frequently. Updating a product name in a denormalized table requires rewriting every row that references that product. In a normalized schema, you update one row in one table. ClickHouse is append-optimized, the fundamental tension between normalized integrity and denormalized performance becomes existential when updates are frequent.

You need strong data consistency guarantees. Denormalization introduces duplication risk. If your product catalog updates through one pipeline and your order data through another, you’ll inevitably experience drift.

Your schema has complex relationships. Many-to-many associations and hierarchies are painful to model in wide tables. Challenging the Kimball star schema orthodoxy for analytical workloads suggests that even the dimensional modeling standard isn’t always right.

The ClickHouse Toolkit: Alternatives to Full Denormalization

Full denormalization is a blunt instrument. ClickHouse provides surgical alternatives:

Materialized Views

CREATE MATERIALIZED VIEW daily_visits_mv 
ENGINE = SummingMergeTree()
PARTITION BY toDate(time)
ORDER BY (date, country)
AS 
SELECT 
    toDate(time) AS date,
    country,
    count(*) AS visit_count
FROM visits
GROUP BY date, country;

This precomputes aggregations at insert time. Your raw data stays normalized, and queries against the materialized view are instant.

Dictionary Tables

Dictionaries are in-memory key-value maps for dimension lookups:

CREATE DICTIONARY user_dict (
    id UInt64,
    name String
) PRIMARY KEY id
SOURCE(CLICKHOUSE(HOST 'localhost' PORT 9000 USER 'default' TABLE 'users'))
LAYOUT(FLAT())
LIFETIME(3600);

SELECT 
    dictGet('user_dict', 'name', user_id) AS user_name,
    total_amount 
FROM orders;

This replaces a join with a memory lookup. For small dimension tables, it’s effectively free.

Projections

Projections are an elegant alternative, they let ClickHouse maintain alternative data layouts automatically:

ALTER TABLE sales ADD PROJECTION category_revenue (
    SELECT category, sum(amount) AS total_revenue 
    GROUP BY category
);

Now queries that match this pattern use the projection transparently. No schema changes. No maintenance.

The Real Trade-Off: Complexity vs. Performance

The GlassFlow article is right about one thing: denormalization introduces real costs. Table management becomes harder. Schema evolution is painful. Ingestion pipelines get more complex.

But the counterargument from the field is equally valid: analytical queries on denormalized schemas run 3-10x faster than their normalized equivalents in most real-world deployments.

The decision matrix looks like this:

Factor	Normalize	Denormalize
Query Performance	Slower (joins)	Faster (direct scan)
Storage Efficiency	High	Lower (compression helps)
Update Cost	Low	High (rewrite rows)
Data Integrity	High	Lower (duplication risk)
Schema Evolution	Easier	Harder
Best For	Frequent updates, complex relationships	Analytics, immutable data, speed

The Practical Path Forward

Don’t start with denormalization. Optimize your schema in stages:

Start normalized. It’s easier to reason about, easier to debug, and easier to evolve.
Profile your slow queries. Use system.query_log to identify the queries consuming the most resources.
Apply surgical optimizations. A materialized view for your most expensive aggregation. A dictionary for your most-hit dimension lookup. A projection for your most common filter pattern.
Only then, denormalize. If you still have a hot table that’s causing pain, merge its related dimensions. The performance gain will be immediate and dramatic.

The Bottom Line

The normalization vs. denormalization debate isn’t a religious war, it’s an engineering trade-off that depends on your specific workload.

ClickHouse, like all columnar databases, performs best when queries scan contiguous columns without joins. Whether you achieve that through denormalization, materialized views, or pre-aggregation depends on your update patterns, team capacity, and performance requirements.

For alternatives to real-time aggregation for analytical query performance, ClickHouse’s projection system deserves a close look. It gives you many of the benefits of denormalization without the schema complexity.

The engineers who get this right don’t dogmatically normalize or denormalize. They understand how their execution engine actually works, measure what’s happening, and choose the right tool for the job.

Sometimes that tool is a wide, flat table. And that’s okay.