The Postgres Betrayal: When Your Database Becomes a Data Lakehouse

Snowflake just open-sourced pg_lake, letting Postgres talk directly to Iceberg tables. The SQL you know now touches petabytes.

November 5, 2025

For decades, the separation was sacred: transactional databases handled business logic while data warehouses handled analytics. But Snowflake’s open-sourcing of pg_lake ↗ just declared war on that architecture. PostgreSQL, the world’s most beloved open-source database, can now directly query and manage Apache Iceberg tables in your data lake. The implications are seismic.

The Architecture Shift You Didn’t See Coming

The fundamental design of pg_lake is both elegant and borderline audacious. Rather than forcing Postgres to become something it’s not, the extension delegates heavy lifting to DuckDB through a clever wire protocol implementation.

When you run CREATE TABLE iceberg_test USING iceberg in Postgres, you’re not just creating another table, you’re initiating a distributed dance between Postgres’ transaction management and DuckDB’s columnar execution engine. Postgres handles the ACID transactions and SQL planning, while pgduck_server processes the actual data scanning and computation.

pg_lake Architecture

The architecture separates concerns beautifully: Postgres remains the familiar SQL interface while DuckDB handles the parallel execution that Postgres processes were never designed for. This hybrid approach sidesteps Postgres’ threading limitations while leveraging its battle-tested transaction model.

Real-World Code That Actually Works

The pg_lake documentation shows this isn’t just theoretical. Here’s the exact pattern developers can implement today:

-- Create an Iceberg table directly from Postgres
CREATE TABLE iceberg_test USING iceberg 
AS SELECT i as key, 'val_'|| i as val FROM generate_series(0,99)i;
 
-- Query it like any other Postgres table
SELECT count(*) FROM iceberg_test;
 
-- Export query results directly to S3
COPY (SELECT * FROM iceberg_test) 
TO 's3://testbucketpglake/parquet_data/iceberg_test.parquet';

The foreign data wrapper implementation is equally compelling:

-- Create foreign table from S3 Parquet files with automatic schema inference
CREATE FOREIGN TABLE parquet_table() 
SERVER pg_lake 
OPTIONS (path 's3://testbucketpglake/parquet_data/*.parquet');

This eliminates the traditional ETL pipeline bottleneck. No more waiting for data engineers to transform and load data, developers can query petabytes of lake data directly from their familiar Postgres terminal.

The Modular Design That Makes It Scalable

What makes pg_lake particularly interesting is its modular architecture. Instead of creating another monolithic extension, the team applied lessons from previous Postgres extension development (including Citus and pg_cron) to build a composable system:

pg_lake_iceberg: Implements the full Iceberg specification
pg_lake_table: Foreign data wrapper for object storage files
pg_lake_copy: COPY to/from data lake operations
pg_lake_engine: Common module for different extensions
pgduck_server: Separate process handling DuckDB execution

This componentization means teams can use just what they need rather than adopting an all-or-nothing platform. The approach recognizes that data teams have diverse requirements and existing infrastructure investments.

Why This Changes Everything for Data Teams

The traditional data pipeline has been a source of constant friction. Business logic lives in Postgres applications, while analytics happens in separate systems like Snowflake, Databricks, or BigQuery. Moving data between these systems introduces latency, complexity, and governance headaches.

With pg_lake, developers can query petabytes of data lake information directly from their application code. Data scientists can use their existing Postgres skills to analyze Iceberg tables without learning new query languages. Operations teams maintain one security model instead of multiple.

Snowflake’s announcement specifically addresses this unification: “The separation of transactional data in Postgres from analytical data has long been a major architectural roadblock for enterprises, forcing costly data movement and preventing real-time data access for apps and AI agents.” pg_lake aims to eliminate that separation entirely.

The Performance Trade-Offs You Need to Understand

This architectural elegance comes with performance considerations. Delegating computation to DuckDB introduces network overhead between Postgres and pgduck_server. While this works well for analytical workloads scanning large datasets, it might not suit high-frequency transactional operations.

The system excels at:

Analytical queries across large Iceberg tables
Data import/export operations to object storage
JOIN operations between Postgres tables and lake data

But it introduces latency for:

High-frequency point lookups
Real-time transactional workloads
Applications requiring sub-millisecond response times

The approach acknowledges that most data value comes from analytical processing rather than transactional speed, a pragmatic trade-off for the lakehouse use case.

Where This Fits in the Broader Ecosystem

Snowflake’s timing here is strategic. With their acquisition of Crunchy Data ↗ and subsequent open-sourcing of pg_lake, they’re positioning PostgreSQL as the universal data access layer. This isn’t just about making Postgres better, it’s about creating bridges between competing data platforms.

According to Snowflake’s BUILD 2025 announcements, they’re embracing open standards like Apache Iceberg REST Catalog and Apache Polaris, recognizing that enterprises won’t tolerate vendor lock-in in the AI era. pg_lake becomes the gateway drug: use Postgres for everything, then optionally leverage Snowflake’s platform when you need enterprise features.

The Practical Implementation Reality

Getting started requires understanding the deployment options. For development work, the Docker setup provides a quick start. Production deployments need consideration of:

Memory Management: pgduck_server defaults to 80% of system memory, adjust based on workload
Credentials: Leverages DuckDB’s secrets manager for S3/GCP access
Network: Unix domain socket communication between Postgres and pgduck_server
Caching: Remote file caching configuration for performance optimization

The modular design means teams can deploy just the components they need. If you’re only using Iceberg tables, you don’t need the full COPY functionality. If you’re focused on file ingestion, you can ignore the Iceberg components.

The Bigger Picture: Unification Over Revolution

What makes pg_lake compelling isn’t just the technology, it’s the philosophical shift. Instead of forcing everyone onto a new platform, it extends existing tools to handle new workloads. Developers keep their Postgres skills, applications maintain compatibility, and enterprises avoid yet another platform migration.

As enterprises struggle with AI data readiness, where 80% of IT leaders cite data silos as the primary obstacle ↗, tools like pg_lake bridge the gap between operational and analytical systems. They’re not building a better warehouse, they’re making the warehouse accessible through the tools teams already use.

The real impact might be organizational. When developers can directly query petabytes of analytics data, the traditional boundaries between application development and data engineering start to blur. That organizational shift could be more transformative than any technical innovation.

The pg_lake project represents a mature approach to data architecture: build bridges rather than walls, extend rather than replace, and recognize that most data value comes from accessibility rather than technological perfection. For teams tired of data pipelines and platform migrations, that’s a revolution worth paying attention to.

The Great Data Tooling Debate: SQL vs. Spark Ecosystem

Enterprise data teams are increasingly questioning the need for complex tooling stacks when SQL in platforms like Snowflake can handle most data transformations.

#data-engineering#sql#spark...

database scaling

Database Scaling Isn't Magic. It's a Tax on Simplicity

The 10 database scaling techniques every architect must master, and why most teams implement them too late, too wrong, and at too high a cost.

#database scaling#postgres#sharding...

flink

Is Fluss About to Make Kafka Obsolete for Lakehouse Real-Time?

Apache Fluss 0.8.0 introduces Iceberg support, enabling sub-second latency processing via Flink and positioning itself as the hot layer solution for Iceberg-based lakehouses.

#flink#iceberg#real-time...

View All Related (4)

Navigation

Categories

The Postgres Betrayal: When Your Database Becomes a Data Lakehouse

Snowflake just open-sourced pg_lake, letting Postgres talk directly to Iceberg tables. The SQL you know now touches petabytes.

The Architecture Shift You Didn’t See Coming

Real-World Code That Actually Works

The Modular Design That Makes It Scalable

Why This Changes Everything for Data Teams

The Performance Trade-Offs You Need to Understand

Where This Fits in the Broader Ecosystem

The Practical Implementation Reality

The Bigger Picture: Unification Over Revolution

Related Articles

The Great Data Tooling Debate: SQL vs. Spark Ecosystem

Database Scaling Isn't Magic. It's a Tax on Simplicity

Is Fluss About to Make Kafka Obsolete for Lakehouse Real-Time?

The Great Data Tooling Debate: SQL vs. Spark Ecosystem

Database Scaling Isn't Magic. It's a Tax on Simplicity

Is Fluss About to Make Kafka Obsolete for Lakehouse Real-Time?

The Great Data Dream vs. Architectural Reality: Why Your Low-Code Warehouse Choice Matters

Table of Contents