The Postgres Betrayal: When Your Database Becomes a Data Lakehouse

The Postgres Betrayal: When Your Database Becomes a Data Lakehouse

Snowflake just open-sourced pg_lake, letting Postgres talk directly to Iceberg tables. The SQL you know now touches petabytes.
November 5, 2025

For decades, the separation was sacred: transactional databases handled business logic while data warehouses handled analytics. But Snowflake’s open-sourcing of pg_lake just declared war on that architecture. PostgreSQL, the world’s most beloved open-source database, can now directly query and manage Apache Iceberg tables in your data lake. The implications are seismic.

The Architecture Shift You Didn’t See Coming

The fundamental design of pg_lake is both elegant and borderline audacious. Rather than forcing Postgres to become something it’s not, the extension delegates heavy lifting to DuckDB through a clever wire protocol implementation.

When you run CREATE TABLE iceberg_test USING iceberg in Postgres, you’re not just creating another table, you’re initiating a distributed dance between Postgres’ transaction management and DuckDB’s columnar execution engine. Postgres handles the ACID transactions and SQL planning, while pgduck_server processes the actual data scanning and computation.

pg_lake Architecture

The architecture separates concerns beautifully: Postgres remains the familiar SQL interface while DuckDB handles the parallel execution that Postgres processes were never designed for. This hybrid approach sidesteps Postgres’ threading limitations while leveraging its battle-tested transaction model.

Real-World Code That Actually Works

The pg_lake documentation shows this isn’t just theoretical. Here’s the exact pattern developers can implement today:

1-- Create an Iceberg table directly from Postgres 2CREATE TABLE iceberg_test USING iceberg 3AS SELECT i as key, 'val_'|| i as val FROM generate_series(0,99)i; 4 5-- Query it like any other Postgres table 6SELECT count(*) FROM iceberg_test; 7 8-- Export query results directly to S3 9COPY (SELECT * FROM iceberg_test) 10TO 's3://testbucketpglake/parquet_data/iceberg_test.parquet';

The foreign data wrapper implementation is equally compelling:

1-- Create foreign table from S3 Parquet files with automatic schema inference 2CREATE FOREIGN TABLE parquet_table() 3SERVER pg_lake 4OPTIONS (path 's3://testbucketpglake/parquet_data/*.parquet');

This eliminates the traditional ETL pipeline bottleneck. No more waiting for data engineers to transform and load data, developers can query petabytes of lake data directly from their familiar Postgres terminal.

The Modular Design That Makes It Scalable

What makes pg_lake particularly interesting is its modular architecture. Instead of creating another monolithic extension, the team applied lessons from previous Postgres extension development (including Citus and pg_cron) to build a composable system:

  • pg_lake_iceberg: Implements the full Iceberg specification
  • pg_lake_table: Foreign data wrapper for object storage files
  • pg_lake_copy: COPY to/from data lake operations
  • pg_lake_engine: Common module for different extensions
  • pgduck_server: Separate process handling DuckDB execution

This componentization means teams can use just what they need rather than adopting an all-or-nothing platform. The approach recognizes that data teams have diverse requirements and existing infrastructure investments.

Why This Changes Everything for Data Teams

The traditional data pipeline has been a source of constant friction. Business logic lives in Postgres applications, while analytics happens in separate systems like Snowflake, Databricks, or BigQuery. Moving data between these systems introduces latency, complexity, and governance headaches.

With pg_lake, developers can query petabytes of data lake information directly from their application code. Data scientists can use their existing Postgres skills to analyze Iceberg tables without learning new query languages. Operations teams maintain one security model instead of multiple.

Snowflake’s announcement specifically addresses this unification: “The separation of transactional data in Postgres from analytical data has long been a major architectural roadblock for enterprises, forcing costly data movement and preventing real-time data access for apps and AI agents.” pg_lake aims to eliminate that separation entirely.

The Performance Trade-Offs You Need to Understand

This architectural elegance comes with performance considerations. Delegating computation to DuckDB introduces network overhead between Postgres and pgduck_server. While this works well for analytical workloads scanning large datasets, it might not suit high-frequency transactional operations.

The system excels at:

  • Analytical queries across large Iceberg tables
  • Data import/export operations to object storage
  • JOIN operations between Postgres tables and lake data

But it introduces latency for:

  • High-frequency point lookups
  • Real-time transactional workloads
  • Applications requiring sub-millisecond response times

The approach acknowledges that most data value comes from analytical processing rather than transactional speed, a pragmatic trade-off for the lakehouse use case.

Where This Fits in the Broader Ecosystem

Snowflake’s timing here is strategic. With their acquisition of Crunchy Data and subsequent open-sourcing of pg_lake, they’re positioning PostgreSQL as the universal data access layer. This isn’t just about making Postgres better, it’s about creating bridges between competing data platforms.

According to Snowflake’s BUILD 2025 announcements, they’re embracing open standards like Apache Iceberg REST Catalog and Apache Polaris, recognizing that enterprises won’t tolerate vendor lock-in in the AI era. pg_lake becomes the gateway drug: use Postgres for everything, then optionally leverage Snowflake’s platform when you need enterprise features.

The Practical Implementation Reality

Getting started requires understanding the deployment options. For development work, the Docker setup provides a quick start. Production deployments need consideration of:

  • Memory Management: pgduck_server defaults to 80% of system memory, adjust based on workload
  • Credentials: Leverages DuckDB’s secrets manager for S3/GCP access
  • Network: Unix domain socket communication between Postgres and pgduck_server
  • Caching: Remote file caching configuration for performance optimization

The modular design means teams can deploy just the components they need. If you’re only using Iceberg tables, you don’t need the full COPY functionality. If you’re focused on file ingestion, you can ignore the Iceberg components.

The Bigger Picture: Unification Over Revolution

What makes pg_lake compelling isn’t just the technology, it’s the philosophical shift. Instead of forcing everyone onto a new platform, it extends existing tools to handle new workloads. Developers keep their Postgres skills, applications maintain compatibility, and enterprises avoid yet another platform migration.

As enterprises struggle with AI data readiness, where 80% of IT leaders cite data silos as the primary obstacle, tools like pg_lake bridge the gap between operational and analytical systems. They’re not building a better warehouse, they’re making the warehouse accessible through the tools teams already use.

The real impact might be organizational. When developers can directly query petabytes of analytics data, the traditional boundaries between application development and data engineering start to blur. That organizational shift could be more transformative than any technical innovation.

The pg_lake project represents a mature approach to data architecture: build bridges rather than walls, extend rather than replace, and recognize that most data value comes from accessibility rather than technological perfection. For teams tired of data pipelines and platform migrations, that’s a revolution worth paying attention to.

Related Articles