Parquet for Images: When Architectural Purity Meets Pipeline Collapse

A data pipeline processing terabytes of live imagery every hour started experiencing extreme latency and crippled checkpoint times. The culprit wasn’t Flink, Pulsar, or Iceberg, it was the decision to stuff megabyte-sized image binaries directly into Parquet files. What began as an elegant “single source of truth” architecture quickly revealed why columnar formats and large binary objects are fundamentally misaligned.

The Seductive Simplicity of “One Place for Everything”

The original architecture seemed clean: ingest live imagery and metadata from Apache Pulsar, buffer through Flink, and land everything in Iceberg tables backed by Parquet. Having image data co-located with metadata promised streamlined analytics and machine learning workflows. Data scientists could theoretically run Spark SQL queries that returned both image features and pixel data in a single operation. No separate S3 lookups, no eventual consistency headaches, no dual-system orchestration.

This appeal is powerful enough to make experienced engineers question decades of storage best practices. When you’re drowning in hundreds of terabytes of images that need tight coupling with metadata for pixel-level analytics, the idea of self-contained Parquet rows feels like architectural perfection. Why manage two storage systems when one could suffice?

The answer, as with most performance questions, emerges only at scale.

The Checkpoint Death Spiral

The first warning sign appears in Flink’s checkpoint mechanism. Checkpointing already involves capturing state snapshots, when each row contains multiple megabytes of image data, those snapshots balloon exponentially. The pipeline ingesting terabytes per hour doesn’t just write data, it carries the weight of every previous image in its checkpoint state. This isn’t a gradual slowdown but a compounding latency disaster that eventually breaches SLA thresholds.

Parquet’s design exacerbates the problem. As a columnar format, it excels at compressing repetitive data across rows and enabling predicate pushdown. But image binaries are already compressed, already unique, and stored as large blobs that defeat Parquet’s internal chunking strategies. The format generates massive row groups that must be buffered entirely before writing, creating memory pressure that Flink wasn’t designed to handle. You’re essentially using a Formula 1 car to haul concrete: technically possible, but the engine will seize.

The Brutal Truth: Zero Columnar Benefits

Here’s where conventional wisdom collides with reality. Parquet’s metadata statistics, min/max values, dictionary encoding, bloom filters, provide almost zero value for image columns. You can’t prune row groups based on pixel data ranges. You can’t skip reading a 5MB image blob because Parquet’s metadata suggested it wasn’t relevant. Every query that touches an image column pulls the full weight of those binaries through the execution engine.

The performance implications cascade. Spark clusters need exponentially more memory just to hold Parquet footers and row group metadata. Network transfer costs spike as entire files must be scanned rather than selectively read. Even simple operations like counting rows require reading through image data because Parquet’s internal structure wasn’t optimized for this use case. The data scientists who were promised convenience instead get queries that timeout and clusters that crash from OOM errors.

What Actually Works: The Two-Step Pattern

The data engineering community’s consensus is unequivocal: store references, not binaries. The dominant pattern involves writing image files directly to blob storage (S3, GCS, Azure Blob) with paths or hashes recorded in Iceberg tables. This separation unlocks multiple advantages:

Storage efficiency enables deduplication, identical images across timestamps or devices reference the same blob.
Retrieval optimization lets you use CDNs, range requests, and direct HTTP access patterns that Parquet can’t provide.
Query performance shrinks dramatically: a Spark cluster processing metadata-only Iceberg tables runs with a fraction of the executors and memory.

The migration path is straightforward. Replace image columns with s3_path STRING or image_hash STRING fields. For analytics, join on these references and load images lazily in Python using libraries like Pillow or OpenCV. The two-step process, SQL to fetch paths, then Python to load pixels, is identical in effort to extracting blobs from Parquet, but executes orders of magnitude faster.

One engineer suggested a migration-resilient approach: storing content-addressable hashes (like SHA256) instead of direct S3 paths. This creates a lightweight lookup service that survives storage reorganization. Combined with periodic maintenance jobs to purge orphaned blobs, this pattern handles failure modes more gracefully than monolithic Parquet files.

The Edge Case That Isn’t

Some argue that small, mostly-unique images at low volume might justify inline storage. The reality? Even modest scales reveal the same fundamental problems. A Spark cluster handling gigabytes of image data in Parquet still suffers from checkpoint amplification, read amplification, and wasted compute. The threshold where this becomes unacceptable is far lower than most anticipate.

The only scenario where inline storage makes sense is when images are tiny (kilobytes), truly unique, and accessed exclusively through full-table scans where Parquet’s I/O patterns don’t matter. If that doesn’t describe your use case, and for terabytes-per-hour pipelines, it absolutely doesn’t, then you’re building technical debt that will collapse under its own weight.

The Lakehouse Philosophy vs. Operational Reality

This debate exposes a deeper tension in modern data architecture. The lakehouse movement promises unified storage for all data types, but conflating “can store” with “should store” leads to catastrophic decisions. Iceberg and Parquet were designed for structured and semi-structured data where columnar statistics drive query optimization. They weren’t built to be object stores, and no amount of metadata layering changes that fundamental limitation.

The S3 Tables feature, built on Iceberg, demonstrates where the real value lies: using Parquet for metrics and metadata while letting S3 handle the object storage it was built for. The new observability-driven storage patterns with S3 Storage Lens export to Iceberg tables show the ecosystem’s direction, keep binary data in blob storage, analyze the metadata in columnar formats.

Concrete Recommendations

If your pipeline is already struggling, migrate immediately. The steps are clear:

Extract images from Parquet to S3 with hive-style partitioning (e.g., s3://bucket/images/year=2025/month=12/day=26/hour=14/).
Replace columns in Iceberg with image_url STRING populated during the migration.
Update Flink jobs to write images to S3 and emit paths to Pulsar/Flink instead of binaries.
Implement a cleanup job (monthly scan for orphaned blobs) or use S3 lifecycle policies with timestamp-based keys.

For new pipelines, start with the reference pattern. Use DuckDB’s read_blob() if you need SQL-native access to images without the overhead of a full Spark cluster. The performance difference isn’t incremental, it’s the difference between a pipeline that sustains terabytes per hour and one that grinds to a halt.

The controversy isn’t whether this works. It doesn’t, and the data proves it. The real debate is why teams keep making this mistake despite clear evidence. Architectural purity is a poor substitute for operational stability.