The pitch for Apache Iceberg is compelling. ACID transactions, painless schema evolution, time travel, it promises database-like reliability for your petabyte-scale data lake. You get the demo working, your team gets excited about hidden partitioning, and you start migrating tables. Then you hit production. The tick-tock of Snapshot #452 hits your timeline. The small-file count spikes, and your query performance tanks. Welcome to the operational iceberg, where 90% of the work, compaction schedules, orphan file cleanup, manifest rewrites, is hidden beneath the waterline, threatening to sink your entire data platform.
This is the dirty secret nobody puts in the README. The real-world experience is far from the seamless, self-managing utopia sold by vendors. As one exasperated engineer recently asked in a forum: how are teams actually handling Iceberg table maintenance in production? The answer, distilled from countless war stories, is a tangled web of custom glue code, external schedulers, and constant vigilance.
The Hidden Toil: What Marketing Brochures Don’t Show
The research material reveals a stark reality. Developers running Iceberg on Spark report being “surprised” by the sheer volume of maintenance glue code required. This isn’t about writing queries, it’s about orchestrating a relentless background process to keep the system from collapsing under its own weight.
Here’s the laundry list that becomes your new on-call rotation:
* Compaction Schedules: Merging thousands of tiny Parquet files into larger, query-efficient ones.
* Snapshot Expiration: Pruning old metadata to prevent the snapshots table from becoming a performance bottleneck itself.
* Orphan File Cleanup: Identifying and deleting data files no longer referenced by any snapshot (a critical, often forgotten, cost control).
* Manifest Rewrites: Optimizing the metadata that tells queries where the data is.
* Monitoring Small-File Counts: Because when this metric “blows up”, your query engine’s latency explodes.
This is the foundation of your modern data stack? More like a second full-time job.
The Anatomy of an Iceberg Maintenance Bomb
To understand why this happens, you need to look under the hood. Iceberg’s power comes from its immutable, versioned metadata layer. Every write creates a new snapshot. Every snapshot points to a set of manifest files, which in turn list the actual data files. It’s elegant. It’s also a factory for metadata sprawl.
This architecture, perfect for streaming use cases, is also its own worst enemy. High-frequency writers (think IoT data, clickstreams) create a torrent of small files. Each new file is a new entry in a manifest. Over time, this leads to manifest bloat, where planning a simple SELECT * FROM table LIMIT 10 requires reading megabytes of JSON before touching a single byte of actual data.
The documented Spark procedures are your tools, but they are manual, low-level levers, not an automated system. You have:
* CALL system.rewrite_data_files(...): For compaction.
* CALL system.expire_snapshots(...): For metadata cleanup.
* CALL system.remove_orphan_files(...): For cost control.
These are SQL commands you must wrap, schedule, monitor, and retry. The IOMETE documentation provides the syntax, but offers zero guidance on the “when” and “how often.” That’s the entire job they’ve left for you.
The Siren Song of “Managed” Services
Faced with this operational quagmire, the natural reaction is to look for a managed solution. The cloud providers have an answer. AWS offers Glue’s “low-code” maintenance for Iceberg tables. Google Cloud’s BigQuery Lakehouse promo explicitly states it “automates routine Iceberg maintenance, like compaction and clustering, to optimize price-performance and eliminate manual overhead.”
Sounds perfect, right? Not so fast.
The feedback from the trenches is cautionary. As one practitioner warned, these services sound great until you realize you have “ZERO control over WHEN it runs, HOW OFTEN it runs other than some config options.” If you’re doing high-commit-volume streaming and need to stay on top of intraday performance, you’re often forced back to writing your own Spark jobs. The “managed” solution becomes a black box, trading control for convenience and sometimes creating more problems than it solves when your workload doesn’t fit their one-size-fits-all schedule.
This is a classic trade-off in modern platform engineering: the convenience of abstraction versus the control required for performance-critical systems. It forces you to evaluate whether your data platform is truly autonomous or just architectural complexity and operational overhead in a different package.
Building Your Own Maintenance Engine: A Survival Guide
Since the off-the-shelf solutions can be brittle, many teams end up building. The patterns emerging from the community are instructive.
1. The Centralized Scheduler Pattern: This is the most common approach. Teams use a workflow orchestrator like Apache Airflow to define and schedule maintenance DAGs. Each table can have its own configuration, a default compaction policy, a retention period for snapshots, and a central job reads a manifest of all tables and fires off the appropriate Spark procedures. New tables are automatically added to the manifest. It’s robust, but it’s also more infrastructure to manage.
2. The Event-Driven Alerting Pattern: Instead of preventing problems, some teams focus on reacting quickly. They set up monitoring for key metrics (small-file count, manifest size, snapshot age) and trigger corrective maintenance jobs via alerts. The biggest win, as one engineer noted, is simply “having good alerts for when the small file count hits a threshold.” This pairs well with the scheduler pattern for baseline maintenance.
3. The Infrastructure-as-Code Pattern: Modern teams are defining their maintenance logic alongside their table definitions. Using tools like CDK or Terraform, they auto-generate Spark jobs triggered by EventBridge on deployment or on a default interval, with developers able to override schedules per table. Stack and table tags propagate to the EMR job runs for full observability. It’s a “set and forget” ideal, assuming your forgetfulness doesn’t lead to a runaway storage bill.
The core theme? Automation is non-negotiable. Hand-running expire_snapshots is a path to burnout and error.
Integrating the Maintenance Cycle: Code Examples and Schedules
Let’s get concrete. What does this glue code actually look like? Based on the Reintech tutorial, here’s a typical compaction job you’d need to schedule:
-- Target ~512MB files, only rewrite partitions needing it
CALL demo.system.rewrite_data_files(
table => 'prod.events',
strategy => 'binpack',
options => map('target-file-size-bytes', '536870912', 'min-input-files', '5')
);
And the crucial, cost-saving orphan cleanup:
-- Dry-run first to see what you'll delete!
CALL demo.system.remove_orphan_files(
table => 'prod.events',
older_than => timestamp(date_sub(current_date(), 3)),
dry_run => true
);
-- Then execute
CALL demo.system.remove_orphan_files(
table => 'prod.events',
older_than => timestamp(date_sub(current_date(), 3))
);
But when do you run these? There is no universal answer, but patterns emerge:
* Compaction (rewrite_data_files): Run hourly for high-volume streaming tables, daily for batch-ingested tables. Monitor the average_file_size metric.
* Snapshot Expiration (expire_snapshots): Run daily, retaining between 7 to 30 days of history for time-travel debugging. Balance storage cost against utility.
* Orphan Cleanup (remove_orphan_files): Run weekly. This is your garbage collection. Miss it, and you’re paying for unused cloud storage indefinitely.
This isn’t data engineering, it’s data janitoring. And its reliability is as critical as your ETL pipelines, reminiscent of the challenges in ensuring the reliability of background maintenance processes in distributed systems.
The Emerging Frontier: Agentic Automation and the Future
The industry isn’t blind to this pain. A new wave of tooling is emerging, moving from scheduled scripts to observability-driven, autonomous maintenance. Companies like definity (which just raised a $12M Series A) are building what they call “agentic data engineering platforms.” The pitch is shifting from “here are the tools” to “the platform will figure it out.”
These systems work directly within production pipelines, capturing runtime signals across infrastructure behavior, pipeline execution, and data characteristics. Instead of you defining a static “compact every 4 hours” rule, an AI agent could analyze write patterns, query performance, and cost metrics to decide when and how to compact, then execute it. Google’s push for “AI-powered assistance and automation for all data users” in BigQuery points in the same direction.
The promise is a shift from reactive, script-heavy operations to declarative management: “Keep this table’s query performance under 5 seconds and storage costs below $X/month.” The platform handles the how. We’re not there yet for most, but it’s the logical endgame for taming these unpredictable operational nightmares.
Practical Takeaways: Navigating the Iceberg
So, you’re committed to Iceberg. What now?
- Budget for Operations From Day One: This isn’t an afterthought. If you’re adopting Iceberg, you are also adopting a maintenance subsystem. Factor in the development and runtime cost.
- Start with Managed, But Plan for Escape: Try the managed maintenance from your cloud provider or lakehouse platform. Monitor its effectiveness closely. Have a plan (and code) to take over if it doesn’t meet your needs.
- Instrument Everything: You cannot manage what you cannot measure. Export metrics for snapshot count, manifest size, average data file size, and orphan file count to your observability stack. Set alerts.
- Build Runbooks, Not Just Scripts: The Reintech guide wisely advises to “build runbooks for common operations like compaction and snapshot expiration.” Document the why and when, not just the how.
- Treat Table Properties as Configuration: Use Iceberg’s table properties (
write.target-file-size-bytes,history.expire.max-snapshot-age-ms) to encode your maintenance intent directly into the table. This makes your automation simpler and more portable.
The power of Iceberg is real. Its transactional guarantees and evolution capabilities are transformative for data lakes. But that power comes with a metabolic cost, the continuous energy required to keep its metadata ecosystem healthy. Ignore that maintenance, and your high-performance lakehouse will quietly degenerate into a swamp of small files and slow queries. The choice isn’t whether to use Iceberg, it’s whether you’re prepared to operate it. The underlying lakehouse platform frameworks battle will rage on, but the daily grind of file management is the universal constant. Plan for it, automate it, and maybe one day, an agent will do it for you. Until then, keep your compaction schedules tight and your orphan file alerts louder.




