Spark 3.5’s Phantom Exit: When Long-Running Loops Vanish Without a Trace

Your Spark job runs perfectly for 8 iterations. On the 9th, it simply disappears. No stack trace. No OutOfMemoryError. No executor loss. YARN reports exit code 0, the Spark UI shows healthy executors right up until they vanish, and your logs are emptier than a dev environment on Friday afternoon. This isn’t a crash, it’s a phantom exit, and it’s been haunting Spark 3.5+ users who dare to wrap their logic in while(true).

The Crime Scene: What We Actually Know

A recent incident report from a production environment paints a frustrating picture: Spark 3.5.1 on YARN 3.4, running a straightforward batch job inside an infinite loop. The setup was modest, two executors with 16GB each, driver at 8GB, processing Parquet from S3A. The loop included a short sleep between iterations, standard practice for polling-style workloads. Everything worked flawlessly for 8 to 12 iterations. Then, silence.

The monitoring told a confusing story: memory stayed stable, GC behavior was normal, and even aggressive cache clearing with unpersist(), clearCache(), and checkpointing did nothing. Extending heartbeat intervals and monitoring JVM metrics revealed no smoking gun. The job simply exited cleanly, as if someone had sent a polite SIGTERM and Spark had decided “sure, why not?”

The Metadata Bomb Theory

The working hypothesis from engineers who’ve debugged this points to a subtle culprit: Dataset lineage and plan metadata accumulation in Spark’s Catalyst optimizer. Each iteration of a DataFrame operation builds a logical plan, and Catalyst incrementally tracks transformations to enable optimization. In batch jobs, this is fine, the plan builds, executes, and the context terminates. But in a long-running loop, that metadata never gets truly cleared. It accumulates, iteration after iteration, like a memory leak that doesn’t show up in your heap metrics.

Spark 3.5 introduced several optimizer enhancements, including more aggressive plan caching and metadata tracking for improved query performance. The unintended consequence? An infinite loop that generates slightly different plans each iteration (or even the same plan, re-registered) can cause the Catalyst metadata store to grow unbounded. Eventually, it crosses an internal threshold, likely a safety mechanism, and triggers a graceful shutdown. No error, because from Spark’s perspective, this is cleanup, not failure.

Why Traditional Fixes Fail

The Reddit thread discussing this issue reads like a laundry list of failed workarounds. Let’s examine why each one misses the mark:

unpersist() and clearCache() : These clear cached data, not metadata. The lineage graph in Catalyst remains intact.
Checkpointing: Forces materialization but doesn’t prune the logical plan history that Catalyst maintains for optimization purposes.
Extended heartbeats: Addresses network timeout issues, which isn’t the problem here.
GC monitoring: The metadata bloat isn’t in the JVM heap accessible to standard GC, it’s in Spark’s internal structures.

Memory staying stable is the key clue. This isn’t a Java memory leak, it’s a Spark metadata leak.

The “Nuclear Option” Workaround

One engineer discovered a bizarre fix: adding spark.range(1).count() inside the loop after the sleep. This shouldn’t work, but it does. Why? Forcing a trivial action like this likely triggers Catalyst’s plan evaluation and cleanup mechanisms. It’s a hack that punches through the optimizer’s accumulation logic, essentially telling Spark “flush your metadata now.” But relying on this in production is like fixing a roof leak with duct tape, it’ll hold until it doesn’t.

What Actually Works: Structured Streaming Micro-Batches

The consensus among Spark maintainers and experienced users is clear: stop using infinite loops. Structured Streaming was designed precisely for this use case. Micro-batch mode isolates each execution’s DAG, manages lineage automatically, and provides exactly-once semantics. The metadata lifecycle is bounded per micro-batch, preventing the accumulation that kills long-running batch loops.

If you must stick with batch semantics, the only reliable approach is to restart the Spark context periodically. This is inherently fragile, it adds orchestration complexity and risks state loss, but it’s the only way to guarantee a clean slate for Catalyst’s metadata. A better pattern: external orchestration (Airflow, Dagster) that launches discrete batch jobs on a schedule, letting each job live and die naturally.

The Root Cause: Spark’s Batch-First Design Philosophy

This isn’t a bug, it’s a fundamental design assumption. Spark was built for finite batch jobs and streaming workloads, not infinite batch loops. The Catalyst optimizer’s metadata management optimizes for plan reuse and incremental optimization within a single job lifecycle. When you stretch that lifecycle to infinity, you break the contract.

Spark 3.5’s “silent termination” is actually a protective measure. Without it, metadata accumulation would eventually cause more severe issues, actual OOMs, corrupted plans, or undefined behavior. The graceful exit is Spark’s way of saying “you’re using me wrong.”

The Takeaway

If you’re running Spark 3.5+ in a while(true) loop, you’re fighting the framework. The phantom exit is a feature, not a bug, a forced reminder that batch loops are an anti-pattern. Migrate to Structured Streaming, or accept the overhead of job restarts. The metadata leak isn’t going away, because it’s not a leak. It’s Spark doing exactly what it was designed to do, just not what you wanted.

The next time your job vanishes without a trace, check the optimizer metrics. The smoking gun is there, buried in Catalyst’s plan history, right where you weren’t looking.

Spark 3.5’s Phantom Exit: When Long-Running Loops Vanish Without a Trace

Spark 3.5’s Phantom Exit: When Long-Running Loops Vanish Without a Trace

The Crime Scene: What We Actually Know

The Metadata Bomb Theory

Why Traditional Fixes Fail

The “Nuclear Option” Workaround

What Actually Works: Structured Streaming Micro-Batches

The Root Cause: Spark’s Batch-First Design Philosophy

The Takeaway

Related Articles

The Great Metadata Unification War: Are We Trading Chaos for Vendor Prison?