JSON in PySpark: The Performance Trap You're Probably Falling For

Why writing large PySpark DataFrames as JSON to S3 is fundamentally flawed - and what you should do instead

October 4, 2025

The allure of JSON is undeniable. It’s human-readable, universally understood, and seems like the perfect format for moving data between systems. But when you’re dealing with PySpark DataFrames containing 60+ million rows headed for S3, JSON becomes less of a convenience and more of a performance nightmare.

The JSON Illusion: Why This Approach Feels Right (But Isn’t)

The scenario is painfully common: you need to move massive amounts of data from PySpark to Snowflake via S3. JSON seems like the logical choice - it’s the lingua franca of data exchange, right? Developers often reach for combinations of UDFs and collect_list to bundle everything into neat JSON arrays, only to discover they’ve built a performance trap.

The fundamental problem isn’t JSON itself - it’s how JSON interacts with distributed computing paradigms. PySpark is designed for distributed processing, but JSON’s structure encourages anti-patterns that undermine this architecture.

The Memory Wall: When 60 Million Rows Become Unmanageable

Consider the original problem: 60+ million rows needing consolidation into JSON arrays. The approach of using collect_list to combine rows into arrays sounds reasonable until you hit the memory constraints. Each executor trying to collect and serialize millions of rows simultaneously creates memory pressure that can bring your cluster to its knees.

As one developer discovered, this approach includes an unwanted side effect: PySpark adds column names as outer JSON attributes, creating structural mismatches with downstream systems like Snowflake’s COPY INTO command. More critically, the memory overhead becomes prohibitive as dataset size increases.

The Object Storage Nightmare: 60 Million Files and Counting

When the array approach fails, many developers pivot to writing individual JSON objects - one per row. This seems like a reasonable compromise until you do the math: 60 million rows means 60 million individual files in S3.

S3 may be “unlimited” in capacity, but it has very real limitations when dealing with massive numbers of small objects. Each file operation incurs latency, and listing 60 million objects becomes an operation measured in minutes or hours rather than seconds. The S3 Tables documentation ↗ hints at this challenge, emphasizing the importance of optimized file organization for analytical workloads.

Snowflake’s COPY INTO can handle multiple files, but the overhead of managing millions of tiny JSON files creates operational complexity that scales poorly. The storage costs alone for millions of small objects can be significantly higher than fewer, larger files due to S3’s per-object pricing.

The Format War: JSON vs. Modern Alternatives

The real controversy emerges when you question why JSON is the requirement in the first place. As developers on data engineering forums point out, if your destination is Snowflake, you’re likely better served by formats like Parquet or ORC.

Amazon S3 Tables ↗ specifically optimizes for analytical formats like Parquet, Avro, and ORC - not JSON. These columnar formats offer better compression, faster query performance, and built-in schema evolution capabilities. The performance difference isn’t marginal - we’re talking about 3x faster queries and significant storage cost reductions.

The irony is that many teams insist on JSON for “compatibility” reasons, only to discover they’ve chosen the least compatible format for modern data lake architectures.

The PySpark Connector Solution: Cutting Out the Middleman

Perhaps the most damning revelation from the developer community is the existence of purpose-built solutions that make the entire JSON-to-S3-to-Snowflake pipeline unnecessary. The PySpark connector for Snowflake ↗ allows direct DataFrame-to-Snowflake transfers without intermediate file storage.

This approach eliminates multiple pain points:

No S3 file management overhead
No JSON serialization/deserialization costs
Built-in type mapping between PySpark and Snowflake
Transactional consistency guarantees

Yet many teams continue with the JSON approach due to legacy requirements or unfamiliarity with available alternatives.

Partitioning: The Band-Aid That Doesn’t Stop the Bleeding

Some developers suggest partitioning as a solution - breaking the 60 million rows into manageable chunks. While partitioning can help with memory constraints, it doesn’t address the fundamental format inefficiency. You’re still writing JSON, just in smaller batches.

The real issue is that JSON lacks the structural optimizations that make formats like Parquet ideal for analytical workloads. Even partitioned JSON files will underperform compared to properly organized columnar formats.

The Performance Reality: Benchmarks Don’t Lie

When you compare JSON against modern alternatives in PySpark-to-S3 scenarios, the results are stark:

Write Performance: Parquet writes 2-3x faster than JSON due to better compression and serialization efficiency
Storage Costs: Parquet files are typically 30-50% smaller than equivalent JSON
Query Performance: Snowflake can query Parquet files directly without the parsing overhead JSON requires
Network Transfer: Smaller file sizes mean faster S3 uploads and downloads

These performance differences become exponentially more significant as dataset sizes grow from millions to billions of rows.

The Way Forward: Rethinking Data Movement Patterns

The solution isn’t finding better ways to write JSON - it’s questioning whether JSON should be part of your data pipeline at all. For analytical workloads moving between PySpark and Snowflake, consider these alternatives:

Direct Connectors: Use the native PySpark-Snowflake connector when possible
Columnar Formats: Default to Parquet or ORC for S3-based workflows
Iceberg Tables: Leverage S3 Tables with Apache Iceberg support ↗ for managed table optimization
Streaming Approaches: For continuously updated data, consider streaming patterns rather than batch file transfers

JSON Has Its Place (Just Not Here)

JSON remains valuable for API responses, configuration files, and human-readable data exchange. But for moving massive analytical datasets between distributed systems, it’s the wrong tool for the job.

The next time you’re tempted to write that 60-million-row PySpark DataFrame as JSON to S3, ask yourself: are you solving a data problem or creating a performance one? The most efficient solution might be eliminating the intermediate format altogether.

The real innovation in data engineering isn’t finding better ways to work with inefficient formats - it’s having the courage to question whether those formats belong in your architecture at all.

Java Streaming is Dead: Why Your 75GB CSV Ingestion Strategy Belongs in 2015

A brutal case study comparing Java streaming approaches against modern tools like DuckDB and Spark for massive data ingestion, revealing why traditional methods are costing you time and sanity.

#data-engineering#java#spark...