
JSON in PySpark: The Performance Trap You're Probably Falling For
Why writing large PySpark DataFrames as JSON to S3 is fundamentally flawed - and what you should do instead
The allure of JSON is undeniable. It’s human-readable, universally understood, and seems like the perfect format for moving data between systems. But when you’re dealing with PySpark DataFrames containing 60+ million rows headed for S3, JSON becomes less of a convenience and more of a performance nightmare.
The JSON Illusion: Why This Approach Feels Right (But Isn’t)
The scenario is painfully common: you need to move massive amounts of data from PySpark to Snowflake via S3. JSON seems like the logical choice - it’s the lingua franca of data exchange, right? Developers often reach for combinations of UDFs and collect_list
to bundle everything into neat JSON arrays, only to discover they’ve built a performance trap.
The fundamental problem isn’t JSON itself - it’s how JSON interacts with distributed computing paradigms. PySpark is designed for distributed processing, but JSON’s structure encourages anti-patterns that undermine this architecture.
The Memory Wall: When 60 Million Rows Become Unmanageable
Consider the original problem: 60+ million rows needing consolidation into JSON arrays. The approach of using collect_list
to combine rows into arrays sounds reasonable until you hit the memory constraints. Each executor trying to collect and serialize millions of rows simultaneously creates memory pressure that can bring your cluster to its knees.
As one developer discovered, this approach includes an unwanted side effect: PySpark adds column names as outer JSON attributes, creating structural mismatches with downstream systems like Snowflake’s COPY INTO
command. More critically, the memory overhead becomes prohibitive as dataset size increases.
The Object Storage Nightmare: 60 Million Files and Counting
When the array approach fails, many developers pivot to writing individual JSON objects - one per row. This seems like a reasonable compromise until you do the math: 60 million rows means 60 million individual files in S3.
S3 may be “unlimited” in capacity, but it has very real limitations when dealing with massive numbers of small objects. Each file operation incurs latency, and listing 60 million objects becomes an operation measured in minutes or hours rather than seconds. The S3 Tables documentation ↗ hints at this challenge, emphasizing the importance of optimized file organization for analytical workloads.
Snowflake’s COPY INTO
can handle multiple files, but the overhead of managing millions of tiny JSON files creates operational complexity that scales poorly. The storage costs alone for millions of small objects can be significantly higher than fewer, larger files due to S3’s per-object pricing.
The Format War: JSON vs. Modern Alternatives
The real controversy emerges when you question why JSON is the requirement in the first place. As developers on data engineering forums point out, if your destination is Snowflake, you’re likely better served by formats like Parquet or ORC.
Amazon S3 Tables ↗ specifically optimizes for analytical formats like Parquet, Avro, and ORC - not JSON. These columnar formats offer better compression, faster query performance, and built-in schema evolution capabilities. The performance difference isn’t marginal - we’re talking about 3x faster queries and significant storage cost reductions.
The irony is that many teams insist on JSON for “compatibility” reasons, only to discover they’ve chosen the least compatible format for modern data lake architectures.
The PySpark Connector Solution: Cutting Out the Middleman
Perhaps the most damning revelation from the developer community is the existence of purpose-built solutions that make the entire JSON-to-S3-to-Snowflake pipeline unnecessary. The PySpark connector for Snowflake ↗ allows direct DataFrame-to-Snowflake transfers without intermediate file storage.
This approach eliminates multiple pain points:
- No S3 file management overhead
- No JSON serialization/deserialization costs
- Built-in type mapping between PySpark and Snowflake
- Transactional consistency guarantees
Yet many teams continue with the JSON approach due to legacy requirements or unfamiliarity with available alternatives.
Partitioning: The Band-Aid That Doesn’t Stop the Bleeding
Some developers suggest partitioning as a solution - breaking the 60 million rows into manageable chunks. While partitioning can help with memory constraints, it doesn’t address the fundamental format inefficiency. You’re still writing JSON, just in smaller batches.
The real issue is that JSON lacks the structural optimizations that make formats like Parquet ideal for analytical workloads. Even partitioned JSON files will underperform compared to properly organized columnar formats.
The Performance Reality: Benchmarks Don’t Lie
When you compare JSON against modern alternatives in PySpark-to-S3 scenarios, the results are stark:
- Write Performance: Parquet writes 2-3x faster than JSON due to better compression and serialization efficiency
- Storage Costs: Parquet files are typically 30-50% smaller than equivalent JSON
- Query Performance: Snowflake can query Parquet files directly without the parsing overhead JSON requires
- Network Transfer: Smaller file sizes mean faster S3 uploads and downloads
These performance differences become exponentially more significant as dataset sizes grow from millions to billions of rows.
The Way Forward: Rethinking Data Movement Patterns
The solution isn’t finding better ways to write JSON - it’s questioning whether JSON should be part of your data pipeline at all. For analytical workloads moving between PySpark and Snowflake, consider these alternatives:
- Direct Connectors: Use the native PySpark-Snowflake connector when possible
- Columnar Formats: Default to Parquet or ORC for S3-based workflows
- Iceberg Tables: Leverage S3 Tables with Apache Iceberg support ↗ for managed table optimization
- Streaming Approaches: For continuously updated data, consider streaming patterns rather than batch file transfers
JSON Has Its Place (Just Not Here)
JSON remains valuable for API responses, configuration files, and human-readable data exchange. But for moving massive analytical datasets between distributed systems, it’s the wrong tool for the job.
The next time you’re tempted to write that 60-million-row PySpark DataFrame as JSON to S3, ask yourself: are you solving a data problem or creating a performance one? The most efficient solution might be eliminating the intermediate format altogether.
The real innovation in data engineering isn’t finding better ways to work with inefficient formats - it’s having the courage to question whether those formats belong in your architecture at all.