AWS Glue’s Dirty Little Secret: Why Small Files Break Everything

AWS Glue promises serverless ETL simplicity, but when you hit it with hundreds of thousands of small files, the bill comes due, in compute time, frustration, and architectural complexity. The reality is this: Spark was never designed for the “small file hell” that modern data ingestion patterns create.

Cover Image for Building a Real-Time Data Lake on AWS S3, Glue, and Athena in Prod

The $235 Wake-Up Call

One AWS customer learned this lesson the hard way when their analytics dashboard timed out, costing them $235 in a single failed Athena query. The culprit? Poor partitioning forced Athena to scan three years of historical data instead of just the targeted week’s worth of user events.

But wait, it gets worse. Consider this real-world scenario from developers facing the same issue: 500,000 Parquet files averaging 20KB each, totaling just 10GB of data. The spark.read.parquet(input_path) call alone takes 1.5 hours, despite the relatively small dataset size.

from pyspark.sql import SparkSession

input_path = "s3://…/parsed-data/us/*/data.parquet" 
output_path = "s3://…/app-data-parquet/"

def main():
    spark = SparkSession.builder.appName("JsonToParquetApps").getOrCreate()
    print("Reading JSON from:", input_path)

    df = spark.read.parquet(input_path)  # This line takes 1.5 hours
    print('after spark.read.parquet')

    df_coalesced = df.coalesce(50)
    print('after df.coalesce(50)')

    df_coalesced.write.mode("overwrite").parquet(output_path)
    spark.stop()

Why Small Files Wreck Glue Performance

The Metadata Avalanche

Spark’s architecture assumes you’re working with reasonably sized files. When you feed it thousands of tiny files, you’re essentially asking it to manage a massive distributed filesystem catalog for what amounts to a dataset that could fit comfortably in memory.

Driver Memory Pressure: The Spark driver must track every single file, partition, and task
Scheduler Overhead: Each file becomes a separate task, overwhelming the task scheduler
I/O Inefficiency: Reading many small files means more S3 API calls and less optimal data loading
Athena Collateral Damage: “Athena spends more time listing files than reading them” when dealing with files under 10MB

This pattern runs counter to Parquet’s intended sweet spot, files between 128 MB and 512 MB that balance compression efficiency with query performance.

The Architectural Shift: Stop Creating the Problem

The most effective solution isn’t optimization, it’s prevention. Instead of fighting small files after they’re created, engineers are shifting to architectures that avoid them entirely.

Kinesis Data Firehose: The Set-and-Forget Solution

One developer originally facing 2-hour compaction times reported success after switching: “I ended up going with AWS Data Firehose to compact my Parquet files, and it’s working well.” Firehose automatically buffers incoming records and writes them in larger chunks, typically around 128MB based on configuration.

The workflow looks like this:
1. Lambdas write streaming data directly to Firehose
2. Firehose buffers data to optimal file sizes
3. Output lands as properly-sized Parquet files
4. No post-processing compaction required

For high-throughput scenarios, developers suggest replacing SQS with streaming: “Lambda write data to SQS queue, Glue streaming job constantly reading from SQS and doing the transformation, stop the job if no files for 10 minutes.”

When You Inherit the Mess: Glue Optimization Tactics

Spark Configuration Tuning

# Critical settings for small file performance
spark.conf.set("spark.sql.files.maxPartitionBytes", 64 * 1024 * 1024)  # 64MB
spark.conf.set("spark.sql.files.openCostInBytes", 4 * 1024 * 1024)     # 4MB

File Reading Strategy
Remove wildcards from your paths when possible: Use s3://../parsed-data/us/ instead of s3://../parsed-data/us/*/data.parquet and enable .option("recursiveFileLookup", "true") to avoid driver-side listing bottlenecks.

Parallel Processing Considerations
Increasing worker count helps, but only to a point. More workers mean more parallel file listings, but the driver still coordinates everything. The consensus from experienced engineers is clear: “But really, that’s not Spark use case, there is nothing ‘analytical’ about this kind of processing.”

Best Practices for Running Container WordPress on AWS (ECS, EFS, RDS, ELB) using CDK

Beyond Spark: Alternative Tooling

DuckDB: “I remember hearing from a coworker in a similar situation that DuckDB massively sped up their ingest compared to their original solution.” DuckDB’s single-node architecture avoids Spark’s distributed coordination overhead, making it surprisingly effective for datasets that fit in memory.

Polars: Another developer suggested “using polars and then manually chunking into 50 partitions to output. I would think it would handle the small files better.” Polars’ Rust-based engine and different parallelism model can outperform Spark for specific file-heavy workloads.

S3 Select: While one developer noted AWS appears to have discontinued certain S3 Select features, the principle remains, serverless SQL engines can sometimes process small files more efficiently than distributed frameworks.

The Iceberg Solution: Built-in Compaction

For teams using modern table formats, AWS Glue now offers automatic compaction features that specifically address the small file problem. Medidata’s journey to a modern lakehouse architecture demonstrates how “AWS Glue Iceberg optimization features that include compaction, snapshot retention, and orphan file deletion provide a set-and-forget experience for solving a number of common Iceberg frustrations, such as the small file problem.”

The key insight from AWS documentation: “Enable automatic compaction through the Data Catalog or use S3 Tables (on by default). You will get hands-free optimization without building custom maintenance jobs.” This eliminates the need for manual compaction scripts and provides continuous file size optimization.

Production-Grade Compaction Strategy

# Production file compaction
 df = spark.read.parquet("s3://path/to/small/files/")
 df.repartition(10).write.mode("overwrite").parquet("s3://path/to/compacted/")

Why repartition instead of coalesce? When your bottleneck is read performance (not write), repartitioning gives you better control over output file sizes and distribution. The goal is to create files in that 128-512MB sweet spot that analytics engines love.

The Cost of Getting It Wrong

Storage Overhead: More files mean more S3 PUT/LIST requests
Compute Waste: Glue jobs spending 90% of their time on metadata operations
Query Costs: Athena scanning more metadata than actual data
Maintenance Burden: Custom compaction jobs running daily or weekly

One team reported that implementing proper partitioning and compaction strategies “eliminated 80% of ‘full table scans’ and saved ~$2,000/month” on a moderate data lake.

When Prevention Isn’t Possible

Iceberg Tables: Upgrade to Apache Iceberg V3 tables that support automatic compaction
Batch Processing: Schedule regular compaction jobs during low-usage windows
Incremental Approach: Compact oldest/most problematic partitions first
Cost Monitoring: Set Athena query data scan limits to prevent runaway costs

The Architectural Lesson

The small file problem exposes a fundamental truth about cloud data engineering: storage is cheap, but compute is expensive. Saving a few dollars on development time by writing files immediately can cost hundreds in unnecessary compute overhead.

The most successful teams architect for file size from day one. They choose ingestion patterns (like Firehose) that naturally create optimal file sizes. They implement table formats (like Iceberg) with built-in compaction. And they treat file size as a first-class metric alongside data freshness and quality.

Because in the end, nobody wants to explain why their “simple” data compaction job costs more than their actual data processing, or why it takes longer to list files than to actually analyze the data inside them.