Your S3 Batch Processing Strategy is Probably Terrible

Your S3 Batch Processing Strategy is Probably Terrible

Facing 2 million files in S3? Here’s how to avoid the performance pitfalls and metadata mistakes that kill large-scale processing jobs.

by Andre Banandre

Processing millions of files in Amazon S3 sounds straightforward until you do the math. At one second per file, 2 million files would take you 20+ days of continuous processing. That single fact makes most naive approaches completely impractical for production workloads. The reality of high-volume S3 processing reveals fundamental architectural decisions that can make or break your pipeline.

The Parallelism Imperative: Why Single-Threaded Doesn’t Cut It

When a data engineering newcomer recently asked about processing 2M+ S3 files, the community response was unanimous: parallelization isn’t optional.

The most practical advice came from experienced engineers: “Copy the files locally into 10 folders with 200k each. Start multiple instances of the script against each folder.” This simple folder-based partitioning strategy avoids the complexity of distributed systems while still achieving significant speedup.

Consider the alternatives: running everything sequentially would take weeks, but partitioning into just 10 parallel streams reduces processing time to days. The key insight here is granularity versus overhead – finding the right balance between parallel workers and the coordination overhead they introduce.

The Metadata Management Trap

One of the most controversial decisions in S3 batch processing is how to track progress. The original plan proposed modifying source file metadata to mark completion status, which seems elegant initially but violates core data engineering principles.

As one experienced engineer noted, “One of the standing principles in DE world is not to change source content, for good reason too. Let the ‘state/status’ be captured elsewhere.” This principle protects data integrity and ensures reprocessing capability when schemas evolve or business logic changes.

The better approach? External state tracking through databases like DynamoDB or even simple log files. While writing 2M documents to DynamoDB seems expensive, it’s often cheaper than modifying millions of S3 objects and maintains separation between processing state and source data.

The problem with S3 metadata checkpointing: Every time you update an object’s metadata, S3 actually creates a new copy of the object. For 2M files, you’re essentially doubling your storage costs and potentially impacting performance.

Infrastructure Architecture Choices: VM, Serverless, or Distributed?

The research reveals three main architectural patterns for high-volume S3 processing:

Apache Kafka
Apache Kafka architecture for S3 processing

Single VM with multiprocessing works for one-off jobs but lacks scalability. As the discussion noted, “It’s going to run on a VM due to length of time required to process” – a practical choice but not necessarily the optimal one.

Serverless approaches using AWS Lambda and SQS can provide massive parallelism without infrastructure management. However, Lambda’s execution time limits (15 minutes) and memory constraints require careful design of chunk sizes and error handling.

Distributed frameworks like Spark or AWS Glue offer the most sophisticated approach. AWS Glue “automatically scales even the most demanding resource-intensive data processing jobs from gigabytes to petabytes with no infrastructure to manage”, making it ideal for unpredictable workloads.

The choice depends on your requirements:
One-time batch: Simple VM with folder partitioning
Recurring batches: Serverless with S3 event triggers
Real-time streams: Apache Kafka or Kinesis
Variable volumes: AWS Glue or EMR clusters

Cost Optimization: The Hidden S3 Tax

Most engineers focus on compute costs while overlooking the significant data transfer expenses. As one commenter wisely noted, “if there are ingress/egress fees, I’d just make sure everything is in the same location.”

Here’s what they’re talking about:
Cross-region data transfer: $0.02 per GB between regions
S3 to EC2 in different AZs: $0.01 per GB
S3 API requests: $0.005 per 1000 requests for GET

For 2M files, even if each file is only 1KB, that’s 2GB of data transfer and 2M API calls – potentially hundreds of dollars in unnecessary costs if not optimized. Using S3 VPC endpoints and keeping processing in the same region can reduce these costs dramatically.

Monitoring and Resilience Patterns

“When running scripts on VMs, add a Slack function to send messages using webhooks at intervals”, suggested one engineer. This simple practice transforms batch jobs from black boxes into observable systems.

More sophisticated approaches include:
CloudWatch metrics for processing rates and error counts
Dead letter queues for retry management
Checkpointing at manageable intervals (every 1000 files, not every file)
Graceful degradation when downstream systems are overloaded

The real challenge isn’t just processing – it’s knowing when processing fails and being able to resume efficiently. As the community observed, “Parallelism and restartability are keys to ur use case.”

Tool Selection Spectrum: From DIY to Fully Managed

Airbyte Logo
Airbyte logo for data integration

DIY Python scripts offer maximum flexibility but require handling all error cases, retry logic, and monitoring yourself. Perfect for one-off jobs with simple requirements.

Open-source frameworks like Apache Airflow or Airbyte provide orchestration and some built-in patterns but still require infrastructure management.

Managed services like AWS Glue handle scaling, monitoring, and error recovery automatically but may have less flexibility for custom requirements.

The decision matrix comes down to:
Complexity: How many transformation steps?
Frequency: One-time or recurring?
Team expertise: Python-heavy or GUI-preferred?
Budget: Engineering time vs. cloud costs

The Reality Check: When Simple Wins

Despite all the sophisticated options available, sometimes the simplest approach wins. As the original poster realized after considering various alternatives, “Yeah I was thinking the same. You’re correct, it’s a one off job… The actual challenge comes after this part!”

This highlights a critical insight: Don’t over-engineer one-time data preparations. If the real value comes from subsequent ML processing or analysis, optimize your effort accordingly.

The most elegant solution isn’t always the most complex one. Sometimes partitioning files into folders and running multiple Python processes delivers the best ROI for time-constrained projects.

Key Takeaways for Production Success

After processing millions of files across various projects, here are the non-negotiable patterns:

  1. Always implement checkpointing – but use external state tracking, not S3 metadata modifications
  2. Design for parallelism from day one – sequential processing rarely survives contact with production volumes
  3. Monitor everything – implement progress reporting, error tracking, and cost monitoring
  4. Optimize for your specific constraints – whether it’s time, budget, or team expertise
  5. Keep processing close to data – minimize cross-region transfers and use VPC endpoints

The real art of S3 batch processing isn’t just making it work – it’s making it work efficiently, reliably, and cost-effectively at scale. Whether you choose VMs with folder partitioning, serverless Lambda functions, or fully-managed AWS Glue, the principles remain the same: parallelize intelligently, monitor relentlessly, and keep your processing logic separate from your source data.

Related Articles