Amazon’s AI training datasets contained hundreds of thousands of pieces of child sexual abuse material in 2025, an order of magnitude more than every other major tech company combined. The company detected and removed the content before training its models, but has refused to provide child safety officials with basic information about where the material originated, rendering the reports essentially useless for law enforcement.
The scale is staggering. The National Center for Missing and Exploited Children (NCMEC) processed over 1 million AI-related reports in 2025, a fifteen-fold increase from the previous year. The “vast majority” came from Amazon alone. For context, Amazon’s total CSAM reports across all business units in 2024 were 64,195. In 2025, just their AI training data pipeline generated hundreds of thousands more.
This isn’t a case of Amazon being more diligent at detection. Child safety experts call the company an “outlier” whose behavior raises more questions than answers.
The Detection Gap: What Amazon Found vs. What They’re Sharing
Amazon uses automated hashing tools to compare training data against known CSAM databases, a standard industry practice. The company claims it takes a “deliberately cautious approach” and intentionally over-reports to avoid missing cases, which explains some false positives. But that doesn’t explain the source opacity.
When NCMEC officials asked Amazon where the material came from, essential information for tracking down perpetrators and removing content at its origin, Amazon’s response was essentially: we don’t know and can’t tell you.
Fallon McNulty, executive director of NCMEC’s CyberTipline, told Bloomberg the reports contained “very little to almost no information” about origin, who shared it, or whether it remains accessible online. “There’s nothing then that can be done with those reports”, she said. “Our team has been really clear with [Amazon] that those reports are inactionable.”
This creates a perverse situation: Amazon fulfills its legal reporting obligation while actively preventing the reports from serving their intended purpose.
Where Did It Actually Come From? The AWS Theory
Amazon claims the data came from “external sources” and “the public web”, but won’t elaborate. The vacuum has fueled speculation across technical communities about the actual provenance.
The most damning theory, one that keeps surfacing in developer forums, suggests Amazon scraped customer data from its own AWS infrastructure. The logic is simple: Amazon has unprecedented access to petabytes of customer data stored in S3 buckets, including content that’s technically public but not intended for scraping. If they trained models on this data without explicit permission, they’d be violating their own terms of service and potentially breaking the law.
Encrypted storage doesn’t prevent this. As one security engineer noted, encryption at rest only protects against drive theft, not against the infrastructure provider accessing data. If Amazon wanted to index and train on customer data, the technical barriers are minimal.
The company has every reason to hide this. Admitting they trained on customer data would open them to massive lawsuits and regulatory action. It would also mean they know exactly which AWS customers are hosting CSAM but failed to report them properly, instead just quietly removing the content from their training pipeline.
Amazon’s relationship with NCMEC adds another layer of complexity. The company funds the organization and holds a corporate board seat. While NCMEC officials have been “clear” about the reports being inactionable, they’ve stopped short of public condemnation.
The Industry Comparison: Everyone Else Isn’t Having This Problem
Amazon’s peers report dramatically different numbers. Google, Meta, OpenAI, and Anthropic all scan their training data and report findings to NCMEC. Collectively, they submitted “a handful” of reports in 2025, compared to Amazon’s hundreds of thousands.
These companies manage to provide actionable details. Meta and Google specifically structure their reports so AI-related findings are distinguishable from their other businesses, and they include origin information that helps law enforcement.
Anthropic reported finding no CSAM in its training data. OpenAI and Google found some, but at normal volumes that track with their data acquisition scale.
The discrepancy suggests either:
1. Amazon’s data sourcing is uniquely reckless, vacuuming up high-risk sources others avoid
2. Amazon’s detection is hypersensitive to the point of being ineffective
3. Amazon is training on fundamentally different (and more contaminated) data sources
The company claims 99.97% of the reports came from “non-proprietary training data”, meaning public web scrapes. But if that’s true, why aren’t other companies scraping the same sources and finding similar volumes?
The False Positive Defense Doesn’t Hold Up
Amazon leans heavily on the “over-inclusive threshold” explanation, suggesting many reports are false positives. But NCMEC’s issue isn’t with false positives, it’s with missing provenance data.
Even if 90% of Amazon’s reports were false alarms, that would still leave tens of thousands of confirmed CSAM cases with no traceable origin. The volume is so large that even aggressive over-reporting can’t explain it away.
David Rust-Smith, a data scientist at Thorn (which provides CSAM detection tools to tech companies), notes that “if you hoover up a ton of the internet, you’re going to get [CSAM].” But Amazon seems to be hoovering from a different, dirtier corner than everyone else.
Legal and Financial Exposure
This isn’t just a PR problem. Amazon’s behavior potentially violates the spirit of reporting laws designed to protect children. While they technically report the content, providing inactionable reports could be seen as a form of compliance theater, checking a box while undermining the law’s purpose.
The financial implications are significant. Amazon’s $10 billion OpenAI investment and its aggressive AI infrastructure spending create pressure to acquire training data cheaply and quickly. Properly vetting data sources and tracking provenance is expensive and time-consuming. When you’re racing to compete with Google and Microsoft, corners get cut.
If investigators determine Amazon knowingly trained on customer data or failed to report CSAM sources they could identify, the liability could be enormous. The company is already facing scrutiny for replacing 600,000 workers with robotics and driving up electric bills nationwide through data center expansion. Adding child safety violations to that list would be catastrophic.
What This Reveals About AI Development
The CSAM contamination is a symptom of a deeper disease in AI development: the race for scale over safety. Foundation models require massive datasets, and companies are incentivized to acquire them as quickly and cheaply as possible.
This creates a market for shady data brokers and encourages scraping everything accessible, regardless of quality or legality. The assumption is you can always filter later, but that assumes you have good filters and know what you’re filtering.
Training AI on CSAM is uniquely dangerous. It can teach models to:
– Better manipulate images of children
– Generate new CSAM that re-victimizes real children
– Normalize patterns of abuse in generated content
Amazon claims their models haven’t generated CSAM, but the training data contamination alone is a form of perpetuating abuse. Every time that data is used, it spreads the digital fingerprint of real victims.
The Transparency Problem
David Thiel, former chief technologist at Stanford Internet Observatory, argues the industry needs radical transparency about data sourcing and cleaning. “There should be more transparency on how companies are gathering and analyzing the data to train their models, and how they’re training them”, he said.
Amazon’s opacity makes this impossible. They won’t share:
– Specific data sources
– Scraping methodologies
– Filtering criteria
– False positive rates by source
– Why their volume is so much higher than peers
Without this information, independent researchers can’t verify safety claims, and regulators can’t assess compliance. It’s a black box that happens to contain a horrifying amount of child abuse material.
What Happens Next
NCMEC continues to push Amazon for better data, but has limited leverage. The organization depends on tech industry funding, Amazon included, to operate. This creates a conflict of interest that may explain the muted public response.
Regulatory action is possible. California has already launched investigations into AI companies for CSAM concerns. Federal lawmakers could mandate provenance tracking for training data or penalize companies that submit inactionable reports.
For now, Amazon continues its AI development unabated. The company is simultaneously cutting 16,000 jobs while investing billions in AI infrastructure, suggesting they view the CSAM issue as a manageable PR problem rather than a fundamental flaw in their data pipeline.
The Real Cost of “Move Fast and Break Things”
Amazon’s case proves that the “move fast and break things” mentality has no place in AI development when children’s safety is at stake. The company’s detection tools work well enough to find CSAM but not well enough to prevent it from entering their pipeline in the first place.
More importantly, their response prioritizes legal compliance over child protection. They’ve found a loophole: report everything while providing nothing, satisfying the letter of the law while violating its spirit.
The AI community needs to demand better. This means:
– Mandatory provenance tracking for all training data
– Independent audits of data sourcing practices
– Real consequences for companies that can’t explain where their contaminated data came from
– Industry standards that make Amazon’s volumes unthinkable, not just an outlier
Until then, every AI model trained on mystery data carries the risk of hidden contamination. And for the victims whose abuse images flowed through Amazon’s servers, the company’s silence adds insult to life-altering injury.
The question isn’t whether Amazon can build powerful AI. It’s whether they should be allowed to, given their apparent inability to do so without scraping the darkest corners of the internet, and their refusal to explain how those corners ended up in their training data to begin with.




