The Benchmark Is Lying: Qwen Team Exposes Massive Flaws in AI’s Most Trusted Tests
The scoreboard says GPT-5.2 leads on scientific reasoning. Claude dominates graduate-level questions. Gemini crushes abstract problem-solving. But what if the scoreboard itself is rigged?
A month ago, an independent researcher noticed something strange while pushing DeepSeek to its limits. The model kept failing questions it should have aced, except it wasn’t hallucinating. It was deriving correct answers that contradicted the “gold standard” labels. After hand-verifying the math line-by-line, the culprit became clear: the benchmark data was wrong.
Now the Qwen team has dropped a bombshell paper that doesn’t just confirm those suspicions, it systematically proves that two of AI’s most influential evaluation datasets, GPQA and HLE (Humanity’s Last Exam), are riddled with errors that distort every major model ranking you’ve seen.
The $40 Million Question No One Was Asking
Benchmarks are the currency of AI progress. When a lab claims their model hit 94.3% on GPQA Diamond, that number drives funding decisions, product launches, and the entire narrative of who’s “winning” the AI race. But here’s the problem: nobody was auditing the tests themselves.
The Qwen team’s HLE-Verified paper reveals the uncomfortable truth. Out of 2,500 items in the original HLE dataset:
- 641 items (25.6%) were verified as correct without modification
- 1,170 items (46.8%) contained fixable errors in problems, answers, or rationales
- 689 items (27.6%) remain too ambiguous or flawed to certify
Let that sink in. Nearly three-quarters of the questions used to evaluate our most advanced AI systems have problems that directly impact scoring.
How Bad Is It? The Numbers Don’t Lie
The downstream impact is staggering. When models are evaluated on the cleaned HLE-Verified dataset versus the flawed original, accuracy jumps by 7-10 percentage points on average. But on the subset of items where the original problem statement or answer was outright erroneous? 30-40 percentage point improvements.
This isn’t incremental. This fundamentally reshuffles the leaderboard.
| Model | Raw HLE Score | HLE-Verified Score | Absolute Gain |
|---|---|---|---|
| GPT-5.2-Thinking | 33.35% | 43.30% | +9.95% |
| Claude-Opus-4.6 | 38.95% | 46.80% | +7.85% |
| Gemini-3-Pro | 40.42% | 48.20% | +7.78% |
| DeepSeek-V3.2 | 24.90% | 36.40% | +11.50% |
On just the revised subset, the gains are more dramatic: GPT-5.2 gains 38.04 points. DeepSeek jumps 39.58 points. These aren’t model improvements, they’re corrections of systematic measurement error.
The Error Taxonomy: Where Benchmarks Break
The Qwen team didn’t just flag problems, they built a 19-category defect taxonomy that reveals why high-difficulty benchmarks fail. The issues aren’t random, they’re structural:
Problem-Level Defects (5 categories)
– Semantic errors: ambiguous or contradictory statements
– Knowledge errors: incorrect factual premises
– Missing information: unstated assumptions required for unique solutions
– Theoretical invalidity: problems that violate accepted domain theory
– Format semantic errors: LaTeX or notation that distorts meaning
Rationale-Level Defects (10 categories)
– Missing prerequisites: critical assumptions omitted
– Circular reasoning: conclusions that depend on themselves
– Empirical soundness violations: steps contradict established facts
– Format semantic errors: 63% of Computer Science rationale defects fall here
Answer-Level Defects (4 categories)
– Incorrect answers: 90% of answer-level defects, this is the killer
– Incomplete or ambiguous responses
– Format errors that impair interpretability
The pattern is clear: answer keys are wrong most often, rationales are incomplete, and problems are ambiguous. This isn’t a labeling error, it’s a fundamental quality control failure.
Domain-Specific Disaster Zones
Not all subjects are equally broken. The error distribution reveals systematic blindspots:
High-Validity Domains:
– Mathematics: 90%+ problem validity, but 35-40% of answers are incorrect
– Biology/Medicine: High problem validity, but rationales are predominantly invalid due to omitted clinical prerequisites
Low-Validity Domains:
– Physics: ~30% problem validity, dominated by uncertainty rather than explicit errors
– Engineering: Similar 30-35% validity, with massive uncertainty from unstated conventions
– Humanities/Social Science: Interpretive flexibility makes verification nearly impossible
The takeaway? Your model’s “weakness” in physics might just be its refusal to guess what the benchmark author meant by an underspecified problem.
The Confidence Signal: Models Know When Tests Are Wrong
Here’s the fascinating part: models know something is off. The Qwen team found a strong correlation between model confidence and benchmark errors. On items with flawed problem statements, models express significantly lower confidence. After repair, confidence increases by 1.83 to 11.08 points across all evaluated models.
This means confidence scores aren’t just miscalibrated, they’re diagnostic. When GPT-5.2 says “I’m 60% sure” on a question it should theoretically ace, it’s not being humble. It’s detecting the ambiguity, missing constraints, or contradictory premises that humans missed during benchmark creation.
From Discovery to Verification: The Two-Stage Fix
The community actually caught this first. A researcher running the DeepSeek-Overclock project forensically audited HLE a month ago, writing Python scripts to verify math from first principles. The Qwen team’s contribution is turning that forensic approach into a systematic pipeline:
Stage I: Component-wise Verification
– Domain experts review each item’s problem, answer, and rationale
– Multiple frontier models attempt solutions (pass@8 sampling)
– Items enter the gold subset only if both problem and answer are unproblematic
Stage II: Systematic Revision
– Two independent expert teams propose fixes preserving original intent
– Model-assisted auditing provides auxiliary evidence
– Final adjudication selects canonical repairs or routes to uncertain set
The result: 1,811 certified items (641 gold + 1,170 revised) with full revision metadata, plus 689 explicitly documented uncertain items for community refinement.
Why This Matters Beyond HLE
This isn’t just about one benchmark. The same patterns likely infect GPQA, MMLU, and every other high-stakes evaluation. When FutureHouse independently audited HLE, they estimated only 51.3% of items were research-supported. The Qwen team’s work validates that estimate with rigorous methodology.
The implications ripple across the entire AI ecosystem:
- Model rankings are fragile: A 7-10 point swing is enough to reorder “best-in-class” claims
- Training data contamination: If benchmarks are wrong, fine-tuning on them teaches models to reproduce errors
- Research waste: PhD students and corporate labs may be optimizing for flawed metrics
- Benchmark saturation: Claims that models are “saturating” benchmarks may reflect benchmark decay, not capability ceilings
The Qwen Pattern: From Models to Measurement
What’s particularly noteworthy is that this verification work comes from the same team delivering state-of-the-art models like Qwen3-VL, Qwen-Image-2512, and Qwen Next. While other labs focus on scaling parameters, Qwen is systematically attacking the infrastructure of AI evaluation itself.
Their recent releases, whether it’s Qwen3-VL running locally or Qwen3 Coder Next challenging cloud APIs, consistently push beyond synthetic benchmarks toward real-world performance. This verification work suggests they understand a critical truth: better models require better measurements.
Even their Qwen3-TTS latency claims sparked benchmark scrutiny, mirroring the broader dataset concerns. When a team builds systems that perform well in production but not on leaderboards, they start questioning the leaderboards.
What Practitioners Should Do Right Now
If you’re a researcher:
– Stop treating benchmark scores as ground truth. Demand revision histories and error taxonomies.
– Use confidence scores as diagnostic tools. Low confidence on “easy” questions suggests data issues.
– Prioritize benchmarks with transparent verification protocols like HLE-Verified.
If you’re a developer:
– Evaluate models on your own tasks, not just leaderboards. The llama.cpp integration of Qwen3 makes local testing accessible.
– When models “fail” on benchmarks, investigate. They might be correctly solving incorrectly labeled problems.
– For coding tasks, consider Qwen3 Coder Next which challenges benchmarks with real-world performance rather than synthetic scores.
If you’re a decision-maker:
– Treat model vendor claims with appropriate skepticism. Ask how they verified their evaluation data.
– Recognize that “state-of-the-art” is often a marketing term built on unverified foundations.
– Invest in internal evaluation frameworks that mirror your actual use cases.
The Uncomfortable Truth About Benchmark Culture
The AI community has a benchmark problem. We reward labs for posting new SOTA numbers without requiring them to prove their tests are valid. We treat evaluation as a solved problem while ignoring mounting evidence that our yardsticks are warped.
The Qwen team’s work isn’t just a technical contribution, it’s a call to action. They’ve shown that systematic verification is possible, that models can detect errors we miss, and that cleaning up evaluation infrastructure yields immediate, substantial improvements in measurement accuracy.
But it also means every headline you’ve read about model capabilities in the past year might be off by 10-40%. That’s not a rounding error. That’s the difference between “Claude beats GPT” and “GPT beats Claude.” Between “Gemini leads” and “Gemini trails.”
The next time you see a benchmark score, ask yourself: Is this measuring intelligence, or measuring who memorized the most flawed data?
The Qwen team just proved we can’t tell the difference. Yet.




