The Benchmark Is Lying: Qwen Team Exposes Massive Flaws in AI’s Most Trusted Tests
GPQA and HLE, benchmarks that determine which AI models lead the pack, are fundamentally broken. The Qwen team’s systematic verification reveals incorrect answers, ambiguous problems, and systematic errors that artificially deflate model scores by up to 40%.