Tagged with

1 article found

The Benchmark Is Lying: Qwen Team Exposes Massive Flaws in AI’s Most Trusted Tests

GPQA and HLE, benchmarks that determine which AI models lead the pack, are fundamentally broken. The Qwen team’s systematic verification reveals incorrect answers, ambiguous problems, and systematic errors that artificially deflate model scores by up to 40%.

#ai evaluation#data quality#GPQA...