
LLM Benchmarks: Why 'Top 50 Humans' Might Be Better Than MMLU
A new subjective benchmarking approach reveals what standardized tests miss about AI model capabilities and training data overlap.
Traditional LLM benchmarks have become the standardized tests of the AI world, they measure what’s easy to measure, not necessarily what matters. While MMLU and HellaSwag provide neat scores, they often miss the nuanced differences that actually distinguish one model from another. But a new approach using subjective human lists is turning benchmarking on its head, revealing surprising insights about model similarity and training data contamination.
The Problem with Objective Benchmarks
Most LLM evaluation relies on standardized tests with predefined answers. As researchers noted in a recent integrated evaluation framework study ↗, this approach has a fundamental flaw: “If evaluation solely focuses on the presence of the term ‘vegetable,’ both responses would incorrectly be considered equally correct.” The problem is that traditional benchmarks prioritize measurable correctness over qualitative understanding.
Consider this: when two models answer “What is a tomato?” and both mention “vegetable”, traditional benchmarks might score them equally. But one model might provide a scientifically grounded explanation based on botanical definitions, while another relies on subjective, everyday perceptions. The standardized test misses this crucial distinction entirely.
The Subjective List Approach: A New Benchmarking Frontier
The breakthrough comes from an unexpected direction: asking models to generate subjective lists. One researcher’s approach stands out, asking LLMs to list their “top 50 best humans currently alive” and analyzing the results using Rank-Biased Overlap (RBO) algorithms.
This methodology, shared through a Python implementation ↗, creates what the developer calls a “fingerprint for identifying training data overlap.” The premise is simple: models trained on similar data will produce similar subjective rankings, even for highly opinion-based questions.
The approach uses sophisticated name consolidation with fuzzy matching (90% similarity threshold) and RBO calculations with a persistence parameter of 0.95. The result is a network graph that visually represents model similarity based on their subjective outputs.
Initial Results: Gemini and Grok Under the Microscope
Initial testing with Gemini and Grok revealed fascinating patterns. The RBO-based similarity matrix showed measurable differences in how these models approach subjective ranking tasks. While both models could complete the task, their ranking methodologies and underlying biases differed significantly.
What makes this approach particularly powerful is its ability to detect subtle training data contamination. As one developer noted, models that “vibe with certain people” tend to cluster together in the similarity network. This suggests that subjective list generation can reveal training data relationships that traditional benchmarks miss.
The methodology aligns with broader trends in AI evaluation. Recent frameworks like MCP-Bench ↗ are pushing for more realistic evaluation scenarios that test “long-term planning and fuzzy instructions across multiple tools.” The subjective list approach fits perfectly into this paradigm, it tests models on tasks that lack clear right answers but require nuanced understanding.
Why Subjective Benchmarks Matter More Than You Think
The implications extend beyond academic curiosity. As CEBench researchers discovered ↗, traditional benchmarks often “overlook economic considerations, making their findings less applicable to practical scenarios.” Subjective benchmarking provides a more realistic assessment of how models will perform in real-world applications where answers aren’t neatly categorized as right or wrong.
Consider the educational application explored in a PLOS One study ↗, where AI evaluations using structured prompts showed 70.8% agreement with expert human assessments. The structured prompts that provided specific evaluation criteria significantly outperformed unstructured approaches, which showed only 7.6% agreement. This demonstrates that well-designed subjective evaluation can achieve human-level reliability.
The Technical Implementation: How It Works
The methodology involves several sophisticated components:
- Name normalization: Basic cleaning removes titles and formatting inconsistencies
- Fuzzy matching consolidation: Groups similar names using rapidfuzz with 90% threshold
- Rank-Biased Overlap calculation: Measures list similarity with focus on top-ranked items
- Network visualization: Creates provider-colored similarity graphs using matplotlib
The Python implementation handles multiple samples per model (typically 5 samples) and aggregates results to create representative lists. The system includes provider mapping for major AI companies and uses color-coding to visualize relationships in the resulting network graphs.
Beyond “Top Humans”: Expanding the Methodology
The current implementation focuses on “top humans” as a test case, but the researchers envision expanding to multiple categories. This expansion would provide larger sample sizes and more diverse testing scenarios. Potential categories could include:
- Most influential historical figures
- Best scientific discoveries
- Most important philosophical concepts
- Greatest artistic achievements
Each category would test different aspects of model knowledge and reasoning, creating a multidimensional similarity profile.
The Controversy: Is Subjective Better?
Not everyone is convinced. Critics argue that subjective benchmarks introduce their own biases and lack the reproducibility of standardized tests. However, proponents counter that subjective evaluation better reflects real-world usage, where LLMs often operate in ambiguous, opinion-based contexts.
The debate mirrors earlier discussions in educational assessment, where standardized testing faced similar criticisms for prioritizing measurable outcomes over deeper understanding. As one education researcher noted, “Expert evaluations may include subjective interpretations and deep expertise, potentially resulting in different outcomes compared to AI evaluations.”
The Future of LLM Evaluation
The subjective list approach represents a shift toward more holistic evaluation methodologies. As Arize’s evaluation platform ↗ demonstrates, the industry is moving beyond simple accuracy metrics toward multifaceted assessment that includes factors like toxicity, contextual appropriateness, and factual accuracy.
The methodology’s open-source nature means it could become a standard tool for detecting training data overlap, a growing concern as models are increasingly trained on similar internet-scale datasets. With proper standardization and validation, subjective benchmarking could complement existing approaches, providing a more complete picture of model capabilities.
What’s clear is that the era of one-size-fits-all benchmarking is ending. The future belongs to multidimensional evaluation frameworks that acknowledge the complexity of intelligence, whether artificial or human. The subjective list methodology may seem unconventional, but it’s precisely this unconventional thinking that drives progress in understanding what our AI systems can truly do.