Claude Sonnet 4.5 Eviscerates GPT-5-Codex on Real Coding Challenges

Claude Sonnet 4.5 Eviscerates GPT-5-Codex on Real Coding Challenges

SWE-rebench results reveal Claude's decisive 55.1% pass@5 advantage and unique bug-fixing capabilities that left OpenAI's flagship coding model behind
October 15, 2025

The polite fiction that every new AI model represents a linear advancement in capability just got demolished by hard data from the trenches. The latest SWE-rebench results show Claude Sonnet 4.5 achieving a decisive 55.1% pass@5 on fresh September 2025 GitHub PR bug-fix tasks, outperforming GPT-5-Codex’s 44.9% and solving problems that stumped every other model tested.

SWE-rebench results showing Claude Sonnet 4.5 outperforming GPT-5-Codex

RankModelResolved Rate (%)Resolved Rate SEM (±)Pass@5 (%)Cost per Problem ($)Tokens per Problem
1Claude Sonnet 4.544.5%1.00%55.1%$0.891,797,999
2gpt-5-codex41.2%0.76%44.9%$0.861,658,941
3Claude Sonnet 440.6%1.08%46.9%$0.911,915,332
4Claude Opus 4.140.2%0.77%44.9%$4.031,675,141
5gpt-5-2025-08-07-medium38.8%1.29%44.9%$0.721,219,914
6gpt-5-mini-2025-08-07-medium37.1%1.19%44.9%$0.321,039,942
7gpt-5-2025-08-07-high36.3%2.08%46.9%$1.051,641,219
8o3-2025-04-1636.3%1.98%46.9%$1.331,404,415
9Qwen3-Coder-480B-A35B-Instruct35.7%1.51%44.9%$0.591,466,625
10GLM-4.535.1%1.35%44.9%$0.921,518,166
11Grok 434.6%2.07%44.9%$1.531,168,808
12GLM-4.5 Air31.0%2.18%42.9%$0.321,578,223
13gpt-5-2025-08-07-minimal30.6%0.65%46.9%$0.32629,319
14Grok Code Fast 130.1%2.11%42.9%$0.04957,736
15gpt-oss-120b28.7%1.30%42.9%$0.041,161,946
16Qwen3-235B-A22B-Instruct-250728.6%1.83%40.8%$0.18899,731
17gpt-4.1-2025-04-1428.4%1.85%42.9%$0.48518,584
18o4-mini-2025-04-1627.3%1.53%44.9%$0.951,726,082
19Kimi K2 Instruct 090525.9%2.02%40.8%$1.101,815,589
20DeepSeek-V3.124.9%2.47%42.9%$0.411,509,692
21Qwen3-Coder-30B-A3B-Instruct23.3%1.38%32.7%$0.06584,337
22Qwen3-235B-A22B-Thinking-250722.4%1.29%32.7%$0.14512,537
23DeepSeek-V3-032422.1%0.72%30.6%$0.17324,623
24gemini-2.5-pro21.4%1.07%34.7%$0.591,111,184
25DeepSeek-R1-052820.0%2.08%30.6%$0.63679,114
26Qwen3-Next-80B-A3B-Instruct19.7%1.35%32.7%$0.23444,208
27gemini-2.5-flash17.6%0.99%30.6%$0.171,384,944
28gpt-4.1-mini-2025-04-1415.9%2.36%36.7%$0.211,217,617
29Qwen3-30B-A3B-Thinking-250713.1%1.38%26.5%$0.05436,857
30Qwen3-30B-A3B-Instruct-250710.2%1.44%26.5%$0.121,128,517
31Claude Sonnet 3.5N/AN/AN/AN/AN/A
32DeepSeek-V3N/AN/AN/AN/AN/A
33DeepSeek-V3-0324N/AN/AN/AN/AN/A
34Devstral-Small-2505N/AN/AN/AN/AN/A
35gemini-2.0-flashN/AN/AN/AN/AN/A
36gemini-2.0-flashN/AN/AN/AN/AN/A
37gemini-2.5-flash-preview-05-20 no-thinkingN/AN/AN/AN/AN/A
38gemini-2.5-flash-preview-05-20 no-thinkingN/AN/AN/AN/AN/A
39gemma-3-27b-itN/AN/AN/AN/AN/A
40gpt-4.1-2025-04-14N/AN/AN/AN/AN/A
41gpt-4.1-mini-2025-04-14N/AN/AN/AN/AN/A
42gpt-4.1-nano-2025-04-14N/AN/AN/AN/AN/A
43gpt-oss-20bN/AN/AN/AN/AN/A
44horizon-alphaN/AN/AN/AN/AN/A
45horizon-betaN/AN/AN/AN/AN/A
46Kimi K2N/AN/AN/AN/AN/A
47Llama-3.3-70B-InstructN/AN/AN/AN/AN/A
48Llama-4-Maverick-17B-128E-InstructN/AN/AN/AN/AN/A
49Llama-4-Scout-17B-16E-InstructN/AN/AN/AN/AN/A
50Qwen2.5-72B-InstructN/AN/AN/AN/AN/A
51Qwen2.5-Coder-32B-InstructN/AN/AN/AN/AN/A
52Qwen3-235B-A22BN/AN/AN/AN/AN/A
53Qwen3-235B-A22B no-thinkingN/AN/AN/AN/AN/A
54Qwen3-235B-A22B thinkingN/AN/AN/AN/AN/A
55Qwen3-32BN/AN/AN/AN/AN/A
56Qwen3-32B no-thinkingN/AN/AN/AN/AN/A
57Qwen3-32B thinkingN/AN/AN/AN/AN/A

The Numbers Don’t Lie: Claude’s Clear Edge

The Nebius evaluation tested 25+ models on 49 real GitHub PR bug-fix tasks created just last month, creating the most current snapshot of coding model performance. What emerged wasn’t just another incremental improvement, it was a paradigm shift in how we think about AI coding capabilities.

Claude Sonnet 4.5 didn’t just edge out the competition, it dominated. The model uniquely solved several complex instances that no other model could crack, including the python-trio/trio-3334 exception wrapping issue, cubed-dev/cubed-799 region support implementation, and canopen-python/canopen-613 PDO selection bug.

Early adopters are already noticing the difference. Replit President Michele Catasta reported “We went from 9% error rate on Sonnet 4 to 0% on our internal code editing benchmark”, while Cognition’s Scott Wu noted “Claude Sonnet 4.5 increased planning performance by 18% and end-to-end eval scores by 12%, the biggest jump we’ve seen since Claude 3.6.”

Where the Rubber Meets the Road: Real Development Scenarios

The gap becomes even more pronounced when examining actual development workflows. Developers testing both models report distinct performance patterns that align with the benchmark findings.

One developer spent $104 comparing models on 135k+ lines of Rust code and found Sonnet 4.5 “excels at testing its own code, enabling Devin to run longer, handle harder tasks, and deliver production-ready code.” The consensus emerging from practical testing suggests Claude’s architecture handles complex, multi-file contexts more reliably, while GPT-5-Codex occasionally struggles with architectural coherence across larger codebases.

The cost-performance calculus also favors Claude in practical scenarios. In head-to-head e-commerce app development tests, Sonnet 4.5 and Claude Code used approximately 18M input tokens and 117k output tokens costing around $10.26, while GPT-5-Codex consumed ~600k input and 103k output tokens costing just $2.50. The fourfold price difference narrows when considering Claude’s higher success rate and reduced iteration needs.

Architectural Underpinnings: Why Claude Excels

Claude Sonnet 4.5’s performance surge appears rooted in Anthropic’s refined hybrid reasoning system, which modulates compute allocation within a unified model rather than routing between separate sub-models like GPT-5’s architecture.

This approach eliminates routing overhead but potentially allocates more compute than strictly necessary for simple tasks, a trade-off that pays dividends on complex, multi-step coding challenges. Claude’s ability to sustain focus on projects for over 30 hours gives it an edge on real-world development timelines where context retention matters more than raw speed.

The SWE-bench Verified platform, which measures AI models’ ability to solve real-world software issues, shows Claude Sonnet 4.5 scoring 77.2%, up from 72.7% for Sonnet 4, when parallel compute is enabled. This represents the highest score ever achieved on the benchmark and suggests architectural improvements specifically targeting software engineering workflows.

The Developer Experience Divide

Beyond raw performance numbers, developers report meaningful differences in how these models approach coding tasks.

Many developers prefer Claude’s coding style, describing it as more “human-like” and cohesive. As one developer noted, “Claude outdid GPT-5 in frontend implementation and GPT-5 outshone Claude in debugging and implementing backend.” This specialization pattern suggests teams might benefit from using multiple models for different phases of development.

Claude Code’s interface improvements, particularly the checkpoint system that automatically saves state before changes, provides a safety net that encourages bolder refactoring attempts. This reduces the “hallucination anxiety” that plagues AI-assisted coding, where developers fear irreversible damage to their codebase.

The Open-Source Surprise: Qwen3-Coder’s Strong Showing

While the Claude vs. OpenAI battle grabs headlines, the SWE-rebench results reveal another significant development: Qwen3-Coder emerged as the best open-source performer, beating established models like Gemini-2.5-Pro and DeepSeek-R1-0528.

This suggests we’re entering an era of specialization where different models excel at different aspects of coding. GLM-4.5 trailed Qwen3-Coder closely, and developers are eagerly awaiting the GLM-4.6 release to see if it can claim the open-source crown.

The competitive landscape is becoming more nuanced, with developers reporting that “Claude feels noticeably faster and smarter than the previous powerhouse, Claude Opus”, particularly in “areas like finance, statistics, and data dashboards.”

Practical Implications for Development Teams

For engineering leaders, these results demand a reconsideration of AI tooling strategy. The one-size-fits-all approach to AI coding assistants is quickly becoming obsolete as models specialize.

Teams working on complex, multi-file projects with extensive context requirements might prioritize Claude for its superior architectural comprehension, while those focused on rapid prototyping or debugging might still prefer GPT-5-Codex’s cost-efficiency for simpler tasks.

The cost-benefit analysis becomes particularly interesting at scale. While GPT-5 maintains pricing advantages ($1.25/$10 per million tokens vs Claude’s $3/$15), Claude’s higher success rate and reduced iteration overhead can make it more cost-effective for complex projects where failed attempts accumulate significant token costs.

The Future of AI-Assisted Development

These benchmark results signal a maturing of the AI coding ecosystem. We’re moving beyond simple code completion toward genuinely collaborative development partnerships where models understand project architecture, maintain context across extended sessions, and handle increasingly complex software engineering tasks autonomously.

The ability to solve real GitHub PRs that stumped human developers for days represents a milestone in AI capability. As one developer building with both systems observed, “Codex delivered the entire recommendation engine with fewer retries and far fewer context errors. Claude’s output looked cleaner on the surface, but Codex’s results actually held up in production.”

What emerges is a more nuanced picture: Claude excels at architectural coherence and long-term project maintenance, while GPT-5-Codex handles surgical fixes and debugging with precision. The best development teams will likely learn to leverage both strengths rather than declaring unilateral superiority.

The coding assistant wars have moved from marketing claims to measurable performance on real engineering challenges, and right now, Claude Sonnet 4.5 holds the advantage where it matters most: shipping working code that passes production tests.

Related Articles