Claude Sonnet 4.5 Eviscerates GPT-5-Codex on Real Coding Challenges

SWE-rebench results reveal Claude's decisive 55.1% pass@5 advantage and unique bug-fixing capabilities that left OpenAI's flagship coding model behind

October 15, 2025

The polite fiction that every new AI model represents a linear advancement in capability just got demolished by hard data from the trenches. The latest SWE-rebench results ↗ show Claude Sonnet 4.5 achieving a decisive 55.1% pass@5 on fresh September 2025 GitHub PR bug-fix tasks, outperforming GPT-5-Codex’s 44.9% and solving problems that stumped every other model tested.

SWE-rebench results showing Claude Sonnet 4.5 outperforming GPT-5-Codex

Rank	Model	Resolved Rate (%)	Resolved Rate SEM (±)	Pass@5 (%)	Cost per Problem ($)	Tokens per Problem
1	Claude Sonnet 4.5	44.5%	1.00%	55.1%	$0.89	1,797,999
2	gpt-5-codex	41.2%	0.76%	44.9%	$0.86	1,658,941
3	Claude Sonnet 4	40.6%	1.08%	46.9%	$0.91	1,915,332
4	Claude Opus 4.1	40.2%	0.77%	44.9%	$4.03	1,675,141
5	gpt-5-2025-08-07-medium	38.8%	1.29%	44.9%	$0.72	1,219,914
6	gpt-5-mini-2025-08-07-medium	37.1%	1.19%	44.9%	$0.32	1,039,942
7	gpt-5-2025-08-07-high	36.3%	2.08%	46.9%	$1.05	1,641,219
8	o3-2025-04-16	36.3%	1.98%	46.9%	$1.33	1,404,415
9	Qwen3-Coder-480B-A35B-Instruct	35.7%	1.51%	44.9%	$0.59	1,466,625
10	GLM-4.5	35.1%	1.35%	44.9%	$0.92	1,518,166
11	Grok 4	34.6%	2.07%	44.9%	$1.53	1,168,808
12	GLM-4.5 Air	31.0%	2.18%	42.9%	$0.32	1,578,223
13	gpt-5-2025-08-07-minimal	30.6%	0.65%	46.9%	$0.32	629,319
14	Grok Code Fast 1	30.1%	2.11%	42.9%	$0.04	957,736
15	gpt-oss-120b	28.7%	1.30%	42.9%	$0.04	1,161,946
16	Qwen3-235B-A22B-Instruct-2507	28.6%	1.83%	40.8%	$0.18	899,731
17	gpt-4.1-2025-04-14	28.4%	1.85%	42.9%	$0.48	518,584
18	o4-mini-2025-04-16	27.3%	1.53%	44.9%	$0.95	1,726,082
19	Kimi K2 Instruct 0905	25.9%	2.02%	40.8%	$1.10	1,815,589
20	DeepSeek-V3.1	24.9%	2.47%	42.9%	$0.41	1,509,692
21	Qwen3-Coder-30B-A3B-Instruct	23.3%	1.38%	32.7%	$0.06	584,337
22	Qwen3-235B-A22B-Thinking-2507	22.4%	1.29%	32.7%	$0.14	512,537
23	DeepSeek-V3-0324	22.1%	0.72%	30.6%	$0.17	324,623
24	gemini-2.5-pro	21.4%	1.07%	34.7%	$0.59	1,111,184
25	DeepSeek-R1-0528	20.0%	2.08%	30.6%	$0.63	679,114
26	Qwen3-Next-80B-A3B-Instruct	19.7%	1.35%	32.7%	$0.23	444,208
27	gemini-2.5-flash	17.6%	0.99%	30.6%	$0.17	1,384,944
28	gpt-4.1-mini-2025-04-14	15.9%	2.36%	36.7%	$0.21	1,217,617
29	Qwen3-30B-A3B-Thinking-2507	13.1%	1.38%	26.5%	$0.05	436,857
30	Qwen3-30B-A3B-Instruct-2507	10.2%	1.44%	26.5%	$0.12	1,128,517
31	Claude Sonnet 3.5	N/A	N/A	N/A	N/A	N/A
32	DeepSeek-V3	N/A	N/A	N/A	N/A	N/A
33	DeepSeek-V3-0324	N/A	N/A	N/A	N/A	N/A
34	Devstral-Small-2505	N/A	N/A	N/A	N/A	N/A
35	gemini-2.0-flash	N/A	N/A	N/A	N/A	N/A
36	gemini-2.0-flash	N/A	N/A	N/A	N/A	N/A
37	gemini-2.5-flash-preview-05-20 no-thinking	N/A	N/A	N/A	N/A	N/A
38	gemini-2.5-flash-preview-05-20 no-thinking	N/A	N/A	N/A	N/A	N/A
39	gemma-3-27b-it	N/A	N/A	N/A	N/A	N/A
40	gpt-4.1-2025-04-14	N/A	N/A	N/A	N/A	N/A
41	gpt-4.1-mini-2025-04-14	N/A	N/A	N/A	N/A	N/A
42	gpt-4.1-nano-2025-04-14	N/A	N/A	N/A	N/A	N/A
43	gpt-oss-20b	N/A	N/A	N/A	N/A	N/A
44	horizon-alpha	N/A	N/A	N/A	N/A	N/A
45	horizon-beta	N/A	N/A	N/A	N/A	N/A
46	Kimi K2	N/A	N/A	N/A	N/A	N/A
47	Llama-3.3-70B-Instruct	N/A	N/A	N/A	N/A	N/A
48	Llama-4-Maverick-17B-128E-Instruct	N/A	N/A	N/A	N/A	N/A
49	Llama-4-Scout-17B-16E-Instruct	N/A	N/A	N/A	N/A	N/A
50	Qwen2.5-72B-Instruct	N/A	N/A	N/A	N/A	N/A
51	Qwen2.5-Coder-32B-Instruct	N/A	N/A	N/A	N/A	N/A
52	Qwen3-235B-A22B	N/A	N/A	N/A	N/A	N/A
53	Qwen3-235B-A22B no-thinking	N/A	N/A	N/A	N/A	N/A
54	Qwen3-235B-A22B thinking	N/A	N/A	N/A	N/A	N/A
55	Qwen3-32B	N/A	N/A	N/A	N/A	N/A
56	Qwen3-32B no-thinking	N/A	N/A	N/A	N/A	N/A
57	Qwen3-32B thinking	N/A	N/A	N/A	N/A	N/A

The Numbers Don’t Lie: Claude’s Clear Edge

The Nebius evaluation tested 25+ models on 49 real GitHub PR bug-fix tasks created just last month, creating the most current snapshot of coding model performance. What emerged wasn’t just another incremental improvement, it was a paradigm shift in how we think about AI coding capabilities.

Claude Sonnet 4.5 didn’t just edge out the competition, it dominated. The model uniquely solved several complex instances that no other model could crack, including the python-trio/trio-3334 ↗ exception wrapping issue, cubed-dev/cubed-799 ↗ region support implementation, and canopen-python/canopen-613 ↗ PDO selection bug.

Early adopters are already noticing the difference. Replit President Michele Catasta reported “We went from 9% error rate on Sonnet 4 to 0% on our internal code editing benchmark”, while Cognition’s Scott Wu noted “Claude Sonnet 4.5 increased planning performance by 18% and end-to-end eval scores by 12%, the biggest jump we’ve seen since Claude 3.6.”

Where the Rubber Meets the Road: Real Development Scenarios

The gap becomes even more pronounced when examining actual development workflows. Developers testing both models report distinct performance patterns that align with the benchmark findings.

One developer spent $104 comparing models on 135k+ lines of Rust code and found Sonnet 4.5 “excels at testing its own code, enabling Devin to run longer, handle harder tasks, and deliver production-ready code.” The consensus emerging from practical testing suggests Claude’s architecture handles complex, multi-file contexts more reliably, while GPT-5-Codex occasionally struggles with architectural coherence across larger codebases.

The cost-performance calculus also favors Claude in practical scenarios. In head-to-head e-commerce app development tests, Sonnet 4.5 and Claude Code used approximately 18M input tokens and 117k output tokens costing around $10.26, while GPT-5-Codex consumed ~600k input and 103k output tokens costing just $2.50. The fourfold price difference narrows when considering Claude’s higher success rate and reduced iteration needs.

Architectural Underpinnings: Why Claude Excels

Claude Sonnet 4.5’s performance surge appears rooted in Anthropic’s refined hybrid reasoning system, which modulates compute allocation within a unified model rather than routing between separate sub-models like GPT-5’s architecture.

This approach eliminates routing overhead but potentially allocates more compute than strictly necessary for simple tasks, a trade-off that pays dividends on complex, multi-step coding challenges. Claude’s ability to sustain focus on projects for over 30 hours gives it an edge on real-world development timelines where context retention matters more than raw speed.

The SWE-bench Verified ↗ platform, which measures AI models’ ability to solve real-world software issues, shows Claude Sonnet 4.5 scoring 77.2%, up from 72.7% for Sonnet 4, when parallel compute is enabled. This represents the highest score ever achieved on the benchmark and suggests architectural improvements specifically targeting software engineering workflows.

The Developer Experience Divide

Beyond raw performance numbers, developers report meaningful differences in how these models approach coding tasks.

Many developers prefer Claude’s coding style, describing it as more “human-like” and cohesive. As one developer noted, “Claude outdid GPT-5 in frontend implementation and GPT-5 outshone Claude in debugging and implementing backend.” This specialization pattern suggests teams might benefit from using multiple models for different phases of development.

Claude Code’s interface improvements, particularly the checkpoint system that automatically saves state before changes, provides a safety net that encourages bolder refactoring attempts. This reduces the “hallucination anxiety” that plagues AI-assisted coding, where developers fear irreversible damage to their codebase.

The Open-Source Surprise: Qwen3-Coder’s Strong Showing

While the Claude vs. OpenAI battle grabs headlines, the SWE-rebench results reveal another significant development: Qwen3-Coder emerged as the best open-source performer, beating established models like Gemini-2.5-Pro and DeepSeek-R1-0528.

This suggests we’re entering an era of specialization where different models excel at different aspects of coding. GLM-4.5 trailed Qwen3-Coder closely, and developers are eagerly awaiting the GLM-4.6 release to see if it can claim the open-source crown.

The competitive landscape is becoming more nuanced, with developers reporting that “Claude feels noticeably faster and smarter than the previous powerhouse, Claude Opus”, particularly in “areas like finance, statistics, and data dashboards.”

Practical Implications for Development Teams

For engineering leaders, these results demand a reconsideration of AI tooling strategy. The one-size-fits-all approach to AI coding assistants is quickly becoming obsolete as models specialize.

Teams working on complex, multi-file projects with extensive context requirements might prioritize Claude for its superior architectural comprehension, while those focused on rapid prototyping or debugging might still prefer GPT-5-Codex’s cost-efficiency for simpler tasks.

The cost-benefit analysis becomes particularly interesting at scale. While GPT-5 maintains pricing advantages ($1.25/$10 per million tokens vs Claude’s $3/$15), Claude’s higher success rate and reduced iteration overhead can make it more cost-effective for complex projects where failed attempts accumulate significant token costs.

The Future of AI-Assisted Development

These benchmark results signal a maturing of the AI coding ecosystem. We’re moving beyond simple code completion toward genuinely collaborative development partnerships where models understand project architecture, maintain context across extended sessions, and handle increasingly complex software engineering tasks autonomously.

The ability to solve real GitHub PRs that stumped human developers for days represents a milestone in AI capability. As one developer building with both systems observed, “Codex delivered the entire recommendation engine with fewer retries and far fewer context errors. Claude’s output looked cleaner on the surface, but Codex’s results actually held up in production.”

What emerges is a more nuanced picture: Claude excels at architectural coherence and long-term project maintenance, while GPT-5-Codex handles surgical fixes and debugging with precision. The best development teams will likely learn to leverage both strengths rather than declaring unilateral superiority.

The coding assistant wars have moved from marketing claims to measurable performance on real engineering challenges, and right now, Claude Sonnet 4.5 holds the advantage where it matters most: shipping working code that passes production tests.

#software-engineering

AI scaling

Scaling the AI Mountain: Why GPT-5, Gemini, and Claude 3.5 May Be Hitting the Ceiling

An analysis of AI scaling challenges, economic constraints, and emerging research directions in foundation model development.

#AI scaling#GPT-5#Gemini...

open-source

Open Source Coding Models Are Beating Proprietary Giants at Their Own Game

GLM-4.5 and Qwen3-Coder are nipping at the heels of Sonnet 4 and GPT-5 on real GitHub tasks while costing 20x less. The coding AI monopoly is crumbling.

#open-source#AI-coding#benchmarks...

legacy-modernization

Legacy Modernization's Dirty Secret: Technical Debt Is the Least of Your Problems

The real blockers in legacy migration aren't outdated code, they're knowledge silos, risk-averse culture, and organizational inertia that technical teams can't fix alone.

#legacy-modernization#technical-debt#organizational-culture...

View All Related (4)

Navigation

Categories

Claude Sonnet 4.5 Eviscerates GPT-5-Codex on Real Coding Challenges

SWE-rebench results reveal Claude's decisive 55.1% pass@5 advantage and unique bug-fixing capabilities that left OpenAI's flagship coding model behind

The Numbers Don’t Lie: Claude’s Clear Edge

Where the Rubber Meets the Road: Real Development Scenarios

Architectural Underpinnings: Why Claude Excels

The Developer Experience Divide

The Open-Source Surprise: Qwen3-Coder’s Strong Showing

Practical Implications for Development Teams

The Future of AI-Assisted Development

Related Articles

Scaling the AI Mountain: Why GPT-5, Gemini, and Claude 3.5 May Be Hitting the Ceiling

Open Source Coding Models Are Beating Proprietary Giants at Their Own Game

Legacy Modernization's Dirty Secret: Technical Debt Is the Least of Your Problems

Scaling the AI Mountain: Why GPT-5, Gemini, and Claude 3.5 May Be Hitting the Ceiling

Open Source Coding Models Are Beating Proprietary Giants at Their Own Game

Legacy Modernization's Dirty Secret: Technical Debt Is the Least of Your Problems

Moondream 3's Performance Claims Are Too Good to Be True

Table of Contents