Tagged with

2 articles found

IQuest-Coder-V1’s 81% SWE-Bench Claim: A 40B Model That Punches Above Its Weight, or Just Benchmark Boxing?

A new 40B-parameter dense coding model claims state-of-the-art results on SWE-Bench and LiveCodeBench, reigniting debates about benchmark validity and open-source AI competitiveness.

#benchmark-controversy#coding-llms#model-evaluation...

context-extension

Meta’s Context Cap: How Community Hacking Unlocked Llama 3.3 8B’s True Potential

Community testing reveals that unofficial context extensions of Llama 3.3 8B significantly outperform Meta’s official 8k configuration, exposing gaps in model evaluation and raising questions about intentional limitations.

#context-extension#llama-3.3#llm-benchmarks...