96GB VRAM Is the New Minimum: How Qwen3.5 Is Eating GPT-OSS-120b’s Lunch in Local Agentic Coding
The boundary between open and closed AI models is collapsing. While OpenAI’s GPT-OSS-120b was supposed to dominate local agentic coding, Alibaba’s Qwen3.5 family is proving that smaller, efficiently architected models can match or exceed performance on consumer-grade hardware. From 12GB VRAM laptops running Qwen3.5-9B to 96GB workstation battles between Qwen3.5-122B and GPT-OSS-120b, the landscape is shifting toward local inference, but not without caveats. High variance in output quality and the complexity of MoE quantization are creating a fragmented ecosystem where developers constantly switch between models based on task requirements.
The 96GB Reality Check
Local AI used to mean “run it on your gaming laptop.” Those days are dead. The new baseline for serious agentic coding is 96GB of VRAM, and if you’re still rocking a 24GB card, you’re officially running a toy. This isn’t gatekeeping, it’s physics. When you’re asking an AI to maintain context across an entire codebase, debug multi-file architectures, and execute terminal commands autonomously, state-of-the-art coding model hardware demands serious memory headroom.
The shift happened faster than most anticipated. OpenAI released GPT-OSS-120b with fanfare about democratizing AI, but the model’s appetite for memory and its 128K context limitation quickly revealed the cost of “open” source. Enter Qwen3.5, Alibaba’s answer to the local coding revolution, bringing 256K native context windows (extendable to 1M tokens via Yarn), vision capabilities, and parallel tool calling to the table. On paper, it should dominate. In practice, it’s complicated.
From 12GB to 96GB: The Qwen3.5 Scaling Spectrum
The Qwen3.5 family spans an absurd range of hardware targets, and community testing reveals a patchwork of viability across the VRAM spectrum.
At the entry level, developers report surprising success with Qwen3.5-9B on 12GB VRAM setups, specifically the RTX 3060. One developer running Kilo Code and Roo Code described the model as the first to “actually work for more than an hour, doing kind of significant work and capable of going on by itself without getting stuck.” That’s high praise for hardware that was considered obsolete for LLM inference just months ago.
But the real action happens at the high end. On a Mac Mini M4 with 64GB unified memory, Qwen3.5-35B-A3B-4bit generates tokens at 35.2 tokens/second, more than 3.5x faster than the dense Qwen3-32B-4bit running on identical hardware. The secret? Mixture of Experts (MoE) architecture. The “A3B” designation means only 3.3 billion parameters activate per token despite the model having 35 billion total parameters. It’s a cheat code for memory bandwidth constraints, delivering 35B-class intelligence with 3B-class compute costs.
For the 96GB crowd, the comparison gets spicy. Qwen3.5-122B-A10B (running at UD_Q4_K_XL quantization) goes head-to-head with GPT-OSS-120b, and the results are inconsistent enough to drive developers mad. Some report the Qwen model matching or exceeding GPT-OSS-120b on coding benchmarks, while others find themselves switching back to OpenAI’s offering for speed and consistency.
When Qwen3.5 Works (And When It Deletes Your Project)
The variance problem is real. Community reports oscillate between “shockingly good” and “completely unusable” with the same model weights. One developer using Qwen3.5-9B on an RTX 4060 reported the model “completely mess up my build system then delete the project”, hardly the behavior you want from an autonomous coding agent. Others find that even the 0.8B parameter variants make solid tool calls using MCPs for web search and browser control.
This inconsistency manifests across the entire Qwen3.5 lineup. The 122B model, while capable of impressive reasoning, suffers from what developers describe as “higher variance of quality” compared to GPT-OSS-120b. The thinking modes, where the model generates internal reasoning blocks before responding, can spiral into overthinking, consuming context windows and getting lost in its own reasoning chains. One workaround involves using Qwen3.5-27B with Opus 4.6 CoT distilled variants, which reportedly handle the overthinking problem better than the base models.
The issue isn’t just accuracy, it’s predictability. When you’re running agentic workflow capabilities, you need to trust that the model won’t hallucinate a terminal command that wipes your working directory. GPT-OSS-120b, for all its limitations, offers more consistent behavior, a crucial factor when securing local AI agents becomes paramount.
Speed vs Smarts: The MoE Advantage
Where Qwen3.5 undeniably wins is the speed-to-intelligence ratio. On an RTX Pro 6000 with 96GB VRAM, developers are running Qwen3.5-122B at Q4 quantization while maintaining usable inference speeds. The parallel tool calling support and vision capabilities (multimodal inputs for debugging screenshots) give it practical advantages over GPT-OSS-120b’s text-only interface in real development workflows.
But GPT-OSS-120b fights back with raw throughput. The model’s architecture allows for faster token generation on identical hardware, a crucial factor when you’re iterating on code. Developers report switching back to GPT-OSS-120b specifically for speed benefits, even when Qwen3.5 produces technically superior code, because “the quality difference… is not as pronounced as benchmarks indicate.”
The quantization game also favors different approaches for each model. Qwen3.5’s MoE architecture responds well to aggressive quantization, Unsloth’s UD-TQ1_0 variant compresses the 30B model to just 8.01GB, making it viable for code completion tasks even on limited hardware. However, for agentic coding where tool calling accuracy matters, community consensus suggests Q5 variants minimum to avoid “decent amount of loss” in reasoning capabilities.
Integration Hell: Claude Code and the KV Cache Trap
Getting these models running is only half the battle. Integration with existing on-prem coding assistant shift tools like Claude Code reveals architectural landmines. A critical bug discovered by the Unsloth team shows that Claude Code invalidates the KV cache for local models by prepending internal IDs to prompts, making inference 90% slower by default. This isn’t a model problem, it’s a middleware problem.
The fix involves stripping these prefixes or using alternative inference servers like omlx specifically optimized for Apple Silicon. For developers running Qwen3.5-35B-A3B locally, this optimization transforms the experience from “unusable slideshow” to “productive coding session.”
Tool calling implementations also vary wildly. While Qwen3.5 supports function calling, getting it to play nice with Cline or Roo Code requires specific quantization formats and parameter tuning. Recommended settings from the community include temperature=0.25, top_p=0.9, and top_k=40 for coding tasks, with num_ctx pushed as high as 65536 tokens to maintain codebase context.
The Nemotron Curveball
Just when the Qwen3.5 vs GPT-OSS-120b narrative seemed settled, NVIDIA dropped Nemotron-3-nano 120B. Early reports from 96GB VRAM users suggest it approaches problems differently than both Qwen and OpenAI’s offerings, offering speeds around 600 tokens/second in prompt processing on multi-GPU setups. This creates a three-way battle at the high end, with developers reporting that Nemotron “kicks both of these models’ ass in coding” when running on RTX Pro 6000 setups.
The local LLM efficiency benchmarks are shifting weekly. StepFun 3.5 and Minimax M2.5 also enter the conversation, with the former showing particular strength in mathematical and logical reasoning at Q4 quantization, outperforming even 397B parameter models at Q3.
Practical Deployment: What Actually Works
For developers looking to deploy today, the consensus emerging from community testing suggests a tiered approach:
Consumer Hardware
(12-24GB VRAM):
Qwen3.5-9B with optimized tool-calling fine-tunes (like the Cline/Roo Code specific variants) offers viable agentic coding, albeit with careful monitoring. The UD-TQ1_0 quantization at 8.01GB works for code completion with Continue, but skip it for complex agentic tasks.
Prosumer Setup
(48-64GB VRAM):
Qwen3.5-35B-A3B-4bit hits the sweet spot. At ~20GB, it leaves room for context while delivering 35+ tokens/second on Apple Silicon or high-end consumer GPUs.
Workstation Class
(96GB+ VRAM):
The battle between Qwen3.5-122B (Q4) and GPT-OSS-120b comes down to task type. For long-context repository analysis, Qwen3.5’s 256K window wins. For rapid iteration and consistent tool calling, GPT-OSS-120b maintains an edge. Consider optimizing tokens for local agents to squeeze more performance from limited context windows.
The Fragmentation Problem
The real story here isn’t just “Qwen3.5 beats GPT-OSS-120b”, it’s the fragmentation of the local AI ecosystem. We’re entering an era where model selection becomes as complex as hyperparameter tuning. The variance in Qwen3.5 outputs means developers need fallback strategies, with many reporting workflows that switch between GPT-OSS-120b for speed and Qwen3.5 for specific reasoning tasks.
For businesses considering bringing local models online, this instability presents a challenge. You can’t build reliable automation on a model that might delete your project files 10% of the time. Until Qwen3.5’s variance issues resolve through better fine-tunes or quantization techniques, GPT-OSS-120b remains the conservative choice for production agentic workflows, while Qwen3.5 represents the bleeding edge for experimentation.
The landscape is shifting, but it’s not a clean revolution, it’s a messy, hardware-intensive brawl where 96GB is the new entry fee and consistency remains the final boss.



