Cohere Command A+ Is a 218B MoE Model for Two GPUs, And the Benchmark Skeptics Are Circling

Cohere’s Command A+ represents a pragmatic shift in how enterprise-grade AI models reach developers. Rather than dropping another 400-billion-parameter behemoth that only trillion-dollar labs can touch, Cohere released a 218-billion-parameter sparse mixture-of-experts (MoE) model explicitly engineered to compress down to as few as two NVIDIA H100 GPUs. Released under an Apache 2.0 license, it unifies reasoning, multimodal understanding, multilingual translation, and agentic tool use into a single architecture. The headline isn’t just the parameter count, it’s the quantization-aware distillation and selective precision that make dense-model performance accessible without dense-model infrastructure. But as the open-source community digs into the numbers, a familiar tension is emerging between deployment efficiency and raw benchmark dominance.

The Hardware Reality: From 218B Parameters to Two H100s

Let’s be blunt about what usually happens when a lab announces a 200B+ parameter model. You smile, nod, and quietly accept that you’ll never touch it outside of a managed API. Cohere is explicitly targeting that pain point with Command A+. The model ships in three quantization tiers, each mapping directly to realistic hardware budgets:

Quantization	Blackwell Minimum	Hopper Minimum
BF16 (16-bit)	4× B200	8× H100
FP8 (8-bit)	2× B200	4× H100
W4A4 (4-bit)	1× B200	2× H100

The recommended configuration is W4A4, and Cohere claims the benchmark drops between these tiers are “negligible.” If true, and that’s a big if that we’ll dissect shortly, it means a model with 218 billion total parameters and 25 billion active parameters can reside entirely on a pair of enterprise GPUs. That is not normal. Most architectures at this scale assume you have a rack.

The aggressive footprint is possible because Command A+ is sparse: each token only activates a subset of experts rather than dragging the entire parameter set through inference. Specifically, the architecture routes tokens through 128 experts, with 8 active per token and a single shared expert applied universally. The attention layers interleave sliding-window attention with rotational positional embeddings and global attention layers in a 3:1 ratio, design choices that favor long-context efficiency over brute-force memorization.

The Quantization Surgery: Why Experts Get 4-Bit Treatment

The real engineering story here isn’t the MoE routing, it’s what happens to the weights when you try to squeeze them onto commodity (relatively speaking) silicon. Reasoning-heavy models typically suffer brutal quantization degradation because decoding errors compound across long thinking traces. Cohere’s workaround is surgical: they apply NVFP4 W4A4 quantization only to the MoE experts, while keeping the attention pathway, including Q/K/V/O projections, KV cache, and attention compute, at full precision.

To close the remaining quality gap, Cohere uses Quantization-Aware Distillation (QAD). The quantized student model is trained to mimic its full-precision teacher’s output distribution, inserting fake quantization operators during the forward pass and straight-through estimators on the backward pass. Think of it as teaching the compressed model to hallucinate mistakes in the same way the uncompressed model would, rather than introducing entirely new failure modes.

The result is a model whose expert GEMMs, the operations that typically bottleneck short-to-medium-context decode, run faster and leaner without sacrificing the precision of the pathways that actually maintain coherence across long contexts. It’s a pragmatic admission that not all parameters deserve equal protection.

Speed Claims That Actually Show Up in the Margins

Cohere isn’t just promising smaller footprints, it’s claiming significant latency wins over its own previous dense architectures. At equivalent quantization and concurrency levels, Command A+ delivers up to 63% higher Output Tokens per Second (TOPS) and reduces Time To First Token (TTFT) by up to 17% compared to Command A Reasoning. The W4A4 quantization itself contributes an additional 47% speedup and 13% latency reduction.

Then there’s speculative decoding, which Cohere optimized specifically for the MoE architecture. Rather than treating the draft model as a one-size-fits-all accelerator, they tailored it around sparse expert activation, squeezing out an extra 1.5, 1.6× inference speedup for both text and multimodal inputs.

The tokenizer deserves its own mention. Command A+ introduces a new tokenizer with substantial compression improvements for non-European languages, often the casualties of tokenizer training data bias. Tokenization efficiency improved by 20% for Arabic, 16% for Korean, and 18% for Japanese, which translates directly to lower inference costs and faster generation for a huge swath of global users.

Benchmarks, Benchmaxxing, and the Score of 37

Here is where the conversation gets spicy. Cohere touted several striking internal improvements: on 𝜏²-Bench Telecom, Command A+ jumped from 37% to 85% over Command A Reasoning, Terminal-Bench Hard agentic coding climbed from 3% to 25%, and within Cohere’s North platform, agentic question answering improved by 20%, spreadsheet analysis by 32%, and memory usage quality jumped from 39% to 54%.

Externally, the model achieved a 37 on the Artificial Analysis Intelligence Index, which Cohere noted outperforms other leading open models. But the developer community immediately flagged the absence of direct comparisons to contemporary heavyweights in the same size class, such as MiniMax’s recent releases. That gap left room for skepticism about whether Command A+ can compete on raw quality or merely on efficiency.

The broader discussion quickly turned to the phenomenon some developers call “benchmaxxing”, models explicitly tuned to maximize leaderboard scores while faltering on messy, unstructured tasks. There is an inherent trade-off when you optimize for hardware efficiency: every quantization bit and distillation step is a potential source of brittleness that won’t show up in sanitized evaluation harnesses. Command A+ will be tested not by its 𝜏² scores, but by how well it survives a Friday afternoon enterprise RAG pipeline with inconsistent PDF formatting and ambiguous user queries.

What “Sovereign AI” Means When the Weights Are Actually Yours

Cohere’s announcement leaned hard into the “sovereign AI” narrative, enterprise deployment inside customer-controlled environments without data exfiltration to third-party APIs. Nick Frosst, Cohere’s co-founder, emphasized in the r/LocalLLaMA announcement that the release offers “total, near unfettered access” under Apache 2.0. That licensing choice matters more than the architecture. Unlike models buried under commercial use restrictions or API-only access, an Apache 2.0 release means you can modify, redistribute, and embed the weights into products without asking permission.

This aligns with the trend toward running efficient models locally, similar to Command A+’s design for smaller hardware, though Cohere’s current form factor still assumes data-center GPUs rather than consumer cards. When pressed on whether smaller variants might follow, the team hinted at future releases, suggesting that discussion of scaling limitations and efficiency trade-offs may eventually extend all the way down to individual workstations.

Developer Experience: Reasoning Traces, Tool Calls, and vLLM Reality

Command A+ isn’t just a weights dump. Cohere shipped integration for both Transformers and vLLM, complete with chat templates for tool use using JSON schema. When reasoning is enabled, the model generates thinking traces between explicit <|START_THINKING|> and <|END_THINKING|> tags before producing its final answer, useful for debugging agentic loops, though potentially verbose.

For W4A4 deployment specifically, you’ll need vLLM ≥0.21.0 and cohere_melody>=0.9.0 for accurate response parsing. The recommended sampling parameters run hot: temperature=0.9, top_p=0.95, and repetition_penalty=1.04.

Running inference looks straightforward enough in the Hugging Face examples:

from transformers import AutoTokenizer, AutoModelForImageTextToText

model_id = "CohereLabs/command-a-plus-05-2026-w4a4"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(model_id)

messages = [{"role": "user", "content": "Explain the Transformer architecture"}]
input_ids = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt",
)

gen_tokens = model.generate(
    input_ids, max_new_tokens=4096, do_sample=True, temperature=0.6, top_p=0.95
)

But don’t mistake accessibility for simplicity. Deploying a 218B-parameter MoE model, even a quantized one, still requires serious GPU memory topology awareness. The fact that it can run on two H100s doesn’t mean it will forgive a poorly configured vLLM scheduler.

The Context: Efficiency Is the New Battleground

Command A+ arrives in a market increasingly skeptical of parameter-count arms races. We’ve seen smaller, efficient models outperforming larger ones, similar to Command A+’s GPU efficiency, and stealth-launched open-weight behemoths like GLM-5 demonstrating that size without accessibility is largely academic. Cohere’s release joins a growing class of open-weight models with specialized agentic capabilities that prioritize what developers can actually deploy over what benchmarks can measure.

The model’s multilingual expansion, from 23 to 48 languages, and its multimodal reasoning (63% on MMMU Pro, 75.1% on MMMU) position it as a practical workhorse rather than a narrow research artifact. Fujitsu has already signed on publicly, citing alignment with their own sovereign AI strategy through the Kozuchi Enterprise AI Factory. That enterprise validation is crucial, because if Command A+ can survive real operational constraints rather than synthetic evals, it may prove that the next competitive moat in AI isn’t raw capability, it’s deployability.

The Verdict

Command A+ is Cohere’s most interesting open release in years precisely because it makes an architectural bet that efficiency and capability are not mutually exclusive. The quantization strategy is genuinely clever: protecting attention precision while aggressively compressing experts is the kind of hardware-aware compromise more labs should consider. The speed and latency claims are substantial, and the Apache 2.0 license removes the legal friction that hobbles many “open” models.

But the skepticism is warranted. A score of 37 on the Artificial Analysis Intelligence Index and a lack of direct head-to-head comparisons with the current frontier leave unanswered questions. The true test will come when developers actually load these weights onto two H100s, feed them real enterprise slop, and see whether the 4-bit experts hold up or crumble. Cohere has given the tools to find out. The community now has to do the grinding work of validation, which, for an open-weight release, is exactly how it should be.