Ollama and KoboldCpp Are Doing It Wrong: llama.cpp’s Auto-Memory Fit Exposes the Limits of Manual GPU Tuning

For years, running large language models locally meant becoming an amateur GPU archaeologist, digging through layer counts, manually adjusting tensor splits, and praying your VRAM allocation wouldn’t crash mid-generation. Downstream tools like Ollama and KoboldCpp built empires on hand-tuned heuristics, turning memory management into a dark art of guesswork and compromise. But llama.cpp’s latest automated memory optimization, introduced in PR #16653 and refined in PR #18058, just made that entire approach look antiquated.

The new --fit flag doesn’t just automate what humans were doing manually, it fundamentally rethinks the problem. And for Mixture-of-Experts (MoE) models, the performance gains aren’t incremental, they’re transformational.

The Manual Tuning Trap: Why Heuristics Fail

Until now, hybrid GPU-CPU inference meant explicitly setting --n-gpu-layers and --tensor-split parameters, hoping your mental math matched the model’s actual memory footprint. Ollama and KoboldCpp automated this with heuristic algorithms, but as the implementation notes candidly admit, these rely on “rough heuristics and tend to be inaccurate.” The result? Conservative allocations that leave 10-15% of VRAM on the table, or aggressive settings that trigger OOM crashes when context sizes shift.

The problem becomes acute with MoE models like Qwen 3 Next or Mistral Large 3. These models don’t just have dense layers, they have sparse expert layers that activate conditionally. A simple layer-count heuristic can’t distinguish between critical dense tensors (attention, embeddings) and expert FFNs that might be swapped to CPU with minimal performance impact. The existing approach treats all layers as equal, which is like using a sledgehammer for brain surgery.

Automated Intelligence: How It Actually Works

llama.cpp’s solution is brutally elegant: virtual test allocations with iterative feedback. Instead of guessing, the system actually tries to allocate tensors across available GPUs, measures the deficit, and systematically reduces memory usage until the model fits. The algorithm follows a precise hierarchy:

Context size reduction first: If the model won’t fit, slash context size from 131072 to the user-specified minimum (default 4096), freeing up gigabytes instantly.
Dense tensor prioritization: For MoE models, move only the sparse expert tensors to CPU while keeping dense attention weights on GPU.
Layer-level overflow: When individual layers exceed GPU capacity (common with 24GB GPUs), split them across devices rather than moving entire layers to CPU.

The llama-fit-params tool exposes this logic transparently. Running it on a Qwen 3 Next BF16 model with two RTX 4090s produces:

llama_params_fit_impl: projected memory use with initial parameters [MiB]:
llama_params_fit_impl:   - CUDA0: 24080 total, 34873 used, 11187 deficit
llama_params_fit_impl:   - CUDA1: 24080 total, 31847 used, 8161 deficit
llama_params_fit_impl: context size reduced from 131072 to 4096 -> need 4490 MiB less
llama_params_fit_impl: with only dense weights in device memory there is a total surplus of 42064 MiB

The final allocation keeps 37 layers on GPU with strategic tensor overrides: -ot blk\.13\.ffn_(up|gate|down).*=CUDA1,blk\.25\.ffn_down.*=CPU,.... This isn’t random, it’s a computed optimum that manual tuning could never reliably achieve.

Benchmarks That Tell a Story

The performance data reveals why this matters. On Qwen 3 Next BF16, automated fitting delivers:

GPUs	VRAM Utilization	pp4096 (t/s)	tg128 (t/s)
1x RTX 4090	88.1%	381.52	19.01
2x RTX 4090	88.5%	246.29	20.89
4x RTX 4090	89.3%	433.10	24.70

The key insight isn’t just the throughput numbers, it’s the 88-90% VRAM utilization across configurations. Manual tuning typically plateaus at 75-80% before hitting stability issues. For the GPT OSS 120b mxfp4 model, the system achieves 87-90% utilization while maintaining crash-free operation, something that had eluded even sophisticated users.

But the real controversy emerges in the comments: developers note that for dense models, keeping attention weights on GPU while moving FFN layers to CPU can yield 50% faster token generation at equivalent VRAM usage. The automated system doesn’t yet exploit this sub-layer granularity for dense models, treating them as contiguous blocks. This reveals a future optimization path that could obsolete even more manual tuning practices.

Why This Breaks Existing Tools

Here’s where it gets spicy: Ollama and KoboldCpp’s heuristic approaches aren’t just suboptimal, they’re fundamentally misaligned with how modern MoE models work. Their token-balancing strategies, inherited from traditional Expert Parallelism research, assume compute-bound operations where runtime scales linearly with token count. But as recent research from Yale, Princeton, and NVIDIA demonstrates, MoE decode phases are memory-bound, not compute-bound. Runtime scales with activated expert replicas, not token throughput.

The metro paper shows that token-balancing algorithms can increase decode latency by 14% while reducing total throughput by 10%. By contrast, llama.cpp’s approach, minimizing activated experts while maximizing dense tensor residency, directly addresses the memory bottleneck. It’s not just a better heuristic, it’s a different theoretical foundation.

This creates an existential question for downstream tools: do they double down on their existing architectures, or do they effectively become wrappers around llama.cpp’s automation? The latter path means acknowledging that the core innovation now lives upstream.

The MoE Multiplier Effect

MoE models expose the limitations of manual tuning more dramatically than dense models. A Mistral Large 3 with 675B total parameters but only 41B active per forward pass requires surgical memory management. The dense backbone must stay on GPU, but which of the hundreds of expert FFNs can spill to CPU without destroying performance?

llama.cpp’s algorithm answers this systematically. The --fit-target parameter (default 1024 MiB margin) ensures safety while the iterative solver finds the exact split. For Qwen 3 Next, this means keeping 36 layers on GPU1 and 22 on GPU2, with 11 layers partially overflowing. The result: 2201 MiB used on GPU1, 21484 MiB free, a tight fit that manual tuning would need dozens of attempts to replicate.

Practical Impact: What Users Actually Get

No more --n-gpu-layers guesswork: The system measures and adjusts automatically
MoE models become first-class citizens: No special configuration needed for Qwen, DeepSeek-V3, or Mistral Large 3
Context size becomes dynamic: The system will transparently reduce context to fit larger models rather than crashing
Multi-GPU just works: Tensor splitting is computed to balance across heterogeneous GPUs

The time-to-fit is surprisingly low: 4-8 seconds for most models, scaling linearly with GPU count. This is a one-time cost at load that pays dividends across the entire inference session.

The Controversy: Are We Losing Control?

Critics on the forums raise valid concerns. One developer noted that the automated system doesn’t yet implement sub-layer optimization for dense models, potentially leaving 10-15% performance on the table. Another pointed out that deterministic allocation is easier to debug than a black-box solver.

These are fair points, but they miss the larger trend: local LLM inference is becoming a commodity. The value isn’t in manual optimization, it’s in reliable, performant execution. Just as compilers replaced hand-tuned assembly for most code, automated memory management will replace manual GPU tuning for most models.

The real debate isn’t whether automation is perfect, it’s whether the developer community trusts llama.cpp’s upstream decisions or wants configurable knobs to override them. The current implementation wisely offers --fit off for purists, but defaults to automation for the 95% use case.

Looking Forward: The End of an Era

This shift signals the maturation of the local LLM ecosystem. We’ve moved from “can we run this at all?” to “how do we run this optimally without thinking about it?” The next frontier isn’t memory management, it’s higher-level optimizations like prompt caching, speculative decoding, and intelligent batching.

For Ollama and KoboldCpp, the path forward is clear: integrate llama.cpp’s automation or become obsolete. Their value proposition was always usability, not algorithmic innovation. Now that the core library has solved the hardest technical problem, the differentiation moves to UX, ecosystem integration, and model distribution.

For users, the message is simpler: delete your tuning scripts. Stop counting layers. Run ./llama-fit-params once, then enjoy 20% faster generation and zero crashes. The age of GPU archaeology is over.

The bottom line: llama.cpp’s automated memory optimization isn’t just a feature, it’s a statement. Local LLM inference doesn’t have to be a dark art. And for MoE models pushing the boundaries of what’s possible on consumer hardware, it’s not just convenient, it’s essential.