Llama.cpp’s MTP Merge Tanks Throughput on Constrained VRAM. Here’s How a Community Fork Pushes 110 tok/s on a 12GB Card.

Llama.cpp’s MTP Merge Tanks Throughput on Constrained VRAM. Here’s How a Community Fork Pushes 110 tok/s on a 12GB Card.

After llama.cpp’s MTP merge caused a 20% performance regression, ik_llama.cpp brings back 110 tok/s for local Qwen3.6 inference on constrained VRAM.

A recent benchmark reveals that the official llama.cpp MTP implementation caused a severe performance regression for Qwen3.6 inference on constrained VRAM, dropping speeds to ~90 tok/s where an unofficial fork, ik_llama.cpp, pushes 110 tok/s on the same 12GB RTX 4070 Super. We break down the exact commands, the counterintuitive acceptance-rate paradox, and the OS-level tricks making it possible.

Cover image for article: Why MTP Doesn't Speed Up Your llama.cpp Inference (and How to Actually Get 110 tok/s)
Performance regression illustration: Official llama.cpp MTP merge vs. ik_llama.cpp fork throughput on 12GB VRAM.

When the Merge Button Becomes the Bottleneck

Multi-token prediction (MTP) was supposed to be the great equalizer for local LLM inference. By baking speculative decoding directly into the model, you could theoretically leap from ~7 tok/s to 80+ tok/s on consumer GPUs without buying new hardware. That promise held true, right up until the official llama.cpp MTP PR merged into upstream. Then, for at least one rigorous tester, performance cratered.

Running a byteshape Qwen3.6-35B-A3B IQ4_XS quant on an RTX 4070 Super with 12GB VRAM, the same configuration that previously flirted with 80 tok/s suddenly struggled to stay above non-MTP speeds. The fix came from an unlikely place: a community fork called ik_llama.cpp, which reclaimed a 23% throughput increase and pushed the setup to 110 tok/s. The kicker? It achieved that win with a lower draft acceptance rate than the official build.

If you think higher acceptance always equals higher throughput, the numbers below will change your mind.

The Benchmark: 90 tok/s vs. 110 tok/s

The test rig was modest: CachyOS, AMD Ryzen 7 9700X, 48GB DDR5-6000, and the aforementioned RTX 4070 Super 12GB. The model was the Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf quant, roughly 18.6 GB on disk and requiring partial CPU offload to squeeze into 12GB VRAM. Benchmarks were run using the standardized mtp-bench.py script across nine prompt categories.

Official llama.cpp (post-merge):

Prompt Predicted Draft Accepted Rate tok/s
code_python 192 122 118 0.967 79.8
code_cpp 192 117 110 0.940 89.1
explain_concept 192 124 113 0.911 88.0
summarize 192 139 127 0.914 95.0
qa_factual 192 133 128 0.962 97.0
translation 192 125 117 0.936 91.6
creative_short 192 109 99 0.908 82.1
stepwise_math 192 130 125 0.962 97.0
long_code_review 192 121 115 0.950 88.2

Aggregate: 89.76 tok/s, acceptance rate 0.9393, wall time 21.86 s.

ik_llama.cpp (same model, same hardware):

Prompt Predicted Draft Accepted Rate tok/s
code_python 192 135 122 0.904 105.1
code_cpp 192 136 120 0.882 110.3
explain_concept 192 133 116 0.872 109.0
summarize 56 38 37 0.974 122.3
qa_factual 192 141 127 0.901 116.0
translation 192 143 113 0.790 104.1
creative_short 192 133 118 0.887 109.4
stepwise_math 192 140 125 0.893 114.6
long_code_review 192 128 108 0.844 101.4

Aggregate: 110.24 tok/s, acceptance rate 0.8749, wall time 16.64 s.

Notice the paradox: llama.cpp won on acceptance rate (93.9% vs. 87.5%) but lost on wall-clock time by over five seconds. Whatever overhead the official implementation introduced, likely a combination of inefficient CPU/GPU hybrid scheduling and CUDA graph thrashing, more than erased its accuracy advantage.

For background on llama.cpp MTP implementation and its performance trade-offs, the broader picture shows that MTP’s gains are highly workload-dependent, but upstream’s current execution clearly leaves tokens on the table.

Why Official llama.cpp Stumbles Under Real Load

The performance gap isn’t random. When you force an 18.6 GB model to live inside 12 GB of VRAM, you enter the hostile territory of partial CPU offload, and that is exactly where ik_llama.cpp is reputed to outperform its parent project. The fork includes aggressive optimizations for hybrid inference, tensor placement decisions, buffer reuse, and KV cache management that upstream either hasn’t merged or hasn’t implemented.

Three failure modes from the broader developer community explain why MTP can regress instead of accelerate:

  1. KV cache thrashing: Generating multiple draft candidates per step churns the KV cache far more aggressively than standard autoregressive decoding. On a memory-starved GPU, that extra pressure spills into system RAM or triggers reallocations, stalling the CUDA cores.
  2. CUDA graph capture collapse: llama.cpp relies on CUDA graphs to amortize kernel-launch overhead. MTP introduces dynamic shapes because the number of accepted tokens varies per step. If the graph is re-captured every iteration, you burn more time on setup than you save on inference. The unofficial fork appears to handle graph stability better under these variable conditions.
  3. Draft-model overhead at high concurrency: While less relevant to the single-user 12 GB scenario, upstream MTP can actually hurt throughput under concurrent load because the draft head adds compute that saturates an already-busy GPU. One developer found that at concurrency 4, MTP reduced generation speed from 41.5 tok/s to 29.9 tok/s.

Combined, these factors mean that a superficially high acceptance rate can mask a broken pipeline. If verifying 93% of drafts takes longer than generating 87% of them in a better-optimized engine, the “worse” engine wins.

The Escape Hatch: Exact Commands That Matter

Official llama.cpp launch:

llama-server \
  -m Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf \
  --fit on \
  --fit-target 512 \
  --ctx-size 131072 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --cache-type-k-draft q8_0 \
  --cache-type-v-draft q8_0 \
  --spec-type draft-mtp \
  --spec-draft-p-min 0.75 \
  --spec-draft-n-max 3 \
  --no-mmap \
  --mlock \
  --threads 8 \
  --temp 0.0

ik_llama.cpp launch (the winning config):

llama-server \
  -m Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf \
  --fit \
  --fit-margin 1664 \
  --ctx-size 131072 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --cache-type-k-draft q8_0 \
  --cache-type-v-draft q8_0 \
  --multi-token-prediction \
  --draft-p-min 0.75 \
  --draft-max 3 \
  --no-mmap \
  --mlock \
  --threads 8 \
  --temp 0.0

Key differences:
--fit in ik_llama.cpp is auto-calibrating, upstream requires --fit on plus a --fit-target hint.
--fit-margin 1664 explicitly reserves 1664 MiB of VRAM breathing room in the fork. If you hit OOM, bump this to 1792 or 2048, an extremely useful parameter that upstream’s equivalent setup doesn’t expose as cleanly.
– The draft flags are completely renamed: --multi-token-prediction, --draft-p-min, and --draft-max replace --spec-type draft-mtp, --spec-draft-p-min, and --spec-draft-n-max.

If you are running the GPU as a secondary device with your monitor plugged into the iGPU, freeing the entire 12GB for inference, both configurations work, but only the fork seems to respect the constraints efficiently. There is a performance comparison of ik_llama.cpp across different hardware configs that shows this pattern repeating on 24 GB cards as well.

Squeezing 35B Parameters Into 12GB: VRAM Tactics

Running an 18.6 GB quant on a 12 GB card is not for the faint of heart. Beyond the --fit-margin tuning, several OS-level hacks claw back precious megabytes:

  • iGPU trick: Plug your monitor into the motherboard’s integrated graphics instead of the discrete GPU. This leaves 100% of the 12 GB available for inference rather than losing hundreds of megabytes to the display buffer.
  • CPU-only desktop sessions: On CachyOS with KDE Wayland, creating a custom .desktop session that forces software rendering (LIBGL_ALWAYS_SOFTWARE=1 GALLIUM_DRIVER=llvmpipe) drops idle VRAM usage from ~1 GB to roughly 126 MB. One test even showed active workload VRAM dropping from ~1 GB to ~162 MB. That is nearly a full gigabyte reclaimed, enough to push a borderline model from OOM to stable.
  • Quantize the KV cache: Using q8_0 for both K and V caches, plus their draft equivalents, is mandatory at this scale. The alternative is f16, which would consume roughly double the memory and immediately crash the load.

These maneuvers are tedious, but they are the difference between actually running the model and staring at a CUDA out-of-memory error.

Forks, Fragmentation, and the Cost of Conservative Merges

The existence of ik_llama.cpp is not an accident. Tensions between the original fork’s maintainer and the upstream project have been well documented. While some praise llama.cpp’s conservatism, arguing that maintainers have no obligation to merge every optimization, others note that the friction pushes genuine performance work into permanently fragmented branches. There is a long history of llama.cpp adopting ik_llama’s quantization techniques precisely because the fork routinely proves its innovations in the wild before upstream feels pressure to integrate them.

In the case of MTP, the beta period generated excitement about llama.cpp closing the gap with vLLM, as covered in earlier looks at the competitive positioning of MTP against other inference engines. Yet the post-merge reality for VRAM-constrained users is that the official implementation currently serves as a proof-of-concept, while the fork is production-grade for anyone counting tokens per second.

How to Diagnose Your Own MTP Setup

If you have already enabled MTP and your results feel underwhelming, run through this checklist before blaming your GPU:

  1. Measure acceptance rate. If your aggregate acceptance rate is below ~60%, MTP is actively hurting you. You are paying the compute cost of draft generation without reaping the parallel-verification reward. Use a standardized benchmark script like mtp-bench.py rather than eyeballing a single chat response.
  2. Monitor VRAM headroom. Run nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu --format=csv -l 1 in a second terminal. If you are pinned above 95% utilization, MTP’s extra KV allocations are likely spilling into slower memory. Reduce --ctx-size, quantize the cache more aggressively, or increase --fit-margin.
  3. Test CUDA graph stability. As a control, force-disable graphs and compare:
    bash
    GGML_CUDA_DISABLE_GRAPHS=1 ./llama-cli -m your-model.gguf ...

    If throughput improves with graphs disabled, your build is re-capturing the graph every step, a known failure mode under MTP’s variable token shapes.
  4. Tune --draft-p-min (or --spec-draft-p-min). Counterintuitively, raising the minimum acceptance probability from the default to around 0.75 often increases realized throughput. It forces the draft head to only propose high-confidence tokens, which reduces rejections and wasted KV churn.
  5. Match the quantization to the model head. Some MTP heads are more sensitive to aggressive quants than the base weights. If acceptance mysteriously tanks after switching to a smaller quant, that is your culprit.

For additional context on how Qwen3.6 models integrate with the broader llama.cpp ecosystem, and why this model family is worth the trouble in the first place, see our earlier background on Qwen3 models and their integration into llama.cpp ecosystem.

The Takeaway

MTP is not a free speed boost. It is a high-wire act that demands the right model quant, the right launch flags, and, at least for now, the right fork. The official llama.cpp merge proved that speculative decoding can exist upstream without actually being optimal for the users who need it most. On constrained VRAM, the 23% gap between 90 tok/s and 110 tok/s is the difference between a usable local setup and a frustrating regression.

If you are running hybrid CPU/GPU offload on a 12 GB card, ik_llama.cpp is currently the best path to the performance that MTP promised. Set your --fit-margin, pin your draft probability at 0.75, and stop giving your display buffer a slice of VRAM it does not deserve. The tokens you save may be your own.

Share:

Related Articles