Gemma 4’s MTP Drafters: Not Just a Speed Hack, But an Architectural Power Shift
The narrative in local AI has been straightforward: for speed, pay the cloud tax or suffer the latency. Google’s latest move with the Gemma 4 family shatters that bargain. By releasing Multi-Token Prediction (MTP) drafters alongside its main models, Google isn’t just offering a performance tweak, it’s architecting a fundamental shift in how efficient inference can be. This isn’t about squeezing out a few extra tokens per second, it’s about decoupling the physics of memory bandwidth from the logic of token generation.
Here’s the stark reality: standard LLM inference is memory-bandwidth bound. Your GPU’s processor spends most of its time fetching billions of parameters from VRAM just to produce a single token, leading to underutilized silicon and sluggish responses, especially on consumer hardware. Speculative decoding, the technique MTP implements, tackles this head-on by separating the roles of prediction and verification. It’s like having a hyper-fast scout running ahead to map the trail, while the seasoned expert follows behind, confirming the path in a single, efficient glance.
Google’s implementation, however, goes beyond the standard speculative decoding playbook. Let’s tear into the details.
How MTP Drafters Actually Work: The Scout-and-Verify Paradigm
At its core, the system pairs a heavyweight “target” model (like the 31B-parameter Gemma 4) with a featherweight “drafter” model. The official technical documentation lays out the process clearly: the drafter model predicts several tokens autoregressively in the time it takes the target model to process just one. The target model then verifies this entire proposed sequence in a single parallel forward pass.
If the target model agrees with the draft, it accepts the whole block and even generates an extra token of its own in the same step. This can output N+1 tokens in the time previously needed for one. If it rejects a token, it discards the rest of that draft sequence and generates the correct token itself, ensuring output quality is mathematically identical to standard autoregressive generation. No hallucinations, no quality drop, just raw speed.
The drafter models aren’t standalone entities. They are clever architectural extensions. They share the target model’s input embedding table and, crucially, build directly upon its last-layer activations. This means they’re not starting from scratch, they’re leveraging the “thinking” the larger model has already done. For the smallest models, the E2B and E4B, Google adds another trick: an “efficient embedder” that uses token clustering to avoid the expensive operation of predicting across the entire 262k-token vocabulary.
The Four Drafters: From Workstation to Edge
Google released four drafters, one for each main Gemma 4 model size, available now on Hugging Face:
gemma-4-31B-it-assistant(0.5B params): For the flagship dense 31B model.gemma-4-26B-A4B-it-assistant(0.4B params): For the 26B Mixture-of-Experts (MoE) model, which only activates 4B parameters per token.gemma-4-E4B-it-assistant(78.8M params): For the 4.5B effective parameter dense model.gemma-4-E2B-it-assistant(78M params): For the 2.3B effective parameter dense model.
It’s the last two that are the real eyebrow-raisers. The E2B drafter is a mere 78 million parameters. As developers noted in forums, that’s a “cute” and “tiny little safetensor” enabling high-performance inference on mobile phones. This drafter leverages Gemma 4’s massive 262k-token vocabulary, itself a form of “knowledge compression”, to punch far above its weight class.
This move aligns with a broader industry paradigm shift toward AI efficiency over parameter counts. The focus is no longer just on total flops, but on active compute and memory movement.
Why This Matters for Mixture-of-Experts
The 26B A4B MoE model presents a unique case. As Google notes, MoE models activate different experts per token. Verifying a batch of drafted tokens might require loading different expert weights from memory, which can offset drafting gains at batch size 1. However, at higher batch sizes (e.g., 4-8), expert activation overlap increases, and the MTP benefits shine, reportedly unlocking up to ~2.2x speedups. This nuance is critical for developers tuning for local inference performance on consumer hardware.
The Performance Claim: Up to 3x Speedup, For Real?
The official blog post shows a chart with tokens-per-second speed increases tested on hardware using LiteRT-LM, MLX, Hugging Face Transformers, and vLLM, claiming “up to a 3x speedup.” Real-world performance will depend heavily on hardware, batch size, and the “acceptance rate” of the drafter’s tokens.
The drafter uses a heuristic schedule for the number of tokens to draft (num_assistant_tokens_schedule = "heuristic"). If all draft tokens are accepted, it increases the draft length by 2. If any are rejected, it decreases by 1. This adaptive mechanism seeks to maximize throughput without wasting cycles on incorrect drafts.
from transformers import AutoProcessor, AutoModelForCausalLM
TARGET_MODEL_ID = "google/gemma-4-31B-it"
ASSISTANT_MODEL_ID = "google/gemma-4-31B-it-assistant"
processor = AutoProcessor.from_pretrained(TARGET_MODEL_ID)
target_model = AutoModelForCausalLM.from_pretrained(TARGET_MODEL_ID, dtype="auto", device_map="auto")
assistant_model = AutoModelForCausalLM.from_pretrained(ASSISTANT_MODEL_ID, dtype="auto", device_map="auto")
# Generate with MTP
outputs = target_model.generate(
**inputs,
assistant_model=assistant_model, # This one line enables MTP
max_new_tokens=256,
)
This simplicity masks a sophisticated backend that shares KV caches and uses the target model’s activations, a level of integration that makes earlier, more generic speculative decoding implementations look clunky.
The Bigger Picture: A Blueprint for Commoditized Inference
Google releasing these as open-weights, Apache 2.0 licensed components is a strategic masterstroke. It provides a ready-made, optimized architectural blueprint for commoditized inference that anyone can use and build upon. It directly pressures the ecosystem, pushing other model providers and inference engine developers to match this level of integration.
This also directly competes with other efficiency frontiers. While projects are working on native Multi-Token Prediction support in engines like llama.cpp, Google’s approach is model-native, not engine-bolted-on. Furthermore, the efficiency of the E2B/E4B drafters, combined with Gemma’s Per-Layer Embeddings (PLE), shows a deep investment in on-device AI that goes beyond just scaling down large models.
For developers building agentic workflows or real-time applications, the implications are profound. Faster token generation means more complex reasoning steps per second, more responsive chat interfaces, and more viable on-device autonomous systems. It changes the calculus for what’s possible on a laptop, a phone, or an edge device.
The Bottom Line: It’s About Latency, Not Just Throughput
The ultimate value of MTP drafters isn’t just a higher number on a benchmark chart. It’s about reducing perceived latency, which is the killer metric for user experience. When an AI coding assistant can stream suggestions almost as fast as you type, or a local agent can re-plan its next move without a noticeable pause, the technology fades into the background where it belongs.
Google’s Gemma 4 MTP release is a clear signal: the frontier of AI is no longer just about who has the biggest model, but about who can deliver that model’s intelligence the fastest, most efficiently, and most responsively. For developers, the message is clear. The tools for optimizing custom kernel performance for faster tokens are important, but architectural innovations like integrated speculative decoding are becoming the new baseline. The race for efficient inference just got a definitive, open-source entry from a major player. Now it’s time to see what the community builds with it.
