How a Single llama.cpp PR Just Fixed Agentic Coding’s Worst Performance Bottleneck

You spend 15 minutes discussing a feature with your local coding agent. Fifty thousand tokens of back-and-forth. Then you say “implement it”, and it reads your files, writes code, runs commands, produces twenty thousand more tokens. The code is ready. You sigh, type “thank you”, and… nothing.

You wait. And wait. You watch your terminal log as that dreaded message scrolls by: forcing full prompt re-processing due to lack of cache data.

Seventy thousand tokens. Reprocessed. From scratch.

If you’ve been running agentic coding tools like opencode, pi, or GitHub Copilot on local models, you know this pain intimately. The model that was snappy a moment ago suddenly forgets everything and takes minutes to “catch up.” The culprit? Not the model. Not your hardware. It’s how your agentic tool mutates the conversation history.

Diagram showing the performance bottleneck of full prompt re-processing in local agentic coding workflows — Illustration of the token re-processing bottleneck that the llama.cpp checkpoint fix aims to eliminate.

The Hidden Tax of “Smart” Context Optimization

Here’s the pattern that’s probably wrecking your agentic coding sessions. Tools like opencode try to be clever with context management. They rewrite the conversation history, removing reasoning tokens, trimming tool outputs, or restructuring message boundaries. Their goal is noble: keep the context window lean and the model focused.

The reality is brutal.

Every edit to the conversation history forces llama.cpp to reevaluate its prompt cache. In the best case, it only reprocesses from the point of modification, say, twenty thousand tokens. In the worst case, and this is what’s been crushing workflows, it invalidates the entire cache chain and reprocesses everything from token zero.

The developer of this fix, Jacek Poplawski, documented exactly this behavior with a clean metaphor: “Some tools, like opencode, try to be smart and optimize the context. They modify something in the conversation history. In the best case, llama.cpp has to reprocess everything from that point. In the worst case, it has to reprocess the entire context.”

His solution? He switched from opencode to pi. Not because pi is magic, but because it doesn’t rewrite the context history.

The Checkpoint Problem: Context Boundaries vs. Token Counts

Before this PR, llama.cpp created context checkpoints at fixed token intervals. You’d get a checkpoint every 8192 tokens or whatever --checkpoint-every-n-tokens you’d set. This is fine for simple chat scenarios, but it’s catastrophically wrong for agentic coding.

The problem is that agentic conversations have semantic boundaries, user message changes, tool call completions, system prompt injections, that don’t align with token intervals. When an agent removes reasoning tokens from a past assistant response, it changes content between checkpoints. But the cache chain doesn’t know that only that section changed. Invalidate one checkpoint, and the entire chain collapses.

The resulting behavior is what developers like joost00719 called out: “Finally. Been struggling with this a lot.” The fix involves something deceptively simple: instead of placing checkpoints at arbitrary token counts, place them at conversation message boundaries.

How the Conversation-Boundary Checkpoint Works

The PR, merged on May 25, 2026, changes the entire philosophy of context checkpointing. The core logic:

Extract message_spans from chat templates: The server now understands where each user message begins and ends in the token stream.
Find the prompt token position before the latest user message: Instead of arbitrary intervals, it targets the exact boundary before the newest input.
Split prompt batching at that position: The pipeline stops processing at the message boundary, creates a checkpoint, then continues. This creates a crystal-clear separation between “old context that won’t change” and “new context that might.”
Avoid periodic mid-prompt checkpoints when the boundary is known: If the server knows where the next checkpoint should go, it skips the wasteful intermediate checkpoints.

The practical effect is visible in the test logs:

slot update_slots: id  0 | task 0 | checkpoint before user input reached: ending prompt batch at prompt_n_tokens = 3549

That line is the money shot. Instead of creating checkpoints every 8192 tokens and hoping one aligns with a conversation boundary, the server now says “stop here, this is where the conversation actually changes.”

The `preserve_thinking` Connection

An interesting sub-plot emerged during testing. The preserve_thinking parameter, available when using models like Qwen 3.6, plays a crucial supporting role.

Many models strip reasoning tokens from the output before storing the conversation. This is “smart” behavior that backfires horribly. Each time the model’s thinking is removed, the prompt history changes, and llama.cpp invalidates its cache.

Jacek Poplawski’s testing command shows how to enable both features:

--chat-template-kwargs '{"preserve_thinking":true}'

His observation is blunt: “preserve_thinking really helps, without it, the prompt history changes, so there is always some reprocessing.”

The combination of these two features, conversation-boundary checkpoints and preserved thinking tokens, is what makes agentic coding “more responsive” now. They prevent the cause of cache invalidation and minimize the impact when invalidation is unavoidable.

Real-World Performance: From 30-Second Waits to 2-Token Increments

The test logs from Qwen 3.6 27B testing reveal something remarkable. Watch how the prompt eval time shrinks as the checkpoint system matures:

Request	Tokens Processed	Prompt Eval Time	Tokens/Second
Initial	3,562	2,163.37 ms	1,646.51
Follow-up	451	495.71 ms	909.80
Long context	12,668	7,495.20 ms	1,690.15
Checkpoint hit	42	195.19 ms	215.18
Near-hit	2	830.98 ms*	*restored from checkpoint

The pattern is clear: once a checkpoint is established at a conversation boundary, subsequent requests with the same prefix only need to process the difference. The worst case, full reprocessing, becomes the exception rather than the rule.

One user, corrm, reported: “The message forcing full prompt re-processing due to lack of cache data never shows up in logs, and now almost all the time, I get a cache hit.”

Another tester noted that after fixing the checkpoint spacing, even with 8 checkpoints instead of 24, the system remained stable. The conversation-boundary approach is so effective that the number of checkpoints matters less than where they’re placed.

The Remaining Edges: Where It Still Breaks

The fix isn’t perfect, and the community has identified at least two remaining failure modes.

The GitHub Copilot regression: User mscheurwater found that while pi works beautifully, GitHub Copilot’s workflow hits a different pattern. Copilot appends a unique session UUID to the system prompt for each new chat:

VSCODE_TARGET_SESSION_LOG: .../debug-logs/0207a5a3-7ced-472e-9eec-d93c811c0bfa

Every new session gets a different UUID, which changes the system prompt prefix. The checkpoint at token position 14,910 is now wrong because the content before the checkpoint has changed. Jacek Poplawski acknowledged this: “the functionality you actually need is a ‘system prompt checkpoint’… I could later try to create another PR with that.”

The SWA/hybrid model gap: Users running Qwen 3.5 122B on M3 Ultra hardware reported that some forcing full prompt re-processing messages still appear, especially with SWA (sliding window attention) or hybrid recurrent models. The prompt cache logic for these architectures has different invalidation patterns.

Checkpoint storage location: User DistanceSolar1449 raised a valid point: “Checkpoints in VRAM is terrible, checkpoints stored in RAM is slightly better (but not good for MoE models or Macs). Ideally checkpoints should be stored on a fast SSD.” A MacBook Pro M1 can read a 35B model’s KV cache from SSD in about one second, potentially faster than keeping it in limited VRAM.

What This Means for Your Agentic Coding Setup

If you’re running llama.cpp for local agentic coding, the changes are already live in master. Here’s what you need to configure:

Minimum configuration for the checkpoint fix:

./bin/llama-server \
  --ctx-checkpoints 24 \
  --checkpoint-min-step 256 \
  --jinja

Recommended configuration for Qwen 3.6 users:

./bin/llama-server \
  -c 200000 \
  --ctx-checkpoints 24 \
  --cache-ram 65536 \
  --chat-template-kwargs '{"preserve_thinking":true}' \
  --checkpoint-min-step 256

If you’re still seeing full reprocessing: Increase --ctx-checkpoints to 32 or 64. The PR author notes that with 8 checkpoints, he could still reproduce the issue. With 24, he worked for hours without problems.

This is particularly relevant if you’re running models like Qwen 3.6 27B on consumer hardware. The intersection of checkpoint optimization and memory constraints is where most users hit walls, and understanding these tradeoffs is essential for avoiding the kind of performance collapse that happens when VRAM runs out.

The Deeper Lesson: Agentic Coding Demands Infrastructure Rethink

This PR reveals something uncomfortable about the current state of agentic coding: the tools are lying to you.

When opencode rewrites your conversation history to “optimize context”, it’s breaking the fundamental contract with the inference engine. The inference engine assumes the prompt is immutable. The agentic tool assumes modification is safe. These assumptions are incompatible, and the user pays the price in wait time.

The broader implication is that agentic coding frameworks need to tell the truth to the inference layer. Instead of silently mutating context, they should:

Use explicit checkpoint markers for regions that are static (system prompts, codebase contexts)
Signal when context is being trimmed or modified
Preserve reasoning tokens instead of stripping them

This ties directly into the broader debate about manual optimization versus AI-generated code. The most performant local coding workflows aren’t the ones with the most sophisticated context manipulation, they’re the ones that understand the infrastructure they’re running on and work with it rather than against it.

The SSD Checkpoint Problem

One of the most interesting threads in the PR discussion is the request for SSD-based checkpoint storage. Current checkpoints live in VRAM or RAM. For dense models, RAM works. For MoE models and Macs, it’s painful.

The math from the discussion is revealing:
– A MacBook Pro M1 SSD reads at ~5GB/sec
– Qwen 3.6 35B at BF16 max context would take ~1 second from SSD
– Qwen 3.6 27B max KV cache is ~16GB → ~3 seconds to load

Three seconds is much better than a 30-second full reprocess. This isn’t implemented yet, but the logic is sound: why waste VRAM on checkpoint state you can stream from storage on demand?

For users running local models on Apple hardware with severe VRAM limitations, this feature could be transformative, it would enable checkpoint management without competing for the same constrained memory the model needs for inference.

The Bottom Line

The “fix checkpoints creation” PR is a masterclass in finding the right abstraction level for a performance problem. The old approach treated context as a linear byte stream and placed checkpoints at arbitrary positions. The new approach understands conversation structure and places checkpoints where they actually matter.

The result is that agentic coding on local models has become genuinely responsive for the first time. The 30-second pauses after typing “thank you” are gone. The full prompt reprocess that happened every few minutes now happens only when something fundamental changes.

Jacek Poplawski’s response to a tester sums it up: “As expected, it reprocessed the full prompt.” The key word is expected. The system now fails predictably rather than constantly and mysteriously. That’s the difference between a tool you fight against and a tool you can rely on.

If you’ve been on the fence about running local agentic coding, the checkpoint fix is the stability improvement you’ve been waiting for. Running Qwen 3.6 27B on consumer GPU hardware now has a caching system that doesn’t actively sabotage your workflow.

The infrastructure is finally catching up to the ambition.

How a Single llama.cpp PR Just Fixed Agentic Coding’s Worst Performance Bottleneck

The Hidden Tax of “Smart” Context Optimization

The Checkpoint Problem: Context Boundaries vs. Token Counts

How the Conversation-Boundary Checkpoint Works

The `preserve_thinking` Connection

Real-World Performance: From 30-Second Waits to 2-Token Increments

The Remaining Edges: Where It Still Breaks

What This Means for Your Agentic Coding Setup

The Deeper Lesson: Agentic Coding Demands Infrastructure Rethink

The SSD Checkpoint Problem

The Bottom Line

Related Articles

3 Billion Active Parameters Just Challenged 30 Billion: Inside Qwen3.6’s Sparse MoE

How a Single llama.cpp PR Just Fixed Agentic Coding’s Worst Performance Bottleneck

The Hidden Tax of “Smart” Context Optimization

The Checkpoint Problem: Context Boundaries vs. Token Counts

How the Conversation-Boundary Checkpoint Works

The preserve_thinking Connection

Real-World Performance: From 30-Second Waits to 2-Token Increments

The Remaining Edges: Where It Still Breaks

What This Means for Your Agentic Coding Setup

The Deeper Lesson: Agentic Coding Demands Infrastructure Rethink

The SSD Checkpoint Problem

The Bottom Line

Related Articles

3 Billion Active Parameters Just Challenged 30 Billion: Inside Qwen3.6’s Sparse MoE

The `preserve_thinking` Connection