AMD RDNA3 Users Finally Get Decent llama.cpp Performance - Here's What Changed

New optimizations fix critical performance drops and crashes on AMD RDNA3 GPUs, delivering faster long-context inference on hardware like Ryzen AI Max 395.

October 31, 2025

The AMD versus NVIDIA battle in AI inference just got more interesting. For months, AMD RDNA3 GPU users suffered through disappointing performance in llama.cpp’s ROCm backend - particularly as context lengths grew. Promising hardware like the Ryzen AI Max 395 with its powerful integrated RDNA3 GPU was hamstrung by software bottlenecks that made long-context workloads practically unusable.

That story just changed dramatically.

The Problem: RDNA3’s Performance Cliff

The core issue was simple but devastating: llama.cpp’s ROCm/HIP backend with rocWMMA enabled would experience catastrophic performance degradation as context length increased. What started as promising performance at short context lengths turned into a performance cliff at longer sequences.

The data reveals how bad it was. Comparing standard rocWMMA versus the default HIP path on a Llama 3.2 1B model:

At 65K context length, prefill performance dropped by 15% while decode performance plummeted by 58%. The situation was even worse for larger models like GPT-OSS-20B, where decode performance at 65K context saw a 57% regression compared to the standard HIP backend without rocWMMA.

What made this particularly frustrating was that rocWMMA was supposed to be the optimized path. The official llama.cpp ROCm build documentation ↗ recommends enabling -DGGML_HIP_ROCWMMA_FATTN=ON specifically for better FlashAttention performance on RDNA3+ architectures. This recommendation had become a performance trap.

The Fix: Understanding What Went Wrong

Developer randomfoo2 discovered the root causes while digging into ROCm’s matrix multiplication acceleration. The problems were architectural:

First, the decode path was incorrectly using WMMA (Wave Matrix Multiply-Accumulate) operations where simpler VEC/TILE operations would suffice. This forced unnecessary tensor core usage for operations that didn’t benefit from them.

Second, there was an architectural mismatch in how FlashAttention was being tuned for RDNA3. The configuration wasn’t optimized for AMD’s hardware characteristics, particularly around local data share (LDS) usage and block residency.

Most concerningly, the default HIP path lacked a critical safety guard: when predicted TILE splits had no suitable configuration, the system would crash rather than falling back to VEC operations. This explained many of the ROCm backend crashes users reported.

The Numbers: Performance Transformation

The optimized branch delivers staggering improvements, particularly at the context lengths where ROCm previously struggled most:

Llama 3.2 1B Q4_K_M Performance Comparison

Prefill Improvements (higher is better):

Context Depth	Default rocWMMA	Optimized rocWMMA	Improvement
512 tokens	4884 tokens/s	4970 tokens/s	+1.75%
16K tokens	1266 tokens/s	2065 tokens/s	+63.14%
65K tokens	360 tokens/s	706 tokens/s	+96.11%

Decode Improvements (token generation):

Context Depth	Default rocWMMA	Optimized rocWMMA	Improvement
4K tokens	144 tokens/s	173 tokens/s	+20.74%
16K tokens	88 tokens/s	127 tokens/s	+45.11%
65K tokens	27 tokens/s	65 tokens/s	+136.06%

The numbers tell a clear story: what was supposed to be an optimization path was actually making RDNA3 performance substantially worse in realistic use cases.

What Changed Technically

The optimized branch ↗ implements several key fixes:

Adaptive KQ Stride Optimization The FlashAttention implementation now dynamically adjusts KQ stride based on head dimension, using 128 for D<=128 to reduce LDS footprint. This improves memory efficiency without sacrificing compute throughput.

Block Residency Enhancement
Increased HIP occupancy via __launch_bounds__ constraints ensures a minimum of 2 blocks per streaming multiprocessor, better utilizing RDNA3’s compute units.

Intelligent Kernel Selection The system now properly avoids WMMA for decode operations, using HIP’s tuned VEC/TILE selection instead. This prevents the performance cliff that occurred when using tensor cores for operations better suited to traditional compute units.

Crash Prevention Added decode-time safety guards that fall back to VEC operations when predicted TILE splits have no suitable configuration, eliminating many of the random crashes users experienced with the ROCm backend.

The Developer Reaction: Progress or Politics?

The community response has been overwhelmingly positive, with developers calling the improvements “game-changing” for affordable local inference hardware. As one commenter noted, “people like you and your PR keep alive local inference for modest wallets and old hardware.”

However, the official response highlights the complexities of open-source maintenance. The ROCm maintainer commented ↗ that they’re working on a more fundamental rewrite of the WMMA kernel system:

“I very much expect a proper MMA implementation to be faster than the WMMA kernel so I don’t want to make any more changes to it until it is removed. If it turns out that the kernel in this PR is still faster at the end I will reconsider.”

This creates tension between immediate performance gains for users and long-term architectural direction. The maintainer estimates a month-long timeline for the replacement MMA kernel implementation, leaving AMD RDNA3 users in limbo.

Practical Impact: Ryzen AI Max Suddenly Useful

The performance improvements matter most for newer hardware like the Ryzen AI Max 395 with its RDNA3.5 graphics. Previously, these chips couldn’t leverage their full potential for AI workloads despite having capable hardware.

Compared to NVIDIA’s DGX Spark, the gap has narrowed considerably. With the optimized ROCm backend, the performance difference at 32K context drops from +117.5% slower to just +6.4% slower in token generation. For prefill, the improvement is even more dramatic - closing from +445.6% slower to +199.4% slower.

While NVIDIA still leads, the optimized ROCm backend makes AMD hardware genuinely competitive for local AI inference, especially considering the price-performance advantage.

How to Get the Improvements Today

For developers eager to try these optimizations, the process is straightforward:

git clone -b rocm-wmma-tune https://github.com/lhl/llama.cpp.git
cd llama.cpp
mkdir build && cd build
cmake .. -DGGML_HIP_ROCWMMA_FATTN=ON
make -j$(nproc)

Community members have also created Docker containers ↗ with the optimizations pre-applied for easier testing.

The Bigger Picture: AMD’s AI Ecosystem Matures

These optimizations represent more than just performance improvements - they signal AMD’s growing commitment to making ROCm a first-class citizen for AI workloads. The fact that community developers can achieve such dramatic improvements suggests the hardware potential was always there, waiting for the software to catch up.

For developers with AMD hardware, particularly newer RDNA3-based systems, this means:

Long-context inference is now practical - no more avoiding workflows that require substantial context windows
Fewer crashes - the safety guards prevent many common failure modes
Better hardware utilization - RDNA3’s tensor cores finally deliver on their promise
Viable alternative to NVIDIA - AMD GPUs can now handle serious AI workloads without compromise

The improvements come at a crucial time as AMD continues pushing AI capabilities into their consumer hardware lineup. With upcoming Ryzen AI 300 series processors bringing even more powerful NPUs to laptops, having robust ROCm support becomes increasingly important.

Looking Forward

While the optimizations currently live in a community branch, the performance gains are too significant to ignore. They demonstrate that with proper tuning, AMD hardware can deliver competitive AI inference performance.

The question now is whether these improvements will make it into the mainline llama.cpp repository or if users will need to maintain forks. Given the substantial performance benefits and AMD’s growing importance in the AI hardware ecosystem, there’s strong incentive for upstream adoption.

For now, AMD RDNA3 users finally have a path to competitive llama.cpp performance - and that’s a win for everyone pushing the boundaries of local AI inference.

Screw the GPU, test the model first. Why $5k shouldn't be the first step

You're staring at an RTX-5090 on the showroom shelf, the price tag bleeding into your startup's budget. The promise? 5,800 tokens/sec, 4 GB VRAM, "Blackwell wonder." The real question: how much of that performance will you actually get in production?

#RTX-5090#GPU

The AI Bubble Is Worse Than Dot-Com Because It's Real

Examining the unsustainable economics behind AI's trillion-dollar valuations and the circular financing fueling the frenzy

#ai#valuation#bubble...

nvidia

DGX Spark's Dirty Secret: NVIDIA's 1 PFLOPS AI Box Delivers Half That

Independent tests reveal NVIDIA's DGX Spark may only achieve 480 TFLOPS FP4 performance instead of the advertised 1 PFLOPS, with overheating issues compounding memory bandwidth limitations.

#nvidia#ai-hardware#gpu...

View All Related (4)

Navigation

Categories

AMD RDNA3 Users Finally Get Decent llama.cpp Performance - Here's What Changed

New optimizations fix critical performance drops and crashes on AMD RDNA3 GPUs, delivering faster long-context inference on hardware like Ryzen AI Max 395.

The Problem: RDNA3’s Performance Cliff

The Fix: Understanding What Went Wrong

The Numbers: Performance Transformation

Llama 3.2 1B Q4_K_M Performance Comparison

What Changed Technically

The Developer Reaction: Progress or Politics?

Practical Impact: Ryzen AI Max Suddenly Useful

How to Get the Improvements Today

The Bigger Picture: AMD’s AI Ecosystem Matures

Looking Forward

Related Articles

Screw the GPU, test the model first. Why $5k shouldn't be the first step

The AI Bubble Is Worse Than Dot-Com Because It's Real

DGX Spark's Dirty Secret: NVIDIA's 1 PFLOPS AI Box Delivers Half That

Screw the GPU, test the model first. Why $5k shouldn't be the first step

The AI Bubble Is Worse Than Dot-Com Because It's Real

DGX Spark's Dirty Secret: NVIDIA's 1 PFLOPS AI Box Delivers Half That

NVIDIA's DGX Spark vs AMD Strix Halo: The $4,000 AI Paperweight vs The People's Champion

Table of Contents