Tagged with

2 articles found

Vulkan Is Quietly Outpacing CUDA for Specific LLMs on Consumer GPUs

Benchmarks reveal Vulkan achieving up to 2.2× speedup over CUDA for select quantized models on RTX 3080, challenging assumptions about optimal local inference backends.

#cuda#gpu-acceleration#llama.cpp...

diffusion-models

Tencent’s WeDLM 8B: When Diffusion Models Beat Autoregressive LLMs at Their Own Game

Tencent’s diffusion-based language model achieves 3-6× faster inference than vLLM-optimized Qwen3-8B on math reasoning, challenging the token-by-token generation paradigm that has dominated LLMs since GPT-2.

#diffusion-models#llm-inference#math-reasoning...