20x Faster Top-K Sampling Without a GPU: The AVX2 Optimization Rewriting LLM Inference Rules
A new open-source AVX2-optimized Top-K implementation achieves 20x speedup over PyTorch CPU, delivering 63% faster prompt processing in llama.cpp for large MoE models, sometimes matching CUDA performance without the GPU overhead.