Liquid AI’s LFM2-24B: A 24B-Parameter MoE That Runs on 32GB RAM and Makes Cloud APIs Look Overpriced

Liquid AI’s LFM2-24B: A 24B-Parameter MoE That Runs on 32GB RAM and Makes Cloud APIs Look Overpriced

Liquid AI’s new sparse MoE model activates only 2.3B parameters per token, delivering server-class AI performance on consumer hardware while challenging the cloud-only paradigm.

Liquid AI just dropped a model that fundamentally disrupts the “bigger is better” narrative. Their LFM2-24B-A2B crams 24 billion parameters into a sparse MoE architecture that activates only 2.3 billion per token, roughly 9.6% of its total capacity. The kicker? It runs comfortably on a laptop with 32GB RAM, spitting out 112 tokens per second on a consumer AMD CPU. This isn’t another incremental efficiency gain, it’s a direct assault on the assumption that serious AI requires cloud infrastructure and enterprise budgets.

The Architecture Heresy: When Convolutions Outsmart Attention

Most frontier models are attention purists. GPT, Claude, Gemini, they’re essentially elaborate attention mechanisms stacked to the moon. Liquid AI looked at that orthodoxy and said, “What if we used convolutions for 75% of our layers instead?” The result is a hybrid architecture that would make a Transformer traditionalist squirm: 30 gated short convolution layers with kernel size 3, plus only 10 grouped-query attention layers.

This isn’t architectural arbitrariness. Liquid AI ran hardware-in-the-loop architecture searches and found that convolution-dominant stacks outperform alternatives under fixed on-device performance budgets. They tested linear attention, state-space models, and additional convolution variants. None beat their conv-heavy design when compute budgets were constrained.

The math is brutal for attention purists: depthwise convolutions have O(1) per-step decode cost, while attention’s KV cache balloons with context length. When three-quarters of your layers are convolutions, sequential decoding on CPUs becomes dramatically faster. This is why LFM2-24B can hit 112 tok/s on an AMD Ryzen AI Max+ 395, hardware that isn’t even server-grade.

The MoE implementation uses 64 experts per block with top-4 routing. For each token, the model consults 64 specialists, activates the best 4, and ignores the rest. It’s like having a hospital with 64 doctors but only calling in the specialists relevant to your symptoms.

Liquid AI's LFM2-24B architecture diagram
Liquid AI’s LFM2-24B architecture diagram
Specification Details
Total Parameters 24 billion
Active Parameters/Token 2.3 billion
Architecture Hybrid Conv + GQA MoE
Experts per MoE Block 64 (top-4 routing)
Context Window 32,768 tokens
Training Data 17 trillion tokens (ongoing)
Memory (Q4_K_M) ~32 GB RAM
Languages English, Arabic, Chinese, French, German, Japanese, Korean, Spanish, Portuguese

Performance Numbers That Actually Matter

Let’s cut through the benchmark theater. Here are the numbers that determine whether you can use this model today:

Platform Decode Speed Notes
AMD Ryzen AI Max+ 395 (CPU) 112 tok/s Q4_K_M via llama.cpp
NVIDIA H100 SXM5 (single stream) 293 tok/s vLLM
NVIDIA H100 SXM5 (1,024 concurrent) 26,800 tok/s aggregate vLLM continuous batching

112 tokens per second on a laptop CPU isn’t just “good for local.” It’s faster than many cloud API calls under load. The H100 numbers are equally telling: 26,800 tokens/s aggregate throughput under continuous batching beats both Qwen3-30B-A3B and gpt-oss-20b, despite LFM2-24B using fewer active parameters (2.3B vs 3.3-3.6B).

For context, the smaller LFM2-8B-A1B runs at 48.6 tok/s on a Samsung Galaxy S25. The progression is clear: Liquid AI is optimizing for the edge, not just paying lip service to it.

The 32GB RAM Sweet Spot

Here’s where things get practical. At Q4_K_M quantization, LFM2-24B fits in ~32GB RAM. That puts it within reach of:
– High-end gaming laptops (RTX 4090 mobile configs with 64GB RAM are common)
– Desktop workstations with modest GPU upgrades
– M2/M3 Ultra Mac Studios
– Even some Threadripper-based systems

This is the real innovation: democratizing access to 24B-parameter models. While competitors chase trillion-parameter monsters that require server farms, Liquid AI targets the hardware developers already own. The consumer-grade hardware capabilities for running large AI models have been steadily improving, but models haven’t kept pace with optimization.

The model’s 32K context window is a notable limitation compared to Qwen3’s 128K. For summarization and agentic tasks, 32K is sufficient. For long document analysis, you’ll hit walls. This is the trade-off: efficiency versus capacity.

The License: Open Weight, Not Open Source

Before you rebuild your startup on LFM2-24B, read the fine print. Liquid AI uses its custom LFM Open License v1.0, based on Apache 2.0 but with a revenue threshold. Companies under $10 million annual revenue can use it freely. Above that, you need a commercial license.

This makes it “open weight” rather than truly open source in the OSI sense. The distinction matters if you’re building a product around it. Many developers learned this lesson the hard way with other “open” models that had hidden restrictions. Liquid AI is at least transparent about the threshold, but it’s a reminder that “free” in AI often comes with asterisks.

What’s Missing (And Why It Matters)

The current release is a base model checkpoint. Training is still ongoing at 17 trillion tokens. Critical missing pieces include:

  • No reasoning/thinking mode: LFM2.5-24B-A2B with RL training is promised but undated
  • 32K context: Adequate for most tasks but lagging behind competitors
  • Instruction tuning: The current model is a base checkpoint, not optimized for chat

Liquid AI’s scaling curves suggest substantial quality improvements as training progresses. The performance and architectural breakthroughs of Liquid AI’s 1.2B LFM 2.5 model provide a preview: smaller models punching above their weight class through architectural innovation rather than brute scale.

The Biological Roots: From Worm Brains to Efficient AI

Liquid AI’s origin story is refreshingly non-corporate. Founded by four MIT researchers in 2023, the company grew out of research on C. elegans, a 1mm roundworm with exactly 302 neurons. The team studied how its neural architecture processes information through graded analog signals rather than digital spikes.

This work produced Liquid Time-Constant Networks in 2020, where parameters dynamically change based on input. While LFM2 has evolved beyond pure liquid neural networks into a pragmatic conv-attention hybrid, the efficiency-first philosophy remains.

The company has raised $297 million, including a $250 million Series A led by AMD Ventures in December 2024 that valued it at over $2 billion. That AMD relationship explains why their primary benchmark hardware is AMD’s Ryzen AI Max+ rather than Intel or Apple silicon.

Industry Disruption: Who Should Be Worried

LFM2-24B’s release sends tremors through several established narratives:

Cloud Providers: If serious models run locally, API revenue evaporates. The “inference tax” becomes harder to justify when 112 tok/s is free on your laptop.

GPU Giants: NVIDIA’s moat assumes models need their hardware. LFM2-24B performs admirably on CPUs and AMD APUs. The critical analysis of Liquid AI’s 1.2B model efficiency and real-world performance tradeoffs shows this isn’t a fluke, it’s a systematic approach to efficiency.

Closed Model Vendors: When open-weight models match cloud APIs on quality while offering privacy and control, the value proposition of closed models weakens. The Liquid AI’s previous LFM2.5 model family and its efficiency claims on consumer hardware demonstrated this trend earlier, LFM2-24B accelerates it.

Getting Started: The Practical Path

Weights are on Hugging Face with 10 GGUF quantization variants. The model works with:
– llama.cpp (for CPU inference)
– vLLM and SGLang (for GPU serving)
– MLX (for Apple Silicon)
– LM Studio (for desktop UI)
– Unsloth and TRL with LoRA/DPO/GRPO (for fine-tuning)

The setup is straightforward: download the Q4_K_M GGUF, fire up llama.cpp, and you’re generating at 112 tok/s on supported AMD hardware. For NVIDIA users, vLLM provides server-class throughput.

The Verdict: Efficiency as a Feature, Not a Compromise

LFM2-24B represents a philosophical shift in AI development. While the industry chases scale at any cost, Liquid AI asks: “What if we built models that respect computational constraints?” The answer is a model that runs where you work, not where you’re billed.

The 32GB RAM requirement is a psychological barrier. It’s not “runs on a Raspberry Pi” efficiency, but it’s achievable for serious developers without corporate infrastructure. The efficiency gap between models like Qwen3.5-122B-A10B and Llama 4 Maverick shows the market is waking up to this reality.

Whether convolution-dominant architectures can truly compete with pure Transformers at scale remains the open question. If LFM2.5-24B ships with competitive reasoning capabilities, Liquid AI will have proven that the Transformer is not the only viable path, and that consumer hardware can run serious models without an API call in sight.

For anyone interested in running open-source LLMs locally, the 32GB requirement puts this within reach of a well-equipped laptop or any modern desktop. The cloud-only era of AI may be ending not with a bang, but with a 112 tok/s whisper on your local CPU.

Liquid AI's LFM2-24B performance benchmarks
Liquid AI’s LFM2-24B performance benchmarks
Share:

Related Articles