800 Million Parameters, One Demonslayer: The Sub-1B Model Running Doom on Your Wrist

On-device AI technology running complex VLM agents on sub 1B models — Edge AI reality check: small models running directly on consumer hardware.

A few years ago, running a vision-language model on a smartwatch was science fiction. Running one that could interpret a 3D game environment, identify enemies, and make tactical decisions? That was laughable. Yet here we are, watching an 800-million-parameter model, small enough to fit in a fraction of a modern watch’s RAM, successfully navigate the demon-infested corridors of Doom.

This isn’t a demo from a well-funded AI lab with bespoke optimizations. It’s a hobbyist project using Qwen 3.5 0.8B, VizDoom, and a simple HTTP loop. The implications are simultaneously impressive and deeply uncomfortable for anyone invested in the “bigger is better” school of AI architecture.

The Setup: Grid Overlay and Gunfire

The mechanics are brutally simple. The system captures a screenshot from VizDoom, overlays a numbered grid on the image, and feeds it to the model with access to two tools: shoot and move. The model analyzes the visual scene, selects a grid coordinate, and executes. Rinse and repeat every ten seconds.

Ten seconds per action sounds glacial if you’re used to cloud-based inference, but consider the constraints. Running on an M-series Mac through LM Studio, this sub-1B model is processing visual input, reasoning about spatial relationships, and generating tool calls locally. No API keys, no data center, no 70B-parameter behemoth. Just under a billion parameters doing the work that previously required an order of magnitude more compute.

The model demonstrates genuine spatial awareness. In basic scenarios, it identifies enemies, selects the correct grid column, and successfully eliminates targets. It processes visual information, recognizing the distinctive sprite of an imp or fireball, and maps that to actionable coordinates. For a model that weighs in at roughly 0.6GB of RAM in Q4 quantization, that’s a remarkable display of embodied intelligence.

Small Language Model Leaderboard showing performance metrics for models under 10B parameters — Performance metrics for models under 10B parameters.

The Glitches: Ammo Conservation and Existential Doubt

Of course, it’s not perfect. The model exhibits the charming limitations of edge-grade reasoning. In “defend_the_center” scenarios, it successfully hits enemies but fails catastrophically at resource management, emptying its magazine without restraint and then attempting to shoot dry air. It’s a stark reminder that while sub-1B models can handle perception and immediate action, long-term planning and state tracking remain challenging.

More interesting are the hallucinations, or perhaps philosophical moments, where the model outputs observations like “I see a fireball but I’m not sure if it’s an enemy.” This isn’t a failure mode, it’s an artifact of uncertainty quantification in tiny parameter spaces. The model knows it doesn’t know, which is arguably more sophisticated than confidently hallucinating a threat.

Developers are already iterating on these limitations. Adding a “reason” field to tool calls forces the model to articulate its visual perception before acting, creating a chain-of-thought mechanism that should improve ammunition discipline and reduce phantom target acquisition.

The Sub-10B Landscape: Speed vs. Brains

This Doom-playing experiment isn’t an isolated stunt. Recent benchmarks of small language models under 10B parameters reveal a field undergoing rapid capability compression. Qwen 3.5’s series spans from 0.8B to 9B, all sharing the same Gated DeltaNet hybrid architecture and native multimodal support. The 9B variant scores 82.5 on MMLU-Pro, numbers that would have topped open-source leaderboards six months ago, while the 0.8B variant you’re strapping to your wrist manages genuine visual reasoning.

The trade-offs are fascinating. One developer is planning a Doom deathmatch pitting the small-but-fast models (0.8B and 2B) against their smarter-but-slower siblings (4B and 9B). The hypothesis, that raw inference speed might outperform deeper reasoning in real-time combat scenarios, cuts to the heart of edge AI design philosophy. Do you want the model that thinks harder or the one that shoots faster?

Google’s Gemma 3 4B IT posts an 89.2% on GSM8K and 71.3% on HumanEval, while Microsoft’s Phi-4-mini (3.8B) dominates ARC-C reasoning at 83.7%. These aren’t toy models, they’re production-grade reasoning engines that happen to fit in your pocket. When you’re comparing memory constraints and quantization trade-offs for local AI inference, the math increasingly favors these compressed architectures.

The Agentic Reality Check

There’s a broader lesson here about the state of agentic AI. While enterprise vendors promise autonomous agents that will revolutionize workflows, the reality on the ground is messier. Bridging the gap between agentic AI promises and production reality gaps requires acknowledging that current agents struggle with basic state management, like remembering how many bullets are left in a virtual magazine.

This experiment serves as a counterpoint to the overconfidence we’ve seen in the automation space. Recent high-profile attempts at mass agent deployment have demonstrated what happens when executives contrast ambition against the technical feasibility of agent automation, you end up with systems that work beautifully in demos and fail catastrophically in production. A sub-1B model that admits uncertainty about fireballs is arguably more trustworthy than a 400B-parameter cloud model that confidently fabricates entire workflows.

What Actually Fits on Your Hardware

The deployment implications are immediate. A year ago, models under 3B parameters were barely useful for autocomplete. Today, Qwen3.5-0.8B handles multimodal inputs on roughly 0.6GB of RAM. Gemma 3 1B runs at over 2,500 tokens per second on mobile GPUs while scoring 62.8% on grade-school math. These aren’t edge cases, they’re viable production alternatives for privacy-sensitive applications.

For phones with 4-8GB of RAM, the options have expanded dramatically. Qualcomm’s Snapdragon 8 Gen 5 NPU delivers 46% faster inference than the previous generation, and Meta’s ExecuTorch now supports 12+ hardware backends. When comparing CPU-only inference limits for sub-100M parameter voice models to these new sub-1B multimodal agents, the gap between specialized and general intelligence on the edge is narrowing fast.

The trajectory is clear: we’re approaching a threshold where the device in your pocket, or on your wrist, can run sophisticated visual agents without phoning home. The Doom demo is silly, but it’s also a proof of concept. If 800 million parameters can navigate a 3D environment and make tactical decisions, what happens when that same architecture is pointed at your calendar, your email, or your home automation system?

The answer is probably “it will try to shoot your empty inbox and then apologize for the uncertainty”, but that’s still further along than we were last year.

800 Million Parameters, One Demonslayer: The Sub-1B Model Running Doom on Your Wrist

800 Million Parameters, One Demonslayer: The Sub-1B Model Running Doom on Your Wrist

The Setup: Grid Overlay and Gunfire

The Glitches: Ammo Conservation and Existential Doubt

The Sub-10B Landscape: Speed vs. Brains

The Agentic Reality Check

What Actually Fits on Your Hardware

Related Articles

Qwen 3.7 Is Already the ‘New King’, If You’re Happy Benchmarking a Ghost

Qwen 3.7 Materialized in Qwen Chat Overnight, And the Local LLM Crowd Is Already Demanding 122B Weights

Sparky Doesn’t Call Home: A Suitcase Robot Running Gemma 4 E4B Entirely Offline on Jetson Orin NX

The Uncensored Qwen3.6: When Jailbreaking Meets 4-Bit Quantization