768GB 10x GPU Mobile AI Rig: Redefining Local LLM and Generative AI Workstations

Someone just strapped 10 GPUs, 768GB of RAM, and dual power supplies into a Thermaltake W200 case and called it “portable.” The result? A $17,000 local AI workstation that runs DeepSeek-V3.1 at 24.92 tokens per second without hitting a single cloud API. This isn’t another mining rig repurposed for AI, it’s a deliberate engineering statement about where AI infrastructure is headed.

The build, documented in recent hardware forums, centers on a Threadripper Pro 3995WX with 512GB DDR4 (the “768GB” in our title includes the combined system RAM and GPU memory, a crucial distinction for MoE workloads). Eight RTX 3090s and two RTX 5090s provide 256GB of GDDR6X/GDDR7, all shoehorned into a dual-system enclosure that was never meant to hold this much silicon. The creator’s goal was simple: run massive Mixture-of-Experts models like DeepSeek-V3.1 and Kimi K2 locally, support a graphic designer’s video generation pipeline, and keep the whole thing enclosed enough to survive an encounter with curious cats.

The Engineering Reality Check

Let’s address the obvious: this build shouldn’t work. The thermal envelope alone, eight 3090s at 350W each plus two 5090s at 550W, pushes 3,900W at full bore. That’s before you factor in the Threadripper’s 280W TDP. The solution? Aggressive power limiting (3090s capped at 200-250W, 5090s at 500W) and a fan strategy that would make a data center engineer weep: 12x 140mm fans creating airflow patterns that somehow keep GPU temps in operational range.

The enclosure choice is what makes this build genuinely interesting. The Thermaltake W200, designed for dual-system setups, gets hacked by mounting the motherboard upside-down in its secondary compartment. This creates a perfect orientation for PCIe risers to connect GPUs mounted in the main compartment. It’s the kind of geometric workaround that feels obvious only after someone else has done it. The result is dense, so dense that accessing the motherboard requires removing GPUs first, but it eliminates the jank of mining frames zip-tied to rolling racks.

Power delivery splits across two PSUs: an EVGA 1600W and an Asrock 1300W. This isn’t just about wattage, it’s about circuit protection. As hardware forums note, running this on a single 240V 15A breaker would trip after a few minutes. The dual-PSU approach mirrors what Prism, the distributed MoE inference framework, does at the software level: divide and conquer.

Benchmarks That Actually Matter

The performance numbers reveal why someone would bother with this instead of just spinning up cloud instances:

Model	GPU Offload	Tokens	Time to First Token	Tokens/sec
DeepSeek V3.1 Terminus Q2XXS	100%	2338	1.38s	24.92
GLM 4.6 Q4KXL	100%	4096	0.76s	26.61
Kimi K2 TQ1	87%	1664	2.59s	19.61
Hermes 4 405b Q3KXL	100%	–	1.13s	3.52
Qwen 235b Q6KXL	100%	3081	0.42s	31.54

These aren’t synthetic benchmarks. The DeepSeek-V3.1 result, 24.92 tokens per second with 100% GPU offload, means you can have a conversation with a 671B parameter MoE model without the latency hit of cloud inference. The Kimi K2 numbers show the cost of partial offloading: 87% GPU utilization drops you to 19.61 tokens/sec, but that’s still viable for interactive use.

What makes these numbers controversial is the cost comparison. A single hour of 8x A100 on a major cloud provider runs roughly $32. This rig pays for itself after 531 hours of equivalent compute. For a research lab or a small AI studio running experiments daily, that’s a two-month ROI. The creator explicitly avoided going all-in on 5090s or AMD Instinct MI600 Pros because the diminishing returns don’t justify the 3-4x price premium, not when you’re power-limiting anyway.

The Distributed Inference Problem Nobody Talks About

Here’s where the Reddit build intersects with cutting-edge research. Running MoE models across heterogeneous GPUs isn’t just about memory capacity, it’s about expert placement strategy. The Prism framework, detailed in recent papers on edge inference, tackles exactly this problem.

MoE models like DeepSeek-V3.1 have sparse activation patterns, each token hits only a subset of experts. Prism’s key insight is that expert activation patterns are both task-dependent and layer-varying. Arithmetic reasoning might hammer Expert 0 in layer 0, while ASCII recognition prefers Expert 3. This creates a placement optimization problem that cloud-based uniform distribution strategies completely miss.

Prism’s algorithm works in two stages:

Layer-wise expert count allocation: Uses Shannon entropy of activation frequencies to determine how many experts each server (or GPU, in our workstation’s case) should host per layer. Higher entropy = more uniform distribution = more experts needed locally.
Expert-to-server assignment: A greedy algorithm that selects top-N experts per server based on historical activation patterns, with theoretical guarantees of (1-1/e) approximation optimality.

The Reddit build implements this intuition manually: the two 5090s, with their larger VRAM and faster compute, likely host the most frequently accessed experts for video generation tasks, while the 3090s handle the long tail. The power-limiting strategy, capping 3090s at 200-250W, isn’t just thermal management, it’s a crude form of load balancing that prevents any single GPU from becoming a bottleneck.

When Local Beats Cloud (And When It Doesn’t)

The portability claim is simultaneously absurd and practical. The W200 case has handles and rolls. You can move it between rooms. You cannot fly with it. But for a small studio or research group, this “portability” means you can relocate your AI infrastructure without re-architecting your entire pipeline.

This challenges the dominant narrative that AI infrastructure must centralize in the cloud. The Prism paper explicitly addresses this: edge deployments avoid cloud latency, data privacy concerns, and the operational complexity of multi-tenant GPU clusters. But they introduce new problems, heterogeneous hardware, limited bandwidth between nodes, and dynamic workload adaptation.

The 10-GPU rig is essentially a single-node edge cluster. The dual-PSU setup mirrors the redundant power in data centers. The 12-fan cooling system replicates row-level airflow. You’re compressing a rack’s worth of infrastructure into a box that fits under a desk.

Where this breaks down is scale. The build’s $17k cost doesn’t include the time spent tuning, weeks of PCIe bifurcation debugging, power curve optimization, and thermal profiling. Cloud instances are plug-and-play. This rig is a science project that happens to produce 25 tokens/sec.

The Software Stack Reality

Running Ubuntu, the creator uses standard inference frameworks with one critical tweak: power limiting via nvidia-smi. The benchmarks show MoE-specific quantization (Q2XXS, Q4KXL) that reduces memory footprint while maintaining quality. This is where DeepSpeed’s MoE tutorials become relevant, they show how to configure expert parallelism across GPUs with ep_size parameters that control expert distribution.

The DeepSpeed MoE API now accepts ep_size directly, letting you define expert parallelism per layer. For a 10-GPU system, you might set ep_size=2 for early layers (less critical) and ep_size=1 for later layers (more frequent access), manually implementing Prism’s entropy-based allocation heuristic.

The Hermes 4 405b result, 3.52 tokens/sec, shows the limit. The creator admits they “forgot to record” token counts because the response quality was underwhelming. This is the quantization tax: aggressive compression makes some models unusable, regardless of hardware. It’s a reminder that local inference isn’t just about VRAM, it’s about finding the right model precision for your compute budget.

Engineering Tradeoffs Worth Stealing

Even if you’ll never build this, three design decisions matter for any local AI setup:

1. Enclosure-first design: Starting with the W200’s form factor forced thermal and power constraints early. Most builds start with components and discover they don’t fit later.

2. Mixed-generation GPUs: The 3090/5090 split acknowledges that not all tensor operations need Hopper architecture. For MoE’s sparse matrix multiplies, Ampere’s efficiency at lower power is actually advantageous.

3. Power as a tuning knob: Capping GPUs at 60-70% TDP isn’t a compromise, it’s a strategy. The performance per watt sweet spot for 3090s sits around 200W, not their 350W default.

The Prism research validates this approach: their experiments show that heterogeneous GPU allocations (1,1,2 GPUs across three simulated servers) with bandwidth constraints of 500 Mbps achieve 30.6% lower latency than uniform distribution. Your workstation’s PCIe 4.0 lanes between GPUs are effectively a high-bandwidth edge network.

The Bottom Line

This build represents a fundamental shift in AI infrastructure thinking. Cloud providers want you to believe that only their hyper-optimized data centers can run frontier models. They’re wrong, for certain workloads, with certain tradeoffs, local infrastructure wins.

The $17k price tag is deceptive. It includes $5k+ in RAM that cost half that when purchased. It includes GPUs bought before the 5090 launch. Replicating it today might cost $22k. But against cloud pricing at scale, it still pays for itself in months, not years.

More importantly, it gives you control over your AI stack. No API rate limits. No unexpected pricing changes. No data leaving your premises. For a graphic designer generating video assets or a researcher iterating on MoE architectures, that autonomy translates directly to productivity.

The real innovation isn’t the hardware, it’s the recognition that AI inference is becoming a local problem again, just as it was in the early days of personal computing. The cloud abstracted away hardware complexity, but it introduced latency, cost, and control issues that this build directly addresses.

Your next AI workstation won’t look like this. It might have 4 GPUs instead of 10, or use unified memory instead of PCIe risers. But the philosophy, enclosed, portable, power-aware, and MoE-optimized, will define the next generation of AI development hardware. The future of AI isn’t just in the cloud. It’s rolling on casters between rooms, running Ubuntu, and generating 24.92 tokens per second right next to your desk.