A new round of painstaking, multi-day benchmarks pits four very different approaches to local AI inference against each other in the same room with the same power and cooling. The results confirm what engineering intuition suggested but marketing departments hoped you would ignore: memory bandwidth is the single best predictor of token generation speed. An NVIDIA RTX PRO 6000 with roughly 1,800 GB/s of VRAM bandwidth dominates when models fit, but Apple’s M5 Max quietly turns 614 GB/s of unified memory into a competitive weapon for large-context workloads. Meanwhile, the DGX Spark and AMD’s Strix Halo both limp along with roughly 256 GB/s, separated more by price and ecosystem than by performance. If you are ecosystem agnostic and running dense models locally, the maxed-out M5 Max is shockingly legitimate, and the DGX Spark has some explaining to do.
The Methodology: One Room, One Test Harness, Zero PR Fluff
Independent hardware tester Signal_Ad657 locked four machines in a room with adequate power and cooling for three days and ran standardized inference loads in parallel. The test matrix covered both a dense Qwen3.6 27B Q8 model and a Mixture-of-Experts (MoE) 35B-A3B variant, using canonical backends for each platform: llama.cpp Metal for the Mac, llama.cpp CUDA for the RTX 6000 tower and DGX Spark, and llama.cpp Vulkan for the Strix Halo EVO X2. Raw CSV traces, thermal logs, and power draw data were published to GitHub, making this one of the rare hardware comparisons built on open evidence rather than vendor slide decks.
Memory Bandwidth: The Hidden Governor
If you take one thing from this entire comparison, make it this: token generation speed is mostly a function of memory bandwidth, not TFLOPS. The measured bandwidth hierarchy across the four contenders breaks down roughly as follows:
- RTX PRO 6000 Blackwell: ~1,800 GB/s (VRAM)
- Apple M5 Max: ~614 GB/s (unified memory)
- NVIDIA DGX Spark: ~256 GB/s (unified LPDDR5X)
- AMD Strix Halo (EVO X2): ~256 GB/s (LPDDR5X)
Tokens per second per device tracked this curve predictably. Where it gets interesting is what happens when the model size collides with the memory architecture.
RTX PRO 6000 Blackwell: The Performance King with a Cliff
Discrete GPUs live and die by the VRAM wall. The RTX PRO 6000 offers blistering throughput with its 1,800 GB/s frame buffer, and when the model weights plus KV cache fit entirely inside GPU memory, it outperforms everything else in the room. It is also the only platform in this test that can genuinely train models or serve high-batch workloads. Independent testers running vLLM on a consumer-grade 4090 noted it could handle roughly 50 parallel requests before memory pressure became an issue, something the unified-memory appliances simply cannot match today.
The catch is the cliff. Once the model and its context overflow VRAM, the system falls back to main system memory across PCIe, and performance tanks by an order of magnitude. For larger models, think 70B Q8 with a 32K context window, that cliff is real, and the M5 Max starts looking unexpectedly competitive simply because it has no memory tiering bottleneck.
Apple M5 Max: A Laptop That Ate a Server
Apple’s silicon advantage has always been integration, but the M5 Max pushes it into server territory. With up to 128 GB of unified memory and 614 GB/s of bandwidth shared between CPU and GPU, the chip can load a Llama 3.3 70B Q8 model entirely in memory without touching PCIe. Notebookcheck testing confirms the M5 Max 40-core GPU sits in the same performance class as NVIDIA’s mobile RTX 5070 in raw compute, but for LLM inference, the unified memory pool is the killer feature.
Real-world MLX benchmarks on the M5 Max show why: Llama 3.3 70B hits roughly 28 tokens per second at Q4 quantization and around 16 tok/s at Q8. Neural accelerators embedded in every GPU core also deliver roughly 4x faster prompt processing versus the previous M4 Max, shrinking the time to first token on large context windows from half a minute to under ten seconds. The long-term cost analysis is equally aggressive, amortized over three years, a $4,499 M5 Max configuration breaks even against GPT-4o API pricing at roughly 13 million output tokens per month.
But the marketing myth of a “quiet Mac” dies under sustained AI load. During extended inference, the chassis stabilized around 80°C, and the fans spun up to what testers described as full gaming-laptop scream. It is built like an aircraft carrier, performs like one, and you will absolutely know it is working.

DGX Spark: Marketing PFLOPS vs. Memory Reality
NVIDIA’s DGX Spark arrived with gold aesthetics and a headline promising 1 PFLOPS of AI performance. The reality is more modest. Memory bandwidth is capped at approximately 256 GB/s, identical to the Strix Halo, and independent analysis places its raw compute in the ballpark of an RTX PRO 5000, not the 5090 its marketing vaguely implies. The DGX Spark real-world value proposition unravels when you realize a maxed-out M5 Max offers more than double the memory bandwidth for single-user inference, and the M5 Max’s 128 GB unified pool allows it to physically host larger models without offloading layers.
The DGX Spark is not without merit. Its ConnectX-7 NIC ports enable tensor parallelism across multiple units, which can boost effective processing bandwidth by roughly 1.8x and scales unified memory to 256 GB in a cluster. That makes the platform more future-proof than a single appliance. For now, though, an official unit runs $4,700, while the ASUS Ascent GX10 variant hovers closer to $3,500. At either price, NVIDIA’s marketing claims don’t survive contact with reality when compared to the M5 Max’s single-machine throughput on dense 27B to 70B models.
AMD Strix Halo: Bandwidth Starved, Budget Blessed
AMD’s Strix Halo is the people’s champion on paper: up to 192 GB of addressable memory, an open ecosystem, and system prices that undercut the competition. Unfortunately, the LPDDR5X interface bottlenecks the entire affair to roughly 256 GB/s of bandwidth. Independent testing showed that even small dense models at extended context lengths can expose the limitation, with 32K context prefill alone pushing the EVO X2 rig close to thermal timeout ceilings.
The platform does not beat either the RTX 6000 or the M5 Max for raw inference performance at any tested model size. What it offers is moderate performance on large models at a tight budget, without breaking the bank on hardware or power bills. It also occupies a unique niche: if you want massive memory capacity for ramdisk-style model storage or RAG pipelines and can tolerate slower token generation, the Strix Halo is arguably the cheapest way to get there. The memory bandwidth trap remains its Achilles’ heel, but for cost-conscious hobbyists, the trade-off is deliberate, not accidental.

When the Cliff Hits: Model Size vs. VRAM
The most counterintuitive finding from the head-to-head is how the performance hierarchy inverts as models grow. The RTX PRO 6000 dominates on 27B and smaller models that fit comfortably in its VRAM, but the moment weights or KV cache spill past the frame buffer, the machine drops to system memory bandwidth and falls behind the M5 Max. Because the MacBook has no PCIe gulf between system and GPU memory, its 614 GB/s stays constant regardless of model size. That architectural cheat code means a 70B Q8 model with a long context window can actually infer faster on the M5 Max than on a discrete GPU forced to shuffle layers across the bus.
Software Stacks: The Battle Behind the Benchmarks
Hardware is only half the story, backend maturity determines what you can actually run day-to-day.
On the M5 Max, MLX remains the fastest framework for Apple Silicon, leveraging neural accelerators to deliver up to 80% higher throughput than Ollama or raw llama.cpp. Ollama provides the easiest setup and an OpenAI-compatible API endpoint, while LM Studio now ships with an MLX backend that pairs graphical management with near-native speed.


The DGX Spark enjoys a clean CUDA aarch64 path, and the RTX 6000 tower leverages mature CUDA Q8 pipelines, though developers should note that native llama.cpp Q8 MoE hits SOFT_MAX errors on sm_120 Blackwell chips, forcing a fallback to vLLM FP8 for some model rows. Meanwhile, Strix Halo remains stuck in partial Vulkan support with a ROCm retry still pending, a backend gap that effectively shaves real-world performance below what the silicon alone might suggest.
Thermals, Power, and the Noise Tax
Sustained inference is a thermal marathon. The EVO X2 housing the Strix Halo struggled with extended runs, hitting a 300-second timeout wall during 32K context generation tests. The M5 Max MacBook Pro surprised testers by holding steady at roughly 80°C under days-long load, though the acoustic trade-off was significant. The RTX 6000 tower, tested in a full power-cap sweep, showed that anything above 500W is wasted on LLM inference, while diffusion workloads genuinely scale up to 600W. For anyone building a local rig, cooling and power challenges are often the hidden costs that blow past the price of the silicon itself.
The Parallelism Trap: Single User vs. Small Server
One sharp distinction emerged around concurrency. The M5 Max and Strix Halo are effectively single-user inference appliances. You can run one model and serve one request stream beautifully, but batching multiple users or background agentic harnesses is not their strength. The RTX 6000, by contrast, can saturate its VRAM with batched requests through vLLM, serving dozens of concurrent coding assistants or chatbots. DGX Spark units can cluster for distributed training, something no Apple Silicon laptop can presently match. If your definition of “local AI” is “a personal research assistant”, the M5 Max is brilliant. If your definition is “a replacement for an API backend”, you still need the discrete GPU server.
Price-to-Performance Economics
As of testing, the 128 GB M5 Max configuration sits around $5,500. The official DGX Spark lands near $4,700, with third-party ASUS variants closer to $3,500. The Strix Halo undercuts both. The RTX PRO 6000 is in a different pricing tier entirely, aimed at professional workstations.
For pure single-user LLM inference, the M5 Max’s bandwidth advantage over the DGX Spark translates to faster responses for any model that exceeds the Spark’s effective working set. That makes the Apple machine a surprisingly rational purchase for developers who need to run 70B-class models locally without building a water-cooled tower. Conversely, AMD’s alternative to the DGX Spark offers a broader view of the budget landscape: if absolute dollars matter more than tokens-per-second, the Strix Halo is the pragmatic underdog.
Final Dispatch
The winner depends entirely on the shape of your workload. Need to train or serve a team concurrently? Buy the RTX PRO 6000 and stop pretending a laptop can replace a server. Need the biggest possible model running silently(ish) on a desk with zero configuration drama? The M5 Max is currently the most capable local-inference laptop on the market, full stop. Operating on a tight budget but need 100B+ parameter access? The Strix Halo gives you the memory capacity, just don’t expect it to feel fast. And if you simply want to pay a premium for a gold box that underperforms a MacBook on memory bandwidth, the DGX Spark is waiting. Choose accordingly.




