The AI community is frothing at the mouth over the latest 3D generation showdown, but here’s the dirty secret: both competitors are running on the same NVIDIA hardware, and neither can actually see what’s behind your subject. Apple just dropped SHARP, a model that promises photorealistic 3D Gaussian splatting from a single image in under a second. Not to be outdone, Microsoft fired back with TRELLIS.2, a 4-billion-parameter behemoth that generates full PBR-textured assets using a novel “O-Voxel” representation. The headlines scream innovation, but the reality is a masterclass in computational irony and architectural band-aids.
The Technical Cage Match: Gaussians vs. Voxels
At their core, these models represent fundamentally different bets on the future of 3D AI. Apple’s SHARP goes all-in on 3D Gaussian Splatting (3DGS), regressing Gaussian parameters in a single feedforward pass. The results are legitimately impressive: a 25-34% reduction in LPIPS and 21-43% reduction in DISTS versus prior models, with synthesis time dropping by three orders of magnitude. On an M1 Max MacBook, you’re looking at 5-10 seconds per generation. The output is a metrically accurate representation with absolute scale, meaning you can move a virtual camera through the scene with physically correct parallax.
Microsoft’s TRELLIS.2, meanwhile, leverages a flow-matching transformer paired with a sparse voxel-based VAE. Its “O-Voxel” (omni-voxel) structure encodes both geometry and appearance in a field-free representation that handles arbitrary topology without the iso-surface constraints that plague SDF-based methods. The model compresses a 1024³ asset into just ~9.6K latent tokens via 16× spatial downsampling, enabling generation at resolutions up to 1536³. On an NVIDIA H100, TRELLIS.2 cranks out a 512³ asset in ~3 seconds, 1024³ in ~17 seconds, and 1536³ in ~60 seconds.
But here’s where the marketing falls apart: both models are single-image systems. They don’t reconstruct 3D scenes, they hallucinate them. If you photograph a car from the front, neither model can magically reveal the license plate number on the back. They’re sophisticated pattern-completion engines trained on datasets that may include enough car rear-views to make educated guesses, but educated guesses are still guesses.
The CUDA Paradox: When Apple’s Model Requires NVIDIA
The most delicious irony in this entire saga is that Apple’s SHARP requires an NVIDIA GPU for its full feature set. The model itself runs fine on Apple Silicon, generating 3D Gaussian representations works on CPU, CUDA, or MPS. But if you want to render those slick demo videos? CUDA only, baby. The rendering pipeline uses gsplat, which demands NVIDIA hardware, and Apple’s requirements.txt explicitly pins CUDA packages for x86-64 Linux environments.
This isn’t a minor footnote, it’s a strategic face-plant. Apple went out of its way to ensure the model fails gracefully on non-CUDA systems, but the message is clear: even Apple’s own researchers live in a CUDA-centric world. The community has already noticed, with developers pointing out that “a Mac, a non-NVIDIA, non-x64, non-Linux environment, was never a concern for them.” The model’s own documentation admits that rendering trajectories are “CUDA GPU only”, which means your shiny new M3 Ultra Mac Studio can’t generate the showcase videos that make SHARP look revolutionary.
TRELLIS.2 isn’t any better. Microsoft explicitly states the model is “currently tested only on Linux” and requires an NVIDIA GPU with at least 24GB of memory. The code has been verified on A100 and H100 GPUs, cards that cost more than a used car. Both giants are locked in a hardware monoculture that makes a mockery of their “open” branding.
The Hallucination Problem: Single-Image Generation Is a Gimmick
The Reddit communities have been brutal about the fundamental limitation of single-image models. As one developer put it: “It’s impossible to capture all aspects of a 3D object in a single shot. It has to hallucinate the other side and it’s always completely wrong.” This isn’t theoretical, users report that TRELLIS.2’s outputs are “decent, but nowhere near the example shown”, even with default settings.
The skepticism is warranted. When Microsoft shows off a complex mech model generated from one image, developers smell a rat: “The mech mesh seems suspiciously ‘accurate’… they’re picking an extremely ideal candidate to show off, rather than reflecting real-world results.” The suspicion is that training datasets are contaminated with canonical 3D models, miniature STLs scraped from the internet, which means the model isn’t generalizing, it’s memorizing and regurgitating.
Apple’s SHARP avoids this specific critique by focusing on view synthesis rather than explicit 3D reconstruction, but the core problem remains: you can’t defy information theory. A single 2D projection doesn’t contain enough data to uniquely determine a 3D structure. The models work when they’re interpolating between views they’ve seen during training, but they collapse when asked to extrapolate truly novel geometry.
The Multi-Image Rebellion
The community is already demanding what should have been obvious: multi-image support. TRELLIS’s developers have promised that “training-free multi-image conditioning will be added later”, since the original TRELLIS had it. Apple hasn’t made similar promises, but the pressure is mounting.
The real innovation isn’t single-image generation, it’s using AI to augment traditional photogrammetry. Imagine feeding a model five or six casually snapped photos from your phone and getting a perfect 3D model. That’s achievable. But the current marketing hype around “single-image” generation is setting unrealistic expectations and leading to inevitable disappointment.
Benchmark Theater: When Speed Misleads
Both companies are touting speed benchmarks that are technically true but practically misleading. Apple’s “under a second” claim applies to the neural network pass only, not the full pipeline from input to rendered video. Microsoft’s “3 seconds for 512³” requires an H100 GPU that most developers will never touch.
More importantly, speed is irrelevant if the output is wrong. A hallucinated backside that looks plausible in a demo video but fails in production is worse than useless, it’s a liability. For applications like 3D printing, game asset creation, or AR/VR, geometric accuracy is non-negotiable. TRELLIS.2’s own documentation admits that generated meshes “may occasionally contain small holes or minor topological discontinuities”, requiring post-processing scripts for watertight geometry.
The Ecosystem Lock-In Play
Make no mistake: this isn’t about democratizing 3D creation. It’s about ecosystem lock-in. Apple wants you generating assets that work seamlessly with Vision Pro and ARKit. Microsoft wants you building within the Azure/OpenAI universe. Both are releasing “open” models that require proprietary hardware and software stacks.
The MIT license on TRELLIS.2 and Apple’s academic licensing might seem generous, but they’re Trojan horses. The real cost isn’t financial, it’s the implicit requirement to buy into their entire infrastructure. You can download the models for free, but you’ll pay thousands for the hardware to run them effectively, and you’ll be locked into workflows that favor their platforms.
Who’s Actually Winning?
Right now? NVIDIA. The entire AI industry is building on CUDA, and these releases just cement that dominance. Apple and Microsoft are fighting for second place while Jensen Huang laughs all the way to the bank.
Between the two models, SHARP has the edge in pure view synthesis quality and speed on consumer hardware (if you own a high-end Mac). TRELLIS.2 wins for full asset generation with PBR materials and higher resolution, but only if you have data center-grade GPUs.
The real winner will be whichever company first ships multi-image conditioning that works on consumer hardware. Until then, we’re watching a glorified parlor trick competition where both players are using the same rigged deck.
If you’re building 3D AI pipelines today:
- For rapid prototyping on Mac: SHARP is viable for generating 3DGS representations, but forget about rendering showcase videos without an NVIDIA box.
- For production asset generation: TRELLIS.2 is more capable but requires Linux and expensive GPUs. Budget $10k+ for a proper workstation.
- For accuracy: Neither. Use traditional photogrammetry or structured light scanning. These models are toys for now.
- For research: Both are goldmines. The architectures are genuinely innovative, even if the marketing is dishonest.
The controversy isn’t which model is better, it’s that both models expose how broken the “open AI” narrative has become. Open source doesn’t mean accessible, and cutting-edge doesn’t mean useful. The 3D AI race is heating up, but right now, it’s mostly generating hot air.
Want to experiment? Grab SHARP from GitHub and TRELLIS.2 from Hugging Face. Just don’t say we didn’t warn you about the CUDA requirements, and keep a healthy skepticism about anything generated from a single photograph.




