The idea of large language models playing strategy games isn’t new. The idea that they might actually be good at it? That’s where things get interesting. In what might be the most comprehensive test of LLM strategic reasoning to date, researchers pitted two open-source models, OSS-120B and GLM-4.6, against Civilization V’s legendary complexity. The outcome wasn’t victory, but something far more revealing: a mirror showing what these models have actually learned about competition, cooperation, and power.
The Experiment: 1,408 Games of Digital Statecraft
The Vox Deorum project didn’t just dabble in AI gameplay, it went all-in. 1,408 complete Civilization V games, running from the Stone Age to future eras, with the notoriously challenging Vox Populi community patch activated. This wasn’t about micromanaging units, it was a hybrid architecture where LLMs set grand strategy and the game’s algorithmic AI handled execution.
The setup was straightforward: each turn, the LLM receives a text dump of the game state, diplomatic standings, military strength, resource counts, technology trees, and outputs high-level decisions: which victory type to pursue, how to prioritize production, what ideology to adopt. The baseline algorithmic AI then implements these choices through its existing tactical systems.
This approach solved a fundamental problem. Pure LLM or reinforcement learning approaches previously failed to even survive full Civilization games. The hybrid model? 97.5% survival rate, nearly matching the algorithmic AI’s 97.3%. The key insight: LLMs don’t need to control every spearman’s movement. They need to think like a leader.
The Boring Results: Competitive, But Not Dominant
Let’s get the headline numbers out of the way. With simple prompts and minimal memory, both models performed… fine. OSS-120B and GLM-4.6 achieved +1-2% better scores in their best games but -1-3% lower win rates compared to the baseline. Across 2,207 total games (including 919 baseline runs), these differences weren’t statistically significant.
In other words, throwing a 120-billion-parameter model at Civ V doesn’t automatically produce a superintelligence. It produces a competent player that makes interesting decisions but won’t be dethroning human experts anytime soon. The real story isn’t in the win rates, it’s in how they played.
The Surprising Part: Emergent Ideologies and Playstyles
Here’s where things get spicy. The two models didn’t just play differently. They developed fundamentally opposed personalities that reveal deep biases in their training data and alignment.
OSS-120B went full warmonger: +31.5% more Domination victories, -23% fewer Cultural victories. It saw the game as a zero-sum conflict and acted accordingly, prioritizing military expansion and crushing neighbors.
GLM-4.6 played balanced, pursuing both Domination and Cultural strategies with more nuance. Less aggressive, more opportunistic.
But both models agreed on one thing: Order is better than Freedom. They chose the communist-like Order ideology ~24% more often than the democratic-like Freedom path. In a game about building societies from scratch, two different open-source models independently converged on authoritarian central planning.
What does this tell us? That LLMs trained on human text reflect human history’s darker patterns? That the “alignment” we think we’ve achieved is just a thin veneer over statistical likelihoods? The researchers diplomatically call this “divergent playstyles.” Others might call it concerning.
The Economics: $0.86 Per Game, 53,000 Tokens Per Turn
Running this experiment wasn’t cheap. Each turn consumed ~53,000 input tokens and 1,500 output tokens for OSS-120B. At OpenRouter pricing (as of December 2025), that’s $0.86 per game.
The token count reveals a critical scaling problem: input tokens grow linearly with game state complexity. A late-game turn with 20+ cities and 100+ units easily exceeds 50,000 tokens. But here’s the kicker: output tokens stay flat. The models don’t automatically “think harder” as the game gets more complex. They’re as verbose on turn 1 as they are during global nuclear war.
This exposes a fundamental limitation of current LLMs for long-horizon tasks. Humans compress information, we focus on what’s changed, what’s critical. LLMs currently reprocess everything, every time. The researchers are exploring multimodal approaches and more efficient state representations, but the token bloat problem remains unsolved.
The Architecture: Why Hybrid Works When Pure LLM Fails
The Vox Deorum paper details a crucial insight: LLMs shouldn’t play the game. They should direct it. **
Previous attempts at pure LLM or RL gameplay failed catastrophically. Models couldn’t handle the action space or long-term credit assignment. The hybrid approach elegantly sidesteps this:
- ** LLM Layer **: Macro-strategic reasoning in natural language
- ** Algorithmic AI **: Tactical execution (unit movement, combat calculations)
- ** Game Interface **: MCP server exposing game state as text
This architecture achieves competitive end-to-end gameplay while opening doors for future improvements. The researchers note that even OSS-20B works locally, suggesting model size isn’t the bottleneck, design is.
The Community Response: From Skepticism to Collaboration
The project dropped on Reddit and CivFanatics forums, and the community’s reaction was immediate. One developer with RL experience offered to help, having trained StarCraft 2 AIs that beat the hardest difficulty with just marines. The Vox Populi lead developer weighed in, seeing LLMs as a path to finally eliminating AI bonuses in favor of scaling difficulty based on strategic sophistication.
The most interesting question from the community: ** “What’s a good way to express game state more efficiently?” ** This isn’t just about Civ V. It’s about whether LLMs can ever handle truly complex, evolving environments without drowning in context.
The Controversy: What This Really Means for AI Strategy
This research sits at an uncomfortable intersection. On one hand, it proves LLMs can survive in complex, long-horizon environments, a prerequisite for real-world strategic planning. On the other, it reveals how surface-level their “reasoning” actually is.
The models didn’t discover new strategies. They regressed to historical patterns of conquest and centralization. They didn’t show emergent creativity. They showed emergent ** bias **. When given a blank slate to build a society, they chose authoritarianism and war.
This matters because the same architectures are being deployed for business strategy, policy planning, and military wargaming. If an LLM’s “strategic thinking” in Civ V reflects its training data’s darkest tendencies, what happens when we ask it to optimize for more consequential domains?
The researchers are transparent about limitations. They’re exploring RAG, self-play, and long-term memory. But the fundamental question remains: ** Can LLMs truly strategize, or are they just very sophisticated pattern matchers? **
The Future: Beyond Token Dumps
The Vox Deorum team has open-sourced everything. You can download the mod, watch replay files, or even expose the game as an MCP server for your own agents.
Next steps include:
– ** Multimodal state representation ** (vision models reading the game board)
– ** Self-play with reflection ** (models learning from their own mistakes)
– ** In-game chat interfaces ** (actually negotiating with AI opponents)
– ** RL integration ** for tactical improvement
The holy grail? AIs that don’t need artificial bonuses, where difficulty scales with strategic sophistication, not resource cheats.
Takeaways for AI Practitioners
- ** Hybrid architectures beat pure approaches **: LLMs for reasoning, traditional AI for execution
- ** Token efficiency is critical **: Long-horizon tasks require better state compression
- ** Emergent behavior reveals training bias **: Watch what models do, not what they say
- ** Surviving ≠ thriving **: Competence in complex environments is table stakes, not success
- ** Open-source models are viable **: OSS-120B and GLM-4.6 prove you don’t need GPT-4 for strategic tasks
The Civilization V benchmark won’t replace traditional RL envs, but it offers something unique: a lens into how LLMs handle the messiness of human-like strategic thinking. The view isn’t always flattering.
** Want to experiment? ** The Vox Deorum repo is live. Just remember: when your AI opponent chooses Order over Freedom and starts massing troops on your border, it’s not a bug. It’s a feature of its training.
Discussion Questions
- How would you design a more efficient game state representation for LLMs?
- Do the ideological biases in these models concern you for real-world applications?
- Could RL eventually replace the algorithmic AI layer for tactical execution?
- What other complex simulations should we test LLM strategic reasoning on?




