Specification-Driven Development: Why Your LLM Prompts Need a Type System
The Kotlin creator’s new project argues we’ve been talking to AI wrong. Here’s the data on why formal specifications beat natural language for serious software engineering.
The reality? We’re drowning in generated spaghetti that compiles but doesn’t cohere. A recent analysis of AI code generation outcomes showed that while AI can produce functional code at shocking speeds, the resulting architecture often collapses under maintenance load. The culprit isn’t the models, it’s the interface. Natural language is ambiguous, context-dependent, and fundamentally unsuited for specifying the precise constraints that robust software requires.
Enter Specification-Driven Development (SDD), a paradigm that treats LLMs not as omniscient oracles to be chatted with, but as compilers for formal specifications. The approach, championed by JetBrains’ Kotlin creator Andrey Breslav through his new CodeSpeak language, argues that we should maintain specs, not code, and let AI handle the translation.
The 10x Compression Claim: Specs vs. Implementation
CodeSpeak’s central thesis is radical in its simplicity: plain-text specifications can generate production code at compression ratios of 5-10x compared to traditional implementation. This isn’t theoretical vaporware. The team has published concrete case studies from real open-source projects:
| Case Study | Code LOC | Spec LOC | Shrink Factor | Domain |
|---|---|---|---|---|
| WebVTT subtitles (yt-dlp) | 255 | 38 | 6.7x | Video Processing |
| Italian SSN generator (Faker) | 165 | 21 | 7.9x | Data Generation |
| Encoding detection (BeautifulSoup4) | 826 | 141 | 5.9x | Text Processing |
| EML to Markdown converter | 139 | 14 | 9.9x | Document Conversion |
The implications are stark. An EML converter that required 139 lines of carefully crafted Python (with 27 accompanying tests) collapses to 14 lines of specification. The generated code passes the same test suite, actually, it passes more tests because the spec can generate edge cases the original author missed.
But here’s the controversial part: CodeSpeak explicitly targets “engineers building complex software” and “teams of humans”, deliberately excluding what the industry has euphemistically termed “vibe-coding.” This is a direct assault on the democratization narrative that has driven AI adoption. The tool assumes you understand software architecture enough to specify constraints formally, which puts it at odds with the “anyone can code” movement.
When Natural Language Fails: The UML Generation Study
The limitations of natural language prompting become quantifiable when we look at formal model generation. A recent study from Monash University tested state-of-the-art LLMs (GPT-5, Claude Sonnet 4.0, Gemini 2.5, Llama-3.1) on generating UML class diagrams from natural language requirements. The results validate what requirements for software architecture in an AI-augmented era have been suggesting: current approaches hit a complexity ceiling.
The research used a dual-validation framework combining LLM-as-a-Judge with human expert evaluation across five dimensions: completeness, correctness, standards adherence, comprehensibility, and terminological alignment. While GPT-5 achieved substantial alignment with human evaluators (Cohen’s Kappa κ=0.684), the models consistently struggled with domain-specific complexity.
The Pacemaker dataset, a safety-critical medical device specification with 187 requirements. GPT-5’s completeness scores dropped to 2-3 out of 5, with evaluators noting “major gaps” in critical items.
– Monash University Research Data
The limitations of traditional architecture documentation are well-documented, but it turns out LLMs amplify those limitations when fed ambiguous natural language inputs.
The study revealed that “requirement smells”, particularly semantic inconsistencies and ambiguities in the source text, severely hamper model fidelity. When requirements contain contradictory constraints or undefined terms, even the most advanced LLMs produce diagrams that gaps between architectural theory and implementation reliability make dangerous to deploy.
The Formal Specification Advantage
So why do formal specifications succeed where natural language fails? The answer lies in constraint propagation.
When you write a specification in a structured domain-specific language (DSL), you’re not just describing intent, you’re defining an invariant space. The specification encodes relationships, types, and constraints that must hold true across all generated implementations. This aligns with the industry shift toward text-based specifications that treat diagrams as code rather than decoration.
CodeSpeak’s approach uses what they call “managed files”, portions of the codebase that are entirely AI-generated from specs and marked as immutable by human developers. You don’t debug the generated Python, you debug the 14-line spec, and the AI regenerates the implementation. This inverts the traditional maintenance burden: instead of refactoring thousands of lines of legacy code, you refactor a concise specification, and the AI propagates changes through the entire dependency graph.
The technical mechanism involves “spec dependencies”, explicit declarations of how specifications relate to each other and to manually-written code. This allows mixed-mode projects where critical paths are formally specified while glue code remains hand-crafted, addressing the practical reality that most enterprises can’t afford to rewrite their entire stack.
The Human-AI Collaboration Model
Critics argue that specification-driven development simply moves the complexity around, instead of writing code, you write specs, which is just code with extra steps. But this misses the semantic density difference. A 14-line specification that generates 139 lines of implementation isn’t just shorter, it’s verifiable.
Formal Criteria = Verifiability
The Monash study demonstrated that LLM judges (using Grok and Mistral as evaluators) achieved substantial agreement (κ=0.773) when assessing UML quality against formal criteria. This suggests that formal specifications create an evaluable intermediate representation, something natural language prompts fundamentally lack. You can test whether a spec satisfies architectural constraints, testing whether a prompt will produce correct code is stochastic guesswork.
Living Documentation That Doesn’t Drift
Moreover, specifications serve as living documentation that doesn’t drift from implementation. When your spec is the source of truth, and the code is a disposable artifact, you solve the synchronization problem that has plagued limitations of traditional architecture documentation for decades.
The Controversy: Gatekeeping or Necessary Discipline?
The spicy undercurrent here is elitism. By requiring engineers to think in formal specifications rather than conversational English, SDD erects a barrier to entry. The CodeSpeak landing page explicitly states it’s for “production-grade systems” and “long-term projects, not just prototypes”, a clear dig at the rapid-prototyping culture that has dominated AI-assisted development.
This creates tension with the accessibility narrative. If the future of software engineering requires formal methods training, we may see a bifurcation in the industry: architects who specify and maintain the “source of truth”, and implementers who increasingly work with generated code they don’t fully control.
The data from the UML generation study supports this bifurcation. While LLMs can generate structurally coherent diagrams from natural language, the quality variance is too high for safety-critical applications. The Pacemaker case study showed that even GPT-5 produced diagrams with correctness scores of only 3/5 when faced with complex domain logic, acceptable for a CRUD app, potentially lethal for medical device firmware.
Practical Implementation: Mixed-Mode Development
Hybrid Approaches for Organizations Not Ready to Fully Commit
For organizations not ready to jump fully into specification-driven development, the research suggests a hybrid approach. The Monash study’s “LLM-as-a-Judge” methodology can be repurposed for code review: generate multiple implementations from the same spec, use LLMs to rank them against formal criteria, and have humans validate only the winners.
CodeSpeak supports this through what they term “mixed projects”, codebases where some files are managed by specs and others are handwritten. This allows teams to formalize their most critical components (security boundaries, data models, API contracts) while keeping business logic in traditional code.
The shrink factors observed in the case studies suggest that maintenance burden concentrates in specific domains. The WebVTT subtitle parser required 255 lines of intricate state management code that compressed to 38 lines of specification, likely because subtitle parsing involves complex temporal logic that’s easy to specify formally but tedious to implement manually.
The Verdict
Specification-driven development isn’t going to replace Copilot or Cursor for quick scripts. But for requirements for software architecture in an AI-augmented era, it represents a necessary evolution. Natural language is a fantastic interface for exploration and ideation, but it’s a terrible interface for engineering.
The 5-10x compression ratios aren’t just about saving keystrokes, they’re about reducing the surface area for error. When your specification is 14 lines and your implementation is 139 lines, debugging becomes a search problem over a constrained space rather than an archaeological dig through generated sludge.
As the Monash researchers concluded, LLMs can serve as “reliable evaluators in automated requirements engineering workflows”, but only when given formal criteria to evaluate against. Natural language prompts lack that rigidity, which is why they excel at creativity but fail at consistency.
The future likely isn’t “vibe coding” or pure specification-driven development, but a stratified approach where formal specs govern critical paths and natural language handles the long tail of glue code. The engineers who thrive will be those who know when to specify and when to prompt, and have the architectural judgment to tell the difference.
After all, if you’re building software that needs to last longer than the average Hacker News hype cycle, you probably want more than a vibe. You want a spec.




