The 0-100 Architecture Delusion: Why AST Scoring Can’t Capture What Senior Engineers See

Architecture rot doesn’t announce itself with a crash. It arrives as a gradual decay: a database query sneaking into a domain model here, a framework import contaminating business logic there. By the time your identifying hidden architectural health metrics beyond infrastructure data start screaming, you’re already six months into a refactoring nightmare.

Enter Architect Genesis, an open-source CLI tool that’s generating buzz for promising to automate what has traditionally been the sacred domain of senior engineers: architectural governance. It parses your Abstract Syntax Tree (AST) using Tree-Sitter, builds dependency graphs, and spits out a 0-100 score across four dimensions. Plug it into CI, and architecture violations break the build just like failing unit tests.

The premise is seductive. The execution is technically impressive. But the underlying question remains unresolved: Can a numerical score ever capture the contextual nuance that separates good architecture from cargo-culted structure?

How Architect Genesis Actually Works (And It’s Not Regex)

Let’s give credit where it’s due. Unlike early static analysis tools that relied on pattern matching and regex hacks, Architect Genesis uses Tree-Sitter to parse the actual AST of your codebase. This matters because it reads the real dependency structure, not approximations.

It supports seven languages, TypeScript, Python, Go, Java, Rust, Ruby, and PHP, and automatically infers your stack and framework. The scoring engine evaluates four weighted dimensions:

Dimension	Weight	Measures
Modularity	40%	Module separation and encapsulation
Coupling	25%	Cross-boundary dependency density
Cohesion	20%	Relatedness of elements within modules
Layering	15%	Clean separation of View/Core/Data/Infra

The tool detects anti-patterns by analyzing the graph structure: God Classes (files with excessive dependents), Circular Dependencies (import cycles), Leaky Abstractions (layer boundary violations), and Spaghetti Modules (high coupling without clear interfaces). These aren’t heuristic guesses based on line counts, they’re structural realities extracted from the AST.

You declare your governance rules in .architect.rules.yml:

quality_gates:
  min_overall_score: 60
  max_critical_anti_patterns: 0

boundaries:
  allow_circular_dependencies: false
  banned_imports:
    - from: "presentation/*"
      to: "infrastructure/*"

Then architect check validates against these rules with exit codes suitable for CI/CD. This is implementing automated architectural enforcement in CI/CD pipelines taken to its logical extreme.

The Adaptive Scoring Mirage

Here’s where it gets interesting. The tool ships with six scoring profiles, default, frontend-spa, backend-monolith, microservices, data-pipeline, and library, that adjust the dimension weights based on your detected framework. A React SPA might prioritize cohesion over coupling, while a backend monolith inverts those priorities.

The creator calibrated these weights against approximately 30 codebases, but openly admits uncertainty about whether they generalize across different project sizes and paradigms. This honesty reveals the fundamental tension in automated architecture governance: metrics optimized for one context become anti-patterns in another.

Consider the weight distribution. Modularity dominates at 40%, followed by coupling at 25%. But in a microservices architecture, coupling between services might matter more than internal modularity. In a data pipeline, layering could be less critical than cohesion. The prevailing sentiment in early community feedback suggests these weights should be configurable per project, not imposed by algorithmic fiat.

There’s also the gaming problem. When you attach career incentives to a numerical score, especially one that gates production deployments, developers optimize for the metric, not the architecture. A team could split God Classes into anemic micro-modules that ace the modularity score while destroying domain cohesion. The tool sees the graph, it doesn’t see the intent.

AI-Assisted Refactoring: Speed With Training Wheels

Architect Genesis doesn’t stop at analysis. It generates refactoring plans using five rule-based transformations, Hub Splitting, Barrel Optimization, Import Organization, Module Grouping, and Dead Code Detection, and can execute them autonomously using Claude, GPT, or Gemini.

But the creator learned quickly that fully autonomous refactoring is “terrifying.” The current implementation requires human gating on every step: approve, skip, retry, or rollback. It creates protective git branches and commits each step individually, acknowledging that limitations of AI agents in adhering to established architectural patterns remain a significant barrier.

This hybrid approach, machine-generated plans with human oversight, reflects a broader truth about implications of AI coding agents reshaping architectural control. The AI can see the graph edges, but it cannot see the business context that justifies a seemingly circular dependency, or the strategic reason why a particular abstraction leaks intentionally.

The Knowledge Base and Forecasting Trap

Every analysis run persists to a local SQLite database, building an Architecture Knowledge Base that tracks score trends over time. The architect forecast command uses ML-based regression to predict score decay three to six months out, theoretically letting you intervene before architecture becomes a crisis.

This sounds like strategies for evolving architecture documentation dynamically realized. But forecasting architectural health based on historical scores assumes the scoring model captures the right variables. If your 0-100 score doesn’t correlate with actual maintainability, and there’s no guarantee it does, then you’re simply projecting the decay of a vanity metric.

The tool also suggests governance rules based on recurring patterns in the knowledge base. While helpful, this risks calcifying existing architectural decisions, good or bad. A codebase with accumulated technical debt will generate suggestions that entrench that debt, not resolve it.

When the Build Should Break (And When It Shouldn’t)

The most dangerous feature might be the CI integration. Implementing automated architectural enforcement in CI/CD pipelines works brilliantly when your architecture is well-defined and your scoring model aligns with reality. But in the messy middle of a refactor, or during experimental prototyping, a hard gate on a 60-point minimum score becomes organizational friction.

There’s a difference between “this change introduces a circular dependency” (objectively verifiable via AST) and “this change reduces the modularity score from 85 to 78” (contextually ambiguous). The former should block the build. The latter might represent necessary exploration or intentional technical debt.

This distinction matters because potential risks and trade-offs of architecture-as-code initiatives often manifest as false positives that erode developer trust. When the tool flags a legitimate architectural compromise as a violation, engineers learn to game the system or bypass the gates entirely.

The Intent Compiler Fantasy

The roadmap for Architect Genesis reveals the ultimate ambition: evolving from analyzer to “intent compiler.” Describe the architecture you want in natural language, and the system generates the code, requirements document to scaffolded project with architecture decisions baked in.

This is where we depart from static analysis and enter the realm of generative AI. The architect genesis-create command already attempts this, parsing requirements documents to generate bounded contexts, stack decisions, and governance rules. It creates .architect.rules.yml from day one, theoretically preventing architectural drift before it starts.

But this vision assumes architectural intent can be fully specified upfront, a waterfall fantasy that ignores the emergent nature of good software design. The best architectures I’ve seen evolved through conversation, not specification, through refactoring in response to changing requirements, not through upfront generation.

What AST Scoring Can’t See

Tree-Sitter gives you the dependency graph, but it cannot tell you:
– Whether a coupling violation represents a necessary integration or an architectural sin
– If a God Class is actually a well-designed facade hiding complex subsystem interactions
– Whether your layering violations are technical debt or pragmatic optimizations
– If the “spaghetti” is actually a performance-critical hot path that shouldn’t be decoupled

These judgments require domain knowledge, business context, and engineering judgment, capabilities that resist quantification. The AST sees the structure, it cannot see the forces that shaped it.

This isn’t an argument against tools like Architect Genesis. It’s an argument against treating them as oracles. They excel at bridging the gap between stale diagrams and live codebases, catching obvious violations early, and maintaining baseline hygiene. They fail when asked to replace architectural thinking with algorithmic scoring.

The Verdict: Augmentation, Not Replacement

Automated AST scoring won’t replace human reviews, but it will reshape them. The future isn’t senior architects poring over every import statement, that doesn’t scale. It’s also not blind faith in a 0-100 score, that doesn’t work.

The viable path is human-machine collaboration: using tools like Architect Genesis to handle the mechanical aspects of governance, detecting cycles, enforcing layer boundaries, tracking dependency drift, while reserving human judgment for the contextual decisions that separate maintainable systems from metric-compliant disasters.

Start with the anti-pattern detection. Let the tool find your God Classes and circular dependencies. But when it comes to the score? Use it as a conversation starter, not a verdict. And definitely don’t let it break the build during exploratory phases.

The “revolutionary” promise of automated architecture governance isn’t that it replaces human judgment. It’s that it finally gives us a shared language to discuss structural quality before the rot sets in. That’s valuable enough without the pretense of algorithmic omniscience.