When Function Calling Finally Works: 6.75% to 100% Success Fix

Hero image representing the solution to recursive union type errors in AI function calling with a 6.75% to 100% success fix — Figure 1: Systematic approach to fixing model function calling failure modes

Function calling on deeply recursive union types is structurally broken, by design. Industry consensus, backed by papers like NESTFUL (EMNLP 2025) and JSONSchemaBench (ICLR 2025), puts GPT-4o at 28% accuracy on nested tool sequences and suggests 3, 41% coverage is the ceiling for complex schemas. The recommendation is clear: don’t bother with recursive unions. Use flat structures or accept the failure rate.

Then Jeongho Nam from Wrtn Technologies walked onto the stage at Qwen Meetup Korea and demonstrated a system hitting 100% first-try success on qwen3-coder-next, up from 6.75%, and fixing the entire Qwen 3.5 family’s 0% success rate on union types caused by a systematic double-stringify bug. Not by using a bigger model. Not by prompt engineering. By treating function calling as an engineering problem, not a model problem.

The 6.75% Reality Check

Most AI agents generate text and pray. AutoBe, an open-source backend generation agent, takes a different approach: it forces the LLM to fill out structured schemas, specifically, TypeScript ASTs like IJsonSchema (10 recursive variants) and IExpression (30+ recursive variants). These aren’t simple flat objects, they’re combinatorial explosion machines where 10 variants nested three levels deep create 1,000 possible paths.

When tested on qwen3-coder-next, the first-try success rate was 6.75%. The Qwen 3.5 family fared worse, 0% on union types, because of a consistent double-stringify bug where every anyOf field got wrapped in an extra layer of JSON stringification. Not occasionally. Every time.

These numbers align with what the industry expects. Qwen 3.5’s agentic coding performance looks impressive in benchmarks, but local deployment often reveals a different story: skipped tool calls, malformed JSON, and the kind of protocol fragility that makes production deployment a nightmare. The assumption has been that this is a model capability issue, that we need larger models, better training, or more sophisticated prompting.

That assumption is wrong.

The Double-Stringify Bug That Ate an Entire Model Family

The Qwen 3.5 family’s 0% success rate on union types wasn’t random noise. It was a systematic failure mode: the model consistently double-stringified union type fields, producing outputs like {"\"type\":\"card\",...}" instead of the actual object. Standard JSON.parse() chokes on this. Most validation libraries throw up their hands.

Typia, a TypeScript compiler library that generates runtime validators from types, handles this through lenient parsing. Instead of expecting pristine JSON, Typia’s LlmJson.parse() recursively unwinds double-stringified content, auto-closes unclosed brackets, accepts unquoted keys, coerces strings to numbers when the schema demands it, and strips Markdown code blocks and conversational preamble.

Minimal cover image showing AI robot coding workflow and multimodal interface related to function calling fixes — Figure 2: Visualizing the structural changes required for robust function calling

The Harness Architecture: AutoBe + Typia

The breakthrough wasn’t a single fix but a systematic harness: type schemas that constrain outputs, compilers that verify results, and structured feedback that pinpoints exactly where and why something failed.

Typia handles the function-calling layer:

typia.llm.application<T>() generates JSON Schema from TypeScript types at compile time
ILlmFunction.parse() performs broken JSON recovery and type coercion
ILlmFunction.validate() detects schema violations with precise path tracking
LlmJson.stringify() renders errors as inline comments (// ❌) on the LLM’s original JSON

AutoBe provides the system-level validation:

4 AST types (Analyze, Database, Interface, Test) with strict constraints
4-tier compiler validation (Prisma, OpenAPI, Test, TypeScript)
Self-healing loops that preserve successful parts and correct only failures

The combination creates a deterministic loop around a probabilistic model. When validation fails, the system doesn’t just say “error”, it returns structured feedback like “Field order.product.price should be number & Minimum<0> but you gave -100.” The LLM self-corrects and retries. Strong models converge in 1, 2 attempts, weaker models take 3, 4. Both reach 100%.

This approach is model-neutral. AutoBe runs Qwen 3.5 edge AI variants, GLM, DeepSeek, and OpenAI models with the same schemas and pipelines, achieving 99.8%+ compilation across all of them. No model-specific prompt tuning required.

Why Small Models Are the Best QA Engineers

Counterintuitively, the journey from 6.75% to 100% was driven by small models. Large models like GPT-4o “correctly guess” ambiguous schema parts, papering over system vulnerabilities. Small models expose everything.

When testing with qwen3-30b-a3b (3B active parameters), the ~10% success rate revealed fundamental schema ambiguities and missing required fields that larger models had silently handled. Each failure pointed to a system vulnerability, each fix strengthened the pipeline for all models.

This flips the conventional wisdom about distilled Qwen3 model performance. While the industry obsesses over benchmark scores, the real value of small models is as QA infrastructure, they’re brutally honest about protocol compliance. When even a 3B-active model can’t break your system, no model will.

The Universal Pattern: Verification Over Capability

The harness pattern extends beyond code generation. Any domain with deterministic validators, semiconductors (DRC → LVS → SPICE), chemical processes (mass balance → energy balance → ASPEN), or browser-based physics simulations, can apply the same structure: recursive union types, hierarchical decomposition, and progressive validation.

The key insight is that function calling doesn’t require the LLM to be perfect. It requires the structure around the LLM to be perfect. Types eliminate ambiguity, schemas constrain through absence rather than prohibition, validators provide mechanical verification. Native function calling claims often obscure this reality by promising magic while hiding the error correction loops.

As one developer noted in discussions about the presentation, the framing of “6.75% is not failure, it’s the first input to the loop” represents a genuine mental model shift. Most teams abandon structured output when initial accuracy drops, not realizing that the entire point of a feedback loop is to start somewhere measurable and converge through verification.

The Takeaway

The industry has been solving the wrong problem. We’ve been trying to make the probabilistic part of AI reliable through prompt engineering and larger models, when we should be making the deterministic part perfect through type schemas, compilers, and validation feedback.

With the right harness, 6.75% becomes 100%. Not because the model got smarter, but because the system got engineered. The Qwen 3.5 family’s double-stringify bug wasn’t a death sentence, it was just another failure mode to handle in the loop.

If you can verify, you converge. Everything else is just the first iteration.