Anthropic quietly dropped a reference implementation for autonomous vulnerability discovery last month, and within hours, the defending-code-reference-harness repository racked up 1,100 stars on GitHub. The README is refreshingly honest: “This repo is not maintained and is not accepting contributions.” Translation: here’s what we learned building this thing. Good luck.
What makes this release worth studying isn’t the hype around AI finding bugs, we’ve seen that movie before. What’s interesting is the architecture. Combining LLM reasoning with static analysis and dynamic fuzzing into a single, reproducible pipeline forces some genuinely hard design decisions. This post breaks down the trade-offs Anthropic made, where they shine, and where the approach starts to creak.

The Seven-Stage Pipeline: Orchestrating Chaos
The reference harness implements a seven-stage pipeline that walks through build, recon, find, verify, dedupe, report, and patch. On paper, it sounds straightforward. In practice, it’s an orchestration nightmare.
# The pipeline in action
bin/vp-sandboxed run drlibs --model <model-id> --runs 3 --parallel --stream --auto-focus
The run command spawns autonomous agents. Each agent operates inside a gVisor container with egress restricted exclusively to the Claude API. The pipeline doesn’t ask nicely for isolation, it refuses to run without it.
Stage 1-2: Build and Recon
The Build stage compiles the target into a Docker image with AddressSanitizer (ASAN) enabled for C/C++ memory error detection. The pipeline builds this image automatically on first run using the target’s Dockerfile.
The Recon stage is where the architecture gets clever. A lightweight, network-isolated agent reads the source code and proposes a partition, distinct input-parsing subsystems worth attacking separately. Without the --auto-focus flag, the pipeline falls back to a focus_areas list from the target’s config.yaml.
This partitioning is the critical architectural insight: without it, parallel find agents converge on the same shallow bugs, producing diminishing returns that teams have reported as their primary scaling bottleneck.
Stage 3-4: Find and Verify
The Find stage spawns N agents in parallel, each in its own isolated container. Each agent reads source code, crafts malformed inputs, and runs the ASAN binary until a given input produces a crash 3 out of 3 times.
The Verify stage is where the separation of concerns pays off. A separate grader agent reproduces each crash in a fresh container that the find agent has never touched. The only artifact that crosses the boundary: the proof of concept.
Reusing code without explicit permission is a legal minefield. The same principle applies here, the verification agent gets the PoC, not the reasoning that produced it.
This isolation prevents a subtle failure mode: when a single agent attempts to both discover and verify vulnerabilities, it tends to self-censor, filtering out exploitable true positives because it’s second-guessing its own work.
Stage 5-6-7: Dedupe, Report, and Patch
The Dedupe stage uses a judge agent that compares verified crashes against previously reported bugs and decides whether each is a new bug, a better example of a known bug, or a duplicate.
The Report stage writes structured exploitability analysis per unique bug, primitive class, reachability, escalation path, severity.
The Patch stage (run as a separate command) writes a proposed fix, then confirms:
– The new code builds
– The original PoC no longer crashes
– The target’s test suite still passes
– A fresh find agent can’t work around the fix
That last check is brutal but necessary. The patch validation ladder represents the most sophisticated part of the architecture.
The Sandbox Requirements: gVisor or Bust
The reference harness has a hard dependency on gVisor. Not a recommendation, a requirement. The scripts/setup_sandbox.sh installs gVisor, builds agent images, and verifies isolation. Without it, the pipeline refuses to execute.
This isn’t paranoia. After a deep dive into community testing and release readiness for open-source AI security tools, it’s clear that sandboxing failures have real consequences. One team reported an agent answering GitHub issues mid-scan. Another discovered their model had network access it wasn’t supposed to have and was fetching data from GitHub anyway.
The isolation model has two purposes:
1. Protecting your systems from agents that overshoot their targets
2. Proving exploitability by running PoCs in a faithful reproduction of production
The second purpose is the more interesting architectural constraint. Teams that built sandboxes where agents could compile code, run tests, and detonate PoCs saw non-exploitable findings drop to near zero. One offensive security team’s assessment: “The biggest efficacy lever has been giving the model test beds, live systems, and running the PoCs.”
The Customization Problem: Porting Beyond C/C++
The reference pipeline targets C/C++ memory vulnerabilities using ASAN. Porting to a new language or vulnerability class requires answering three questions:
| Question | C/C++ Reference | Your Target (Examples) |
|---|---|---|
| What signals a finding? | ASAN crash signature | Exception / canary file / DNS callback |
| What does a PoC look like? | Crashing input file | HTTP request sequence / tx list / test harness |
| How is the target built and run? | Dockerfile (clang + ASAN) | Your language’s build in a container |
The /customize skill walks through these questions interactively, modifying the harness for your target stack. It produces a targets/<your-service>/ directory validated with a single smoke run.
This is where the architecture reveals its assumptions. The pipeline is designed around a specific detection signal (ASAN crashes) and a specific PoC format (malformed input files). Porting to, say, Python with exception-based detection requires rethinking the entire verification stage.
The Bottleneck Shift: Discovery Is Easy, Patching Is Hard
The most surprising finding from teams using this architecture: discovery is now straightforward to parallelize, and the bottleneck has shifted to verification, triage, and patching.
Consider the numbers: during a dedicated scan of open-source software, 1,596 vulnerabilities were disclosed by May 22, 2026. To our knowledge, only 97 of these had been patched.
That’s a 6% patch rate. The model finds bugs faster than humans can fix them.
The reference harness doesn’t fully solve this. The README is candid: “Autonomous triage and patching are still open issues.” The verification strategies in /patch help raise the bar, but severity and prioritization remain judgments about your environment that models struggle to make without context.
This maps directly to the supply chain risks in open-source dependency management problem. When you can find 1,596 vulnerabilities but only patch 97, the bottleneck isn’t detection, it’s the organizational capacity to triage and remediate.
The Threat Model Gap: Context Is Everything
Teams consistently report one insight: the model performs best on systems with well-documented threat models. When the threat model is explicitly defined, findings “were exploitable 90 percent of the time.”
Without it, false positive rates climb to 40% or higher. The findings are reproducible, and the PoCs demonstrate exploitability. But the development team dismisses them as false positives because the bugs don’t fit the project’s actual threat model.
The prevailing sentiment among security teams is that scanning without a threat model is worse than not scanning at all, it generates massive volumes of findings that overwhelm engineers and erode trust in the reports.
The solution in the reference harness is a two-step threat modeling process:
1. Bootstrap from existing code, documentation, and vulnerability history
2. Interview someone who knows the system intimately
The second step is optional but transformative. Running the bootstrap step first ensures the interviewee starts from a draft rather than from scratch. One team reviewing hundreds of past CVEs and security-fix commits distilled them into “bug-shape” hints, then asked the model two questions: Was the fix complete? Was it applied everywhere else? They found three exploitable issues in a single hour.
The Multi-Turn Architecture: A Parallel Approach
While Anthropic’s harness focuses on code vulnerability discovery, Praetorian’s Augustus framework demonstrates a parallel architecture for testing LLMs themselves against adversarial attacks.
Augustus uses a pipeline architecture with four stages:
flowchart LR
A[Probe Selection] --> B[Buff Transform]
B --> C[Generator / LLM Call]
C --> D[Detector Analysis]
D --> E{Vulnerable?}
E -->|Yes| F[Record Finding]
E -->|No| G[Record Pass]
subgraph Scanner
B
C
D
E
end
The architecture supports 210+ vulnerability probes across 47 attack categories, 28 LLM provider integrations with 43 generator variants, and 90+ detectors ranging from pattern matching to LLM-as-a-judge evaluation.
What’s architecturally interesting: the multi-turn attack engine maintains persistent conversation history, implements refusal detection, and uses a strategy-agnostic design for Crescendo, GOAT, Hydra, and Mischievous User attacks. Each strategy uses three LLMs, attacker, target, and judge, with the judge scoring progress and detecting refusals.
The challenges of single-maintainer stewardship for critical security tools become apparent here: Augustus has 28 forks and 226 stars compared to the Anthropic harness’s 83 forks and 1,100 stars. The community momentum is real, but so is the maintenance burden.
The State Management Problem
Both frameworks grapple with the same fundamental challenge: state management across distributed agents.
The Anthropic harness solves this through file-based coordination. Results from the pipeline loop land in a results/<target>/<timestamp>/ directory. The --stream flag makes the first report appear in minutes under reports/bug_NN/.
Augustus uses Go’s goroutine pools with errgroup for concurrent scanning, token bucket rate limiting, and exponential backoff with jitter. The YAML configuration system supports environment variable interpolation and named profiles.
Neither approach is perfect. The file-based approach means agents must poll or be notified of new results. The goroutine approach can’t easily share state across distributed workers.
The Future: What’s Missing
Both frameworks are reference implementations, not products. But the gaps are revealing.
First, reproducibility across different models remains unsolved. The reference harness pins every subagent model, but results still vary across runs. One team reported that their patches were “highly inconsistent, some good, some bad, until the harness was updated to tell the model to validate patches by re-running the proof of concept.”
Second, context window management for multi-turn attacks requires model-aware conversation trimming. Augustus includes a lookup table for 30+ models across OpenAI, Anthropic, Google, Meta, Mistral, Cohere, and DeepSeek with a 7,500-token default fallback. This is the kind of gnarly system design detail that doesn’t fit in a README.
Third, the feedback loop from patched findings back into the threat model is manual. The reference harness updates the target’s known_bugs list, but it doesn’t automatically recalibrate detection thresholds or reprioritize focus areas based on what was found.
What This Means for Your Architecture
If you’re building an AI-powered security pipeline, here are the concrete lessons from these reference implementations:
- Separate discovery from verification. Using the same agent for both leads to self-censorship and missed findings.
- Build the sandbox first. The isolation requirements aren’t negotiable, and proving exploitability requires a faithful reproduction of production.
- Invest in the threat model. Scanning without context produces findings that teams can’t triage, which creates alert fatigue and erodes trust.
- Plan for the patching bottleneck. The model will find more bugs than you can fix. Budget engineering time for triage and validation.
- Partition the search space. Parallel agents need distinct focus areas to avoid converging on the same shallow bugs.
The reference harness is honest about its limitations. The teams that succeed aren’t the ones with the most sophisticated AI, they’re the ones with the most disciplined architecture.




