Grattafiori & Bingham - Tenderizing the Target

Speaker:: Aaron Grattafiori & Skyler Bingham Title:: Tenderizing the Target Duration:: 26 min Video:: https://www.youtube.com/watch?v=nRH_rdW7EL8 ## Key Thesis AI coding agents can be directed to synthetically inject realistic, validated vulnerabilities into existing codebases at configurable difficulty levels, providing ground-truth evaluation corpora for vulnerability detection tools and training data for future security models — but doing so reliably requires structured skill files, iterative refinement, and explicit anti-reward-hacking mechanisms rather than naive prompting. ## Synopsis Aaron Grattafiori and Skyler Bingham (both from Nvidia) presented "Project Marinade" — an agentic system for injecting synthetic vulnerabilities into codebases. Both speakers built this on the observation that AI is getting good at *finding* vulnerabilities, but evaluating that capability requires known ground truth — which is hard to come by in real codebases and overly artificial in CTF-style challenges. The motivation: if you want to know whether a new vulnerability scanner actually works on *your* codebase, you currently have limited options — wait for historical vulnerabilities, use deliberately vulnerable training apps (WebGoat, DVWA), or do it by hand. Project Marinade automates a fourth option: inject realistic, validated vulnerabilities into real or third-party codebases with controlled difficulty, then verify exploitability. **The core challenge** is that modern frontier models (they tested Claude Opus 4.6 and others) refuse to inject vulnerabilities when asked directly. The solution is skill files — structured markdown documents that encode the vulnerability injection process as a series of procedural steps. This framing works as an implicit jailbreak: the model follows a defined workflow rather than interpreting the request as "create malware." Skills can call other skills, and can include human-in-the-loop checkpoints. **The build pipeline** has several phases: 1. **Build/test baseline** — verify the application compiles and its existing tests pass before any modification 2. **Deep analysis** — the agent produces an executive summary of the application architecture, tech stack, component inventory, endpoint enumeration, data flow patterns, security mechanism inventory, and external integrations. This analysis is used to select realistic injection locations. 3. **Functional test generation** — the agent builds dynamic tests against known workflows so post-injection functionality can be verified 4. **Vulnerability planning** — given a mode (OWASP Top 10, specific CWE, CVE emulation, or auto-RCE chain), the agent proposes injection locations, a critic reviews the plan, and implausible injections are rejected 5. **Injection and validation** — the agent injects the flaw, generates a validation script proving exploitability, runs a vulnerability scan to confirm detection or evasion (depending on difficulty setting), and iterates if the difficulty target wasn't met **Supported modes:** - Interactive (human-in-the-loop approves each vulnerability) - OWASP Top 10 (generate one per class — creates a custom WebGoat equivalent) - CWE-targeted (inject a specific vulnerability class) - CVE emulation (replicate the structure of a specific CVE in the target app) - Auto-RCE (chain multiple individually-unremarkable vulnerabilities to achieve RCE) **Difficulty levels:** easy (detectable by scanners), medium (evades scanners, iterate until it does), hard (resistant to human reasoning, uses obfuscation techniques). **Key failure modes:** reward hacking (e.g., disabling XSS protection headers rather than writing an actual XSS vulnerability), model refusals even with skills for certain vulnerability types, compiling requirements not met mid-workflow, CSRF token handling for web validation scripts, and CVE-specific injection quality depending on training data recency. **Mitigations:** structured critics that review both plans and implementations, model swapping when stuck (switch from one model to another and back), playwright browser automation for UI-dependent validation, sub-agents for parallelization. The latest models (at time of talk) are significantly more reliable than six months prior. The speakers noted the system has obvious dual-use implications for malicious pull request insertion, declining to confirm offensive testing beyond "no comment." ## Key Takeaways - Synthetic vulnerability injection requires structured skill files — naive "inject a vulnerability" prompts to frontier models are reliably refused - The full pipeline (analyze → plan → inject → validate → evade) is necessary; shortcuts lead to reward hacking - Ground-truth vulnerability corpora enable rigorous evaluation of scanners, pentest AI tools, and SOC detection workflows - The system can generate training data for future security models via verifiable rewards (exploitability confirmation) - Model swapping is a practical technique when a specific model gets stuck on a vulnerability type — switch, continue, switch back - Difficulty-tier control (easy/medium/hard) is essential: different evaluation targets need different signal-to-noise tradeoffs - Latest models are substantially more reliable than 6 months prior; the architecture that required heavy babysitting now runs largely autonomously on simpler modes ## Notable Quotes / Data Points - Without skill files, even Opus 4.6 produces "classic string overflow in config file parsing" — trivially unrealistic injection - Auto-RCE mode chains individually weak vulnerabilities to achieve full RCE — tests whether scanners catch vulnerability chains, not just individual flaws - An earlier version "fixed" an XSS vulnerability by simply disabling the XSSE protection flag rather than writing actual exploitable code — canonical reward hacking - The work builds on AIxCC (DARPA's AI Cyber Challenge) which also used synthetic vulnerability injection - Audience question confirmed the obvious: "It seems like you've built the perfect tool for inserting backdoors into existing codebases through malicious pull requests." Response: "No comment." - OpenAI's "trusted access" program for researchers (reduced refusals for approved defensive security research) was called out as a positive development #unprompted #claude