Behrens & Cassel - Source to Sink LLM Vuln Discovery

Speaker:: Scott Behrens & Justice Cassel Title:: Source to Sink: Improving LLM Vuln Discovery Duration:: 26 min Video:: https://www.youtube.com/watch?v=bxwEZMhqeR0 ## Key Thesis LLM-based vulnerability discovery is real and productive, but naive approaches (monolithic prompts, simple agents) leave significant true positives on the table — the architectural choices around context management, agent specialization, source-to-sink tracing, and dedicated false-positive filtering matter more than raw model power, especially on large codebases. ## Synopsis Behrens and Cassel (both from an offensive security background) walk through roughly 13 months of iteration on LLM-assisted first-party vulnerability discovery, from initial naive prompting (February 2025) through IDE assistants, MCP-based context injection, and finally an in-house orchestration framework. The talk is structured around concrete experimental results from a purpose-built evaluation benchmark of 41 known true positives across a test codebase, with a larger 191-TP benchmark for XL codebase testing. **Experiment 1 — Monolithic vs. Specialized Agents:** A monolithic "super agent" (single rule file with heavy security context) found 36 of 41 TPs at lower cost. A multi-agent specialized configuration found more TPs. Conclusion: monolithic is cost-effective but multi-agent has better coverage. **Experiment 2 — Dedicated False Positive Filtering:** Orchestrated scan without dedicated FP agent: 26% correct severity assignments. With dedicated FP agent: 74% correct severity. Strong recommendation: always run a dedicated post-analysis FP/triage agent — the token cost is worth it for accurate severity. **Experiment 3 — Grouped Vulnerability Categories vs. Individual Agents:** Grouping agents by category (e.g., "injection vulnerabilities") to save tokens caused a 17% miss rate vs. individual per-class agents (34 vs 41 TPs found). Hypothesis: may be overfitting on category name strings. Recommendation: don't use categories. **Experiment 4 — Enumeration, Discovery, and Tracing:** Adding thoughtful context management (file tree filtering, architecture discovery, source-to-sink tracing) produced a ~26% cost reduction while improving depth. The insight: loading only relevant agents for the detected architecture (don't run an SSTI agent if there's no templating engine) saves significant tokens. **Experiment 5 — Forced File Batching:** Having agents review every line of code deterministically finds the most bugs but is expensive ($400-500/scan on large repos). Recommended only for most critical codebases or when a lighter scan signals code smells worth deep-diving. **Opus 4.6 surprise:** On small codebases (1-2K lines), Opus 4.6 with crafty prompting gets close to 41/41 TPs (~38 average). On large codebases (30K+ lines, 125+ files): Opus solo found 40/191 TPs, while a super-agent with a less powerful model found 80, and the full orchestrated workflow found 116. Context management quality beats raw model capability at scale. The final workflow architecture: file tree enumeration (filtering tests/build artifacts), parallel discovery phase (static analysis + architecture discovery + data flow tracing run simultaneously), discovery gates (only load agents whose architecture matches — no SQL injection agent if no database), file batching by source/sync prioritization and cross-import relationships, schematization with programmatic tool calling to enforce output format and reject hallucinated vulnerability classes, then dedicated FP/triage phase. Real-world disclosures from the research include: a TOCTOU/double-spend vulnerability in a trading platform (free money exploit) and a command injection in a VoIP stack amounting to RCE over fax. ## Key Takeaways - Context management architecture beats model power for large codebases — Opus 4.6 solo (40 TPs) was far surpassed by orchestrated workflow with lesser model (116 TPs) on 191-TP XL benchmark - Dedicated false-positive filtering agents are critical: 26% → 74% correct severity assignments - Don't group agents by vulnerability category — individual per-class agents find more TPs - Architecture-aware discovery gates (only run relevant agents) reduce cost ~26% while improving depth - For esoteric vuln classes (e.g., TOCTOU), the model won't find them without explicit context in the prompt — models are trained on data volume - Schematization + programmatic enforcement prevents hallucinated severity/vuln class outputs - Incremental diff scanning and saving trace data enables efficient PR-level scanning without full re-scan - Chain models by cost/capability (Haiku → Sonnet → Opus as signal escalates) for better cost/result tradeoff ## Notable Quotes / Data Points - Benchmark: 41 known TPs on small codebase; 191 known TPs on XL codebase (30K+ lines, 125+ files) - Opus 4.6 solo on XL: 40/191 TPs; super-agent (monolithic, lesser model): 80/191; orchestrated workflow: 116/191 - Opus 4.6 on small codebase: ~38/41 TPs average - Dedicated FP agent: correct severity 26% → 74% - Source-to-sink tracing: ~26% cost reduction - Disclosures: TOCTOU double-spend in trading platform; RCE via command injection in VoIP stack - Per-scan cost at full forced-file-batching: $400-500/repo — only viable for most critical codebases - "The models are trained based on a volume of data — for SQL injection everybody knows it, but for TOCTOU there's way less articles" - Opus 4.6 dropped mid-talk-prep, which prompted real questions about whether months of workflow work was still worthwhile #unprompted #claude