Qu - Why Most ML Vulnerability Detection Fails

Speaker:: Jenny Guanni Qu Title:: Why Most ML Vulnerability Detection Fails Duration:: 13 min Video:: https://www.youtube.com/watch?v=93jhfuL-ndo ## Key Thesis Most ML vulnerability detection approaches fail because they benchmark against weak baselines, train on poorly curated datasets, and use model architectures that can't see enough code at once — and the Linux kernel's 125,000-labeled-commit dataset reveals that dumb three-feature baselines, long context windows, and careful dataset curation matter more than fancy architectures. ## Synopsis Qu (AI security researcher backed by Pebble Ventures, former Caltech math/ML background, DEF CON CTF third place) chose Linux kernel vulnerability detection as her research domain because of the unique infrastructure problem: 1,400 emails/day on the mailing list, volunteer maintainers drowning in patches, no centralized tracking, and no CI coordination (more CI ≠ better CI). The kernel has a 125,000-commit labeled vulnerability dataset via the "Fixes:" tag convention — a commit introducing a bug will have a later fixing commit referencing it with `Fixes: [commit hash]`, providing clean ground truth labels. Her first key finding about the domain: 13% of bugs hide for over 5 years. A 19-year-old bug (introduced August 2006, fixed August 2025) was itself a fixing commit. Race condition bugs have the longest lifetimes because they require precise timing to trigger. The appearance that recent-year bugs get fixed faster is a data artifact — right censoring means there's simply been less time for long-lived bugs to fully manifest. Before training any neural model, she built four simple baselines. B1: just three numbers (lines added, lines removed, files changed per diff). This achieved AUC 0.779 — far better than expected. B4: commit subject line only. B6: subject + diff combined. B8: 118 hand-engineered diff features. The three-number baseline performing so well is explained by class imbalance: vulnerable commits are rare, so a model that's only slightly better than random-guess-safe still scores well on recall. Critical lesson: if your neural model can't beat these, it's just a fancy tokenizer. Context length mattered dramatically: at 512 tokens, transformer models "read messages not code" — they can't see most of a diff. Increasing to 8,000 tokens produced significantly better AUC by enabling actual code understanding. Vulnerability patterns have a shelf life — models trained on older data predict earlier commits well but decay on newer ones, implying retraining cadence must be tuned based on how fast the pattern landscape shifts. Curriculum learning turned out to be backwards from her initial intuition. She assumed large, complex diffs with many changes would be hard to classify. In reality, large diffs have more signal and are easier. The genuinely hard samples are 1-5 line changes — a missing null check, reference counting increment without error path, lock in one function released in another. These are the bugs that hide longest. Most surprising finding: better data curation mattered more than better architectures. Specifically, including "easy negatives" (obviously safe commits) helped more than hard negatives. The model needs to learn "obviously safe" before it can detect "subtly dangerous." Hard negative mining alone was counterproductive. ## Key Takeaways - 13% of Linux kernel bugs hide for more than 5 years; race conditions hide longest - Three-number baseline (lines added/removed, files changed) achieves AUC 0.779 — your fancy model needs to beat this or it's worthless - Context length unlocks code understanding — 512 tokens reads messages, 8K tokens reads code - Vulnerability patterns decay over time — models need periodic retraining - Easy negatives in training data are more important than hard negatives — models must learn "obviously safe" first - Small diffs (1-5 lines) are harder to classify than large diffs, despite intuition — less signal - Data quality > architecture sophistication for this domain - The "Fixes:" tag convention in Linux kernel provides a rare, high-quality labeled dataset ## Notable Quotes / Data Points - 125,000 labeled vulnerability commits in Linux kernel via "Fixes:" tag convention - 1,400 emails/day on Linux kernel mailing list - 13% of bugs hide for 5+ years; a 19-year-old bug (Aug 2006 → Aug 2025) was itself a "fixing" commit - Three-number baseline AUC: 0.779 - 512 tokens: model reads messages. 8,000 tokens: model reads code — "context length unlocks code understanding" - Blog post went to #1 on Hacker News - Top bug fixer Dan Carpenter: 2,000+ bug fixes and also invented the "Fixes:" tag convention - Hard sample examples: 1-5 line changes with missing null checks, refcount increment without error path #unprompted #claude