Adkins & Flynn - Evaluating Threats at Google

Speaker:: Heather Adkins & Four Flynn Title:: Evaluating Threats & Automating Defense at Google Duration:: 20 min Video:: https://www.youtube.com/watch?v=B_7RpP90rUk ## Key Thesis Google's mission is to eliminate every software vulnerability on Earth, and with agentic AI systems they believe it is now technically achievable. Big Sleep finds deep memory-safety bugs autonomously with zero false positives; Code Mender automatically patches them — together forming an end-to-end, human-free vulnerability discovery and remediation pipeline. ## Synopsis Adkins (Google cyber leadership) and Flynn (CISO-equivalent at DeepMind) open by framing 2026 as "the year everything pivoted" and predict it will soon look like a simpler time. She notes the NVD has a 30,000-item unanalyzed backlog and that CVEs grew 35% between 2024 and 2025. She estimates roughly $1 billion in VC funding is flowing into vulnerability discovery and pen testing startups, and that open-source agentic frameworks for finding web app vulnerabilities already exist. Her conclusion: within a few years, every vulnerability in every system will be findable by agentic tools, making CVSS scoring obsolete because the number of known vulnerabilities will be overwhelming. **Big Sleep** is Google/DeepMind's agentic vulnerability research system. It replicates the methodology of Project Zero researchers: deep codebase familiarity, variant analysis based on past bugs, and iterative hypothesis-testing. The system runs in a continuous agentic loop with three tools: a debugger, a code browser, and a Python interpreter. It builds test cases, tests hypotheses, and loops on failures. When it produces a crash, it enters a verification phase that generates an actual proof-of-vulnerability exploit — this is key to achieving zero false positives. The final output is a Gemini-written report with step-by-step explanation, proof-of-concept code, and enough detail for a non-expert developer to understand the bug. All found bugs are posted publicly; as of the talk, all but five are fixed. Big Sleep is finding bugs that fuzzers (including OSS-Fuzz) miss, confirming it's reaching genuinely deep code paths. **Code Mender** takes over after discovery. It was started in recognition of the coming "vuln apocalypse." The patching system has three goals: (1) fix the security vulnerability, (2) don't break functionality, and (3) honor the original developer's coding idioms. It uses a pluggable verifier system that includes: fuzzing before and after the patch to validate functional equivalence, formal verification, differential testing with malicious inputs, and an LLM judge with a carefully crafted pre-prompt. Only patches that pass all verifiers are submitted. Failed patches feed their learnings back into the LLM's context for iteration. Code Mender has produced 178 autonomous fixes in open source so far, including work in libWP and Chrome pointer hardening. The green-field vibe coding problem (securing AI-generated new code) is also being worked on but was not covered in depth. The speakers acknowledged that the hardest unsolved problem is getting the world to actually apply patches at the speed AI produces them. ## Key Takeaways - "Vuln apocalypse" is coming — every system will be fully automatable pen-testable within years, making current CVSS ranking meaningless - Big Sleep achieves zero false positives by requiring a working exploit as proof before reporting any vulnerability - Code Mender's multi-stage verifier (fuzzing, formal verification, differential testing, LLM judge) is the "secret sauce" that enables high-quality autonomous patches - 178 open-source autonomous fixes produced so far; Google is prioritizing quality over speed of release - The hardest open problem remains patch adoption — AI can find and fix bugs faster than humans can deploy fixes ## Notable Quotes / Data Points - NVD backlog: 30,000 unanalyzed vulnerabilities - 35% increase in CVEs logged between 2024 and 2025 - ~$1 billion VC flowing into vuln discovery / pen testing startups - 178 autonomous fixes in open source from Code Mender - RAND 2017 study: expert researchers take ~1 month to find a deep embedded bug and ~22 days to successfully exploit it - Big Sleep is finding bugs OSS-Fuzz misses - All but 5 Big Sleep findings are now fixed (as of talk date) - Google's stated goal: "eliminate every software vulnerability on Earth" #unprompted #claude