Speaker:: Ilia Shumailov
Title:: AI Security with Guarantees
Duration:: 26 min
Video:: https://www.youtube.com/watch?v=NU6l0Qcf5rU
## Key Thesis
The current AI security paradigm — probabilistic defenses evaluated against weak attacks, creating a cat-and-mouse cycle that all eventually fall — is fundamentally broken. The escape is to separate data and instructions at the architectural level, producing systems with formal guarantees analogous to control flow integrity in software, and empirical results show this approach solves 70-80% of tasks without the model ever seeing untrusted data.
## Synopsis
Shumailov (academic researcher, formerly Google, now runs a company commercializing this work) opens by challenging the audience's intuition: most people think AI won't solve its own security problems, but jailbreaks are empirically harder than two years ago. The problem is the cycle — as models improve, they understand attacks better, so both attack and defense capability scale, and the net vulnerability doesn't necessarily improve.
The field is stuck in a vicious loop: you can't build good defenses without good attacks; you can't evaluate attacks without good defenses. Every defense published gets broken within weeks at sub-dollar budget using general adversarial search (genetic algorithm-style techniques). His team published a paper proving all major defenses break under stronger attacks, then published a better defense a month later, which will itself break. The cycle is the problem, not any specific defense.
The insight is to move from anomaly-detection-style probabilistic security to principled mechanisms with formal guarantees — the same way software security evolved from "hope they don't exploit this" to ASLR + CFI + sandboxing.
The CAMEL system (published ~1 year before the talk) is the first mechanism with actual guarantees against prompt injection. The architecture: (1) take the user query and rewrite it into a formal program in a controlled language — critically, the model performing this step never sees untrusted data; (2) freeze this formal program; (3) execute the frozen program, which only calls a second LLM to process untrusted data into boolean or structured outputs. The second LLM's output power is constrained — it can output that an eligibility criterion is met, but it cannot redirect the execution flow.
This achieves control flow integrity: an adversary injecting content into untrusted data cannot change what the program does — only potentially affect boolean values. The blast radius of any indirect prompt injection is restricted to flipping structured values, not hijacking control flow.
Benchmark results: CAMEL performs on par with standard LLMs on most agentic tasks, with performance losses only on specific data-dependent tasks. The planning capability scales with model quality — better models produce better plans, so security performance improves automatically as models improve, breaking the historical decoupling between capability and security.
The key conceptual framing is "task-data independence": if a task can be solved without seeing the untrusted data, never show it the data. The Rubik's cube analogy: a robot solving a Rubik's cube needs only the initial state — once given that, it's a completely independent task. For web agents (computer use), Shumailov shows that 70-80% of standard benchmark tasks can be solved without the agent ever seeing a webpage, because the model already knows the general algorithm (e.g., how to order something on Amazon) and just needs to execute it.
A remaining vulnerability class: data flow attacks within the frozen plan. Cookie-prompt ads on websites are an example — if the plan always checks for a cookie prompt first, buying ad space to display fake cookie prompts causes the agent to "click accept" before proceeding. This is a data flow attack, not a control flow attack: it doesn't redirect the execution path, it manipulates a conditional branch the plan was already going to evaluate. Shumailov's group is working on Selmate — a communication layer between agents and browser environments enabling explicit expectation-setting (the agent declares "clicking accept on this prompt should keep me on the same domain").
CAMEL + Selmate together reportedly address 99.999% of practical attack scenarios. The main barrier to adoption isn't technical — it's engineering difficulty of integrating this architecture into existing production systems (6 months, millions of lines of code for their internal deployment).
## Key Takeaways
- All current AI security defenses are probabilistic and will fall — the cycle will continue until we build mechanisms with formal guarantees
- CAMEL achieves control flow integrity for LLM agents by separating instruction planning from untrusted data processing
- Task-data independence: if the task doesn't require seeing the data, never show it the data — this eliminates prompt injection for that class
- 70-80% of computer-use benchmark tasks are solvable without the agent ever seeing a webpage
- Data flow attacks (exploiting known conditional branches) remain a vulnerability even with CFI — e.g., fake cookie-prompt ads
- Security performance scales with model capability in the CAMEL architecture — better model → better planning → better security
- Selmate (browser policy enforcement layer) + CAMEL together close nearly all practical attack vectors
- The engineering barrier to adoption is the real obstacle, not conceptual difficulty
## Notable Quotes / Data Points
- "We have no idea how to even define what robustness means properly for simple small models"
- Published paper: all major defenses break at sub-$1 budget with general adversarial search
- 6 months, millions of lines of code to production-ize CAMEL internally
- 70-80% of computer-use benchmark tasks solvable without internet access, by the 7th planning attempt on average
- Plans are ~3,000-4,000 lines of code per complex computer-use task
- Cookie-prompt ad injection: buying ad space to place fake cookie prompts causes agents to click "accept" as a data-flow exploit
- CAMEL + Selmate: "I promise you you will fail" — 99.999% attack prevention claimed
- "It's the same as load external code, run it" — on fully data-dependent task resolution
- Company now provides production system via openly available API, model-agnostic
#unprompted #claude