Polley - Training BrowseSafe Detecting Prompt Injection

Speaker:: Kyle Polley Title:: Training BrowseSafe: Detecting Prompt Injection Duration:: 29 min Video:: https://www.youtube.com/watch?v=Fzgqx1MauJg ## Key Thesis Existing open-source prompt injection classifiers and academic benchmarks fail in production environments because they optimize for obvious keyword patterns rather than attacker intent, and they don't account for real-world distractors like cookie consent banners. Perplexity's BrowseSafe — a fine-tuned Qwen 30B model trained on a real-world taxonomy of injection attacks — achieves 90.4% F1 at sub-second latency, outperforming GPT-5 mini (85.4%, 2s latency) and GPT-5 (2s–20s) while enabling production deployment as a defense-in-depth layer. ## Synopsis Polley leads security at Perplexity AI, which operates three AI products: Perplexity (AI search/answer engine), Comet (agentic AI browser), and Perplexity Computer (browser + filesystem + code execution). His team has direct production exposure to prompt injection in browser agent contexts, motivating the development of BrowseSafe. The talk establishes the browser agent threat model: a user instructs the agent to accomplish a task (e.g., "book me a flight to New York"), the agent visits websites, and those websites may contain malicious instructions embedded in the page content. The LLM cannot reliably distinguish tool output from user input or system prompt. Attack types observed in the wild include: prompt template impersonation (injecting content formatted to look like the system prompt structure); agent social engineering (benign-looking step-by-step instructions mimicking legitimate workflows); conditional triggers (instructions hidden in calendar events that activate on specific user queries, as discussed in earlier talks); and hidden HTML elements or URL-embedded instructions. Polley benchmarked existing solutions — including Meta's Prompt Guard and GBT OSS Safeguard — against a taxonomy of real-world attack types and found critical failure modes. First, these models detect obvious attacks (e.g., "ignore previous instructions") well but fail on context manipulation attacks where the injection is phrased as social engineering. Detection rates drop significantly as attack sophistication increases. Second, multi-language attacks: an injection written in Hebrew or Spanish causes models to get confused and follow the instructions regardless. The root cause is that models are pattern-matching on keywords rather than evaluating intent. Third, distractors: adding realistic benign-looking web content (e.g., cookie consent banners that say "Important: you must click accept to proceed") causes a 19-percentage-point drop in F1 even for models that otherwise perform well, because the models are flagging based on surface-level heuristics. BrowseSafe Bench is the resulting open-source evaluation dataset built from deanonymized and classified real-world attacks, organized into a taxonomy of attack types and injection strategies. This dataset was used both for evaluation and fine-tuning BrowseSafe itself. The fine-tuned Qwen 30B model achieves 90.4% F1 in sub-second latency. A distinctive feature is that BrowseSafe outputs a probability confidence score rather than a binary classification — enabling security teams to tune the precision/recall tradeoff for their specific product context (e.g., a product that can ask users to confirm suspected injections can afford more false positives). Held-out generalization testing revealed that BrowseSafe generalizes well on unseen URLs and attack languages, but shows a significant drop on novel injection strategies (new placement techniques). This signals that attackers will discover new placement methods over time, making a data flywheel essential. The production deployment strategy Polley recommends is defense-in-depth across four layers: (1) pre-processing — strip hidden elements and HTML comments before content reaches the LLM context; (2) classification — BrowseSafe or prompted GPT-5 mini (85.4% F1, 2s) at every untrusted tool call boundary, with structured error returns that inform the agent rather than just blocking silently; (3) always use the latest frontier model as your agent — the jump from Claude Opus 4.5 to 4.6 reduced prompt injection success rates from 16.2% to 2.83%; (4) LLM backstop — low-confidence BrowseSafe outputs get escalated to a large model for final judgment, and those verdicts are added to the training flywheel to retrain BrowseSafe against new attack patterns. In Q&A, Polley addressed the Zenity research on Comet: one finding was that Comet's soft guardrails (system-prompt-level restrictions on filesystem access) were bypassable; these were replaced with hard architectural controls. The second finding involved a Hebrew-language prompt injection attack against 1Password integration — Polley noted that Zenity had to run the attack ~200 times before it succeeded, which he considered a mischaracterization of the vulnerability severity. ## Key Takeaways - Academic prompt injection benchmarks don't translate to production; they miss distractors and novel attack types - Models detect obvious injection keywords well, but fail on intent-based social engineering and multi-language attacks - Adding just 3 distractors to an eval drops detection from ~95% to 81% — a 14-point collapse from realistic web content - Fine-tuning on domain-specific real-world data dramatically outperforms prompted general-purpose LLMs - 90.4% F1 at sub-second latency makes production deployment practical; GPT-5 mini is a usable fallback (85.4%, 2s) - Probability confidence scores matter: a binary classifier doesn't let you tune precision/recall for your product's UX - Claude Opus 4.6 reduced prompt injection success from 16.2% to 2.83% vs 4.5 — always run the latest model as your agent - When detecting injection, don't silently block — return a structured error message so the agent knows why and can communicate it to the user - Data flywheel (classifier → LLM backstop → retrain) is essential because new injection placement strategies will appear ## Notable Quotes / Data Points - BrowseSafe: 90.4% F1, sub-second latency (Qwen 30B fine-tuned) - GPT-5 mini prompted: 85.4% F1, ~2s latency - GPT-5 prompted: competitive F1 but up to 20s latency — not production-viable - Prompt Guard and GBT OSS Safeguard: "don't actually work that well in live environments" - Claude Opus 4.6 prompt injection attack success: 2.83% (vs 16.2% for 4.5); with updated safeguards: 0.8% - Zenity required 200 attempts before their Hebrew-injection attack succeeded once on Comet - BrowseSafe model and BrowseSafe Bench dataset are fully open-source on Hugging Face - Research collaboration: Mark Tennenholtz, Dennis Sierras, Jerry Ma (Perplexity) + Caillou Zhang, Dr. Nigu Lee (Purdue University) #unprompted #claude