Speaker:: Jeffrey Zhang & Siddh Shah Title:: Guardrails beyond Vibes Duration:: 20 min Video:: https://www.youtube.com/watch?v=KrKk8BGPeQA ## Key Thesis Stripe built two production security agents — a threat modeling agent and a security routing agent — and the central lesson is that measuring agent quality requires moving beyond simple accuracy metrics toward LLM-as-judge semantic evaluation calibrated against human-created golden datasets, combined with phased rollouts and humans kept in the loop. ## Synopsis Zhang and Shah (software and security engineers at Stripe) describe their work shipping two AI security agents in production: one for automated threat modeling of security review tickets, and one for routing developer security questions to the correct internal security team. **Threat modeling agent**: Designed as an async, long-running process (not conversational) because accuracy mattered more than response speed. Architecture is a modular multi-agent pipeline: an orchestrator agent feeds into input agents (which gather context from linked Google Docs, Slack threads, etc.), specialized security agents (each scoped to a specific review category, e.g., third-party integrations), and output agents that format results for different audiences (human-readable summary, MITRE-structured data for metrics tooling, input for a conversational follow-up agent). The orchestrator was intentionally constrained to sequential execution rather than free agency because unrestricted orchestrators skipped relevant specialized agents. Core design principle: define a baseline of required coverage areas per review type (e.g., data sensitivity, transport protocols, auth story) to ensure completeness even when the agent has flexibility beyond that. **Security routing agent**: Tried a progression from one-shot LLM prompt (fast but hallucinated on internal terminology) → fully agentic with many tools (accurate but ~10 minutes runtime) → minimal tool set via iterative pruning (30 seconds runtime, accuracy preserved). Final state: 2 tools, ~30 seconds. Rollout was user-feedback-driven via web page → Slack integration → company-wide internal chat. **Evaluation**: Threat modeling resists deterministic matching (MITRE category labeling is inconsistent even when risks are correctly identified; keyword matching misses semantics). They adopted LLM-as-judge with human-created golden datasets. Key design: the LLM judge is not evaluating the threat model in isolation — it compares actual agent output against a human-written expected output for semantic equivalence of risks and mitigations. This hybrid approach avoids the circular trust problem. Iterating on prompts via this pipeline yielded ~10% accuracy gains from prompt improvements and another ~10% from model selection (tested across flagship LLMs on a mega-dataset with duplicated test cases to counteract non-determinism). A JSON formatting instruction accidentally dropped accuracy by 10% — caught only by the eval pipeline, not visible in spot-checks. **Phased rollout**: Started in shadow mode on a constrained subcategory of reviews (similar risks/mitigations = more automation-friendly). Targeted ~80% accuracy with humans reviewing AI outputs before action (80% accuracy + human-in-loop beats 95% accuracy with no oversight for novel categories). **Key learnings**: AlphaEvolve-style automated prompt evolution didn't work for language (too open-ended, iterations just paraphrased); humans in the loop are not optional; invest in eval early; agent architecture depends on the task; garbage input produces garbage output and agents must be taught to say "I don't know" instead of hallucinating. ## Key Takeaways - LLM-as-judge with human golden datasets is the right eval approach for open-ended security tasks where deterministic matching fails - Eval pipeline catches regressions that spot-checks miss (10% accuracy drop from a formatting instruction was invisible without it) - Minimize agent tool sets aggressively — Stripe went from "all tools" (10 min) to 2 tools (30 sec) with comparable accuracy - Phased rollout starting in shadow mode with constrained scope is the right path to production - Humans in the loop are not optional — 80% accuracy + human review beats full autonomy at lower accuracy thresholds - AlphaEvolve does not work well for prompt optimization in open-ended language tasks - "Garbage input = garbage output" — agents must be trained to acknowledge missing information rather than hallucinate ## Notable Quotes / Data Points - Security routing agent: reduced from ~10 min to ~30 sec runtime by pruning from many tools to 2 tools - Threat modeling prompt accuracy improvements: ~10% from prompt engineering, ~10% from model selection - JSON formatting instruction inadvertently caused 10% accuracy drop (only caught via eval pipeline) - Targeted ~80% accuracy threshold for initial production rollout with human-in-loop - AlphaEvolve prompt evolution in practice: "variation one just added two words; variation two just paraphrased the entire prompt" - "When we gave our orchestrator agent too much agency, it wouldn't always run the relevant specialized agent we wanted it to" #unprompted #claude