Speaker:: Mudita Khurana
Title:: Rethinking how we evaluate security agents for real-world use
Duration:: 10 min
Video:: https://www.youtube.com/watch?v=uImn7_dmeoY
## Key Thesis
Current security agent evaluation is rooted in narrow, outcome-only scoring that hides whether agents actually reason correctly or just pattern-match to the right answer. A capability-centric framework called CLASP provides six rubrics to measure how an agent achieves results, not just whether it does, enabling targeted improvement and more reliable deployment.
## Synopsis
Khurana opens with an analogy: a student who gives the right answer by guessing cannot be reliably corrected or improved, because there's no underlying skill to give feedback on. The same problem plagues security agents graded on brittle, outcome-only benchmarks. An 80% benchmark score tells you nothing about why it worked, why it failed, or whether it will transfer to your environment.
She illustrates the problem with a SQL injection detection example. An agent might correctly flag a vulnerable code pattern, but a trace review reveals it used shallow pattern matching — it never traced taint flow, never checked downstream sanitization, and produced no exploit evidence. That's a passing score masking fundamentally broken reasoning. Worse, when that agent is part of a larger security workflow (find → exploit → patch → validate), carrying incorrect or incomplete context forward causes the whole chain to fail even if individual steps pass.
CLASP (Capability-Centric Evaluation Framework) provides six rubrics covering agentic capabilities like reasoning, memory, planning, and tool use. Each rubric distinguishes brittle execution (low score) from adaptive, feedback-aware execution (high score). Applied to existing agents, CLASP revealed that different capabilities matter for different security tasks — for enumeration-heavy recon tasks, breadth of tool coverage mattered more than depth of reasoning; for vulnerability confirmation tasks, the inverse held.
Practically, CLASP can be applied two ways: using an LLM-as-judge pipeline fed the CLASP rubric and the agent's trace (lighter lift), or by building benchmark scenarios at varying complexity per capability (heavier lift but more rigorous). A blueprint for the latter is described in the accompanying academic paper.
## Key Takeaways
- Outcome-only scoring hides explanability, reliability, and improvability gaps in security agents
- Taint flow analysis, exploitability assessment, and evidence documentation are the kinds of reasoning that outcome scores miss
- Security is a workflow, not an isolated task — agents must carry context across stages for end-to-end success
- CLASP's six rubrics let you diagnose where an agent is weak and prescribe what to improve
- For recon-style tasks, breadth of tool use matters more than depth of reasoning; this tradeoff is task-specific
- LLM-as-judge is a workable starting point; test scenario benchmarks are more rigorous
- The work is personal research, not affiliated with Airbnb
## Notable Quotes / Data Points
- "Don't evaluate on isolated narrow outcomes. Dig deeper into the how."
- An agent achieving 80% benchmark success "doesn't tell you why it worked 80% of the time" nor "why it failed the other 20%"
- CLASP paper includes a benchmark blueprint for open-source release
- For recon agents: agents that planned broadly and used tools extensively outperformed agents that reasoned deeply but narrowly
#unprompted #claude