Saxe - Measuring Agent Effectiveness

Speaker:: Joshua Saxe Title:: The Hard Part Isn't Building the Agent: Measuring Effectiveness Duration:: 22 min Video:: https://www.youtube.com/watch?v=rO2yA52U_i4 ## Key Thesis Classical ML evaluation metrics (precision, recall, F-score, AUC) are fundamentally broken for autonomous cyber defense agents because security ground truth is inherently noisy and multi-dimensional. The field needs to shift from oracle-based binary outcome scoring to holistic, rubric-based evaluation that assesses how agents reason under uncertainty — treating them like junior security engineers, not classifiers. ## Synopsis Saxe (recently co-founded a startup; previously led AI security work at Meta for four years) opens by sketching the near-future threat landscape: within a few years, attackers will transition from manual operators to managers of AI agents performing attacks at scale. Small organizations with minimal IT security staff will need autonomous defenders. Larger organizations will need AI to fill their perennial staffing gaps. The prerequisite for deploying these autonomous systems — trusting an AI to quarantine a production server, close a critical alert, or patch a live codebase without human sign-off — is robust evaluation. And the field currently doesn't have it. He identifies the core structural problem: classical ML evaluation assumes an "oracle" — a transparent, trustworthy label source, like knowing a photo contains a cat. Security doesn't have that. SOC analysts disagree on alert severity at double-digit rates. Access control investigators disagree on who should have access to what at double-digit rates. Program security is subject to halting-problem-adjacent uncertainty about whether a patch is correct. Binary classification metrics applied over noisy labels give results nobody actually trusts — they just don't know what to replace them with. Saxe ran a simulation: adding even a 1% label-flip rate (true positive flipped to false positive and vice versa) causes apparent system accuracy to plummet, and above ~3% flip rate you hit a "noise ceiling" where you can no longer measure whether the system is actually improving. This means that for any real-world security ML system, published accuracy numbers are largely fiction. His proposed reframe: evaluate agents the way you'd evaluate a human hire. Define a rubric covering multiple dimensions — evidence gathering, policy understanding, first-principles reasoning, auditability of outputs, and decision accuracy. Grade agent trajectories (the full chain of reasoning, tool calls, and outputs) against this rubric, calibrated by a small committee of domain experts. Then train an LLM judge to automate that rubric scoring at scale. Use ~100 samples to calibrate the judge. Define a deployment bar above which you'll ship, and hill-climb all dimensions until you're above it. Monitor at steady state post-deployment. He notes that at Meta and at his new startup, this approach — combined with using LLMs and genetic-algorithm-style prompt optimization to automatically hill-climb evaluation scores — has substantially accelerated shipping autonomous security systems. ## Key Takeaways - Precision/recall/F-score are unreliable for security agents because ground truth labels are inherently noisy at the domain level, not just through measurement error - Even 1-3% label noise creates a "noise ceiling" that prevents meaningful measurement of system improvement - The right model: evaluate agents like you evaluate human security engineers — assess reasoning quality, evidence gathering, policy understanding, not just outcome binary - An LLM judge calibrated on ~100 expert-reviewed samples can automate rubric scoring at scale - Eval takes ~50% of team time but produces ~10x shipping velocity - Autonomous agents must eventually make "dangerous" decisions (quarantine prod, patch live systems) — evaluation is the prerequisite for trusting that ## Notable Quotes / Data Points - SOC analyst disagreement rate on alert labeling: double-digit percentage - Access control investigator disagreement rate: double-digit percentage - Simulation: at 3% label flip rate, ability to measure system improvement approaches zero - ~100 samples sufficient to calibrate an LLM judge for rubric-based evaluation - "Evaluation is the main hard problem" in deploying autonomous cyber defense - "We are going to need to get comfortable with AI systems doing jobs we only entrusted to humans in the past" - "There are many teams right now in AI security operating with no eval — this is a real dystopia, it's all vibes" - Expects "armies of artificially general intelligent robots scaling up adversarial activity against all our networks" within a few years #unprompted #claude