Rittinghouse & Huang - 1.8M Prompts 30 Alerts

Speaker:: Matt Rittinghouse & Millie Huang Title:: 1.8M Prompts, 30 Alerts Duration:: 22 min Video:: https://www.youtube.com/watch?v=PtWwrOm3BeE ## Key Thesis Content moderation (prompt filtering) is insufficient to secure agentic systems — you must also instrument the execution layer with behavioral anomaly detection that profiles what agents normally do across three axes (user, agent, org), and that ensemble approach can reduce 1.8 million daily prompts to fewer than 30 actionable alerts while remaining blind to customer prompt data. ## Synopsis Rittinghouse and Huang work on the data science team in Salesforce's security operations center, defending "AgentForce" — Salesforce's agentic AI platform deployed across both customer orgs and internal Salesforce orgs. The scale they operate at: 12,000 unique daily active agents across 55,000 monitored organizations, generating roughly 1.8 million prompts per day. Their threat model is two buckets: (1) platform-targeted attacks — misconfigured permissions, flaws in custom Apex skills exposed to the reasoning engine; (2) abuse of legitimate agency — threat actors using the agent's built-in capabilities for valid-but-malicious actions. The second category is the hard one, because the actions themselves look normal. The content moderation layer ("trust layer") they have does useful work — it catches toxicity and prompt injection at the reasoning/dialogue layer. But it has three fundamental gaps. First, reasoning vs. execution: it polices the model's thought process but can't see actual system calls and API invocations the agent makes. Second, the blocking dilemma: content mod catches malicious intent but lacks the high-fidelity signal needed for automated inline blocking in enterprise context — you need behavioral context to block confidently. Third, post-generation blindness: even a clean prompt can result in an agent plan that accesses unauthorized data or escalates privileges, which the content filter can't observe. Their solution: behavioral anomaly detection at the execution layer. The model profiles three axes simultaneously and combines them as an ensemble to minimize noise: user-level behavior, agent-level behavior, and org-level behavior. Features focus on data access depth (frequency of DB calls, proportion of data accessed — catching exfiltration attempts via ratio analysis) and data access breadth (sensitivity/rarity of fields accessed, PII field flags, column-access frequency profiling). Implementation uses incremental historical profiling: daily session profiles build a rolling baseline, new sessions are scored against standard deviation thresholds from that baseline in near-real-time. This enables both alert prioritization (how far outside baseline) and rapid response. Key failures they learned from: (1) a custom-built query complexity calculator — seemed smart but confused layers of identity (agents construct queries, not users, making the feature measure the wrong thing); (2) joining data from multiple metadata tables — enormously expensive computational joins that were later refactored down to a single-table model, reducing training time by 2/3; (3) feature explosion — adding features across user/agent/org dimensions creates multiplicative noise, requiring aggressive culling. The result: 1.8M daily prompts → trust layer filtering → fewer than 30 active alerts requiring investigation. Each alert generates a structured JSON payload that an additional LLM agent translates into plain-English summaries for SOC analysts — no security background required. The roadmap includes moving from 12-24 hour batch detection to in-flight inference using real-time cached behavioral baselines, enabling auto-containment (session kill, token revocation, bot-level trigger) when sessions hit statistically-impossible deviation thresholds. ## Key Takeaways - Content moderation alone is insufficient for agentic security — execution-layer behavioral detection is required - Three-axis ensemble (user + agent + org) produces much lower noise than single-axis anomaly detection - 1.8M daily prompts reduced to <30 active alerts via behavioral profiling - Simple features (invocation count, sensitive asset frequency) outperformed complex ones (query complexity parser) in both signal quality and latency - Agents construct queries — feature engineering that assumes user-constructed queries will measure the wrong thing - Build your behavioral baseline before you know what the attack looks like — you need the statistical mean ready when the deviation occurs - Expect 14 days of noise for any new agent; build a warm-up period into SOC playbooks - LLM-assisted alert triage (plain English summaries from structured JSON) reduces SOC analyst skill barrier - The "observability gap" is the first battle: engineering logs need structured events linking user ID to agent ID — without that, abuse detection is blind ## Notable Quotes / Data Points - Scale: 12,000 unique daily active agents; 55,000 monitored organizations; ~1.8M daily prompts - Result: fewer than 30 active alerts after behavioral anomaly detection - Refactoring to single-table model: training time reduced by ~2/3 - Content mod catch rate on correct severity: insufficient for inline blocking; behavioral model needed - "Don't wait for a known attack to build your defense" — signature-based approaches fail at 12,000+ unique agents - Current detection latency: 12-24 hours (batch); roadmap: in-flight inference with auto-containment - "Simple telemetry like invocation count or sensitive asset frequency produced way higher fidelity signals" - Purple team exercise planned to tune auto-containment thresholds before production deployment #unprompted #claude