Nagarajan - Exploring the AI Automation Boundary

Speaker:: Arthi Nagarajan Title:: Exploring the AI Automation Boundary Duration:: 23 min Video:: https://www.youtube.com/watch?v=EZSLjT8O2rw ## Key Thesis Building an AI-powered threat hunting copilot revealed that LLMs alone are inadequate for production security workflows because semantic accuracy — generating queries that surface the *right* data, not just syntactically valid queries — is fundamentally a schema discovery problem that requires live data analysis, not documentation lookup. The automation boundary that worked at Datadog was: let AI handle query generation, iterative refinement, and synthesis; let humans drive the hunt and define the playbook. ## Synopsis Nagarajan is a software engineer at Datadog who works with internal threat intelligence and detection teams to productionize AI-powered security tools. Over six to nine months, her team developed the "Hunting Copilot," an AI threat hunting tool, and iteratively discovered where the automation boundary should sit. **V1 — Direct LLM Agent.** The initial design used a single GPT-4.1 agent with web search and the AWS Knowledge MCP for CloudTrail schema documentation. The system prompt included Datadog Lucene syntax documentation and a few threat hunting hypothesis-to-query examples. Results were mediocre: response time was fast (avg 5 seconds), but syntax error rate was 25%, hunters complained about aggressive hypothesis generation, fields were frequently misnamed, and the agent ignored tools. The fundamental problem: LLMs are bad at threat hunting out of the box because log data at Datadog has large volumes, 450-day lookback periods, faceted and unfaceted fields, customizable ingestion pipelines, and no reliable ground truth for evaluation. Fine-tuning was considered but rejected as too high-risk and high-effort given the absence of a well-curated, consistent corpus of past hunts. **Reframing around schema discovery.** The insight that unlocked V2 came from sitting down with threat hunters and documenting their actual workflow: start with a hypothesis and broad query → inspect raw log samples → tweak query → repeat until signal found. This process is called "log schema exploration" — the schema should be derived live from the data, not from documentation. This directly addressed the core accuracy problem and also the hunter request to automatically rule out false positives from test accounts. **V2 — Multi-Agent Framework.** V2 introduced an orchestrator-expert architecture with reasoning models (GPT-5.1), a Datadog MCP server that allows query execution against real logs (returning concise syntax corrections), and tighter system prompts with explicit construction rules for tricky syntax (negation, JSON policy documents, wildcard placement). Architecture: Orchestrator (maintains overall hunt context, routes to sub-agents) → AWS/GCP source documentation agents → Datadog agent (iteratively generates and refines queries against live data, biased toward returning results, with a 5-iteration cap). Results: syntax accuracy improved 17%, tool usage increased. Cost: response time jumped from 5 seconds to an average of 6 minutes, with edge cases at 30–60 minutes. Users were not happy. **Key lessons from V2.** Reasoning models are too slow for production; use them sparingly and show reasoning transparency to users. Multi-agent frameworks manage context well — each expert sub-agent starts fresh, preventing context pollution — but delegation adds latency and coordination complexity. Chains longer than three levels tend to get stuck due to looping. Tools need to return short, actionable corrections, not verbose outputs. Specify which tool to call at which step in the system prompt. **Approaches to semantic accuracy.** Four strategies explored: (1) external documentation — not representative of live data; (2) log clustering/aggregation — easy to misjudge which facets to aggregate on; (3) ingestion pipeline pre-processing (static schema from grok parsers and remappers) — misses information, consumes context, changes over time; (4) live sampling — computationally exhaustive. The final approach is a coordinated combination, shifting more of the computational burden to tools rather than the agent. **Evaluation strategy.** Traditional input/output benchmarks with a single accuracy score were found to be misleading — high benchmark scores didn't correlate with user trust. The team prioritized: comparative A/B testing (hunting copilot users vs. non-users on the same hunt week), a tight user feedback loop, and limiting traditional benchmarks to syntactic accuracy (using Datadog's built-in syntax validator) and response speed. **Production results from an AWS privilege escalation hunt week.** A/B test: Team A used the hunting copilot; Team B (more experienced hunters) used only ChatGPT. The copilot cut average query iteration time from 10 to 5 minutes and saved 25 of 60 minutes per hypothesis. All response times remained under 6 minutes. The tool excelled at hypothesis relevancy and context awareness but struggled on complex hypotheses with high data volume (truncation prevented complete log analysis). The hunt produced 2 new cloud detection rules and 0 incidents. A notable success: the agent identified suspicious IAM CreateLoginProfile events, correctly identified that some users were test threat actors created by the security research team (as verified post-hoc), and recommended follow-up verification actions. ## Key Takeaways - LLMs are bad at threat hunting out of the box; don't expect zero-shot performance on complex schema-heavy tasks - Semantic accuracy (queries that answer the right question) is harder than syntax accuracy and requires live data validation, not documentation lookup - Log schema exploration — deriving schema from the data itself — is the key reframe that unlocks accuracy - Reasoning models are too slow for production threat hunting; save them for specific high-value sub-tasks - Multi-agent orchestrator-expert frameworks manage context effectively but add latency and coordination complexity - Keep agent chains to ≤ 3 layers to avoid downstream looping failures - Tools should return concise corrections, not verbose outputs — take the correction burden off the agent - Traditional eval benchmarks mislead; A/B testing with real users is the most meaningful signal - Current automation boundary: AI handles query generation, iterative field discovery, and synthesis; humans drive the hunt and define the playbook ## Notable Quotes / Data Points - Datadog telemetry: logs with up to 450-day lookback, mix of faceted and unfaceted fields, customizable ingestion pipelines - V1 syntax error rate: 25% - V2 syntax accuracy improvement: +17% vs V1 - V2 response time: avg 6 minutes (vs 5 seconds for V1); edge cases 30–60 minutes - A/B test: Hunting Copilot cut query iteration time from 10 min → 5 min, saved 25 of 60 min per hypothesis - Hunt week result: 2 new cloud detection rules, 0 incidents - Agent chains of length >3 are "subject to getting stuck downstream, especially if there was looping" - Fine-tuning rejected: "too high risk and high effort with expected low return" given data quality constraints #unprompted #claude