Speaker:: Aaron Brown & Madhur Prashant
Title:: Trajectory-Aware Post-Training Security Agents
Duration:: 26 min
Video:: https://www.youtube.com/watch?v=4zoYCfHwhEk
## Key Thesis
General-purpose frontier models are poorly suited for multi-stage security tasks (penetration testing, incident response) because they are optimized for breadth rather than depth on long-horizon agentic workflows. Post-training — specifically supervised fine-tuning followed by online reinforcement learning on security-specific trajectory data — can substantially improve specialized task performance; Aaron Brown released an open-source framework ("Open Trajectory Gym") to make this accessible to practitioners without a PhD-level ML background.
## Synopsis
Aaron Brown (background in tech, government, and open source; focused on model post-training and frontier agent design) delivered a dense technical talk on post-training small language models for security-specific agentic tasks, anchored by the release of the "Open Trajectory Gym" project at the time of the talk.
**The problem with general models on security tasks:** Current frontier models (including GPT-5 class) perform well on atomic security tasks — identifying a single LFI or XSS vulnerability, for example — but are poor at multi-stage tasks that require chaining vulnerabilities or sustaining coherent strategy over many turns. This is because base models are optimized for a broad distribution of tasks. Security-specific post-training is the fix.
**Post-training landscape overview:** Brown explained the three-stage model development pipeline:
1. **Pre-training** — learning from large web corpora (what makes GPT-3 class models useful at all)
2. **Instruction tuning / SFT (Supervised Fine-Tuning)** — aligning the model to a specific task format (function calling, QA, tool invocations)
3. **Reinforcement Learning (RL)** — optimizing for verifiable task outcomes via reward signals
For security agents, the relevant stage is SFT + RL. The key innovation is training on **expert trajectories** — recording how stronger models (GLM5, Kimi K2) solve tasks, then distilling that behavior into smaller, faster, cheaper models.
**Agent architecture review:** Brown defined an agent as a stateless model invoked with: system prompt + tool schemas (function names, argument types, doc strings) + conversation history + next action. Tools include shell execution, code generation, memory stores, and environment interaction. The "long horizon" problem arises because security tasks require many turns with large KV caches, making both inference and training expensive.
**Open Trajectory Gym:** Released one hour before the talk, this open-source project (built on TRL, Sky Computing Lab's SkyRL, and related projects) provides a practitioner-level framework for building security-specific post-training pipelines. Key capabilities:
- Bring Your Own Benchmark and Bring Your Own Model
- Support for LangChain, AutoGen, Strands agents
- Synthetic data generation module (world manifest → teacher model → synthetic traces)
- Reward function composability (binary, temporal decay, information sparsity, uniqueness)
- GRPO (Group Relative Policy Optimization) — runs 8 agent instances in parallel on the same task, selects the best trace by group mean reward, distills into model weights
**Training pipeline executed for the talk:** Brown used Qwen 3.5-27B (dense model, released 2 days before the talk), trained against Cybench — 40 web application and static code analysis challenges spanning CTF-style flag capture and CVE identification. Three stages:
1. Supervised fine-tuning on ~285 open-source expert traces (tool calling format and terminology alignment)
2. Online RL on the Cybench environment with composite reward functions
3. Test-time: Genetic Prompt Evolution (VERP) — a prompt optimizer that evolves the system prompt between steps using an out-of-band LLM reviewing past tool calls and traces
**Results:** Baseline solve rate on Cybench: ~12.5%. After SFT + RL: ~35% solve rate. Brown explicitly noted this is not a SOTA claim — it is a demonstration of the recipe's validity with limited compute and a single conference-sprint-sized run. Larger open-weight models (Minimax 2.5, GLM5, ~5-10x the parameter count) would show substantially higher starting baselines.
**Case study:** A crypto challenge requiring reverse engineering of an AES variant with a shuffled nonlinear layer, interacting with a network oracle, and writing a solver — the base model failed after ~25 turns; the tuned model solved it in fewer turns.
**Lessons learned:**
- Use dense models for RL, not MoE (mixture-of-experts) — weight synchronization with MoE expert nodes is unsolved in open-source frameworks
- Progressive difficulty scaling: start the agent at easy challenges and progress; don't start at expert level
- Long-horizon tasks are compute-expensive beyond just model size — KV cache for 128K context windows requires 2-3x the VRAM of the model itself at inference
- Binary rewards are a starting point; composite/progressive rewards provide better training signal
- TRL is the foundational framework; Unsloth is optimized for SFT/instruction tuning (not RL); SkyRL and VERP extend the RL capabilities
## Key Takeaways
- Frontier general models fail at multi-stage security tasks (vulnerability chaining, long-horizon pentesting) — post-training is required, not optional
- SFT + RL on security-specific trajectory data produces meaningful performance lifts (12.5% → 35% on Cybench) even with limited compute
- GRPO runs 8 parallel agent instances and selects the best trace — effectively group-level behavioral cloning from self-play
- Composite reward functions (binary + temporal decay + information sparsity + uniqueness) significantly outperform binary-only reward signals
- MoE models are currently unusable for RL post-training in open source — use dense models
- Security currently lacks the equivalent of SWE-bench/math benchmark infrastructure — no widely trusted, rigorous eval framework
- Long-horizon RL environments need better design; current security benchmarks (Cybench, CVE-bench) are incomplete but functional starting points
- Open Trajectory Gym is available open source now; Discord community available for practitioners
## Notable Quotes / Data Points
- Cybench: 40 challenges across web application pentesting and static code analysis
- Training data used: 285 open-source expert traces from stronger models (typical production scale: 20,000-30,000 traces for SFT)
- Qwen 3.5-27B results: 12.5% baseline → 35% solve rate after SFT + RL
- Inference VRAM note: a 27B model at 60GB VRAM load + 128K context = easily 2-3x that VRAM requirement
- GLM5 and Kimi K2 cited as strong open-weight teacher models for trajectory distillation
- "We don't have great evals, we don't have great measures, no one believes anything" — on the state of security ML benchmarks
- VERP (Genetic Prompt Evolution) evolves the system prompt between turns without modifying model weights — pure test-time optimization
- "Every RL training program needs to have a verifiable reward at the end" — the fundamental constraint for security RL environment design
#unprompted #claude