Hurd - Glass-Box Security Mechanistic Interpretability

Speaker:: Carl Hurd Title:: Glass-Box Security: Operationalizing Mechanistic Interpretability Duration:: 28 min Video:: https://www.youtube.com/watch?v=JZlaijmG-Ng ## Key Thesis Current host-based and network-based AI security solutions operate only on the plaintext portions of a model's lifecycle and are fundamentally insufficient for deep behavioral detection. The next generation of AI security requires "glass-box" introspection — capturing activations during the model's forward pass and using mechanistic interpretability techniques (intent via cosine similarity, strength via scalar projection) to detect concepts and behaviors in the model's native high-dimensional space before they manifest as actions. ## Synopsis Carl Hurd, co-founder of Star Security, drew on a background spanning national labs, seven years of zero-day research at Cisco Talos (ICS/embedded systems focus), and DARPA formal methods work to argue that AI security is repeating a maturity arc the industry has already traveled in traditional security — and shouldn't have to. Hurd's framing of the problem: every current AI security product, regardless of marketing, is either host-based (eBPF/ETW) or network-based (prompt firewalls, TLS inspection). Both categories share a critical limitation — they only operate on content that is **in plaintext**, which represents roughly half of the model's processing lifecycle. Once text is embedded and begins passing through transformer layers, no current commercial product has visibility into what the model is "thinking." He introduced **glass-box security** as the solution concept, built on two pillars: **1. Intent** — captured via mechanistic interpretability hooks on the model's forward pass. These hooks collect activation vectors from specific layers of the model during inference (no backward pass required). The technique is well-established in interpretability research (Anthropic blog posts, Google DeepMind) via linear probes, sparse autoencoders, and differential prompt analysis. To detect a concept like "illegality," you compare the activation vector of the prompt against a pre-captured "illegality" vector using cosine similarity. If they point in the same high-dimensional direction, the intent exists in the prompt. **2. Strength** — measured via latent space geometry. Cosine similarity alone tells you direction but not magnitude. Using scalar projection (dot product of the activation vector against the intent vector), you can measure how large a "shadow" that specific intent casts relative to the total magnitude of the activation tensor. This distinguishes "how do I rob a bank" from "how do I steal a pen from a bank" — both involve theft, but the strength of the concept differs, and LLMs actually do respond differently (Gemini example: refuses the first, gives a tongue-in-cheek answer about the second). Combining intent and strength provides the basis for **behavior-based detection in neural networks** — analogous to how EDRs moved from pure signature detection to behavioral analysis. The linear representation hypothesis supports building detection "manifolds" (multi-layer composite detections) that serve as trip wires, enabling not just detection but remediation or generation halting. **Engineering challenges and solutions:** 1. *Activations unavailable for frontier models* → Use a "canary model" approach: instrument smaller open-weight models running in-line or async as interpretability proxies for large closed models 2. *Activation data volume is enormous* → GPT-NeoX 20B generates ~4MB of activation data per token; a full context window = ~10TB. Mitigation: hook only the residual stream (not attention heads), and only monitor layers empirically identified as most active for the target detection 3. *Writing detection content* → Progressive context enhancement: combine plaintext host/network signals with model-layer introspection signals; expose these via Yara modules and Cedar policies using familiar detection engineering primitives 4. *Detection content is not universal* → Adopt open-source detection frameworks and extend them; context determines what's anomalous for a given agent or user — same as ICS/SCADA security where "good" and "bad" packets are context-dependent For **agentic workflows** specifically, Hurd argued that agents in observe-decide-act loops compound the detection problem: an agent pursuing a goal will try known-good techniques (e.g., using packet capture tools to execute binaries with elevated privileges — a known CTF technique) that syntactic rules can't anticipate. Mechanistic interpretability provides **semantic traceability** — you can measure which parts of the model were activated during decision-making and verify the agent didn't find a loophole or reward-hack its way to the objective. The ideal future detection content he described looks like a Yara rule with a mechanistic interpretability module: "if file deletion intent exists on layers [X,Y,Z] with magnitude > threshold, block." ## Key Takeaways - All current AI security products operate only on plaintext — they miss the majority of the model's processing lifecycle - Mechanistic interpretability hooks on the residual stream provide visibility into model "thought" during inference without requiring backward passes - Intent (cosine similarity) + Strength (scalar projection/dot product) together form behavior-based detection analogous to EDR behavioral analysis - The linear representation hypothesis enables building composable, multi-layer detection manifolds rather than single-point rules - Canary model architecture solves the "closed frontier model" problem by running instrumented smaller models as interpretability proxies - Residual stream-only hooking keeps activation data volumes manageable (avoids quadratic attention head costs) - Detection content universality is an industry-wide unsolved problem — context determines what's anomalous, same as ICS security - Sovereign inference infrastructure is a requirement for organizations taking model security seriously - Fine-tuning drift: if fine-tuning freezes layers where detection manifolds were calibrated, manifolds remain valid; if those layers are modified, recalibration is needed ## Notable Quotes / Data Points - GPT-NeoX 20B: ~4MB activation data per first token; full context window fill = ~10TB activation data - GPT-NeoX 20B has exactly 24 layers — every visualization showing "3 layers" is unrealistically simplified - "Autonomous agents bypass traditional perimeter boundaries. Static syntactic guardrails are not going to be effective long-term. We have to intercept the thought before the action occurs." - "Semantic observability is really the new foundation for security moving forward" - "Sovereign infrastructure is a requirement for anyone that wants to take secure model usage seriously" - Cosine similarity range: +1 (perfectly aligned) to -1 (perfectly opposed) - IFC (indirect prompt injection) detection via before/after RAG pipeline activation comparison was identified as a promising application - Carl previously co-authored Badgerboard (PLC backplane IDS/IPS for Cisco Talos), reverse engineered VPNFilter malware #unprompted #claude