Reed - Are You Thinking What I'm Thinking

Speaker:: Jackson Reed Title:: Are you thinking what I'm thinking? Duration:: 5 min Video:: https://www.youtube.com/watch?v=j2_VsH6aNzY ## Key Thesis The cryptographic signatures protecting reasoning/thinking blocks in models like Claude and GPT-4o only verify that a block originated from the provider's model — they do not bind the reasoning to a specific conversation, session, or API key. This allows an attacker to harvest reasoning blocks from one conversation and inject them into another, causing the model to "remember" thinking about something it never actually thought about. ## Synopsis Reasoning models (Anthropic, OpenAI, Gemini) emit "thinking blocks" — special token sequences that enrich the model's context and improve response quality. These blocks are cryptographically protected: Anthropic and OpenAI provide either an encrypted blob or an HMAC signature, so you can't tamper with the content without the API rejecting it. Reed's discovery: the protection only checks "did a model from our provider actually produce this?" — not "did this model produce this in THIS conversation?" The blocks are not bound to conversation ID, session, or API key. This means reasoning blocks are replayable across conversations and even across different API keys. The live demo showed this concretely: Reed asked Claude Haiku about the capital of the Île-de-France region (Paris), captured the reasoning block, then asked a different question about the capital of the Alsace region (Strasbourg). He injected the Paris reasoning block into the Strasbourg conversation and deactivated the genuine Strasbourg thinking. When then asked "what were you thinking before giving your answer?", Claude responded that it had briefly thought about Paris before giving the correct answer — acknowledging false reasoning it never actually performed. Reed notes the steering is reportedly even more effective on OpenAI than Anthropic. His interpretation: the cryptographic check was likely implemented to prevent malformed reasoning blocks from "dorking the model" — a technical integrity check, not a security control. None of Anthropic, OpenAI, or Gemini bind reasoning to context (though Gemini's different signature approach may have incidentally fixed this). Reed planned to publish a blog post with full details the following week. ## Key Takeaways - Reasoning block signatures verify provider origin only — not conversation, session, or API key binding - Reasoning blocks are replayable: harvest from conversation A, inject into conversation B - The attack works cross-API-key in some providers - OpenAI's steering via this technique is reportedly more effective than Anthropic's - Gemini may be incidentally resistant due to a different signature architecture - This appears to be a missed threat model item rather than an intentional security design choice - A model can be made to "confess" to reasoning it never performed, which has implications for any system that audits or relies on model thinking for trust ## Notable Quotes / Data Points - Demo: injected "Paris" reasoning into a "Strasbourg capital" query; model reported thinking about Paris before answering correctly - "I think it was just kind of missed in threat modeling" — Reed's assessment of why providers don't bind reasoning to context - "The steering is actually more effective and more concerning from OpenAI than it is on Anthropic" - Gemini's approach is "completely different" and may incidentally prevent this - Planned blog post with full harness details for Anthropic; OpenAI harness in development at time of talk #unprompted #claude