Speaker:: Dongdong Sun Title:: From OSINT Chaos to Knowledge Graph Duration:: 27 min Video:: https://www.youtube.com/watch?v=lib_KZKISOo ## Key Thesis Open source threat intelligence is overwhelming in volume (10,000+ reports/week from 200+ sources) and insufficiently structured for query-driven analysis — and the solution is a semi-structured knowledge graph extracted by LLMs in a multi-step pipeline, combined with a graph-traversal agent that can answer multi-hop threat intelligence questions with source attribution, avoiding model hallucination by grounding answers in the extracted graph. ## Synopsis Sun (ML engineer at Palo Alto Networks working on threat intelligence) frames the OSINT problem concisely: the data is theoretically free, but consuming it is not. 200+ tracked sources, ~10,000 reports/week, multiple vendors covering overlapping events with different details. Reading everything is impossible; indicator feeds provide structured IOCs but strip out the contextual narrative that makes intelligence actionable. The real value is in the unstructured text. Current approaches fail in two ways: humans can't keep up with volume, and simply dumping reports into an LLM doesn't work because LLMs lack current information and have no principled structure. STIX exists but is machine-readable-only and generates unwieldy output that's hard for both humans and AI to work with. The solution space is "semi-structured" — enough structure to enable querying, enough unstructured text to retain context. The knowledge graph ontology naturally fits threat intelligence because cyber data is interconnected: threat actors → malware → campaigns → vulnerabilities → victims → indicators, all with typed relationships. The extraction pipeline is multi-step to achieve reliability at scale: 1. **Skeleton extractor** (router): scans the report for worthwhile entities, routes to per-entity-type extraction components 2. **Entity extraction**: detailed extraction per entity type, grounded to exact phrasing in the report where possible 3. **MITRE ATT&CK mapping**: LLMs have good recall of the ATT&CK framework but produce fuzzy matches at query time — the solution is to extract behavior descriptions, generate candidate ATT&CK technique IDs, then let the LLM read the actual ATT&CK technique descriptions from a knowledge base and self-correct to the final mapping 4. **Relationship extraction**: constrained to a pre-defined set of entity-relationship triplets; relationships include prose description of how they connect (not just a verb), providing contextual clues in the graph A demo using a BeyondTrust critical vulnerability report: the pipeline extracted 100+ entities and 100+ relationships from a single 11-minute-read report. Graph traversal showed VShell malware's infection vectors, historical campaigns, related vulnerability exploitation, and a threat actor from 2024 who exploited the same BeyondTrust vulnerability — context that manual reading would require 40-50 minutes to extract and organize. The agent uses graph traversal starting from a pivot node, walking relationships to gather subgraphs, with semantic search to find related entities not directly linked. Multi-hop query example: "What vulnerabilities does APT28 exploit, and what other groups exploit the same vulnerabilities?" — the agent resolves aliases (APT28 has multiple names across reports), hops from threat actor → exploitation relationship → vulnerabilities → other threat actors, and returns a sourced answer with an investigation graph showing every step. Evaluation was a key investment: 50%+ of engineering time. Starting with small human-curated datasets and iterative refinement. Automatic prompt optimization created problems — an unconstrained loop would fix one issue and break another. Solution: partition prompt into sections the LLM can modify and sections only humans can change. The reflection aggregator also surfaced annotation inconsistencies between human researchers — when two researchers disagreed on whether something was malware vs. a tool, the LLM flagged the conflict and sent it to the researchers to resolve. Thinking models performed worse than non-thinking models for this task because they over-relied on internal model knowledge rather than grounding on the provided source text. ## Key Takeaways - OSINT is not free: volume (10K reports/week, 200+ sources) makes human consumption impossible without automation - STIX is insufficient — machine-readable but unwieldy for both humans and LLMs; semi-structured knowledge graph is the better ontology - Multi-step extraction (router → per-entity extractor → MITRE mapping → relationship extraction) is necessary for reliable entity/relationship extraction at scale - MITRE ATT&CK mapping requires self-correction: generate candidates, then ground against the knowledge base to reduce fuzzy-match noise - Graph traversal agent with source attribution prevents hallucination — every answer is grounded in extracted graph nodes, not model memory - Alias resolution is critical: APT28 has multiple names across reports; failing to merge aliases produces fragmented intelligence - Automatic prompt optimization loops can diverge — constrain what the LLM is allowed to change vs. human-only sections - Thinking models performed worse than non-thinking models for this extraction task — they over-rely on model knowledge over source documents - 90% accuracy achieved after 2-3 human-annotation iterations on small seed datasets ## Notable Quotes / Data Points - 200+ tracked OSINT sources; ~10,000 reports/week ingested - BeyondTrust demo report: 11 min to read, 40-50 min to extract manually; pipeline extracted 100+ entities and 100+ relationships - 50%+ of engineering time invested in evaluation infrastructure - 90% accuracy on most entity types after 2-3 annotation iterations - Thinking models performed worse for grounding tasks — "thinking too much, relying on model knowledge" - Newer LLM models didn't necessarily improve benchmark performance — evaluation caught regressions - Investigation graph provided with every answer, showing which nodes and hops were used to derive the answer - Co-occurrence search (same entities mentioned in same report) used for connections not expressible in the formal ontology #unprompted #claude