Speaker:: Dongdong Sun
Title:: From OSINT Chaos to Knowledge Graph
Duration:: 27 min
Video:: https://www.youtube.com/watch?v=lib_KZKISOo
## Key Thesis
Open source threat intelligence is overwhelming in volume (10,000+ reports/week from 200+ sources) and insufficiently structured for query-driven analysis — and the solution is a semi-structured knowledge graph extracted by LLMs in a multi-step pipeline, combined with a graph-traversal agent that can answer multi-hop threat intelligence questions with source attribution, avoiding model hallucination by grounding answers in the extracted graph.
## Synopsis
Sun (ML engineer at Palo Alto Networks working on threat intelligence) frames the OSINT problem concisely: the data is theoretically free, but consuming it is not. 200+ tracked sources, ~10,000 reports/week, multiple vendors covering overlapping events with different details. Reading everything is impossible; indicator feeds provide structured IOCs but strip out the contextual narrative that makes intelligence actionable. The real value is in the unstructured text.
Current approaches fail in two ways: humans can't keep up with volume, and simply dumping reports into an LLM doesn't work because LLMs lack current information and have no principled structure. STIX exists but is machine-readable-only and generates unwieldy output that's hard for both humans and AI to work with. The solution space is "semi-structured" — enough structure to enable querying, enough unstructured text to retain context.
The knowledge graph ontology naturally fits threat intelligence because cyber data is interconnected: threat actors → malware → campaigns → vulnerabilities → victims → indicators, all with typed relationships. The extraction pipeline is multi-step to achieve reliability at scale:
1. **Skeleton extractor** (router): scans the report for worthwhile entities, routes to per-entity-type extraction components
2. **Entity extraction**: detailed extraction per entity type, grounded to exact phrasing in the report where possible
3. **MITRE ATT&CK mapping**: LLMs have good recall of the ATT&CK framework but produce fuzzy matches at query time — the solution is to extract behavior descriptions, generate candidate ATT&CK technique IDs, then let the LLM read the actual ATT&CK technique descriptions from a knowledge base and self-correct to the final mapping
4. **Relationship extraction**: constrained to a pre-defined set of entity-relationship triplets; relationships include prose description of how they connect (not just a verb), providing contextual clues in the graph
A demo using a BeyondTrust critical vulnerability report: the pipeline extracted 100+ entities and 100+ relationships from a single 11-minute-read report. Graph traversal showed VShell malware's infection vectors, historical campaigns, related vulnerability exploitation, and a threat actor from 2024 who exploited the same BeyondTrust vulnerability — context that manual reading would require 40-50 minutes to extract and organize.
The agent uses graph traversal starting from a pivot node, walking relationships to gather subgraphs, with semantic search to find related entities not directly linked. Multi-hop query example: "What vulnerabilities does APT28 exploit, and what other groups exploit the same vulnerabilities?" — the agent resolves aliases (APT28 has multiple names across reports), hops from threat actor → exploitation relationship → vulnerabilities → other threat actors, and returns a sourced answer with an investigation graph showing every step.
Evaluation was a key investment: 50%+ of engineering time. Starting with small human-curated datasets and iterative refinement. Automatic prompt optimization created problems — an unconstrained loop would fix one issue and break another. Solution: partition prompt into sections the LLM can modify and sections only humans can change. The reflection aggregator also surfaced annotation inconsistencies between human researchers — when two researchers disagreed on whether something was malware vs. a tool, the LLM flagged the conflict and sent it to the researchers to resolve. Thinking models performed worse than non-thinking models for this task because they over-relied on internal model knowledge rather than grounding on the provided source text.
## Key Takeaways
- OSINT is not free: volume (10K reports/week, 200+ sources) makes human consumption impossible without automation
- STIX is insufficient — machine-readable but unwieldy for both humans and LLMs; semi-structured knowledge graph is the better ontology
- Multi-step extraction (router → per-entity extractor → MITRE mapping → relationship extraction) is necessary for reliable entity/relationship extraction at scale
- MITRE ATT&CK mapping requires self-correction: generate candidates, then ground against the knowledge base to reduce fuzzy-match noise
- Graph traversal agent with source attribution prevents hallucination — every answer is grounded in extracted graph nodes, not model memory
- Alias resolution is critical: APT28 has multiple names across reports; failing to merge aliases produces fragmented intelligence
- Automatic prompt optimization loops can diverge — constrain what the LLM is allowed to change vs. human-only sections
- Thinking models performed worse than non-thinking models for this extraction task — they over-rely on model knowledge over source documents
- 90% accuracy achieved after 2-3 human-annotation iterations on small seed datasets
## Notable Quotes / Data Points
- 200+ tracked OSINT sources; ~10,000 reports/week ingested
- BeyondTrust demo report: 11 min to read, 40-50 min to extract manually; pipeline extracted 100+ entities and 100+ relationships
- 50%+ of engineering time invested in evaluation infrastructure
- 90% accuracy on most entity types after 2-3 annotation iterations
- Thinking models performed worse for grounding tasks — "thinking too much, relying on model knowledge"
- Newer LLM models didn't necessarily improve benchmark performance — evaluation caught regressions
- Investigation graph provided with every answer, showing which nodes and hops were used to derive the answer
- Co-occurrence search (same entities mentioned in same report) used for connections not expressible in the formal ontology
#unprompted #claude