Speaker:: Xenia Mountrouidou
Title:: Traditional ML vs LLMs: who can classify better?
Duration:: 8 min
Video:: https://www.youtube.com/watch?v=fAmr0N2rHIU
## Key Thesis
In head-to-head security classification benchmarks, traditional ML models (specifically XGBoost-style models with careful feature engineering) outperform LLMs on precision and recall — but LLMs are competitive zero-shot classifiers that require no training data, which makes them a genuinely useful complement rather than a replacement.
## Synopsis
Mountrouidou, a data scientist with 15 years in security data science at Expel (a managed detection and response company), frames this as a question that has been "torturing" her: can generative models do the boundary-finding work that traditional predictive models do for security classification?
Her comparison used network data — specifically packet captures from the CTU-13 MIAI botnet dataset, which she chose deliberately because it's unambiguous: well-labeled known-bad traffic versus clean benign activity, no label doubt. The traditional ML model won on both precision and recall — it's better at finding evil and more precise in its calls. However, the LLM classifiers are zero-shot: they require no training, just prompting. That's a significant operational advantage.
The key insight from her research and follow-on work: LLMs are measurably worse on network/packet data (which is not text) and measurably better on text-like data (phishing emails). For a phishing email classification experiment using Hugging Face datasets, an XGBoost model with good feature engineering still wins outright, but a "router" architecture — where XGBoost classifies first and the LLM handles overflow or edge cases — beats either model alone. This ensemble/routing approach is where she's focusing next.
She also experimented with Claude (specifically referenced as "cloud opus 4.6 six") and found that on small datasets (~100 samples), Claude performs better, but with larger labeled datasets the traditional model overtakes it. This aligns with the theoretical expectation: LLMs were not designed as classifiers; they can't find the same decision boundaries that discriminative models do.
## Key Takeaways
- Traditional ML (XGBoost + feature engineering) beats LLMs on precision and recall for security classification tasks
- LLMs are strong zero-shot classifiers — no training data needed — which is a real operational advantage
- LLMs perform best on text-native data (emails); significantly worse on non-text data (network packets, binary features)
- Ensemble/router architectures (route to LLM after traditional model) outperform either alone
- Claude opus performed better with small datasets (~100 samples) but was surpassed by traditional models as data volume increased
- The worst use case for LLMs is classification on structured/numeric/non-text data — "just give it to the LLM to find evil" is not reliable for network data
## Notable Quotes / Data Points
- Dataset: CTU-13 MIAI botnet dataset — chosen for unambiguous labeling
- Traditional ML model wins on both precision and recall against all LLM variants tested
- Router (XGBoost first, then LLM) outperformed standalone LLM and matched or beat standalone XGBoost
- Claude with 100-sample dataset: competitive; Claude with larger dataset: underperforms traditional models
- "LLMs were not meant for that" — classification boundary-finding is not what generative models are optimized for
#unprompted #claude