Mountrouidou - Traditional ML vs LLMs for Classification

Speaker:: Xenia Mountrouidou Title:: Traditional ML vs LLMs: who can classify better? Duration:: 8 min Video:: https://www.youtube.com/watch?v=fAmr0N2rHIU ## Key Thesis In head-to-head security classification benchmarks, traditional ML models (specifically XGBoost-style models with careful feature engineering) outperform LLMs on precision and recall — but LLMs are competitive zero-shot classifiers that require no training data, which makes them a genuinely useful complement rather than a replacement. ## Synopsis Mountrouidou, a data scientist with 15 years in security data science at Expel (a managed detection and response company), frames this as a question that has been "torturing" her: can generative models do the boundary-finding work that traditional predictive models do for security classification? Her comparison used network data — specifically packet captures from the CTU-13 MIAI botnet dataset, which she chose deliberately because it's unambiguous: well-labeled known-bad traffic versus clean benign activity, no label doubt. The traditional ML model won on both precision and recall — it's better at finding evil and more precise in its calls. However, the LLM classifiers are zero-shot: they require no training, just prompting. That's a significant operational advantage. The key insight from her research and follow-on work: LLMs are measurably worse on network/packet data (which is not text) and measurably better on text-like data (phishing emails). For a phishing email classification experiment using Hugging Face datasets, an XGBoost model with good feature engineering still wins outright, but a "router" architecture — where XGBoost classifies first and the LLM handles overflow or edge cases — beats either model alone. This ensemble/routing approach is where she's focusing next. She also experimented with Claude (specifically referenced as "cloud opus 4.6 six") and found that on small datasets (~100 samples), Claude performs better, but with larger labeled datasets the traditional model overtakes it. This aligns with the theoretical expectation: LLMs were not designed as classifiers; they can't find the same decision boundaries that discriminative models do. ## Key Takeaways - Traditional ML (XGBoost + feature engineering) beats LLMs on precision and recall for security classification tasks - LLMs are strong zero-shot classifiers — no training data needed — which is a real operational advantage - LLMs perform best on text-native data (emails); significantly worse on non-text data (network packets, binary features) - Ensemble/router architectures (route to LLM after traditional model) outperform either alone - Claude opus performed better with small datasets (~100 samples) but was surpassed by traditional models as data volume increased - The worst use case for LLMs is classification on structured/numeric/non-text data — "just give it to the LLM to find evil" is not reliable for network data ## Notable Quotes / Data Points - Dataset: CTU-13 MIAI botnet dataset — chosen for unambiguous labeling - Traditional ML model wins on both precision and recall against all LLM variants tested - Router (XGBoost first, then LLM) outperformed standalone LLM and matched or beat standalone XGBoost - Claude with 100-sample dataset: competitive; Claude with larger dataset: underperforms traditional models - "LLMs were not meant for that" — classification boundary-finding is not what generative models are optimized for #unprompted #claude