Next-generation cyberattack detection with large language models: anomaly analysis across heterogeneous logs
This project explores large language models (LLMs) for anomaly detection across heterogeneous log sources. Traditional intrusion detection systems suffer from high false positive rates, semantic blindness, and data scarcity, as logs are inherently sensitive, making clean datasets rare. We address these challenges through three contributions: (1) LogAtlas-Foundation-Sessions and LogAtlas-Defense-Set, balanced and heterogeneous log datasets with explicit attack annotations and privacy preservation; (2) empirical benchmarking revealing why standard metrics such as F1 and accuracy are misleading for security applications; and (3) a two phase training framework combining log understanding (Base-AMAN, 3B parameters) with real time detection (AMAN, 0.5B parameters via knowledge distillation). Results demonstrate practical feasibility, with inference times of 0.3-0.5 seconds per session and operational costs below 50 USD per day.
💡 Research Summary
This paper presents a comprehensive framework for using large language models (LLMs) to detect cyber‑attack anomalies across heterogeneous log sources. Recognizing that traditional intrusion detection systems suffer from high false‑positive rates, lack of semantic understanding, and severe data scarcity due to privacy constraints, the authors make three primary contributions.
First, they release two balanced, publicly available datasets: LogAtlas‑Foundation‑Sessions and LogAtlas‑Defense‑Set. The former contains over 44 000 temporal sessions and roughly 19 million raw log events drawn from eight synthetic enterprise testbeds, preserving a natural attack prevalence of about 2 %. Each session is enriched with metadata (duration, host ID, time‑of‑day flags, log type, parsing statistics) to enable models to learn contextual variations across time, systems, and sources. The latter is designed for the detection phase and includes approximately 1.68 million attack‑related logs and 3 million normal logs, yielding a deliberately high 35 % attack prevalence. This balanced composition mitigates the well‑known class‑imbalance pathology that causes models to collapse to majority‑class predictions while still reflecting realistic SOC workloads where attacks may constitute 30‑40 % of examined traffic. Both datasets are hosted on Hugging Face and include explicit attack annotations, facilitating reproducibility and future benchmarking.
Second, the authors conduct a systematic benchmarking study that demonstrates why conventional metrics such as accuracy and F1 are misleading for security applications. By varying the attack proportion in test sets from 0 % to 100 % while keeping the total size fixed at 10 000 samples, they show that a supervised RoBERTa classifier predicts every sample as normal regardless of the underlying distribution, achieving high accuracy but zero true positives. An unsupervised LogBERT model only begins to detect attacks when the attack share exceeds 80 %. These findings expose a measurement crisis: models that appear to perform well on standard benchmarks may provide no operational value when deployed in realistic, highly imbalanced environments. Consequently, the paper advocates for evaluation protocols that incorporate realistic attack prevalence, multiple metrics (detection rate, false‑positive rate, ROC‑AUC), and cost‑sensitive analysis.
Third, the paper introduces a two‑phase training pipeline that balances performance with computational feasibility. In Phase 1, a 3‑billion‑parameter foundation model (Base‑AMAN) is trained on LogAtlas‑Foundation‑Sessions using a combination of Chinchilla scaling, Low‑Rank Adaptation (LoRA), and a Soft Mixture‑of‑Experts (Soft‑MoE) architecture. The Chinchilla principle guides a token‑to‑parameter ratio of roughly 51.6 : 1, resulting in 1.544 billion tokens for training. LoRA freezes the bulk of the Qwen2.5‑3B‑Instruct weights and fine‑tunes only 0.96 % (≈29.9 M) of parameters, dramatically reducing memory and compute requirements. Soft‑MoE replaces dense feed‑forward layers with four expert sub‑networks, each receiving soft assignments to avoid routing collapse while allowing specialization for diverse log patterns.
Phase 2 distills the knowledge of Base‑AMAN into a lightweight 0.5‑billion‑parameter detection model (AMAN). The distillation process uses structured instruction‑response pairs that contain (1) a concise activity summary, (2) identified anomalous events, (3) a risk score with justification (CRITICAL/HIGH/MEDIUM/LOW), and (4) recommended remediation steps. By training AMAN on the balanced LogAtlas‑Defense‑Set, the model learns to produce actionable, interpretable outputs while maintaining sub‑second inference latency (0.3–0.5 seconds per session). Operational cost analysis shows that a single GPU instance can run AMAN for under 50 USD per day, making the solution viable for continuous SOC deployment.
Experimental results on both datasets demonstrate high detection rates (>92 %) and low false‑positive rates (<5 %). On the Defense‑Set, AMAN’s F1 score is within 1–2 % of the 3‑B Base‑AMAN, yet it consumes roughly one‑sixth of the memory and achieves a five‑fold speedup. The authors also provide extensive visualizations of session duration, log volume, and temporal distributions to illustrate dataset realism.
In conclusion, the paper delivers (a) novel, privacy‑preserving, heterogeneous log benchmarks; (b) a critical re‑examination of evaluation metrics for security‑oriented anomaly detection; and (c) a practical, two‑stage LLM training and distillation framework that bridges the gap between state‑of‑the‑art language understanding and real‑time cyber‑attack detection. Future work is outlined to incorporate additional cloud‑native logs, multimodal telemetry (e.g., network packets, system metrics), and enhanced explainability techniques for deeper root‑cause analysis.
Comments & Academic Discussion
Loading comments...
Leave a Comment