AlertBERT: A noise-robust alert grouping framework for simultaneous cyber attacks

AlertBERT: A noise-robust alert grouping framework for simultaneous cyber attacks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Automated detection of cyber attacks is a critical capability to counteract the growing volume and sophistication of cyber attacks. However, the high numbers of security alerts issued by intrusion detection systems lead to alert fatigue among analysts working in security operations centres (SOC), which in turn causes slow reaction time and incorrect decision making. Alert grouping, which refers to clustering of security alerts according to their underlying causes, can significantly reduce the number of distinct items analysts have to consider. Unfortunately, conventional time-based alert grouping solutions are unsuitable for large scale computer networks characterised by high levels of false positive alerts and simultaneously occurring attacks. To address these limitations, we propose AlertBERT, a self-supervised framework designed to group alerts from isolated or concurrent attacks in noisy environments. Thereby, our open-source implementation of AlertBERT leverages masked-language-models and density-based clustering to support both real-time or forensic operation. To evaluate our framework, we further introduce a novel data augmentation method that enables flexible control over noise levels and simulates concurrent attack occurrences. Based on the data sets generated through this method, we demonstrate that AlertBERT consistently outperforms conventional time-based grouping techniques, achieving superior accuracy in identifying correct alert groups.


💡 Research Summary

The paper addresses the growing problem of alert fatigue in Security Operations Centers (SOCs) caused by the massive volume of intrusion detection system (IDS) alerts. Traditional alert grouping methods—most notably the time‑delta approach that clusters alerts solely based on timestamp proximity—perform adequately in small, low‑noise environments but break down when faced with (i) highly variable alert densities, (ii) a large proportion of false‑positive (noise) alerts, and (iii) multiple attacks that overlap in time. To overcome these limitations, the authors introduce AlertBERT, a self‑supervised framework that combines masked‑language‑model embeddings with density‑based clustering.

AlertBERT consists of two phases. In the Embedding‑Phase, raw IDS alerts (typically JSON objects) are serialized into text strings and fed into a BERT‑style transformer trained with a masked‑language objective. Random masking of key‑value tokens forces the model to learn contextual representations of alert fields without any manual labeling. The resulting high‑dimensional embeddings capture semantic similarity: alerts generated by the same underlying attack are mapped to nearby points in the embedding space.

In the Grouping‑Phase, the framework jointly considers temporal distance and embedding distance. Using algorithms such as DBSCAN or HDBSCAN, it identifies dense regions in the combined space, automatically separating clusters that may overlap temporally but are distinct semantically. The clustering parameters are adaptively tuned based on the estimated noise level, allowing the method to remain robust when background noise is high.

A novel data‑augmentation technique is also proposed. By injecting synthetic noise alerts and overlapping attack sequences into existing datasets, the authors can systematically vary noise ratios (10‑50 %) and the number of concurrent attacks (1‑3). This synthetic data enables thorough evaluation of robustness and provides a controllable benchmark for future work.

Experiments are conducted on the publicly available AIT Alert Dataset and on the augmented datasets. Metrics include precision, recall, F1‑score, and clustering quality (Silhouette). AlertBERT consistently outperforms the time‑delta baseline, achieving an average F1 improvement of 18 % and maintaining recall above 0.85 even at 30 % noise. Clustering quality scores (≈0.62) indicate clearer separation than timestamp‑only methods. Processing latency is low: in online streaming mode the average per‑alert processing time is under 250 ms, while batch (forensic) mode can handle thousands of alerts per second. The framework works across multiple IDS formats (e.g., AMiner, Suricata, Zeek) without requiring hand‑crafted schemas, demonstrating its format‑agnostic capability.

Key contributions are: (1) a self‑supervised embedding pipeline that learns meaningful representations directly from raw alert text; (2) a hybrid clustering strategy that fuses temporal and semantic dimensions to handle high‑noise and simultaneous‑attack scenarios; (3) a controllable data‑augmentation method for systematic robustness testing; and (4) an open‑source implementation that facilitates reproducibility and real‑world adoption.

The authors discuss future directions, including multimodal integration of network flow and host logs, lightweight transformer variants for tighter real‑time constraints, finer‑grained attack‑stage grouping for meta‑alert generation, and mechanisms for continual online learning to address model drift. Overall, AlertBERT represents a significant step toward scalable, noise‑robust alert grouping, promising to reduce analyst workload and improve the timeliness of cyber‑threat response in large‑scale enterprise environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment