Semantic-Aware Advanced Persistent Threat Detection Using Autoencoders on LLM-Encoded System Logs
Advanced Persistent Threats (APTs) are among the most challenging cyberattacks to detect. They are carried out by highly skilled attackers who carefully study their targets and operate in a stealthy, long-term manner. Because APTs exhibit “low-and-slow” behavior, traditional statistical methods and shallow machine learning techniques often fail to detect them. Previous research on APT detection has explored machine learning approaches and provenance graph analysis. However, provenance-based methods often fail to capture the semantic intent behind system activities. This paper proposes a novel anomaly detection approach that leverages semantic embeddings generated by Large Language Models (LLMs). The method enhances APT detection by extracting meaningful semantic representations from unstructured system log data. First, raw system logs are transformed into high-dimensional semantic embeddings using a pre-trained transformer model. These embeddings are then analyzed using an Autoencoder (AE) to identify anomalous and potentially malicious patterns. The proposed method is evaluated using the DARPA Transparent Computing (TC) dataset, which contains realistic APT attack scenarios generated by red teams in live environments. Experimental results show that the AE trained on LLM-derived embeddings outperforms widely used unsupervised baseline methods, including Isolation Forest (IForest), One-Class Support Vector Machine (OC-SVM), and Principal Component Analysis (PCA). Performance is measured using the Area Under the Receiver Operating Characteristic Curve (AUC-ROC), where the proposed approach consistently achieves superior results, even in complex threat scenarios. These findings highlight the importance of semantic understanding in detecting non-linear and stealthy attack behaviors that are often missed by conventional detection techniques.
💡 Research Summary
The paper tackles the persistent challenge of detecting Advanced Persistent Threats (APTs) by moving beyond traditional statistical or shallow‑learning approaches that treat system logs merely as collections of tokens or numeric features. Recognizing that APTs often manifest as “low‑and‑slow” behaviors, the authors propose a semantic‑aware, unsupervised detection pipeline that first converts raw provenance records into human‑readable natural‑language sentences (e.g., “Process 1054 started /bin/bash and connected socket 192.168.1.5:80 and changed /etc/passwd.”). These sentences are then fed into a pre‑trained transformer model, all‑mpnet‑base‑v2, which produces 768‑dimensional dense embeddings. MPNet, trained with a combination of Masked Language Modeling and Permuted Language Modeling, offers richer contextual representations than classic BERT‑style models, especially after contrastive fine‑tuning on billions of sentence pairs.
The core anomaly detector is a vanilla autoencoder with a symmetric architecture (768 → 256 → 64 → 256 → 768). It is trained exclusively on embeddings derived from normal system activity, minimizing mean‑squared reconstruction error. During inference, each sample’s reconstruction loss serves as an anomaly score; a threshold, calibrated on a validation set, separates normal from malicious events. The authors evaluate the framework—named MPNet‑AE—on the DARPA Transparent Computing (TC) benchmark, which includes five realistic APT scenarios (5DIR, CADETS, CLEARSCOPE, THEIA, TRACE) across multiple operating systems. Each scenario is further decomposed into five provenance views (process events, execution, parent‑child relationships, network flow, and an all‑inclusive view), allowing fine‑grained performance analysis.
Results are presented through both qualitative and quantitative lenses. t‑SNE visualizations reveal tight clusters for normal embeddings and scattered outliers for attacks, confirming that the semantic space separates benign and malicious behaviors. Quantitatively, MPNet‑AE consistently outperforms three widely used unsupervised baselines—Isolation Forest, One‑Class SVM, and Principal Component Analysis—by a margin of 5–10 percentage points in Area Under the ROC Curve (AUC‑ROC). This advantage persists even under extreme class imbalance (malicious events constitute less than 0.004 % of the data), demonstrating the method’s robustness to rare‑event detection.
The paper’s contributions are threefold: (1) introducing a semantic embedding pipeline that replaces bag‑of‑words or frequency‑based features with LLM‑derived vectors; (2) integrating these embeddings with a reconstruction‑based autoencoder to capture normal system behavior without any labeled attack data; (3) providing a comprehensive empirical evaluation on a high‑fidelity APT dataset, showing clear gains over traditional baselines.
Nevertheless, the study has notable limitations. The natural‑language conversion step relies on handcrafted templates; extending the approach to new log formats or heterogeneous environments would require additional engineering. MPNet embeddings are high‑dimensional, incurring substantial GPU memory and compute costs, which may hinder real‑time deployment. The autoencoder’s reliance on reconstruction error alone can produce false positives when benign but rare activities yield high errors. Moreover, the baseline comparison set excludes recent deep graph‑based or transformer‑encoder‑decoder detectors, leaving open the question of how MPNet‑AE stacks up against the state of the art in end‑to‑end deep APT detection.
Future work suggested by the authors includes exploring lightweight embedding models (e.g., Distil‑MPNet), fusing multimodal telemetry such as network flows and file hashes, optimizing the pipeline for streaming inference, and incorporating explainable AI techniques to make anomaly scores interpretable for analysts. Overall, the paper convincingly demonstrates that semantic awareness—derived from large language models—can substantially improve unsupervised APT detection, marking a promising direction for next‑generation cyber‑defense systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment