Anomaly Sequences Detection from Logs Based on Compression
Mining information from logs is an old and still active research topic. In recent years, with the rapid emerging of cloud computing, log mining becomes increasingly important to industry. This paper focus on one major mission of log mining: anomaly detection, and proposes a novel method for mining abnormal sequences from large logs. Different from previous anomaly detection systems which based on statistics, probabilities and Markov assumption, our approach measures the strangeness of a sequence using compression. It first trains a grammar about normal behaviors using grammar-based compression, then measures the information quantities and densities of questionable sequences according to incrementation of grammar length. We have applied our approach on mining some real bugs from fine grained execution logs. We have also tested its ability on intrusion detection using some publicity available system call traces. The experiments show that our method successfully selects the strange sequences which related to bugs or attacking.
💡 Research Summary
The paper “Anomaly Sequences Detection from Logs Based on Compression” introduces a novel approach to log‑based anomaly detection that departs from the dominant statistical, probabilistic, and Markov‑chain techniques. The authors argue that normal system behavior can be described compactly, whereas abnormal behavior requires a longer description. They operationalize this intuition by measuring the increase in the size of a grammar‑based compression model when a candidate sequence is added to a model trained on normal logs.
Methodology
- Training Phase – A set of normal log sequences (S_n) is transformed into an admissible context‑free grammar (G) using a greedy grammar‑transform algorithm (a variant of SEQUITUR). The start symbol (p_0) represents the concatenation of all normal sequences, each wrapped as a distinct non‑terminal to prevent cross‑sequence pattern leakage. The compressed size of the grammar, denoted (Q_0), is the sum of the lengths of all right‑hand sides of the production rules (i.e., the number of symbols needed to describe the normal behavior).
- Evaluation Phase – For each suspect sequence (t \in S_q), the algorithm inserts (t) into the existing grammar, recomputes the compressed size (Q_t), and derives two metrics:
- Information quantity (I_t = Q_t - Q_0) – the absolute increase in description length, indicating how many new symbols the candidate introduces.
- Information density (D_t = I_t / |t|) – a length‑normalized version that highlights short, highly unusual sequences.
- Reporting Phase – The system reports the top‑(m_1) sequences with the largest (I_t) and the top‑(m_2) sequences with the largest (D_t). Both parameters are user‑configurable, allowing the analyst to focus on either raw novelty or novelty per unit length.
Why Grammar‑Based Compression?
The authors critique generic compressors (gzip, bzip2) for three reasons: (i) they rely on LZ77 sliding windows that forget long‑range structure; (ii) they operate on byte streams, whereas logs are naturally tokenized (e.g., function names, line numbers); (iii) they can create patterns that span across distinct log entries, contaminating the evaluation of a new sequence. Grammar‑based compression, by contrast, builds a hierarchical representation that respects token boundaries and preserves the independence of individual log entries.
Algorithmic Details
The greedy grammar transform iteratively applies three reduction rules:
- Rule 1 eliminates a non‑terminal that appears only once on the right‑hand side, inlining its definition.
- Rule 2 extracts a repeated substring (\beta) (length > 1) into a new non‑terminal.
- Rule 3 merges common substrings shared by two different productions into a shared non‑terminal.
The transformation is wrapped (Figure 2 in the paper) so that each original log sequence becomes a child of the start symbol; this guarantees that patterns cannot be formed across sequence boundaries. After the grammar for the normal set is built, the evaluation of a new sequence proceeds by (a) inserting the sequence as a new non‑terminal, (b) recomputing the total number of symbols in all productions, and (c) computing (I) and (D).
Experimental Validation
Two domains were used to assess the approach:
-
Fine‑grained execution logs – The authors employed ReBranch, a record‑and‑replay tool that logs every branch outcome as a line‑number token. They converted traces from the lightweight web server lighhttpd and the key‑value cache memcached into line‑number sequences. Normal executions formed the training set; executions containing two known non‑deterministic bugs formed the test set. The buggy traces introduced sequences that caused a substantial increase in grammar size (high (I)) and, for shorter buggy fragments, high density (D). The method successfully ranked these fragments among the top anomalies.
-
System‑call traces for intrusion detection – Publicly available attack traces (e.g., buffer overflow, privilege escalation) were compared against normal process traces. Again, the abnormal traces produced markedly larger (I) and (D) values, allowing the system to separate attack phases from benign activity.
Across both experiments, the compression‑based detector achieved higher recall and lower false‑positive rates than representative Markov‑based detectors, especially for long‑range, high‑level anomalies that are invisible to short‑window statistical models.
Strengths
- Theoretical grounding – The method links Kolmogorov‑style description length to anomaly detection, providing a principled measure of “strangeness.”
- Structure awareness – By operating on tokenized logs and preserving sequence boundaries, the approach captures hierarchical patterns that generic compressors miss.
- Dual scoring – Using both absolute increase and density allows analysts to detect both large, novel blocks and concise, highly unusual snippets.
Limitations and Future Work
- Scalability – Grammar construction is linear in the total number of tokens but can be memory‑intensive for massive logs; incremental or streaming variants are needed for real‑time deployment.
- Bias toward length – Very long normal sequences may yield low density scores, potentially causing false negatives; conversely, short normal sequences could be flagged as anomalous if they happen to increase grammar size modestly. The authors suggest normalizing density further or combining it with other statistical cues.
- Parameter sensitivity – The choice of (m_1) and (m_2) influences the trade‑off between detection thoroughness and alert fatigue; adaptive thresholding could be explored.
Conclusion
The paper demonstrates that compression‑based anomaly detection, realized through a greedy grammar transform, offers a compelling alternative to conventional statistical methods. By measuring how much additional description is required to accommodate a new log sequence, the system naturally highlights behaviors that deviate from the learned normal model, even when those deviations manifest at higher abstraction levels. The experimental results on both bug‑finding in fine‑grained execution logs and intrusion detection on system‑call traces validate the approach’s effectiveness and open avenues for further research into scalable, incremental grammar learning and hybrid scoring mechanisms.
Comments & Academic Discussion
Loading comments...
Leave a Comment