On Sequence-to-Sequence Models for Automated Log Parsing
Log parsing is a critical standard operating procedure in software systems, enabling monitoring, anomaly detection, and failure diagnosis. However, automated log parsing remains challenging due to heterogeneous log formats, distribution shifts between training and deployment data, and the brittleness of rule-based approaches. This study aims to systematically evaluate how sequence modelling architecture, representation choice, sequence length, and training data availability influence automated log parsing performance and computational cost. We conduct a controlled empirical study comparing four sequence modelling architectures: Transformer, Mamba state-space, monodirectional LSTM, and bidirectional LSTM models. In total, 396 models are trained across multiple dataset configurations and evaluated using relative Levenshtein edit distance with statistical significance testing. Transformer achieves the lowest mean relative edit distance (0.111), followed by Mamba (0.145), mono-LSTM (0.186), and bi-LSTM (0.265), where lower values are better. Mamba provides competitive accuracy with substantially lower computational cost. Character-level tokenization generally improves performance, sequence length has negligible practical impact on Transformer accuracy, and both Mamba and Transformer demonstrate stronger sample efficiency than recurrent models. Overall, Transformers reduce parsing error by 23.4%, while Mamba is a strong alternative under data or compute constraints. These results also clarify the roles of representation choice, sequence length, and sample efficiency, providing practical guidance for researchers and practitioners.
💡 Research Summary
This paper presents a systematic, large‑scale empirical study of sequence‑to‑sequence (Seq2Seq) deep learning models for the task of automated log parsing. The authors compare four modern architectures—Transformer, Mamba state‑space model, unidirectional LSTM, and bidirectional LSTM—across a total of 396 experimental configurations that vary sequence length, training data size, and tokenization granularity. Two benchmark families are used: the real‑world LogHub‑2k collection (13 heterogeneous system logs, each sampled to 2 000 entries) and a synthetic HTTPd‑parse suite originally introduced by Rand et al., which provides four difficulty levels (trivial, easy, medium, hard) and three validation sets (VA, VB, VC) with controlled format mismatches (ELF vs. CLF ratios).
All models adopt an encoder‑decoder paradigm (except the LSTMs, which are used as pure decoders) and are trained to map raw log strings to a character‑wise field‑type sequence (e.g., “h l u t …”). Tokenization is performed at both the character level and the word level to assess the impact of granularity. Sequence lengths of 256, 512, and 1024 tokens and training set sizes of 1 k, 5 k, and 10 k samples are explored, yielding the 396 distinct runs.
The primary evaluation metric is the relative Levenshtein edit distance (DR), i.e., the absolute edit distance normalized by the target length. DR captures partial correctness, rewarding predictions that get many fields right even if some are wrong. In addition, parsing accuracy (PA), FLOPs, and memory footprint are reported. Statistical significance is assessed via bootstrap resampling and Wilcoxon signed‑rank tests.
Key findings:
- Accuracy – The Transformer achieves the lowest mean DR of 0.111, outperforming the other models by a large margin. This translates into a 23.4 % reduction in parsing error relative to the best LSTM baseline.
- Efficiency – Mamba attains a DR of 0.145, only slightly worse than the Transformer, while requiring 3–5× fewer floating‑point operations and substantially less GPU memory. This makes it attractive for production environments with limited compute budgets.
- Tokenization – Character‑level tokenization consistently improves performance across all architectures, especially for logs with many short fields or irregular delimiters. Word‑level tokenization can increase sequence length and hurt models that are sensitive to long‑range dependencies.
- Sequence Length – Varying the maximum input length has negligible impact on Transformer performance (DR remains stable from 256 to 1024 tokens). By contrast, both LSTM variants degrade sharply as length grows, confirming their limited ability to capture long‑range context.
- Sample Efficiency – With the same amount of training data (e.g., 5 k samples), Transformer and Mamba achieve 10–15 % lower DR than the LSTMs, indicating superior generalization when data are scarce or when training‑validation format shift is present.
Based on these results, the authors propose practical guidelines: when ample GPU resources are available, the Transformer should be the default choice for highest accuracy. In resource‑constrained or latency‑critical settings, Mamba offers a compelling trade‑off between speed, memory, and accuracy. For heterogeneous log streams, character‑level tokenization and a modest sequence length (256–512 tokens) are sufficient.
Overall, the study fills a gap in the literature by providing the first head‑to‑head comparison of state‑of‑the‑art Seq2Seq models on realistic log parsing workloads, quantifying the effects of architectural design, token granularity, and data availability. The findings are directly actionable for researchers developing new log‑analysis tools and for practitioners tasked with building robust, scalable log ingestion pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment