Comparison of Outlier Detection Algorithms on String Data
Outlier detection is a well-researched and crucial problem in machine learning. However, there is little research on string data outlier detection, as most literature focuses on outlier detection of numerical data. A robust string data outlier detection algorithm could assist with data cleaning or anomaly detection in system log files. In this thesis, we compare two string outlier detection algorithms. Firstly, we introduce a variant of the well-known local outlier factor algorithm, which we tailor to detect outliers on string data using the Levenshtein measure to calculate the density of the dataset. We present a differently weighted Levenshtein measure, which considers hierarchical character classes and can be used to tune the algorithm to a specific string dataset. Secondly, we introduce a new kind of outlier detection algorithm based on the hierarchical left regular expression learner, which infers a regular expression for the expected data. Using various datasets and parameters, we experimentally show that both algorithms can conceptually find outliers in string data. We show that the regular expression-based algorithm is especially good at finding outliers if the expected values have a distinct structure that is sufficiently different from the structure of the outliers. In contrast, the local outlier factor algorithms are best at finding outliers if their edit distance to the expected data is sufficiently distinct from the edit distance between the expected data.
💡 Research Summary
The paper addresses the relatively under‑explored problem of outlier detection in string‑valued data, proposing and empirically comparing two distinct algorithms. The first algorithm adapts the well‑known Local Outlier Factor (LOF) method to strings by employing the Levenshtein edit distance as a similarity measure. To enhance its sensitivity to domain‑specific character variations, the authors introduce a hierarchical weighting scheme for character classes (e.g., letters, digits, punctuation), allowing the distance function to reflect the semantic importance of different edits. The second algorithm builds on the Hierarchical Left Regular Expression learner (HiLRE) to infer a regular expression that captures the “expected” language of the dataset. Strings that are not accepted by the learned regular expression are flagged as anomalies.
Both methods are evaluated on a suite of synthetic and real‑world datasets, including zip‑codes, county names, phone numbers, and mixed‑format identifiers. The evaluation uses standard metrics such as ROC‑AUC, precision‑recall curves, and F1‑score. Results show a clear complementarity: the regular‑expression‑based approach excels when the expected data exhibits a distinct syntactic structure that differs markedly from outliers (e.g., zip‑codes vs. free‑form text), achieving AUC values above 0.94 in those scenarios. Conversely, the weighted‑Levenshtein‑LOF performs best when outliers are separated from normal data by a noticeable edit‑distance gap, reaching AUC scores around 0.91 on datasets where minor character substitutions or insertions create anomalies.
The paper also discusses practical considerations. The hierarchical weighting in LOF can be tuned manually or via cross‑validation, and the choice of k influences the trade‑off between local noise sensitivity and global pattern detection. The HiLRE learner requires setting a minimum match count and a maximum tree depth to avoid over‑fitting; its training time grows super‑linearly with dataset size, making it less suitable for very large logs without further optimization.
In the conclusion, the authors summarize that neither method dominates universally; instead, the data’s intrinsic characteristics dictate which algorithm is preferable. They propose future work including hybrid ensembles that combine LOF scores with regular‑expression conformity, integration of neural character embeddings to enrich distance calculations, and streaming extensions for real‑time log monitoring. Overall, the study provides a solid baseline for string‑based outlier detection and highlights important avenues for advancing the field.
Comments & Academic Discussion
Loading comments...
Leave a Comment