LogMaster: Mining Event Correlations in Logs of Large scale Cluster Systems

LogMaster: Mining Event Correlations in Logs of Large scale Cluster   Systems
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper presents a methodology and a system, named LogMaster, for mining correlations of events that have multiple attributions, i.e., node ID, application ID, event type, and event severity, in logs of large-scale cluster systems. Different from traditional transactional data, e.g., supermarket purchases, system logs have their unique characteristic, and hence we propose several innovative approaches to mine their correlations. We present a simple metrics to measure correlations of events that may happen interleavedly. On the basis of the measurement of correlations, we propose two approaches to mine event correlations; meanwhile, we propose an innovative abstraction: event correlation graphs (ECGs) to represent event correlations, and present an ECGs based algorithm for predicting events. For two system logs of a production Hadoop-based cloud computing system at Research Institution of China Mobile and a production HPC cluster system at Los Alamos National Lab (LANL), we evaluate our approaches in three scenarios: (a) predicting all events on the basis of both failure and non-failure events; (b) predicting only failure events on the basis of both failure and non-failure events; (c) predicting failure events after removing non-failure events.


💡 Research Summary

LogMaster is a comprehensive framework designed to mine and exploit event correlations in the logs of large‑scale cluster systems, where each log entry carries multiple attributes such as node ID, application ID, event type, and severity. The authors first formalize log records as nine‑tuple records and define n‑ary Log ID Sequences (LES) that preserve temporal ordering. To quantify correlation, they introduce a confidence metric based on two counts: the support count (how often a set of preceding events is followed by a posterior event) and the posterior count (how often the posterior event follows the preceding set). Confidence is simply support divided by posterior.

Building on this metric, two Apriori‑style mining algorithms are proposed. Apriori‑LES generates candidate k‑ary LES only when both (k‑1)‑ary adjacent subsets are frequent, dramatically reducing the candidate space compared to classic Apriori. Apriori‑simiLES further improves efficiency by observing that most 2‑ary rules in real logs involve events on the same node, application, or type; therefore it restricts mining to 2‑ary rules, cutting analysis time while retaining the most informative patterns.

The discovered event rules are abstracted into Event Correlation Graphs (ECGs), where nodes represent events and directed edges encode rules (condition → consequence). ECGs enable graph traversal for predictive analytics: given a current event, the graph can suggest likely future events, facilitating proactive failure detection.

LogMaster’s architecture consists of a log agent (per‑node collection and preprocessing), a log server (mining and ECG construction), and a log database (persistent storage of rules and graphs). The system was evaluated on two real datasets: a 260‑node Hadoop cluster (130 MB, ~978 k records) and a 256‑node HPC cluster at LANL (31.5 MB, ~433 k records). Three prediction scenarios were tested: (a) predicting all events using both failure and non‑failure logs, (b) predicting only failure events using the full log set, and (c) predicting failures after removing non‑failure events. Results show precision rates of 78.20 % (Hadoop) and 81.19 % (HPC) for all‑event prediction, with comparable performance for failure‑only scenarios.

The paper highlights the inadequacy of traditional frequent‑itemset or sequential pattern mining for system logs, which require temporal constraints and multi‑attribute handling. By integrating a simple yet effective confidence metric, a tailored Apriori variant, and a graph‑based representation, LogMaster advances the state of the art in log‑driven failure prediction. Limitations include the lack of sensitivity analysis for window size and thresholds, and the absence of a concrete real‑time streaming deployment strategy. Nonetheless, LogMaster provides a solid foundation for future work on automated, graph‑driven fault diagnosis and proactive maintenance in large‑scale computing environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment