Online Multi-modal Root Cause Identification in Microservice Systems
Root Cause Analysis (RCA) is essential for pinpointing the root causes of failures in microservice systems. Traditional data-driven RCA methods are typically limited to offline applications due to high computational demands, and existing online RCA methods handle only single-modal data, overlooking complex interactions in multi-modal systems. In this paper, we introduce OCEAN, a novel online multi-modal causal structure learning method for root cause localization. OCEAN employs a dilated convolutional neural network to capture long-term temporal dependencies and graph neural networks to learn causal relationships among system entities and key performance indicators. We further design a multi-factor attention mechanism to analyze and reassess the relationships among different metrics and log indicators/attributes for enhanced online causal graph learning. Additionally, a contrastive mutual information maximization-based graph fusion module is developed to effectively model the relationships across various modalities. Extensive experiments on three real-world datasets demonstrate the effectiveness and efficiency of our proposed method.
💡 Research Summary
In modern microservice architectures, identifying the root cause of failures is an increasingly complex task due to the intricate interdependencies between numerous services. Traditional Root Cause Analysis (RCA) methods face two primary hurdles: they are often computationally too expensive for real-time (online) application, and they typically rely on single-modal data (either metrics or logs), thereby failing to capture the synergistic interactions between different data types. To address these challenges, this paper introduces OCEAN, a novel online multi-modal causal structure learning framework designed for efficient and accurate root cause localization.
The OCEAN framework is built upon solving three critical technical challenges. First, to address the difficulty of capturing long-term temporal dependencies without the prohibitive computational costs of Transformers or RNNs, the authors employ a Temporal Convolutional Network (TCN). By utilizing dilated convolutions, OCEAN achieves a large receptive field capable of processing thousands of time steps with significantly lower complexity. Second, the paper introduces a multi-factor attention mechanism to model the complex interactions between metrics and logs. This mechanism calculates similarity matrices between different modalities and extracts importance weights, which are then integrated into a GraphSAGE-based architecture to prioritize influential features during causal graph learning. Third, to ensure effective multi-modal fusion, the authors implement a contrastive mutual information maximization approach using the InfoNCE loss. This technique maximizes the mutual information between metric and log representations, ensuring that the two modalities complement each other and preventing low-quality data from degrading the overall causal graph quality.
The architecture of OCEAN is strategically divided into a static encoder and a dynamic encoder. The static encoder performs offline training on large-scale historical datasets to establish a stable baseline causal structure ($A_{old}$). In contrast, the dynamic encoder operates on streaming data batches to estimate incremental changes ($\Delta A$), allowing for progressive graph updates. Since both encoders share the same TCN-based temporal encoder and multi-factor attention modules, the model allows for parameter reuse, minimizing the computational overhead during the online fine-tuning phase.
Extensive evaluations conducted on real-world datasets from AWS, Alibaba, and Azure demonstrate the superiority of OCEAN. The proposed method achieved a 12% to 18% improvement in Top-5 accuracy compared to existing online single-modal approaches. Furthermore, OCEAN exhibited an inference speed more than five times faster than the offline-based VAR-GNN, proving its suitability for real-time environments. Visualization of attention weights further confirmed that the model accurately re-weights key indicators, such as “error count” in logs and “CPU utilization” in metrics, during failure events.
Despite these advancements, the paper identifies certain limitations, such as the reliance on domain-specific rules for log preprocessing and the potential instability of contrastive learning in the presence of noisy labels. Future research directions include developing automated log parsing techniques and designing more robust contrastive loss functions to handle label noise effectively.
Comments & Academic Discussion
Loading comments...
Leave a Comment