FT-MoE: Sustainable-learning Mixture of Experts for Fault-Tolerant Computing
Intelligent fault-tolerant (FT) computing has recently demonstrated significant advantages in predicting and diagnosing faults proactively, thereby ensuring reliable service delivery. However, due to the heterogeneity of fault knowledge, dynamic workloads, and limited data support, existing deep learning-based FT algorithms face challenges in fault detection quality and training efficiency. This is primarily because their homogenization of fault knowledge perception difficuties to fully capture diverse and complex fault patterns. To address these challenges, we propose FT-MoE, a sustainable-learning fault-tolerant computing framework based on a dual-path architecture for high-accuracy fault detection and classification. This model employs a mixture-of-experts (MoE) architecture, enabling different parameters to learn distinct fault knowledge. Additionally, we adopt a two-stage learning scheme that combines comprehensive offline training with continual online tuning, allowing the model to adaptively optimize its parameters in response to evolving real-time workloads. To facilitate realistic evaluation, we construct a new fault detection and classification dataset for edge networks, comprising 10,000 intervals with fine-grained resource features, surpassing existing datasets in both scale and granularity. Finally, we conduct extensive experiments on the FT benchmark to verify the effectiveness of FT-MoE. Results demonstrate that our model outperforms state-of-the-art methods.
💡 Research Summary
This paper addresses the critical challenge of intelligent fault-tolerant (FT) computing in dynamic edge network environments. While proactive fault prediction and diagnosis are essential for reliable service delivery, existing deep learning-based FT methods struggle with fault detection quality and training efficiency due to three main issues: the heterogeneity of fault knowledge across different failure types, highly dynamic workloads, and a lack of realistic, fine-grained training datasets.
To overcome these limitations, the authors propose FT-MoE, a novel sustainable-learning framework for fault-tolerant computing. The core innovation lies in its dual-path architecture, designed to capture both the intrinsic characteristics of faults and the systemic context of task scheduling.
The first path is a Fault-Adaptive Mixture-of-Experts (MoE) layer. Instead of using a single model for all faults, this path employs multiple expert networks. A novel Experts-Adaptation Gating (EA Gate) mechanism dynamically selects and activates the most relevant subset of experts for each input based on cosine similarity, allowing the model to specialize in and handle diverse fault patterns efficiently. The “Top-any” function within the gate provides flexibility by allowing a variable number of experts to be activated per input.
The second path is a Schedule-Aware Graph Encoder. Recognizing that task migrations between hosts significantly impact system state and potential faults, this path models the scheduling decisions as a graph. It uses a Graph Attention Network (GAT) to encode how resource states propagate and influence each other across hosts through these scheduling edges, capturing the relational context of fault occurrences.
The outputs from these two paths are not processed independently. They are fused through a Cross Multi-Head Attention (CMHA) layer. Here, the fault-specific features from the MoE path serve as the Query, actively seeking relevant contextual information from the scheduling-aware host features (Key and Value) generated by the graph encoder. This enables the model to learn complex relationships, such as which resource anomaly patterns, under specific scheduling conditions, lead to particular faults. The fused representation is then passed through separate feed-forward networks for binary fault detection and multi-class fault type classification.
Beyond architecture, FT-MoE introduces a two-stage sustainable learning strategy. It undergoes comprehensive offline training on a historical dataset to establish a strong foundational model. After deployment, it engages in online tuning, where the model continuously adapts its parameters (including dynamically adding or pruning experts) using newly arriving real-time data. This approach combats performance degradation due to concept drift and ensures long-term adaptability in evolving edge environments.
To enable realistic evaluation, the authors constructed a new, large-scale Fault Detection and Classification Dataset for Edge Networks. This dataset comprises over 10,000 time intervals and features fine-grained host resource metrics and scheduling decisions, offering greater scale and granularity than previously available public datasets.
Extensive experiments on this benchmark demonstrate that FT-MoE significantly outperforms state-of-the-art baselines, including specialized algorithm-based methods (DFTM, ECLB, PCFT, CMODLB) and recent deep learning models (PreGAN, PreGAN+), in terms of both fault detection accuracy and classification performance. The results validate the effectiveness of its dual-path design, dynamic expert selection, cross-attention fusion, and sustainable learning paradigm, positioning FT-MoE as a robust and adaptive solution for proactive fault management in complex edge computing systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment