EPSILON: Adaptive Fault Mitigation in Approximate Deep Neural Network using Statistical Signatures

EPSILON: Adaptive Fault Mitigation in Approximate Deep Neural Network using Statistical Signatures
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The increasing adoption of approximate computing in deep neural network accelerators (AxDNNs) promises significant energy efficiency gains. However, permanent faults in AxDNNs can severely degrade their performance compared to their accurate counterparts (AccDNNs). Traditional fault detection and mitigation approaches, while effective for AccDNNs, introduce substantial overhead and latency, making them impractical for energy-constrained real-time deployment. To address this, we introduce EPSILON, a lightweight framework that leverages pre-computed statistical signatures and layer-wise importance metrics for efficient fault detection and mitigation in AxDNNs. Our framework introduces a novel non-parametric pattern-matching algorithm that enables constant-time fault detection without interrupting normal execution while dynamically adapting to different network architectures and fault patterns. EPSILON maintains model accuracy by intelligently adjusting mitigation strategies based on a statistical analysis of weight distribution and layer criticality while preserving the energy benefits of approximate computing. Extensive evaluations across various approximate multipliers, AxDNN architectures, popular datasets (MNIST, CIFAR-10, CIFAR-100, ImageNet-1k), and fault scenarios demonstrate that EPSILON maintains 80.05% accuracy while offering 22% improvement in inference time and 28% improvement in energy efficiency, establishing EPSILON as a practical solution for deploying reliable AxDNNs in safety-critical edge applications.


💡 Research Summary

The paper introduces EPSILON, a lightweight, just‑in‑time (JIT) fault detection and mitigation framework specifically designed for approximate deep neural network accelerators (AxDNNs). Approximate computing can dramatically reduce energy consumption in DNN inference, but permanent hardware faults can cause severe accuracy loss, especially in safety‑critical edge applications. Traditional fault‑tolerance techniques such as functional testing, BIST, ECC, or retraining either require test modes, incur large area/power overhead, or need access to training data, making them unsuitable for energy‑constrained real‑time deployment.

EPSILON leverages two complementary ideas: (1) multi‑exit neural networks (MENNs), which provide early‑exit predictions with confidence scores, and (2) pre‑computed statistical signatures for each layer obtained from a fault‑free golden model. Each layer l is characterized by a signature Sₗ = {µₗ, σₗ, Qₗ, ρₗ}, where µₗ and σₗ are the mean and standard deviation of the weights, Qₗ contains quartile values, and ρₗ encodes the sparsity pattern. A layer‑importance factor αₗ = βₚ·γₛ captures positional importance (βₚ) and structural importance (γₛ) such as connectivity and sparsity. The detection threshold for a layer is computed as Tₗ = (m + αₗ)·σₗ, allowing more critical layers to have tighter bounds.

During inference EPSILON operates in two stages. First, each exit i computes its prediction Fᵢ(x) and the associated confidence confᵢ = max(Fᵢ(x)). If any confᵢ exceeds a pre‑set confidence threshold γ, the framework immediately returns that prediction, bypassing any further analysis. This early‑exit path preserves the energy benefits of approximation when the network is operating correctly.

If all exits produce low confidence (confᵢ ≤ γ), EPSILON activates its statistical analysis stage. For every layer, the current weight matrix Wₗ is compared against its reference sparsity pattern ρₗ, producing a pattern‑score = Σ|ρₗ(i) − ρ_curr(i)|. When the pattern‑score exceeds Tₗ, a fault is declared in that layer. Faulty weights are then corrected: any weight w whose deviation |w − µₗ| exceeds Tₗ is replaced by the nearest valid value derived from the quartile set Qₗ. After all identified layers are repaired, the network re‑evaluates the input at the final exit to produce the final prediction.

Key contributions include: (i) a JIT detection mechanism that requires no test patterns or retraining, (ii) a layer‑importance‑aware thresholding scheme that adapts to the varying sensitivity of layers, and (iii) a dual‑stage workflow that exploits MENN early exits for minimal overhead while providing a thorough statistical fallback when needed.

The authors evaluate EPSILON across multiple approximate multipliers, several AxDNN architectures (e.g., ResNet‑18, MobileNet‑V2), and four benchmark datasets (MNIST, CIFAR‑10, CIFAR‑100, ImageNet‑1k). Fault rates ranging from 10 % to 50 % are injected. Compared with state‑of‑the‑art fault‑tolerance methods (BIST/ECC, redundancy, retraining), EPSILON maintains an average accuracy of 80.05 % even at a 50 % fault rate, reduces inference latency by 22 %, and improves energy efficiency by 28 %. Ablation studies show that removing the importance factor αₗ leads to a >12 % drop in accuracy, confirming the effectiveness of the dynamic thresholding.

In conclusion, EPSILON demonstrates that statistical signatures combined with multi‑exit architectures can provide fast, low‑overhead fault detection and mitigation for approximate DNNs, making reliable deployment feasible in safety‑critical edge scenarios such as autonomous vehicles, medical monitoring, and industrial control. Future work is suggested on adaptive learning of importance factors, extending the approach to transformer‑based models, and validating the methodology on silicon prototypes.


Comments & Academic Discussion

Loading comments...

Leave a Comment