V-ABFT: Variance-Based Adaptive Threshold for Fault-Tolerant Matrix Multiplication in Mixed-Precision Deep Learning

V-ABFT: Variance-Based Adaptive Threshold for Fault-Tolerant Matrix Multiplication in Mixed-Precision Deep Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Algorithm-Based Fault Tolerance (ABFT) is widely adopted to detect silent data corruptions (SDCs) in matrix multiplication, a cornerstone operation in deep learning systems. However, existing threshold determination methods face critical challenges: analytical bounds are overly conservative, while probabilistic approaches like A-ABFT yield thresholds $160$–$4200\times$ larger than actual rounding errors. We present V-ABFT, a variance-based adaptive threshold algorithm that achieves tighter error bounds by directly modeling the verification difference. By leveraging statistical variance estimation, V-ABFT reduces the threshold-to-actual-error ratio to approximately $7$–$20\times$ for FP32/FP64 and $48$–$158\times$ for BF16, representing a \textbf{6–48$\times$ improvement} over A-ABFT while maintaining zero false positive rate across BF16, FP16, FP32, and FP64 precisions. Furthermore, we demonstrate that for fused-kernel ABFT implementations that verify before output quantization, low-precision GEMM can use FP32-level thresholds ($e_{\max} \approx 10^{-6}$), enabling \textbf{$\sim$1000$\times$ finer detection granularity} compared to offline verification with low-precision output ($e_{\max} \approx 10^{-3}$). We reproduce A-ABFT’s experimental setup and validate our implementation against the original paper’s results. Our method requires only $O(n)$ complexity using max/min/mean statistics, compared to A-ABFT’s $O(pn)$ for finding $p$ largest values. Extensive experiments on synthetic data and real model weights (LLaMA-7B, GPT-2, ViT) demonstrate V-ABFT’s effectiveness across diverse distributions. V-ABFT is platform-agnostic and has been integrated into fault-tolerant GEMM implementations on both NPUs and GPUs.


💡 Research Summary

The paper introduces V‑ABFT, a variance‑based adaptive threshold algorithm for algorithm‑based fault tolerance (ABFT) in matrix multiplication, targeting the silent data corruptions (SDCs) that increasingly affect large‑scale deep‑learning workloads. Traditional threshold determination methods suffer from either excessive conservatism (analytical bounds), heavy calibration effort (experimental), or overly loose probabilistic bounds (A‑ABFT), which can be 160‑4200× larger than the actual rounding errors. V‑ABFT tackles this by directly modeling the verification difference E between the checksum‑based and direct‑row‑sum verification paths, rather than bounding each component separately.

The authors decompose each matrix row into a mean μ and a standard deviation σ, treating the residuals as unit‑variance random variables. By introducing per‑row/column aggregates αₖ and βₖ that capture the average and variance of the relative rounding‑error factors eₖₙ between the two computation paths, the verification difference E is expressed as a sum of four terms: a deterministic bias (proportional to N·μ_A·μ_B), a “random B” term (√N·μ_A·σ_B·βₖ·b′ₖ), a “random A” term (√K·σ_A·μ_B·αₖ·aₘₖ), and an interaction term (√(NK)·σ_A·σ_B·αₖ·βₖ·aₘₖ·b′ₖ). Assuming independence of the random components, the variance of the random part is simply the sum of the variances of the second and third terms, while the interaction term is handled via a confidence multiplier.

A key technical contribution is the use of an extrema‑variance bound: for any data set, σ² ≤ (max − μ)(μ − min). This bound requires only the maximum, minimum, and mean of each row/column, which can be obtained in a single linear pass, yielding O(n) complexity. In contrast, A‑ABFT needs O(p·n) work to locate the p largest values. Using this bound, V‑ABFT computes an adaptive error coefficient e_max, defined as the maximum observed |E| divided by the absolute checksum magnitude, which captures the worst‑case relative rounding error between the two paths.

The final threshold is:
T_bound = e_max ·


Comments & Academic Discussion

Loading comments...

Leave a Comment