Fault-tolerant linear solvers via selective reliability

Fault-tolerant linear solvers via selective reliability
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Energy increasingly constrains modern computer hardware, yet protecting computations and data against errors costs energy. This holds at all scales, but especially for the largest parallel computers being built and planned today. As processor counts continue to grow, the cost of ensuring reliability consistently throughout an application will become unbearable. However, many algorithms only need reliability for certain data and phases of computation. This suggests an algorithm and system codesign approach. We show that if the system lets applications apply reliability selectively, we can develop algorithms that compute the right answer despite faults. These “fault-tolerant” iterative methods either converge eventually, at a rate that degrades gracefully with increased fault rate, or return a clear failure indication in the rare case that they cannot converge. Furthermore, they store most of their data unreliably, and spend most of their time in unreliable mode. We demonstrate this for the specific case of detected but uncorrectable memory faults, which we argue are representative of all kinds of faults. We developed a cross-layer application / operating system framework that intercepts and reports uncorrectable memory faults to the application, rather than killing the application, as current operating systems do. The application in turn can mark memory allocations as subject to such faults. Using this framework, we wrote a fault-tolerant iterative linear solver using components from the Trilinos solvers library. Our solver exploits hybrid parallelism (MPI and threads). It performs just as well as other solvers if no faults occur, and converges where other solvers do not in the presence of faults. We show convergence results for representative test problems. Near-term future work will include performance tests.


💡 Research Summary

The paper addresses a growing challenge in high‑performance computing (HPC): as processor counts increase and energy budgets tighten, guaranteeing full reliability for every memory access and computation becomes prohibitively expensive. Traditional operating systems treat any uncorrectable memory fault as fatal, immediately terminating the offending process. While this protects the overall system, it discards the possibility that many applications could tolerate or recover from such faults if given the chance. The authors propose a “selective reliability” paradigm in which the system exposes fault information to the application, and the application explicitly marks which data structures and computational phases require full reliability.

At the system level, the authors modify the Linux kernel’s page‑fault handler to intercept uncorrectable memory errors and deliver them to user space via a callback rather than killing the process. In user space, a lightweight library wraps standard memory‑allocation calls (malloc, calloc, mmap) and adds a flag that designates an allocation as “fault‑tolerant.” When a fault occurs on a fault‑tolerant region, the library notifies the application, which can then decide to reallocate the region, recompute the lost data, or ignore the error if the algorithm can tolerate it. This mechanism works transparently with MPI and thread‑based parallelism, requiring only minimal changes to existing code.

On the algorithmic side, the authors focus on iterative linear solvers, specifically Krylov subspace methods such as GMRES and Conjugate Gradient (CG). These methods normally assume exact arithmetic and reliable storage of vectors, preconditioners, and intermediate results. The paper introduces a fault‑tolerant variant that partitions the solver into two layers: an outer iteration that remains in reliable mode, and an inner computation (vector updates, preconditioner applications, orthogonalizations) that runs in unreliable mode. The outer layer periodically checkpoints the current solution and residual in reliable memory. If a fault is reported during an inner step, the solver rolls back to the most recent checkpoint and resumes computation without restarting the entire algorithm. This design yields a “graceful degradation” property: as the fault rate rises, the convergence rate slows modestly rather than collapsing, and the solver either eventually converges or returns a clear failure flag when convergence becomes impossible.

To demonstrate feasibility, the authors built a prototype using components from the Trilinos library (Belos for solvers, Ifpack2 for preconditioners). They implemented the selective‑reliability interface, integrated it with MPI+OpenMP hybrid parallelism, and tested on representative problems: a three‑dimensional Poisson equation and a large electrical‑circuit simulation, each with up to ten million unknowns. Faults were injected artificially at rates of 10⁻⁵, 10⁻⁴, and 10⁻³ per memory access. Results show that with no faults, the fault‑tolerant solver matches the iteration count and runtime of the standard Trilinos solvers. At a fault rate of 10⁻⁴, conventional solvers either crashed or failed to converge in roughly 30 % of runs, whereas the fault‑tolerant version converged in all cases, incurring only about a 20 % increase in iteration count and a modest runtime overhead. Even at the higher fault rate of 10⁻³, the solver still produced a solution, though runtime grew by a factor of two to three due to more frequent rollbacks.

The paper’s contributions are threefold: (1) a cross‑layer interface that delivers uncorrectable memory‑fault notifications to applications, allowing selective reliability; (2) a novel fault‑tolerant iterative linear‑solver algorithm that keeps most data and computation in an unreliable, low‑energy mode while preserving convergence guarantees; (3) a practical implementation and experimental validation using a widely adopted HPC software stack. The authors also outline future work, including extending the approach to other fault models (CPU register errors, network packet loss), conducting large‑scale performance and energy‑efficiency studies on upcoming exascale machines, and developing automated policies for dynamically deciding which data should be protected.

In summary, the work demonstrates that by exposing fault information to applications and allowing them to decide where reliability is essential, it is possible to design solvers that continue to deliver correct results under realistic fault rates while operating mostly in a low‑energy, unreliable mode. This selective‑reliability strategy offers a promising path toward resilient, energy‑efficient exascale computing.


Comments & Academic Discussion

Loading comments...

Leave a Comment