CLEAR: Cross-Layer Exploration for Architecting Resilience - Combining Hardware and Software Techniques to Tolerate Soft Errors in Processor Cores

Reading time: 6 minute
...

📝 Abstract

We present a first of its kind framework which overcomes a major challenge in the design of digital systems that are resilient to reliability failures: achieve desired resilience targets at minimal costs (energy, power, execution time, area) by combining resilience techniques across various layers of the system stack (circuit, logic, architecture, software, algorithm). This is also referred to as cross-layer resilience. In this paper, we focus on radiation-induced soft errors in processor cores. We address both single-event upsets (SEUs) and single-event multiple upsets (SEMUs) in terrestrial environments. Our framework automatically and systematically explores the large space of comprehensive resilience techniques and their combinations across various layers of the system stack (586 cross-layer combinations in this paper), derives cost-effective solutions that achieve resilience targets at minimal costs, and provides guidelines for the design of new resilience techniques. We demonstrate the practicality and effectiveness of our framework using two diverse designs: a simple, in-order processor core and a complex, out-of-order processor core. Our results demonstrate that a carefully optimized combination of circuit-level hardening, logic-level parity checking, and micro-architectural recovery provides a highly cost-effective soft error resilience solution for general-purpose processor cores. For example, a 50x improvement in silent data corruption rate is achieved at only 2.1% energy cost for an out-of-order core (6.1% for an in-order core) with no speed impact. However, selective circuit-level hardening alone, guided by a thorough analysis of the effects of soft errors on application benchmarks, provides a cost-effective soft error resilience solution as well (with ~1% additional energy cost for a 50x improvement in silent data corruption rate).

💡 Analysis

We present a first of its kind framework which overcomes a major challenge in the design of digital systems that are resilient to reliability failures: achieve desired resilience targets at minimal costs (energy, power, execution time, area) by combining resilience techniques across various layers of the system stack (circuit, logic, architecture, software, algorithm). This is also referred to as cross-layer resilience. In this paper, we focus on radiation-induced soft errors in processor cores. We address both single-event upsets (SEUs) and single-event multiple upsets (SEMUs) in terrestrial environments. Our framework automatically and systematically explores the large space of comprehensive resilience techniques and their combinations across various layers of the system stack (586 cross-layer combinations in this paper), derives cost-effective solutions that achieve resilience targets at minimal costs, and provides guidelines for the design of new resilience techniques. We demonstrate the practicality and effectiveness of our framework using two diverse designs: a simple, in-order processor core and a complex, out-of-order processor core. Our results demonstrate that a carefully optimized combination of circuit-level hardening, logic-level parity checking, and micro-architectural recovery provides a highly cost-effective soft error resilience solution for general-purpose processor cores. For example, a 50x improvement in silent data corruption rate is achieved at only 2.1% energy cost for an out-of-order core (6.1% for an in-order core) with no speed impact. However, selective circuit-level hardening alone, guided by a thorough analysis of the effects of soft errors on application benchmarks, provides a cost-effective soft error resilience solution as well (with ~1% additional energy cost for a 50x improvement in silent data corruption rate).

📄 Content

CLEAR: Cross-Layer Exploration for Architecting Resilience Combining Hardware and Software Techniques to Tolerate Soft Errors in Processor Cores

Eric Cheng1, Shahrzad Mirkhani1, Lukasz G. Szafaryn2, Chen-Yong Cher3, Hyungmin Cho1, Kevin Skadron2, Mircea R. Stan2, Klas Lilja4, Jacob A. Abraham5, Pradip Bose3, Subhasish Mitra1

1Stanford University, 2University of Virginia, 3IBM Research, 4Robust Chip, Inc., 5University of Texas at Austin

Abstract We present a first of its kind framework which overcomes a major challenge in the design of digital systems that are resilient to reliability failures: achieve desired resilience targets at minimal costs (energy, power, execution time, area) by combining resilience techniques across various layers of the system stack (circuit, logic, architecture, software, algorithm). This is also referred to as cross-layer resilience. In this paper, we focus on radiation-induced soft errors in processor cores. We address both single-event upsets (SEUs) and single-event multiple upsets (SEMUs) in terrestrial environments. Our framework automatically and systematically explores the large space of comprehensive resilience techniques and their combinations across various layers of the system stack (586 cross-layer combinations in this paper), derives cost-effective solutions that achieve resilience targets at minimal costs, and provides guidelines for the design of new resilience techniques. We demonstrate the practicality and effectiveness of our framework using two diverse designs: a simple, in-order processor core and a complex, out-of-order processor core. Our results demonstrate that a carefully optimized combination of circuit-level hardening, logic-level parity checking, and micro-architectural recovery provides a highly cost-effective soft error resilience solution for general-purpose processor cores. For example, a 50× improvement in silent data corruption rate is achieved at only 2.1% energy cost for an out-of-order core (6.1% for an in-order core) with no speed impact. However, selective circuit-level hardening alone, guided by a thorough analysis of the effects of soft errors on application benchmarks, provides a cost-effective soft error resilience solution as well (with ~1% additional energy cost for a 50× improvement in silent data corruption rate).

CCS Concepts • General and reference → Reliability; • Hardware → Fault tolerance; Transient errors and upsets; • Computer systems organization → Reliability

Keywords Cross-layer resilience; soft errors

  1. Introduction This paper addresses the cross-layer resilience challenge for designing robust digital systems: given a set of resilience techniques at various abstraction layers (circuit, logic, architecture, software, algorithm), how does one protect a given design from radiation-induced soft errors using (perhaps) a combination of these techniques, across multiple abstraction layers, such that overall soft error resilience targets are met at minimal costs (energy, power, execution time, area)? Specific soft error resilience targets addressed in this paper are: Silent Data Corruption (SDC), where an error causes the system to output an incorrect result without error indication; and, Detected but Uncorrected Error (DUE), where an error is detected (e.g., by a resilience technique or a system crash or hang) but is not recovered automatically without user intervention. The need for cross-layer resilience, where multiple error resilience techniques from different layers of the system stack cooperate to achieve cost-effective error resilience, is articulated in several publications (e.g., [Borkar 05, Cappello 14, Carter 10, DeHon 10, Gupta 14, Henkel 14, Pedram 12]). There are numerous publications on error resilience techniques, many of which span multiple abstraction layers. These publications mostly

1 Other error sources (voltage noise and circuit-aging) may be incorporated into CLEAR, but are not the focus of this paper. 2 An earlier version ([Cheng 16]) included combinations that were valid, but describe specific implementations. Examples include structural integrity checking [Lu 82] and its derivatives (mostly spanning architecture and software layers) or the combined use of circuit hardening, error detection (e.g., using logic parity checking and residue codes) and instruction-level retry [Ando 03, Meaney 05, Sinharoy 11] (spanning circuit, logic, and architecture layers). Cross-layer resilience implementations in commercial systems are often based on “designer experience” or “historical practice.” There exists no comprehensive framework to systematically address the cross-layer resilience challenge. Creating such a framework is difficult. It must encompass the entire design flow end-to- end, from comprehensive and thorough analysis of various

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut