RepTFD: Replay Based Transient Fault Detection

RepTFD: Replay Based Transient Fault Detection
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The advances in IC process make future chip multiprocessors (CMPs) more and more vulnerable to transient faults. To detect transient faults, previous core-level schemes provide redundancy for each core separately. As a result, they may leave transient faults in the uncore parts, which consume over 50% area of a modern CMP, escaped from detection. This paper proposes RepTFD, the first core-level transient fault detection scheme with 100% coverage. Instead of providing redundancy for each core separately, RepTFD provides redundancy for a group of cores as a whole. To be specific, it replays the execution of the checked group of cores on a redundant group of cores. Through comparing the execution results between the two groups of cores, all malignant transient faults can be caught. Moreover, RepTFD adopts a novel pending period based record-replay approach, which can greatly reduce the number of execution orders that need to be enforced in the replay-run. Hence, RepTFD brings only 4.76% performance overhead in comparison to the normal execution without fault-tolerance according to our experiments on the RTL design of an industrial CMP named Godson-3. In addition, RepTFD only consumes about 0.83% area of Godson-3, while needing only trivial modifications to existing components of Godson-3.


💡 Research Summary

The paper addresses the growing vulnerability of modern chip multiprocessors (CMPs) to transient faults, especially in uncore components such as last‑level caches, network‑on‑chip, and memory controllers, which can occupy more than half of a chip’s area. Existing core‑level fault‑detection schemes protect each core with a dedicated redundant core, but they fail to detect faults that affect shared uncore resources because both the checked core and its redundant counterpart may read the same corrupted data, leading to identical wrong results. To overcome this limitation, the authors propose RepTFD (Replay‑Based Transient Fault Detection), the first core‑level scheme that guarantees 100 % coverage of malignant transient faults. RepTFD partitions the CMP into two equal groups: a “checked” group and a “redundant” group. The two groups execute the same parallel program but are deliberately kept data‑independent; no core in one group accesses data produced by a core in the other group. During the first execution (first‑run) on the checked group, RepTFD records two kinds of information: (1) the results of each instruction (result‑log) and (2) a relaxed start‑time and end‑time for each instruction block using a global clock, defining a “pending period”. If two blocks have non‑overlapping pending periods, their physical time order is known, allowing the system to infer the logical execution order of conflicting memory accesses without explicitly logging them. Because more than 99 % of execution orders can be inferred this way, the determinism‑log that must be stored is very small, and the replay‑run on the redundant group experiences almost no stalls. The replay‑run follows the recorded pending periods, reproduces the exact execution order, and simultaneously compares each instruction’s result with the corresponding entry in the result‑log. Any transient fault can affect only one of the two groups, so a mismatch immediately signals a fault. The authors implemented RepTFD on the RTL of an industrial 16‑core CMP (Godson‑3) and evaluated it with SPLASH‑2 benchmarks. The measured performance overhead is only 4.76 % compared with a fault‑free execution, and the hardware cost is about 0.83 % of the chip area, mainly for the additional logging buffers and a modest control logic. The approach also eliminates the “input incoherence” problem that plagues prior core‑level schemes, because the two groups never share the same memory source. Compared with existing deterministic‑replay techniques, which often incur >18 % slowdown due to extensive order logging, RepTFD’s pending‑period based reduction of enforced orders yields the smallest replay slowdown reported to date. Limitations include the requirement that the application be free of inter‑group data dependencies; programs that need such communication cannot directly benefit from RepTFD. Moreover, the replay‑run assumes a stable hardware environment (identical clock frequency, memory latency, etc.) to preserve determinism. In summary, RepTFD offers a novel, low‑overhead solution that provides full coverage of transient faults across both core and uncore regions, making it a compelling candidate for reliability‑critical CMP designs.


Comments & Academic Discussion

Loading comments...

Leave a Comment