Towards CXL Resilience to CPU Failures

Towards CXL Resilience to CPU Failures
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Compute Express Link (CXL) 3.0 and beyond allows the compute nodes of a cluster to share data with hardware cache coherence and at the granularity of a cache line. This enables shared-memory semantics for distributed computing, but introduces new resilience challenges: a node failure leads to the loss of the dirty data in its caches, corrupting application state. Unfortunately, the CXL specification does not consider processor failures. Moreover, when a component fails, the specification tries to isolate it and continue application execution; there is no attempt to bring the application to a consistent state. To address these limitations, this paper extends the CXL specification to be resilient to node failures, and to correctly recover the application after node failures. We call the system ReCXL. To handle the failure of nodes, ReCXL augments the coherence transaction of a write with messages that propagate the update to a small set of other nodes (i.e., Replicas). Replicas save the update in a hardware Logging Unit. Such replication ensures resilience to node failures. Then, at regular intervals, the Logging Units dump the updates to memory. Recovery involves using the logs in the Logging Units to bring the directory and memory to a correct state. Our evaluation shows that ReCXL enables fault-tolerant execution with only a 30% slowdown over the same platform with no fault-tolerance support.


💡 Research Summary

The paper addresses a critical gap in the Compute Express Link (CXL) 3.0+ ecosystem: the lack of support for host‑processor (Compute Node, CN) failures in a distributed shared‑memory (DSM) setting. While CXL already provides extensive reliability, availability, and serviceability (RAS) mechanisms for links, switches, and devices, it assumes that the CPUs and GPUs that drive the system remain operational. In reality, a CN crash instantly loses all dirty cache‑line data, corrupting the global state of any shared‑memory application. Existing CXL RAS therefore isolates the faulty component but does not attempt to restore a consistent application state.

To fill this void, the authors propose ReCXL, an extension to the CXL specification that adds fault‑tolerance for CN failures with modest performance overhead. The core idea is to augment every remote write transaction with replication messages that forward the update to a small, configurable set of replica CNs (Nᵣ = 2–4). Each replica hosts a dedicated hardware Logging Unit (LU). Upon receiving a REPL (replication) message, the LU creates a log entry containing the requester ID, a logical timestamp, the physical line address, the updated word(s), and a validity flag. The LU immediately acknowledges the requester with REPL_ACK. Only after the requester has collected ACKs from all replicas does it issue a VAL (validation) message, which carries the same logical timestamp; receipt of VAL marks the corresponding log entries as committed.

Log entries are first buffered in a small SRAM Log Buffer for low‑latency access and then stored in a larger DRAM‑based log. Periodically, each LU compresses its log and dumps the compressed data to the memory nodes (MNs). The authors assume MNs are highly reliable (e.g., equipped with stronger RAS) and therefore treat them as non‑faulty storage for persisted logs.

Failure detection builds on the existing CXL link‑layer error detection but adds a “Viral Status” bit per CN in each switch. When a CN stops responding (fail‑stop model), the switch sets this bit. An MSI interrupt is then sent to a surviving CN, which assumes the role of a Configuration Manager. This manager halts the workload, disables the failed CN in the directory, and initiates a distributed recovery algorithm. Recovery proceeds in three steps: (1) read the current directory to identify which cache lines were replicated where; (2) replay the persisted logs from the replicas in timestamp order, updating both the directory and the remote memory; (3) discard any log entries that never received a VAL (i.e., those that were in flight when the failure occurred). The result is a globally consistent memory state, after which normal execution resumes on the remaining healthy nodes.

The authors evaluate ReCXL via cycle‑accurate simulations of a 16‑CN, 16‑MN cluster interconnected by a CXL switch. Benchmarks include several HPC kernels (Streamcluster, Barnes, Bodytrack, Raytrace) and a YCSB key‑value store. Across all workloads, the added replication and logging incur an average runtime overhead of about 30 % compared with a baseline CXL‑DSM system that provides no fault tolerance. This overhead is dramatically lower than the alternatives of write‑through caching or frequent global persistent flushes, which can degrade performance by factors of 2–5. Sensitivity studies varying Nᵣ and the log‑dump interval demonstrate a smooth trade‑off between resilience, bandwidth consumption, and recovery latency.

Key contributions are: (1) an extension to the CXL transaction layer that introduces REPL, REPL_ACK, and VAL messages; (2) a hardware Logging Unit design that isolates logging from the CPU pipeline; (3) a logical‑timestamp‑based ordering scheme that guarantees deterministic replay; (4) a lightweight failure‑detection mechanism using a per‑CN viral status bit; and (5) a comprehensive evaluation showing that robust CN‑failure tolerance can be achieved with modest performance cost.

Limitations include the assumption that MNs never fail, the use of a static hash‑based replica selection (which may lead to load imbalance), and the focus on crash‑stop (non‑Byzantine) failures. Future work could explore dynamic replica placement, multi‑node simultaneous failures, and integration of MN‑level RAS to eliminate the single‑point‑of‑failure assumption.

In summary, ReCXL demonstrates that hardware‑assisted cache‑line replication combined with periodic log persistence can provide practical, low‑overhead resilience to CPU failures in CXL‑enabled distributed shared‑memory systems, paving the way for more reliable, high‑performance data‑center and HPC deployments.


Comments & Academic Discussion

Loading comments...

Leave a Comment