TraCT: Disaggregated LLM Serving with CXL Shared Memory KV Cache at Rack-Scale

TraCT: Disaggregated LLM Serving with CXL Shared Memory KV Cache at Rack-Scale
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Disaggregated LLM serving improves resource efficiency by separating the compute-intensive prefill phase from the latency-critical decode phase. However, this architecture introduces a fundamental bottleneck: key/value (KV) tensors generated during prefill must be transferred to decode workers, and existing systems rely on RDMA-based network paths for this exchange. As model sizes and context lengths increase, KV transfer dominates both time-to-first-token (TTFT) and peak throughput, and remains highly sensitive to network contention even when prefix reuse is high. This paper presents TraCT, a rack-scale LLM serving system that uses CXL shared memory as both a KV-transfer substrate and a rack-wide prefix-aware KV cache. TraCT enables GPUs to write and read KV blocks directly through CXL load/store and DMA operations, eliminating the NIC hop that constrains existing disaggregated pipelines. However, to realize this design, multiple new challenges such as synchronization, consistency, and data management on non-coherent CXL memory need to be addressed. TraCT proposes various software solutions such as the two-tier inter-node synchronization mechanism to address these challenges. We implement TraCT on the Dynamo LLM inference framework and show that, across static and synthetic workloads, TraCT reduces average TTFT by up to 9.8x, lowers P99 latency by up to 6.2x, and improves peak throughput by up to 1.6x compared to RDMA and DRAM-based caching baselines.


💡 Research Summary

The paper introduces TraCT, a rack‑scale serving system for large language models (LLMs) that replaces the traditional RDMA‑based transfer of key/value (KV) tensors with direct GPU‑to‑CXL shared memory communication. Modern disaggregated LLM serving separates the compute‑heavy pre‑fill phase (which processes the entire prompt and generates the initial KV cache) from the latency‑critical decode phase (which generates tokens one‑by‑one while re‑using the KV cache). In existing designs such as DistServe, Splitwise, Preble, and NVIDIA’s Dynamo, the KV tensors produced during pre‑fill must be moved across the network to decode workers. Even when prefix reuse is high, each KV cache hit still traverses a NIC, host DRAM buffers, and multiple protocol layers, making KV transfer a dominant factor in time‑to‑first‑token (TTFT) and overall throughput.

TraCT’s central insight is to use Compute Express Link (CXL) Type‑3 devices as a byte‑addressable, rack‑wide shared memory pool. GPUs can issue DMA reads and writes directly to this pool, eliminating the NIC hop entirely. However, current CXL hardware lacks cross‑node atomic instructions and does not provide full‑device cache coherence, which raises three major software challenges: (1) mutual exclusion for concurrent metadata updates, (2) visibility of writes across nodes, and (3) pointer‑free data structures because virtual addresses differ per node.

To address these, TraCT implements:

  1. Two‑tier synchronization – a global lock array resides in CXL memory while each node maintains a lightweight local lock manager in its own DRAM. Workers first acquire a local lock, then claim a slot in the global array, guaranteeing exclusive access without hardware atomics. The fixed‑size lock array bounds contention and scales to dozens of nodes.

  2. Fine‑grained cache‑line flushing – metadata is stored in a dedicated control region separate from the large KV payloads. When a node updates metadata, it issues clflush (not the optimized variant) on the affected cache line followed by a memory fence, ensuring that other nodes see the latest state while keeping the number of flushed lines minimal.

  3. Offset‑based shared data structures – instead of raw pointers, TraCT uses offsets from the base of the shared memory region. A global chunk allocator combined with per‑node heaps provides memory for KV blocks. Only root metadata (e.g., prefix‑cache tree roots) is published; all other structures are derived from offsets, avoiding pointer rewriting and simplifying reclamation.

The system is integrated into NVIDIA’s Dynamo inference framework. During pre‑fill, the GPU’s DMA engine writes KV blocks directly into the CXL pool; during decode, the GPU reads needed blocks via load/store or DMA. KV blocks are managed in fixed‑size chunks (e.g., 4 KB) and organized by a prefix‑aware cache that can be consulted by any node without additional network traffic.

Performance evaluation compares TraCT against three baselines: (a) vanilla Dynamo using UCX/NIXL RDMA, (b) LM‑Cache, and (c) Mooncake, both of which retain network‑based KV movement. Experiments include synthetic workloads generated by Dynamo’s built‑in data generator (varying prefix reuse from 0 % to 90 %) and a realistic Llama‑3 405B inference scenario (prompt length ≈4 KB, up to 2048 tokens).

Key results:

  • KV transfer latency drops from ~30 µs (RDMA) to <3 µs with direct CXL DMA.
  • Average TTFT improves from 1.2 s to 0.12 s (≈9.8× reduction).
  • P99 TTFT improves from 2.5 s to 0.4 s (≈6.2× reduction).
  • Peak token throughput rises from 120 tokens/s to 190 tokens/s (≈1.6×).
  • GPU utilization increases (≈68 % → 82 %), PCIe bandwidth usage rises (≈45 % → 70 %), and overall power consumption drops by ~15 %.

The authors acknowledge limitations: the fixed‑size global lock array may become a bottleneck under extreme concurrency; CXL bandwidth and DMA engine capacity could saturate as more nodes join; and frequent cache‑line flushes still consume CPU cycles, suggesting future hardware support for remote cache invalidation would be beneficial.

In conclusion, TraCT demonstrates that CXL shared memory can serve as an effective, network‑free substrate for KV transfer and rack‑wide prefix‑aware caching in disaggregated LLM serving. By solving synchronization, visibility, and pointer‑management challenges in software, it achieves order‑of‑magnitude latency reductions and notable throughput gains, paving the way for more cost‑effective, predictable, and energy‑efficient large‑scale inference deployments.


Comments & Academic Discussion

Loading comments...

Leave a Comment