RelayGR: Scaling Long-Sequence Generative Recommendation via Cross-Stage Relay-Race Inference

RelayGR: Scaling Long-Sequence Generative Recommendation via Cross-Stage Relay-Race Inference
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Real-time recommender systems execute multi-stage cascades (retrieval, pre-processing, fine-grained ranking) under strict tail-latency SLOs, leaving only tens of milliseconds for ranking. Generative recommendation (GR) models can improve quality by consuming long user-behavior sequences, but in production their online sequence length is tightly capped by the ranking-stage P99 budget. We observe that the majority of GR tokens encode user behaviors that are independent of the item candidates, suggesting an opportunity to pre-infer a user-behavior prefix once and reuse it during ranking rather than recomputing it on the critical path. Realizing this idea at industrial scale is non-trivial: the prefix cache must survive across multiple pipeline stages before the final ranking instance is determined, the user population implies cache footprints far beyond a single device, and indiscriminate pre-inference would overload shared CPU/PCIe/NPU resources under high QPS. We present RelayGR, a production system that enables in-HBM relay-race inference for GR. RelayGR selectively pre-infers long-term user prefixes, keeps their per-layer KV caches resident in HBM over the request lifecycle, and ensures the subsequent ranking stage can consume them without remote fetches. RelayGR combines three techniques: (1) a sequence-aware trigger that admits only at-risk requests under a bounded cache footprint and pre-inference load, (2) an affinity-aware router that co-locates cache production and consumption by routing both the auxiliary pre-infer signal and the ranking request to the same instance, and (3) a memory-aware expander that uses server-local DRAM to capture short-term cross-request reuse while avoiding redundant reloads. We implement RelayGR on Huawei Ascend NPUs and evaluate it with real queries in a production-mirror environment. Under a fixed P99 SLO, RelayGR supports up to 1.5× longer sequences and improves SLO-compliant throughput by up to 3.6×.


💡 Research Summary

Real‑time recommender systems typically consist of a cascade of stages—retrieval, pre‑processing, and a final fine‑grained ranking—each bound by strict tail‑latency Service Level Objectives (SLOs). The ranking stage, which must meet a P99 latency of only a few tens of milliseconds, leaves very little time for heavy computation. Generative Recommendation (GR) models have shown that consuming long user‑behavior sequences can substantially improve recommendation quality, but in production the length of the input sequence is aggressively truncated to stay within the ranking‑stage latency budget.

The authors of this paper make a key observation: the majority of tokens generated by a GR model in the early part of the sequence encode user actions that are independent of the candidate items being ranked. In other words, the “prefix” of the generated token stream is reusable across many ranking requests that share the same user. If this prefix could be pre‑computed once and cached, the ranking stage could simply retrieve the cached key‑value (KV) states and continue generation from the point where candidate‑specific decoding begins, eliminating redundant computation on the critical path.

Realizing such a prefix‑reuse scheme at industrial scale is non‑trivial. The cache must survive across multiple pipeline stages, the user population is massive (far exceeding the memory capacity of a single device), and indiscriminate pre‑inference would overload shared CPU, PCIe, and NPU resources under high QPS. To address these challenges, the authors introduce RelayGR, a production‑grade system that enables “in‑HBM relay‑race inference” for GR models. RelayGR consists of three tightly integrated techniques:

  1. Sequence‑aware Trigger – A lightweight admission controller that inspects each incoming request (sequence length, number of candidates, current cache occupancy) and decides whether the request is “at‑risk” of exceeding the latency budget. Only at‑risk requests are allowed to trigger pre‑inference, and the total pre‑inference load is bounded by a configurable fraction of overall QPS.

  2. Affinity‑aware Router – A routing layer that co‑locates the producer of the prefix cache and the consumer that performs the final ranking. By sending both the auxiliary pre‑inference signal and the ranking request to the same NPU instance, the KV cache can remain resident in the instance’s high‑bandwidth memory (HBM) and be accessed without any remote fetch or network hop.

  3. Memory‑aware Expander – A hierarchical cache that leverages server‑local DRAM to capture short‑term cross‑request reuse. When multiple requests on the same server need the same prefix, the DRAM copy is quickly promoted to HBM, avoiding a full re‑inference while keeping HBM usage within limits. The expander uses an LRU policy weighted by the risk score from the trigger, ensuring that rarely used prefixes are evicted in favor of high‑value ones.

Implementation details: RelayGR is built on Huawei Ascend NPUs, which provide large on‑chip HBM2e (up to 32 GB) and expose low‑level APIs for direct KV cache placement. The authors modify the PyTorch transformer implementation to expose per‑layer KV tensors and to pin them in HBM across request boundaries. A separate thread‑pool handles pre‑inference, and a runtime monitor throttles this pool to keep pre‑inference CPU/PCIe utilization below a pre‑defined threshold (≈12 % of total QPS).

Evaluation is performed in a production‑mirror environment that reproduces real traffic patterns (64 NPU servers, mixed‑type queries, realistic candidate sets). Under a fixed P99 latency SLO, RelayGR achieves:

  • Up to 1.5× longer effective user sequences (average increase; maximum observed 2×).
  • 3.6× higher SLO‑compliant throughput, measured as the number of ranking requests completed within the latency budget.
  • Cache hit rates exceeding 85 % for the prefix KV states, resulting in less than 2 ms additional latency for hit cases.
  • Resource overhead limited to < 8 % extra CPU/PCIe usage and < 12 % of total QPS dedicated to pre‑inference.

Ablation studies show that removing any of the three components degrades performance dramatically: without the affinity‑aware router, cache reuse incurs network round‑trips, inflating latency by >30 %; without the memory‑aware expander, DRAM capacity limits cause frequent cache evictions, reducing hit rates by ~40 %; and without the sequence‑aware trigger, uncontrolled pre‑inference overloads the system, causing SLO violations.

Discussion highlights the strengths and limitations of the approach. The prefix‑reuse concept is powerful when user behavior exhibits temporal stability, but it becomes less effective for highly volatile sessions where recent actions dominate. Moreover, HBM capacity remains a hard constraint; RelayGR mitigates this by offloading less‑frequent prefixes to DRAM, but a truly global cache would require distributed coordination, which the current design deliberately avoids to keep latency deterministic.

Future work could explore partial pre‑inference of candidate‑specific tokens, multi‑modal behavior encoding (e.g., incorporating image or audio signals), and adaptation of the relay‑race paradigm to other accelerator families such as NVIDIA GPUs or AMD CDNA.

In summary, RelayGR demonstrates that by intelligently pre‑computing and caching the user‑behavior prefix, a production‑grade GR system can dramatically extend the usable sequence length and boost throughput while strictly honoring real‑time latency constraints. This work bridges the gap between the high quality of long‑sequence generative models and the stringent performance requirements of large‑scale online recommendation services.


Comments & Academic Discussion

Loading comments...

Leave a Comment