LASER: An Efficient Target-Aware Segmented Attention Framework for End-to-End Long Sequence Modeling

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Modeling ultra-long user behavior sequences is pivotal for capturing evolving and lifelong interests in modern recommendation systems. However, deploying such models in real-time industrial environments faces a strict “Latency Wall”, constrained by two distinct bottlenecks: the high I/O latency of retrieving massive user histories and the quadratic computational complexity of standard attention mechanisms. To break these bottlenecks, we present LASER, a full-stack optimization framework developed and deployed at Xiaohongshu (RedNote). Our approach tackles the challenges through two complementary innovations: (1) System efficiency: We introduce SeqVault, a unified schema-aware serving infrastructure for long user histories. By implementing a hybrid DRAM-SSD indexing strategy, SeqVault reduces retrieval latency by 50% and CPU usage by 75%, ensuring millisecond-level access to full real-time and life-cycle user histories. (2) Algorithmic efficiency: We propose a Segmented Target Attention (STA) mechanism to address the computational overhead. Motivated by the inherent sparsity of user interests, STA employs a sigmoid-based gating strategy that acts as a silence mechanism to filter out noisy items. Subsequently, a lightweight Global Stacked Target Attention (GSTA) module refines these compressed segments to capture cross-segment dependencies without incurring high computational costs. This design performs effective sequence compression, reducing the complexity of long-sequence modeling while preserving critical signals. Extensive offline evaluations demonstrate that LASER consistently outperforms state-of-the-art baselines. In large-scale online A/B testing serving over 100 million daily active users, LASER achieved a 2.36% lift in ADVV and a 2.08% lift in revenue, demonstrating its scalability and significant commercial impact.

💡 Research Summary

The paper addresses the pressing challenge of modeling ultra‑long user behavior sequences in real‑time recommendation systems, where both data retrieval latency and the quadratic cost of standard self‑attention create a “Latency Wall”. To break this wall, the authors introduce LASER, a full‑stack framework that combines a system‑level service (SeqVault) with a novel attention architecture (Segmented Target Attention, STA, and Global Stacked Target Attention, GSTA).

SeqVault is a unified, schema‑aware serving layer that stores user histories in a hybrid DRAM‑SSD index. A hash table in memory provides fast key look‑ups, while the actual sequences are packed efficiently on SSD using a columnar, type‑aware format. This design eliminates the fragmented short‑term/long‑term “LastN” pipelines of the legacy system, cuts retrieval P99 latency by roughly 50 % and reduces CPU usage by 75 %, enabling millisecond‑level access to histories that can contain thousands of items.

On the algorithmic side, LASER first splits a user’s sequence H of length L into non‑overlapping segments of fixed width w, yielding L′ = L/w segments. For each segment Sᵢ, a shared query‑key‑value projection computes a target‑dependent attention score. A sigmoid‑based gating function (silence mechanism) multiplies the raw attention weights, suppressing noisy or irrelevant items while preserving signals that are predictive of the current target item t. The gated aggregation collapses each segment into a single token sᵢ, dramatically reducing the effective sequence length. The compressed sequence H′ ∈ ℝ^{L′×d} is then processed by GSTA, a lightweight stacked attention module that models cross‑segment dependencies without incurring the O(L²) cost of full self‑attention. The “compress‑then‑refine” pipeline thus achieves O(L′²) complexity, where L′ is an order of magnitude smaller than L.

The final representation combines the GSTA output with max‑pooled recent features and passes them through a multi‑resolution feature‑fusion layer before feeding RankMixer, a state‑of‑the‑art feature‑interaction network, for CTR prediction. Training uses binary cross‑entropy loss.

Extensive offline experiments on large‑scale click‑through‑rate datasets show that LASER consistently outperforms strong baselines such as DIN, DIN‑plus, SIM, and recent long‑sequence Transformers in AUC, GAUC, and LogLoss, even when the compression ratio reaches 10×. In a live A/B test covering over 100 million daily active users, LASER delivers a 2.36 % lift in ad view‑through (ADVV) and a 2.08 % increase in revenue, confirming its commercial impact. System metrics indicate that the average retrieval latency drops from ~3 ms to ~1.5 ms and CPU consumption falls below 30 % of the previous pipeline.

The paper’s contributions are threefold: (1) a production‑grade, schema‑aware long‑sequence service that unifies multi‑scenario features and drastically reduces I/O overhead; (2) a target‑aware segmented attention mechanism that filters noise via sigmoid gating and compresses sequences efficiently; (3) a global stacked attention module that captures long‑range dependencies with minimal extra cost.

Limitations include a lack of sensitivity analysis for segment size w and gating hyper‑parameters, and an incomplete comparison of GSTA’s FLOPs and memory footprint against other efficient attention variants (e.g., Linformer, Performer). The authors also do not discuss the risk that aggressive gating might discard subtle long‑term patterns.

Overall, LASER demonstrates that a tightly coupled system‑algorithm co‑design can overcome the latency barrier in industrial recommender systems, offering a scalable path toward end‑to‑end modeling of lifelong user behavior. Future work could explore dynamic segment sizing, alternative gating functions, and deeper quantitative studies of GSTA’s efficiency.

LASER: An Efficient Target-Aware Segmented Attention Framework for End-to-End Long Sequence Modeling

💡 Research Summary

Comments & Academic Discussion

Leave a Comment