OneLive: Dynamically Unified Generative Framework for Live-Streaming Recommendation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Live-streaming recommender system serves as critical infrastructure that bridges the patterns of real-time interactions between users and authors. Similar to traditional industrial recommender systems, live-streaming recommendation also relies on cascade architectures to support large-scale concurrency. Recent advances in generative recommendation unify the multi-stage recommendation process with Transformer-based architectures, offering improved scalability and higher computational efficiency. However, the inherent complexity of live-streaming prevents the direct transfer of these methods to live-streaming scenario, where continuously evolving content, limited lifecycles, strict real-time constraints, and heterogeneous multi-objectives introduce unique challenges that invalidate static tokenization and conventional model framework. To address these issues, we propose OneLive, a dynamically unified generative recommendation framework tailored for live-streaming scenario. OneLive integrates four key components: (i) A Dynamic Tokenizer that continuously encodes evolving real-time live content fused with behavior signal through residual quantization; (ii) A Time-Aware Gated Attention mechanism that explicitly models temporal dynamics for timely decision making; (iii) An efficient decoder-only generative architecture enhanced with Sequential MTP and QK Norm for stable training and accelerated inference; (iv) A Unified Multi-Objective Alignment Framework reinforces policy optimization for personalized preferences.

💡 Research Summary

The paper introduces OneLive, a novel generative recommendation framework specifically designed for live‑streaming platforms where content evolves continuously, the lifespan of streams is short, latency constraints are strict, and multiple business objectives must be balanced. Traditional industrial recommender systems rely on a cascade architecture (retrieval → pre‑ranking → ranking) that suffers from objective misalignment across stages and low model‑to‑hardware utilization. Recent generative recommendation (GR) approaches replace the cascade with a single Transformer‑based model, but they assume static item representations and a fixed candidate pool, which is incompatible with the dynamic nature of live streams.

Key contributions:

Dynamic Tokenizer – The authors propose a two‑stage tokenizer that first extracts multimodal embeddings from 30‑second sliding windows of live video, audio, and text using a distilled multimodal large language model (MLLM). These embeddings are fused with static author attributes via a gated MLP, then aligned in real time with user interaction signals (comments, gifts, clicks) through a dual‑tower architecture. The resulting “IA embedding” captures both content semantics and collaborative signals. Residual K‑means quantization is applied hierarchically to compress the embedding into multi‑level semantic IDs, achieving >99 % code utilization and low collision rates even with large codebooks.
Time‑Aware Gated Attention – To respect the strict temporal dynamics of a live broadcast (initiation → growth → peak → decline → termination), the model injects the current stream phase and remaining exposure time into a gating mechanism that modulates attention scores. Q‑K normalization (QK Norm) is introduced to stabilize the scale of attention logits, preventing training divergence and improving convergence speed.
Sequential Multi‑Token Prediction (MTP) and QK Norm – Instead of predicting the entire token sequence in one pass, OneLive generates tokens sequentially, conditioning each step on previously generated tokens. Combined with beam search, this yields a 2–3× speed‑up in candidate generation without sacrificing quality. QK Norm further enhances training stability and reduces memory consumption, leading to higher GPU utilization.
Unified Multi‑Objective Alignment Framework – The system integrates multiple engagement signals (click, share, follow, gift) into a single reinforcement‑learning‑based reward function. An ensemble ranking model provides the reward, and DPO/GRPO losses are used to fine‑tune the policy network. This enables a single model to simultaneously optimize heterogeneous KPIs, eliminating the objective inconsistency inherent in cascade pipelines.

Experimental validation – Offline benchmarks show that OneLive improves NDCG@10 by over 7 % and CTR by more than 5 % compared with prior GR models, while maintaining inference latency under 30 ms. In large‑scale online A/B tests on Kuaishou’s live‑streaming service, OneLive delivers 8–12 % lifts in core business metrics such as average watch time, gift revenue, and new follower acquisition. The authors attribute these gains to the dynamic tokenizer’s ability to track content drift, the time‑aware attention’s rapid adaptation to phase changes, and the multi‑objective RL alignment’s effective handling of diverse user behaviors.

Conclusion – OneLive represents the first end‑to‑end generative recommendation system that successfully addresses the unique challenges of live‑streaming environments. By unifying dynamic tokenization, temporal attention, efficient sequential decoding, and multi‑objective reinforcement learning, it achieves both high scalability and strong business impact in a production setting. Future work will explore richer multimodal understanding, ultra‑low‑latency inference accelerators, and meta‑learning techniques to further personalize recommendations.

OneLive: Dynamically Unified Generative Framework for Live-Streaming Recommendation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment