Zenith: Scaling up Ranking Models for Billion-scale Livestreaming Recommendation

Zenith: Scaling up Ranking Models for Billion-scale Livestreaming Recommendation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Accurately capturing feature interactions is essential in recommender systems, and recent trends show that scaling up model capacity could be a key driver for next-level predictive performance. While prior work has explored various model architectures to capture multi-granularity feature interactions, relatively little attention has been paid to efficient feature handling and scaling model capacity without incurring excessive inference latency. In this paper, we address this by presenting Zenith, a scalable and efficient ranking architecture that learns complex feature interactions with minimal runtime overhead. Zenith is designed to handle a few high-dimensional Prime Tokens with Token Fusion and Token Boost modules, which exhibits superior scaling laws compared to other state-of-the-art ranking methods, thanks to its improved token heterogeneity. Its real-world effectiveness is demonstrated by deploying the architecture to TikTok Live, a leading online livestreaming platform that attracts billions of users globally. Our A/B test shows that Zenith achieves +1.05%/-1.10% in online CTR AUC and Logloss, and realizes +9.93% gains in Quality Watch Session / User and +8.11% in Quality Watch Duration / User.


💡 Research Summary

The paper introduces Zenith, a novel ranking architecture designed to scale up recommender models for billion‑scale livestreaming while keeping inference latency low. The core idea is to compress the massive set of sparse input features into a small number of high‑dimensional “Prime Tokens”. Feature embeddings are first bucketed by semantic groups (e.g., user demographics, interaction sequences) and then projected through multilayer perceptrons into these Prime Tokens, reducing the token count from K raw features to T tokens where T ≪ K.

Each Zenith layer consists of two complementary modules: Token Fusion (TF) and Token Boost (TB). TF captures cross‑token interactions. The baseline Zenith uses Retokenized Self‑Attention (RSA): a standard self‑attention block followed by a retokenization step that reshapes token representations and an auxiliary MLP that generates additional tokens to compensate for the reduced token count. The enhanced version, Zenith++, replaces RSA with Token‑wise Multi‑Head Self‑Attention (TMHSA), applying independent multi‑head attention to each token, thereby preserving token heterogeneity across deep layers.

TB enriches the representation of each token individually. The basic model applies Token‑wise SwiGLU (TSwiGLU), a gated linear unit that adds non‑linearity per token. Zenith++ upgrades this to Token‑wise Sparse Mixture‑of‑Experts (TSMoE). A lightweight router selects a subset of expert sub‑networks for each token, enabling the model to expand capacity to billions of parameters while keeping the actual compute per request modest.

The authors conduct extensive scaling‑law experiments, varying model depth, hidden size, and the use of MoE. Results show that Zenith’s token‑heterogeneity‑preserving design yields smoother performance gains as depth increases, unlike traditional DLRM or Hiformer where deeper models quickly saturate. With the same FLOP budget, Zenith achieves 1.8‑2.5 % higher AUC on offline benchmarks (Criteo and internal TikTok Live logs).

A live A/B test on TikTok Live—serving billions of users and handling millions of concurrent requests—demonstrates real‑world impact. Compared with the production baseline, Zenith improves online CTR AUC by +1.05 % and reduces Logloss by –1.10 %. More importantly, user‑centric quality metrics rise dramatically: Quality Watch Sessions per user increase by +9.93 % and Quality Watch Duration per user by +8.11 %. Inference latency grows only by ~2.3 ms, staying well within the system’s sub‑5 ms SLA.

Implementation details include FP16 quantization, batch‑level scheduling, and GPU‑CPU pipeline parallelism to meet millisecond‑level latency. The paper also discusses practical tokenization rules: ID features receive dedicated tokens, non‑ID features are grouped to balance information load, and each original embedding is kept intact within a token to avoid fragmentation.

In summary, Zenith contributes four key innovations: (1) a Prime‑Token abstraction that dramatically reduces token count while preserving rich feature information; (2) Retokenized or token‑wise multi‑head self‑attention for efficient cross‑token interaction; (3) token‑wise gating or sparse MoE for per‑token capacity expansion; and (4) empirical evidence that maintaining token heterogeneity is essential for effective scaling of deep ranking models. The authors suggest future work on multimodal token integration, dynamic token count adaptation to traffic, and reinforcement‑learning‑driven MoE routing.


Comments & Academic Discussion

Loading comments...

Leave a Comment