SMES: Towards Scalable Multi-Task Recommendation via Expert Sparsity

SMES: Towards Scalable Multi-Task Recommendation via Expert Sparsity
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Industrial recommender systems typically rely on multi-task learning to estimate diverse user feedback signals and aggregate them for ranking. Recent advances in model scaling have shown promising gains in recommendation. However, naively increasing model capacity imposes prohibitive online inference costs and often yields diminishing returns for sparse tasks with skewed label distributions. This mismatch between uniform parameter scaling and heterogeneous task capacity demands poses a fundamental challenge for scalable multi-task recommendation. In this work, we investigate parameter sparsification as a principled scaling paradigm and identify two critical obstacles when applying sparse Mixture-of-Experts (MoE) to multi-task recommendation: exploded expert activation that undermines instance-level sparsity and expert load skew caused by independent task-wise routing. To address these challenges, we propose SMES, a scalable sparse MoE framework with progressive expert routing. SMES decomposes expert activation into a task-shared expert subset jointly selected across tasks and task-adaptive private experts, explicitly bounding per-instance expert execution while preserving task-specific capacity. In addition, SMES introduces a global multi-gate load-balancing regularizer that stabilizes training by regulating aggregated expert utilization across all tasks. SMES has been deployed in Kuaishou large-scale short-video services, supporting over 400 million daily active users. Extensive online experiments demonstrate stable improvements, with GAUC gain of 0.29% and a 0.31% uplift in user watch time.


💡 Research Summary

Industrial recommender systems must predict dozens of user interaction signals (click, like, share, watch time, etc.) simultaneously, which is typically addressed with multi‑task learning (MTL). While scaling model parameters has yielded impressive gains in large language models and recent recommendation work, naïvely increasing capacity in an industrial setting is problematic: online latency budgets are strict, and tasks exhibit highly heterogeneous data regimes—some are data‑rich while others are extremely sparse. Uniform scaling therefore wastes resources on sparse tasks and can even degrade performance, as shown on the public KuaiRand benchmark where “like” and “follow” tasks plateau or decline with larger dense models.

The authors identify two fundamental obstacles when applying sparse Mixture‑of‑Experts (MoE) to MTL: (1) exploded expert activation – independent top‑k routing per task can cause the union of activated experts for a single instance to grow with the number of tasks, breaking the desired instance‑level sparsity; (2) expert load skew – because experts are only updated when selected, a few popular experts receive the majority of traffic while many remain under‑trained, leading to unstable training as the expert pool grows.

To overcome these issues, they propose SMES (Scalable Multi‑task recommendation via Expert Sparsity), a sparse MoE framework with progressive expert routing and a global load‑balancing regularizer. SMES decomposes expert selection into two stages: a task‑shared router first selects a small set of experts that are jointly preferred by all tasks, establishing a common backbone for each instance. Then, each task employs a task‑adaptive sub‑router that can activate a limited number of private experts. By fixing the total number of distinct experts per instance to |shared| + k′, SMES guarantees predictable computation and prevents activation explosion.

Training stability is further enhanced by a global multi‑gate load‑balancing regularizer that operates on the aggregated routing probabilities across all tasks, encouraging uniform expert utilization and mitigating hotspots that arise from multi‑task sparse routing. Unlike prior MoE approaches that balance each gate independently, this regularizer enforces a global fairness constraint, leading to more balanced gradients and faster convergence.

SMES also incorporates deduplicated expert execution: when multiple tasks select the same expert for an instance, the computation is performed once and the result is shared, reducing both FLOPs and activation memory. This optimization is crucial for meeting the sub‑millisecond latency requirements of high‑throughput services.

Extensive experiments on public benchmarks and large‑scale internal datasets demonstrate that SMES consistently outperforms dense baselines, standard MMoE, PLE, and naive sparse MoE variants. On KuaiRand, SMES improves AUC/GAUC by 0.2–0.5% across tasks, especially stabilizing performance on sparse signals. In production at Kuaishou’s short‑video platform (over 400 M daily active users), an online A/B test showed a 0.29% lift in GAUC and a 0.31% increase in user watch time, while keeping per‑instance latency increase to only ~1.8 ms—a 60% reduction in computation compared with a dense MoE of comparable capacity.

In summary, SMES offers a principled scaling paradigm for multi‑task recommendation: (1) it aligns model capacity with heterogeneous task demands via shared‑private expert routing; (2) it guarantees instance‑level sparsity and balanced expert usage through a global load‑balancing loss; and (3) it provides practical deployment optimizations that satisfy industrial latency and memory constraints. Future directions include dynamic routing budgets, meta‑learning of expert parameters, and extending the framework to other domains such as advertising and e‑commerce.


Comments & Academic Discussion

Loading comments...

Leave a Comment