Enhancing Bandit Algorithms with LLMs for Time-varying User Preferences in Streaming Recommendations
In real-world streaming recommender systems, user preferences evolve dynamically over time. Existing bandit-based methods treat time merely as a timestamp, neglecting its explicit relationship with user preferences and leading to suboptimal performance. Moreover, online learning methods often suffer from inefficient exploration-exploitation during the early online phase. To address these issues, we propose HyperBandit+, a novel contextual bandit policy that integrates a time-aware hypernetwork to adapt to time-varying user preferences and employs a large language model-assisted warm-start mechanism (LLM Start) to enhance exploration-exploitation efficiency in the early online phase. Specifically, HyperBandit+ leverages a neural network that takes time features as input and generates parameters for estimating time-varying rewards by capturing the correlation between time and user preferences. Additionally, the LLM Start mechanism employs multi-step data augmentation to simulate realistic interaction data for effective offline learning, providing warm-start parameters for the bandit policy in the early online phase. To meet real-time streaming recommendation demands, we adopt low-rank factorization to reduce hypernetwork training complexity. Theoretically, we rigorously establish a sublinear regret upper bound that accounts for both the hypernetwork and the LLM warm-start mechanism. Extensive experiments on real-world datasets demonstrate that HyperBandit+ consistently outperforms state-of-the-art baselines in terms of accumulated rewards.
💡 Research Summary
The paper tackles two fundamental challenges in streaming recommender systems: (1) the non‑stationary, periodic evolution of user preferences over time, and (2) the severe data sparsity and inefficient exploration‑exploitation balance during the early online phase of contextual bandit algorithms. To address these issues, the authors propose HyperBandit+, a novel contextual bandit framework that integrates a time‑aware hypernetwork with a large language model (LLM)‑assisted warm‑start mechanism (LLM Start).
Time‑aware hypernetwork
The hypernetwork receives a discrete time‑period identifier p (e.g., one of 35 weekly slots) and maps it to an embedding s_p. This embedding is fed into a neural generator that produces the parameters of the user‑item preference matrix Θ*_p for the current period. By conditioning the bandit’s reward estimator on s_p, the model can dynamically adapt to periodic shifts such as weekday‑morning vs. weekend‑night behaviors without training separate models for each period. To keep the approach feasible for real‑time streaming, the authors apply low‑rank factorization, reducing the output dimension from d_a·d_u to τ(d_a + d_u), where τ≪min(d_a,d_u). This compression dramatically lowers both training and inference latency.
LLM‑Start warm‑start
Before any online interaction, the system leverages a pre‑trained LLM (e.g., GPT‑4) to enrich side‑information (user profiles, item attributes) and to generate synthetic interaction logs. The augmentation proceeds in multiple steps: (i) LLM expands sparse metadata into richer textual descriptions; (ii) using these enriched representations, the LLM simulates realistic click/skip outcomes based on its world knowledge and contextual reasoning. The resulting synthetic dataset is used to pre‑train the bandit policy, providing a well‑initialized set of parameters that dramatically improves early‑stage performance.
Theoretical contribution
The authors derive a regret bound that explicitly accounts for both the hypernetwork approximation error and the warm‑start benefit. The total regret after T rounds satisfies
R_T = O(√(T log T) + τ log T),
showing sublinear growth even when the underlying preference matrix changes periodically. The bound extends classic contextual bandit analyses to the setting where the policy’s parameters are generated by a time‑conditioned hypernetwork.
Empirical evaluation
Experiments are conducted on two real‑world datasets: (a) Foursquare‑NYC points‑of‑interest data, which exhibits clear weekly and daily periodicities, and (b) a short‑video platform (Kuai) where user genre preferences differ between weekdays and weekends. Baselines include stationary contextual bandits (LinUCB), piecewise‑stationary methods (Sliding‑Window, Change‑Detection), and recent hypernetwork‑based bandits. HyperBandit+ consistently achieves higher cumulative rewards, click‑through rates, and lower early‑stage regret. Notably, when the LLM‑Start component is omitted, the algorithm suffers a 30 % drop in reward during the first week, whereas with LLM‑Start it reaches near‑optimal performance within the first 1,000 interactions.
Limitations and future work
The quality of the synthetic data depends heavily on prompt engineering and the chosen LLM; poor prompts can introduce bias. Moreover, an overly expressive hypernetwork may overfit to limited real interactions, suggesting a need for regularization or meta‑learning strategies. The authors propose future directions such as automated prompt optimization, meta‑learning‑driven hypernetwork architectures, and incorporation of multimodal side‑information (text, images, audio) to further enhance robustness.
In summary, HyperBandit+ introduces a compelling synergy between a time‑conditioned hypernetwork and LLM‑driven data augmentation, delivering both theoretical guarantees and practical gains for streaming recommendation under non‑stationary user behavior and cold‑start constraints.
Comments & Academic Discussion
Loading comments...
Leave a Comment