We propose a novel Bayesian framework for efficient exploration in contextual multi-task multi-armed bandit settings, where the context is only observed partially and dependencies between reward distributions are induced by latent context variables. In order to exploit these structural dependencies, our approach integrates observations across all tasks and learns a global joint distribution, while still allowing personalised inference for new tasks. In this regard, we identify two key sources of epistemic uncertainty, namely structural uncertainty in the latent reward dependencies across arms and tasks, and user-specific uncertainty due to incomplete context and limited interaction history. To put our method into practice, we represent the joint distribution over tasks and rewards using a particle-based approximation of a log-density Gaussian process. This representation enables flexible, data-driven discovery of both inter-arm and inter-task dependencies without prior assumptions on the latent variables. Empirically, we demonstrate that our method outperforms baselines such as hierarchical model bandits, especially in settings with model misspecification or complex latent heterogeneity.
Multi-armed bandits (MABs) are a foundational framework for sequential decision-making (Thompson 1933), widely used in applications such as personalised recommendation (Zhou et al. 2017) or clinical decision support (Durand et al. 2018;Shrestha and Jain 2021;Aziz, Kaufmann, and Riviere 2021). In these settings, each user represents a new instance of a bandit problem, where the goal is to quickly identify the best arm (e.g., treatment or recommendation) while balancing exploration and exploitation. This becomes especially challenging in heterogeneous populations, where users differ in their optimal choices.
In this work, we consider the multi-task contextual bandit problem (Deshmukh, Dogan, and Scott 2017) with a partially observable task-context. Often, contextual features such as demographics or test results (x obs ) are available and can guide personalisation by enabling data sharing across users. However, many important factors, including genetic traits, lifestyle, and long-term adherence, are not directly observed (x unobs ). This creates a setting with partially informative context, complicating the decision of whether to trust past experience or gather more user-specific information. This trade-off is particularly critical in clinical applications, where users are patients and the arms correspond to medical treatments. Exploring suboptimal treatments can result in harm or reduced trust, leading to patient dropout or non-adherence. Conversely, over-reliance on incomplete context can cause long-term regret, especially when similar observed features mask underlying heterogeneity.
As a motivating example, consider a system that recommends one of three dietary plans (Figure 1). Each user is defined by an observed feature (age group: young, middleaged, or old) and an unobserved feature (metabolic type: low or high). We assume that younger users are more likely to have high metabolism, while older users are more likely to have low metabolism. Thus, the observable age group is informative of the unobserved metabolic type. However, for middle-aged users, metabolism may depend on lifestyle factors, and we assume both metabolic types are equally likely for this age group. In Figure 1, we see that Plan 2 is best for users with low metabolism, whereas Plan 3 is preferable for those with high metabolism. Since only the age group is observed, the choice is clear for young and old users, but remains ambiguous for the middle-aged. However, for this group, Plan 1, while always suboptimal, provides valuable information: rewards close to one suggest low metabolism, while rewards near two suggest high metabolism. Thus, assigning Plan 1 to middle-aged users can help identify their metabolic type and, in turn, guide better recommendations between Plans 2 and 3. This example illustrates the importance of context-sensitive exploration strategies that adapt to latent uncertainty. Our method addresses this by modelling the joint reward distribution over users and actions, leveraging shared latent structure to make informed decisions even when some user characteristics are unobserved.
Related work. Our contribution intersects several areas of contextual bandit research, including multi-task bandits (Wang et al. 2021;Deshmukh, Dogan, and Scott 2017), hierarchical models (Hong et al. 2022), meta-learning in bandits (Bastani, Simchi-Levi, and Zhu 2022;Peleg, Pearl, and Meir 2022;Basu et al. 2021;Cella, Lazaric, and Pontil 2020;Ortega et al. 2019), and mixed-effects modelling for information sharing (Aouali, Kveton, and Katariya 2023;Huch et al. 2024). It also relates to work addressing partial observability (Wang, Wu, and Wang 2016) and coldstart scenarios (Bharadhwaj 2019;Silva et al. 2023). These works address the presence of latent variables and information sharing across bandit instances in the presence of uncertainty or partial observability. However, most approaches make strong assumptions about the structure of missing variables, the form of context-reward relationships, or how information is shared across tasks. Assumptions such as rigid clustering, sequential task completion, or static recruitment, limit their flexibility in dynamic, heterogeneous environments. In contrast, our method relaxes these assumptions by modelling a nonparametric distribution over latent reward functions, conditioned on partially observed context. This enables dynamic task recruitment, concurrent interactions, and robust adaptation without relying on predefined structure. A comprehensive review of related literature is provided in the Appendix.
We summarize the key distinctions between our model and the most relevant prior work consisting of hLin-UCB (Wang, Wu, and Wang 2016), KMTL-UCB (Deshmukh, Dogan, and Scott 2017), GradBand (Kveton et al. 2021a), RobustAgg (Wang et al. 2021), MTTS (Wan, Ge, and Song 2021), HierTS (Hong et al. 2022), and RoME (Huch et al. 2024) in Table 1, which compares six criteria: (i) use of multi-task learning, (ii) s
This content is AI-processed based on open access ArXiv data.