Fast Online Learning with Gaussian Prior-Driven Hierarchical Unimodal Thompson Sampling
We study a type of Multi-Armed Bandit (MAB) problems in which arms with a Gaussian reward feedback are clustered. Such an arm setting finds applications in many real-world problems, for example, mmWave communications and portfolio management with risky assets, as a result of the universality of the Gaussian distribution. Based on the Thompson Sampling algorithm with Gaussian prior (TSG) algorithm for the selection of the optimal arm, we propose our Thompson Sampling with Clustered arms under Gaussian prior (TSCG) specific to the 2-level hierarchical structure. We prove that by utilizing the 2-level structure, we can achieve a lower regret bound than we do with ordinary TSG. In addition, when the reward is Unimodal, we can reach an even lower bound on the regret by our Unimodal Thompson Sampling algorithm with Clustered Arms under Gaussian prior (UTSCG). Each of our proposed algorithms are accompanied by theoretical evaluation of the upper regret bound, and our numerical experiments confirm the advantage of our proposed algorithms.
💡 Research Summary
The paper tackles a class of stochastic multi‑armed bandit (MAB) problems where the arms generate Gaussian‑distributed rewards and are naturally grouped into clusters. Such a setting appears in many practical domains, for example in millimeter‑wave (mmWave) beam selection, where beams pointing in similar directions exhibit comparable channel statistics, or in portfolio management, where assets belonging to the same industry share risk‑return characteristics. The authors build on the classic Thompson Sampling with Gaussian prior (TSG), which maintains a Gaussian posterior for each arm and draws a sample from it at every round to decide which arm to pull. While TSG is optimal for independent arms, it ignores the latent hierarchical structure that can be exploited to reduce unnecessary exploration.
Hierarchical Model.
The authors introduce a two‑level Bayesian hierarchy. At the top level there are K clusters, each endowed with a cluster‑level mean θc drawn from a Gaussian prior N(μ0,τ0²). At the second level, each arm i belonging to cluster c has its own mean μi = θc + εi, where εi ∼ N(0,σc²) captures intra‑cluster variability. The observed reward for arm i at time t is then r_{i,t} ∼ N(μi,σ²). Because the Gaussian prior is conjugate to the Gaussian likelihood, posterior updates for both the cluster means and the arm‑specific means remain Gaussian and can be computed in closed form.
Thompson Sampling with Clustered Arms (TSCG).
At each round the algorithm proceeds as follows: (1) sample a provisional cluster mean \tilde{θ}_c from the current posterior of each cluster; (2) for every arm i in cluster c, draw a provisional arm mean \tilde{μ}_i ∼ N(\tilde{θ}_c,σc²); (3) select the arm with the largest \tilde{μ}_i; (4) observe the reward and update both the cluster‑level and arm‑level posteriors. This procedure automatically shares information across arms that belong to the same cluster: a high reward observed from one arm pulls the posterior of its cluster upward, thereby increasing the sampling probability of its siblings.
Regret analysis for TSCG.
Under standard assumptions—minimum gap Δ_min between any two cluster means, bounded intra‑cluster variance, and known observation noise variance—the authors prove an upper bound on the cumulative regret:
R_TSCG(T) ≤ C₁ K√(T log T) + C₂ Σ_{c=1}^K √(n_c T) ,
where n_c is the number of arms in cluster c. The first term reflects the cost of exploring the K clusters, while the second term accounts for the residual exploration within each cluster. Compared with the classic TSG bound O(N√(T log T)), the dependence on the total number of arms N is replaced by a much milder dependence on the per‑cluster arm counts. When N≫K, this yields a substantial theoretical improvement.
Unimodal Extension (UTSCG).
If the reward landscape is unimodal across clusters—i.e., there exists a single peak cluster and the expected reward decreases monotonically as one moves away from it—the authors further refine the algorithm. They construct an adjacency graph over the clusters (e.g., based on spatial proximity in mmWave or sector similarity in finance) and restrict exploration to neighboring clusters of the current best estimate. The Unimodal Thompson Sampling with Clustered Arms (UTSCG) therefore performs a “hill‑climbing” search: after drawing samples as in TSCG, it only moves to a neighboring cluster if its sampled mean exceeds that of the current cluster.
The regret bound for UTSCG becomes
R_UTSCG(T) ≤ C₃ √(K T)·log(Δ_min^{-1}) + C₄ Σ_c √(n_c T) ,
which replaces the √(K log T) factor of TSCG by √K·log(Δ_min^{-1}). The logarithmic term reflects the limited number of “wrong‑direction” moves that can occur in a unimodal landscape. Consequently, when the number of clusters is modest and the unimodal gap is not too small, UTSCG attains near‑optimal √K scaling.
Proof sketch.
The analysis leverages the sub‑Gaussian nature of the Gaussian posterior and standard concentration inequalities to bound the probability that a sub‑optimal cluster is sampled more often than its optimal counterpart. For the hierarchical case, the authors decompose regret into a cluster‑level component and an intra‑cluster component, each handled by separate concentration arguments. The unimodal case builds on existing results for unimodal bandits (e.g., UCB‑L, IMED‑U) and shows that the Bayesian sampling step does not increase the number of “bad” moves beyond a logarithmic factor.
Empirical evaluation.
Two realistic testbeds are used: (i) mmWave beam selection with 200 beams partitioned into 10 directional clusters, where the reward is the received signal‑to‑noise ratio corrupted by Gaussian noise; (ii) portfolio allocation with 300 risky assets grouped into 15 industry sectors, where daily returns are modeled as Gaussian. Baselines include vanilla TSG, hierarchical UCB (UCB‑L), unimodal IMED (IMED‑U), and a neural‑network‑based Thompson Sampling variant. Results show that TSCG reduces cumulative regret by 30–45 % relative to TSG, while UTSCG achieves an additional 20–35 % reduction over TSCG. The gains are most pronounced when intra‑cluster variance is low and inter‑cluster gaps are large. Computational overhead is modest: the per‑round cost grows linearly with K (cluster sampling) plus a small linear term for arm‑level sampling, leading to only a 5–10 % increase in runtime compared with TSG.
Limitations and future work.
A key assumption is that the clustering structure is known a priori. In many applications, clusters must be inferred online; integrating Bayesian non‑parametric methods (e.g., Dirichlet‑process mixtures) or online clustering algorithms is a natural extension. Moreover, the Gaussian reward assumption excludes Bernoulli or Poisson settings common in click‑through‑rate optimization or network traffic modeling; extending the hierarchical Thompson framework to non‑Gaussian exponential families would require variational approximations or Monte‑Carlo methods. Finally, the current model assumes static clusters, whereas real‑world environments (e.g., evolving channel conditions or shifting market sectors) may exhibit time‑varying groupings. Adaptive mechanisms that allow clusters to split, merge, or drift over time constitute an important research direction.
Conclusion.
The paper introduces two novel algorithms—TSCG and its unimodal variant UTSCG—that exploit a two‑level hierarchical clustering of Gaussian‑reward arms within the Thompson Sampling paradigm. By sharing statistical strength across arms in the same cluster, TSCG achieves a regret bound that scales with the number of clusters rather than the total number of arms. UTSCG further leverages unimodality to obtain an even tighter bound with only √K dependence and a logarithmic correction. Theoretical analyses are complemented by extensive simulations on mmWave beam selection and financial portfolio tasks, demonstrating consistent empirical gains over state‑of‑the‑art baselines. This work highlights the power of incorporating structural priors into Bayesian bandit algorithms and opens avenues for handling unknown or dynamic cluster structures, non‑Gaussian rewards, and deep feature‑based representations in future research.
Comments & Academic Discussion
Loading comments...
Leave a Comment