We study a type of Multi-Armed Bandit (MAB) problems in which arms with a Gaussian reward feedback are clustered. Such an arm setting finds applications in many real-world problems, for example, mmWave communications and portfolio management with risky assets, as a result of the universality of the Gaussian distribution. Based on the Thompson Sampling algorithm with Gaussian prior (TSG) algorithm for the selection of the optimal arm, we propose our Thompson Sampling with Clustered arms under Gaussian prior (TSCG) specific to the 2-level hierarchical structure. We prove that by utilizing the 2-level structure, we can achieve a lower regret bound than we do with ordinary TSG. In addition, when the reward is Unimodal, we can reach an even lower bound on the regret by our Unimodal Thompson Sampling algorithm with Clustered Arms under Gaussian prior (UTSCG). Each of our proposed algorithms are accompanied by theoretical evaluation of the upper regret bound, and our numerical experiments confirm the advantage of our proposed algorithms.
The Gaussian distribution is widely observed in many physical and socioeconomic systems. Given two options whose reward follows Gaussian distributions, the ordinary way to compare them is to measure the reward of individual arms repetitively, and conclude the better one by running statistical tests with the two samples.
We explore a type of policy optimization in which the player or agent has the prior knowledge that the reward of an option follows a Gaussian distribution with mean and variance unknown. Further, we work with a unique optimal arm, the property Unimodality which is expanded on in the next section. We present two applications as such in the following two examples.
Example 1: mmWave communications Millimeter wave (mmWave) technology is a key enabler in 5G communication, offering abundant bandwidth to accommodate the growing demand for mobile data traffic. However, a significant challenge in mmWave communication is its heavy reliance on the line-of-sight (LOS) path, which is highly susceptible to blockages from obstacles such as buildings. Beamforming plays a critical role in mmWave communications by enabling directional transmission that focuses the wireless signal toward the receiver. The choice of communication frequency directly affects large-scale path loss, while the beam selection determines the antenna gain. Thus, optimizing the selection of both the communication frequency and beam is essential for efficient mmWave communications. Furthermore, the received noise power typically follows a Gaussian distribution in mmWave communications. The objective of the learning agent (i.e., mmWave base station) is to identify the optimal communication frequency and beam that maximize the received signal strength in the presence of Gaussian noise. Fig. 1 illustrates an example of mmWave communications. There are three communication frequencies [1]: f 1 = 24.25 GHz, f 2 = 43.5 GHz, and f 3 = 60 GHz. For each communication frequency, there are three beams to be selected. In this context, each communication frequency can be considered as a cluster, with each beam representing an arm within that cluster.
Example 2: Portfolio Selection In short time periods, the return of a risky asset, such as one particular stock or a weighted allocation of many as a portfolio, is stochastic. In many quantitative models, the rate of return is approximated to by a superposition of a long-term drift and a Wiener process dW t that accounts for Gaussian fluctuations. 1 An example is shown in Fig 2.
We start with a simple case, in which there are n risky strategies as arms to choose from. Our aim is to find an automatic sampling scheme that gradually arrives at the optimal trading strategy which has the highest rate of return. Optimizations with both the mean and the variance in the Markowitz fashion can be found in [4][5][6], in which the agent maximizes the linear combination of the overall mean return and the risk measured by the variance, at a specific level of risk-tolerance.
In each trading period or time slot, the AI-investor picks one asset and observes a return drawn from a Gaussian distribution that is specific to the asset and unknown to the investor. One practice as such can be found at [7], which samples the arms in Daily Rate of Return Frequency Fig. 2 Example of Gaussian returns. The histogram in blue is synthesized according to statistics in S&P 500 index in [2] and the one in red refers to CSI 300 statistics in [3].
a UCB fashion. In our work, we examine selection in a more generalized setting, in which assets are clustered.
The hierarchy and Unimodality of arms in systems above can be dealt with algorithms specific to utilize the two properties. We first review previous work that does such later in this section, and define our setup in Section. 2. We introduce TSG in Section. 1 Sample return statistics of different risky assets. Our construction of this demonstration refers to the statistics in [3].
Section. 4 and UTSCG in Section. 5. The two algorithms are tested with experiments in Section. 6. Section. 7 concludes.
Extensive research has been conducted on the hierarchical multi-armed bandit (MAB) problem, where the set of available arms is partitioned into several distinct clusters [8][9][10][11]. Under different premises regarding clustering configurations, these existing studies have derived corresponding regret bounds for their proposed methods. For instance, a Two-level Policy (TLP) algorithm, which groups arms into multiple clusters, was first introduced in [12]. A limitation of this work is the absence of theoretical assessments regarding the algorithm’s upper bound on regret. [13] proposed a novel Hierarchical Thompson Sampling (HTS) algorithm. In a relevant context, the beams associated with a single selected group can be analogously treated as a cluster of arms within the MAB framework. Even so, this HTS approach fails to leverage the Unimodal characteristic inherent to each individual c
This content is AI-processed based on open access ArXiv data.