Online Learning in Opportunistic Spectrum Access: A Restless Bandit Approach
We consider an opportunistic spectrum access (OSA) problem where the time-varying condition of each channel (e.g., as a result of random fading or certain primary users’ activities) is modeled as an arbitrary finite-state Markov chain. At each instance of time, a (secondary) user probes a channel and collects a certain reward as a function of the state of the channel (e.g., good channel condition results in higher data rate for the user). Each channel has potentially different state space and statistics, both unknown to the user, who tries to learn which one is the best as it goes and maximizes its usage of the best channel. The objective is to construct a good online learning algorithm so as to minimize the difference between the user’s performance in total rewards and that of using the best channel (on average) had it known which one is the best from a priori knowledge of the channel statistics (also known as the regret). This is a classic exploration and exploitation problem and results abound when the reward processes are assumed to be iid. Compared to prior work, the biggest difference is that in our case the reward process is assumed to be Markovian, of which iid is a special case. In addition, the reward processes are restless in that the channel conditions will continue to evolve independent of the user’s actions. This leads to a restless bandit problem, for which there exists little result on either algorithms or performance bounds in this learning context to the best of our knowledge. In this paper we introduce an algorithm that utilizes regenerative cycles of a Markov chain and computes a sample-mean based index policy, and show that under mild conditions on the state transition probabilities of the Markov chains this algorithm achieves logarithmic regret uniformly over time, and that this regret bound is also optimal.
💡 Research Summary
The paper tackles the opportunistic spectrum access (OSA) problem in which a secondary user must repeatedly select one among several wireless channels whose quality evolves over time. Unlike most prior work that assumes independent and identically distributed (i.i.d.) rewards, the authors model each channel’s condition as an arbitrary finite‑state Markov chain. The reward obtained from a channel at any time is a deterministic function of its current state (e.g., a “good” state yields a high data rate). Crucially, the Markov chains evolve regardless of whether the user probes the channel, making the problem a restless bandit—a setting for which learning algorithms with provable performance guarantees are scarce.
The objective is to design an online learning policy that minimizes regret, defined as the gap between the cumulative reward earned by the learning algorithm and the reward that would have been obtained by always using the best channel (the one with the highest stationary mean reward) if its statistics were known a priori. The authors propose a novel algorithm called Regenerative‑Cycle Upper Confidence Bound (RC‑UCB). For each channel, a “regeneration state’’ (any fixed state) is selected. Whenever the channel visits this state, a regenerative cycle ends and a new one begins. Within a cycle the algorithm records the total reward accumulated and the length of the cycle. Because successive cycles are independent and identically distributed, the sample‑mean estimator of the stationary reward can be formed by dividing the total reward observed over all completed cycles by the total time spent in those cycles.
At each decision epoch the algorithm computes an index for every channel: \
Comments & Academic Discussion
Loading comments...
Leave a Comment