Dynamic Multi-Arm Bandit Game Based Multi-Agents Spectrum Sharing Strategy Design
📝 Abstract
For a wireless avionics communication system, a Multi-arm bandit game is mathematically formulated, which includes channel states, strategies, and rewards. The simple case includes only two agents sharing the spectrum which is fully studied in terms of maximizing the cumulative reward over a finite time horizon. An Upper Confidence Bound (UCB) algorithm is used to achieve the optimal solutions for the stochastic Multi-Arm Bandit (MAB) problem. Also, the MAB problem can also be solved from the Markov game framework perspective. Meanwhile, Thompson Sampling (TS) is also used as benchmark to evaluate the proposed approach performance. Numerical results are also provided regarding minimizing the expectation of the regret and choosing the best parameter for the upper confidence bound.
💡 Analysis
For a wireless avionics communication system, a Multi-arm bandit game is mathematically formulated, which includes channel states, strategies, and rewards. The simple case includes only two agents sharing the spectrum which is fully studied in terms of maximizing the cumulative reward over a finite time horizon. An Upper Confidence Bound (UCB) algorithm is used to achieve the optimal solutions for the stochastic Multi-Arm Bandit (MAB) problem. Also, the MAB problem can also be solved from the Markov game framework perspective. Meanwhile, Thompson Sampling (TS) is also used as benchmark to evaluate the proposed approach performance. Numerical results are also provided regarding minimizing the expectation of the regret and choosing the best parameter for the upper confidence bound.
📄 Content
Dynamic Multi-Arm Bandit Game Based Multi-
Agents Spectrum Sharing Strategy Design
Jingyang Lu, Lun Li, Dan Shen, Genshe Chen, Bin Jia
Intelligent Fusion Technology, Inc.,
20271 Goldenrod Ln, Germantown, MD, 20876
{Jingyang.lu, lun.li, dshen, gchen, bin.jia}
@intfusiontech.com
Erik Blasch, Khanh Pham
Air Force Research Laboratory
Rome, NY and Kirtland, AFB, NM
{erik.blasch.1, khanh.pham}@us.af.mil
Abstract— For a wireless avionics communication system, a Multi-
arm bandit game is mathematically formulated, which includes
channel states, strategies, and rewards. The simple case includes
only two agents sharing the spectrum which is fully studied in
terms of maximizing the cumulative reward over a finite time
horizon. An upper confidence bound (UCB) algorithm is used to
achieve the optimal solutions for the stochastic Multi-arm bandit
(MAB) problem. Also, the MAB problem can also be solved from
the Markov game framework perspective. Meanwhile, Thompson
sampling (TS) is also used as benchmark to evaluate the proposed
approach performance. Numerical results are also provided
regarding minimizing the expectation of the regret and choosing
the best parameter for the upper confidence bound.
Keywords—Multi-arm Bandit Game;Cognitive Raido Network;
Dynamic Spectrum Access
I. INTRODUCTION
Avionics systems are dependent on communication capabilities
for navigation and control [1][2]. A key element of future
unmanned aerial
systems (UAS)
would be
wireless
communications. However, the many possible UAS would be
sharing the available spectrum for navigation and control.
A wireless spectrum, regarded as a limited resource, has been
investigated to increase the utility efficiency [1]. A Cognitive
radio (CR) has been proposed to automatically adapt the
communication system parameter to overcome the conflict
between the great demand for spectrum and large amount of
spectrum left available by Joseph Mitola III [3]. In a cognitive
radio network (CRN), spectrum sensing provides the basis for
the communication control center to dynamically allocate the
spectrum resources without bringing harmful interference to
other users [4][5].
Recently, the spectrum allocation problem has been studied
from the physical (PHY), medium access control (MAC), and
network
layers
using
different
approaches
such
as
communication theory, signal processing, graph theory,
machine learning, and game theory; all of which involve
computational complexity and communication overhead [6].
These approaches are advancing capabilities for avionics
systems. Thus, a key problem arises as how to accommodate
the different parts of communication system to balance the
system
spectrum
utility
and
computation
complexity
constrained by the limited resources.
In this paper, a new type of game is formulated to design the
strategy that each communication node can dynamically select
a candidate spectrum to transmit signal efficiently with the
smallest accumulative regret.
In the Multi-arm Bandit (MAB) game, which is originally
proposed in [7], a gambler has to choose one of K machines to
play. Each time, the gambler pulls the arm of one slot machine
and receives a reward or payoff. The purpose of the game is to
maximize the gambler’s accumulative return or equivalently,
the accumulative regret. The problem is a typical example of
trade-off between the exploration and exploitation. If the player
myopically focuses on the slot machine he thinks is the best, he
may miss the actually best machine. On the other hand, if he
spends most of time trying different slot machines, he may fail
to play the best option enough often to gain an optimal reward.
The traditional Multi-arm bandit game mostly depends on the
assumptions about the statistics of the slot machine.
In [8], a new type of Multi-arm bandit game is investigated, in
which an adversary instead of a well behaved stochastic
process, has complete control over the payoffs. It is proved that
the proposed algorithm can achieve the best payoff arm at the
rate of 𝑂(𝑇−1
2) in a sequence of T plays. Considering the high
computational complexity of solving stochastic dynamic games
as the number of agents grow, the proposed mean-field
approximation can be dramatically reduced. Also, a
performance bound is derived to evaluate the approximation
performance [9].
Considering the computability and plausibility limitation of the
Markov perfect equilibrium, an approximation methodology
also called mean field equilibrium is considered where agents
optimize only with respect to the other players’ average
estimate, which is reasonable because it is impossible for each
player to keep knowledge of other players all the time. The
necessary condition for the existence of a mean field
equilibrium in such games is derived and investigated [10]. The
Multi-arm bandit game is a type of sequential optimization
problem, where in successive trials,
This content is AI-processed based on ArXiv data.