Dynamic Multi-Arm Bandit Game Based Multi-Agents Spectrum Sharing Strategy Design

February 23, 2026

Reading time: 5 minute

...

📝 Abstract

For a wireless avionics communication system, a Multi-arm bandit game is mathematically formulated, which includes channel states, strategies, and rewards. The simple case includes only two agents sharing the spectrum which is fully studied in terms of maximizing the cumulative reward over a finite time horizon. An Upper Confidence Bound (UCB) algorithm is used to achieve the optimal solutions for the stochastic Multi-Arm Bandit (MAB) problem. Also, the MAB problem can also be solved from the Markov game framework perspective. Meanwhile, Thompson Sampling (TS) is also used as benchmark to evaluate the proposed approach performance. Numerical results are also provided regarding minimizing the expectation of the regret and choosing the best parameter for the upper confidence bound.

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

Dynamic Multi-Arm Bandit Game Based Multi- Agents Spectrum Sharing Strategy Design Jingyang Lu, Lun Li, Dan Shen, Genshe Chen, Bin Jia Intelligent Fusion Technology, Inc.,
20271 Goldenrod Ln, Germantown, MD, 20876 {Jingyang.lu, lun.li, dshen, gchen, bin.jia} @intfusiontech.com Erik Blasch, Khanh Pham Air Force Research Laboratory Rome, NY and Kirtland, AFB, NM {erik.blasch.1, khanh.pham}@us.af.mil

Abstract— For a wireless avionics communication system, a Multi- arm bandit game is mathematically formulated, which includes channel states, strategies, and rewards. The simple case includes only two agents sharing the spectrum which is fully studied in terms of maximizing the cumulative reward over a finite time horizon. An upper confidence bound (UCB) algorithm is used to achieve the optimal solutions for the stochastic Multi-arm bandit (MAB) problem. Also, the MAB problem can also be solved from the Markov game framework perspective. Meanwhile, Thompson sampling (TS) is also used as benchmark to evaluate the proposed approach performance. Numerical results are also provided regarding minimizing the expectation of the regret and choosing the best parameter for the upper confidence bound. Keywords—Multi-arm Bandit Game;Cognitive Raido Network; Dynamic Spectrum Access I. INTRODUCTION
Avionics systems are dependent on communication capabilities for navigation and control [1][2]. A key element of future unmanned aerial systems (UAS) would be wireless communications. However, the many possible UAS would be sharing the available spectrum for navigation and control.
A wireless spectrum, regarded as a limited resource, has been investigated to increase the utility efficiency [1]. A Cognitive radio (CR) has been proposed to automatically adapt the communication system parameter to overcome the conflict between the great demand for spectrum and large amount of spectrum left available by Joseph Mitola III [3]. In a cognitive radio network (CRN), spectrum sensing provides the basis for the communication control center to dynamically allocate the spectrum resources without bringing harmful interference to other users [4][5].
Recently, the spectrum allocation problem has been studied from the physical (PHY), medium access control (MAC), and network layers using different approaches such as communication theory, signal processing, graph theory, machine learning, and game theory; all of which involve computational complexity and communication overhead [6]. These approaches are advancing capabilities for avionics systems. Thus, a key problem arises as how to accommodate the different parts of communication system to balance the system spectrum utility and computation complexity constrained by the limited resources.
In this paper, a new type of game is formulated to design the strategy that each communication node can dynamically select a candidate spectrum to transmit signal efficiently with the smallest accumulative regret.
In the Multi-arm Bandit (MAB) game, which is originally proposed in [7], a gambler has to choose one of K machines to play. Each time, the gambler pulls the arm of one slot machine and receives a reward or payoff. The purpose of the game is to maximize the gambler’s accumulative return or equivalently, the accumulative regret. The problem is a typical example of trade-off between the exploration and exploitation. If the player myopically focuses on the slot machine he thinks is the best, he may miss the actually best machine. On the other hand, if he spends most of time trying different slot machines, he may fail to play the best option enough often to gain an optimal reward. The traditional Multi-arm bandit game mostly depends on the assumptions about the statistics of the slot machine. In [8], a new type of Multi-arm bandit game is investigated, in which an adversary instead of a well behaved stochastic process, has complete control over the payoffs. It is proved that the proposed algorithm can achieve the best payoff arm at the rate of 𝑂(𝑇−1 2) in a sequence of T plays. Considering the high computational complexity of solving stochastic dynamic games as the number of agents grow, the proposed mean-field approximation can be dramatically reduced. Also, a performance bound is derived to evaluate the approximation performance [9]. Considering the computability and plausibility limitation of the Markov perfect equilibrium, an approximation methodology also called mean field equilibrium is considered where agents optimize only with respect to the other players’ average estimate, which is reasonable because it is impossible for each player to keep knowledge of other players all the time. The necessary condition for the existence of a mean field equilibrium in such games is derived and investigated [10]. The Multi-arm bandit game is a type of sequential optimization problem, where in successive trials,

View Original ArXiv

This content is AI-processed based on ArXiv data.

Dynamic Multi-Arm Bandit Game Based Multi-Agents Spectrum Sharing Strategy Design

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found