EXP3를 상수 시간으로 구현 실용적 알고리즘과 시간 정규화 트레이드오프
Reading time: 5 minute
...
📝 Original Info
- Title: EXP3를 상수 시간으로 구현 실용적 알고리즘과 시간 정규화 트레이드오프
- ArXiv ID: 2512.11201
- Date: 2026-02-23
- Authors: Researchers from original ArXiv paper
📝 Abstract
We point out that EXP3 can be implemented in constant time per round, propose more practical algorithms, and analyze the trade-offs between the regret bounds and time complexities of these algorithms.💡 Deep Analysis
Deep Dive into EXP3를 상수 시간으로 구현 실용적 알고리즘과 시간 정규화 트레이드오프.We point out that EXP3 can be implemented in constant time per round, propose more practical algorithms, and analyze the trade-offs between the regret bounds and time complexities of these algorithms.
📄 Full Content
Fast EXP3 Algorithms
Ryoma Sato
rsato@nii.ac.jp
National Institute of Informatics
Shinji Ito
shinji@mist.i.u-tokyo.ac.jp
The University of Tokyo and RIKEN
Abstract
We point out that EXP3 can be implemented in constant time per round, propose more
practical algorithms, and analyze the trade-offs between the regret bounds and time com-
plexities of these algorithms.
1
Introduction
The computational complexity of adversarial bandits has been largely overlooked. In fact, in much of the
literature, the computational complexity of EXP3 [1] is either left unspecified or stated to be linear in the
number of arms per round [5, 6].
A notable exception is the work by Chewi et al. [4], who proposed an algorithm that runs in O(log2 K) time
for the adversarial bandit problem with K arms, at the cost of worsening the constant factor in the regret
bound. However, this represents a suboptimal trade-off between computational time and regret.
First, we point out that a practical implementation of EXP3 is possible in O(log K) time per round exactly,
that is, without degrading the constant factor in the regret bound. This improves upon the algorithm of
Chewi et al. [4] in both the order of computational time and the regret constant.
Next, we show that by using an advanced general-purpose data structure [7], it is possible to implement
EXP3 in O(1) expected time per round exactly, again, without degrading the constant factor in the regret
bound. This also improves upon the algorithm of Chewi et al. [4] in both the order of computational time
and the regret constant.
However, this data structure is exceedingly complex.
Therefore, we propose a simpler and more easily analyzable implementation of EXP3 that runs in O(1)
expected time per round.
Finally, we analyze the trade-offs between computational time and regret when making EXP3 anytime. We
show that the commonly used doubling trick is not optimal and that better trade-offs exist.
We summarize our main results in Table 1.
2
Background: Adversarial Bandits and EXP3
In this section, we define the setting of the adversarial bandit problem addressed in this paper and describe
the details of the baseline algorithm EXP3 along with its regret upper bound.
2.1
Problem Setting
We consider a slot machine with K ≥2 arms over a game of T rounds. The adversary (also called the
environment) and the player interact according to the following protocol.
In each round t = 1, . . . , T:
1
arXiv:2512.11201v1 [cs.LG] 12 Dec 2025
Table 1: Summary of main results in this paper. All proposed methods outperform Chewi et al. [4] in both
time complexity and regret constant. The regret coefficient represents the coefficient of
√
KT ln K in the
expected pseudo-regret for the anytime setting; smaller is better.
Algorithm
Time Complexity per Round
Regret Coefficient
Naive implementation
O(K) worst-case
2
Chewi et al. [4]
O(log2 K) worst-case processing + O(log K) expected sampling
4
Binary tree (Section 3)
O(log K) worst-case
2
Advanced data structure (Section 4)
O(1) expected
2
Alias method (Section 5)
O(1) expected
2
1. Adversary’s turn: The adversary determines the loss vector ℓt = (ℓt,1, . . . , ℓt,K) ∈[0, 1]K for each
arm based on the history Ht−1 = (a1, ℓ1,a1, . . . , at−1, ℓt−1,at−1). Crucially, while the adversary knows
the player’s algorithm (policy), it must fix ℓt before knowing the realization of the player’s internal
randomness at round t (i.e., before knowing which arm at will be selected).
2. Player’s turn: The player follows the algorithm and selects one arm at ∈{1, . . . , K} using the
observable history and its own internal randomness.
3. Observation and loss: The player observes only the loss ℓt,at of the selected arm and incurs this
loss. The losses ℓt,i (i ̸= at) of the unselected arms are not observed (bandit feedback).
In this paper, we primarily consider the setting where the horizon T is known in advance. We will discuss
the setting where the horizon is unknown in Section 7.
In this setting, the player’s objective is to minimize the expected pseudo-regret ¯RT defined as follows.
¯RT :=
max
i∈{1,...,K} E
" T
X
t=1
ℓt,at −
T
X
t=1
ℓt,i
#
(1)
Here, the expectation E[·] is taken over the randomness in the player’s algorithm (and the adversary’s
randomness if the adversary is stochastic).
2.2
The EXP3 Algorithm
EXP3 [1] is an algorithm that maintains a weight wt,i for each arm and selects arms according to a probability
distribution based on these weights.
We fix the learning rate η :=
q
2 ln K
KT . The initial weights are set to w1,i = 1 for i = 1, . . . , K, and in each
round t, we define the total weight as Wt := PK
i=1 wt,i. The player selects arm at according to the following
probability distribution pt = (pt,1, . . . , pt,K).
pt,i := wt,i
Wt
(2)
After selecting arm at and observing loss ℓt,at, the weights are updated as follows.
wt+1,i :=
(
wt,i exp
−η ℓt,i
pt,i
(if i = at)
wt,i
(otherwise)
(3)
This is equivalent to performing an exponential weight update using an inverse proba
…(Full text truncated)…
📸 Image Gallery

Reference
This content is AI-processed based on ArXiv data.