EXP3를 상수 시간으로 구현 실용적 알고리즘과 시간 정규화 트레이드오프

February 23, 2026

Reading time: 5 minute

...

📝 Original Info

Title: EXP3를 상수 시간으로 구현 실용적 알고리즘과 시간 정규화 트레이드오프
ArXiv ID: 2512.11201
Date: 2026-02-23
Authors: Researchers from original ArXiv paper

📝 Abstract

We point out that EXP3 can be implemented in constant time per round, propose more practical algorithms, and analyze the trade-offs between the regret bounds and time complexities of these algorithms.

💡 Deep Analysis

Deep Dive into EXP3를 상수 시간으로 구현 실용적 알고리즘과 시간 정규화 트레이드오프.

📄 Full Content

Fast EXP3 Algorithms Ryoma Sato rsato@nii.ac.jp National Institute of Informatics Shinji Ito shinji@mist.i.u-tokyo.ac.jp The University of Tokyo and RIKEN Abstract We point out that EXP3 can be implemented in constant time per round, propose more practical algorithms, and analyze the trade-offs between the regret bounds and time com- plexities of these algorithms. 1 Introduction The computational complexity of adversarial bandits has been largely overlooked. In fact, in much of the literature, the computational complexity of EXP3 [1] is either left unspecified or stated to be linear in the number of arms per round [5, 6]. A notable exception is the work by Chewi et al. [4], who proposed an algorithm that runs in O(log2 K) time for the adversarial bandit problem with K arms, at the cost of worsening the constant factor in the regret bound. However, this represents a suboptimal trade-off between computational time and regret. First, we point out that a practical implementation of EXP3 is possible in O(log K) time per round exactly, that is, without degrading the constant factor in the regret bound. This improves upon the algorithm of Chewi et al. [4] in both the order of computational time and the regret constant. Next, we show that by using an advanced general-purpose data structure [7], it is possible to implement EXP3 in O(1) expected time per round exactly, again, without degrading the constant factor in the regret bound. This also improves upon the algorithm of Chewi et al. [4] in both the order of computational time and the regret constant. However, this data structure is exceedingly complex. Therefore, we propose a simpler and more easily analyzable implementation of EXP3 that runs in O(1) expected time per round. Finally, we analyze the trade-offs between computational time and regret when making EXP3 anytime. We show that the commonly used doubling trick is not optimal and that better trade-offs exist. We summarize our main results in Table 1. 2 Background: Adversarial Bandits and EXP3 In this section, we define the setting of the adversarial bandit problem addressed in this paper and describe the details of the baseline algorithm EXP3 along with its regret upper bound. 2.1 Problem Setting We consider a slot machine with K ≥2 arms over a game of T rounds. The adversary (also called the environment) and the player interact according to the following protocol. In each round t = 1, . . . , T: 1 arXiv:2512.11201v1 [cs.LG] 12 Dec 2025 Table 1: Summary of main results in this paper. All proposed methods outperform Chewi et al. [4] in both time complexity and regret constant. The regret coefficient represents the coefficient of √ KT ln K in the expected pseudo-regret for the anytime setting; smaller is better. Algorithm Time Complexity per Round Regret Coefficient Naive implementation O(K) worst-case 2 Chewi et al. [4] O(log2 K) worst-case processing + O(log K) expected sampling 4 Binary tree (Section 3) O(log K) worst-case 2 Advanced data structure (Section 4) O(1) expected 2 Alias method (Section 5) O(1) expected 2 1. Adversary’s turn: The adversary determines the loss vector ℓt = (ℓt,1, . . . , ℓt,K) ∈[0, 1]K for each arm based on the history Ht−1 = (a1, ℓ1,a1, . . . , at−1, ℓt−1,at−1). Crucially, while the adversary knows the player’s algorithm (policy), it must fix ℓt before knowing the realization of the player’s internal randomness at round t (i.e., before knowing which arm at will be selected). 2. Player’s turn: The player follows the algorithm and selects one arm at ∈{1, . . . , K} using the observable history and its own internal randomness. 3. Observation and loss: The player observes only the loss ℓt,at of the selected arm and incurs this loss. The losses ℓt,i (i ̸= at) of the unselected arms are not observed (bandit feedback). In this paper, we primarily consider the setting where the horizon T is known in advance. We will discuss the setting where the horizon is unknown in Section 7. In this setting, the player’s objective is to minimize the expected pseudo-regret ¯RT defined as follows. ¯RT := max i∈{1,...,K} E " T X t=1 ℓt,at − T X t=1 ℓt,i # (1) Here, the expectation E[·] is taken over the randomness in the player’s algorithm (and the adversary’s randomness if the adversary is stochastic). 2.2 The EXP3 Algorithm EXP3 [1] is an algorithm that maintains a weight wt,i for each arm and selects arms according to a probability distribution based on these weights. We fix the learning rate η := q 2 ln K KT . The initial weights are set to w1,i = 1 for i = 1, . . . , K, and in each round t, we define the total weight as Wt := PK i=1 wt,i. The player selects arm at according to the following probability distribution pt = (pt,1, . . . , pt,K). pt,i := wt,i Wt (2) After selecting arm at and observing loss ℓt,at, the weights are updated as follows. wt+1,i := ( wt,i exp −η ℓt,i pt,i (if i = at) wt,i (otherwise) (3) This is equivalent to performing an exponential weight update using an inverse proba

…(Full text truncated)…

📄 Read Full PDF on ArXiv