Non-Stationary Online Resource Allocation: Learning from a Single Sample

Non-Stationary Online Resource Allo cation: Learning from a Single Sample Yiding F eng † , Jiash uo Jiang † , Yige W ang † † Department of Industrial Engineering & Decision Analytics, Hong Kong Univ ersit y of Science and T echnology W e study online resource allocation under non-stationary demand with a minim um oﬄine data requiremen t. In this problem, a decision-maker must allocate multiple t ypes of resources to sequentially arriving queries o ver a ﬁnite horizon. Each query b elongs to a ﬁnite set of types with ﬁxed resource consumption and a sto c hastic reward drawn from an unknown, t ype-sp eciﬁc distribution. Critically , the en vironmen t exhibits arbitrary non-stationarity—arriv al distributions may shift unpredictably—while the algorithm requires only one historic al sample p er p erio d to op erate eﬀectively . W e distinguish tw o settings based on sample informa- tiv eness: (i) r eward-observe d samples con taining b oth query type and reward realization, and (ii) the more c hallenging typ e-only samples revealing only query t ype information. W e prop ose a nov el t yp e-dep endent quan tile-based meta-p olicy that decouples the problem in to mo du- lar comp onents: rew ard distribution estimation, optimization of target service probabilities via ﬂuid relax- ation, and real-time decisions through dynamic acceptance thresholds. F or reward-observ ed samples, our static threshold p olicy ac hieves ˜ O ( √ T ) regret without requiring large-budget assumptions. F or type-only samples, w e ﬁrst establish that sublinear regret is impossible without additional structure; under a mild minim um-arriv al-probability assumption, w e design both a partially adaptiv e policy attaining the same ˜ O ( T ) b ound and, more signiﬁcan tly , a fully adaptive resolving policy with careful rounding that ac hieves the ﬁrst p oly-logarithmic regret guarantee of O ((log T ) 3 ) for non-stationary multi-resource allo cation. Our frame- w ork adv ances prior w ork b y operating with minimal oﬄine data (one sample p er p erio d), handling arbi- trary non-stationarit y without v ariation-budget assumptions, and supporting m ultiple resource constrain ts— demonstrating that near-optimal performance is attainable even under stringen t data requirements. Key wor ds : Online Resource Allocation, Quan tile-Based Meta-P olicy , One Sample p er P erio d, P oly-Logarithmic Regret 1. In tro duction Online resource allo cation under uncertain t y is a fundamen tal problem in sequen tial decision- making, with broad applications in rev en ue management, digital adv ertising, cloud infrastructure, and sharing econom y platforms. In this paradigm, a decision-mak er must allocate limited resources to sequentially arriving queries ov er a ﬁnite time horizon. Each query yields a sto chastic rew ard and consumes a random v ector of resources. Decisions must b e made immediately and irrevocably 1 2 up on arriv al without knowledge of future queries, with the ob jective of maximizing total rew ard sub ject to resource constraints. T raditional models for online resource allocation often rely on idealized assumptions, whic h com- monly presupp ose stationary demand pr o c esses . In practice, how ever, op erational environmen ts increasingly defy these conditions. Consider an e-commerce cloud platform anticipating a ﬂash sale: demand ma y surge unpredictably due to viral so cial trends or seasonal p eaks such as Sin- gles’ Day . Similarly , a ride-hailing service entering a new city faces rapidly shifting trip patterns inﬂuenced by w eather, lo cal even ts, or evolving comm uter habits. In digital advertising, a breaking news even t can abruptly reshape user engagement, rendering pre-even t clic k-through mo dels obso- lete. These scenarios collectiv ely underscore a fundamental tension in mo dern resource allo cation: envir onments exhibit dynamic non-stationarity , characterized b y trends, seasonalit y , and external sho c ks. Consequen tly , classical algorithms designed for stationary settings often exhibit degraded reliabilit y and limited adaptabilit y when deplo y ed in highly v olatile conditions. Giv en the inherent in tractability of computing optimal p olicies under uncertain ty , the ﬁeld pri- oritizes algorithms with rigorous p erformance guarantees. The standard b enc hmark is r e gr et —the exp ected diﬀerence b etw een the cumu lativ e reward of online p olicy and that of optimal oﬄine p olicy with full kno wledge of all future realizations b eforehand. Although research in stationary settings has achiev ed near-optimal theoretical results, prior work has also sho wn that in online learning problems, sublinear regret is unattainable without a minimal amoun t of data. Therefore, bridging the gap b etw een idealized theoretical assumptions and realities of non-stationarity and data scarcit y forms the core motiv ation for our main researc h question: How c an we design online al lo c ation algorithms that le arn eﬀe ctively fr om sp arse, non- stationary data to achieve ne ar-optimal p erformanc e—ide al ly with subline ar, or even lo garith- mic r e gr et? In this work, we in v estigate non-stationary online resource allo cation with limited historical samples. W e consider a ﬁnite-horizon setting with m ultiple resource constrain ts, where queries arriv e sequentially . Each query belongs to a ﬁnite set of t yp es, and every type is asso ciated with a ﬁxed resource consumption v ector and a sto chastic reward dra wn from an unknown, con tinuous, t yp e-sp eciﬁc distribution. In eac h perio d, the decision-maker observ es a query’s type and its random rew ard realization, then must decide irrev o cably whether to accept (consuming resources for a rew ard) or reject it. The arriv al pro cess is non-stationary , with distributions that can c hange arbitrarily o ver time. Critically , b oth the arriv al distribution of t yp es and the conditional reward distributions are initially unknown; the algorithm has access to only one indep endent historical 3 sample per p erio d in adv ance, which may lack rew ard information. The ob jectiv e is to design an online p olicy that leverages these sparse samples to adapt to distributional shifts while maximizing cum ulativ e rew ard sub ject to resource constraints. 1.1. Our Main Results and Contributions In this pap er, we propose a nov el, uniﬁed framework that ac hiev es strong theoretical guaran tees under non-stationarit y and extreme data scarcit y . A cornerstone of our con tribution is the design of near-optimal online allo cation algorithms under the minimal p ossible data requirement: leveraging only a single historical sample p er p erio d to adapt to arbitrary distributional shifts while attaining sublinear or ev en logarithmic regret. Our main con tributions are as follo ws: Theoretical Regret Guaran tees: F or the online non-stationary multi-resource allo cation prob- lems with unknown t yp e-arriv al and rew ard distributions, we design an fully adaptive algorithm that achiev es p oly-logarithmic regret using only a single historical sample p er p erio d—containing solely t yp e-arriv al information. Our main result is as follows: Main R esult: We establish the ﬁrst p oly-lo garithmic r e gr et upp er b ound O ((log T ) 3 ) for on- line multi-r esour c e al lo c ation pr oblems under non-stationarity with only one typ e-only historic al sample p er p erio d. T o the b est of our knowledge, this result pro vides the ﬁrst p oly-logarithmic regret guaran tee for non-stationary online resource allocation. Our b ound relies on a minimal-arriv al-probabilit y assumption. W e further demonstrate the necessity of this assumption by constructing a counterex- ample, which shows that without such assumption, sublinear regret is unattainable in the worst case when only t yp e samples are av ailable. Under the same conditions, we also derive a ˜ O ( √ T ) regret b ound using a partially adaptiv e algorithm that up dates rew ard distribution estimates on- line that is easier to implement. If the minimal-arriv al-probability assumption is relaxed, attaining sublinear regret requires access to historical reward information. In this setting, we propose a more easier static threshold p olicy with rew ard-observed samples that ac hieves the same ˜ O ( √ T ) upper b ound. The closest work to ours is by Ghuge et al. ( 2025 ), who also studied m ulti-resource allo ca- tion with a single historical sample and proposed an exp onential pricing algorithm that attains a (1 − ϵ ) appro ximation to the hindsight optim um. Their result considers a large-budget assumption- sp eciﬁcally , budgets of order ˜ Ω(1 /ϵ 6 )-whic h, when resource scales linearly with time horizon, trans- lates to a regret rate of ˜ O ( T 5 / 6 ). In con trast, our quan tile-based metho d op erates under arbitrary resource lev els and ac hieves a sharp er ˜ O ( √ T ) regret under the same setting with rew ard-observ ed samples. F or the simpler single-resource setting, Balseiro et al. ( 2023a ) obtains a ˜ O ( √ T ) b ound, whic h our w ork matc hes in the m ulti-resource case under milder assumptions. 4 Ov erall, our approac h relaxes several critical assumptions in prior work and delivers rigorous, scalable performance guaran tees in data-scarce and non-stationary en vironments. W e adv ance the theoretical understanding of online resource allo cation through a nov el and ﬂexible quan tile-based framew ork that uniﬁes the analysis across diﬀerent distributional assumptions, as will b e elaborated later. A Meta Quantile-Based Algorithm Design: T o ac hieve the theoretical guarantees describ ed, w e introduce a nov el typ e-dep endent, quantile-b ase d p olicy fr amework . This design departs fun- damen tally from the conv entional dual-based paradigm b y modularly decomp osing the online al- lo cation problem in to three distinct comp onents: (i) Reward F unction Estimation: a standalone learning sub-problem where state-of-the-art methods can b e applied directly; (ii) Optimization of T arget Service Probabilities: a strategic la y er that resolves the long-term resource-rew ard trade- oﬀ; and (iii) Real-time Decision-making: executed via dynamic type-sp eciﬁc acceptance thresholds deriv ed from estimated rew ard quan tiles. The p o w er of this decomposition lies in tw o k ey adv an tages: (i) Mo dularity for Direct T echnical In tegration: It establishes a clean in terface betw een high-level resource budgeting and per-p erio d op erations. The framework is explicitly designed to b e mo dular, allowing an y adv ancements in quan tile or distribution estimation to b e plugged directly into the p olicy without mo difying the core allo cation logic. (ii) T yp e-Dep enden t, Non-Interfering Decisions: Each query t yp e is managed indep enden tly through its own quantile tra jectory and acceptance threshold. This type-dep endent design ensures that the decision logic for one type do es not in terfere with or complicate the pro- cessing of another, leading to a more transparent and robust managemen t of the resource-rew ard trade-oﬀ across div erse arriv al t yp es. In contrast to prior dual-based approac hes—which typically adopt specialized, monolithic designs that tightly couple dual v ariable learning with allocation logic and rely on intricate, algorithm- sp eciﬁc analyses—our quantile-based framew ork cleanly decouples learning from optimization. F or instance, Balseiro et al. ( 2023a ) employ dual v ariables as adaptive shado w prices up dated via Dual FTRL algorithm to con trol budget pacing und er uncertain ty , while Gh uge et al. ( 2025 ) dynamically adjust dual prices using an exp onential rule to guide resource allocation. Although eﬀectiv e, suc h metho ds inherently intert wine dual-space learning with decision-making, complicating adaptation and analysis. Our approach op erates fundamentally diﬀeren tly: it b ypasses explicit dual v ariable main tenance altogether. Allo cation decisions are deriv ed solely b y solving the primal optimization problem using locally estimated rew ard quan tiles sp eciﬁc to eac h query type. This design requires only minimal, interpretable information per type, eliminates dependencies on dual dynamics, and enhances b oth theoretical tractabilit y and practical adaptabilit y across div erse online allo cation settings—oﬀering a v ersatile alternativ e to tigh tly coupled dual-based paradigms. 5 W e p osition our contributions against key related w orks in the following T able 1 under our form ulation where some mild assumptions are omitted for simplicit y . T able 1 Compa rison of our wo rk to key related literature resource n um b er metho d regret Balseiro et al. ( 2023a ) single dual-based ˜ O ( T 1 / 2 ) Gh uge et al. ( 2025 ) m ultiple dual-based ˜ O ( T 5 / 6 ) Our w ork m ultiple quan tile-based O ((log T ) 3 ), ˜ O ( T 1 / 2 ) 1.2. Other Related Literature Net work Reven ue Management with Logarithmic Regret: Netw ork Reven ue Managemen t (NRM) is a central and extensively studied problem in online resource allo cation, where a key ob jective is to dev elop practically eﬀective p olicies with strong theoretical guaran tees. The ﬁeld’s foundations include the dynamic pricing mo del for NRM introduced b y Gallego and V an Ryzin ( 1997 ). A seminal contribution by T alluri and V an Ryzin ( 1998 ) prop osed a static bid-price p olicy based on the dual v ariables of an ex-ante ﬂuid relaxation, establishing a sublinear regret b ound. This approach was later reﬁned by Reiman and W ang ( 2008 ), who demonstrated that p erio dically re-solving the ﬂuid program to up date bid-prices yields an improv ed regret b ound of o ( √ T ). When the underlying demand functions are unkno wn, the problem requires balancing learning with opti- mization. Besb es and Zeevi ( 2012 ) designed “blind” pricing p olicies that explore and exploit based only on observ ed sales data. Subsequent work, such as that of ( F erreira et al. 2018 ), emplo y ed Ba y esian methods like Thompson sampling to address this trade-oﬀ under in v en tory constrain ts. In a diﬀeren t direction, Dev anur et al. ( 2019 ) developed a single algorithm that attains a (1 − ϵ ) fraction of the oﬄine optimum for ev ery p ossible arriv al distribution. Recent extensions ha v e ad- dressed more complex settings, such as reusable resources ( Baek and Ma 2022 ) and non-stationary en vironmen ts with imp erfect distributional kno wledge ( Jiang et al. 2025a ). A distinct and theoretically signiﬁcant line of researc h aims to establish tigh ter, often logarith- mic, regret b ounds. These results typically require more structured problem assumptions, suc h as discrete distributions with ﬁnite supp ort or non-degeneracy assumptions. F or instance, Jasin and Kumar ( 2012 ) analyzed certaint y-equiv alent heuristics for NRM with customer choice, pro viding 6 b ounded rev enue loss. Bump ensanti and W ang ( 2020 ) designed a re-solving heuristic with a con- stan t regret b ound independent of the time horizon and resource capacities. More recently , Li and Y e ( 2022 ) deriv ed logarithmic regret for online linear programming under local strong conv exity , and a result was further tightened and extended by Bra y ( 2025 ). While Jiang et al. ( 2025b ) in- tro duced a similar quan tile-based policy that get rid of the dep endence of non-degeneracy , our framew ork signiﬁcan tly extends the results b y accommodating arbitrary non-stationary arriv al pro- cesses and incorp orating online distribution learning for settings with initially unknown rewards. Online Learning with Samples: The ﬁeld of online learning with prior samples in v estigates ho w access to historical data can enhance sequential decision-making. Early foundational work often assumed inputs w ere drawn from a known or partially known distribution. F or example, Garg et al. ( 2008 ) analyzed online algorithms under inputs from a ﬁxed distribution, while Hu and Zhou ( 2009 ) considered sequences from non-iden tical distributions. Other studies designed robust algorithms to mitigate ov erﬁtting in problems such as the knapsack secretary problem ( Bradac et al. 2019 ), or dev elop ed sample-driv en metho ds for optimal stopping under random-order arriv als ( Correa et al. 2024 ). More recent research has shifted tow ard stringent data limitations, fo cusing on online learning with v ery few historical sample. This line of inquiry has b een applied across v arious online decision problems: prophet inequalities ha ve been studied under single-sample access in ( Azar et al. 2014 , Caramanis et al. 2022 , Cristi and Ziliotto 2024 ); online matc hing w as examined b y Kaplan et al. ( 2022 ); and online net w ork reven ue managemen t was considered by Argue et al. ( 2022 ). F urthermore, D ¨ utting et al. ( 2024 ) inv estigated online combinatorial allo cation with few samples for bidders with com binatorial v aluations. Collectively , this w ork demonstrates b oth the c hallenge and the feasibilit y of ac hieving near-optimal p erformance with minimal prior data. Sev eral studies are particularly p ertinen t to our setting. Bu et al. ( 2020 ) studied online pricing with oﬄine data, but their approac h relies on n samples p er pro duct and assumes linear de- mand mo dels. Cheung and Li ( 2025 ) considered an episo dic framework repeated ov er H rounds, whic h uses multiple samples per episo de and is conﬁned to single-resource settings. In con trast, our framework requires only one sample p er p erio d and accommo dates general, nonparametric rew ard structures with m ultiple resource setting. This con text underscores our core contribution: w e develop a framework for m ulti-resource online allo cation that op erates under extreme data scarcit y—using only a single sample per p erio d—while accommo dating non-stationary arriv als and initially unknown rew ard functions, adv ancing b ey ond the limitations of prior single-sample and episo dic mo dels. Additional discussion of the further related w ork can b e found in Section A . 7 2. Preliminaries W e consider an online resource allo cation problem o ver a ﬁnite horizon with m resources. Eac h resource i ∈ [ m ] has an initial capacity C i ∈ R ≥ 0 . The pro cess takes place ov er T discrete time p erio ds. A t eac h perio d t ∈ [ T ], one single query arrives, which we denote as query t . Eac h query t is c haracterized by a r andom resource consumption vector a t = ( a t, 1 , . . . , a t,m ) ∈ R m ≥ 0 , where a t,i represen ts the amount of resource i that will b e consumed to serv e query t for all i ∈ [ m ], and a r andom reward r t ∈ R ≥ 0 that denotes the amoun t of reward that can b e collected b y serving query t . W e assume that for eac h p erio d t ∈ [ T ], the pair ( r t , a t ) is dra wn indep endently from a nonsta- tionary distribution denoted b y G t ( · ). F urthermore, w e mak e the follo wing structural assumption o v er the distribution G t ( · ): queries b elong to a ﬁnite n um b er of types, where the resource con- sumption size is ﬁxed for eac h query t yp e, while the reward is con tinuous. F or query t , w e deﬁne its type as j t , which is drawn indep endently from a nonstationary distribution denoted b y P t ( · ). F or eac h t ∈ [ T ], the consumption vector a t tak es v alue in a ﬁnite set A = { a 1 , . . . , a n } . When query t is of type j , its consumption is giv en b y a t = a j . W e denote P t ( j ) = Pr[ a t = a j ] for each j ∈ [ n ] and each p erio d t . Conditional on the query type—equiv alently , giv en a t = a j —the rew ard r t is indep endently and iden tically distributed according to a distribution F j ( · ). Regarding the distribution F j ( · ), w e mak e the follo wing assumption: Assumption 1 F or e ach j ∈ [ n ] , the distribution function F j ( · ) has a density function f j ( · ) sup- p orte d on the interval [ r j , ¯ r j ] wher e 0 < r j < ¯ r j < ∞ . A lso, ther e exists two c onstants 0 < α < β < ∞ such that for e ach j ∈ [ n ] and e ach r ∈ [ r j , ¯ r j ] , it holds that α ≤ f j ( r ) ≤ β . After query t arriv es and its asso ciated v alues ( r t , a t ) are rev ealed, the decision mak er has to decide immediately and irrevocably whether or not to serv e query t , according to an online policy . Note that query t can only b e serv ed if for every resource i , the remaining capacit y is at least a t,i . The decision maker’s ob jective is to maximize the total collected reward sub ject to the resource capacit y constrain t. F or an y online p olicy π , we deﬁne the decision v ariables as { x π t } T t =1 , where x π t is a binary v ariable indicating whether query t is serv ed. A p olicy π is feasible if for every t , the decision x π t dep ends solely on F j t ( · ) and previous instance { ( r s , a s ) } t s =1 , and if the follo wing constrain t is satisﬁed: X t ∈ [ T ] a t,i · x π t ≤ C i . 8 The total rew ard of a policy π is given by V π C ( I ) = P t ∈ [ T ] r t · x π t , where I = { ( r t , a t ) } T t =1 denotes the sample path of the problem instance and C = ( C 1 , · · · , C m ). Rew ard-Observed Samples versus Type-Only Samples. W e assume that b oth the query t yp e distribution P t ( · ) and the conditional rew ard distribution F j ( · ) are unknown and must b e learned. In each time p erio d t , we ha ve access to only one single sample from historical observ ation whose t yp e ˆ j t is dra wn indep enden tly from P t . W e assume that r ewar d-observe d samples include b oth information of arriv al type ˆ j t and its corresp onding reward ˆ r t dra wn from F ˆ j t . In con trast, for typ e-only samples , we hav e no historical rew ard information. F or ev ery t ∈ [ T ], the historical samples ˆ j t and ˆ r t are indep enden t of the problem instance random v ariables j t and r t . Regret Minimization and Fluid Relaxation Benc hmark. In order to establish a b enchmark of decision p olicy π , it is common practice to adopt the oﬄine optimal p olicy—that is, the p olicy c hosen when the decision maker has full prior knowledge of ( r t , a t ) for all t ∈ [ T ]. W e denote the corresp onding decisions as { x oﬀ t } T t =1 , which form the optimal solution to the follo wing oﬄine problem: max x X t ∈ [ T ] r t · x t s.t. X t ∈ [ T ] a t,i · x t ≤ C i i ∈ [ m ] x t ∈ { 0 , 1 } t ∈ [ T ] ( V oﬀ C ( I )) W e deﬁne the performance loss as r e gr et , whic h is the exp ected gap b et w een the ob jective v alue of oﬄine optim um and our p olicy π : Regret( π ) := E I ∼ G  V oﬀ C ( I )  − E I ∼ G [ V π C ( I )] Ho w ev er, in our setting, computing this oﬄine optimal policy is c hallenging due to the stochastic nature of the problem instance I . T o address this diﬃcult y , we introduce its ﬂuid relaxation. Sp eciﬁcally , for eac h query t yp e j ∈ [ n ], let d j represen t the n um ber of arriving queries with a t = a j and with r t dra wn from F j in the problem instance. T o get rid of the dep endence of the problem instance, we tak e the exp ectation of d j o v er the instance and ha v e the following ﬂuid relaxation form ulation: max x X j ∈ [ n ] E I ∼ G [ d j ] · E r ∼ F j [ r · x j ( r )] s.t. X j ∈ [ n ] E I ∼ G [ d j ] · a j,i · E r ∼ F j [ x j ( r )] ≤ C i i ∈ [ m ] x j ( r ) ∈ [0 , 1] j ∈ [ n ] , r ∈ [ r j , ¯ r j ] ( V ﬂd C ) where E I ∼ G [ d j ] = P t ∈ [ T ] P t ( j ). It’s trivial that the relaxation imply an upp er bound of the oﬄine optim um V oﬀ C ( I ) , as formalized in the following lemma: 9 Lemma 1 It holds that V ﬂd C ≥ E I ∼ G [ V oﬀ C ( I )] . Therefore the r e gr et can b e upp er b ounded by the expected gap b etw een V ﬂd C and V π C ( I ): Regret( π ) ≤ V ﬂd C − E I ∼ G [ V π C ( I )] In V ﬂd C , E r ∼ F j [ x j ( r )] can b e in terpreted as the probability to serv e type j queries dep endent on rew ard r . W e denote { x ∗ j ( r ) , ∀ j ∈ [ n ] , ∀ r ∈ [ r j , ¯ r j ] } as an optimal solution to V ﬂd C . W e ha ve the follo wing lemma to sho w the threshold prop ert y of the optimal solution: Lemma 2 F or an optimal solution { x ∗ j ( r ) , ∀ j ∈ [ n ] , ∀ r ∈ [ r j , ¯ r j ] } to V ﬂd C , ther e exists a set of thr esh- old { κ j } n j =1 such that it is optimal to set x ∗ j ( r ) = 1 if and only if r ≥ κ j , and x ∗ j ( r ) = 0 if and only if r < κ j , for any j ∈ [ n ] . Consequen tly , following Lemma 2 , the ﬂuid relaxation problem can b e equiv alently rewritten as: max q X j ∈ [ n ] E I ∼ G [ d j ] · Z 1 1 − q j F − 1 j ( u ) d u s.t. X j ∈ [ n ] E I ∼ G [ d j ] · a j,i · q j ≤ C i i ∈ [ m ] q j ∈ [0 , 1] j ∈ [ n ] ( ¯ V ﬂd C ) Here the decision v ariable q j represen ts the probabilit y of serving a query of type j . W e denote { q ∗ j } n j =1 as one optimal solution to ¯ V ﬂd C . T o minimize our r e gr et , w e would ideally use this solution to design a feasible online policy . How ever, the solution of ﬂuid relaxation depends on the unkno wn distribution function { P t ( · ) } T t =1 and { F j ( · ) } n j =1 . Consequently , the theoretically optimal service probabilities { q ∗ j } n j =1 cannot b e directly implemented to derive a feasible online p olicy . Instead, we m ust adopt a data-driven approac h that relies on historical observ ations to obtain reliable high- probabilit y estimates for b oth E I ∼ G [ d j ] and F j for every j ∈ [ n ]. W e will specify the estimation metho ds in Section 3.3 . 3. T yp e-Dep endent Quan tile-Based P olicy F ramew ork In this section, w e in tro duce a t yp e-dep endent quantile-based p olicy for the decision maker. The core idea of our approach is to translate a target service probability for each query type into a dynamic reward threshold, whic h is then used for mak e real-time decisions. This framework ef- fectiv ely decouples the problem: determining the optimal service probabilities q j b ecomes a ﬂuid optimization task, while enforcing these probabilities online is accomplished via a carefully cali- brated threshold rule. W e denote q π j as the probabilit y that query t will b e serv ed under p olicy π . Conceptually , this service probabilit y serves as a key mechanism for balancing rew ard accum ulation against resource 10 consumption. Giv en the target service probabilit y { q π j } n j =1 and the estimated reward cumulativ e distribution function { ˆ F j ( · ) } n j =1 for each t yp e j , we can calculate a corresp onding reward threshold M ( ˆ F j , q π j ) to decide whether to accept or reject the arriving query based on the observed rew ard. This threshold is deﬁned as the quan tile of the estimated distribution ab o v e which a fraction q π j of the rew ards lie. Up on arriv al of a t yp e- j query with rew ard, our p olicy accepts it if and only if the rew ard meets or exceeds the threshold and suﬃcient resources remain. Intuitiv ely , b y accepting only rewards ab o v e the (1 − q π j )-th quantile, the empirical service rate for type j conv erges to the target q π j , provided our distribution estimate is accurate. Our proposed meta-p olicy is formalized in Subroutine 1 . It op erates online and takes an exogenous threshold M ( ˆ F j t , q π j t ) as an key input at eac h time p erio d t . The sp eciﬁc choice of this threshold distinguishes b etw een diﬀerent settings, which will b e detailed in the follo wing Section 4 , 5 and 6 resp ectively . Subroutine 1: Met a Quantile-based Policy Input: Arriv al t yp e j t and rew ard r t ; Quan tile-based threshold M ( ˆ F j t , q π j t ); Curren t consumption a j t ; Remaining budget C t,i for ev ery i ∈ [ m ]. Output: Decision: ACCEPT or REJECT the query; Updated budgets C t +1 ,i for all i ∈ [ m ]. 1 if r t ≥ M ( ˆ F j t , q π j t ) and C t,i ≥ a j t ,i , ∀ i ∈ [ m ] then 2 A CCEPT the query and record cons i = a j t ,i for ev ery i ∈ [ m ]; 3 else 4 REJECT the query and record cons i = 0 for every i ∈ [ m ]; 5 Up date the remaining budget C t +1 ,i = C t,i − cons i for ev ery i ∈ [ m ]. The elegance of this meta quantile-based p olicy lies in its mo dular design. The complexity of learning rew ard distributions and determining appropriate service probabilities is abstracted into the inputs ˆ F j and q π j . In turn, the online decision rule reduces to a straightforw ard threshold comparison, making the policy highly practical for real-time deplo yment in latency-sensitive sys- tems. Ov erall, our w ork represen ts a paradigm shift from integrated, constraint-driv en learning to a modular, estimation-ﬁrst architecture, signiﬁcantly adv ancing both analytical tractabilit y and empirical adaptabilit y in online resource allo cation. 3.1. ˜ O ( √ T ) Regret by Static and Partial Adaptiv e Thresholds Our quan tile-based framework ac hieves ˜ O ( √ T ) regret through tw o distinct approac hes: (i) static thresholds derived from samples with observ ed rew ards, and (ii) partially adaptive thresholds lev er- aging only query-t yp e information. Both policies op erate within the uniﬁed mo dular meta quantile- based p olicy (Subroutine 1 ), whic h decouples distribution estimation from real-time decision- making via t yp e-dep endent reward thresholds. 11 In Section 4 , we assume that historical data contain b oth query t yp es and corresp onding rew ard v alues. Before the online pro cess begins, w e use the single sample per perio d to construct k ernel- based estimators ˆ F j for reward distribution of eac h arriv al t yp e. W e then formulate an estimated ﬂuid relaxation ˆ V C ( ˆ d ) that replaces unkno wn expected arriv als E I ∼ G [ d j ] with historical coun ts ˆ d j and true distributions F j with their estimates ˆ F j . Solving this estimation problem via its Lagrangian function yields target service probabilities ˆ q j and we can calculate the corresp onding acceptance thresholds M ( ˆ F j , q π j ) by setting q π j as ˆ q j . These thresholds are passed to Subroutine 1 for real-time decisions. Although estimation errors cause the implemented solution to deviate from the true ﬂuid optimum, w e rigorously b ound the resulting regret by analyzing the p erformance gap and the p enalt y due to p ossible constraint violations. Theorem 1 sho ws that our static threshold policy (Algorithm 3 ) attains a regret of O ( m log T √ nT ). Our nov el approach fundamen tally shifts from dual-based frameworks, ac hieving sup erior the- oretical p erformance through a more direct sample-coupled analysis and decoupled design. Prior w ork on exponential pricing ( Gh uge et al. 2025 ) relies hea vily on no-regret prop erties with respect to the zero price v e ctor, requiring in tricate concen tration analysis to control estimation errors in preﬁx consumption. Similarly , robust pacing algorithms ( Balseiro et al. 2023a ) build up on online con v ex optimization frameworks where regret bounds translate to solution qualit y guaran tees. In con trast, our metho d lev erages coupling argumen ts b et w een samples and realizations. This decou- pled design not only enhances practical deplo yabilit y but also con tributes to a tigh ter regret bound. T o b e speciﬁc, our static thresholds metho d ac hieves a ˜ O ( √ T ) regret b ound, which outp erforms the ˜ O ( T 5 / 6 ) rate in ( Ghuge et al. 2025 ) when resource scales linearly with time horizon and matc hed the ˜ O ( √ T ) b ound in ( Balseiro et al. 2023a ) with single-resource setting. A further distinction lies in decision granularit y . Con v en tional dual-based metho ds emplo y a single global price vector coupling all resources and types, obscuring per-query logic and en tangling decisions with global state. While our framew ork enables transparent, type-sp eciﬁc decisions: each query t yp e is managed indep endently by its own quantile threshold. This fundamen tal shift not only eliminates the need for cum b ersome global estimation but also in tro duces greater ﬂexibility , as eac h query t yp e can b e processed indep enden tly through its own quantile-based threshold without in terference from others. In Section 5 , we consider a more diﬃcult setting where historical samples contain only query t yp es and no rew ards. W e ﬁrst establish an impossibility result (Theorem 2 ): without further as- sumptions, no online policy can achiev e sublinear regret, as illustrated b y a counterexample where non-stationary arriv als force a linear regret Ω( T ). T o enable learning, we in tro duce a mild minimum- arriv al-probabilit y assumption (Assumption 3 ), guaran teeing each type app ears with probabilit y at 12 least γ > 0 p er perio d. Under this condition, w e design a partial adaptive threshold sc heme. Start- ing with a uniform prior, eac h time a query of t yp e j arriv es, we up date the estimator ˆ F j,t using the newly observed reward. At p erio d t , we solve an online version of the estimated relaxation ˆ V t, C ( ˆ d ) using curren t estimates ˆ F j,t and historical coun ts ˆ d j , producing time-v arying target service prob- abilities q j,t and corresp onding thresholds M ( ˆ F j,t , q j,t ). The resulting online policy (Algorithm 4 ) con tin uously reﬁnes distribution estimates while resp ecting resource constraints. Remark ably , The- orem 3 conﬁrms that the same O ( m log T √ nT ) regret b ound holds—demonstrating that online rew ard learning need not compromise theoretical p erformance when minimal arriv al regularity is ensured. 3.2. Logarithmic Regret by F ully Adaptive Thresholds T o surpass the √ T barrier, in Section 6 w e introduce a fully adaptiv e resolving p olicy that re- optimizes the remaining problem at each decision ep o ch and incorp orates a careful rounding step. This metho d is also designed for t yp e-only samples under minimum-arriv al-probability c ondition (Assumption 4 ). The core idea is to mo v e from a static or partially adaptiv e ﬂuid relaxation to a semi-ﬂuid relaxation that explicitly accounts for the remaining horizon and curren t resource capacities. At the beginning of eac h p erio d t , w e compute the estimated future arriv als ˆ b j,t from historical samples and current reward estimates ˆ F j,t . Using these, we formulate a p er-p erio d esti- mated problem ˆ V t, c ( I t ) with remaining capacities c t . Solving this problem via Lagrangian function yields candidate service probabilities ˆ q j,t . T o ensure robustness, we in tro duce a rounding pro cedure when setting the acceptance threshold: (i) If ˆ q j t ,t is very high (ab ov e 1 − 2 κ ( log T √ T − t +1 + log T √ t )), we set the threshold to the lo wer b ound r j t , eﬀectively accepting all queries of that t yp e; (ii) If ˆ q j t ,t is v ery low (b elow 2 κ ( log T √ T − t +1 + log T √ t )), we set the threshold to ¯ r j t + 1, eﬀectiv ely rejecting all queries of that t yp e; (iii) In the intermediate range, we use the quan tile-based threshold ˆ F − 1 j t ,t (1 − ˆ q j t ,t ). This rounding p olicy ensures that w e could alw ays constructs a feasible solution to our bench- mark. By analyzing the instantaneous regret at each step and aggregating ov er the horizon, we pro v e Theorem 4 that the fully adaptiv e threshold p olicy (Algorithm 5 ) ac hiev es a regret of O ((log T ) 3 ). This represen ts an exp onential improv ement o v er the t ypical ˜ O ( √ T ) rates in prior dual-based liter- ature. The decoupled design also enables that we can employ aggressiv e online learning metho ds of distributions b ecause reward estimation is separated from resource budgeting, while in dual-based approac hes, the in terlea ving of dual up dates and rew ard estimation mak es it diﬃcult to con trol the error propagation needed for logarithmic regret. T o conclude, our p oly-logarithmic regret b ound bridges a signiﬁcant theoretical gap in online resource allocation. It demonstrates that, ev en with extreme data scarcit y (one sample per perio d) and no prior reward kno wledge, near-optimal p er- formance is attainable with only logarithmic gro wth in regret. 13 3.3. Estimation of Arriv al Distributions In this subsection, we presen t standard metho ds for estimating query arriv al num b ers and rew ard distributions, whic h serv e as direct inputs to our mo dular framework. Estimation of Query Num b ers. Giv en the non-stationary nature of the en vironmen t, estimating query t yp e distributions is not meaningful. Instead, w e fo cus on estimating the total n um b er of query arriv als for each type o ver the en tire time horizon. W e no w sp ecify our approach: F or eac h p erio d t , we hav e one single historical sample ˆ j t ∼ P t . Since the query t yp es are ﬁnite, i.e. ˆ j t ∈ [ n ], w e denote the num b er of t yp e j query arriv als from historical observ ation as ˆ d j = P t ∈ [ T ] 1 n ˆ j t = j o . It follows directly that P j ∈ [ n ] ˆ d j = T as we hav e exactly T samples in total. W e make the following assumption: Assumption 2 F or any historic al instanc e with ˆ d = ( ˆ d 1 , · · · , ˆ d n ) , at le ast one query of e ach typ e j arrives, i.e., Pr h ˆ d j ≥ 1 i = 1 , ∀ j ∈ [ n ] . F or a giv en problem instance I , the actual num b er of type j query arriv als is d j = P t ∈ [ T ] 1 { j t = j } . W e hav e the following lemma to show that ˆ d j serv es as a reliable estimate for E I ∼ G [ d j ]. W e oﬀer detailed discussion in Section B . Lemma 3 Deﬁne µ j = E I ∼ G [ d j ] , with pr ob ability at le ast 1 − δ , we have X j ∈ [ n ] | ˆ d j − µ j | ≤ p 2 nT log (2 /δ ) , X j ∈ [ n ] | ˆ d j − µ j | 2 µ j ≤ n (log (2 /δ )) 2 Estimation of Rew ard Distribution. Our quantile-based framework is compatible with gen- eral rew ard distribution estimation metho ds. In our analysis, we adopt a k ernel-based estimation approac h: Let k ernel probability density function k ( x ) satisﬁes the follo wing prop erties: (i) k ( x ) is non-negativ e symmetrical densit y function; (ii) k ( x ) has compact support on [-1,1], i.e., k ( x ) = 0 when | u | > 1; (iii) R k ( x ) d x = 1 , R xk ( x ) d x = 0 , R | x | k ( x ) d x < + ∞ . W e deﬁne K ( x ) = R x −∞ k ( u ) d u as the cum ulativ e distribution function of k ( x ). Therefore we know K ( x ) is symmetric: K ( − x ) = 1 − K ( x ) and K (0) = 1 2 . Assume that w e hav e N i.i.d. samples { X 1 , · · · , X N } dra wn from a rew ard distribution function F ( · ) and F ( · ) is supp orted on [ a, b ] where 0 < a < b < ∞ . The probabilit y density function f ( · ) satisﬁes that ∀ x ∈ [ a, b ] , 0 < α ≤ f ( x ) ≤ β < ∞ . Deﬁne the kernel estimation as follo wing: ˆ F ( x ) = 1 N N X i =1 K  x − X i h N  (1) where h N > 0 is the bandwidth parameter to be sp eciﬁed. Then w e ha v e the follo wing lemma, the pro of of which is detailed in Section B . 14 Lemma 4 Given N i.i.d. samples { X 1 , · · · , X N } dr awn fr om a r ewar d distribution function F ( · ) , set h N = N − 1 / 2 , ˆ F ( x ) deﬁne d in ( 1 ), with pr ob ability at le ast 1 − δ , the uniform estimation err or satisﬁes sup x ∈ [ a,b ] | F ( x ) − ˆ F ( x ) | ≤ O r log(1 /δ ) N ! 4. Static Thresholds with Rew ard-Observ ed Samples In this section, we assume that the historical samples (rew ard-observ ed samples) also contain the corresp onding reward v alues ˆ r t , eac h drawn from F ˆ j t . This allows us to compute, with high probabilit y , an estimator for the reward distribution function of eac h t yp e j prior to the decision pro cess. Speciﬁcally , we utilize the ˆ d j historical samples asso ciated with eac h type j to construct an estimate of the rew ard distribution F j ( · ) with k ernel metho d stated in Section 3.3 . Threshold Computation and Algorithm Design: W e now develop the estimator threshold that will b e used in Subroutine 1 . Given historical samples ˆ d = ( ˆ d 1 , · · · , ˆ d n ), w e deﬁne the estimation problem of ¯ V ﬂd C : max q X j ∈ [ n ] ˆ d j · Z 1 1 − q j ˆ F − 1 j ( u ) d u s.t. X j ∈ [ n ] ˆ d j · a j,i · q j ≤ C i i ∈ [ m ] q j ∈ [0 , 1] j ∈ [ n ] ( ˆ V C ( ˆ d )) W e can solv e the estimation problem ˆ V C ( ˆ d ) and obtain its optimal solution: ˆ q j ( ˆ λ ) = 1 − ˆ F j  X i ∈ [ m ] ˆ λ i a j,i  (2) where ˆ λ is the optimal dual v ariable in the Lagragian function of ˆ V C ( ˆ d ) . T o b e sp eciﬁc, W e summarize Subroutine 2 to serve as a solver to the estimation problem. In the con text of our policy , the most straightforw ard wa y to implement the service probability is to set q π j := ˆ q j ( ˆ λ ). Based on the preceding analysis, w e c an calculate the estimator threshold as follo wing: M ( ˆ F j , q π j ) := ˆ F − 1 j (1 − q π j ) Our static threshold p olicy is therefore formalized in Algorithm 3 . The algorithm op erates in tw o phases: an oﬄine pre-pro cessing phase follow ed by an online deci- sion phase. In the oﬄine phase, the algorithm initializes the resource capacities and uses historical data to estimate the query arriv al n umbers and rew ard distributions. It then constructs and solves an estimated relaxation problem to obtain target service probabilities for each query type as Sub- routine 2 . These probabilities are subsequen tly conv erted in to t yp e-sp eciﬁc rew ard thresholds. In 15 Subroutine 2: Empirical Quantile Thresholds Sol ver Input: Estimated query n um b er { ˆ d j } n j =1 ; Estimated rew ard distribution function { ˆ F j ( · ) } n j =1 ; Resource consumption v ectors a j for ev ery j ∈ [ n ]; Initial resource capacit y C . Output: Optimal solution { ˆ q j ( ˆ λ ) } n j =1 . 1 F ormulize the Lagragian function as ˆ L ( q , λ ) = P j ∈ [ n ] ˆ d j R 1 1 − q j ˆ F − 1 j ( u ) d u − P i ∈ [ m ] λ i ( P j ∈ [ n ] ˆ d j a j,i q j − C i ); 2 Solve the dual v ariable as ˆ λ = arg min λ  max q ˆ L ( q , λ )  ; 3 Compute the target service probability as ˆ q j ( ˆ λ ) = 1 − ˆ F j ( P i ∈ [ m ] ˆ λ i a j,i ). Algorithm 3: St a tic Threshold Policy Input: Historical samples: { ( ˆ j t , ˆ r t ) } T t =1 ; Resource capacities C i ∈ R ≥ 0 for ev ery i ∈ [ m ]; Resource consumption v ectors a j for ev ery t yp e j ∈ [ n ]. 1 Initialize remaining capacities C 1 ,i = C i for ev ery i ∈ [ m ]; 2 Count the num b er of samples for eac h t yp e j as ˆ d j ; 3 Estimate the distribution of type j rew ard ˆ F j with ˆ d j samples using k ernel estimation; 4 Construct the estimated relaxation problem as ˆ V C ( ˆ d ) ; 5 F ollow Subroutine 2 to compute service probabilities q π j as ( 2 ) for ev ery j ∈ [ n ]; 6 Compute the threshold M ( ˆ F j , q π j ) = ˆ F − 1 j (1 − q π j ) for ev ery j ∈ [ n ]. 7 for t = 1 , · · · , T do 8 Observ e arriv al t yp e j t and rew ard r t ; 9 Call Subroutine 1 with input ( j t , r t , M ( ˆ F j t , q π j t ) , a j t ) and C t,i for ev ery i ∈ [ m ]. 10 end the online phase, for eac h arriving query , the p olicy compares its observ ed rew ard against the pre- computed threshold. The ﬁnal accept/reject decision is made by our meta quan tile-based p olicy (Subroutine 1 ), whic h also accoun ts for the curren t remaining resource budget. W e state our main result in the follo wing theorem: Theorem 1 With r ewar d-observe d samples, under Assumption 1 and 2 , the r e gr et of St a tic Threshold Policy (A lgorithm 3 ) is at most ( ¯ r + ¯ r ma max )  4 β + 2 β k 1 α + 2 + ¯ r  log T √ nT = O ( m log T √ nT ) wher e m r epr esents the numb er of r esour c e c onstr aints, n r epr esents the numb er of arrival query typ es and ¯ r = max j ¯ r j , a max = max j,i a j,i , α , β , k 1 ar e al l c onstants. 16 P erformance Analysis: W e no w outline the analysis of the p erformance loss incurred by our approac h. When accurate estimators are av ailable for b oth E I ∼ G [ d j ] and F j ( · ), estimated solution { ˆ q j ( ˆ λ ) } n j =1 can b e used to appro ximate true primal optim um. How ever, in practice, estimation errors inevitably in tro duce a p erformance gap b etw een our p olicy and theoretical b enchmark. T o assess the robustness of our approach, w e now examine the optimal b enchmark solution and demonstrate that the resulting regret remains b ounded. Denote µ j as the exp ectation of d j . W e deﬁne the Lagrangian function of primal problem ¯ V ﬂd C : L ( q , λ ) = X j ∈ [ n ] E I ∼ G [ d j ] Z 1 1 − q j F − 1 j ( u ) d u − X i ∈ [ m ] λ i ( X j ∈ [ n ] E I ∼ G [ d j ] a j,i q j − C i ) = X j ∈ [ n ] µ j Z 1 1 − q j F − 1 j ( u ) d u − X i ∈ [ m ] λ i ( X j ∈ [ n ] µ j a j,i q j − C i ) and its optimal solution can b e written as q ∗ j ( λ ∗ ) = 1 − F j  X i ∈ [ m ] λ ∗ i a j,i  (3) where λ ∗ is the optimal dual v ariable of the Lagragian function. W e can demonstrate that the p erformance loss can b e expressed as a function of the diﬀerence b et w een the optimal b enchmark solution { q ∗ j ( λ ∗ ) } n j =1 and the solution employ ed by our algorithm. It is imp ortant to note that, due to the budget constraints inherent in the problem, directly ap- plying the estimated solution in our algorithm can also lead to practical constraint violations. This arises b ecause the constraints in the primal problem (based on true parameters) diﬀer from those in the estimation problem (based on sampled data). T o address this issue, we formally quantify the p enalty of such constrain t violations through the term V ( ˆ q ( ˆ λ )), deﬁned in Equation ( 10 ). This term captures the exp ected excess resource consumption incurred by implementing p otentially in- feasible estimated solution. Imp ortantly , w e can demonstrate that the impact of this constraint violation is manageable within our framework. Sp eciﬁcally , it con tributes only a sublinear term to the o v erall regret. T o conclude, the result is formally stated in the pro of of Theorem 1 (see Section C ), whic h guaran tees the near-optimal p erformance of our static threshold p olicy despite the inherent esti- mation error. The theorem essentially shows that although p erfect feasibility cannot b e ensured when op erating with estimated parameters, the resulting p erformance loss scales fav orably with the sample size and do es not dominate the long-term av erage reward. 5. P artial Adaptive Thresholds with T yp e-Only Samples When we ha ve t yp e-only samples, historical reward information is unav ailable, necessitating online learning of the rew ard distributions. Consequently , our goal is to sequen tially up date our estimate 17 of rew ard distribution function and use the evolving solution to appro ximate the primal optimal solution in eac h step. W e commence b y presen ting a theorem that establishes an imp ossibility result: Theorem 2 Under Assumption 1 and 2 , no online p olicy c an b e at the Ω( T ) r e gr et lower b ound with typ e-only samples. W e pro v e Theorem 2 b y pro viding the follo wing coun terexample: Example 1 Consider a setting with one r esour c e c onstr aint m = 1 and two query typ es n = 2 . The r esour c e c ap acity C = T 2 and e ach query c onsumes one unit of r esour c e, i.e., a 1 = a 2 = 1 . During the ﬁrst half of the time horizon, only typ e 1 queries arrive, so P t (1) = 1 , P t (2) = 0 for t = 1 , 2 , · · · , T 2 . In the se c ond half, only typ e 2 queries arrive, so P t (1) = 0 , P t (2) = 1 for t = T 2 + 1 , · · · , T . We c onsider two sc enarios: In Sc enario 1, the r ewar d distribution for typ e 1 is uniform on [1 , 2] (denote d U [1 , 2] ), and for typ e 2 is U [0 , 1] . In Sc enario 2, typ e 1 r ewar ds fol low U [1 , 2] , while typ e 2 r ewar ds fol low U [2 , 3] . Thus, the b enchmark p olicy (with ful l arrival information) ac c epts al l high- r ewar d queries: typ e 1 in Sc enario 1 and typ e 2 in Sc enario 2. Sinc e the total r esour c e c ap acity e quals the numb er of arrivals for e ach high-r ewar d typ e, the exp e cte d b enchmark r ewar d is 3 T 4 in Sc enario 1 and 5 T 4 in Sc enario 2. L et T 1 ( π ) and T 2 ( π ) denote the exp e cte d amount of r esour c e c onsume d by p olicy π during the ﬁrst T 2 time p erio ds in Sc enario 1 and Sc enario 2 r esp e ctively. Equivalently, T 1 ( π ) and T 2 ( π ) r epr esent the exp e cte d numb ers of queries ac c epte d by p olicy π in e ach sc enario over that interval. Then the exp e cte d r ewar d c ol le cte d by p olicy π in the two sc enarios c an b e written as: R 1 T ( π ) = 3 2 T 1 ( π ) + 1 2  T 2 − T 1 ( π )  = T 4 + T 1 ( π ) R 2 T ( π ) = 3 2 T 2 ( π ) + 5 2  T 2 − T 2 ( π )  = 5 T 4 − T 2 ( π ) Ther efor e, the r e gr et of p olicy π in Sc enario 1 is T 2 − T 1 ( π ) , and in Sc enario 2 it is T 2 ( π ) . Mor e over, b e c ause the p olicy π c an only dep end on historic al samples, its de cisions during the ﬁrst half must b e identic al in b oth sc enarios; henc e, we must have T 1 ( π ) = T 2 ( π ) . Conse quently, we obtain R e gr et ( π ) ≥ max { T 2 − T 1 ( π ) , T 2 ( π ) } ≥ T 4 = Ω( T ) The example ab o v e illustrates that without knowledge of future arriv als, it is imp ossible to ac hiev e sublinear regret in both scenarios simultaneously . If the p olicy conserves to o m uc h resource in the ﬁrst half, then in Scenario 1 it misses the opp ortunit y to accept proﬁtable type 1 arriv als 18 early on, which cannot be reco vered. Conv ersely , if it consumes to o m uch resource in the ﬁrst half, then in Scenario 2 it lac ks suﬃcien t capacit y to accept the proﬁtable t yp e 2 arriv als later. Consequen tly , to enable online learning, we must imp ose a minimal arriv al probability assump- tion: Assumption 3 Ther e exists a c onstant γ > 0 such that the typ e arrival pr ob ability P t ( j ) ≥ γ , for every j ∈ [ n ] and t ∈ [ T ] . Threshold Computation and Algorithm Design: W e now develop the estimator threshold that will b e used in Subroutine 1 . The only diﬀerence from Section 4 is that we apply a p artial adaptive metho d to up date our estimated rew ard distributions during the implemen tation of the problem instance. Given historical samples ˆ d = ( ˆ d 1 , · · · , ˆ d n ), for each time p erio d t , w e deﬁne the online estimation problem: max q X j ∈ [ n ] ˆ d j · Z 1 1 − q j ˆ F − 1 j,t ( u ) d u s.t. X j ∈ [ n ] ˆ d j · a j,i · q j ≤ C i i ∈ [ m ] q j ∈ [0 , 1] j ∈ [ n ] ( ˆ V t, C ( ˆ d )) where ˆ F j,t is the estimator for F j in time p erio d t . Then following the Subroutine 2 by substituting ˆ F j for ˆ F j,t at eac h time p erio d, we can solve the optimal solution to ˆ V t, C ( ˆ d ) : q j,t ( λ t ) = 1 − ˆ F j,t  X i ∈ [ m ] λ t,i a j,i  (4) where λ t is the optimal dual v ariable in Lagragian function of ˆ V t, C ( ˆ d ) . Setting q π j,t := q j,t ( λ t ) for eac h time p erio d t , we can calculate the estimator threshold as follo wing: M ( ˆ F j,t , q π j,t ) := ˆ F − 1 j,t (1 − q π j,t ) W e no w presen t our partial adaptiv e threshold p olicy in Algorithm 4 . The p olicy b egins by initializing the remaining resource capacities and estimating query arriv al n um b ers from the input data, while initializing all rew ard distribution estimates as uniform. During the online execution, for eac h arriving query , we observ e its t yp e and reward v alue, and immediately use this new sample to up date the reward distribution estimate for that sp eciﬁc t yp e. W e then reconstruct and solv e the estimated relaxation problem based on the latest distributions to obtain up dated optimal service probabilities for the current p erio d. Finally , w e compute a new adaptiv e threshold, which is passed to the meta quantile-based p olicy to guide the immediate decision. This pro cess allo ws the thresholds to evolv e in resp onse to newly observed reward information. The main result of this section is formalized as follo wing: 19 Algorithm 4: P ar tial Ad aptive Threshold Policy Input: Historical samples: { ˆ j t } T t =1 ; Resource capacities C i ∈ R ≥ 0 for ev ery i ∈ [ m ]; Resource consumption v ectors a j for ev ery t yp e j ∈ [ n ]. 1 Initialize remaining capacities C 1 ,i = C i for ev ery i ∈ [ m ]; 2 Count the num b er of samples for eac h t yp e j as ˆ d j ; 3 Initialize distribution estimates ˆ F j, 0 as uniform distribution for all j ∈ [ n ]; 4 for t = 1 , · · · , T do 5 Observ e arriv al t yp e j t and rew ard r t ; 6 Up date the estimation ˆ F j t ,t using the new sample r t ; Set ˆ F j,t = ˆ F j,t − 1 , ∀ j  = j t ; 7 Construct the estimated relaxation problem as ˆ V t, C ( ˆ d ) ; 8 F ollo w Subroutine 2 to compute service probabilities q π j,t as ( 4 ) for ev ery j ∈ [ n ]; 9 Compute curren t threshold M ( ˆ F j t ,t , q π j t ,t ) = ˆ F − 1 j t ,t (1 − q π j t ,t ); 10 Call Subroutine 1 with input ( j t , r t , M ( ˆ F j t ,t , q π j t ,t ) , a j t ) and C t,i for ev ery i ∈ [ m ]. 11 end Theorem 3 With typ e-only samples, under Assumption 1 , 2 and 3 , the r e gr et of P ar tial Ad ap- tive Threshold Policy (A lgorithm 4 ) is at most ( ¯ r + ¯ r ma max )  4 β α + 8 β k 1 αγ + 16 γ + 2 ¯ r  log T √ nT = O ( m log T √ nT ) wher e m r epr esents the numb er of r esour c e c onstr aints, n r epr esents the numb er of arrival query typ es and ¯ r = max j ¯ r j , a max = max j,i a j,i , α , β , γ , k 1 ar e al l c onstants. This theorem demonstrates that with minimal arriv al probabilit y , despite the absence of prior rew ard kno wledge, the policy ac hiev es sublinear regret, eﬀectiv ely balancing exploration with ex- ploitation while main taining near-optimal resource utilization o v er time. P erformance Analysis: Regarding the regret of our partial adaptive threshold p olicy , it can still b e expressed as a function of the diﬀerence b etw een the optimal b enchmark solution in ( 3 ) and the solution emplo yed by our algorithm. Similar to Section 4 , w e also need to account for p oten tial con- strain t violation. W e quantify the impact of this infeasibilit y b y V ( ˆ q ( λ t )) deﬁned in Equation ( 16 ), whic h represen ts the p enalty asso ciated with constrain t violation at step t . By implemen ting on- line learning and decision-making pro cedure outlined in Algorithm 4 , we can manage these dual c hallenges of distributional learning and constraint satisfaction. The p erformance of this policy is characterized b y a result analogous to the rew ard-observ ed sample case, formally presen ted in Theorem 3 with detailed pro of in Section D . 20 6. F ully Adaptiv e Thresholds with T yp e-Only Samples In this section, we in tro duce a p oly-logarithmic regret p olicy with fully adaptive threshold metho d using t yp e-only samples. W e consider the same setting as Section 5 where the historical samples include no rew ard information. W e state the follo wing minimal-arriv al-probabilit y assumption: Assumption 4 Ther e exists a c onstant γ > 0 such that the typ e arrival pr ob ability P t ( j ) ≥ γ , for every j ∈ [ n ] and t ∈ [ T ] . Her e γ is known to the de cision maker. Previous literature rigorously establishes that static p olicies, which solve the relaxation problem only once (from perio d 1 to T ) and ﬁx decisions thereafter, incur a regret low er bound Ω( √ T ) (see Theorem 3 in ( Arlotto and Gurvich 2019 )), making logarithmic performance unattainable. Consequen tly , adaptive resolving (from p erio d t to T ) at eac h time p erio d t is a necessary condition to circum ven t the fundamen tal √ T regret limitation and realize logarithmic-order guarantees, as v alidated b y mo dern framew orks. In this pap er, our static and partially adaptiv e p olicies rely on solutions computed based on the total initial r esour c e c ap acity . By con trast, the ful ly adaptive p olicy studied in this section rep eatedly re-solv es the problem using the r emaining c ap acity at eac h p erio d, thereb y enabling dynamic adjustmen t to real-time resource a v ailabilit y . Algorithm Design: W e give a general description of our ful ly adaptive thr eshold approach. W e deﬁne random v ariable for type j query arriv als from p erio d t to T as ˆ b j,t in the historical sam- ples. Denote c t = ( c t, 1 , · · · , c t,m ) ∈ R m ≥ 0 an y vector of remaining capacities of the resources at the b eginning of a p erio d t . Then on the remaining problem instance I t = { ( r s , a s ) } T s = t , w e consider the follo wing estimation problem starting from time t giv en the remaining capacit y c : max q X j ∈ [ n ] ˆ b j,t · Z 1 1 − q j ˆ F − 1 j,t ( u ) d u s.t. X j ∈ [ n ] ˆ b j,t · a j,i · q j ≤ c i i ∈ [ m ] q j ∈ [0 , 1] j ∈ [ n ] ( ˆ V t, c ( I t )) where ˆ F j,t is the estimator for F j in time p erio d t . With new input { ˆ b j,t } n j =1 , { ˆ F j,t ( · ) } n j =1 and c to the Subroutine 2 , we can solv e the optimal solution to ˆ V t, c ( I t ) as { ˆ q j,t } n j =1 . Our algorithm for p oly-logarithmic regret p olicy (Algorithm 5 ) is formalized as follows: T o be sp eciﬁc, the policy initializes remaining resource capacities and sets all rew ard distribution estimates to a uniform distribution. Unlik e previous metho ds, it do es not rely on historical data to pre-estimate query arriv al n um b ers o ver the whole time horizon. Instead, during online execu- tion, after observing the query t yp e and its rew ard at eac h time perio d, the p olicy up dates the corresp onding reward distribution and computes an estimate of the remaining query num b ers for 21 Algorithm 5: Full y Adaptive Threshold Policy Input: Historical samples: { ˆ j t } T t =1 ; Resource capacities C i ∈ R ≥ 0 for ev ery i ∈ [ m ]; Resource consumption v ectors a j for ev ery t yp e j ∈ [ n ]; Reward b ounds ¯ r j and r j for ev ery t yp e j ∈ [ n ]; Constant κ deﬁned in Theorem 4 . 1 Initialize remaining capacities C 1 ,i = C i for ev ery i ∈ [ m ]; 2 Initialize distribution estimates ˆ F j, 0 as uniform distribution for all j ∈ [ n ]; 3 for t = 1 , · · · , T do 4 Observ e arriv al t yp e j t and rew ard r t ; 5 Up date the estimation ˆ F j t ,t using the new sample r t ; Set ˆ F j,t = ˆ F j,t − 1 , ∀ j  = j t ; 6 Compute ˆ b j,t for eac h t yp e j ; 7 Construct the estimated relaxation problem as ˆ V t, c ( I t ) ; 8 F ollo w Subroutine 2 to compute service probabilities ˆ q j,t for ev ery j ∈ [ n ]; 9 if ˆ q j t ,t ≥ 1 − 2 κ  log T √ T − t +1 + log T √ t  then 10 Set M ( ˆ F j t ,t , q π j t ,t ) = r j t ; 11 end 12 else if ˆ q j t ,t ≤ 2 κ  log T √ T − t +1 + log T √ t  then 13 Set M ( ˆ F j t ,t , q π j t ,t ) = ¯ r j t + 1; 14 end 15 else if 2 κ  log T √ T − t +1 + log T √ t  ≤ ˆ q j t ,t ≤ 1 − 2 κ  log T √ T − t +1 + log T √ t  then 16 Set M ( ˆ F j t ,t , q π j t ,t ) = ˆ F − 1 j t ,t (1 − ˆ q j t ,t ); 17 end 18 Call Subroutine 1 with input ( j t , r t , M ( ˆ F j t ,t , q π j t ,t ) , a j t ) and C t,i for ev ery i ∈ [ m ]. 19 end all types. It then constructs and solves an estimated relaxation problem using the latest remaining capacities and query n umber estimates to obtain the current p erio d’s optimal service probabili- ties. The key distinguishing feature of this p olicy is its threshold determination rule. Rather than directly inv erting the estimated distribution function, it applies a carefully designed rounding pro- cedure. Based on the magnitude of the computed service probability , the p olicy selects a threshold from three distinct regimes. This rounding mec hanism is crucial for controlling approximation error dynamically throughout the horizon and is the core tec hnique that enables the theoretical p oly-logarithmic regret guaran tee established in our analysis. The prop osed algorithm is prov en to achiev e a regret upp er b ound of O ((log T ) 3 ) as follo wing: 22 Theorem 4 With typ e-only samples, under Assumption 1 , 2 and 4 , the r e gr et of Full y Adaptive Threshold Policy (A lgorithm 5 ) is at most  1 αγ + 5 κ 2 (log T ) 2 α  (2 log T + 2 + 2 π ) + (2 s 0 + 1) · ¯ r + ¯ r · m = O ( √ n · (log T ) 3 + m ) wher e m r epr esents the numb er of r esour c e c onstr aints, n r epr esents the numb er of arrival query typ es and κ = 4 β √ n αγ + 2 β k 1 α √ γ 3 , s 0 = max { 144 κ 2 (log T ) 2 , 4 γ 2 κ 2 (log T ) 2 , 2 γ } and ¯ r , α, β , γ , k 1 ar e al l c on- stants. This result demonstrates that through fully adaptiv e resolving and careful rounding, p oly- logarithmic regret is attainable with only minimal prior data. It thus represen ts a signiﬁcan t impro v emen t o v er previous results in the literature. Key Asp ects of Pro of Approac h: W e no w highligh t k ey asp ects of our pro of for achieving p oly-logarithmic regret. Attaining a p oly-logarithmic regret guarantee requires using the semi- ﬂuid relaxation as a p erformance benchmark. A detailed discussion is provided in Section E . The follo wing lemma demonstrates that the optimal solution to our estimation problem ˆ V t, c ( I t ) serves as a high-qualit y appro ximation of the solution to the semi-ﬂuid relaxation. Lemma 5 Ther e exists a c onstant κ such that for any c ≥ 0 , it holds that with pr ob ability at le ast 1 − 1 T , | ˜ q ∗ j,t − ˆ q j,t | ≤ κ  log T √ s + log T √ T − s + 1  , ∀ j ∈ [ n ] for r emaining pr oblem instanc e I t wher e s = T − t + 1 , κ = 4 β √ n αγ + 2 β k 1 α √ γ 3 . { ˜ q ∗ j,t } n j =1 is the optimal solution to semi-ﬂuid r elaxation pr oblem deﬁne d in ¯ V semi c ( I t ) and { ˆ q j,t } n j =1 is the optimal solution to ˆ V t, c ( I t ) . The rationale for incorp orating a rounding rule when computing the decision threshold is also ro oted in the analysis for Theorem 4 . The core of our poly-logarithmic regret guaran tee lies in b ounding the instan taneous regret at eac h time p erio d t . This is achiev ed by constructing feasible solutions to the semi-ﬂuid b enchmark through a case analysis based on the computed quantile v alues. The rounding mec hanism is essen tial b ecause it ensures the v alidity of these constructed solutions even when the estimated quantile ˆ q j t ,t is extremely close to 0 or 1, whic h preven ts the regret b ound from deteriorating. A complete pro of of Theorem 4 is also provided in Section E . 7. Conclusion and F uture Directions W e study online multi-resource allocation under arbitrary non-stationarit y with a minimal possible data requirement—lev eraging only a single historical sample p er p erio d. Our k ey contribution is 23 a no v el t yp e-dep endent quantile-based framework that cleanly decouples distribution estimation from optimization, enabling transparen t, mo dular algorithm design. Theoretically , we establish three principal results. First, with reward-observ ed samples, our pro- p osed St a tic Threshold Policy achiev es ˜ O ( √ T ) regret in the m ulti-resource setting. Second, for t yp e-only samples, we pro ve that sublinear regret is impossible without structural assumptions, but b ecomes attainable under a mild minimum-arriv al-probability condition b y our prop osed P ar- tial Adaptive Threshold Policy . Third and most signiﬁcan tly , we design Full y Adaptive Threshold Policy with careful rounding that achiev es the ﬁrst p oly-lo garithmic r e gr et guar an- te e of O ((log T ) 3 ) for non-stationary multi-resource allo cation—a qualitative impro v emen t ov er all prior dual-based approac hes. Bey ond regret b ounds, our quan tile-based paradigm oﬀers conceptual adv an tages: decisions are t yp e-sp eciﬁc with no cross-t yp e in terference, the framework is mo dular (allo wing plug-and-pla y estimation metho ds), and the p olicy logic is transparen t (“accept rewards ab o v e the (1 − q j )- th quan tile”). This work demonstrates that near-optimal p erformance in volatile environmen ts requires only the minimal p ossible oﬄine information—a single sample p er p erio d suﬃces for remark ably strong guaran tees. F uture directions include extending our framework to settings with reusable resources, contextual information, or bandit feedback where rew ards are only observed up on acceptance. Finally , em- pirical v alidation on real-w orld non-stationary w orkloads would further substan tiate the practical relev ance of our theoretically grounded approac h. References D. Adelman. Dynamic bid prices in reven ue management. Op er ations R ese ar ch , 55(4):647–661, 2007. S. Agraw al and N. Dev anur. Linear con textual bandits with knapsac ks. A dvanc es in neur al information pr o c essing systems , 29, 2016. S. Agraw al and N. R. Dev anur. Bandits with conca v e rewards and conv ex knapsac ks. In Pr o c e e dings of the ﬁfte enth ACM c onfer enc e on Ec onomics and c omputation , pages 989–1006, New Y ork, NY, USA, 2014. A CM. S. Agra wal, N. R. Dev an ur, and L. Li. An eﬃcien t algorithm for con textual bandits with knapsacks, and an extension to concav e ob jectives. In Confer enc e on L e arning The ory , volume 49 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 4–18, Colum bia Universit y , New Y ork, New Y ork, USA, 23–26 Jun 2016. PMLR. C. Argue, A. F rieze, A. Gupta, and C. Seiler. Learning from a sample in online algorithms. A dvanc es in Neur al Information Pr o c essing Systems , 35:13852–13863, 2022. A. Arlotto and I. Gurvic h. Uniformly b ounded regret in the m ultisecretary problem. Sto chastic Systems , 9 (3):231–260, 2019. 24 P . D. Azar, R. Klein berg, and S. M. W einberg. Prophet inequalities with limited information. In Pr o c e e dings of the twenty-ﬁfth annual ACM-SIAM symp osium on Discr ete algorithms , pages 1358–1377, Portland, Oregon, USA, 2014. SIAM. A. Badanidiyuru, J. Langford, and A. Slivkins. Resourceful contextual bandits. In Confer enc e on L e arning The ory , pages 1109–1134, Barcelona, Spain, 2014. PMLR. A. Badanidiyuru, R. Klein b erg, and A. Slivkins. Bandits with knapsac ks. Journal of the ACM (JACM) , 65 (3):1–55, 2018. J. Baek and W. Ma. Bifurcating constrain ts to impro ve approximation ratios for net w ork reven ue manage- men t with reusable resources. Op er ations R ese ar ch , 70(4):2226–2236, 2022. S. R. Balseiro, R. Kumar, V. Mirrokni, B. Siv an, and D. W ang. Robust budget pacing with a single sample. In International Confer enc e on Machine L e arning , pages 1636–1659, Honolulu, Ha w aii USA, 2023a. PMLR. S. R. Balseiro, H. Lu, and V. Mirrokni. The b est of man y worlds: Dual mirror descent for online allocation problems. Op er ations R ese ar ch , 71(1):101–119, 2023b. O. Besb es and A. Zeevi. Blind netw ork reven ue management. Op er ations r ese ar ch , 60(6):1537–1550, 2012. O. Besb es, Y. Gur, and A. Zeevi. Sto chastic multi-armed-bandit problem with non-stationary rewards. A dvanc es in neur al information pr o c essing systems , 27, 2014. O. Besb es, Y. Gur, and A. Zeevi. Non-stationary stochastic optimization. Op er ations r ese ar ch , 63(5):1227– 1244, 2015. A. Bosk ovic, Q. Chen, D. Kufel, and Z. Zhou. Online learning and matc hing for resource allocation problems, 2019. D. Bradac, A. Gupta, S. Singla, and G. Zuzic. Robust algorithms for the secretary problem, 2019. R. L. Bray . Logarithmic regret in multisecretary and online linear programs with con tin uous v aluations. Op er ations R ese ar ch , 73(4):2188–2203, 2025. J. Bu, D. Simc hi-Levi, and Y. Xu. Online pricing with oﬄine data: Phase transition and inv erse square law. In international c onfer enc e on machine le arning , pages 1202–1210, Vienna, Austria, 2020. PMLR. N. Buch binder, K. Jain, and J. Naor. Online primal-dual algorithms for maximizing ad-auctions rev en ue. In Eur op e an Symp osium on Algorithms , pages 253–264, Berlin, Heidelb erg, 2007. Springer, Springer Berlin Heidelb erg. P . Bump ensan ti and H. W ang. A re-solving heuristic with uniformly b ounded loss for net work reven ue managemen t. Management Scienc e , 66(7):2993–3009, 2020. C. Caramanis, P . D ¨ utting, M. F aw, F. F usco, P . Lazos, S. Leonardi, O. P apadigenop oulos, E. P ountourakis, and R. Reiﬀenh¨ auser. Single-sample prophet inequalities via greedy-ordered selection. In Pr o c e e d- ings of the 2022 Annual ACM-SIAM Symp osium on Discr ete Algorithms (SOD A) , pages 1298–1325, Alexandria, Virginia, USA, 2022. SIAM. 25 M. Castiglioni, A. Celli, and C. Kro er. Online learning with knapsacks: the b est of b oth worlds. In Interna- tional Confer enc e on Machine L e arning , pages 2767–2783, Baltimore, Maryland, USA, 2022. PMLR. X. Chen, Y. W ang, and Y.-X. W ang. Nonstationary sto chastic optimization under l p, q-v ariation measures. Op er ations R ese ar ch , 67(6):1752–1765, 2019. W. C. Cheung and Z. Li. Episo dic contextual bandits with knapsacks under conv ersion mo dels, 2025. W. C. Cheung, W. Ma, D. Simchi-Levi, and X. W ang. In v entory balancing with online learning. Management Scienc e , 68(3):1776–1807, 2022. W. C. Cheung, D. Simchi-Levi, and R. Zhu. Nonstationary reinforcemen t learning: The blessing of (more) optimism. Management Scienc e , 69(10):5722–5739, 2023. J. Correa, A. Cristi, B. Epstein, and J. A. Soto. Sample-driv en optimal stopping: F rom the secretary problem to the iid prophet inequalit y . Mathematics of Op er ations R ese ar ch , 49(1):441–475, 2024. A. Cristi and B. Ziliotto. Prophet inequalities require only a constant n um b er of samples. In Pr o c e e dings of the 56th Annual ACM Symp osium on The ory of Computing , pages 491–502, New Y ork, NY, USA, 2024. Asso ciation for Computing Mac hinery . N. R. Dev anur and K. Jain. Online matc hing with conca ve returns. In Pr o c e e dings of the forty-fourth annual AC M symp osium on The ory of c omputing , pages 137–144, New Y ork, NY, USA, 2012. Asso ciation for Computing Machinery . N. R. Dev anur, K. Jain, B. Siv an, and C. A. Wilk ens. Near optimal online algorithms and fast appro ximation algorithms for resource allocation problems. Journal of the A CM (JA CM) , 66(1):1–41, 2019. P . D ¨ utting, T. Kesselheim, B. Lucier, R. Reiﬀenh¨ auser, and S. Singla. Online combinatorial allo cations and auctions with few samples. In 2024 IEEE 65th Annual Symp osium on F oundations of Computer Scienc e (FOCS) , pages 1231–1250, Chicago, IL, USA, 2024. IEEE. J. F eldman, N. Liu, H. T opaloglu, and S. Ziya. Appointmen t scheduling under patient preference and no-show b eha vior. Op er ations R ese ar ch , 62(4):794–811, 2014. K. J. F erreira, D. Simchi-Levi, and H. W ang. Online net work reven ue managemen t using thompson sampling. Op er ations r ese ar ch , 66(6):1586–1602, 2018. G. Gallego and G. V an Ryzin. A multiproduct dynamic pricing problem and its applications to net w ork yield management. Op er ations r ese ar ch , 45(1):24–41, 1997. N. Garg, A. Gupta, S. Leonardi, P . Sank owski, et al. Sto c hastic analyses for online combinatorial optimization problems. In Pr o c e e dings of the Ninete enth Annual A CM-SIAM Symp osium on Discr ete A lgorithms, SOD A 2008 , pages 942–951, San F rancisco, California, USA, 2008. SIAM. R. Gh uge, S. Singla, and Y. W ang. Single-sample and robust online resource allo cation. In Pr o c e e dings of the 57th Annual A CM Symp osium on The ory of Computing , pages 1442–1453, New Y ork, NY, USA, 2025. Asso ciation for Computing Mac hinery . 26 T. Hu and D.-X. Zhou. Online learning with samples drawn from non-identical distributions. Journal of Machine L e arning R ese ar ch , 10(12), 2009. N. Immorlica, K. Sank araraman, R. Sc hapire, and A. Slivkins. Adversarial bandits with knapsac ks. Journal of the A CM , 69(6):1–47, 2022. S. Jasin and S. Kumar. A re-solving heuristic with bounded rev en ue loss for net work reven ue managemen t with customer choice. Mathematics of Op er ations R ese ar ch , 37(2):313–345, 2012. J. Jiang, X. Li, and J. Zhang. Online sto chastic optimization with w asserstein-based nonstationarit y . Man- agement Scienc e , 71(11):9104–9122, 2025a. J. Jiang, W. Ma, and J. Zhang. Degeneracy is ok: Logarithmic regret for netw ork reven ue managemen t with indiscrete distributions. Op er ations R ese ar ch , 73(6):3405–3420, 2025b. H. Kaplan, D. Naori, and D. Raz. Online weigh ted matc hing with a sample. In Pr o c e e dings of the 2022 Annual AC M-SIAM Symp osium on Discr ete Algorithms (SODA) , pages 1247–1272, Alexandria, Virginia, USA, 2022. SIAM. N. Kell and D. Panigrahi. Online budgeted allo cation with general budgets. In Pr o c e e dings of the 2016 ACM Confer enc e on Ec onomics and Computation , pages 419–436, New Y ork, NY, USA, 2016. Association for Computing Machinery . S. Kunnumk al and K. T alluri. On a piecewise-linear approximation for netw ork reven ue management. Math- ematics of Op er ations R ese ar ch , 41(1):72–91, 2016. E. Lecarp en tier and E. Rachelson. Non-stationary marko v decision pro cesses, a worst-case approac h using mo del-based reinforcement learning. A dvanc es in neur al information pr o c essing systems , 32, 2019. X. Li and Y. Y e. Online linear programming: Dual conv ergence, new algorithms, and regret b ounds. Op er- ations R ese ar ch , 70(5):2948–2966, 2022. X. Li, C. Sun, and Y. Y e. T he symmetry b etw een arms and knapsacks: A primal-dual approach for bandits with knapsacks. In International Confer enc e on Machine L e arning , pages 6483–6492, San Diego, CA, USA, 2021. PMLR. S. Liu, J. Jiang, and X. Li. Non-stationary bandits with knapsac ks. A dvanc es in Neur al Information Pr o c essing Systems , 35:16522–16532, 2022. A. Mehta, A. Sab eri, U. V azirani, and V. V azirani. Adwords and generalized online matching. Journal of the A CM (JA CM) , 54(5):22–es, 2007. X. Pan, J. Song, J. Zhao, and V.-A. T ruong. Online contextual learning with p erishable resources allo cation. Iise T r ansactions , 52(12):1343–1357, 2020. M. I. Reiman and Q. W ang. An asymptotically optimal p olicy for a quan tity-based netw ork reven ue man- agemen t problem. Mathematics of Op er ations R ese ar ch , 33(2):257–282, 2008. K. T alluri and G. V an Ryzin. An analysis of bid-price con trols for netw ork rev enue management. Management scienc e , 44(11-part-1):1577–1593, 1998. 27 V.-A. T ruong. Optimal adv ance sc heduling. Management Scienc e , 61(7):1584–1597, 2015. A. V era, S. Banerjee, and I. Gurvic h. Online allocation and pricing: Constant regret via b ellman inequalities. Op er ations R ese ar ch , 69(3):821–840, 2021. M. Zhalec hian, E. Keyv anshokooh, C. Shi, and M. P . V an Oyen. Online resource allocation with personalized learning. Op er ations R ese ar ch , 70(4):2138–2161, 2022. 28 App endix A: F urther Related W ork W e discuss further related work to ours here: Online Resource Allo cation: The online resource allo cation problem, commonly referred to as the Ad- W ords problem, has b een extensiv ely studied. F oundational work b y Meh ta et al. ( 2007 ) introduced the trade-oﬀ revealing linear program and deriv ed an optimal algorithm with a comp etitive ratio of 1 − 1 /e . Buc h binder et al. ( 2007 ) dev elop ed a primal-dual framew ork that ac hiev ed the same optimal ratio. This line of research w as signiﬁcantly generalized by Dev anur and Jain ( 2012 ), who allow ed for arbitrary con- ca v e returns and characterized the optimal comp etitiv e ratio. Kell and P anigrahi ( 2016 ) further extended the primal-dual analysis, demonstrating that a constan t competitive ratio is imp ossible in the most general setting and providing asymptotically tight b ounds. Beyond comp etitive analysis, structural and algorith- mic insigh ts hav e b een developed for related dynamic scheduling problems. F or instance, F eldman et al. ( 2014 ) prop osed heuristic pro cedures for online adv ance sc heduling, while T ruong ( 2015 ) provided analyt- ical results for a t wo-class mo del. A notable algorithmic framework was introduced b y V era et al. ( 2021 ), whic h designed simple y et eﬃcient policies for a broad family of problems including online packing, budget- constrained probing, and con textual bandits with knapsacks. F rom a learning-theoretic p ersp ectiv e, Balseiro et al. ( 2023b ) employ ed dual mirror descen t to achiev e sublinear regret relative to the hindsigh t optimum for online allo cation with conca v e rewards. T o handle uncertaint y in reward or consumption distributions, online learning techniques become essen- tial. Pan et al. ( 2020 ) studied online matching with unknown rew ard distributions, proposing a t w o-phase explore-then-exploit algorithm for non-stationary P oisson arriv als. Similarly , Cheung et al. ( 2022 ) combined in v entory balancing with online learning to allocate resources to heterogeneous customers despite unkno wn consumption patterns. F or non-stationary environmen ts, Bosk ovic et al. ( 2019 ) prop osed a time-segmen tation approac h that con verts a non-stationary problem in to a sequence of stationary sub-problems. More generally , Zhalec hian et al. ( 2022 ) introduced a framew ork that jointly optimizes exploration, exploitation, and ro- bustness against adversarial arriv al sequences, oﬀering a principled approach to dynamic resource allo cation under distributional shifts. Non-stationary Sto c hastic Optimization: A gro wing b o dy of literature addresses sequen tial decision- making under non-stationarity . Early work in netw ork reven ue management (NRM) tackled time-v arying arriv als through appro ximate dynamic programming; for instance, Adelman ( 2007 ) derived a deterministic linear program for bid-price con trol using such an approach. Subsequent reﬁnemen ts, such as the compact lin- ear programming form ulation for piecewise-linear v alue function appro ximations b y Kunn umk al and T alluri ( 2016 ), oﬀered improv ed computational tractability . A prominent framework for quan tifying and managing non-stationarit y is the v ariation budget, introduced b y Besbes et al. ( 2015 ) for sequential sto chastic opti- mization with changing cost functions. This concept w as later generalized by Chen et al. ( 2019 ), who derived matc hing upper and lo wer regret bounds for smooth, strongly con vex function sequences under L p,q -v ariation constrain ts. In parallel, researc h on non-stationary Marko v Decision Pro cesses (MDPs) has extended these 29 ideas to reinforcemen t learning. Lecarp en tier and Rac helson ( 2019 ) studied mo del-based algorithms in ev olv- ing MDPs, while Cheung et al. ( 2023 ) developed a sliding-window upp er conﬁdence bound algorithm with a conﬁdence-widening technique, establishing dynamic regret b ounds when v ariation budgets are kno wn. MAB with Resource Constraints: Our w ork is further connected to the literature on multi-armed bandits (MAB) under resource constrain ts, commonly known as bandits with knapsac ks (BwK). This framew ork gen- eralizes the classical MAB problem b y incorporating long-term resource consumption constraints alongside rew ard maximization. The BwK model w as formally in tro duced by Badanidiyuru et al. ( 2018 ). Subsequen t researc h has signiﬁcantly expanded its scope and developed eﬃcien t algorithms. Agraw al and Dev anur ( 2014 ) considered a general model accommo dating concav e rewards and conv ex constraints. The contextual setting, where side information is a v ailable per round, has b een extensively studied: Badanidiyuru et al. ( 2014 ) ex- amined contextual MAB under budget constrain ts, while Agra wal et al. ( 2016 ) and Agraw al and Dev anur ( 2016 ) developed eﬃcien t algorithms emplo ying conﬁdence ellipsoids for parameter estimation. F rom a re- gret analysis p ersp ectiv e, Li et al. ( 2021 ) designed a primal-dual algorithm ac hieving a problem-dependent logarithmic regret bound. F or adversarial environmen ts, Immorlica et al. ( 2022 ) deriv ed an algorithm with an O (log T ) competitive ratio relative to the b est ﬁxed action distribution, a result later improv ed up on b y Castiglioni et al. ( 2022 ). The challenge of non-stationarit y within this constrained bandit setting has also b een addressed. Besbes et al. ( 2014 ) proposed a tractable MAB form ulation allowing rew ard distributions to c hange o ver time. More directly related to our setting, Liu et al. ( 2022 ) study BwK in a non-stationary en vironmen t, pro viding a primal-dual analysis that explicitly characterizes the interpla y b etw een resource constrain ts and distribution shifts. App endix B: Estimation of Arriv al Distributions Pro of of Lemma 3 : F or every t , we obtain one single sample ˆ j t ∼ P t . Since the query types are ﬁnite, i.e. ˆ j t ∈ [ n ], we deﬁne the n um b er of t yp e j query arriv als from historical observ ation as ˆ d j = P T t =1 1 n ˆ j t = j o . Clearly , P n j =1 ˆ d j = T as w e ha ve exactly T samples in total. F or a giv en problem instance I , the actual num b er of type j queries is d j = P T t =1 1 { j t = j } . Since j t and ˆ j t are independent and iden tically distributed according to P t , it follows that: E I ∼ G [ 1 { j t = j } ] = P t ( j ) , E I ∼ G h 1 n ˆ j t = j oi = P t ( j ) . Consequen tly , for an y j ∈ [ n ] we ha v e E I ∼ G [ d j ] = T X t =1 P t ( j ) = E I ∼ G h ˆ d j i . Therefore ˆ d j is an un biased estimator for E I ∼ G [ d j ]. F or simpliﬁcation, w e deﬁne µ j = E I ∼ G [ d j ] in the following analysis. Applying Ho eﬀding Inequalit y , for any ϵ > 0, we ha ve Pr h | ˆ d j − µ j | ≥ ϵ i = Pr h | ˆ d j − E I ∼ G h ˆ d j i | ≥ ϵ i ≤ 2 exp − 2 ϵ 2 P T t =1 (1 − 0) 2 ! = 2 exp  − 2 ϵ 2 T  . 30 Th us, with probability at least 1 − δ , | ˆ d j − µ j | ≤ r T 2 log(2 /δ ) . By Cauch y-Sch warz Inequalit y , we ha ve E I ∼ G " n X j =1 | ˆ d j − µ j | # ≤ n X j =1 q V ar( ˆ d j ) ≤ v u u t n n X j =1 V ar( ˆ d j ) ≤ √ nT where the last inequalit y holds for P n j =1 V ar( ˆ d j ) = P n j =1 P T t =1 P t ( j )(1 − P t ( j )) ≤ P n j =1 P T t =1 P t ( j ) = T . Applying McDiarmid Inequality , for an y ϵ > 0, we ha ve Pr " | n X j =1 | ˆ d j − µ j | − E I ∼ G " n X j =1 | ˆ d j − µ j | # | ≥ ϵ # ≤ 2 exp  − 2 ϵ 2 4 T  = 2 exp  − ϵ 2 2 T  . With probability at least 1 − δ , we ha v e n X j =1 | ˆ d j − µ j | ≤ √ nT + p 2 T log (2 /δ ) ≤ p 2 nT log (2 /δ ) (5) W e no w consider the upp er bound for P n j =1 | ˆ d j − µ j | 2 /µ j . By Bernstein Inequality , for any ϵ > 0, we hav e Pr h | ˆ d j − µ j | ≥ ϵ i ≤ 2 exp − ϵ 2 2(V ar( ˆ d j ) + ϵ/ 3) ! Let ϵ = L 1 √ µ j where L 1 is a constant to b e determined, w e ha ve Pr h | ˆ d j − µ j | ≥ L 1 √ µ j i = Pr h | ˆ d j − µ j | 2 /µ j ≥ L 2 1 i ≤ 2 exp  − L 2 1 µ j 2( µ j + L 1 √ µ j / 3)  F rom Assumption 2 , for any j ∈ [ n ], Pr h ˆ d j ≥ 1 i = 1, we kno w that µ j ≥ 1 alwa ys holds. Therefore Pr " | ˆ d j − µ j | 2 µ j ≥ L 2 1 # ≤ 2 exp  − L 2 1 µ j 2( µ j + L 1 µ j / 3)  = 2 exp  − L 2 1 2(1 + L 1 / 3)  Consequen tly , with probabilit y at least 1 − δ , we kno w that | ˆ d j − µ j | 2 µ j ≤ 1 3 log( 2 δ ) + r 1 9 (log( 2 δ )) 2 + 2 log ( 2 δ ) ! 2 ≤  2 3 log( 2 δ ) + 3  2 Therefore, for δ < 1 / 2, with probabilit y at least 1 − δ , we hav e n X j =1 | ˆ d j − µ j | 2 µ j ≤ n  2 3 log( 2 δ ) + 3  2 ≤ n  4 log( 2 δ )  2 (6) Pro of of Lemma 4 : W e deﬁne the k ernel estimation as: ˆ F ( x ) = 1 N N X i =1 K ( x − X i h N ) where h N > 0 is the bandwidth parameter to be speciﬁed. W e split ˆ F ( x ) − F ( x ) in to tw o parts: the bias E h ˆ F ( x ) i − F ( x ) and the random ﬂuctuation ˆ F ( x ) − E h ˆ F ( x ) i . 31 W e ﬁrst discuss the upp er b ound for bias E h ˆ F ( x ) i − F ( x ). T ak e exp ectation of ˆ F ( x ) with resp ect to the randomness in the samples { X 1 , · · · , X N } which are independent and identically distributed from the true distribution function F ( x ), we ha ve E h ˆ F ( x ) i = E X 1 , ··· ,X N " 1 N N X i =1 K ( x − X i h N ) # = E X 1  K ( x − X 1 h N )  = Z + ∞ −∞ K ( x − y h N ) f ( y ) d y where f is the probabilit y densit y function of F . Deﬁne u = ( x − y ) /h N , we ha ve E h ˆ F ( x ) i = Z + ∞ −∞ K ( x − y h N ) f ( y ) d y = h N Z + ∞ −∞ K ( u ) f ( x − h N u ) d u Mean while, F ( x ) = Z + ∞ −∞ 1 { y ≤ x } f ( y ) d y = h N Z + ∞ −∞ 1 { u ≥ 0 } f ( x − h N u ) d u W e deﬁne L ( u ) = K ( u ) − 1 { u ≥ 0 } . Since k ( u ) is supp orted on [-1,1], w e know K ( u ) = 0 when u ≤ − 1, and K ( u ) = 1 when u ≥ 1. Consequen tly , L ( u ) is also supp orted on [-1,1]. Therefore, E h ˆ F ( x ) i − F ( x ) = h N Z + ∞ −∞ L ( u ) f ( x − h N u ) d u = h N Z 1 − 1 L ( u ) f ( x − h N u ) d u W e prov e that R 1 − 1 | L ( u ) | d u is b ounded: Since K ( u ) is symmetric, we kno w K ( − u ) = 1 − K ( u ) and K (0) = 1 2 . Since K ( u ) is monotone increasing, we know 0 ≤ K ( u ) ≤ 1 2 when − 1 ≤ u ≤ 0, and 1 2 ≤ K ( u ) ≤ 1 when 0 ≤ u ≤ 1. Therefore, Z 1 − 1 | L ( u ) | d u = Z 0 − 1 K ( u ) d u + Z 1 0 [1 − K ( u )] d u ≤ Z 0 − 1 1 2 d u + Z 1 0 1 2 d u = 1 Case 1: In terior p oin ts. When x ∈ [ a + h N , b − h N ], for all u ∈ [ − 1 , 1], we hav e x − h N u ∈ [ a, b ], so f ( x − h N u ) is well-deﬁned and upp er b ounded b y constan t β . Hence, E h ˆ F ( x ) i − F ( x ) ≤ h N Z 1 − 1 | L ( u ) || f ( x − h N u ) | d u ≤ β h N Z 1 − 1 | L ( u ) | d u ≤ β h N Case 2: Boundary regions. When x ∈ [ a, a + h N ) ∪ ( b − h N , b ], w e ﬁrst consider the left b oundary x = a + θ h N where 0 ≤ θ < 1. Therefore, when u > θ , w e ha ve x − h N u < a and f ( x − h N u ) = 0. Split the in tegral into t wo parts: E h ˆ F ( x ) i − F ( x ) = h N Z θ − 1 L ( u ) f ( x − h N u ) d u + h N Z 1 θ L ( u ) f ( x − h N u ) d u ≤ h N Z θ − 1 | L ( u ) || f ( x − h N u ) | d u ≤ β h N 32 Similarly , the same conclusion holds for the righ t boundary x ∈ ( b − h N , b ]. T o conclude, sup x ∈ [ a,b ] | E h ˆ F ( x ) i − F ( x ) | ≤ β h N (7) No w we consider the upper bound for random ﬂuctuation ˆ F ( x ) − E h ˆ F ( x ) i . W e deﬁne the empirical dis- tribution function as ¯ F ( x ). W e hav e ˆ F ( x ) = 1 N N X i =1 Z x − X i h N −∞ k ( u ) d u = Z + ∞ −∞ ¯ F ( x − h N u ) k ( u ) d u Similarly , E h ˆ F ( x ) i = Z + ∞ −∞ F ( x − h N u ) k ( u ) d u Therefore, | ˆ F ( x ) − E h ˆ F ( x ) i | ≤ Z + ∞ −∞ | ¯ F ( x − h N u ) − F ( x − h N u ) | k ( u ) d u ≤ sup y ∈ [ a,b ] | ¯ F ( y ) − F ( y ) | Z + ∞ −∞ k ( u ) d u = sup y ∈ [ a,b ] | ¯ F ( y ) − F ( y ) | Consequen tly , sup x ∈ [ a,b ] | ˆ F ( x ) − E h ˆ F ( x ) i | ≤ sup y ∈ [ a,b ] | ¯ F ( y ) − F ( y ) | Applying Dvoretzky–Kiefer–W olfowitz Inequalit y , for an y ϵ > 0, w e ha ve Pr " sup y ∈ [ a,b ] | ¯ F ( y ) − F ( y ) | ≥ ϵ # ≤ 2 exp  − 2 N ϵ 2  Th us, with probability at least 1 − δ , sup x ∈ [ a,b ] | ˆ F ( x ) − E h ˆ F ( x ) i | ≤ sup y ∈ [ a,b ] | ¯ F ( y ) − F ( y ) | ≤ r log(2 /δ ) 2 N (8) Com bining ( 7 ) and ( 8 ), w e hav e that with probabilit y at least 1 − δ , sup x ∈ [ a,b ] | F ( x ) − ˆ F ( x ) | ≤ sup x ∈ [ a,b ] | E h ˆ F ( x ) i − F ( x ) | + sup x ∈ [ a,b ] | ˆ F ( x ) − E h ˆ F ( x ) i | ≤ β h N + r log(2 /δ ) 2 N W e set h N = N − 1 / 2 , then with probabilit y at least 1 − δ , sup x ∈ [ a,b ] | F ( x ) − ˆ F ( x ) | ≤ ( β + 1) r log(1 /δ ) N = O ( r log(1 /δ ) N ) App endix C: Missing Pro of in Section 4 Pro of of Theorem 1 : W e rewrite the r e gr et according to ¯ V ﬂd C : Regret( π ) ≤ ¯ V ﬂd C − E I ∼ G [ V π C ( I )] = E I ∼ G " n X j =1 d j ( Z 1 1 − q ∗ j ( λ ∗ ) F − 1 j ( u ) d u − Z 1 1 − q π j F − 1 j ( u ) d u ) # ≤ E I ∼ G " n X j =1 µ j · ¯ r j · | q ∗ j ( λ ∗ ) − q π j | # 33 where q π j denotes the probabilit y that query t will be served b y the online p olicy π and the inequalit y holds b ecause the rew ard for t yp e j is upp er b ounded b y ¯ r j . The most straigh tforward approach to implementing q π j is to set q π j := ˆ q j ( ˆ λ ). Ho wev er, this w ould result in constraint violations in practice. Therefore, w e reformulate the regret of our algorithm as follows: Regret( π ) ≤ E I ∼ G " n X j =1 µ j · ¯ r j · | q ∗ j ( λ ∗ ) − q π j | # ≤ E I ∼ G " n X j =1 µ j · ¯ r j · | q ∗ j ( λ ∗ ) − ˆ q j ( ˆ λ ) | + V ( ˆ q ( ˆ λ )) # (9) where V ( ˆ q ( ˆ λ )) denotes the penalty incurred due to constraint violations when directly applying { ˆ q j ( ˆ λ ) } n j =1 in our algorithm. F or V ( ˆ q ( ˆ λ )), we ha ve the following upper bound: V ( ˆ q ( ˆ λ )) ≤ ¯ r m X i =1 max { 0 , n X j =1 µ j a j,i ˆ q j ( ˆ λ ) − C i } ≤ ¯ r m X i =1 max { 0 , n X j =1 µ j a j,i ˆ q j ( ˆ λ ) − n X j =1 µ j a j,i q ∗ j ( λ ∗ ) } ≤ ¯ r m X i =1 n X j =1 µ j a j,i | q ∗ j ( λ ∗ ) − ˆ q j ( ˆ λ ) | ≤ n X j =1 µ j · ¯ rma max | q ∗ j ( λ ∗ ) − ˆ q j ( ˆ λ ) | (10) where ¯ r = max j ¯ r j , a max = max j,i a j,i and the second inequalit y follo ws from the budget constraints. W e now proceed to b ound the term P n j =1 µ j | q ∗ j ( λ ∗ ) − ˆ q j ( ˆ λ ) | . Recall from ( 3 ) that { q ∗ j ( λ ∗ ) } n j =1 is the optimal solution to ¯ V ﬂd C and ( 2 ) that { ˆ q j ( ˆ λ ) } n j =1 is the optimal solution to ˆ V C ( ˆ d ) . F or notational simplicit y , w e will omit the dependence on λ and write q ∗ j ( λ ∗ ) as q ∗ j , ˆ q j ( ˆ λ ) as ˆ q j n the subsequent analysis. F or each j ∈ [ n ], w e deﬁne G j ( q ) = Z 1 1 − q F − 1 j ( u ) d u Since we assume the low er b ound and upp er bound of pdf f j , we ha ve G ′ j ( q ) = F − 1 j (1 − q ) , G ′′ j ( q ) = − 1 f j ( F − 1 j (1 − q )) ≤ − 1 β (11) th us G j ( q ) is 1 /β -strongly concav e and G ′ j ( q ) is 1 /α -Lipsc hitz-contin uous. T o address the regret arising from estimation errors in both query num b ers and reward distributions, we in tro duce an intermediate v ariable ˜ q j and decomp ose | q ∗ j − ˆ q j | into t wo parts using the triangle inequalit y . Sp eciﬁcally , we consider an intermediate estimation problem in whic h the re w ard distributions are known accurately: max q X j ∈ [ n ] ˆ d j · Z 1 1 − q j F − 1 j ( u ) d u s.t. X j ∈ [ n ] ˆ d j · a j,i · q j ≤ C i i ∈ [ m ] q j ∈ [0 , 1] j ∈ [ n ] ( ˜ V C ( ˆ d )) 34 and let { ˜ q j } n j =1 denote its optimal solution. Then we ha ve n X j =1 µ j | q ∗ j − ˆ q j | ≤ n X j =1 µ j | q ∗ j − ˜ q j | + n X j =1 µ j | ˜ q j − ˆ q j | (12) F or the ﬁrst term on the right-hand side of ( 12 ), w e use the strong concavit y of G j , which yields 1 β ( ˜ q j − q ∗ j ) 2 ≤  G ′ j ( ˜ q j ) − G ′ j ( q ∗ j )  ( q ∗ j − ˜ q j ) W e deﬁne the optimal dual v ariable to ¯ V ﬂd C as λ ∗ and the optimal dual v ariable to ˜ V C ( ˆ d ) as ˜ λ . F rom ﬁrst-order condition, we ha v e G ′ j ( q ∗ j ) = m X i =1 λ ∗ i a j,i , G ′ j ( ˜ q j ) = m X i =1 ˜ λ i a j,i F rom this it follo ws that 1 β n X j =1 µ j ( ˜ q j − q ∗ j ) 2 ≤ n X j =1 µ j  G ′ j ( ˜ q j ) − G ′ j ( q ∗ j )  ( q ∗ j − ˜ q j ) = n X j =1 µ j m X i =1 ( ˜ λ i − λ ∗ i ) a j,i ( q ∗ j − ˜ q j ) = m X i =1 ( ˜ λ i − λ ∗ i ) n X j =1 µ j a j,i ( q ∗ j − ˜ q j ) ! Deﬁne ∆ C i = P n j =1 µ j a j,i q ∗ j − P n j =1 ˆ d j a j,i ˜ q j for every i ∈ [ m ], w e ha ve m X i =1 ( ˜ λ i − λ ∗ i ) n X j =1 µ j a j,i ( q ∗ j − ˜ q j ) ! = m X i =1 ( ˜ λ i − λ ∗ i ) · ∆ C i + m X i =1 ( ˜ λ i − λ ∗ i ) n X j =1 ( µ j − ˆ d j ) a j,i ˜ q j ! F rom complementary slac kness and resource constrain ts, w e ha ve λ ∗ i ( n X j =1 µ j a j,i q ∗ j − C i ) = 0 , n X j =1 µ j a j,i q ∗ j ≤ C i ˜ λ i ( n X j =1 ˆ d j a j,i ˜ q j − C i ) = 0 , n X j =1 ˆ d j a j,i ˜ q j ≤ C i when λ ∗ i > 0, it holds that P n j =1 µ j a j,i q ∗ j = C i , thus ∆ C i ≥ 0; when ˜ λ i > 0, it holds that P n j =1 ˆ d j a j,i ˜ q j ≤ C i , th us ∆ C i ≤ 0. Consequently w e ha ve m X i =1 ( ˜ λ i − λ ∗ i ) · ∆ C i ≤ 0 Therefore, we ha ve 1 β n X j =1 µ j ( ˜ q j − q ∗ j ) 2 ≤ m X i =1 ( ˜ λ i − λ ∗ i ) · ∆ C i + m X i =1 ( ˜ λ i − λ ∗ i ) n X j =1 ( µ j − ˆ d j ) a j,i ˜ q j ! ≤ n X j =1 ( µ j − ˆ d j ) ˜ q j m X i =1 ( ˜ λ i − λ ∗ i ) a j,i = n X j =1 ( µ j − ˆ d j ) ˜ q j  G ′ j ( ˜ q j ) − G ′ j ( q ∗ j )  ≤ n X j =1 | µ j − ˆ d j || G ′ j ( ˜ q j ) − G ′ j ( q ∗ j ) | 35 Since G ′ j ( q ) is 1 /α -Lipsc hitz-contin uous, we kno w that | G ′ j ( ˜ q j ) − G ′ j ( q ∗ j ) | ≤ 1 α | ˜ q j − q ∗ j | Th us 1 β n X j =1 µ j ( ˜ q j − q ∗ j ) 2 ≤ 1 α n X j =1 | µ j − ˆ d j || ˜ q j − q ∗ j | By Cauch y-Sch warz Inequalit y , we ha ve 1 β n X j =1 µ j ( ˜ q j − q ∗ j ) 2 ≤ 1 α n X j =1 | ˆ d j − µ j || ˜ q j − q ∗ j | ≤ 1 α ( n X j =1 µ j ( ˜ q j − q ∗ j ) 2 ) 1 2 ( n X j =1 | ˆ d j − µ j | 2 µ j ) 1 2 Therefore, n X j =1 µ j ( ˜ q j − q ∗ j ) 2 ≤ ( β α ) 2 n X j =1 | ˆ d j − µ j | 2 µ j F rom Section B , w e ha ve ( 6 ). F ollowing Cauch y-Sch warz Inequality , with probability at least 1 − 1 T , we ha v e n X j =1 µ j | ˜ q j − q ∗ j | ≤ ( n X j =1 µ j ) 1 2 ( n X j =1 µ j ( ˜ q j − q ∗ j ) 2 ) 1 2 ≤ √ T · β α ( n X j =1 | ˆ d j − µ j | 2 µ j ) 1 2 ≤ √ T · β α · 4 log (2 T ) √ n ≤ 4 β α log( T ) √ nT (13) F or the second term on the right-hand side of ( 12 ), w e deﬁne ˆ G j ( q ) = Z 1 1 − q ˆ F − 1 j ( u ) d u W e can also apply the strongly concavit y of G j : G j ( ˜ q j ) ≤ G j ( ˆ q j ) + G ′ j ( ˆ q j )( ˜ q j − ˆ q j ) − 1 2 β ( ˜ q j − ˆ q j ) 2 Th us 1 2 β n X j =1 ˆ d j ( ˜ q j − ˆ q j ) 2 ≤ n X j =1 ˆ d j G j ( ˆ q j ) − n X j =1 ˆ d j G j ( ˜ q j ) + n X j =1 ˆ d j G ′ j ( ˆ q j )( ˜ q j − ˆ q j ) ≤ n X j =1 ˆ d j G ′ j ( ˆ q j )( ˜ q j − ˆ q j ) = n X j =1 ˆ d j ˆ G ′ j ( ˆ q j )( ˜ q j − ˆ q j ) + n X j =1 ˆ d j ( G ′ j ( ˆ q j ) − ˆ G ′ j ( ˆ q j ))( ˜ q j − ˆ q j ) ≤ n X j =1 ˆ d j | F − 1 j (1 − ˆ q j ) − ˆ F − 1 j (1 − ˆ q j ) || ˜ q j − ˆ q j | ≤ n X j =1 ˆ d j ϵ j α | ˜ q j − ˆ q j | (14) 36 where ϵ j = sup x | ˆ F j ( x ) − F j ( x ) | , the second and third inequalit y hold for the optimalit y of { ˜ q j } n j =1 to ˜ V C ( ˆ d ) and { ˆ q j } n j =1 to ˆ V C ( ˆ d ) . F rom Lemma 4 , with probability at least 1 − 1 T , w e ha ve that ϵ j = sup x | ˆ F j ( x ) − F j ( x ) | ≤ k 1 q log( T ) / ˆ d j = O ( q log( T ) / ˆ d j ) where k 1 is a constant. Therefore by Cauc hy-Sc h w arz Inequality , w e ha ve 1 2 β n X j =1 ˆ d j ( ˜ q j − ˆ q j ) 2 ≤ n X j =1 ˆ d j ϵ j α | ˜ q j − ˆ q j | ≤ 1 α ( n X j =1 ˆ d j ( ˜ q j − ˆ q j ) 2 ) 1 2 ( n X j =1 ˆ d j ( ϵ 2 j )) 1 2 ≤ k 1 p n log( T ) α n X j =1 ˆ d j ( ˜ q j − ˆ q j ) 2 ) 1 2 Th us with probability at least 1 − 1 T , n X j =1 ˆ d j ( ˜ q j − ˆ q j ) 2 ≤ ( 2 β k 1 α ) 2 · n log ( T ) By Cauch y-Sch warz Inequalit y , we ha ve n X j =1 ˆ d j | ˜ q j − ˆ q j | ≤ ( n X j =1 ˆ d j ) 1 2 ( n X j =1 ˆ d j ( ˜ q j − ˆ q j ) 2 ) 1 2 ≤ √ T · 2 β k 1 α p n log( T ) = 2 β k 1 α p nT log ( T ) Therefore, applying ( 5 ), with probabilit y at least 1 − 1 T , n X j =1 µ j | ˜ q j − ˆ q j | ≤ n X j =1 ˆ d j | ˜ q j − ˆ q j | + n X j =1 | µ j − ˆ d j || ˜ q j − ˆ q j | ≤ n X j =1 ˆ d j | ˜ q j − ˆ q j | + n X j =1 | µ j − ˆ d j | ≤ 2 β k 1 α p nT log ( T ) + 2 p nT log ( T ) (15) Com bining ( 13 ) and ( 15 ), with probability at least 1 − 1 T , we ha ve n X j =1 µ j | q ∗ j − ˆ q j | ≤ n X j =1 µ j | q ∗ j − ˜ q j | + n X j =1 µ j | ˜ q j − ˆ q j | ≤ 4 β α log( T ) √ nT + 2 β k 1 α p nT log ( T ) + 2 p nT log ( T ) ≤ ( 4 β + 2 β k 1 α + 2) log ( T ) √ nT T o conclude, considering ( 9 ) and ( 10 ), we obtain the following upp er b ound for our regret: Regret( π ) ≤ E I ∼ G " n X j =1 µ j · ( ¯ r j + ¯ rma max ) · | q ∗ j − ˆ q j | # ≤ ( ¯ r + ¯ r ma max ) E I ∼ G " n X j =1 µ j | q ∗ j − ˆ q j | # ≤ ( ¯ r + ¯ r ma max )( 4 β + 2 β k 1 α + 2) log ( T ) √ nT + ( ¯ r + ¯ r ma max ) ¯ rT · 1 T ≤ ( ¯ r + ¯ r ma max )( 4 β + 2 β k 1 α + 2 + ¯ r ) log ( T ) √ nT = O ( m log ( T ) √ nT ) where ¯ r , a max , α, β , k 1 are all constants. 37 App endix D: Missing Pro of in Section 5 Pro of of Theorem 3 : W e deﬁne d j,t and ˆ d j,t as the indicator v ariables for the arriv al of a query of j at time p erio d t in the primary problem instance and in historical sample stream resp ectiv ely . Th us, both v ariables take v alues in { 0 , 1 } . It holds that E I ∼ G [ d j,t ] = E I ∼ G h ˆ d j,t i = P t ( j ). F urthermore, for eac h time p erio d t , exactly one type arriv es, so w e ha v e P n j =1 d j,t = P n j =1 ˆ d j,t = 1. Aggregating o ver time, w e ha ve P T t =1 d j,t = d j , P T t =1 ˆ d j,t = ˆ d j for every j ∈ [ n ]. With suc h deﬁnition of d j,t and ˆ d j,t , similar to Section C , we ha ve Regret( π ) ≤ E I ∼ G " n X j =1 T X t =1 d j,t ( Z 1 1 − q ∗ j ( λ ∗ ) F − 1 j ( u ) d u − Z 1 1 − q π j,t F − 1 j ( u ) d u ) # ≤ E I ∼ G " n X j =1 T X t =1 P t ( j ) · ¯ r j · | q ∗ j ( λ ∗ ) − q π j,t | # ≤ E I ∼ G " n X j =1 T X t =1 P t ( j ) · ¯ r j · | q ∗ j ( λ ∗ ) − q j,t ( λ t ) | + T X t =1 V ( ˆ q ( λ t )) # where the second inequalit y holds because the rew ard for type j is upp er bounded b y ¯ r j . V ( ˆ q ( λ t )) denotes the p enalt y of constrain t violation caused by directly applying { q j,t ( λ t ) } n j =1 for time perio d t in our algorithm. No w we discuss the upp er b ound for V ( ˆ q ( λ t )): V ( ˆ q ( λ t )) ≤ ¯ r m X i =1 max { 0 , n X j =1 P t ( j ) a j,i q j,t ( λ t ) − n X j =1 P t ( j ) a j,i q ∗ j ( λ ∗ ) } ≤ ¯ r m X i =1 n X j =1 P t ( j ) a j,i | q ∗ j ( λ ∗ ) − q j,t ( λ t ) | ≤ ¯ r ma max n X j =1 P t ( j ) | q ∗ j ( λ ∗ ) − q j,t ( λ t ) | (16) where ¯ r = max j ¯ r j , a max = max j,i a j,i . Thus T X t =1 V ( ˆ q ( λ t )) ≤ ¯ r ma max n X j =1 T X t =1 P t ( j ) | q ∗ j ( λ ∗ ) − q j,t ( λ t ) | Regret( π ) ≤ ( ¯ r + ¯ r ma max ) · E I ∼ G " n X j =1 T X t =1 P t ( j ) | q ∗ j ( λ ∗ ) − q j,t ( λ t ) | ) # (17) F or the next part of analysis, w e discuss the upp er b ound for P n j =1 P T t =1 P t ( j ) | q ∗ j ( λ ∗ ) − q j,t ( λ t ) | . W e kno w that { q ∗ j ( λ ∗ ) } n j =1 in 3 is the optimal solution to ( ¯ V ﬂd C ) and { q j,t ( λ t ) } n j =1 in ( 4 ) is the optimal solution to ˆ V t, C ( ˆ d ) . F or notational simplicit y , we omit λ and write q ∗ j ( λ ∗ ) as q ∗ j , q j,t ( λ t ) as q j,t in the following. F or each j ∈ [ n ], w e deﬁne G j ( q ) = Z 1 1 − q F − 1 j ( u ) d u 38 F ollowing ( 11 ), w e know G j ( q ) is Lipschitz-con tinuous and 1 /β -strongly concav e. Similar to Section C , w e in tro duce the intermediate estimation problem ˜ V C ( ˆ d ) and denote { ˜ q j } n j =1 as its optimal solution. Then follo wing the triangle inequalit y , we can decompose | q ∗ j − q j,t | into t wo parts: n X j =1 T X t =1 P t ( j ) | q ∗ j − q j,t | ≤ n X j =1 T X t =1 P t ( j ) | q ∗ j − ˜ q j | + n X j =1 T X t =1 P t ( j ) | ˜ q j − q j,t | = n X j =1 µ j | q ∗ j − ˜ q j | + n X j =1 T X t =1 P t ( j ) | ˜ q j − q j,t | (18) where the equality holds for µ j = E I ∼ G [ d j ] = P T t =1 P t ( j ). The upp er bound for the ﬁrst term P n j =1 µ j | q ∗ j − ˜ q j | follows directly from ( 13 ). Now it remains to b ound P n j =1 P T t =1 P t ( j ) | ˜ q j − q j,t | . W e utilize the strongly concavit y of G j for any t ∈ [ T ]: G j ( ˜ q j ) ≤ G j ( q j,t ) + G ′ j ( q j,t )( ˜ q j − q j,t ) − 1 2 β ( ˜ q j − q j,t ) 2 F or each time p erio d t , w e deﬁne ˆ G j,t ( q ) = Z 1 1 − q ˆ F − 1 j,t ( u ) d u (19) Similar to ( 14 ), from the optimality of { ˜ q j } n j =1 to ˜ V C ( ˆ d ) and { q j,t } n j =1 to ˆ V t, C ( ˆ d ) , we ha ve 1 2 β n X j =1 ˆ d j ( ˜ q j − q j,t ) 2 ≤ n X j =1 ˆ d j G j ( q j,t ) − n X j =1 ˆ d j G j ( ˜ q j ) + n X j =1 ˆ d j G ′ j ( q j,t )( ˜ q j − q j,t ) ≤ n X j =1 ˆ d j | F − 1 j (1 − q j,t ) − ˆ F − 1 j,t (1 − q j,t ) || ˜ q j − q j,t | ≤ n X j =1 ˆ d j ϵ j,t α | ˜ q j − q j,t | where ϵ j,t = sup x | ˆ F j,t ( x ) − F j ( x ) | . By Cauch y-Sch w arz Inequality , we hav e 1 2 β n X j =1 ˆ d j ( ˜ q j − q j,t ) 2 ≤ n X j =1 ˆ d j ϵ j,t α | ˜ q j − q j,t | ≤ 1 α ( n X j =1 ˆ d j ( ˜ q j − q j,t ) 2 ) 1 2 ( n X j =1 ˆ d j ( ϵ 2 j,t )) 1 2 W e denote n j,t as the n um b er of query type j arriv ed from time p erio d 1 to t . Under Assumption 3 , it follo ws that E I ∼ G [ n j,t ] ≥ γ · t and E I ∼ G h ˆ d j i ≥ γ · T . By Chernoﬀ bound, w e know that with probabilty at least 1 − δ , ˆ d j ≥ E I ∼ G h ˆ d j i − r 2 E I ∼ G h ˆ d j i log( 1 δ ) , n j,t ≥ E I ∼ G [ n j,t ] − r 2 E I ∼ G [ n j,t ] log( 1 δ ) Therefore when t ≥ t 0 = 8 γ log( T ), with probability at least 1 − 1 T , w e ha v e ˆ d j ≥ γ T 2 and n j,t ≥ γ t 2 . By Lemma 4 , with probability at least 1 − 1 T , w e ha ve ϵ j,t = sup x | ˆ F j,t ( x ) − F j ( x ) | ≤ k 1 p log( T ) /n j,t , where k 1 is a constant. Th us with probability at least 1 − 2 T , γ T 2 n X j =1 ( ˜ q j − q j,t ) 2 ≤ n X j =1 ˆ d j ( ˜ q j − q j,t ) 2 ≤ ( 2 β α ) 2 n X j =1 ˆ d j ( ϵ 2 j,t ) ≤ ( 2 β α ) 2 n X j =1 ˆ d j · 2 k 2 1 log( T ) γ t 39 Consequen tly , n X j =1 ( ˜ q j − q j,t ) 2 ≤ 2 γ T ( 2 β α ) 2 n X j =1 ˆ d j · 2 k 2 1 log( T ) γ t ≤ ( 4 β k 1 αγ ) 2 log( T ) t Therefore, by Cauc hy-Sc h w arz Inequality , with probabilit y at least 1 − 2 T , n X j =1 T X t =1 P t ( j ) | ˜ q j − q j,t | ≤ T X t =1 n X j =1 | ˜ q j − q j,t | ≤ √ n ( 4 β k 1 αγ T X t =1 r log( T ) t + 2 t 0 ) ≤ 8 β k 1 αγ p log( T ) · nT + 16 γ √ n log( T ) (20) Com bining ( 18 ), ( 13 ) and ( 20 ), with probability at least 1 − 2 T , we ha ve n X j =1 T X t =1 P t ( j ) | q ∗ j − q j,t | ≤ n X j =1 µ j | q ∗ j − ˜ q j | + n X j =1 T X t =1 P t ( j ) | ˜ q j − q j,t | ≤ 4 β α log( T ) √ nT + 8 β k 1 αγ p log( T ) · nT + 16 γ √ n log( T ) ≤ ( 4 β α + 8 β k 1 αγ + 16 γ ) log( T ) √ nT (21) Finally , considering ( 21 ) and ( 17 ) we obtain the following upper bound for online regret: Regret( π ) ≤ ( ¯ r + ¯ r ma max ) · E I ∼ G " n X j =1 T X t =1 P t ( j ) | q ∗ j − q j,t | ) # ≤ ( ¯ r + ¯ r ma max )  4 β α + 8 β k 1 αγ + 16 γ  log( T ) √ nT + ( ¯ r + ¯ r ma max ) ¯ rT · 2 T ≤ ( ¯ r + ¯ r ma max )  4 β α + 8 β k 1 αγ + 16 γ + 2 ¯ r  log( T ) √ nT = O ( m log ( T ) √ nT ) where ¯ r , a max , α, β , γ , k 1 are all constants. App endix E: F urther Discussion and Missing Pro of in Section 6 In tro duction to Semi-ﬂuid Relaxation: In order to obtain a p oly-logarithmic regret with our p olicy in Section 6 , w e now in tro duce the deﬁnition of semi-ﬂuid relaxation. F or a ﬁxed instance I with d = ( d 1 , · · · , d n ), we form ulate the semi-ﬂuid relaxation of V oﬀ C ( I ) : max x X j ∈ [ n ] d j · E r ∼ F j [ r · x j ( r )] s.t. X j ∈ [ n ] d j · a j,i · E r ∼ F j [ x j ( r )] ≤ C i i ∈ [ m ] x j ( r ) ∈ [0 , 1] j ∈ [ n ] , r ∈ [ r j , ¯ r j ] ( V semi C ( I )) where d j denote the n umber of t yp e j query arriv als and d dep ends on the sample path I . F rom Lemma 1 , we kno w that semi-ﬂuid relaxation also implies an upper bound for oﬄine optim um as V ﬂd C = E I ∼ G [ V semi C ( I )] ≥ E I ∼ G  V oﬀ C ( I )  . 40 F ollowing Lemma 2 , the semi-ﬂuid relaxation problem can b e equiv alently rewritten as: max q X j ∈ [ n ] d j · Z 1 1 − q j F − 1 j ( u ) d u s.t. X j ∈ [ n ] d j · a j,i · q j ≤ C i i ∈ [ m ] q j ∈ [0 , 1] j ∈ [ n ] ( ¯ V semi C ( I )) where the decision v ariable q j represen ts the probability of serving a query of type j . T o achiev e p oly-logarithmic regret, w e now consider the problem starting from time p erio d t . W e deﬁne random v ariable for t yp e j query arriv als from p erio d t to T as b j,t in the problem instance I . Then on the remaining problem instance I t = { ( r s , a s ) } T s = t , we denote the follo wing semi-ﬂuid problem at time p erio d t , whic h serves as a relaxation of the total reward collected by the prophet from time p erio d t to T giv en the remaining capacity c . max q X j ∈ [ n ] b j,t · Z 1 1 − q j F − 1 j ( u ) d u s.t. X j ∈ [ n ] b j,t · a j,i · q j ≤ c i i ∈ [ m ] q j ∈ [0 , 1] j ∈ [ n ] ( ¯ V semi c ( I t )) and we denote { ˜ q ∗ j,t } n j =1 as its optimal solution. Pro of of Lemma 5 : W e introduce an intermediate v ariable ˜ q j,t and decomp ose | ˜ q ∗ j,t − ˆ q j,t | into tw o parts using the triangle inequalit y . Sp eciﬁcally , for instance I t , w e consider an in termediate estimation problem in whic h the rew ard distributions are known accurately: max q X j ∈ [ n ] ˆ b j,t · Z 1 1 − q j F − 1 j ( u ) d u s.t. X j ∈ [ n ] ˆ b j,t · a j,i · q j ≤ c i i ∈ [ m ] q j ∈ [0 , 1] j ∈ [ n ] ( ˜ V t, c ( I t )) and let { ˜ q j,t } n j =1 denote its optimal solution. Then we ha ve that for any j ∈ [ n ], | ˜ q ∗ j,t − ˆ q j,t | ≤ | ˜ q ∗ j,t − ˜ q j,t | + | ˜ q j,t − ˆ q j,t | (22) W e ﬁrst consider the ﬁrst term on the righ t-hand side. F ollo wing the strong conca vity in ( 11 ), we ha v e 1 β ( ˜ q j,t − ˜ q ∗ j,t ) 2 ≤  G ′ j ( ˜ q j,t ) − G ′ j ( ˜ q ∗ j,t )  ( ˜ q ∗ j,t − ˜ q j,t ) F ollowing the analysis in Section C b y substituting µ j and C i for b j,t and c i for every j and i , w e ha ve 1 β n X j =1 b j,t ( ˜ q j,t − ˜ q ∗ j,t ) 2 ≤ 1 α n X j =1 | b j,t − ˆ b j,t || ˜ q j,t − ˜ q ∗ j,t | By Cauch y-Sch warz Inequalit y , we ha ve 1 β n X j =1 b j,t ( ˜ q j,t − ˜ q ∗ j,t ) 2 ≤ 1 α n X j =1 | b j,t − ˆ b j,t || ˜ q j,t − ˜ q ∗ j,t | ≤ 1 α ( n X j =1 b j,t ( ˜ q j,t − ˜ q ∗ j,t ) 2 ) 1 2 ( n X j =1 | b j,t − ˆ b j,t | 2 b j,t ) 1 2 41 Therefore, n X j =1 b j,t ( ˜ q j,t − ˜ q ∗ j,t ) 2 ≤ ( β α ) 2 n X j =1 | b j,t − ˆ b j,t | 2 b j,t F rom Section B , we hav e ( 6 ). Since b j,t and ˆ b j,t are independent identically distributed, with probabilit y at least 1 − 1 T , we ha ve n X j =1 b j,t | ˜ q j,t − ˜ q ∗ j,t | ≤ ( n X j =1 b j,t ) 1 2 ( n X j =1 b j,t ( ˜ q j,t − ˜ q ∗ j,t ) 2 ) 1 2 ≤ √ s · β α ( n X j =1 | b j,t − ˆ b j,t | 2 b j,t ) 1 2 ≤ √ s · β α · 4 √ 2 log(2 T ) √ n ≤ 4 β α log T √ ns where the ﬁrst inequalit y holds for Cauch y-Sch warz Inequalit y and the second holds for P n j =1 b j,t = s . No w that we ha ve b j,t | ˜ q j,t − ˜ q ∗ j,t | ≤ 4 β α log T √ ns holds for any j ∈ [ n ]. With Assumption 3 , we kno w that b j,t ≥ γ · s , thus with probabilit y at least 1 − 1 T , | ˜ q j,t − ˜ q ∗ j,t | ≤ 4 β √ n αγ log T √ s , ∀ j ∈ [ n ] (23) F or the second term on the righ t-hand side of ( 22 ), w e follo w the deﬁnition of ˆ G j,t in ( 19 ) and apply the strong concavit y of G j . Consequently , we hav e 1 2 β n X j =1 ˆ b j,t ( ˜ q j,t − ˆ q j,t ) 2 ≤ n X j =1 ˆ b j,t G j ( ˆ q j,t ) − n X j =1 ˆ b j,t G j ( ˜ q j,t ) + n X j =1 ˆ b j,t G ′ j ( ˆ q j,t )( ˜ q j,t − ˆ q j,t ) ≤ n X j =1 ˆ b j,t G ′ j ( ˆ q j,t )( ˜ q j,t − ˆ q j,t ) = n X j =1 ˆ b j,t ˆ G ′ j,t ( ˆ q j,t )( ˜ q j,t − ˆ q j,t ) + n X j =1 ˆ b j,t ( G ′ j ( ˆ q j,t ) − ˆ G ′ j,t ( ˆ q j,t ))( ˜ q j,t − ˆ q j,t ) ≤ n X j =1 ˆ b j,t | F − 1 j (1 − ˆ q j,t ) − ˆ F − 1 j,t (1 − ˆ q j,t ) || ˜ q j,t − ˆ q j,t | ≤ n X j =1 ˆ b j,t ϵ j,t α | ˜ q j,t − ˆ q j,t | where ϵ j,t = sup x | ˆ F j,t ( x ) − F j ( x ) | . The second inequality relies on the optimalit y of { ˜ q j,t } n j =1 to ˜ V t, c ( I t ) , and the third inequality holds for the optimalit y of { ˆ q j,t } n j =1 to ˆ V t, c ( I t ) . By Cauch y-Sch warz Inequalit y , we ha ve 1 2 β n X j =1 ˆ b j,t ( ˜ q j,t − ˆ q j,t ) 2 ≤ n X j =1 ˆ b j,t ϵ j,t α | ˜ q j,t − ˆ q j,t | ≤ 1 α ( n X j =1 ˆ b j,t ( ˜ q j,t − ˆ q j,t ) 2 ) 1 2 ( n X j =1 ˆ b j,t ( ϵ 2 j,t )) 1 2 42 By Lemma 4 , with probability at least 1 − δ , we hav e ϵ j,t = sup x | ˆ F j,t ( x ) − F j ( x ) | ≤ k 1 p log(1 /δ ) /n j,t , where k 1 is a constant. Moreov er, under Assumption 4 , we hav e E I ∼ G [ n j,t ] ≥ γ t = γ ( T − s + 1). Therefore, with probability at least 1 − 1 T , we ha ve n X j =1 ˆ b j,t ( ˜ q j,t − ˆ q j,t ) 2 ≤ ( 2 β α ) 2 · n X j =1 ˆ b j,t ( ϵ 2 j,t ) ≤ ( 2 β α ) 2 · k 2 1 log T γ ( T − s + 1) n X j =1 ˆ b j,t = ( 2 β α ) 2 · k 2 1 s log T γ ( T − s + 1) where the last equalit y holds for P n j =1 ˆ b j,t = s . Therefore by Cauc h y-Sch warz Inequalit y , n X j =1 ˆ b j,t | ˜ q j,t − ˆ q j,t | ≤ ( n X j =1 ˆ b j,t ) 1 2 ( n X j =1 ˆ b j,t ( ˜ q j,t − ˆ q j,t ) 2 ) 1 2 ≤ s · 2 β k 1 α s log T γ ( T − s + 1) No w that w e hav e ˆ b j,t | ˜ q j,t − ˆ q j,t | ≤ s · 2 β k 1 α q log T γ ( T − s +1) holds for an y j ∈ [ n ]. With Assumption 3 , w e kno w that ˆ b j,t ≥ γ · s , thus with probabilit y at least 1 − 1 T , | ˜ q j,t − ˆ q j,t | ≤ 2 β k 1 α √ γ 3 r log T T − s + 1 , ∀ j ∈ [ n ] (24) Com bining ( 22 ), ( 23 ) and ( 24 ), for any j ∈ [ n ], with probabilit y at least 1 − 1 T , | ˜ q ∗ j,t − ˆ q j,t | ≤ 4 β √ n αγ log T √ s + 2 β k 1 α √ γ 3 r log T T − s + 1 ≤ ( 4 β √ n αγ + 2 β k 1 α √ γ 3 )( log T √ s + log T √ T − s + 1 ) where α, β , γ , k 1 , ¯ r are all constants. Pro of of Theorem 4 : W e denote b y ¯ V c ( I t ) a relaxation of the total rew ard collected b y the prophet from time perio d t to T giv en the remaining problem instance I t = { ( r s , a s ) } T s = t and the remaining capacit y c . The regret of any online policy π ov er the whole time horizon T can therefore b e upp er b ound b y the expected gap b etw een ¯ V C ( I 1 ) and V π C ( I ), where C = ( C 1 , · · · , C m ) is a v ector of initial capacity for all resources, i.e., Regret( π ) ≤ E I 1 ∼ G  ¯ V C ( I 1 )  − E I ∼ G [ V π ( I )] (25) F or each t ∈ [ T ], w e denote c π t = ( c π t, 1 , · · · , c π t,m ) as the remaining capacities of the resources at the beginning of a p erio d t during the implemen tation of p olicy π and c π t is random for each p erio d t b ecause of the randomness in the problem instance I and policy π . Note that c π 1 = C and ¯ V c ( I T +1 ) = 0 for ev ery c , the regret upp er b ound in ( 25 ) can b e decomposed as: E I 1 ∼ G  ¯ V C ( I 1 )  − E I ∼ G [ V π ( I )] = E I 1 ∼ G " T X t =1 ( ¯ V c π t ( I t ) − ¯ V c π t +1 ( I t +1 )) # − E I ∼ G [ V π ( I )] = T X t =1 E I t ∼ G h ¯ V c π t ( I t ) − ¯ V c π t +1 ( I t +1 ) − r t · x π t i = T X t =1 E I t ∼ G  ¯ V c π t ( I t ) − ¯ V c π t − a t · x π t ( I t +1 ) − r t · x π t  43 where x π t is the decision for perio d t in the policy π . F or eac h c ≥ 0, w e deﬁne the myopic regret: My opic t ( π , c ) = E I t ∼ G  ¯ V c ( I t ) − ¯ V c − a t · x π t ( I t +1 ) − r t · x π t  (26) In practice, applying Algorithm 5 may lead to infeasibilit y , whic h introduces additional regret b etw een our p olicy and the benchmark. T o illustrate this, consider a virtual buﬀer that is in tro duced for the algorithm, con taining exactly one unit of resource to b e consumed by an arriving query . When the real remaining capacit y is suﬃcien t for accepting the query , the p olicy behav es iden tically in both the real and the virtual settings. How ever, once the real remaining capacit y is no longer enough to accept an arriving query , the p olicy will reject all subsequen t queries in the real setting. Consequen tly , the only discrepancy betw een the virtual and real cases o ccurs at the ﬁrst time perio d when the real resource capacity runs out: in the virtual case, the query can still b e accepted using the remaining buﬀer, whereas in practice it must b e rejected. This discrepancy results in a p erformance gap of at most ¯ r for each resource i ∈ [ m ]. Therefore, the total regret can b e rewritten as follo ws: Regret( π ) ≤ T X t =1 E c π t [My opic t ( π , c π t )] + m · ¯ r (27) W e now pro ceed to oﬀer an upper bound for m y opic regret. As deﬁned in ¯ V semi c ( I t ) , we ha ve ¯ V semi c ( I t ) = n X j =1 b j,t +1 · Z 1 1 − ˜ q ∗ j,t F − 1 j ( u ) d u + Z 1 1 − ˜ q ∗ j t ,t F − 1 j t ( u ) d u (28) where w e denote by j t the t yp e of query t in the instance I . Set ¯ V c ( I t ) := ¯ V semi c ( I t ) in ( 26 ) and denote q π j,t as the probability that query t will be serv ed b y the online p olicy π : My opic t ( π , c ) = E I t +1 ∼ G [ n X j =1 b j,t +1 · Z 1 1 − ˜ q ∗ j,t F − 1 j ( u ) d u + Z 1 1 − ˜ q ∗ j t ,t F − 1 j t ( u ) d u − Z 1 1 − q π j t ,t F − 1 j t ( u ) d u − E r ∼ F j t [ ¯ V semi c − a j t · x π t ( r ) ( I t +1 )]] = E I t +1 ∼ G [ n X j =1 b j,t +1 · Z 1 1 − ˜ q ∗ j,t F − 1 j ( u ) d u + Z 1 − q π j t ,t 1 − ˜ q ∗ j t ,t F − 1 j t ( u ) d u − q π j t ,t ¯ V semi c − a j t ( I t +1 ) − (1 − q π j t ,t ) ¯ V semi c ( I t +1 )] (29) W e now discuss three cases dep endent on the diﬀerent computed quan tile v alues from ˆ V t, c ( I t ) and construct feasible solution to ¯ V semi c − a j t ( I t +1 ) and ¯ V semi c ( I t +1 ) to upp er bound the m yopic regret. Case 1: when ˆ q j t ,t ≥ 1 − 2 κ ( log T √ s + log T √ T − s +1 ). F rom Lemma 5 , we ha ve | ˜ q ∗ j t ,t − ˆ q j t ,t | ≤ κ ( log T √ s + log T √ T − s + 1 ) It implies that ˜ q ∗ j t ,t ≥ 1 − 3 κ ( log T √ s + log T √ T − s +1 ) ≥ 1 2 when s ≥ 144 κ 2 (log T ) 2 and T − s + 1 ≥ 144 κ 2 (log T ) 2 . W e kno w that c ≥ ˆ b j t ,t · ˆ q j t ,t · a j t ≥ γ · s · a j t 2 ≥ a j t for large s ≥ 2 γ . Therefore, we alwa ys hav e enough remaining capacity to serve quert t with type j t . W e kno w that query t of type j t should b e accepted b y our algorithm in a high quantile. F or simplicity , we set q π j t ,t = 1. 44 Th us we only need to construct a feasible solution to ¯ V semi c − a j t ( I t +1 ) as it contributes negatively to our my opic regret. F rom the feasibilit y of { ˜ q ∗ j,t } n j =1 , we kno w n X j =1 b j,t +1 · a j,i · ˜ q ∗ j,t + a j t ,i · ˜ q ∗ j t ,t ≤ c i , ∀ i ∈ [ m ] (30) W e construct the follo wing solution { ˜ q ′ j,t } n j =1 satisfying ˜ q ′ j,t = ˜ q ∗ j,t , ∀ j  = j t and ˜ q ′ j t ,t = ˜ q ∗ j t ,t + ˜ q ∗ j t ,t − 1 b j t ,t +1 (31) Then we hav e P n j =1 b j,t +1 · a j,i · ˜ q ′ j,t ≤ c i − a j t ,i for every i ∈ [ m ]. Therefore { ˜ q ′ j,t } n j =1 is a feasible solution to ¯ V semi c − a j t ( I t +1 ), where ˜ q ′ j t ,t ≥ 0 follows from ˜ q ∗ j t ,t ≥ 1 2 and b j t ,t +1 ≥ 1. Consequen tly , following ( 29 ), setting q π j t ,t = 1, we ha ve an upp er bound for m yopic regret: My opic t ( π , c ) ≤ E I t +1 ∼ G " n X j =1 b j,t +1 · Z 1 1 − ˜ q ∗ j,t F − 1 j ( u ) d u + Z 1 − q π j t ,t 1 − ˜ q ∗ j t ,t F − 1 j t ( u ) d u − n X j =1 b j,t +1 · Z 1 1 − ˜ q ′ j,t F − 1 j ( u ) d u # = E I t +1 ∼ G " b j t ,t +1 · Z 1 − ˜ q ′ j t ,t 1 − ˜ q ∗ j t ,t F − 1 j t ( u ) d u + Z 1 − q π j t ,t 1 − ˜ q ∗ j t ,t F − 1 j t ( u ) d u # W e introduce the follo wing lemma: Lemma 6 F or any q 1 , q 2 ∈ [0 , 1] , it holds that Z 1 − q 2 1 − q 1 F − 1 j ( u ) d u ≤ F − 1 j (1 − q 1 ) · ( q 1 − q 2 ) + ( q 1 − q 2 ) 2 2 α for any j ∈ [ n ] , wher e α is the lower b ound of density function f j ( · ) deﬁne d in Assumption 1 . Applying Lemma 6 , w e ha ve Z 1 − ˜ q ′ j t ,t 1 − ˜ q ∗ j t ,t F − 1 j t ( u ) d u ≤ F − 1 j t (1 − ˜ q ∗ j t ,t ) · 1 − ˜ q ∗ j t ,t b j t ,t +1 + (1 − ˜ q ∗ j t ,t ) 2 2 αb 2 j t ,t +1 Z 1 − q π j t ,t 1 − ˜ q ∗ j t ,t F − 1 j t ( u ) d u ≤ F − 1 j t (1 − ˜ q ∗ j t ,t ) · ( ˜ q ∗ j t ,t − 1) + (1 − ˜ q ∗ j t ,t ) 2 2 α Therefore, with probability at least 1 − 1 T , we get My opic t ( π , c ) ≤ E I t +1 ∼ G  (1 − ˜ q ∗ j t ,t ) 2 2 α + (1 − ˜ q ∗ j t ,t ) 2 2 αb j t ,t +1  ≤ E I t +1 ∼ G  2(1 − ˆ q j t ,t ) 2 2 α + 2( ˆ q j t ,t − ˜ q ∗ j t ,t ) 2 2 α + 1 2 αb j t ,t +1  ≤ 5 κ 2 (log T ) 2 α 1 s + 1 T − s + 1 + 2 p s ( T − s + 1) ! + 1 2 α · γ ( s − 1) (32) where the second inequalit y follo ws from Basic Inequality and third inequality holds for ˆ q j t ,t ≥ 1 − 2 κ ( log T √ s + log T √ T − s +1 ), Lemma 5 and Assumption 4 . Case 2: when ˆ q j t ,t ≤ 2 κ ( log T √ s + log T √ T − s +1 ). F rom Lemma 5 , we ha ve | ˜ q ∗ j t ,t − ˆ q j t ,t | ≤ κ ( log T √ s + log T √ T − s + 1 ) 45 It implies that ˜ q ∗ j t ,t ≤ 3 κ ( log T √ s + log T √ T − s +1 ) ≤ 1 2 when 144 κ 2 (log T ) 2 ≤ s ≤ T + 1 − 144 κ 2 (log T ) 2 . W e kno w that query t of type j t should b e rejected by our algorithm with high probability . F or simplicity , we set q π j t ,t = 0. Th us we only need to construct a feasible solution to ¯ V semi c ( I t +1 ) as it contributes negatively to our m y opic regret. W e construct the follo wing solution { ˜ q ′′ j,t } n j =1 satisfying ˜ q ′′ j,t = ˜ q ∗ j,t , ∀ j  = j t and ˜ q ′′ j t ,t = ˜ q ∗ j t ,t · b j t ,t +1 + 1 b j t ,t +1 (33) F ollow the feasibility of { ˜ q ∗ j,t } n j =1 in ( 30 ), we hav e P n j =1 b j,t +1 · a j,i · ˜ q ′′ j,t ≤ c i for every i ∈ [ m ]. Therefore { ˜ q ′′ j,t } n j =1 is a feasible solution to ¯ V semi c ( I t +1 ), where ˜ q ′′ j t ,t ≤ 1 follows from ˜ q ∗ j t ,t ≤ 1 2 and b j t ,t +1 ≥ 1. F ollowing ( 29 ), setting q π j t ,t = 0, we ha ve an upp er bound for m yopic regret: My opic t ( π , c ) ≤ E I t +1 ∼ G " n X j =1 b j,t +1 · Z 1 1 − ˜ q ∗ j,t F − 1 j ( u ) d u + Z 1 − q π j t ,t 1 − ˜ q ∗ j t ,t F − 1 j t ( u ) d u − n X j =1 b j,t +1 · Z 1 1 − ˜ q ′′ j,t F − 1 j ( u ) d u # = E I t +1 ∼ G " b j t ,t +1 · Z 1 − ˜ q ′′ j t ,t 1 − ˜ q ∗ j t ,t F − 1 j t ( u ) d u + Z 1 − q π j t ,t 1 − ˜ q ∗ j t ,t F − 1 j t ( u ) d u # Applying Lemma 6 , w e ha ve Z 1 − ˜ q ′′ j t ,t 1 − ˜ q ∗ j t ,t F − 1 j t ( u ) d u ≤ F − 1 j t (1 − ˜ q ∗ j t ,t ) · − ˜ q ∗ j t ,t b j t ,t +1 + ( ˜ q ∗ j t ,t ) 2 2 αb 2 j t ,t +1 Z 1 − q π j t ,t 1 − ˜ q ∗ j t ,t F − 1 j t ( u ) d u ≤ F − 1 j t (1 − ˜ q ∗ j t ,t ) · ˜ q ∗ j t ,t + ( ˜ q ∗ j t ,t ) 2 2 α Therefore, with probability at least 1 − 1 T , we get My opic t ( π , c ) ≤ E I t +1 ∼ G  ( ˜ q ∗ j t ,t ) 2 2 α + ( ˜ q ∗ j t ,t ) 2 2 αb j t ,t +1  ≤ E I t +1 ∼ G  2( ˆ q j t ,t ) 2 2 α + 2( ˆ q j t ,t − ˜ q ∗ j t ,t ) 2 2 α + 1 2 αb j t ,t +1  ≤ 5 κ 2 (log T ) 2 α 1 s + 1 T − s + 1 + 2 p s ( T − s + 1) ! + 1 2 α · γ ( s − 1) (34) where the second inequality follows from Basic Inequalit y and third inequality holds for ˆ q j t ,t ≤ 2 κ ( log T √ s + log T √ T − s +1 ), Lemma 5 and Assumption 4 . Case 3: when 2 κ ( log T √ s + log T √ T − s +1 ) ≤ ˆ q j t ,t ≤ 1 − 2 κ ( log T √ s + log T √ T − s +1 ). W e kno w that c ≥ ˆ b j t ,t · ˆ q j t ,t · a j t ≥ γ · s · 2 κ log T √ s · a j t ≥ a j t for large s ≥ 1 4 γ 2 κ 2 (log T ) 2 . Therefore, w e alw ays hav e enough remaining capacity to serv e quert t with t yp e j t . F rom Lemma 5 , we ha ve | ˜ q ∗ j t ,t − ˆ q j t ,t | ≤ κ ( log T √ s + log T √ T − s + 1 ) whic h implies that κ ( log T √ s + log T √ T − s +1 ) ≤ ˜ q ∗ j t ,t ≤ 1 − κ ( log T √ s + log T √ T − s +1 ). In this case, the oﬄine optim um accept query t with a probabilit y neither close to 0 nor close to 1. Th us we set q π j t ,t = ˆ q j t ,t in the algorithm. 46 W e construct the solution { ˜ q ′ j,t } n j =1 to ¯ V semi c − a j t ( I t +1 ) as ( 31 ) and the solution { ˜ q ′′ j,t } n j =1 to ¯ V semi c ( I t +1 ) as ( 33 ). Then for s ≥ 4 γ 2 κ 2 (log T ) 2 , we ha ve ˜ q ∗ j t ,t ≥ κ log T √ s ≥ 1 γ ( s − 1) ≥ 1 b j t ,t +1 ˜ q ∗ j t ,t ≤ 1 − κ log T √ s ≤ 1 − 1 γ ( s − 1) ≤ 1 − 1 b j t ,t +1 Therefore ˜ q ′ j t ,t ≥ 0 and ˜ q ′′ j t ,t ≤ 1. F ollo wing the feasibility of { ˜ q ∗ j,t } n j =1 in ( 30 ), we know that { ˜ q ′ j,t } n j =1 is feasible to ¯ V semi c − a j t ( I t +1 ) and { ˜ q ′′ j,t } n j =1 is feasible to ¯ V semi c ( I t +1 ). F ollowing ( 29 ) and Lemma 6 , b y setting q π j t ,t = ˆ q j t ,t , with probability at least 1 − 1 T , we ha ve My opic t ( π , c ) ≤ E I t +1 ∼ G [ q π j t ,t b j t ,t +1 Z 1 − ˜ q ′ j t ,t 1 − ˜ q ∗ j t ,t F − 1 j t ( u ) d u + Z 1 − q π j t ,t 1 − ˜ q ∗ j t ,t F − 1 j t ( u ) d u + (1 − q π j t ,t ) b j t ,t +1 Z 1 − ˜ q ′′ j t ,t 1 − ˜ q ∗ j t ,t F − 1 j t ( u ) d u ] ≤ E I t +1 ∼ G [ q π j t ,t (1 − ˜ q ∗ j t ,t ) 2 2 αb j t ,t +1 + (1 − q π j t ,t ) ( ˜ q ∗ j t ,t ) 2 2 αb j t ,t +1 + ( ˜ q ∗ j t ,t − ˆ q j t ,t ) 2 2 α ] ≤ E I t +1 ∼ G [ 1 2 αb j t ,t +1 + ( ˜ q ∗ j t ,t − ˆ q j t ,t ) 2 2 α ] ≤ 1 2 α · γ ( s − 1) + κ 2 (log T ) 2 2 α ( 1 s + 1 T − s + 1 + 2 p s ( T − s + 1) ) (35) W e deﬁne s 0 = max { 144 κ 2 (log T ) 2 , 4 γ 2 κ 2 (log T ) 2 , 2 γ } . T o conclude, when s 0 ≤ s ≤ T − s 0 + 1, com bining ( 32 ), ( 34 ) and ( 35 ), with probabilit y at least 1 − 1 T , we ha ve My opic t ( π , c ) ≤ 1 2 α · γ ( s − 1) + 5 κ 2 (log T ) 2 α 1 s + 1 T − s + 1 + 2 p s ( T − s + 1) ! F ollowing ( 27 ), w e obtain the upper b ound for total regret: Regret( π ) ≤ T X t =1 E c π t [My opic t ( π , c π t )] + m · ¯ r ≤  1 αγ + 5 κ 2 (log T ) 2 α  (2 log T + 2 + 2 π ) + 2 s 0 · ¯ r + ¯ r T · 1 T + m · ¯ r = O ( √ n · (log T ) 3 + m ) Pro of of Lemma 6 : F ollowing ( 11 ), w e know that for any j ∈ [ n ], G ′′ j ( q ) ∈ [ − 1 α , − 1 β ]. F rom the strong conca vit y of G j , we ha ve Z 1 − q 2 1 − q 1 F − 1 j ( u ) d u = G j ( q 1 ) − G j ( q 2 ) ≤ G ′ j ( q 1 ) · ( q 1 − q 2 ) + ( q 1 − q 2 ) 2 2 α = F − 1 j (1 − q 1 ) · ( q 1 − q 2 ) + ( q 1 − q 2 ) 2 2 α

Non-Stationary Online Resource Allocation: Learning from a Single Sample

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment