Wasserstein Distributionally Robust Online Learning

We study distributionally robust online learning, where a risk-averse learner updates decisions sequentially to guard against worst-case distributions drawn from a Wasserstein ambiguity set centered at past observations. While this paradigm is well u…

Authors: Guixian Chen, Salar Fattahi, Soroosh Shafiee

Proceedings of Machine Learning Research vol : 1 – 41 , 2026 W asserstein Distrib utionally Rob ust Online Learning Guixian Chen G X C H E N @ U M I C H . E D U University of Michigan Salar F attahi FA T T A H I @ U M I C H . E D U University of Michigan Soroosh Shafiee S H A FI E E @ C O R N E L L . E D U Cornell University Abstract W e study distrib utionally robust online learning, where a risk-averse learner updates decisions se- quentially to guard against worst-case distributions drawn from a W asserstein ambiguity set cen- tered at past observations. While this paradigm is well understood in the offline setting through W asserstein Distributionally Robust Optimization (DRO), its online extension poses significant challenges in both con ver gence and computation. In this paper , we address these challenges. First, we formulate the problem as an online saddle-point stochastic game between a decision maker and an adversary selecting worst-case distributions, and propose a general framew ork that con v erges to a robust Nash equilibrium coinciding with the solution of the corresponding offline W asser- stein DR O problem. Second, we address the main computational bottleneck, which is the repeated solution of worst-case expectation problems. For the important class of piecewise concave loss functions, we propose a tailored algorithm that exploits problem geometry to achieve substan- tial speedups ov er state-of-the-art solvers such as Gurobi. The ke y insight is a novel connection between the worst-case expectation problem, an inherently infinite-dimensional optimization prob- lem, and a classical and tractable budget allocation problem, which is of independent interest. Keyw ords: risk-averse online learning, data-dri v en optimization, W asserstein uncertainty 1. Introduction The primary objectiv e of statistical learning is to identify a decision rule x within a feasible set X ⊆ R n that minimizes the e xpectation of a loss function ℓ : X × Ξ → R with respect to an underlying, unknown data-generating distribution P ⋆ ∈ P (Ξ) , where P (Ξ) denotes the set of all probability distrib utions supported on Ξ ⊆ R m . When P ⋆ is inaccessible but a static dataset of T i.i.d. observations { b ξ 1 , . . . , b ξ T } is av ailable, Empirical Risk Minimization (ERM) approximates this goal using the empirical distribution b P T := 1 T P T t =1 δ b ξ t , where δ ξ denotes the Dirac measure centered at ξ . By replacing P ⋆ with this plug-in estimator , ERM solv es the optimization problem inf x ∈X ( E ξ ∼ b P T [ ℓ ( x, ξ )] = 1 T T X t =1 ℓ ( x, b ξ t ) ) . When observations arriv e sequentially , the framework of Online Con vex Optimization (OCO) pro- vides efficient algorithms that minimize regret, ensuring con ver gence to the statistical learning so- lution as the time horizon T → ∞ ( Nemirovski et al. , 2009 ; Shale v-Shwartz , 2012 ; Hazan , 2022 ). Despite its theoretical foundations, this standard statistical learning framew ork suffers from fundamental limitations. First, by relying solely on the expectation as a risk measure, it overlooks © 2026 G. Chen, S. Fattahi & S. Shafiee. C H E N F A T TA H I S H A FI E E higher-order v ariations, failing to account for the risk sensitivity required in safety-critical applica- tions. Second, the framew ork is notoriously brittle to data corruption during training, where mea- surement noise ( Nettleton et al. , 2010 ) or adversarial manipulation ( Nietert et al. , 2023 , 2024 ) can se verely degrade the learned model. Finally , it assumes the testing distribution perfectly matches the training distribution, causing performance to collapse under adversarial distribution shifts ( Y ang et al. , 2024 ) or test-time corruption ( Kurakin et al. , 2017 ; Goodfello w et al. , 2015 ). W asserstein Distributionally Robust Optimization (DRO) addresses these challenges in a uni- fied manner . By defining an ambiguity set based on the W asserstein distance, which captures the underlying geometry of the sample space, DR O effecti vely models geometric corruptions. Fur- thermore, it inherently regularizes the model against local perturbations, acting as a penalty on the Lipschitz constant or gradient variation of the loss. Formally , for a fixed p ∈ [1 , ∞ ) , we consider a distributional ambiguity set centered around a reference distrib ution P ∈ P (Ξ) , defined as B p ρ ( P ) :=  Q ∈ P (Ξ) : W p p ( Q , P ) ≤ ρ  , where ρ > 0 is the ambiguity radius and W p ( P , Q ) denotes the p -W asserstein distance, define as W p ( P , Q ) := inf π ∈ Π( P , Q )  E ( ξ ,ξ ′ ) ∼ π  ∥ ξ − ξ ′ ∥ p  1 /p . Here, Π( P , Q ) := { π ∈ P (Ξ 2 ) : π ( · × Ξ) = P , π (Ξ × · ) = Q } represents the set of all couplings with marginals P and Q . Ideally , a risk-averse learner aims to solve the minimax problem centered at the true distribution P ⋆ : inf x ∈X sup Q ∈ B p ρ ( P ⋆ ) E ξ ∼ Q [ ℓ ( x, ξ )] . (1) W e note that when ρ = 0 , the ambiguity set collapses to a singleton, and the problem reduces to the standard statistical learning frame work. Since P ⋆ is unkno wn, this problem is typically solved in the offline setting using a data-driv en approximation. Specifically , giv en a static dataset of T i.i.d. observations, the standard data-driv en DR O approach ( Mohajerin Esfahani and Kuhn , 2018 ) proceeds by substituting P ⋆ with b P T in ( 1 ) and solving the resulting optimization problem. Ho we ver , many modern applications operate in dynamic en vironments where data is not av ail- able as a static batch but arri ves sequentially . In settings such as online recommendation systems ( Bai et al. , 2019 ; W en et al. , 2022 ) and real-time financial portfolio management ( Cost a and Iyengar , 2023 ), the learner must adapt to streaming data in real-time. In these scenarios, waiting to accumu- late a large dataset to solve a static DR O problem is computationally prohibiti ve and fails to capture temporal shifts. This necessitates algorithms that can learn sequentially while strictly controlling the risk of worst-case outcomes, moti vating the central question of this paper: How can we design efficient algorithms that learn sequentially fr om str eaming data while r emaining r obust to wor st-case distributions? 1.1. Summary of Contributions W e formulate the DR O problem ( 1 ) as an online zero-sum stochastic game. At each iteration t , the en vironment re veals a sample b ξ t drawn from P ⋆ . Simultaneously , the dual player (adversary) selects a worst-case distribution Q t from a W asserstein ambiguity set centered at the historical observations, 2 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G while the primal player (decision maker) selects a decision x t . The primal player then incurs the expected loss with respect to the dual player’ s chosen distribution E ξ ∼ Q t [ ℓ ( x t , ξ )] . Our objectiv e is to design an online algorithm that competes against the offline, risk-av erse benchmark defined in ( 1 ). Solving the minimax problem ( 1 ) in an online fashion presents unique challenges that distin- guish it from the application of standard OCO for saddle-point problems ( Orabona , 2019 , § 12). Unlike typical min-max games where the dual v ariable lies in a fixed, finite-dimensional space, our maximization occurs ov er the space of probability measures, which is infinite-dimensional. Fur- thermore, the problem is inherently non-stationary and stochastic. That is, the dual player does not hav e access to the full ambiguity set B p ρ ( P ⋆ ) , but only observes a single sample b ξ t at each step. Consequently , the immediate ambiguity set changes dynamically with e very iteration as the center of the W asserstein ball shifts based on the incoming data stream. In this work, we make equal contributions to both the theoretical foundations and the computational practicality of this field: ⋄ Novel Risk-A verse Framework: W e propose a theoretical framew ork for online learning against W asserstein uncertainty . W e formulate the learning problem as an online saddle-point optimization between a primal player , responsible for updating the decision rule x , and a dual player that selects worst-case distributions within a W asserstein ambiguity set. W e show that the resulting online dynamics conv er ge to a robust Nash equilibrium that coincides with the solution of the corresponding of fline W asserstein DR O problem. This provides theoretical guarantees for learning decisions that control tail risk and pre vent lar ge losses under admissi- ble distributional perturbations. ⋄ Efficient Computation: T o o vercome the computational bottleneck of the inner maximiza- tion, we de velop specialized and highly ef ficient algorithms for computing the worst-case e x- pectation. Focusing on the important class of piece wise conca ve loss functions, our method achie ves a δ -optimal solution in O (p oly log(1 /δ )) iterations. The efficienc y of the proposed approach arises from a nov el connection between the worst-case expectation problem—an in- herently infinite-dimensional optimization problem—and a classical budget allocation prob- lem, a connection we belie ve is of independent interest. 1.2. Related W orks W asserstein DRO The W asserstein metric provides a natural frame work for modeling geometric uncertainty and data corruption by capturing the underlying geometry of the input space. T o enable practical implementation, con vex duality results ha ve been recently de veloped to make W asserstein DR O computationally efficient ( Mohajerin Esfahani and Kuhn , 2018 ; Blanchet and Murthy , 2019 ; Gao and Kleywegt , 2023 ). The empirical success of these methods is often attributed to their theo- retical connections with v ariation-based ( Gao et al. , 2024 ; Shafiee et al. , 2025 ) and Lipschitz-based ( Blanchet et al. , 2019 ; Shafieezadeh-Abadeh et al. , 2019 ) regularization. Furthermore, W asserstein DR O of fers strong generalization guarantees deri ved from measure concentration and transport in- equality arguments ( Mohajerin Esfahani and K uhn , 2018 ; An and Gao , 2021 ; Gao , 2023 ). Despite these strengths, existing approaches rely on of fline processing of the full training dataset, a limita- tion this paper addresses by de veloping an online frame work for sequential data. Online W asserstein DRO The setting of online W asserstein DR O remains largely unexplored. Y erenbur g ( 2021 ) provides the first solution by dualizing the inner maximization to formulate a 3 C H E N F A T TA H I S H A FI E E single minimization problem, and then solving the resulting formulation via online mirror descent. Ho we ver , the reformulation approach requires strong oracles that solves a potentially nonconv ex problem to find the worst-case perturbation. Furthermore, their analysis mandates that the ambi- guity set size vanishes as T → ∞ . Similarly , W ang et al. ( 2025 ) utilize online clustering for data compression but still require solving a full W asserstein DR O problem on the compressed data at e very iteration. In contrast to these methods, we propose a primal-dual algorithm that av oids re- peatedly resolving the full optimization problem. Our approach efficiently identifies the w orst-case distribution at each step and applies a first-order update to the primal decision. Online Algorithms for Robust Optimization OCO techniques have recently been adapted to ro- bust optimization by casting such problems as semi-infinite programs. Seminal work by Ben-T al et al. ( 2015 ) and follow-ups by Ho-Nguyen and Kılınc ¸ -Karzan ( 2018 , 2019 ) reduce the problem to repeated robust feasibility checks via regret-minimizing algorithms, while more recent approaches ( Postek and Shtern , 2024 ; T u et al. , 2024 ) av oid bisection through perspectiv e reformulations or Lagrangian relaxations. These ideas extend naturally to DRO with specific ambiguity sets. F or example, Namkoong and Duchi ( 2016 ) and Aigner et al. ( 2023 ) use primal–dual updates for f - di ver gence sets with discrete support, while dualizing the inner maximization enables direct on- line minimization for f -diver gence sets ( Qi et al. , 2021 ) and W asserstein sets ( Y erenbur g , 2021 ). Closely related is the prediction-with-expert-advice framew ork, including the W eighted A verage and Aggregating Algorithms ( Kivinen and W armuth , 1999 ; V ovk , 1990 ), which can be viewed as Follo w-the-Regularized-Leader on a probability simplex with negati ve entropy regularization. Dis- tinct from these approaches, our work considers W asserstein ambiguity sets without discreteness assumptions and solves the problem in a fully online manner , naturally interpreted as a dynamic game between the learner and a W asserstein-constrained adversary . Adversarial T raining and Domain Shift Adversarial training was originally proposed to reduce the sensitivity of machine learning models to small, carefully crafted noise ( Goodfellow et al. , 2015 ; Kurakin et al. , 2017 ). This defense strategy can be rigorously reformulated as a rob ust optimization problem with box uncertainty , which is mathematically equiv alent to a W asserstein DR O prob- lem using an ∞ -W asserstein ambiguity set ( Gao et al. , 2024 ). Sinha et al. ( 2017 ) extended this formulation to the general p -W asserstein setting, establishing a frame work that naturally accommo- dates adversarial domain shifts where the test distribution differs from the training distrib ution via bounded adversarial corruption. W asserstein DR O of fers prov able robust generalization guarantees when facing such shifts ( Lee and Raginsky , 2018 ; T u et al. , 2019 ; W ang et al. , 2019 ; Kwon et al. , 2020 ; V olpi et al. , 2018 ). By solving the W asserstein DR O problem in a fully online fashion, our approach is naturally suited to this adversarial setting. Risk-A verse Online Learning A classic result by Artzner et al. ( 1999 ) establishes that any co- herent risk measure can be dually represented as a DR O problem ov er a specific ambiguity set. In optimal control, risk sensitivity is traditionally modeled using the entropic risk measure, par- ticularly within the linear-e xponential-Gaussian frame work ( Jacobson , 1973 ; Whittle , 1990 ). In reinforcement learning, this perspectiv e has expanded to include objectiv es based on the Condi- tional V alue-at-Risk ( Cho w and Ghav amzadeh , 2014 ; Hau et al. , 2023 ) and e xponential utility func- tions ( Borkar , 2002 ). Similarly , the multi-armed bandit literature has addressed risk sensiti vity through mean-variance criteria ( Sani et al. , 2012 ; V akili and Zhao , 2016 ) and CV aR-based explo- ration ( Galichet et al. , 2013 ). While these approaches typically rely on specific functional forms 4 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G of risk or f -div ergence ambiguity sets, the notion of risk sensitivity in our work is geometrically induced by the W asserstein ambiguity set. Distributionally Robust Regret Optimization A related paradigm is Distributionally Rob ust Re- gret Optimization (DRR O). In this setting, the minimax objectiv e in ( 1 ) is modified to minimize the worst-case regret or excess risk—defined as the difference between the loss and the optimal loss under the worst-case distribution—rather than the worst-case expected loss itself. While DRR O achie ves statistical minimax optimality under distributional shifts ( Agarw al and Zhang , 2022 ), it in- troduces significant computational challenges. Recent work has addressed these issues for W asser- stein ambiguity sets ( Chen and Xie , 2021 ; Bitar , 2024 ; Fiechtner and Blanchet , 2025 ; Xue and Rujeerapaiboon , 2025 ). W e emphasize that although our analysis employs cumulative regret as a performance metric, our objective remains minimizing the robust loss, which differs from the minimax regret formulation studied in the DRR O literature. 1.3. Notation and Outline Let ∥ · ∥ denotes the Euclidean norm. The set of positi ve integers up to n ∈ N is denoted by [ n ] . W e write P (Ξ) for the family of Borel probability measures on Ξ ⊆ R m , equipped with the p - W asserstein distance, where p ∈ [1 , ∞ ) . W e write E P [ ℓ ( x, ξ )] for expectation of ℓ ( x, ξ ) with respect to ξ ∼ P ; when clear from the context, the parameter and the random variable are dropped and we write E P [ ℓ ] . W e write Π X as the projection operator onto a closed and con vex set X . Let ∂ f ( x ) denote the subdifferential of f at x if f is con ve x, or the superdifferential if f is concav e. When clear from the context, ∂ f ( x ) may also refer to a specific subgradient or supergradient. For p ∈ [1 , ∞ ) , the p -th order homogeneous Sobolev (semi)norm of continuously dif ferentiable f : Ξ → R w .r .t. P is ∥ f ∥ ˙ H 1 ,p ( P ) . The Lipschitz constant of Lipschitz continuous f : Ξ → R is ∥ f ∥ lip . The remainder of the paper is org anized as follo ws. Section 2 introduces the problem setup and assumptions. In Section 3 , we analyze the con ver gence of the proposed algorithm, assuming access to an oracle for the (inner) worst-case expectation problem. Section 4 presents an ef ficient algorithm that implements this oracle. All proofs and technical details are deferred to the appendix. 2. Problem Setup In this section, we formalize the structural assumptions required for our analysis. Assumption 1 (Regularity) The feasible re gion X ⊆ R n is nonempty , con vex and compact, with diameter D X . The support of the random variable Ξ ⊆ R m is nonempty , closed and con vex. F or any fixed ξ ∈ Ξ , the loss function ℓ ( · , ξ ) is r eal-valued, con vex, and Lipschitz continuous with Lipschitz constant G X > 0 . F or any x ∈ X , ther e exists a constant g > 0 such that ℓ ( x, ξ ) ≤ g (1 + ∥ ξ ∥ p ) for all ξ ∈ Ξ . Assumption 1 ensures that the optimization problem inf x ∈X sup Q ∈ B p ρ ( b P t )  f ( x, Q ) := E ξ ∼ Q [ ℓ ( x, ξ )]  (2) has finite optimal value ( Gao and Kleywe gt , 2023 , Theorem 1). By linearity of the expectation, f ( x, Q ) is af fine in Q . Hence, it inherits con ve xity in x from the loss function ℓ . Moreover , the 5 C H E N F A T TA H I S H A FI E E Lipschitz continuity of ℓ also e xtends to its expectation: | f ( x 1 , Q ) − f ( x 2 , Q ) | ≤ E ξ ∼ Q [ | ℓ ( x 1 , ξ ) − ℓ ( x 2 , ξ ) | ] ≤ G X ∥ x 1 − x 2 ∥ , ∀ x 1 , x 2 ∈ X . This structural regularity allo ws us to establish strong duality for ( 2 ), implying that the order of the sup and inf operators can be interchanged without affecting the optimal value. Ho we ver , strong duality alone does not guarantee the existence of a finite optimal solution to ( 2 ), as the supremum or infimum may fail to be attained. T o ensure the existence of a solution pair ( x ⋆ , Q ⋆ ) for ( 2 ) (also kno wn as a saddle point ), additional assumptions are required; see, for example, ( Shafiee et al. , 2025 , Theorem 1). One such assumption that guarantees the existence of a saddle point is the piece wise structure of the loss function proposed by Mohajerin Esfahani and K uhn ( 2018 ). Assumption 2 (Piecewise Structure) The loss function ℓ ( x, ξ ) := max k ∈ [ K ] ℓ k ( x, ξ ) , wher e for every fixed x ∈ X and k ∈ [ K ] , the function ℓ k : X × Ξ → R is concave and differ entiable. As we will see in the next lemma, the above assumption guarantees the existence of a saddle point for ( 2 ), allowing us to replace the inf and sup operators with min and max , respecti vely , which is a requirement for any online saddle-point algorithms. Beyond this, the assumption of fers an additional benefit: it enables ( 2 ) to admit a finite-dimensional reformulation ( Mohajerin Esfahani and Kuhn , 2018 , Theorem 4.2). W e exploit this key property to design an efficient algorithm for solving the inner worst-case expectation problem. Notably , the assumed piece wise structure already encompasses a broad class of robust regression and classification models and is particularly attrac- ti ve since an y smooth function can be approximated arbitrarily well by piece wise linear functions. Lemma 1 (Existence of Saddle Point) Suppose Assumptions 1 and 2 hold. Moreo ver , if p = 1 , suppose in addition that either Ξ is compact or ther e exists a constant g > 0 such that ℓ ( x, ξ ) ≤ g (1 + ∥ ξ ∥ r ) for all ξ ∈ Ξ and some r ∈ (0 , 1) . Then, min x ∈X max Q ∈ B p ρ ( b P t ) f ( x, Q ) = max Q ∈ B p ρ ( b P t ) min x ∈X f ( x, Q ) . W e emphasize that, when p > 1 , no additional assumptions beyond Assumptions 1 and 2 are required to guarantee the existence of a saddle point. In contrast, the case p = 1 requires more careful analysis, since the worst-case distrib ution may assign an asymptotically vanishing amount of probability mass to points escaping to infinity along directions in the recession cone of Ξ . The additional assumption that Ξ is compact rules out this behavior , as its recession cone reduces to the singleton { 0 } . Alternativ ely , the restriction on the growth condition ensures that sending mass to infinity is nev er optimal for the worst-case distribution. Ne xt, we assume that the underlying data-generating distribution P ⋆ is well-behav ed. Assumption 3 (Light-tailed Distribution) The underlying data-gener ating distrib ution P ⋆ is light- tailed, that is, ther e exists a > p ≥ 1 suc h that E ξ ∼ P ⋆ [exp( ∥ ξ ∥ a )] < ∞ . W e note that Assumption 3 is primarily utilized to leverage the con vergence rates of the empirical distribution b P t to P ⋆ in the W asserstein metric, which serves only for deriving an end-to-end result for the con ver gence rate of our online algorithm in the next section. Our main algorithm relies on the follo wing computational oracle to identify the worst-case distrib ution. 6 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G Algorithm 1: Online Distributional Best-Response Algorithm Input: Initial decision x 1 ∈ X , step size η t > 0 f or t = 1 , 2 , . . . do // 1. Distributional Best Response Step Query the oracle to find the worst-case distrib ution: Q t ← O W ( x t , B p ρ ( b P t )) // 2. Learning Step Update the decision via projected subgradient method: x t +1 = Π X ( x t − η t ∂ x f ( x t , Q t )) // 3. Aggregation Step Maintain the av erage of the decisions: ¯ x t +1 = 1 t +1 P t +1 i =1 x i = t t +1 ¯ x t + 1 t +1 x t +1 end Assumption 4 (W asserstein Oracle) F or any δ > 0 , ther e e xists a W asserstein Ora- cle O W ( x, B p ρ ( P )) that returns a distribution Q ⋆ ∈ B p ρ ( P ) that satisfies f ( x, Q ⋆ ) ≥ max Q ∈ B p ρ ( P ) f ( x, Q ) − δ . While we initially assume the existence of such an oracle, we dedicate Section 4 to opening this “black box” and providing ef ficient implementations under mild conditions. The resulting procedure, which we call online distributional best response algorithm , is detailed in Algorithm 1 . At each iteration t , gi ven the current primal decision x t , the dual player queries the oracle O W  x t , B p ρ ( b P t )  to compute its best response, which corresponds to a solution of the inner worst-case expectation problem. The primal player then updates its decision using a single iteration of the projected subgradient method x t +1 = Π X ( x t − η t ∂ x f ( x t , Q t )) , where η t is the step size and Π X denotes the projection onto the feasible set X . This simple procedure is inspired by the best-response framework of Orabona ( 2019 , Algo- rithm 12.2). Howe ver , unlik e classical methods that operate on fix ed dual feasible sets, our approach must contend with a non-stationary dual environment. Namely , the dual player has access only to the partial ambiguity set B p ρ ( b P t ) , which ev olves tow ard the of fline benchmark B p ρ ( b P ⋆ ) as the data stream unfolds. This shifting landscape makes a best-response strate gy for the dual player essential and un- av oidable. In a standard simultaneous primal-dual update, the dual step could easily fall outside the currently valid (and shifting) W asserstein ball, yielding updates that are either infeasible or insuf- ficiently robust. By fr eezing the dual player’ s best response Q t against the current decision x t , we ensure the learner remains resilient to the most damaging distributional shift currently admissible. For simplicity , we emplo y the projected subgradient method to update the primal v ariable, although alternati ve approaches, including projection-free Frank-W olfe-type algorithms ( Garber and Hazan , 2015 ), can be used without loss of generality . 3. Con vergence Analysis Before proceeding to the formal analysis, we define the learning objecti ve within this online setting and establish the criteria for ev aluating the performance of Algorithm 1 . Let x ⋆ denote an optimal solution to ( 1 ). Our goal is to ensure that the rob ust estimate ¯ x t produced by Algorithm 1 con ver ges to x ⋆ according to a well-defined metric. W e ev aluate this con vergence rate using the primal sub- optimality gap, a standard performance measure in minimax optimization when the focus is on the 7 C H E N F A T TA H I S H A FI E E primal decision v ariable. Specifically , we seek to deriv e an upper bound for the quantity: Gap( ¯ x T ) := max Q ∈ B p ρ ( P ⋆ ) f ( ¯ x T , Q ) − min x ∈X max Q ∈ B p ρ ( P ⋆ ) f ( x, Q ) . The follo wing lemma plays a ke y role in controlling this suboptimality gap. Lemma 2 The suboptimality gap of the avera ged iter ate ¯ x T admits the following upper bound: Gap( ¯ x T ) ≤ 1 T T X t =1 ( f ( x t , Q t ) − f ( x ⋆ , Q t )) + 1 T T X t =1 max Q ∈ B p ρ ( P ⋆ ) f ( x t , Q ) − max Q ∈ B p ρ ( b P t ) f ( x t , Q ) ! + 1 T T X t =1 max Q ∈ B p ρ ( b P t ) f ( x ⋆ , Q ) − max Q ∈ B p ρ ( P ⋆ ) f ( x ⋆ , Q ) ! + δ. According to this lemma, the suboptimality gap can be decomposed into four distinct components. The first component characterizes the regret of the learning step in Algorithm 1 , ev aluated on the sequence of functions { f ( · , Q t ) } T t =1 . The second and third components quantify the sensitivity of the worst-case expectations with respect to two W asserstein balls, one centered at b P t and the other at P ⋆ . Finally , the fourth component accounts for the error incurred by the W asserstein oracle. Among these, the first component is the most straightforward to control, as it reduces to a standard regret analysis of the online projected subgradient method, which we present ne xt. Lemma 3 Under Assumption 1 , for any T ≥ 1 , the sequence { ( x t , Q t ) } T t =1 gener ated by Algo- rithm 1 with the stepsize η t = D X G X √ t satisfies 1 T T X t =1 ( f ( x t , Q t ) − f ( x ⋆ , Q t )) ≤ G X D X (1 + log T ) √ T . Unlike the first component, controlling the second and third components in Lemma 2 is more del- icate. In particular , although b P t → P ⋆ , it is not immediate how this con ver gence translates into con vergence of the corresponding worst-case expectations ov er W asserstein balls centered at b P t and P ⋆ . The following lemma characterizes this ef fect. Lemma 4 Under Assumptions 1 and 2 , for any x ∈ X and t ≥ 1 , we have      max Q ∈ B p ρ ( b P t ) f ( x, Q ) − max Q ∈ B p ρ ( P ⋆ ) f ( x, Q )      ≤ ( ∥ ℓ ( x, · ) ∥ lip W 1 ( b P t , P ⋆ ) , if p = 1 , ∥ ℓ ( x, · ) ∥ ˙ H 1 ,q ( P ⋆ ) W p ( b P t , P ⋆ ) , if p > 1 , wher e q > 1 is the conjugate e xponent satisfying 1 p + 1 q = 1 when p > 1 . W e emphasize that the implication of the above lemma is far from trivial and is perhaps surpris- ing: although the worst-case expectations are taken o ver W asserstein balls B p ρ ( b P t ) and B p ρ ( P ⋆ ) with radius ρ > 0 , the resulting bound is independent of ρ and depends solely on the p -W asserstein dis- tance between b P t and P ⋆ . Lev eraging this key lemma, we are able to relax a restrictive assumption in Y erenbur g ( 2021 ), which requires ρ → 0 as T → ∞ . By combining Lemmas 3 and 4 with Lemma 2 , we can establish the con vergence of Algorithm 1 . 8 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G Theorem 5 Under Assumptions 1 , 2 , and 4 , consider the averaged iterate ¯ x T gener ated by Algo- rithm 1 with stepsize η t = D X G X √ t . The following guarantees hold. (i) Case p = 1 . Suppose that either Ξ is compact, or there exists a constant g > 0 such that ℓ ( x, ξ ) ≤ g (1 + ∥ ξ ∥ r ) for all ξ ∈ Ξ and some r ∈ (0 , 1) . Then, Gap( ¯ x T ) ≤ G X D X (1 + log T ) √ T + 1 T T X t =1  ∥ ℓ ( x t , · ) ∥ lip + ∥ ℓ ( x ⋆ , · ) ∥ lip  W 1 ( b P t , P ⋆ ) + δ. (ii) Case p > 1 . Let q > 1 denote the conjugate e xponent satisfying 1 p + 1 q = 1 . Then, Gap( ¯ x T ) ≤ G X D X (1 + log T ) √ T + 1 T T X t =1  ∥ ℓ ( x t , · ) ∥ ˙ H 1 ,q ( b P t ) + ∥ ℓ ( x ⋆ , · ) ∥ ˙ H 1 ,q ( P ⋆ )  W p ( b P t , P ⋆ ) + δ. The results above show that the suboptimality gap of the final output is gov erned by three main factors: (i) the regret incurred by the online projected subgradient method, (ii) the weighted av- erage of the p -W asserstein distance between the empirical measures and the true data-generating distribution, and (iii) the error tolerance of the W asserstein oracle. When the data-generating distri- bution P ⋆ is additionally light-tailed, that is, it satisfies Assumption 3 , the following corollary can be established. Corollary 6 Suppose that Assumption 4 is satisfied with δ =  1 T log  T τ  min n p m , 1 2 o . Under the conditions of Theor em 5 and Assumption 3 , for any p ≥ 1 and any τ ∈ (0 , 1) , when T ≥ C 1 log( T τ ) , with pr obability at least 1 − τ , it holds that Gap( ¯ x T ) ≤ C 2  1 T log  T τ  min n p m , 1 2 o . Her e, C 1 , C 2 ar e constants depending on m , p , a , G X , D X , ∥ ℓ ( x ⋆ , · ) ∥ ˙ H 1 ,q ( P ⋆ ) , the exponential moments of P ⋆ , and uniform bounds on max x ∈X {∥ ℓ ( x, · ) ∥ lip } and max x ∈X n ∥ ℓ ( x, · ) ∥ ˙ H 1 ,q ( b P t ) o . The established bound highlights the interplay between the dimensionality of the uncertainty set and the con ver gence rate of the proposed online algorithm. In particular , it confirms that despite operating on streaming data, the online framework preserv es the statistical guarantees dictated by measure concentration, consistent with the offline DRO results of Mohajerin Esfahani and Kuhn ( 2018 ). W e also note that the corollary assumes access to a W asserstein oracle with a prescribed error , whose efficient implementation is discussed in the next section. 4. W asserstein Oracle In this section, we de velop an efficient implementation of the W asserstein oracle introduced in Assumption 4 . Our goal is to devise an algorithm that efficiently realizes the distributional best- r esponse step of Algorithm 1 . Specifically , for each iteration t = 1 , 2 , . . . , T , the algorithm returns a δ -accurate solution to the follo wing infinite-dimensional worst-case e xpectation problem: max Q ∈ B p ρ ( b P t ) E ξ ∼ Q [ ℓ ( x t , ξ )] . (3) 9 C H E N F A T TA H I S H A FI E E T o streamline the presentation, we henceforth omit the explicit dependence of the loss function on x t and write ℓ ( ξ ) in place of ℓ ( x t , ξ ) . Under Assumptions 1 and 2 , the problem ( 3 ) admits a finite-dimensional con ve x reformulation ( Kuhn et al. , 2019 ). While this reformulation is an important step toward establishing the tractability of ( 3 ), it ultimately relies on generic of f-the-shelf solvers. Such solv ers fail to exploit the intrinsic structure of the resulting optimization problem. In this section, we uncover this useful structure and sho w ho w it can be exploited to solve ( 3 ) more efficiently . T o streamline our analysis, we only focus on the 1-W asserstein metric (which already co vers sev eral relev ant applications; see Mohajerin Esfahani and K uhn ( 2018 )) and make the follo wing simplifying assumption. Assumption 5 The support of the r andom samples Ξ is R m , and p = 1 . There exists a constant g > 0 such that ℓ ( x, ξ ) ≤ g (1 + ∥ ξ ∥ r ) for all ξ ∈ Ξ and some r ∈ (0 , 1) . The sublinear growth assumption ensures solvability of the dual W asserstein DR O and yields an explicit characterization of the worst-case distribution. In contrast, when ℓ exhibits linear or super- linear gro wth, the worst-case distrib ution may not be attained. At the core of our proposed algorithm lies the follo wing ke y reformulation of ( 3 ). Theorem 7 Under Assumptions 2 and 5 , pr oblem ( 3 ) is equivalent to: max b 1 ,...,b t ( 1 t t X i =1 S i ( b i ) s.t. t X i =1 b i ≤ ρt, b 1 , . . . , b t ≥ 0 ) . (4) wher e, for every i ∈ [ t ] and b ≥ 0 , S i ( b ) is defined as: S i ( b ) = max 1 ≤ k 1 0 Initialize: Search bounds λ low = 0 and λ high = ∥ ℓ ∥ lip while λ high − λ low > η λ do 1. Dual variable assignment: Set the dual variable: λ mid ← λ low + λ high 2 2. Solving decomposed subproblems: for i = 1 , . . . , t do Compute the local optimal budget b b i ( λ mid ) using Algorithm 3 end 3. Budget Feasibility Check: Ev aluate the total budget consumption: ¯ b ← P t i =1 b b i ( λ mid ) if ¯ b > ρt then λ low ← λ mid // Dual variable is too small end else λ high ← λ mid // Dual variable is too large end end Output: b λ := λ high and b b i := b b i ( λ high ) for all i ∈ [ t ] A ppendix B. Omitted Proofs B.1. Proof of Lemma 1 T o establish the existence of a saddle point, we verify the conditions required in ( Shafiee et al. , 2025 , Lemmas 3 and 4). Under our structural Assumptions 1 and 2 , the regularity requirements regarding con vexity , lower semi-continuity , integrability , and compactness of sublev el sets with respect to x ( Shafiee et al. , 2025 , Assumptions 3–6, 8) are immediately satisfied. The remaining requirement in volv es the existence of Slater points, as specified in ( Shafiee et al. , 2025 , Assumption 7). First, since the transportation cost function defining the p -W asserstein dis- tance is real-valued and continuous ( c ( ξ , ξ ′ ) = ∥ ξ − ξ ′ ∥ p ), the support of the empirical distribution 18 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G Algorithm 3: Algorithm for solving the decoupled problems ( 7 ) Input: Subproblem function b S i ( · ) , dual candidate λ , minimum interv al length η b Initialize: Local search bounds: b low = 0 , b high = tρ , golden ratio ϕ = √ 5 − 1 2 z 1 ← b high − ϕ ( b high − b low ) , z 2 ← b low + ϕ ( b high − b low ) while b high − b low > η b do if b S i ( z 1 ) − λz 1 < b S i ( z 2 ) − λz 2 then b low ← z 1 , z 1 ← z 2 , z 2 ← b low + ϕ ( b high − b low ) end else b high ← z 2 , z 2 ← z 1 , z 1 ← b high − ϕ ( b high − b low ) end end Output: b low b P t tri vially lies within the interior of the cost function’ s domain. This satisfies the Slater condition ( Shafiee et al. , 2025 , Assumption 7 (i)). Second, per Assumption 1 , the feasible region X is a non- empty , compact, and con ve x set. Consequently , it has a non-empty relativ e interior and admits a Slater point, thereby satisfying the primal Slater condition ( Shafiee et al. , 2025 , Assumption 7 (ii)). Finally , the provided gro wth conditions for the p = 1 case ensure the dual problem is well-posed and the inner maximization is attained. Thus, all necessary conditions for the minimax theorem are met, and the claim follo ws. ■ B.2. Proof of Lemma 2 Using Jensen’ s inequality , we can write Gap( ¯ x T ) = max Q ∈ B p ρ ( P ⋆ ) f ( ¯ x T , Q ) − min x ∈X max Q ∈ B p ρ ( P ⋆ ) f ( x, Q ) ≤ 1 T T X t =1 max Q ∈ B p ρ ( P ⋆ ) f ( x t , Q ) − max Q ∈ B p ρ ( P ⋆ ) f ( x ⋆ , Q ) ! , where x ⋆ ∈ argmin x ∈X max Q ∈ B p ρ ( P ⋆ ) f ( x, Q ) . Next, we ha ve max Q ∈ B p ρ ( P ⋆ ) f ( x t , Q ) − max Q ∈ B p ρ ( P ⋆ ) f ( x ⋆ , Q ) ≤ f ( x t , Q t ) − f ( x ⋆ , Q t ) + max Q ∈ B p ρ ( P ⋆ ) f ( x t , Q ) − max Q ∈ B p ρ ( b P t ) f ( x t , Q ) + max Q ∈ B p ρ ( b P t ) f ( x ⋆ , Q ) − max Q ∈ B p ρ ( P ⋆ ) f ( x ⋆ , Q ) + δ The abov e inequality is obtained by noting that max Q ∈ B p ρ ( b P t ) { f ( x ⋆ , Q ) } − f ( x ⋆ , Q t ) ≥ 0 and apply- ing the W asserstein oracle from Assumption 4 , which ensures f ( x t , Q t ) − max Q ∈ B p ρ ( b P t ) { f ( x t , Q ) } + δ ≥ 0 . Combinig the abov e two inequalities completes the proof. ■ 19 C H E N F A T TA H I S H A FI E E Algorithm 4: Algorithm for e v aluating S i ( b ) based on ( 5 ) and ( 6 ) Input: Local budget b > 0 , golden ratio ϕ = √ 5 − 1 2 , minimum interv al lengths η in , η out > 0 f or 1 ≤ k 1 < k 2 ≤ K do // Golden section search on the weight allocations ( α 1 + α 2 = 1 ) Initialize bounds for α 1 : [ L α , U α ] ← [0 , 1] a α ← U α − ϕ ( U α − L α ) b α ← L α + ϕ ( U α − L α ) while ( U α − L α ) > η out do V a ← SolveInner ( α 1 = a α , α 2 = 1 − a α ) V b ← SolveInner ( α 1 = b α , α 2 = 1 − b α ) if V a > V b then U α ← b α , b α ← a α , a α ← U α − ϕ ( U α − L α ) end else L α ← a α , a α ← b α , b α ← L α + ϕ ( U α − L α ) end end Set S ( k 1 ,k 2 ) i ( b ) = max( V a , V b ) end Output: Maximum value max 1 ≤ k 1 η in do // Compute V a and V b using the subroutine from Assumption 6 . V a ← max ∥ q 1 ∥≤ a β { α 1 ℓ k 1 ( b ξ i − q 1 α 1 ) } + max ∥ q 2 ∥≤ b − a β { α 2 ℓ k 2 ( b ξ i − q 2 α 2 ) } V b ← max ∥ q 1 ∥≤ b β { α 1 ℓ k 1 ( b ξ i − q 1 α 1 ) } + max ∥ q 2 ∥≤ b − b β { α 2 ℓ k 2 ( b ξ i − q 2 α 2 ) } if V a > V b then U β ← b β , b β ← a β , a β ← U β − ϕ ( U β − L β ) end else L β ← a β , a β ← b β , b β ← L β + ϕ ( U β − L β ) end end retur n max( V a , V b ) end B.3. Proof of Lemma 3 The result follows from the standard regret analysis of the projected subgradient method; see, for example, ( Orabona , 2019 , Section 2.2.2). For completeness, we pro vide a short proof. 20 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G For an y z ∈ X , we hav e ∥ x t +1 − z ∥ 2 = ∥ Π X ( x t − η t ∂ x f ( x t , Q t )) − z ∥ 2 ≤ ∥ x t − η t ∂ x f ( x t , Q t ) − z ∥ 2 = ∥ x t − z ∥ 2 − 2 η t ⟨ x t − z , ∂ x f ( x t , Q t ) ⟩ + η 2 t ∥ ∂ x f ( x t , Q t ) ∥ 2 . Summing up both sides ov er t = 1 , 2 , . . . , T yields 2 T X t =1 η t ⟨ x t − z , ∂ x f ( x t , Q t ) ⟩ ≤ ∥ x 1 − z ∥ 2 − ∥ x T +1 − z ∥ 2 + T X t =1 η 2 t ∥ ∂ x f ( x t , Q t ) ∥ 2 ≤ D 2 X + G 2 X T X t =1 η 2 t , where we used ∥ ∂ x f ( x t , Q t ) ∥ = ∥ ∂ x E ξ ∼ Q t [ ℓ ( x t , ξ )] ∥ ≤ G X and ∥ x 1 − z ∥ ≤ D X . Since f ( x, Q ) is con vex in x for ev ery Q ∈ P (Ξ) , we ha ve ⟨ x t − z , ∂ x f ( x t , Q t ) ⟩ ≥ f ( x t , Q t ) − f ( z , Q t ) , which implies 2 T X t =1 η t ( f ( x t , Q t ) − f ( z , Q t )) ≤ D 2 X + G 2 X T X t =1 η 2 t . Since the sequence of step size { η t } T t =1 is non-increasing, we obtain T X t =1 ( f ( x t , Q t ) − f ( z , Q t )) ≤ D 2 X 2 η T + G 2 X P T t =1 η 2 t 2 η T . Upon choosing η t = D X G X √ t for all t ∈ [ T ] , we obtain 1 T T X t =1 ( f ( x t , Q t ) − f ( z , Q t )) ≤ 1 2 G X D X 1 √ T + 1 2 G X D X 1 √ T T X t =1 1 t ≤ G X D X (1 + log T ) √ T Substituting z = x ⋆ in the abov e inequality completes the proof. ■ B.4. Proof of Lemma 4 For ease of notation, throughout this proof we suppress the dependence on the decision variable x and simply write ℓ ( ξ ) in lieu of ℓ ( x, ξ ) . First, we introduce two technical lemmas. Lemma 12 Under Assumption 2 , for k ∈ [ K ] , define the function h k : Ξ × R + → R such that h k ( ξ , λ ) := sup z ∈ Ξ ℓ k ( z ) − λ ∥ z − ξ ∥ p , ∀ λ ≥ 0 , ξ ∈ Ξ . F or any fixed λ ≥ 0 , the function h k ( ξ , λ ) is concave in ξ ∈ Ξ . Further , define h ( ξ , λ ) := max k ∈ [ K ] h k ( ξ , λ ) . W e have h ( ξ , λ ) = sup z ∈ Ξ ℓ ( z ) − λ ∥ z − ξ ∥ p , ∀ λ ≥ 0 , ξ ∈ Ξ . 21 C H E N F A T TA H I S H A FI E E Proof For any fixed λ ≥ 0 , the function ℓ k ( z ) − λ ∥ z − ξ ∥ p is jointly conca ve in ( z , ξ ) ∈ Ξ 2 under Assumption 2 . Therefore, the point-wise supremum sup z ∈ Ξ ℓ k ( z ) − λ ∥ z − ξ ∥ p is concave in ξ ∈ Ξ . Next, we v erify that h ( ξ , λ ) := max k ∈ [ K ] h k ( ξ , λ ) = max k ∈ [ K ] sup z ∈ Ξ ℓ k ( z ) − λ ∥ z − ξ ∥ p = sup z ∈ Ξ max k ∈ [ K ] ℓ k ( z ) − λ ∥ z − ξ ∥ p = sup z ∈ Ξ ℓ ( z ) − λ ∥ z − ξ ∥ p . Lemma 13 Under the conditions of Lemma 12 , assume that p > 1 and λ > 0 . Then h k ( ξ , λ ) is differ entiable with r espect to ξ , and ∥∇ ξ h k ( ξ , λ ) ∥ ≤ ∥∇ ℓ k ( ξ ) ∥ . Proof For any fixed λ > 0 , the function ℓ k ( z ) − λ ∥ z − ξ ∥ p is strictly concav e in z due to p > 1 and the concavity of ℓ k . Therefore, since Ξ ⊆ R m is con vex and closed, and lim sup ∥ z ∥→∞ ℓ k ( z ) ∥ z − ξ ∥ p ≤ 0 , we obtain that argmax z ∈ Ξ ℓ k ( z ) − λ ∥ z − ξ ∥ p is nonempty and has a unique element. Denote z ⋆ = argmax z ∈ Ξ ℓ k ( z ) − λ ∥ z − ξ ∥ p . Since ℓ k ( z ) − λ ∥ z − ξ ∥ p is differentiable with respect to ξ , by the Danskin’ s theorem, h k ( ξ , λ ) is dif ferentiable with respect to ξ , and ∇ ξ h k ( ξ , λ ) = ∂ ∂ ξ [ ℓ k ( z ) − λ ∥ z − ξ ∥ p ]     z = z ⋆ = λp ∥ z ⋆ − ξ ∥ p − 1 ( z ⋆ − ξ ) . (8) Since Ξ is con ve x and closed, by the optimality of z ⋆ , we hav e ⟨∇ ℓ k ( z ⋆ ) − λp ∥ z ⋆ − ξ ∥ p − 1 ( z ⋆ − ξ ) , z − z ⋆ ⟩ ≤ 0 , ∀ z ∈ Ξ , which implies that ⟨∇ ℓ k ( z ⋆ ) − ∇ ξ h k ( ξ , λ ) , z − z ⋆ ⟩ ≤ 0 , ∀ z ∈ Ξ . Let z = ξ ∈ Ξ , and note that, according to ( 8 ), −∇ ξ h k ( ξ , λ ) is in the same direction as ξ − z ⋆ . Therefore, we hav e ⟨∇ ℓ k ( z ⋆ ) − ∇ ξ h k ( ξ , λ ) , −∇ ξ h k ( ξ , λ ) ⟩ ≤ 0 , which leads to ⟨∇ ℓ k ( z ⋆ ) , ∇ ξ h k ( ξ , λ ) ⟩ ≥ ∥∇ ξ h k ( ξ , λ ) ∥ 2 . Since ℓ k is concav e, we have ⟨∇ ℓ k ( z ⋆ ) − ∇ ℓ k ( ξ ) , z ⋆ − ξ ⟩ ≤ 0 . Again, using ( 8 ), we ha ve ⟨∇ ℓ k ( z ⋆ ) − ∇ ℓ k ( ξ ) , ∇ ξ h k ( ξ , λ ) ⟩ ≤ 0 . 22 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G It follo ws that ∥∇ ξ h k ( ξ , λ ) ∥ 2 ≤ ⟨∇ ℓ k ( z ⋆ ) , ∇ ξ h k ( ξ , λ ) ⟩ ≤ ⟨∇ ℓ k ( ξ ) , ∇ ξ h k ( ξ , λ ) ⟩ ≤ ∥∇ ℓ k ( ξ ) ∥ ∥∇ ξ h k ( ξ , λ ) ∥ Indeed, ∇ ξ h k ( ξ , λ ) = 0 the statement of the lemma holds tri vially . Moreover , if ∇ ξ h k ( ξ , λ )  = 0 , the abov e inequality leads to ∥∇ ξ h k ( ξ , λ ) ∥ ≤ ∥∇ ℓ k ( ξ ) ∥ . This completes the proof. No w , we are ready to present the proof of Lemma 4 , which is a special case of the follo wing theorem. Theorem 14 Under Assumptions 1 , 2 , given a pair of distrib utions P 1 , P 2 ∈ P (Ξ) , the differ ence of the worst-case e xpectation within the W asserstein ball center ed at P 1 and P 2 satisfies: sup Q ∈ B p ρ ( P 1 ) E ξ ∼ Q [ ℓ ( ξ )] − sup Q ∈ B p ρ ( P 2 ) E ξ ∼ Q [ ℓ ( ξ )] ≤ ( ∥ ℓ ( · ) ∥ lip W 1 ( P 1 , P 2 ) , if p = 1 , ∥ ℓ ( · ) ∥ ˙ H 1 ,q ( P 2 ) W p ( P 1 , P 2 ) , if p > 1 , wher e q > 1 is a constant suc h that 1 p + 1 q = 1 when p > 1 . Proof Define the value function V ρ ( P ) = sup Q ∈ B p ρ ( P ) E ξ ∼ Q [ ℓ ( ξ )] . By ( Blanchet and Murthy , 2019 , Theorem 1), we can reformulate the v alue function as V ρ ( P ) = inf λ ≥ 0 { λρ + E P [ h ( ξ , λ )] } , where h ( ξ , λ ) := sup z ∈ Ξ { ℓ ( z ) − λ ∥ z − ξ ∥ p } . For an y ϵ > 0 , by the definition of the infimum, there exists a λ ϵ ≥ 0 such that λ ϵ ρ + E P 2 [ h ( ξ , λ ϵ )] ≤ V ρ ( P 2 ) + ϵ. By the suboptimality of λ ϵ in the minimization problem for V ρ ( P 1 ) , we hav e V ρ ( P 1 ) ≤ λ ϵ ρ + E P 1 [ h ( ξ , λ ϵ )] . Subtracting these inequalities yields V ρ ( P 1 ) − V ρ ( P 2 ) ≤ E P 1 [ h ( ξ , λ ϵ )] − E P 2 [ h ( ξ , λ ϵ )] + ϵ. (9) Letting π be the optimal coupling for W p ( P 1 , P 2 ) , we can re write the abov e expectation as E P 1 [ h ( ξ , λ ϵ )] − E P 2 [ h ( ξ , λ ϵ )] = E ( ξ 1 ,ξ 2 ) ∼ π [ h ( ξ 1 , λ ϵ ) − h ( ξ 2 , λ ϵ )] . 23 C H E N F A T TA H I S H A FI E E Case 1 ( p = 1 ): In this case, the function h ( ξ , λ ) = sup z ∈ Ξ { ℓ ( z ) − λ ∥ z − ξ ∥} . First, by the properties of the Pasch-Hausdorf f en velope ( Rockafellar and W ets , 1998 , Example 9.11), the function h ( · , λ ) is λ -Lipschitz continuous for any λ ≥ 0 . Second, we argue that we can restrict the search for λ to the interv al [0 , ∥ ℓ ∥ lip ] . Suppose λ > ∥ ℓ ∥ lip . Since ℓ is ∥ ℓ ∥ lip -Lipschitz, for any z ∈ Ξ we ha ve ℓ ( z ) ≤ ℓ ( ξ ) + ∥ ℓ ∥ lip · ∥ z − ξ ∥ . Thus, we may conclude that ℓ ( z ) − λ ∥ z − ξ ∥ ≤ ℓ ( ξ ) + ( ∥ ℓ ∥ lip − λ ) ∥ z − ξ ∥ ≤ ℓ ( ξ ) , where the last inequality holds because ∥ ℓ ∥ lip − λ < 0 . Since the v alue ℓ ( ξ ) is attained at z = ξ , it follows that h ( ξ , λ ) = ℓ ( ξ ) for all λ ≥ ∥ ℓ ∥ lip . In this regime, the dual objectiv e λϵ + E [ h ( ξ , λ )] is strictly increasing in λ . Therefore, the infimum must be attained at some λ ∈ [0 , ∥ ℓ ∥ lip ] , and we can assume λ ϵ ≤ ∥ ℓ ∥ lip without loss of optimality . Finally , using the λ ϵ -Lipschitzness of h and the optimal coupling π ∈ Π( P 1 , P 2 ) , we hav e E P 1 [ h ( ξ , λ ϵ )] − E P 2 [ h ( ξ , λ ϵ )] = E ( ξ 1 ,ξ 2 ) ∼ π [ h ( ξ 1 , λ ϵ ) − h ( ξ 2 , λ ϵ )] ≤ E π [ λ ϵ ∥ ξ 1 − ξ 2 ∥ ] ≤ ∥ ℓ ∥ lip · W 1 ( P 1 , P 2 ) . Substituting this into ( 9 ) and taking ϵ → 0 completes the proof for p = 1 . Case 2 ( p > 1 ): First, we look at the case where λ ϵ = 0 . It follows that h ( ξ , 0) = sup z ∈ Ξ ℓ ( z ) , which is constant. Substituting this into ( 9 ) and taking ϵ → 0 , we hav e V ρ ( P 1 ) − V ρ ( P 2 ) ≤ 0 , which completes the proof. Next, assume that λ ϵ > 0 . One can write E P 1 [ h ( ξ , λ ϵ )] − E P 2 [ h ( ξ , λ ϵ )] = E ( ξ 1 ,ξ 2 ) ∼ π [ h ( ξ 1 , λ ϵ ) − h ( ξ 2 , λ ϵ )] = E ( ξ 1 ,ξ 2 ) ∼ π  max k ∈ [ K ] h k ( ξ 1 , λ ϵ ) − max k ∈ [ K ] h k ( ξ 2 , λ ϵ )  ≤ max k ∈ [ K ] E ( ξ 1 ,ξ 2 ) ∼ π [ h k ( ξ 1 , λ ϵ ) − h k ( ξ 2 , λ ϵ )] ( a ) ≤ max k ∈ [ K ] E ( ξ 1 ,ξ 2 ) ∼ π h ∇ ξ h k ( ξ 2 , λ ϵ ) ⊤ ( ξ 1 − ξ 2 ) i ( b ) ≤ max k ∈ [ K ]  E ( ξ 1 ,ξ 2 ) ∼ π ∥∇ ξ h k ( ξ 2 , λ ϵ ) ∥ q  1 /q  E ( ξ 1 ,ξ 2 ) ∼ π ∥ ξ 1 − ξ 2 ∥ p  1 /p ( c ) ≤ max k ∈ [ K ] ( E ξ ∼ P 2 ∥∇ ℓ k ( ξ ) ∥ q ) 1 /q W p ( P 1 , P 2 ) = max k ∈ [ K ] ∥ ℓ k ( · ) ∥ ˙ H 1 ,q ( P 2 ) W p ( P 1 , P 2 ) ≤ ∥ ℓ ( · ) ∥ ˙ H 1 ,q ( P 2 ) W p ( P 1 , P 2 ) . (10) Here, ( a ) follows from the differentiability and concavity of h k ( · , λ ϵ ) , as established in Lemma 13 and Lemma 12 , respectiv ely . Moreover , ( b ) follows from H ¨ older’ s inequality . Finally , ( c ) follows from Lemma 13 . Combining ( 9 ) and ( 10 ) yields V ρ ( P 1 ) − V ρ ( P 2 ) ≤ ∥ ℓ ( · ) ∥ ˙ H 1 ,q ( P 2 ) W p ( P 1 , P 2 ) + ϵ Letting ϵ → 0 completes the proof. 24 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G B.5. Proof of Theor em 5 T o prov e this lemma, it suf fices to control the individual terms in Lemma 2 . By Lemma 3 , we hav e 1 T T X t =1 ( f ( x t , Q t ) − f ( x ⋆ , Q t )) ≤ G X D X (1 + log T ) √ T . On the other hand, when p = 1 , we may in vok e Lemma 4 with x = x t to obtain 1 T T X t =1 max Q ∈ B p ρ ( P ⋆ ) f ( x t , Q ) − max Q ∈ B p ρ ( b P t ) f ( x t , Q ) ! ≤ 1 T T X t =1 ∥ ℓ ( x t , · ) ∥ lip W 1 ( b P t , P ⋆ ) . Similarly , inv oking Lemma 4 with x = x ⋆ yields 1 T T X t =1 max Q ∈ B p ρ ( P ⋆ ) f ( x ⋆ , Q ) − max Q ∈ B p ρ ( b P t ) f ( x ⋆ , Q ) ! ≤ 1 T T X t =1 ∥ ℓ ( x ⋆ , · ) ∥ lip W 1 ( b P t , P ⋆ ) . Combining the abo ve inequalities with Lemma 2 completes the proof for the first case ( p = 1 ). The second case ( p > 1 ) follo ws by an analogous argument and is therefore omitted for brevity . ■ B.6. Proof of Cor ollary 6 T o present the proof, we need the following result. Theorem 15 ( F ournier and Guillin ( 2015 ), Theorem 2) If Assumption 3 holds, for all T ≥ 1 , and ϵ > 0 , we have P n W p ( b P T , P ⋆ ) ≥ ϵ o ≤ ( c 1 exp( − c 2 T ϵ max { m/p, 2 } ) if ϵ ≤ 1 , c 1 exp( − c 2 T ϵ a/p ) if ϵ > 1 , (11) wher e c 1 , c 2 ar e positive constants that only depend on a , m , and A := E ξ ∼ P ⋆ [exp( ∥ ξ ∥ a )] . Proof of Corollary 6 . W e only present the proof for p = 1 , as the case p > 1 follo ws identically . According to the first statement of Theorem 5 , it suf fices to control the term 1 T T X t =1  ∥ ℓ ( x t , · ) ∥ lip + ∥ ℓ ( x ⋆ , · ) ∥ lip  W 1 ( b P t , P ⋆ ) . and sho w that it dominates the other terms. W e have 1 T T X t =1  ∥ ℓ ( x t , · ) ∥ lip + ∥ ℓ ( x ⋆ , · ) ∥ lip  W 1 ( b P t , P ⋆ ) ≤  2 max x ∈X {∥ ℓ ( x, · ) ∥ lip }  · 1 T T X t =1 W 1 ( b P t , P ⋆ ) . (12) By Theorem 15 , for each t ≥ 1 c 2 log( c 1 T τ ) , with probability at least 1 − τ /T , we have W 1 ( b P t , P ⋆ ) ≤  1 c 2 t log  c 1 T τ  min { p m , 1 2 } . 25 C H E N F A T TA H I S H A FI E E Therefore, with probability at least 1 − τ , we ha ve T X t = ⌈ 1 c 2 log( c 1 T τ ) ⌉ W 1 ( b P t , P ⋆ ) ≤ T X t = ⌈ 1 c 2 log( c 1 T τ ) ⌉  1 c 2 t log  c 1 T τ  min { p m , 1 2 } =  1 c 2 log  c 1 T τ  min { p m , 1 2 } T X t = ⌈ 1 c 2 log( c 1 T τ ) ⌉  1 t  min { p m , 1 2 } ≤  1 c 2 log  c 1 T τ  min { p m , 1 2 } T 1 − min { p m , 1 2 } . It follo ws that 1 T T X t =1 W 1 ( b P t , P ⋆ ) = 1 T ⌈ 1 c 2 log( c 1 T τ ) ⌉− 1 X t =1 W 1 ( b P t , P ⋆ ) + 1 T T X t = ⌈ 1 c 2 log( c 1 T τ ) ⌉ W 1 ( b P t , P ⋆ ) ≤ 1 T ⌈ 1 c 2 log( c 1 T τ ) ⌉− 1 X t =1 W 1 ( b P t , P ⋆ ) +  1 c 2 log  c 1 T τ  min { p m , 1 2 } T − min { p m , 1 2 } ≤ C 2  1 T log  T τ  min n p m , 1 2 o , where C 2 is a constant depending on m , p , a , A . This bound combined with ( 12 ) and the first statement of Theorem 5 completes the proof for p = 1 . B.7. Proof of Theor em 7 W e start with the follo wing fundamental theorem, which provides a finite conv ex reformulation of the worst-case e xpectation problem. Theorem 16 ( Mohajerin Esfahani and K uhn ( 2018 ), Theor em 4.4) Under Assumptions 2 and 5 , the worst-case e xpectation pr oblem ( 3 ) is equivalent to the following con ve x pr ogram max { α ik ,q ik }              1 t t X i =1 K X k =1 α ik ℓ k ( b ξ i − q ik α ik ) s.t. 1 t t X i =1 K X k =1 ∥ q ik ∥ ≤ ρ, K X k =1 α ik = 1 , ∀ i ∈ [ t ] , α ik ≥ 0 , ∀ i ∈ [ t ] , k ∈ [ K ]              (13) In particular , the optimal values of ( 3 ) and ( 13 ) coincide. Mor eover , let { α ⋆ ik , q ⋆ ik } be an optimal solution of ( 13 ) . Then, the discr ete pr obability distribution Q ⋆ := 1 t t X i =1 K X k =1 α ⋆ ik δ ξ ⋆ ik with ξ ⋆ ik := b ξ i − q ⋆ ik α ⋆ ik belongs to the W asserstein ball B 1 ρ ( b P t ) and attains the maximum of ( 3 ) . 26 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G Our first goal is to establish an equiv alence between the the formulation from the above theorem ( 14a ) with a ne w formulation ( 14b ): max { α ik ,q ik } ( 1 t t X i =1 K X k =1 α ik ℓ k  b ξ i − q ik α ik  s.t. t X i =1 K X k =1 ∥ q ik ∥ ≤ tρ, K X k =1 α ik = 1 , α ik ≥ 0 ) , (14a) max { α ik ,v ik } ( 1 t t X i =1 K X k =1 α ik ℓ k ( b ξ i − v ik ) s.t. t X i =1 K X k =1 α ik ∥ v ik ∥ ≤ tρ, K X k =1 α ik = 1 , α ik ≥ 0 ) . (14b) T o establish this equiv alence, first we rewrite both formulations as follo ws: max ( 1 t t X i =1 S old i ( b i ) s.t. t X i =1 b i ≤ tρ, b i ≥ 0 ) , (15a) max ( 1 t t X i =1 S new i ( b i ) s.t. t X i =1 b i ≤ tρ, b i ≥ 0 ) , (15b) where S old i ( b ) := max { α k ,q k } ( K X k =1 α k ℓ k  b ξ i − q k α k  s.t. K X k =1 ∥ q k ∥ ≤ b, K X k =1 α k = 1 , α k ≥ 0 ) , (16a) S new i ( b ) := max { α k ,v k } ( K X k =1 α k ℓ k ( b ξ i − v k ) s.t. K X k =1 α k ∥ v k ∥ ≤ b, K X k =1 α k = 1 , α k ≥ 0 ) . (16b) Lemma 17 Pr oblem ( 14a ) is equivalent to pr oblem ( 15a ) . Similarly , pr oblem ( 14b ) is equivalent to pr oblem ( 15b ) . Proof The equiv alence between problems ( 14a ) and ( 15a ) follows directly by observing that the constraint P t i =1 P K k =1 ∥ q ik ∥ ≤ tρ is equiv alent to introducing auxiliary v ariables b i such that P t i =1 b i ≤ tρ and P K k =1 ∥ q ik ∥ ≤ b i for i = 1 , . . . , t . Gi ven this reformulation, the equi valence fol- lo ws immediately from the definition of S old i ( b ) in ( 16a ). The equi v alence between problems ( 14b ) and ( 15b ) can be established analogously . Therefore, to establish the equi v alence between ( 14a ) and ( 14b ), it suffices to sho w that S old i ( b ) = S new i ( b ) for ev ery b ≥ 0 and i ∈ [ t ] . This is established in the next lemma. This equiv alence justi- fies the interchangeable use of the perspectiv e formulation ( 16a ) and budget-allocation formulation ( 16b ) throughout our analysis. Lemma 18 F ix any b ≥ 0 and any i ∈ [ t ] . Under Assumptions 2 and 5 , we have S old i ( b ) = S new i ( b ) . Mor eover , given any ϵ 1 -suboptimal solution to ( 16b ) , one can construct a feasible solution to ( 16a ) that attains the same objective value. Con versely , given any ϵ 2 -suboptimal solution to ( 16a ) , one can construct a sequence of feasible solutions to ( 16b ) whose objective values conver ge to the same value. Proof Consider ( 16b ). F or any ϵ 1 > 0 , there exist feasible solutions { α ϵ 1 k } k ∈ [ K ] , { v ϵ 1 k } k ∈ [ K ] such that 0 ≤ S new i ( b ) − K X k =1 α ϵ 1 k ℓ k ( b ξ i − v ϵ 1 k ) ≤ ϵ 1 . 27 C H E N F A T TA H I S H A FI E E W ithout loss of generality , when α ϵ 1 k = 0 , we may assume v ϵ 1 k = 0 . Upon defining q ϵ 1 k = α ϵ 1 k v ϵ 1 k for k ∈ [ K ] , the solutions { α ϵ 1 k } k ∈ [ K ] , { q ϵ 1 k } k ∈ [ K ] are feasible to ( 16a ) and q ϵ 1 k α ϵ 1 k = ( 0 if α ϵ 1 k = 0 , v ϵ 1 k if α ϵ 1 k > 0 . It then follo ws that the objecti ve value of ( 16a ) ev aluated at { α ϵ 1 k } k ∈ [ K ] , { q ϵ 1 k } k ∈ [ K ] is the same as the objecti ve v alue of ( 16b ) ev aluated at { α ϵ 1 k } k ∈ [ K ] , { v ϵ 1 k } k ∈ [ K ] , i.e. K X k =1 α ϵ 1 k ℓ k ( b ξ i − q ϵ 1 k α ϵ 1 k ) = K X k =1 α ϵ 1 k ℓ k ( b ξ i − v ϵ 1 k ) . This yields S old i ( b ) ≥ K X k =1 α ϵ 1 k ℓ k ( b ξ i − q ϵ 1 k α ϵ 1 k ) = K X k =1 α ϵ 1 k ℓ k ( b ξ i − v ϵ 1 k ) ≥ S new i ( b ) − ϵ 1 . Letting ϵ 1 → 0 + , we conclude that S old i ( b ) ≥ S new i ( b ) . Next, we show S old i ( b ) ≤ S new i ( b ) . For any ϵ 2 > 0 , there exist feasible solutions { α ϵ 2 k } k ∈ [ K ] , { q ϵ 2 k } k ∈ [ K ] to ( 16a ) such that 0 ≤ S old i ( b ) − K X k =1 α ϵ 2 k ℓ k ( b ξ i − q ϵ 2 k α ϵ 2 k ) ≤ ϵ 2 . Define I ϵ 2 0 := { k ∈ [ K ] : α ϵ 2 k = 0 , q ϵ 2 k  = 0 } , I ϵ 2 1 := { k ∈ [ K ] : α ϵ 2 k = 0 , q ϵ 2 k = 0 } , and I ϵ 2 2 := { k ∈ [ K ] : α ϵ 2 k > 0 } . For ev ery k ∈ I ϵ 2 1 , we hav e α ϵ 2 k ℓ k ( b ξ i − q ϵ 2 k α ϵ 2 k )     α ϵ 2 k =0 , q ϵ 2 k =0 = 0 , where we use the con vention 0 / 0 = 0 . For e very k ∈ I ϵ 2 0 , we hav e α ϵ 2 k ℓ k ( b ξ i − q ϵ 2 k α ϵ 2 k )     α ϵ 2 k =0 := lim inf α → 0 + αℓ k ( b ξ i − q ϵ 2 k α ) . In this case, by definition of lim inf , for any ϵ 3 > 0 , there exists some δ 1 > 0 such that lim inf α → 0 + αℓ k ( b ξ i − q ϵ 2 k α ) ≤ inf α ∈ (0 ,δ 1 ) αℓ k ( b ξ i − q ϵ 2 k α ) + ϵ 3 . Upon taking an arbitrary α δ 1 k ∈ (0 , δ 1 ) , we hav e lim inf α → 0 + αℓ k ( b ξ i − q ϵ 2 k α ) ≤ inf α ∈ (0 ,δ 1 ) αℓ k ( b ξ i − q ϵ 2 k α ) + ϵ 3 ≤ α δ 1 k ℓ k ( b ξ i − q ϵ 2 k α δ 1 k ) + ϵ 3 . 28 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G This leads to S old i ( b ) ≤ K X k =1 α ϵ 2 k ℓ k ( b ξ i − q ϵ 2 k α ϵ 2 k ) + ϵ 2 = X k ∈I ϵ 2 0 lim inf α → 0 + αℓ k ( b ξ i − q ϵ 2 k α ) + X k ∈I ϵ 2 2 α ϵ 2 k ℓ k ( b ξ i − q ϵ 2 k α ϵ 2 k ) + ϵ 2 ≤ X k ∈I ϵ 2 0 α δ 1 k ℓ k ( b ξ i − q ϵ 2 k α δ 1 k ) + ϵ 3 ! + X k ∈I ϵ 2 2 α ϵ 2 k ℓ k ( b ξ i − q ϵ 2 k α ϵ 2 k ) + ϵ 2 ≤ X k ∈I ϵ 2 0 α δ 1 k ℓ k ( b ξ i − q ϵ 2 k α δ 1 k ) + X k ∈I ϵ 2 2 α ϵ 2 k ℓ k ( b ξ i − q ϵ 2 k α ϵ 2 k ) + ϵ 2 + K ϵ 3 . Denote k ⋆ = argmax k ∈ [ K ] α ϵ 2 k . By Pigeonhole principle, we ha ve α ϵ 2 k ⋆ ≥ 1 K since P K k =1 α ϵ 2 k = 1 . This implies that k ⋆ ∈ I ϵ 2 2 . Assume that δ 1 ≤ 1 K 2 . Since |I ϵ 2 0 | ≤ K − 1 , we have α ϵ 2 k ⋆ − X k ∈I ϵ 2 0 α δ 1 k ≥ 1 K − ( K − 1) 1 K 2 = 1 K 2 > 0 . No w we construct a feasible solution to ( 16b ). Specifically , we choose α ′ k =            α δ 1 k if k ∈ I ϵ 2 0 , α ϵ 2 k ⋆ − X k ∈I ϵ 2 0 α δ 1 k if k = k ⋆ , α ϵ 2 k if k / ∈ I ϵ 2 0 and k  = k ⋆ , and v ′ k = q ϵ 2 k α ′ k for k ∈ [ K ] . For this constructed solution { α ′ k } k ∈ [ K ] , { v ′ k } k ∈ [ K ] , we can verify its feasibility by realizing that α ′ k > 0 for e very k ∈ [ K ] such that q k  = 0 , P K k =1 α ′ k = 1 and K X k =1 α ′ k   v ′ k   = K X k =1   q ϵ 2 k   ≤ b. 29 C H E N F A T TA H I S H A FI E E Moreov er , we have    X k ∈I ϵ 2 0 α δ 1 k ℓ k ( b ξ i − q ϵ 2 k α δ 1 k ) + X k ∈I ϵ 2 2 α ϵ 2 k ℓ k ( b ξ i − q ϵ 2 k α ϵ 2 k )    − K X k =1 α ′ k ℓ k ( b ξ i − q ′ k α ′ k ) = α ϵ 2 k ⋆ ℓ k ⋆ ( b ξ i − q ϵ 2 k ⋆ α ϵ 2 k ⋆ ) −   α ϵ 2 k ⋆ − X k ∈I ϵ 2 0 α δ 1 k   ℓ k ⋆ ( b ξ i − q ϵ 2 k ⋆ α ϵ 2 k ⋆ − P k ∈I ϵ 2 0 α δ 1 k ) =   X k ∈I ϵ 2 0 α δ 1 k   ℓ k ⋆ ( b ξ i − q ϵ 2 k ⋆ α ϵ 2 k ⋆ ) +   α ϵ 2 k ⋆ − X k ∈I ϵ 2 0 α δ 1 k   ℓ k ⋆ ( b ξ i − q ϵ 2 k ⋆ α ϵ 2 k ⋆ ) − ℓ k ⋆ ( b ξ i − q ϵ 2 k ⋆ α ϵ 2 k ⋆ − P k ∈I ϵ 2 0 α δ 1 k ) ! ≤   X k ∈I ϵ 2 0 α δ 1 k    ℓ k ⋆ ( b ξ i ) + ∥ ℓ ∥ lip     q ϵ 2 k ⋆ α ϵ 2 k ⋆      +   α ϵ 2 k ⋆ − X k ∈I ϵ 2 0 α δ 1 k   ∥ ℓ ∥ lip      q ϵ 2 k ⋆ α ϵ 2 k ⋆ − q ϵ 2 k ⋆ α ϵ 2 k ⋆ − P k ∈I ϵ 2 0 α δ 1 k      =   X k ∈I ϵ 2 0 α δ 1 k    ℓ k ⋆ ( b ξ i ) + 2 ∥ ℓ ∥ lip     q ϵ 2 k ⋆ α ϵ 2 k ⋆      ≤ max  K δ 1  ℓ k ⋆ ( b ξ i ) + 2 ∥ ℓ ∥ lip     q ϵ 2 k ⋆ α ϵ 2 k ⋆      , 0  . On the other hand, since α ϵ 2 k ⋆ ≥ 1 K , we obtain ℓ k ⋆ ( b ξ i ) + 2 ∥ ℓ ∥ lip     q ϵ 2 k ⋆ α ϵ 2 k ⋆     ≤ ℓ k ⋆ ( b ξ i ) + 2 K ∥ ℓ ∥ lip   q ϵ 2 k ⋆   W ithout loss of generality , we may assume that ℓ k ⋆ ( b ξ i ) + 2 K ∥ ℓ ∥ lip   q ϵ 2 k ⋆   > 0 . For any ϵ 2 > 0 , let δ 1 = min    1 K 2 , ϵ 2 K  ℓ k ⋆ ( b ξ i ) + 2 K ∥ ℓ ∥ lip   q ϵ 2 k ⋆       and ϵ 3 = ϵ 2 K . 30 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G It follo ws that S old i ( b ) ≤ X k ∈I ϵ 2 0 α δ 1 k ℓ k ( b ξ i − q ϵ 2 k α δ 1 k ) + X k ∈I ϵ 2 1 α ϵ 2 k ℓ k ( b ξ i − q ϵ 2 k α ϵ 2 k ) + ϵ 2 + K ϵ 3 ≤ K X k =1 α ′ k ℓ k ( b ξ i − q ′ k α ′ k ) + K δ 1  ℓ k ⋆ ( b ξ i ) + 2 ∥ ℓ ∥ lip     q ϵ 2 k ⋆ α ϵ 2 k ⋆      + ϵ 2 + K ϵ 3 ≤ K X k =1 α ′ k ℓ k ( b ξ i − q ′ k α ′ k ) + 3 ϵ 2 ≤ S i ( b ) + 3 ϵ 2 . Letting ϵ 2 → 0 + , we obtain S old i ( b ) ≤ S new i ( b ) . This completes the proof. Next, we establish a fundamental property of the value function S new i ( b ) : its global optimum can be recov ered by solving  K 2  smaller subproblems, each defined on a pair of v ariables. Lemma 19 F ix b ≥ 0 and i ∈ [ t ] . Under Assumptions 2 and 5 , for 1 ≤ k 1 < k 2 ≤ K , define S new , ( k 1 ,k 2 ) i ( b ) := max α k 1 ,α k 2 ,v k 1 ,v k 2 ( α k 1 ℓ k 1 ( b ξ i − v k 1 ) + α k 2 ℓ k 2 ( b ξ i − v k 2 ) s.t. α k 1 ∥ v k 1 ∥ + α k 2 ∥ v k 2 ∥ ≤ b, α k 1 + α k 2 = 1 , α k 1 , α k 2 ≥ 0 ) . (17) Then, we have S new i ( b ) = max 1 ≤ k 1 0 , there exists some { v ϵ 0 k } k ∈ [ K ] such that 0 ≤ S new i ( b ) − S new i ( b, { v ϵ 0 k } ) ≤ ϵ 0 . When { v k } k ∈ [ K ] are fixed to { v ϵ 0 k } k ∈ [ K ] , problem ( 18 ) reduces to a linear program in the decision v ariables { α k } k ∈ [ K ] . Moreover , its feasible region is a bounded polyhedron, since { α k } k ∈ [ K ] lies in the standard simplex. By the Fundamental Theorem of Linear Programming, there exists an optimal solution { α ⋆,ϵ 0 k } k ∈ [ K ] that is an extreme point of the simplex. Consequently , at least K linearly independent constraints of ( 18 ) are active at { α ⋆,ϵ 0 k } k ∈ [ K ] . This implies that at least K − 2 constraints of the form α k ≥ 0 for all k ∈ [ K ] are active, and hence at most two components of 31 C H E N F A T TA H I S H A FI E E { α ⋆,ϵ 0 k } k ∈ [ K ] can be nonzero. Using this observation, one can enumerate all possible pairs of indices corresponding to potentially nonzero components and reformulate ( 18 ) accordingly as S new i  b,  v ϵ 0 k  k ∈ [ K ]  = max 1 ≤ k 1 0 , we have S new i ( b ) ≤ S new i ( b, { v ϵ 0 k } ) + ϵ 0 = ϵ 0 + max 1 ≤ k 1 0 with α ⋆ 1 + α ⋆ 2 = 1 , and restrict the search range of α 1 to (0 , 1) . Under Assumption 6 , for any fixed α j ∈ (0 , 1) and β j ∈ [0 , b ] , j = 1 , 2 , we hav e access to a subroutine running in time Cost k j ,δ ev al / 2 that outputs a vector q ⋆ j satisfying ∥ q ⋆ j ∥ ≤ β j and α j ℓ k j  b ξ i − q ⋆ j α j  ≥ max ∥ q ∥≤ β j  α j ℓ k j  b ξ i − q α j  − δ ev al 2 , j = 1 , 2 . Since Ψ( α, β ) = max ∥ q ∥≤ β 1  α 1 ℓ k 1  b ξ i − q α 1  + max ∥ q ∥≤ β 2  α 2 ℓ k 2  b ξ i − q α 2  , we conclude that the approximate ev aluation b Ψ( α, β ) of Ψ( α, β ) can be computed in time Cost k 1 ,δ ev al / 2 + Cost k 2 ,δ ev al / 2 . No w , fix α in the interior of the simplex. Since ℓ k 1 and ℓ k 2 are ∥ ℓ ∥ lip -Lipschitz, increasing β 1 by ∆ β can increase max ∥ q ∥≤ β 1 α 1 ℓ k 1 ( b ξ i − q /α 1 ) by at most ∥ ℓ ∥ lip ∆ β . At the same time, max ∥ q ∥≤ b − β 1 α 2 ℓ k 2 ( b ξ i − q /α 2 ) does not increase and may decrease by at most ∥ ℓ ∥ lip ∆ β . There- fore, Ψ( α , · ) is ∥ ℓ ∥ lip -Lipschitz in β . Let β and β ′ be two points ev aluated by the inner golden section search in Algorithm 4 . The algorithm discards a subinterv al based on comparing b Ψ( α, β ) and b Ψ( α, β ′ ) . W e claim that the algorithm makes the correct comparison whenev er | Ψ( α, β ) − Ψ( α, β ′ ) | > δ ev al . T o see this, suppose without loss of generality that Ψ( α, β ) > Ψ( α, β ′ ) + δ ev al . Then b Ψ( α, β ) ≥ Ψ( α, β ) − δ ev al > Ψ( α, β ′ ) ≥ b Ψ( α, β ′ ) , and the algorithm correctly shrinks the interval while retaining the optimal solution β ⋆ ( α ) . Thus, after suf ficiently many iterations, the inner golden section search returns b β ( α ) satisfying   Ψ( α, β ⋆ ( α )) − Ψ( α, b β ( α ))   ≤ δ ev al . (20) It remains to bound the number of iterations in the inner golden section search. Let η in denote the final interv al length of this golden section search. By Lipschitz continuity of Ψ ,   Ψ( α, β ⋆ ( α )) − Ψ( α, b β ( α ))   ≤ ∥ ℓ ∥ lip η in . Hence, ( 20 ) is ensured by choosing η in ≤ δ ev al / ∥ ℓ ∥ lip . Consequently , the inner golden section search runs in time O   Cost k 1 ,δ ev al / 2 + Cost k 2 ,δ ev al / 2  log  b η in  = O   Cost k 1 ,δ ev al / 2 + Cost k 2 ,δ ev al / 2  log  b ∥ ℓ ∥ lip δ ev al  . 34 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G Next, we turn to the outer golden section search (o ver α ). Similar to our analysis of the inner goldern section search, let α and α ′ be tw o points e v aluated by the outer golden section search. The algorithm discards a subinterv al based on comparing b Ψ( α, b β ( α )) and b Ψ( α ′ , b β ( α ′ )) . W e claim that the algorithm makes the correct comparison whene ver | Ψ( α, b β ( α )) − Ψ( α ′ , β ⋆ ( α ′ )) | > 3 δ ev al . W ithout loss of generality , let us assume that Ψ( α, b β ( α )) > Ψ( α ′ , β ⋆ ( α ′ )) + 3 δ ev al . Note that | b Ψ( α, b β ( α )) − Ψ( α, b β ( α )) | ≤ δ ev al Moreov er , | b Ψ( α ′ , b β ( α ′ )) − Ψ( α ′ , β ⋆ ( α ′ )) | ≤ | b Ψ( α ′ , b β ( α ′ )) − Ψ( α ′ , b β ( α ′ )) | + | Ψ( α ′ , b β ( α ′ )) − Ψ( α ′ , β ⋆ ( α ′ )) | ≤ δ ev al + ∥ ℓ ∥ lip η in ≤ 2 δ ev al . Therefore, b Ψ( α, b β ( α )) ≥ Ψ( α, b β ( α )) − δ ev al > Ψ( α ′ , β ⋆ ( α ′ )) + 2 δ ev al ≥ b Ψ( α ′ , b β ( α ′ )) , and the algorithm correctly shrinks the interval while retaining the optimal solution α ⋆ . Thus, after suf ficiently many iterations, the outer golden section search returns a solution b α satisfying | Ψ( b α, b β ( b α )) − Ψ( α ⋆ , β ⋆ ( α ⋆ )) | ≤ 3 δ ev al . (21) It remains to bound the number of iterations in the outer golden section search. Let η out denote the final interv al length of the outer golden section search. Since we assume α ⋆ lies in the interior of the simplex, there must e xists an open neighborhood N ( α ⋆ ) = { α : ∥ α − α ⋆ ∥ ≤ r } within which Ψ( · , β ⋆ ( α ⋆ )) is locally Lipschitz, that is, for an y α ∈ N ( α ⋆ ) : − L α ⋆ | α − α ⋆ | ≤ Ψ( α, β ⋆ ( α ⋆ )) − Ψ( α ⋆ , β ⋆ ( α ⋆ )) ≤ 0 . Using this property , one can write Ψ( α ⋆ , β ⋆ ( α ⋆ )) − Ψ( b α, b β ( b α )) = (Ψ( α ⋆ , β ⋆ ( α ⋆ )) − Ψ( b α, β ⋆ ( α ⋆ ))) + (Ψ( b α, β ⋆ ( α ⋆ )) − Ψ( b α, β ⋆ ( b α ))) +  Ψ( b α, β ⋆ ( b α )) − Ψ( b α, b β ( b α ))  ≤ (Ψ( α ⋆ , β ⋆ ( α ⋆ )) − Ψ( b α, β ⋆ ( α ⋆ ))) +  Ψ( b α, β ⋆ ( b α )) − Ψ( b α, b β ( b α ))  ≤ L α ⋆ η out + ∥ ℓ ∥ lip η in ≤ L α ⋆ η out + δ ev al , where in the third line, we used the fact that Ψ( b α , β ⋆ ( α ⋆ )) − Ψ( b α, β ⋆ ( b α )) ≤ 0 due to the fact that, by definition, β ⋆ ( b α ) is the maximizer of Ψ( b α, · ) . Since Ψ( α ⋆ , β ⋆ ( α ⋆ )) − Ψ( b α, b β ( b α )) ≥ 0 (due to the optimality of α ⋆ , β ⋆ ( α ⋆ ) ), we thus hav e | Ψ( α ⋆ , β ⋆ ( α ⋆ )) − Ψ( b α, b β ( b α )) | ≤ L α ⋆ η out + δ ev al (22) 35 C H E N F A T TA H I S H A FI E E Hence, we satisfy ( 21 ) by choosing η out ≤ min { 2 δ ev al /L α ⋆ , r } . Combined with the complexity of the inner golden section search, this implies that the nested golden section search runs in time O   Cost k 1 ,δ ev al / 2 + Cost k 2 ,δ ev al / 2  log  b η in  log  1 η out  = O   Cost k 1 ,δ ev al / 2 + Cost k 2 ,δ ev al / 2  log  b ∥ ℓ ∥ lip δ ev al  max  log  L 2 δ ev al  , log  1 r  . The proof is completed by noting that Ψ( α ⋆ , β ⋆ ( α ⋆ )) is precisely S ( k 1 ,k 2 ) i ( b ) , and its e v aluation b Ψ( b α, b β ( b α )) satisfies | b Ψ( b α, b β ( b α )) − Ψ( α ⋆ , β ⋆ ( α ⋆ )) | ≤ | b Ψ( b α, b β ( b α )) − Ψ( b α, b β ( b α )) | + | Ψ( b α, b β ( b α )) − Ψ( α ⋆ , β ⋆ ( α ⋆ )) | ≤ δ ev al + 3 δ ev al ≤ 4 δ ev al . Proof of Lemma 9 . The proof readily follo ws from ( 5 ) and the result of Lemma 8 . B.9. Proof of Lemma 10 Before presenting the proof of Lemma 10 , we first need helper lemmas. Lemma 21 Under Assumptions 2 and 5 , the function S i ( b ) is concave in b ≥ 0 for eac h i ∈ [ T ] . Proof By Lemma 18 , S i ( b ) admits the representation S i ( b ) = max α,q ( K X k =1 α k ℓ k  b ξ i − q k α k       K X k =1 ∥ q k ∥ ≤ b, K X k =1 α k = 1 , α k ≥ 0 ) . W e first characterize the feasible set. Since P K k =1 ∥ q k ∥ is con ve x in q , its epigraph n ( q , b ) : P K k =1 ∥ q k ∥ ≤ b o is jointly con vex in ( q , b ) . Moreov er , the standard simplex n α : P K k =1 α k = 1 , α k ≥ 0 o is con vex. Therefore, the set C 1 := ( { ( α k , q k , b ) } : K X k =1 ∥ q k ∥ ≤ b, K X k =1 α k = 1 , α k ≥ 0 ) is jointly con ve x in { ( α k , q k , b ) } . Ne xt, consider the objective. For each k , the function − α k ℓ k  b ξ i − q k α k  is the perspective of the con vex function − ℓ k , composed with the affine map ( α k , q k ) 7→ ( α k b ξ i − q k , α k ) . Hence, it is jointly con ve x in ( α k , q k ) . Summing over k , we conclude that K X k =1 − α k ℓ k  b ξ i − q k α k  36 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G is jointly con vex in ( α, q ) . Consequently , its epigraph C 2 := ( { ( α k , q k , t ) } : K X k =1 − α k ℓ k  b ξ i − q k α k  ≤ t ) is jointly con vex in { ( α k , q k , t ) } . Combining the two parts, the set C 1 ∩ C 2 = ( { ( α k , q k , b, t ) } : K X k =1 ∥ q k ∥ ≤ b, K X k =1 α k = 1 , α k ≥ 0 , K X k =1 − α k ℓ k  b ξ i − q k α k  ≤ t ) is jointly con vex in ( α, q , b, t ) . Finally , observe that the epigraph of − S i can be written as epi ( − S i ) = { ( b, t ) : ∃ ( α, q ) s.t. ( α , q , b, t ) ∈ C 1 ∩ C 2 } , which is the projection of a con vex set onto the ( b, t ) -coordinates. Since projections preserve con- ve xity , epi ( − S i ) is con ve x, implying that − S i is con vex and hence S i is concav e in b . Our next lemma establishes the Lipschitz continuity of S i . Lemma 22 Under Assumptions 2 and 5 , the function S i ( b ) is Lipschitz continuous on b ≥ 0 with Lipschitz constant ∥ ℓ ∥ lip , for each i ∈ [ t ] . Proof First, we show that the right deri v ati ve of S i ( b ) exists for b ≥ 0 . For a fix ed budget b ≥ 0 , the right deri v ativ e of S i ( b ) at b is defined as S ′ i, + ( b ) := lim h → 0 + ϕ ( h ) := lim h → 0 + S i ( b + h ) − S i ( b ) h ≥ 0 . T o sho w this limit exists, it suf fices to sho w that the difference quotient ϕ ( h ) is non-decreasing. Let 0 < h 1 < h 2 , since S i ( · ) is conca ve and b + h 1 =  1 − h 1 h 2  b + h 1 h 2 ( b + h 2 ) , by Jensen’ s inequality , we hav e S i ( b + h 1 ) ≥  1 − h 1 h 2  S i ( b ) + h 1 h 2 S i ( b + h 2 ) , which can be rearranged as ϕ ( h 1 ) = S i ( b + h 1 ) − S i ( b ) h 1 ≥ S i ( b + h 2 ) − S i ( b ) h 2 = ϕ ( h 2 ) . Therefore, the right deriv ati ve S ′ i, + ( b ) ≥ 0 exists for e very b ≥ 0 . Similarly , we can sho w that the left deri v ativ e S ′ i, − ( b ) := lim h → 0 − S i ( b + h ) − S i ( b ) h ≥ 0 exists for e very b > 0 . Furthermore, it follo ws from concavity that for e very b > 0 , S ′ i, − ( b ) ≥ S ′ i, + ( b ) , 37 C H E N F A T TA H I S H A FI E E and the set of supergradients is defined as ∂ S i ( b ) = [ S ′ i, + ( b ) , S ′ i, − ( b )] . Next, we show that for 0 ≤ b 1 < b 2 , we hav e S ′ i, + ( b 1 ) ≥ S ′ i, − ( b 2 ) . Let h 2 < 0 < h 1 such that b 1 + h 1 ≤ b 2 + h 2 , since S i ( · ) is conca ve, we hav e S i ( b 1 + h 1 ) − S i ( b 1 ) ( b + h 1 ) − b 1 ≥ S i ( b 2 + h 2 ) − S i ( b 1 + h 1 ) ( b 2 + h 2 ) − ( b 1 + h 1 ) ≥ S i ( b 2 ) − S i ( b 2 + h 2 ) b 2 − ( b 2 + h 2 ) , which implies that S i ( b 1 + h 1 ) − S i ( b 1 ) h 1 ≥ S i ( b 2 + h 2 ) − S i ( b 2 ) h 2 . Let h 1 → 0 + and h 2 → 0 − , we hav e S ′ i, + ( b 1 ) ≥ S ′ i, − ( b 2 ) . It follo ws that 0 ≤ g i ≤ S ′ i, + (0) for e very g i ∈ ∂ S i ( b ) , ∀ b ≥ 0 . No w it remains to sho w that S ′ i, + (0) ≤ ∥ ℓ ( · ) ∥ lip . For any δ > 0 and ϵ > 0 , there exists a feasible solution { α δ,ϵ k } k ∈ [ K ] , { v δ,ϵ k } k ∈ [ K ] to ( 16b ) that satisfies 0 ≤ S i ( δ ) − K X k =1 α δ,ϵ k ℓ k ( b ξ i − v δ,ϵ k ) < ϵ. Therefore, the right deri v ativ e of S i at 0 can be written as S ′ i, + (0) = lim δ → 0 + S i ( δ ) − S i (0) δ ≤ lim δ → 0 + P K k =1 α δ,ϵ k ℓ k ( b ξ i − v δ,ϵ k ) + ϵ − max k ∈ [ K ] ℓ k ( b ξ i ) δ ≤ lim δ → 0 + P K k =1 α δ,ϵ k h ℓ k ( b ξ i − v δ,ϵ k ) − ℓ k ( b ξ i ) i + ϵ δ ≤ lim δ → 0 + P K k =1 α δ,ϵ k ∥ ℓ k ∥ lip    v δ,ϵ k    + ϵ δ ≤ lim δ → 0 + max k ∈ [ K ] ∥ ℓ k ∥ lip δ + ϵ δ ≤ lim δ → 0 + ∥ ℓ ∥ lip δ + ϵ δ Letting ϵ → 0 + , we hav e S ′ i, + (0) ≤ lim δ → 0 + ∥ ℓ ∥ lip δ δ = ∥ ℓ ∥ lip . This completes the proof. 38 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G Proof of Lemma 10 . Due to conca vity and non-decreasing properties of S i ( b ) , there e xists b limit i ≥ 0 such that the function S i ( b ) is strictly increasing and concav e on [0 , b limit i ] , and S i ( b ) is a constant on ( b limit i , ∞ ) . Note that b limit i can be + ∞ , in which case S i ( b ) is strictly increasing on [0 , + ∞ ) . Also, for any λ ≥ ∥ ℓ ∥ lip , b ⋆ i ( λ ) ∈ argmax b ≥ 0 { S i ( b ) − λb } must satisfy b ⋆ i ( λ ) = 0 . Therefore, without loss of generality , we can restrict the search range for the dual v ariable λ to [0 , ∥ ℓ ∥ lip ] . Note that argmax b ≥ 0 { S i ( b ) − λb } is a closed set. Let b ⋆ i ( λ ) ∈ argmax b ≥ 0 { S i ( b ) − λb } be the smallest element of this set. For any fixed i ∈ [ t ] and dual candidate λ ∈ [0 , ∥ ℓ ∥ lip ] , let b b i ( λ ) be the output of Algorithm 3 . Let b i and b ′ i be tw o distinct points ev aluated by the golden section search presented in Algorithm 3 . The algorithm discards a subinterval based on comparing b S i ( b i ) − λb i and b S i ( b ′ i ) − λb ′ i . The algo- rithm makes the correct comparison whene ver   ( S i ( b i ) − λb i ) −  S i ( b ′ i ) − λb ′ i    > 2 δ ev al . T o see this, suppose without loss of generality that ( S i ( b i ) − λb i ) > ( S i ( b ′ i ) − λb ′ i ) + 2 δ ev al . Then b S i ( b i ) − λb i ≥ S i ( b i ) − λb i − δ ev al ≥ S i ( b ′ i ) − λb ′ i + δ ev al ≥ b S i ( b ′ i ) − λb ′ i and the algorithm correctly shrinks the interval while retaining the optimal solution b ⋆ ( λ ) . Thus, after suf ficiently many iterations, the golden section search returns b b ( λ ) satisfying     S i ( b b i ( λ )) − λ b b i ( λ )  − ( S i ( b ⋆ i ( λ )) − λb ⋆ i ( λ ))    ≤ 2 δ ev al . (23) It remains to bound the number of iterations in the golden section search. Let η b denote the final interv al length of this golden section search. By Lemma 22 , the function S i is Lipschitz continuous with constant ∥ ℓ ∥ lip . Therefore,     S i ( b b i ( λ )) − λ b b i ( λ )  − ( S i ( b ⋆ i ( λ )) − λb ⋆ i ( λ ))    ≤ ( ∥ ℓ ∥ lip + λ ) η b ≤ 2 ∥ ℓ ∥ lip η b . where in the last inequality , we use the fact that λ ∈ [0 , ∥ ℓ ∥ lip ] . Hence, ( 23 ) is ensured by choos- ing η b = δ ev al / ∥ ℓ ∥ lip . Consequently , combined with the complexity of Algorithm 4 deriv ed by Lemma 9 , we conclude that Algorithm 3 runs in time O  Γ · K 2 · (Cost k 1 ,δ ev al / 2 + Cost k 2 ,δ ev al / 2 ) · log  b δ ev al  · log  1 δ ev al  · log  ρt η b  = O  Γ · K 2 · (Cost k 1 ,δ ev al / 2 + Cost k 2 ,δ ev al / 2 ) · log  b δ ev al  · log  1 δ ev al  · log  ρt ∥ ℓ ∥ lip δ ev al  . The proof is completed by noting that    ( b S i ( b b i ( λ )) − λ b b i ( λ )) − ( S i ( b ⋆ i ( λ )) − λb ⋆ i ( λ ))    ≤    b S i ( b b i ( λ )) − S i ( b b i ( λ ))    +    ( S i ( b b i ( λ )) − λ b b i ( λ )) − ( S i ( b ⋆ i ( λ )) − λb ⋆ i ( λ ))    ≤ 4 δ ev al + 2 δ ev al = 6 δ ev al . 39 C H E N F A T TA H I S H A FI E E B.10. Proof of Theor em 11 Denote the optimal dual v ariable of ( 4 ) by λ ⋆ ∈ [0 , ∥ ℓ ∥ lip ] . By Lemma 10 , for an y λ ∈ [0 , ∥ ℓ ∥ lip ] , Algorithm 7 runs in time O  Γ · (Cost k 1 ,δ ev al / 2 + Cost k 2 ,δ ev al / 2 ) · log  b δ ev al  · log  1 δ ev al  · log  ρt ∥ ℓ ∥ lip δ ev al  . and outputs a b b i ( λ ) satisfying | b b i ( λ ) − b ⋆ i ( λ ) | ≤ η b with η b ≤ δ ev al / ∥ ℓ ∥ lip . No w , we analyze the bisection method of Algorithm 2 (over λ ). Similar to our analysis in Lemma 10 , let λ be a dual candidate at some iteration of the bisection search. The algorithm discards a subinterv al based on the sign of ρt − P t i =1 b b i ( λ ) . W e claim that the algorithm makes the correct decision whene ver      ρt − t X i =1 b ⋆ i ( λ )      > tη b . T o see this, assume without loss of generality that ρt − P t i =1 b ⋆ i ( λ ) > tη b . Then ρt − t X i =1 b b i ( λ ) = ρt − t X i =1 b ⋆ i ( λ ) + t X i =1  b ⋆ i ( λ ) − b b i ( λ )  > tη b − tη b = 0 . This implies that the algorithm correctly shrinks the interval while retaining the optimal dual λ ⋆ within the interv al. Thus, after suf ficiently many iterations, the bisection search returns a solution b λ satisfying      ρt − t X i =1 b ⋆ i ( b λ )      ≤ tη b . (24) It remains to bound the number of iterations in the outer bisection search. Let η λ denote the final interv al length of the bisection search. Giv en λ ⋆ ∈ [0 , ∥ ℓ ∥ lip ] , there e xists a neighborhood N i ( λ ⋆ ) = { λ : | λ − λ ⋆ | ≤ r i } ∩ [0 , ∥ ℓ ∥ lip ] within which b ⋆ i ( · ) is locally Lipschitz, that is, for an y λ ∈ N i ( λ ⋆ ) : | b ⋆ i ( λ ) − b ⋆ i ( λ ⋆ ) | ≤ L ( i ) λ ⋆ | λ − λ ⋆ | . Using this property , one can write      ρt − t X i =1 b ⋆ i ( b λ )      =      t X i =1  b ⋆ i ( λ ⋆ ) − b ⋆ i ( b λ )       ≤    b λ − λ ⋆    t X i =1 L ( i ) λ ⋆ ≤ η λ t X i =1 L ( i ) λ ⋆ . Hence, we satisfy ( 24 ) by choosing η λ = min { η b /L λ ⋆ , r min } , where L λ ⋆ := max i ∈ [ t ] L ( i ) λ ⋆ , r min := min i ∈ [ t ] { r i } . Combined with the complexity of the golden section search from Lemma 10 and after noting that η b = δ ev al / ∥ ℓ ∥ lip , we conclude that Algorithm 2 runs in time O  Γ · K 2 · max k ∈ [ K ]  Cost k,δ ev al / 2  · log  b δ ev al  · log  1 δ ev al  · log  ρt ∥ ℓ ∥ lip δ ev al  · log  ∥ ℓ ∥ lip η λ  = O Γ · K 2 · max k ∈ [ K ]  Cost k,δ ev al / 2  · log  b δ ev al  · log  1 δ ev al  · log  ρt ∥ ℓ ∥ lip δ ev al  · max  log  L λ ⋆ ∥ ℓ ∥ lip δ ev al  , log  1 r min  ! . 40 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G The proof is completed by noting that 1 t t X i =1 S i ( b ⋆ i ( λ ⋆ )) − 1 t t X i =1 S i ( b b i ( b λ )) ≤ 1 t t X i =1 ∥ ℓ ∥ lip    b ⋆ i ( λ ⋆ ) − b b i ( b λ )    ≤ ∥ ℓ ∥ lip t t X i =1     b ⋆ i ( λ ⋆ ) − b ⋆ i ( b λ )    +    b ⋆ i ( b λ ) − b b i ( b λ )     ≤ ∥ ℓ ∥ lip t t X i =1  L ( i ) λ ⋆ η λ + η b  ≤ ∥ ℓ ∥ lip ( L λ ⋆ η λ + η b ) ≤ 2 δ ev al ■ 41

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment