Wasserstein Distributionally Robust Online Learning

Proceedings of Machine Learning Research vol : 1 – 41 , 2026 W asserstein Distrib utionally Rob ust Online Learning Guixian Chen G X C H E N @ U M I C H . E D U University of Michigan Salar F attahi FA T T A H I @ U M I C H . E D U University of Michigan Soroosh Shaﬁee S H A FI E E @ C O R N E L L . E D U Cornell University Abstract W e study distrib utionally robust online learning, where a risk-averse learner updates decisions se- quentially to guard against worst-case distributions drawn from a W asserstein ambiguity set cen- tered at past observations. While this paradigm is well understood in the ofﬂine setting through W asserstein Distributionally Robust Optimization (DRO), its online extension poses signiﬁcant challenges in both con ver gence and computation. In this paper , we address these challenges. First, we formulate the problem as an online saddle-point stochastic game between a decision maker and an adversary selecting worst-case distributions, and propose a general framew ork that con v erges to a robust Nash equilibrium coinciding with the solution of the corresponding ofﬂine W asser- stein DR O problem. Second, we address the main computational bottleneck, which is the repeated solution of worst-case expectation problems. For the important class of piecewise concave loss functions, we propose a tailored algorithm that exploits problem geometry to achieve substan- tial speedups ov er state-of-the-art solvers such as Gurobi. The ke y insight is a novel connection between the worst-case expectation problem, an inherently inﬁnite-dimensional optimization prob- lem, and a classical and tractable budget allocation problem, which is of independent interest. Keyw ords: risk-averse online learning, data-dri v en optimization, W asserstein uncertainty 1. Introduction The primary objectiv e of statistical learning is to identify a decision rule x within a feasible set X ⊆ R n that minimizes the e xpectation of a loss function ℓ : X × Ξ → R with respect to an underlying, unknown data-generating distribution P ⋆ ∈ P (Ξ) , where P (Ξ) denotes the set of all probability distrib utions supported on Ξ ⊆ R m . When P ⋆ is inaccessible but a static dataset of T i.i.d. observations { b ξ 1 , . . . , b ξ T } is av ailable, Empirical Risk Minimization (ERM) approximates this goal using the empirical distribution b P T := 1 T P T t =1 δ b ξ t , where δ ξ denotes the Dirac measure centered at ξ . By replacing P ⋆ with this plug-in estimator , ERM solv es the optimization problem inf x ∈X ( E ξ ∼ b P T [ ℓ ( x, ξ )] = 1 T T X t =1 ℓ ( x, b ξ t ) ) . When observations arriv e sequentially , the framework of Online Con vex Optimization (OCO) pro- vides efﬁcient algorithms that minimize regret, ensuring con ver gence to the statistical learning so- lution as the time horizon T → ∞ ( Nemirovski et al. , 2009 ; Shale v-Shwartz , 2012 ; Hazan , 2022 ). Despite its theoretical foundations, this standard statistical learning framew ork suffers from fundamental limitations. First, by relying solely on the expectation as a risk measure, it overlooks © 2026 G. Chen, S. Fattahi & S. Shaﬁee. C H E N F A T TA H I S H A FI E E higher-order v ariations, failing to account for the risk sensitivity required in safety-critical applica- tions. Second, the framew ork is notoriously brittle to data corruption during training, where mea- surement noise ( Nettleton et al. , 2010 ) or adversarial manipulation ( Nietert et al. , 2023 , 2024 ) can se verely degrade the learned model. Finally , it assumes the testing distribution perfectly matches the training distribution, causing performance to collapse under adversarial distribution shifts ( Y ang et al. , 2024 ) or test-time corruption ( Kurakin et al. , 2017 ; Goodfello w et al. , 2015 ). W asserstein Distributionally Robust Optimization (DRO) addresses these challenges in a uni- ﬁed manner . By deﬁning an ambiguity set based on the W asserstein distance, which captures the underlying geometry of the sample space, DR O effecti vely models geometric corruptions. Fur- thermore, it inherently regularizes the model against local perturbations, acting as a penalty on the Lipschitz constant or gradient variation of the loss. Formally , for a ﬁxed p ∈ [1 , ∞ ) , we consider a distributional ambiguity set centered around a reference distrib ution P ∈ P (Ξ) , deﬁned as B p ρ ( P ) :=  Q ∈ P (Ξ) : W p p ( Q , P ) ≤ ρ  , where ρ > 0 is the ambiguity radius and W p ( P , Q ) denotes the p -W asserstein distance, deﬁne as W p ( P , Q ) := inf π ∈ Π( P , Q )  E ( ξ ,ξ ′ ) ∼ π  ∥ ξ − ξ ′ ∥ p  1 /p . Here, Π( P , Q ) := { π ∈ P (Ξ 2 ) : π ( · × Ξ) = P , π (Ξ × · ) = Q } represents the set of all couplings with marginals P and Q . Ideally , a risk-averse learner aims to solve the minimax problem centered at the true distribution P ⋆ : inf x ∈X sup Q ∈ B p ρ ( P ⋆ ) E ξ ∼ Q [ ℓ ( x, ξ )] . (1) W e note that when ρ = 0 , the ambiguity set collapses to a singleton, and the problem reduces to the standard statistical learning frame work. Since P ⋆ is unkno wn, this problem is typically solved in the ofﬂine setting using a data-driv en approximation. Speciﬁcally , giv en a static dataset of T i.i.d. observations, the standard data-driv en DR O approach ( Mohajerin Esfahani and Kuhn , 2018 ) proceeds by substituting P ⋆ with b P T in ( 1 ) and solving the resulting optimization problem. Ho we ver , many modern applications operate in dynamic en vironments where data is not av ail- able as a static batch but arri ves sequentially . In settings such as online recommendation systems ( Bai et al. , 2019 ; W en et al. , 2022 ) and real-time ﬁnancial portfolio management ( Cost a and Iyengar , 2023 ), the learner must adapt to streaming data in real-time. In these scenarios, waiting to accumu- late a large dataset to solve a static DR O problem is computationally prohibiti ve and fails to capture temporal shifts. This necessitates algorithms that can learn sequentially while strictly controlling the risk of worst-case outcomes, moti vating the central question of this paper: How can we design efﬁcient algorithms that learn sequentially fr om str eaming data while r emaining r obust to wor st-case distributions? 1.1. Summary of Contributions W e formulate the DR O problem ( 1 ) as an online zero-sum stochastic game. At each iteration t , the en vironment re veals a sample b ξ t drawn from P ⋆ . Simultaneously , the dual player (adversary) selects a worst-case distribution Q t from a W asserstein ambiguity set centered at the historical observations, 2 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G while the primal player (decision maker) selects a decision x t . The primal player then incurs the expected loss with respect to the dual player’ s chosen distribution E ξ ∼ Q t [ ℓ ( x t , ξ )] . Our objectiv e is to design an online algorithm that competes against the ofﬂine, risk-av erse benchmark deﬁned in ( 1 ). Solving the minimax problem ( 1 ) in an online fashion presents unique challenges that distin- guish it from the application of standard OCO for saddle-point problems ( Orabona , 2019 , § 12). Unlike typical min-max games where the dual v ariable lies in a ﬁxed, ﬁnite-dimensional space, our maximization occurs ov er the space of probability measures, which is inﬁnite-dimensional. Fur- thermore, the problem is inherently non-stationary and stochastic. That is, the dual player does not hav e access to the full ambiguity set B p ρ ( P ⋆ ) , but only observes a single sample b ξ t at each step. Consequently , the immediate ambiguity set changes dynamically with e very iteration as the center of the W asserstein ball shifts based on the incoming data stream. In this work, we make equal contributions to both the theoretical foundations and the computational practicality of this ﬁeld: ⋄ Novel Risk-A verse Framework: W e propose a theoretical framew ork for online learning against W asserstein uncertainty . W e formulate the learning problem as an online saddle-point optimization between a primal player , responsible for updating the decision rule x , and a dual player that selects worst-case distributions within a W asserstein ambiguity set. W e show that the resulting online dynamics conv er ge to a robust Nash equilibrium that coincides with the solution of the corresponding of ﬂine W asserstein DR O problem. This provides theoretical guarantees for learning decisions that control tail risk and pre vent lar ge losses under admissi- ble distributional perturbations. ⋄ Efﬁcient Computation: T o o vercome the computational bottleneck of the inner maximiza- tion, we de velop specialized and highly ef ﬁcient algorithms for computing the worst-case e x- pectation. Focusing on the important class of piece wise conca ve loss functions, our method achie ves a δ -optimal solution in O (p oly log(1 /δ )) iterations. The efﬁcienc y of the proposed approach arises from a nov el connection between the worst-case expectation problem—an in- herently inﬁnite-dimensional optimization problem—and a classical budget allocation prob- lem, a connection we belie ve is of independent interest. 1.2. Related W orks W asserstein DRO The W asserstein metric provides a natural frame work for modeling geometric uncertainty and data corruption by capturing the underlying geometry of the input space. T o enable practical implementation, con vex duality results ha ve been recently de veloped to make W asserstein DR O computationally efﬁcient ( Mohajerin Esfahani and Kuhn , 2018 ; Blanchet and Murthy , 2019 ; Gao and Kleywegt , 2023 ). The empirical success of these methods is often attributed to their theo- retical connections with v ariation-based ( Gao et al. , 2024 ; Shaﬁee et al. , 2025 ) and Lipschitz-based ( Blanchet et al. , 2019 ; Shaﬁeezadeh-Abadeh et al. , 2019 ) regularization. Furthermore, W asserstein DR O of fers strong generalization guarantees deri ved from measure concentration and transport in- equality arguments ( Mohajerin Esfahani and K uhn , 2018 ; An and Gao , 2021 ; Gao , 2023 ). Despite these strengths, existing approaches rely on of ﬂine processing of the full training dataset, a limita- tion this paper addresses by de veloping an online frame work for sequential data. Online W asserstein DRO The setting of online W asserstein DR O remains largely unexplored. Y erenbur g ( 2021 ) provides the ﬁrst solution by dualizing the inner maximization to formulate a 3 C H E N F A T TA H I S H A FI E E single minimization problem, and then solving the resulting formulation via online mirror descent. Ho we ver , the reformulation approach requires strong oracles that solves a potentially nonconv ex problem to ﬁnd the worst-case perturbation. Furthermore, their analysis mandates that the ambi- guity set size vanishes as T → ∞ . Similarly , W ang et al. ( 2025 ) utilize online clustering for data compression but still require solving a full W asserstein DR O problem on the compressed data at e very iteration. In contrast to these methods, we propose a primal-dual algorithm that av oids re- peatedly resolving the full optimization problem. Our approach efﬁciently identiﬁes the w orst-case distribution at each step and applies a ﬁrst-order update to the primal decision. Online Algorithms for Robust Optimization OCO techniques have recently been adapted to ro- bust optimization by casting such problems as semi-inﬁnite programs. Seminal work by Ben-T al et al. ( 2015 ) and follow-ups by Ho-Nguyen and Kılınc ¸ -Karzan ( 2018 , 2019 ) reduce the problem to repeated robust feasibility checks via regret-minimizing algorithms, while more recent approaches ( Postek and Shtern , 2024 ; T u et al. , 2024 ) av oid bisection through perspectiv e reformulations or Lagrangian relaxations. These ideas extend naturally to DRO with speciﬁc ambiguity sets. F or example, Namkoong and Duchi ( 2016 ) and Aigner et al. ( 2023 ) use primal–dual updates for f - di ver gence sets with discrete support, while dualizing the inner maximization enables direct on- line minimization for f -diver gence sets ( Qi et al. , 2021 ) and W asserstein sets ( Y erenbur g , 2021 ). Closely related is the prediction-with-expert-advice framew ork, including the W eighted A verage and Aggregating Algorithms ( Kivinen and W armuth , 1999 ; V ovk , 1990 ), which can be viewed as Follo w-the-Regularized-Leader on a probability simplex with negati ve entropy regularization. Dis- tinct from these approaches, our work considers W asserstein ambiguity sets without discreteness assumptions and solves the problem in a fully online manner , naturally interpreted as a dynamic game between the learner and a W asserstein-constrained adversary . Adversarial T raining and Domain Shift Adversarial training was originally proposed to reduce the sensitivity of machine learning models to small, carefully crafted noise ( Goodfellow et al. , 2015 ; Kurakin et al. , 2017 ). This defense strategy can be rigorously reformulated as a rob ust optimization problem with box uncertainty , which is mathematically equiv alent to a W asserstein DR O prob- lem using an ∞ -W asserstein ambiguity set ( Gao et al. , 2024 ). Sinha et al. ( 2017 ) extended this formulation to the general p -W asserstein setting, establishing a frame work that naturally accommo- dates adversarial domain shifts where the test distribution differs from the training distrib ution via bounded adversarial corruption. W asserstein DR O of fers prov able robust generalization guarantees when facing such shifts ( Lee and Raginsky , 2018 ; T u et al. , 2019 ; W ang et al. , 2019 ; Kwon et al. , 2020 ; V olpi et al. , 2018 ). By solving the W asserstein DR O problem in a fully online fashion, our approach is naturally suited to this adversarial setting. Risk-A verse Online Learning A classic result by Artzner et al. ( 1999 ) establishes that any co- herent risk measure can be dually represented as a DR O problem ov er a speciﬁc ambiguity set. In optimal control, risk sensitivity is traditionally modeled using the entropic risk measure, par- ticularly within the linear-e xponential-Gaussian frame work ( Jacobson , 1973 ; Whittle , 1990 ). In reinforcement learning, this perspectiv e has expanded to include objectiv es based on the Condi- tional V alue-at-Risk ( Cho w and Ghav amzadeh , 2014 ; Hau et al. , 2023 ) and e xponential utility func- tions ( Borkar , 2002 ). Similarly , the multi-armed bandit literature has addressed risk sensiti vity through mean-variance criteria ( Sani et al. , 2012 ; V akili and Zhao , 2016 ) and CV aR-based explo- ration ( Galichet et al. , 2013 ). While these approaches typically rely on speciﬁc functional forms 4 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G of risk or f -div ergence ambiguity sets, the notion of risk sensitivity in our work is geometrically induced by the W asserstein ambiguity set. Distributionally Robust Regret Optimization A related paradigm is Distributionally Rob ust Re- gret Optimization (DRR O). In this setting, the minimax objectiv e in ( 1 ) is modiﬁed to minimize the worst-case regret or excess risk—deﬁned as the difference between the loss and the optimal loss under the worst-case distribution—rather than the worst-case expected loss itself. While DRR O achie ves statistical minimax optimality under distributional shifts ( Agarw al and Zhang , 2022 ), it in- troduces signiﬁcant computational challenges. Recent work has addressed these issues for W asser- stein ambiguity sets ( Chen and Xie , 2021 ; Bitar , 2024 ; Fiechtner and Blanchet , 2025 ; Xue and Rujeerapaiboon , 2025 ). W e emphasize that although our analysis employs cumulative regret as a performance metric, our objective remains minimizing the robust loss, which differs from the minimax regret formulation studied in the DRR O literature. 1.3. Notation and Outline Let ∥ · ∥ denotes the Euclidean norm. The set of positi ve integers up to n ∈ N is denoted by [ n ] . W e write P (Ξ) for the family of Borel probability measures on Ξ ⊆ R m , equipped with the p - W asserstein distance, where p ∈ [1 , ∞ ) . W e write E P [ ℓ ( x, ξ )] for expectation of ℓ ( x, ξ ) with respect to ξ ∼ P ; when clear from the context, the parameter and the random variable are dropped and we write E P [ ℓ ] . W e write Π X as the projection operator onto a closed and con vex set X . Let ∂ f ( x ) denote the subdifferential of f at x if f is con ve x, or the superdifferential if f is concav e. When clear from the context, ∂ f ( x ) may also refer to a speciﬁc subgradient or supergradient. For p ∈ [1 , ∞ ) , the p -th order homogeneous Sobolev (semi)norm of continuously dif ferentiable f : Ξ → R w .r .t. P is ∥ f ∥ ˙ H 1 ,p ( P ) . The Lipschitz constant of Lipschitz continuous f : Ξ → R is ∥ f ∥ lip . The remainder of the paper is org anized as follo ws. Section 2 introduces the problem setup and assumptions. In Section 3 , we analyze the con ver gence of the proposed algorithm, assuming access to an oracle for the (inner) worst-case expectation problem. Section 4 presents an ef ﬁcient algorithm that implements this oracle. All proofs and technical details are deferred to the appendix. 2. Problem Setup In this section, we formalize the structural assumptions required for our analysis. Assumption 1 (Regularity) The feasible re gion X ⊆ R n is nonempty , con vex and compact, with diameter D X . The support of the random variable Ξ ⊆ R m is nonempty , closed and con vex. F or any ﬁxed ξ ∈ Ξ , the loss function ℓ ( · , ξ ) is r eal-valued, con vex, and Lipschitz continuous with Lipschitz constant G X > 0 . F or any x ∈ X , ther e exists a constant g > 0 such that ℓ ( x, ξ ) ≤ g (1 + ∥ ξ ∥ p ) for all ξ ∈ Ξ . Assumption 1 ensures that the optimization problem inf x ∈X sup Q ∈ B p ρ ( b P t )  f ( x, Q ) := E ξ ∼ Q [ ℓ ( x, ξ )]  (2) has ﬁnite optimal value ( Gao and Kleywe gt , 2023 , Theorem 1). By linearity of the expectation, f ( x, Q ) is af ﬁne in Q . Hence, it inherits con ve xity in x from the loss function ℓ . Moreover , the 5 C H E N F A T TA H I S H A FI E E Lipschitz continuity of ℓ also e xtends to its expectation: | f ( x 1 , Q ) − f ( x 2 , Q ) | ≤ E ξ ∼ Q [ | ℓ ( x 1 , ξ ) − ℓ ( x 2 , ξ ) | ] ≤ G X ∥ x 1 − x 2 ∥ , ∀ x 1 , x 2 ∈ X . This structural regularity allo ws us to establish strong duality for ( 2 ), implying that the order of the sup and inf operators can be interchanged without affecting the optimal value. Ho we ver , strong duality alone does not guarantee the existence of a ﬁnite optimal solution to ( 2 ), as the supremum or inﬁmum may fail to be attained. T o ensure the existence of a solution pair ( x ⋆ , Q ⋆ ) for ( 2 ) (also kno wn as a saddle point ), additional assumptions are required; see, for example, ( Shaﬁee et al. , 2025 , Theorem 1). One such assumption that guarantees the existence of a saddle point is the piece wise structure of the loss function proposed by Mohajerin Esfahani and K uhn ( 2018 ). Assumption 2 (Piecewise Structure) The loss function ℓ ( x, ξ ) := max k ∈ [ K ] ℓ k ( x, ξ ) , wher e for every ﬁxed x ∈ X and k ∈ [ K ] , the function ℓ k : X × Ξ → R is concave and differ entiable. As we will see in the next lemma, the above assumption guarantees the existence of a saddle point for ( 2 ), allowing us to replace the inf and sup operators with min and max , respecti vely , which is a requirement for any online saddle-point algorithms. Beyond this, the assumption of fers an additional beneﬁt: it enables ( 2 ) to admit a ﬁnite-dimensional reformulation ( Mohajerin Esfahani and Kuhn , 2018 , Theorem 4.2). W e exploit this key property to design an efﬁcient algorithm for solving the inner worst-case expectation problem. Notably , the assumed piece wise structure already encompasses a broad class of robust regression and classiﬁcation models and is particularly attrac- ti ve since an y smooth function can be approximated arbitrarily well by piece wise linear functions. Lemma 1 (Existence of Saddle Point) Suppose Assumptions 1 and 2 hold. Moreo ver , if p = 1 , suppose in addition that either Ξ is compact or ther e exists a constant g > 0 such that ℓ ( x, ξ ) ≤ g (1 + ∥ ξ ∥ r ) for all ξ ∈ Ξ and some r ∈ (0 , 1) . Then, min x ∈X max Q ∈ B p ρ ( b P t ) f ( x, Q ) = max Q ∈ B p ρ ( b P t ) min x ∈X f ( x, Q ) . W e emphasize that, when p > 1 , no additional assumptions beyond Assumptions 1 and 2 are required to guarantee the existence of a saddle point. In contrast, the case p = 1 requires more careful analysis, since the worst-case distrib ution may assign an asymptotically vanishing amount of probability mass to points escaping to inﬁnity along directions in the recession cone of Ξ . The additional assumption that Ξ is compact rules out this behavior , as its recession cone reduces to the singleton { 0 } . Alternativ ely , the restriction on the growth condition ensures that sending mass to inﬁnity is nev er optimal for the worst-case distribution. Ne xt, we assume that the underlying data-generating distribution P ⋆ is well-behav ed. Assumption 3 (Light-tailed Distribution) The underlying data-gener ating distrib ution P ⋆ is light- tailed, that is, ther e exists a > p ≥ 1 suc h that E ξ ∼ P ⋆ [exp( ∥ ξ ∥ a )] < ∞ . W e note that Assumption 3 is primarily utilized to leverage the con vergence rates of the empirical distribution b P t to P ⋆ in the W asserstein metric, which serves only for deriving an end-to-end result for the con ver gence rate of our online algorithm in the next section. Our main algorithm relies on the follo wing computational oracle to identify the worst-case distrib ution. 6 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G Algorithm 1: Online Distributional Best-Response Algorithm Input: Initial decision x 1 ∈ X , step size η t > 0 f or t = 1 , 2 , . . . do // 1. Distributional Best Response Step Query the oracle to ﬁnd the worst-case distrib ution: Q t ← O W ( x t , B p ρ ( b P t )) // 2. Learning Step Update the decision via projected subgradient method: x t +1 = Π X ( x t − η t ∂ x f ( x t , Q t )) // 3. Aggregation Step Maintain the av erage of the decisions: ¯ x t +1 = 1 t +1 P t +1 i =1 x i = t t +1 ¯ x t + 1 t +1 x t +1 end Assumption 4 (W asserstein Oracle) F or any δ > 0 , ther e e xists a W asserstein Ora- cle O W ( x, B p ρ ( P )) that returns a distribution Q ⋆ ∈ B p ρ ( P ) that satisﬁes f ( x, Q ⋆ ) ≥ max Q ∈ B p ρ ( P ) f ( x, Q ) − δ . While we initially assume the existence of such an oracle, we dedicate Section 4 to opening this “black box” and providing ef ﬁcient implementations under mild conditions. The resulting procedure, which we call online distributional best response algorithm , is detailed in Algorithm 1 . At each iteration t , gi ven the current primal decision x t , the dual player queries the oracle O W  x t , B p ρ ( b P t )  to compute its best response, which corresponds to a solution of the inner worst-case expectation problem. The primal player then updates its decision using a single iteration of the projected subgradient method x t +1 = Π X ( x t − η t ∂ x f ( x t , Q t )) , where η t is the step size and Π X denotes the projection onto the feasible set X . This simple procedure is inspired by the best-response framework of Orabona ( 2019 , Algo- rithm 12.2). Howe ver , unlik e classical methods that operate on ﬁx ed dual feasible sets, our approach must contend with a non-stationary dual environment. Namely , the dual player has access only to the partial ambiguity set B p ρ ( b P t ) , which ev olves tow ard the of ﬂine benchmark B p ρ ( b P ⋆ ) as the data stream unfolds. This shifting landscape makes a best-response strate gy for the dual player essential and un- av oidable. In a standard simultaneous primal-dual update, the dual step could easily fall outside the currently valid (and shifting) W asserstein ball, yielding updates that are either infeasible or insuf- ﬁciently robust. By fr eezing the dual player’ s best response Q t against the current decision x t , we ensure the learner remains resilient to the most damaging distributional shift currently admissible. For simplicity , we emplo y the projected subgradient method to update the primal v ariable, although alternati ve approaches, including projection-free Frank-W olfe-type algorithms ( Garber and Hazan , 2015 ), can be used without loss of generality . 3. Con vergence Analysis Before proceeding to the formal analysis, we deﬁne the learning objecti ve within this online setting and establish the criteria for ev aluating the performance of Algorithm 1 . Let x ⋆ denote an optimal solution to ( 1 ). Our goal is to ensure that the rob ust estimate ¯ x t produced by Algorithm 1 con ver ges to x ⋆ according to a well-deﬁned metric. W e ev aluate this con vergence rate using the primal sub- optimality gap, a standard performance measure in minimax optimization when the focus is on the 7 C H E N F A T TA H I S H A FI E E primal decision v ariable. Speciﬁcally , we seek to deriv e an upper bound for the quantity: Gap( ¯ x T ) := max Q ∈ B p ρ ( P ⋆ ) f ( ¯ x T , Q ) − min x ∈X max Q ∈ B p ρ ( P ⋆ ) f ( x, Q ) . The follo wing lemma plays a ke y role in controlling this suboptimality gap. Lemma 2 The suboptimality gap of the avera ged iter ate ¯ x T admits the following upper bound: Gap( ¯ x T ) ≤ 1 T T X t =1 ( f ( x t , Q t ) − f ( x ⋆ , Q t )) + 1 T T X t =1 max Q ∈ B p ρ ( P ⋆ ) f ( x t , Q ) − max Q ∈ B p ρ ( b P t ) f ( x t , Q ) ! + 1 T T X t =1 max Q ∈ B p ρ ( b P t ) f ( x ⋆ , Q ) − max Q ∈ B p ρ ( P ⋆ ) f ( x ⋆ , Q ) ! + δ. According to this lemma, the suboptimality gap can be decomposed into four distinct components. The ﬁrst component characterizes the regret of the learning step in Algorithm 1 , ev aluated on the sequence of functions { f ( · , Q t ) } T t =1 . The second and third components quantify the sensitivity of the worst-case expectations with respect to two W asserstein balls, one centered at b P t and the other at P ⋆ . Finally , the fourth component accounts for the error incurred by the W asserstein oracle. Among these, the ﬁrst component is the most straightforward to control, as it reduces to a standard regret analysis of the online projected subgradient method, which we present ne xt. Lemma 3 Under Assumption 1 , for any T ≥ 1 , the sequence { ( x t , Q t ) } T t =1 gener ated by Algo- rithm 1 with the stepsize η t = D X G X √ t satisﬁes 1 T T X t =1 ( f ( x t , Q t ) − f ( x ⋆ , Q t )) ≤ G X D X (1 + log T ) √ T . Unlike the ﬁrst component, controlling the second and third components in Lemma 2 is more del- icate. In particular , although b P t → P ⋆ , it is not immediate how this con ver gence translates into con vergence of the corresponding worst-case expectations ov er W asserstein balls centered at b P t and P ⋆ . The following lemma characterizes this ef fect. Lemma 4 Under Assumptions 1 and 2 , for any x ∈ X and t ≥ 1 , we have      max Q ∈ B p ρ ( b P t ) f ( x, Q ) − max Q ∈ B p ρ ( P ⋆ ) f ( x, Q )      ≤ ( ∥ ℓ ( x, · ) ∥ lip W 1 ( b P t , P ⋆ ) , if p = 1 , ∥ ℓ ( x, · ) ∥ ˙ H 1 ,q ( P ⋆ ) W p ( b P t , P ⋆ ) , if p > 1 , wher e q > 1 is the conjugate e xponent satisfying 1 p + 1 q = 1 when p > 1 . W e emphasize that the implication of the above lemma is far from trivial and is perhaps surpris- ing: although the worst-case expectations are taken o ver W asserstein balls B p ρ ( b P t ) and B p ρ ( P ⋆ ) with radius ρ > 0 , the resulting bound is independent of ρ and depends solely on the p -W asserstein dis- tance between b P t and P ⋆ . Lev eraging this key lemma, we are able to relax a restrictive assumption in Y erenbur g ( 2021 ), which requires ρ → 0 as T → ∞ . By combining Lemmas 3 and 4 with Lemma 2 , we can establish the con vergence of Algorithm 1 . 8 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G Theorem 5 Under Assumptions 1 , 2 , and 4 , consider the averaged iterate ¯ x T gener ated by Algo- rithm 1 with stepsize η t = D X G X √ t . The following guarantees hold. (i) Case p = 1 . Suppose that either Ξ is compact, or there exists a constant g > 0 such that ℓ ( x, ξ ) ≤ g (1 + ∥ ξ ∥ r ) for all ξ ∈ Ξ and some r ∈ (0 , 1) . Then, Gap( ¯ x T ) ≤ G X D X (1 + log T ) √ T + 1 T T X t =1  ∥ ℓ ( x t , · ) ∥ lip + ∥ ℓ ( x ⋆ , · ) ∥ lip  W 1 ( b P t , P ⋆ ) + δ. (ii) Case p > 1 . Let q > 1 denote the conjugate e xponent satisfying 1 p + 1 q = 1 . Then, Gap( ¯ x T ) ≤ G X D X (1 + log T ) √ T + 1 T T X t =1  ∥ ℓ ( x t , · ) ∥ ˙ H 1 ,q ( b P t ) + ∥ ℓ ( x ⋆ , · ) ∥ ˙ H 1 ,q ( P ⋆ )  W p ( b P t , P ⋆ ) + δ. The results above show that the suboptimality gap of the ﬁnal output is gov erned by three main factors: (i) the regret incurred by the online projected subgradient method, (ii) the weighted av- erage of the p -W asserstein distance between the empirical measures and the true data-generating distribution, and (iii) the error tolerance of the W asserstein oracle. When the data-generating distri- bution P ⋆ is additionally light-tailed, that is, it satisﬁes Assumption 3 , the following corollary can be established. Corollary 6 Suppose that Assumption 4 is satisﬁed with δ =  1 T log  T τ  min n p m , 1 2 o . Under the conditions of Theor em 5 and Assumption 3 , for any p ≥ 1 and any τ ∈ (0 , 1) , when T ≥ C 1 log( T τ ) , with pr obability at least 1 − τ , it holds that Gap( ¯ x T ) ≤ C 2  1 T log  T τ  min n p m , 1 2 o . Her e, C 1 , C 2 ar e constants depending on m , p , a , G X , D X , ∥ ℓ ( x ⋆ , · ) ∥ ˙ H 1 ,q ( P ⋆ ) , the exponential moments of P ⋆ , and uniform bounds on max x ∈X {∥ ℓ ( x, · ) ∥ lip } and max x ∈X n ∥ ℓ ( x, · ) ∥ ˙ H 1 ,q ( b P t ) o . The established bound highlights the interplay between the dimensionality of the uncertainty set and the con ver gence rate of the proposed online algorithm. In particular , it conﬁrms that despite operating on streaming data, the online framework preserv es the statistical guarantees dictated by measure concentration, consistent with the ofﬂine DRO results of Mohajerin Esfahani and Kuhn ( 2018 ). W e also note that the corollary assumes access to a W asserstein oracle with a prescribed error , whose efﬁcient implementation is discussed in the next section. 4. W asserstein Oracle In this section, we de velop an efﬁcient implementation of the W asserstein oracle introduced in Assumption 4 . Our goal is to devise an algorithm that efﬁciently realizes the distributional best- r esponse step of Algorithm 1 . Speciﬁcally , for each iteration t = 1 , 2 , . . . , T , the algorithm returns a δ -accurate solution to the follo wing inﬁnite-dimensional worst-case e xpectation problem: max Q ∈ B p ρ ( b P t ) E ξ ∼ Q [ ℓ ( x t , ξ )] . (3) 9 C H E N F A T TA H I S H A FI E E T o streamline the presentation, we henceforth omit the explicit dependence of the loss function on x t and write ℓ ( ξ ) in place of ℓ ( x t , ξ ) . Under Assumptions 1 and 2 , the problem ( 3 ) admits a ﬁnite-dimensional con ve x reformulation ( Kuhn et al. , 2019 ). While this reformulation is an important step toward establishing the tractability of ( 3 ), it ultimately relies on generic of f-the-shelf solvers. Such solv ers fail to exploit the intrinsic structure of the resulting optimization problem. In this section, we uncover this useful structure and sho w ho w it can be exploited to solve ( 3 ) more efﬁciently . T o streamline our analysis, we only focus on the 1-W asserstein metric (which already co vers sev eral relev ant applications; see Mohajerin Esfahani and K uhn ( 2018 )) and make the follo wing simplifying assumption. Assumption 5 The support of the r andom samples Ξ is R m , and p = 1 . There exists a constant g > 0 such that ℓ ( x, ξ ) ≤ g (1 + ∥ ξ ∥ r ) for all ξ ∈ Ξ and some r ∈ (0 , 1) . The sublinear growth assumption ensures solvability of the dual W asserstein DR O and yields an explicit characterization of the worst-case distribution. In contrast, when ℓ exhibits linear or super- linear gro wth, the worst-case distrib ution may not be attained. At the core of our proposed algorithm lies the follo wing ke y reformulation of ( 3 ). Theorem 7 Under Assumptions 2 and 5 , pr oblem ( 3 ) is equivalent to: max b 1 ,...,b t ( 1 t t X i =1 S i ( b i ) s.t. t X i =1 b i ≤ ρt, b 1 , . . . , b t ≥ 0 ) . (4) wher e, for every i ∈ [ t ] and b ≥ 0 , S i ( b ) is deﬁned as: S i ( b ) = max 1 ≤ k 1 0 Initialize: Search bounds λ low = 0 and λ high = ∥ ℓ ∥ lip while λ high − λ low > η λ do 1. Dual variable assignment: Set the dual variable: λ mid ← λ low + λ high 2 2. Solving decomposed subproblems: for i = 1 , . . . , t do Compute the local optimal budget b b i ( λ mid ) using Algorithm 3 end 3. Budget Feasibility Check: Ev aluate the total budget consumption: ¯ b ← P t i =1 b b i ( λ mid ) if ¯ b > ρt then λ low ← λ mid // Dual variable is too small end else λ high ← λ mid // Dual variable is too large end end Output: b λ := λ high and b b i := b b i ( λ high ) for all i ∈ [ t ] A ppendix B. Omitted Proofs B.1. Proof of Lemma 1 T o establish the existence of a saddle point, we verify the conditions required in ( Shaﬁee et al. , 2025 , Lemmas 3 and 4). Under our structural Assumptions 1 and 2 , the regularity requirements regarding con vexity , lower semi-continuity , integrability , and compactness of sublev el sets with respect to x ( Shaﬁee et al. , 2025 , Assumptions 3–6, 8) are immediately satisﬁed. The remaining requirement in volv es the existence of Slater points, as speciﬁed in ( Shaﬁee et al. , 2025 , Assumption 7). First, since the transportation cost function deﬁning the p -W asserstein dis- tance is real-valued and continuous ( c ( ξ , ξ ′ ) = ∥ ξ − ξ ′ ∥ p ), the support of the empirical distribution 18 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G Algorithm 3: Algorithm for solving the decoupled problems ( 7 ) Input: Subproblem function b S i ( · ) , dual candidate λ , minimum interv al length η b Initialize: Local search bounds: b low = 0 , b high = tρ , golden ratio ϕ = √ 5 − 1 2 z 1 ← b high − ϕ ( b high − b low ) , z 2 ← b low + ϕ ( b high − b low ) while b high − b low > η b do if b S i ( z 1 ) − λz 1 < b S i ( z 2 ) − λz 2 then b low ← z 1 , z 1 ← z 2 , z 2 ← b low + ϕ ( b high − b low ) end else b high ← z 2 , z 2 ← z 1 , z 1 ← b high − ϕ ( b high − b low ) end end Output: b low b P t tri vially lies within the interior of the cost function’ s domain. This satisﬁes the Slater condition ( Shaﬁee et al. , 2025 , Assumption 7 (i)). Second, per Assumption 1 , the feasible region X is a non- empty , compact, and con ve x set. Consequently , it has a non-empty relativ e interior and admits a Slater point, thereby satisfying the primal Slater condition ( Shaﬁee et al. , 2025 , Assumption 7 (ii)). Finally , the provided gro wth conditions for the p = 1 case ensure the dual problem is well-posed and the inner maximization is attained. Thus, all necessary conditions for the minimax theorem are met, and the claim follo ws. ■ B.2. Proof of Lemma 2 Using Jensen’ s inequality , we can write Gap( ¯ x T ) = max Q ∈ B p ρ ( P ⋆ ) f ( ¯ x T , Q ) − min x ∈X max Q ∈ B p ρ ( P ⋆ ) f ( x, Q ) ≤ 1 T T X t =1 max Q ∈ B p ρ ( P ⋆ ) f ( x t , Q ) − max Q ∈ B p ρ ( P ⋆ ) f ( x ⋆ , Q ) ! , where x ⋆ ∈ argmin x ∈X max Q ∈ B p ρ ( P ⋆ ) f ( x, Q ) . Next, we ha ve max Q ∈ B p ρ ( P ⋆ ) f ( x t , Q ) − max Q ∈ B p ρ ( P ⋆ ) f ( x ⋆ , Q ) ≤ f ( x t , Q t ) − f ( x ⋆ , Q t ) + max Q ∈ B p ρ ( P ⋆ ) f ( x t , Q ) − max Q ∈ B p ρ ( b P t ) f ( x t , Q ) + max Q ∈ B p ρ ( b P t ) f ( x ⋆ , Q ) − max Q ∈ B p ρ ( P ⋆ ) f ( x ⋆ , Q ) + δ The abov e inequality is obtained by noting that max Q ∈ B p ρ ( b P t ) { f ( x ⋆ , Q ) } − f ( x ⋆ , Q t ) ≥ 0 and apply- ing the W asserstein oracle from Assumption 4 , which ensures f ( x t , Q t ) − max Q ∈ B p ρ ( b P t ) { f ( x t , Q ) } + δ ≥ 0 . Combinig the abov e two inequalities completes the proof. ■ 19 C H E N F A T TA H I S H A FI E E Algorithm 4: Algorithm for e v aluating S i ( b ) based on ( 5 ) and ( 6 ) Input: Local budget b > 0 , golden ratio ϕ = √ 5 − 1 2 , minimum interv al lengths η in , η out > 0 f or 1 ≤ k 1 < k 2 ≤ K do // Golden section search on the weight allocations ( α 1 + α 2 = 1 ) Initialize bounds for α 1 : [ L α , U α ] ← [0 , 1] a α ← U α − ϕ ( U α − L α ) b α ← L α + ϕ ( U α − L α ) while ( U α − L α ) > η out do V a ← SolveInner ( α 1 = a α , α 2 = 1 − a α ) V b ← SolveInner ( α 1 = b α , α 2 = 1 − b α ) if V a > V b then U α ← b α , b α ← a α , a α ← U α − ϕ ( U α − L α ) end else L α ← a α , a α ← b α , b α ← L α + ϕ ( U α − L α ) end end Set S ( k 1 ,k 2 ) i ( b ) = max( V a , V b ) end Output: Maximum value max 1 ≤ k 1 η in do // Compute V a and V b using the subroutine from Assumption 6 . V a ← max ∥ q 1 ∥≤ a β { α 1 ℓ k 1 ( b ξ i − q 1 α 1 ) } + max ∥ q 2 ∥≤ b − a β { α 2 ℓ k 2 ( b ξ i − q 2 α 2 ) } V b ← max ∥ q 1 ∥≤ b β { α 1 ℓ k 1 ( b ξ i − q 1 α 1 ) } + max ∥ q 2 ∥≤ b − b β { α 2 ℓ k 2 ( b ξ i − q 2 α 2 ) } if V a > V b then U β ← b β , b β ← a β , a β ← U β − ϕ ( U β − L β ) end else L β ← a β , a β ← b β , b β ← L β + ϕ ( U β − L β ) end end retur n max( V a , V b ) end B.3. Proof of Lemma 3 The result follows from the standard regret analysis of the projected subgradient method; see, for example, ( Orabona , 2019 , Section 2.2.2). For completeness, we pro vide a short proof. 20 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G For an y z ∈ X , we hav e ∥ x t +1 − z ∥ 2 = ∥ Π X ( x t − η t ∂ x f ( x t , Q t )) − z ∥ 2 ≤ ∥ x t − η t ∂ x f ( x t , Q t ) − z ∥ 2 = ∥ x t − z ∥ 2 − 2 η t ⟨ x t − z , ∂ x f ( x t , Q t ) ⟩ + η 2 t ∥ ∂ x f ( x t , Q t ) ∥ 2 . Summing up both sides ov er t = 1 , 2 , . . . , T yields 2 T X t =1 η t ⟨ x t − z , ∂ x f ( x t , Q t ) ⟩ ≤ ∥ x 1 − z ∥ 2 − ∥ x T +1 − z ∥ 2 + T X t =1 η 2 t ∥ ∂ x f ( x t , Q t ) ∥ 2 ≤ D 2 X + G 2 X T X t =1 η 2 t , where we used ∥ ∂ x f ( x t , Q t ) ∥ = ∥ ∂ x E ξ ∼ Q t [ ℓ ( x t , ξ )] ∥ ≤ G X and ∥ x 1 − z ∥ ≤ D X . Since f ( x, Q ) is con vex in x for ev ery Q ∈ P (Ξ) , we ha ve ⟨ x t − z , ∂ x f ( x t , Q t ) ⟩ ≥ f ( x t , Q t ) − f ( z , Q t ) , which implies 2 T X t =1 η t ( f ( x t , Q t ) − f ( z , Q t )) ≤ D 2 X + G 2 X T X t =1 η 2 t . Since the sequence of step size { η t } T t =1 is non-increasing, we obtain T X t =1 ( f ( x t , Q t ) − f ( z , Q t )) ≤ D 2 X 2 η T + G 2 X P T t =1 η 2 t 2 η T . Upon choosing η t = D X G X √ t for all t ∈ [ T ] , we obtain 1 T T X t =1 ( f ( x t , Q t ) − f ( z , Q t )) ≤ 1 2 G X D X 1 √ T + 1 2 G X D X 1 √ T T X t =1 1 t ≤ G X D X (1 + log T ) √ T Substituting z = x ⋆ in the abov e inequality completes the proof. ■ B.4. Proof of Lemma 4 For ease of notation, throughout this proof we suppress the dependence on the decision variable x and simply write ℓ ( ξ ) in lieu of ℓ ( x, ξ ) . First, we introduce two technical lemmas. Lemma 12 Under Assumption 2 , for k ∈ [ K ] , deﬁne the function h k : Ξ × R + → R such that h k ( ξ , λ ) := sup z ∈ Ξ ℓ k ( z ) − λ ∥ z − ξ ∥ p , ∀ λ ≥ 0 , ξ ∈ Ξ . F or any ﬁxed λ ≥ 0 , the function h k ( ξ , λ ) is concave in ξ ∈ Ξ . Further , deﬁne h ( ξ , λ ) := max k ∈ [ K ] h k ( ξ , λ ) . W e have h ( ξ , λ ) = sup z ∈ Ξ ℓ ( z ) − λ ∥ z − ξ ∥ p , ∀ λ ≥ 0 , ξ ∈ Ξ . 21 C H E N F A T TA H I S H A FI E E Proof For any ﬁxed λ ≥ 0 , the function ℓ k ( z ) − λ ∥ z − ξ ∥ p is jointly conca ve in ( z , ξ ) ∈ Ξ 2 under Assumption 2 . Therefore, the point-wise supremum sup z ∈ Ξ ℓ k ( z ) − λ ∥ z − ξ ∥ p is concave in ξ ∈ Ξ . Next, we v erify that h ( ξ , λ ) := max k ∈ [ K ] h k ( ξ , λ ) = max k ∈ [ K ] sup z ∈ Ξ ℓ k ( z ) − λ ∥ z − ξ ∥ p = sup z ∈ Ξ max k ∈ [ K ] ℓ k ( z ) − λ ∥ z − ξ ∥ p = sup z ∈ Ξ ℓ ( z ) − λ ∥ z − ξ ∥ p . Lemma 13 Under the conditions of Lemma 12 , assume that p > 1 and λ > 0 . Then h k ( ξ , λ ) is differ entiable with r espect to ξ , and ∥∇ ξ h k ( ξ , λ ) ∥ ≤ ∥∇ ℓ k ( ξ ) ∥ . Proof For any ﬁxed λ > 0 , the function ℓ k ( z ) − λ ∥ z − ξ ∥ p is strictly concav e in z due to p > 1 and the concavity of ℓ k . Therefore, since Ξ ⊆ R m is con vex and closed, and lim sup ∥ z ∥→∞ ℓ k ( z ) ∥ z − ξ ∥ p ≤ 0 , we obtain that argmax z ∈ Ξ ℓ k ( z ) − λ ∥ z − ξ ∥ p is nonempty and has a unique element. Denote z ⋆ = argmax z ∈ Ξ ℓ k ( z ) − λ ∥ z − ξ ∥ p . Since ℓ k ( z ) − λ ∥ z − ξ ∥ p is differentiable with respect to ξ , by the Danskin’ s theorem, h k ( ξ , λ ) is dif ferentiable with respect to ξ , and ∇ ξ h k ( ξ , λ ) = ∂ ∂ ξ [ ℓ k ( z ) − λ ∥ z − ξ ∥ p ]     z = z ⋆ = λp ∥ z ⋆ − ξ ∥ p − 1 ( z ⋆ − ξ ) . (8) Since Ξ is con ve x and closed, by the optimality of z ⋆ , we hav e ⟨∇ ℓ k ( z ⋆ ) − λp ∥ z ⋆ − ξ ∥ p − 1 ( z ⋆ − ξ ) , z − z ⋆ ⟩ ≤ 0 , ∀ z ∈ Ξ , which implies that ⟨∇ ℓ k ( z ⋆ ) − ∇ ξ h k ( ξ , λ ) , z − z ⋆ ⟩ ≤ 0 , ∀ z ∈ Ξ . Let z = ξ ∈ Ξ , and note that, according to ( 8 ), −∇ ξ h k ( ξ , λ ) is in the same direction as ξ − z ⋆ . Therefore, we hav e ⟨∇ ℓ k ( z ⋆ ) − ∇ ξ h k ( ξ , λ ) , −∇ ξ h k ( ξ , λ ) ⟩ ≤ 0 , which leads to ⟨∇ ℓ k ( z ⋆ ) , ∇ ξ h k ( ξ , λ ) ⟩ ≥ ∥∇ ξ h k ( ξ , λ ) ∥ 2 . Since ℓ k is concav e, we have ⟨∇ ℓ k ( z ⋆ ) − ∇ ℓ k ( ξ ) , z ⋆ − ξ ⟩ ≤ 0 . Again, using ( 8 ), we ha ve ⟨∇ ℓ k ( z ⋆ ) − ∇ ℓ k ( ξ ) , ∇ ξ h k ( ξ , λ ) ⟩ ≤ 0 . 22 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G It follo ws that ∥∇ ξ h k ( ξ , λ ) ∥ 2 ≤ ⟨∇ ℓ k ( z ⋆ ) , ∇ ξ h k ( ξ , λ ) ⟩ ≤ ⟨∇ ℓ k ( ξ ) , ∇ ξ h k ( ξ , λ ) ⟩ ≤ ∥∇ ℓ k ( ξ ) ∥ ∥∇ ξ h k ( ξ , λ ) ∥ Indeed, ∇ ξ h k ( ξ , λ ) = 0 the statement of the lemma holds tri vially . Moreover , if ∇ ξ h k ( ξ , λ )  = 0 , the abov e inequality leads to ∥∇ ξ h k ( ξ , λ ) ∥ ≤ ∥∇ ℓ k ( ξ ) ∥ . This completes the proof. No w , we are ready to present the proof of Lemma 4 , which is a special case of the follo wing theorem. Theorem 14 Under Assumptions 1 , 2 , given a pair of distrib utions P 1 , P 2 ∈ P (Ξ) , the differ ence of the worst-case e xpectation within the W asserstein ball center ed at P 1 and P 2 satisﬁes: sup Q ∈ B p ρ ( P 1 ) E ξ ∼ Q [ ℓ ( ξ )] − sup Q ∈ B p ρ ( P 2 ) E ξ ∼ Q [ ℓ ( ξ )] ≤ ( ∥ ℓ ( · ) ∥ lip W 1 ( P 1 , P 2 ) , if p = 1 , ∥ ℓ ( · ) ∥ ˙ H 1 ,q ( P 2 ) W p ( P 1 , P 2 ) , if p > 1 , wher e q > 1 is a constant suc h that 1 p + 1 q = 1 when p > 1 . Proof Deﬁne the value function V ρ ( P ) = sup Q ∈ B p ρ ( P ) E ξ ∼ Q [ ℓ ( ξ )] . By ( Blanchet and Murthy , 2019 , Theorem 1), we can reformulate the v alue function as V ρ ( P ) = inf λ ≥ 0 { λρ + E P [ h ( ξ , λ )] } , where h ( ξ , λ ) := sup z ∈ Ξ { ℓ ( z ) − λ ∥ z − ξ ∥ p } . For an y ϵ > 0 , by the deﬁnition of the inﬁmum, there exists a λ ϵ ≥ 0 such that λ ϵ ρ + E P 2 [ h ( ξ , λ ϵ )] ≤ V ρ ( P 2 ) + ϵ. By the suboptimality of λ ϵ in the minimization problem for V ρ ( P 1 ) , we hav e V ρ ( P 1 ) ≤ λ ϵ ρ + E P 1 [ h ( ξ , λ ϵ )] . Subtracting these inequalities yields V ρ ( P 1 ) − V ρ ( P 2 ) ≤ E P 1 [ h ( ξ , λ ϵ )] − E P 2 [ h ( ξ , λ ϵ )] + ϵ. (9) Letting π be the optimal coupling for W p ( P 1 , P 2 ) , we can re write the abov e expectation as E P 1 [ h ( ξ , λ ϵ )] − E P 2 [ h ( ξ , λ ϵ )] = E ( ξ 1 ,ξ 2 ) ∼ π [ h ( ξ 1 , λ ϵ ) − h ( ξ 2 , λ ϵ )] . 23 C H E N F A T TA H I S H A FI E E Case 1 ( p = 1 ): In this case, the function h ( ξ , λ ) = sup z ∈ Ξ { ℓ ( z ) − λ ∥ z − ξ ∥} . First, by the properties of the Pasch-Hausdorf f en velope ( Rockafellar and W ets , 1998 , Example 9.11), the function h ( · , λ ) is λ -Lipschitz continuous for any λ ≥ 0 . Second, we argue that we can restrict the search for λ to the interv al [0 , ∥ ℓ ∥ lip ] . Suppose λ > ∥ ℓ ∥ lip . Since ℓ is ∥ ℓ ∥ lip -Lipschitz, for any z ∈ Ξ we ha ve ℓ ( z ) ≤ ℓ ( ξ ) + ∥ ℓ ∥ lip · ∥ z − ξ ∥ . Thus, we may conclude that ℓ ( z ) − λ ∥ z − ξ ∥ ≤ ℓ ( ξ ) + ( ∥ ℓ ∥ lip − λ ) ∥ z − ξ ∥ ≤ ℓ ( ξ ) , where the last inequality holds because ∥ ℓ ∥ lip − λ < 0 . Since the v alue ℓ ( ξ ) is attained at z = ξ , it follows that h ( ξ , λ ) = ℓ ( ξ ) for all λ ≥ ∥ ℓ ∥ lip . In this regime, the dual objectiv e λϵ + E [ h ( ξ , λ )] is strictly increasing in λ . Therefore, the inﬁmum must be attained at some λ ∈ [0 , ∥ ℓ ∥ lip ] , and we can assume λ ϵ ≤ ∥ ℓ ∥ lip without loss of optimality . Finally , using the λ ϵ -Lipschitzness of h and the optimal coupling π ∈ Π( P 1 , P 2 ) , we hav e E P 1 [ h ( ξ , λ ϵ )] − E P 2 [ h ( ξ , λ ϵ )] = E ( ξ 1 ,ξ 2 ) ∼ π [ h ( ξ 1 , λ ϵ ) − h ( ξ 2 , λ ϵ )] ≤ E π [ λ ϵ ∥ ξ 1 − ξ 2 ∥ ] ≤ ∥ ℓ ∥ lip · W 1 ( P 1 , P 2 ) . Substituting this into ( 9 ) and taking ϵ → 0 completes the proof for p = 1 . Case 2 ( p > 1 ): First, we look at the case where λ ϵ = 0 . It follows that h ( ξ , 0) = sup z ∈ Ξ ℓ ( z ) , which is constant. Substituting this into ( 9 ) and taking ϵ → 0 , we hav e V ρ ( P 1 ) − V ρ ( P 2 ) ≤ 0 , which completes the proof. Next, assume that λ ϵ > 0 . One can write E P 1 [ h ( ξ , λ ϵ )] − E P 2 [ h ( ξ , λ ϵ )] = E ( ξ 1 ,ξ 2 ) ∼ π [ h ( ξ 1 , λ ϵ ) − h ( ξ 2 , λ ϵ )] = E ( ξ 1 ,ξ 2 ) ∼ π  max k ∈ [ K ] h k ( ξ 1 , λ ϵ ) − max k ∈ [ K ] h k ( ξ 2 , λ ϵ )  ≤ max k ∈ [ K ] E ( ξ 1 ,ξ 2 ) ∼ π [ h k ( ξ 1 , λ ϵ ) − h k ( ξ 2 , λ ϵ )] ( a ) ≤ max k ∈ [ K ] E ( ξ 1 ,ξ 2 ) ∼ π h ∇ ξ h k ( ξ 2 , λ ϵ ) ⊤ ( ξ 1 − ξ 2 ) i ( b ) ≤ max k ∈ [ K ]  E ( ξ 1 ,ξ 2 ) ∼ π ∥∇ ξ h k ( ξ 2 , λ ϵ ) ∥ q  1 /q  E ( ξ 1 ,ξ 2 ) ∼ π ∥ ξ 1 − ξ 2 ∥ p  1 /p ( c ) ≤ max k ∈ [ K ] ( E ξ ∼ P 2 ∥∇ ℓ k ( ξ ) ∥ q ) 1 /q W p ( P 1 , P 2 ) = max k ∈ [ K ] ∥ ℓ k ( · ) ∥ ˙ H 1 ,q ( P 2 ) W p ( P 1 , P 2 ) ≤ ∥ ℓ ( · ) ∥ ˙ H 1 ,q ( P 2 ) W p ( P 1 , P 2 ) . (10) Here, ( a ) follows from the differentiability and concavity of h k ( · , λ ϵ ) , as established in Lemma 13 and Lemma 12 , respectiv ely . Moreover , ( b ) follows from H ¨ older’ s inequality . Finally , ( c ) follows from Lemma 13 . Combining ( 9 ) and ( 10 ) yields V ρ ( P 1 ) − V ρ ( P 2 ) ≤ ∥ ℓ ( · ) ∥ ˙ H 1 ,q ( P 2 ) W p ( P 1 , P 2 ) + ϵ Letting ϵ → 0 completes the proof. 24 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G B.5. Proof of Theor em 5 T o prov e this lemma, it suf ﬁces to control the individual terms in Lemma 2 . By Lemma 3 , we hav e 1 T T X t =1 ( f ( x t , Q t ) − f ( x ⋆ , Q t )) ≤ G X D X (1 + log T ) √ T . On the other hand, when p = 1 , we may in vok e Lemma 4 with x = x t to obtain 1 T T X t =1 max Q ∈ B p ρ ( P ⋆ ) f ( x t , Q ) − max Q ∈ B p ρ ( b P t ) f ( x t , Q ) ! ≤ 1 T T X t =1 ∥ ℓ ( x t , · ) ∥ lip W 1 ( b P t , P ⋆ ) . Similarly , inv oking Lemma 4 with x = x ⋆ yields 1 T T X t =1 max Q ∈ B p ρ ( P ⋆ ) f ( x ⋆ , Q ) − max Q ∈ B p ρ ( b P t ) f ( x ⋆ , Q ) ! ≤ 1 T T X t =1 ∥ ℓ ( x ⋆ , · ) ∥ lip W 1 ( b P t , P ⋆ ) . Combining the abo ve inequalities with Lemma 2 completes the proof for the ﬁrst case ( p = 1 ). The second case ( p > 1 ) follo ws by an analogous argument and is therefore omitted for brevity . ■ B.6. Proof of Cor ollary 6 T o present the proof, we need the following result. Theorem 15 ( F ournier and Guillin ( 2015 ), Theorem 2) If Assumption 3 holds, for all T ≥ 1 , and ϵ > 0 , we have P n W p ( b P T , P ⋆ ) ≥ ϵ o ≤ ( c 1 exp( − c 2 T ϵ max { m/p, 2 } ) if ϵ ≤ 1 , c 1 exp( − c 2 T ϵ a/p ) if ϵ > 1 , (11) wher e c 1 , c 2 ar e positive constants that only depend on a , m , and A := E ξ ∼ P ⋆ [exp( ∥ ξ ∥ a )] . Proof of Corollary 6 . W e only present the proof for p = 1 , as the case p > 1 follo ws identically . According to the ﬁrst statement of Theorem 5 , it suf ﬁces to control the term 1 T T X t =1  ∥ ℓ ( x t , · ) ∥ lip + ∥ ℓ ( x ⋆ , · ) ∥ lip  W 1 ( b P t , P ⋆ ) . and sho w that it dominates the other terms. W e have 1 T T X t =1  ∥ ℓ ( x t , · ) ∥ lip + ∥ ℓ ( x ⋆ , · ) ∥ lip  W 1 ( b P t , P ⋆ ) ≤  2 max x ∈X {∥ ℓ ( x, · ) ∥ lip }  · 1 T T X t =1 W 1 ( b P t , P ⋆ ) . (12) By Theorem 15 , for each t ≥ 1 c 2 log( c 1 T τ ) , with probability at least 1 − τ /T , we have W 1 ( b P t , P ⋆ ) ≤  1 c 2 t log  c 1 T τ  min { p m , 1 2 } . 25 C H E N F A T TA H I S H A FI E E Therefore, with probability at least 1 − τ , we ha ve T X t = ⌈ 1 c 2 log( c 1 T τ ) ⌉ W 1 ( b P t , P ⋆ ) ≤ T X t = ⌈ 1 c 2 log( c 1 T τ ) ⌉  1 c 2 t log  c 1 T τ  min { p m , 1 2 } =  1 c 2 log  c 1 T τ  min { p m , 1 2 } T X t = ⌈ 1 c 2 log( c 1 T τ ) ⌉  1 t  min { p m , 1 2 } ≤  1 c 2 log  c 1 T τ  min { p m , 1 2 } T 1 − min { p m , 1 2 } . It follo ws that 1 T T X t =1 W 1 ( b P t , P ⋆ ) = 1 T ⌈ 1 c 2 log( c 1 T τ ) ⌉− 1 X t =1 W 1 ( b P t , P ⋆ ) + 1 T T X t = ⌈ 1 c 2 log( c 1 T τ ) ⌉ W 1 ( b P t , P ⋆ ) ≤ 1 T ⌈ 1 c 2 log( c 1 T τ ) ⌉− 1 X t =1 W 1 ( b P t , P ⋆ ) +  1 c 2 log  c 1 T τ  min { p m , 1 2 } T − min { p m , 1 2 } ≤ C 2  1 T log  T τ  min n p m , 1 2 o , where C 2 is a constant depending on m , p , a , A . This bound combined with ( 12 ) and the ﬁrst statement of Theorem 5 completes the proof for p = 1 . B.7. Proof of Theor em 7 W e start with the follo wing fundamental theorem, which provides a ﬁnite conv ex reformulation of the worst-case e xpectation problem. Theorem 16 ( Mohajerin Esfahani and K uhn ( 2018 ), Theor em 4.4) Under Assumptions 2 and 5 , the worst-case e xpectation pr oblem ( 3 ) is equivalent to the following con ve x pr ogram max { α ik ,q ik }              1 t t X i =1 K X k =1 α ik ℓ k ( b ξ i − q ik α ik ) s.t. 1 t t X i =1 K X k =1 ∥ q ik ∥ ≤ ρ, K X k =1 α ik = 1 , ∀ i ∈ [ t ] , α ik ≥ 0 , ∀ i ∈ [ t ] , k ∈ [ K ]              (13) In particular , the optimal values of ( 3 ) and ( 13 ) coincide. Mor eover , let { α ⋆ ik , q ⋆ ik } be an optimal solution of ( 13 ) . Then, the discr ete pr obability distribution Q ⋆ := 1 t t X i =1 K X k =1 α ⋆ ik δ ξ ⋆ ik with ξ ⋆ ik := b ξ i − q ⋆ ik α ⋆ ik belongs to the W asserstein ball B 1 ρ ( b P t ) and attains the maximum of ( 3 ) . 26 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G Our ﬁrst goal is to establish an equiv alence between the the formulation from the above theorem ( 14a ) with a ne w formulation ( 14b ): max { α ik ,q ik } ( 1 t t X i =1 K X k =1 α ik ℓ k  b ξ i − q ik α ik  s.t. t X i =1 K X k =1 ∥ q ik ∥ ≤ tρ, K X k =1 α ik = 1 , α ik ≥ 0 ) , (14a) max { α ik ,v ik } ( 1 t t X i =1 K X k =1 α ik ℓ k ( b ξ i − v ik ) s.t. t X i =1 K X k =1 α ik ∥ v ik ∥ ≤ tρ, K X k =1 α ik = 1 , α ik ≥ 0 ) . (14b) T o establish this equiv alence, ﬁrst we rewrite both formulations as follo ws: max ( 1 t t X i =1 S old i ( b i ) s.t. t X i =1 b i ≤ tρ, b i ≥ 0 ) , (15a) max ( 1 t t X i =1 S new i ( b i ) s.t. t X i =1 b i ≤ tρ, b i ≥ 0 ) , (15b) where S old i ( b ) := max { α k ,q k } ( K X k =1 α k ℓ k  b ξ i − q k α k  s.t. K X k =1 ∥ q k ∥ ≤ b, K X k =1 α k = 1 , α k ≥ 0 ) , (16a) S new i ( b ) := max { α k ,v k } ( K X k =1 α k ℓ k ( b ξ i − v k ) s.t. K X k =1 α k ∥ v k ∥ ≤ b, K X k =1 α k = 1 , α k ≥ 0 ) . (16b) Lemma 17 Pr oblem ( 14a ) is equivalent to pr oblem ( 15a ) . Similarly , pr oblem ( 14b ) is equivalent to pr oblem ( 15b ) . Proof The equiv alence between problems ( 14a ) and ( 15a ) follows directly by observing that the constraint P t i =1 P K k =1 ∥ q ik ∥ ≤ tρ is equiv alent to introducing auxiliary v ariables b i such that P t i =1 b i ≤ tρ and P K k =1 ∥ q ik ∥ ≤ b i for i = 1 , . . . , t . Gi ven this reformulation, the equi valence fol- lo ws immediately from the deﬁnition of S old i ( b ) in ( 16a ). The equi v alence between problems ( 14b ) and ( 15b ) can be established analogously . Therefore, to establish the equi v alence between ( 14a ) and ( 14b ), it sufﬁces to sho w that S old i ( b ) = S new i ( b ) for ev ery b ≥ 0 and i ∈ [ t ] . This is established in the next lemma. This equiv alence justi- ﬁes the interchangeable use of the perspectiv e formulation ( 16a ) and budget-allocation formulation ( 16b ) throughout our analysis. Lemma 18 F ix any b ≥ 0 and any i ∈ [ t ] . Under Assumptions 2 and 5 , we have S old i ( b ) = S new i ( b ) . Mor eover , given any ϵ 1 -suboptimal solution to ( 16b ) , one can construct a feasible solution to ( 16a ) that attains the same objective value. Con versely , given any ϵ 2 -suboptimal solution to ( 16a ) , one can construct a sequence of feasible solutions to ( 16b ) whose objective values conver ge to the same value. Proof Consider ( 16b ). F or any ϵ 1 > 0 , there exist feasible solutions { α ϵ 1 k } k ∈ [ K ] , { v ϵ 1 k } k ∈ [ K ] such that 0 ≤ S new i ( b ) − K X k =1 α ϵ 1 k ℓ k ( b ξ i − v ϵ 1 k ) ≤ ϵ 1 . 27 C H E N F A T TA H I S H A FI E E W ithout loss of generality , when α ϵ 1 k = 0 , we may assume v ϵ 1 k = 0 . Upon deﬁning q ϵ 1 k = α ϵ 1 k v ϵ 1 k for k ∈ [ K ] , the solutions { α ϵ 1 k } k ∈ [ K ] , { q ϵ 1 k } k ∈ [ K ] are feasible to ( 16a ) and q ϵ 1 k α ϵ 1 k = ( 0 if α ϵ 1 k = 0 , v ϵ 1 k if α ϵ 1 k > 0 . It then follo ws that the objecti ve value of ( 16a ) ev aluated at { α ϵ 1 k } k ∈ [ K ] , { q ϵ 1 k } k ∈ [ K ] is the same as the objecti ve v alue of ( 16b ) ev aluated at { α ϵ 1 k } k ∈ [ K ] , { v ϵ 1 k } k ∈ [ K ] , i.e. K X k =1 α ϵ 1 k ℓ k ( b ξ i − q ϵ 1 k α ϵ 1 k ) = K X k =1 α ϵ 1 k ℓ k ( b ξ i − v ϵ 1 k ) . This yields S old i ( b ) ≥ K X k =1 α ϵ 1 k ℓ k ( b ξ i − q ϵ 1 k α ϵ 1 k ) = K X k =1 α ϵ 1 k ℓ k ( b ξ i − v ϵ 1 k ) ≥ S new i ( b ) − ϵ 1 . Letting ϵ 1 → 0 + , we conclude that S old i ( b ) ≥ S new i ( b ) . Next, we show S old i ( b ) ≤ S new i ( b ) . For any ϵ 2 > 0 , there exist feasible solutions { α ϵ 2 k } k ∈ [ K ] , { q ϵ 2 k } k ∈ [ K ] to ( 16a ) such that 0 ≤ S old i ( b ) − K X k =1 α ϵ 2 k ℓ k ( b ξ i − q ϵ 2 k α ϵ 2 k ) ≤ ϵ 2 . Deﬁne I ϵ 2 0 := { k ∈ [ K ] : α ϵ 2 k = 0 , q ϵ 2 k  = 0 } , I ϵ 2 1 := { k ∈ [ K ] : α ϵ 2 k = 0 , q ϵ 2 k = 0 } , and I ϵ 2 2 := { k ∈ [ K ] : α ϵ 2 k > 0 } . For ev ery k ∈ I ϵ 2 1 , we hav e α ϵ 2 k ℓ k ( b ξ i − q ϵ 2 k α ϵ 2 k )     α ϵ 2 k =0 , q ϵ 2 k =0 = 0 , where we use the con vention 0 / 0 = 0 . For e very k ∈ I ϵ 2 0 , we hav e α ϵ 2 k ℓ k ( b ξ i − q ϵ 2 k α ϵ 2 k )     α ϵ 2 k =0 := lim inf α → 0 + αℓ k ( b ξ i − q ϵ 2 k α ) . In this case, by deﬁnition of lim inf , for any ϵ 3 > 0 , there exists some δ 1 > 0 such that lim inf α → 0 + αℓ k ( b ξ i − q ϵ 2 k α ) ≤ inf α ∈ (0 ,δ 1 ) αℓ k ( b ξ i − q ϵ 2 k α ) + ϵ 3 . Upon taking an arbitrary α δ 1 k ∈ (0 , δ 1 ) , we hav e lim inf α → 0 + αℓ k ( b ξ i − q ϵ 2 k α ) ≤ inf α ∈ (0 ,δ 1 ) αℓ k ( b ξ i − q ϵ 2 k α ) + ϵ 3 ≤ α δ 1 k ℓ k ( b ξ i − q ϵ 2 k α δ 1 k ) + ϵ 3 . 28 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G This leads to S old i ( b ) ≤ K X k =1 α ϵ 2 k ℓ k ( b ξ i − q ϵ 2 k α ϵ 2 k ) + ϵ 2 = X k ∈I ϵ 2 0 lim inf α → 0 + αℓ k ( b ξ i − q ϵ 2 k α ) + X k ∈I ϵ 2 2 α ϵ 2 k ℓ k ( b ξ i − q ϵ 2 k α ϵ 2 k ) + ϵ 2 ≤ X k ∈I ϵ 2 0 α δ 1 k ℓ k ( b ξ i − q ϵ 2 k α δ 1 k ) + ϵ 3 ! + X k ∈I ϵ 2 2 α ϵ 2 k ℓ k ( b ξ i − q ϵ 2 k α ϵ 2 k ) + ϵ 2 ≤ X k ∈I ϵ 2 0 α δ 1 k ℓ k ( b ξ i − q ϵ 2 k α δ 1 k ) + X k ∈I ϵ 2 2 α ϵ 2 k ℓ k ( b ξ i − q ϵ 2 k α ϵ 2 k ) + ϵ 2 + K ϵ 3 . Denote k ⋆ = argmax k ∈ [ K ] α ϵ 2 k . By Pigeonhole principle, we ha ve α ϵ 2 k ⋆ ≥ 1 K since P K k =1 α ϵ 2 k = 1 . This implies that k ⋆ ∈ I ϵ 2 2 . Assume that δ 1 ≤ 1 K 2 . Since |I ϵ 2 0 | ≤ K − 1 , we have α ϵ 2 k ⋆ − X k ∈I ϵ 2 0 α δ 1 k ≥ 1 K − ( K − 1) 1 K 2 = 1 K 2 > 0 . No w we construct a feasible solution to ( 16b ). Speciﬁcally , we choose α ′ k =            α δ 1 k if k ∈ I ϵ 2 0 , α ϵ 2 k ⋆ − X k ∈I ϵ 2 0 α δ 1 k if k = k ⋆ , α ϵ 2 k if k / ∈ I ϵ 2 0 and k  = k ⋆ , and v ′ k = q ϵ 2 k α ′ k for k ∈ [ K ] . For this constructed solution { α ′ k } k ∈ [ K ] , { v ′ k } k ∈ [ K ] , we can verify its feasibility by realizing that α ′ k > 0 for e very k ∈ [ K ] such that q k  = 0 , P K k =1 α ′ k = 1 and K X k =1 α ′ k   v ′ k   = K X k =1   q ϵ 2 k   ≤ b. 29 C H E N F A T TA H I S H A FI E E Moreov er , we have    X k ∈I ϵ 2 0 α δ 1 k ℓ k ( b ξ i − q ϵ 2 k α δ 1 k ) + X k ∈I ϵ 2 2 α ϵ 2 k ℓ k ( b ξ i − q ϵ 2 k α ϵ 2 k )    − K X k =1 α ′ k ℓ k ( b ξ i − q ′ k α ′ k ) = α ϵ 2 k ⋆ ℓ k ⋆ ( b ξ i − q ϵ 2 k ⋆ α ϵ 2 k ⋆ ) −   α ϵ 2 k ⋆ − X k ∈I ϵ 2 0 α δ 1 k   ℓ k ⋆ ( b ξ i − q ϵ 2 k ⋆ α ϵ 2 k ⋆ − P k ∈I ϵ 2 0 α δ 1 k ) =   X k ∈I ϵ 2 0 α δ 1 k   ℓ k ⋆ ( b ξ i − q ϵ 2 k ⋆ α ϵ 2 k ⋆ ) +   α ϵ 2 k ⋆ − X k ∈I ϵ 2 0 α δ 1 k   ℓ k ⋆ ( b ξ i − q ϵ 2 k ⋆ α ϵ 2 k ⋆ ) − ℓ k ⋆ ( b ξ i − q ϵ 2 k ⋆ α ϵ 2 k ⋆ − P k ∈I ϵ 2 0 α δ 1 k ) ! ≤   X k ∈I ϵ 2 0 α δ 1 k    ℓ k ⋆ ( b ξ i ) + ∥ ℓ ∥ lip     q ϵ 2 k ⋆ α ϵ 2 k ⋆      +   α ϵ 2 k ⋆ − X k ∈I ϵ 2 0 α δ 1 k   ∥ ℓ ∥ lip      q ϵ 2 k ⋆ α ϵ 2 k ⋆ − q ϵ 2 k ⋆ α ϵ 2 k ⋆ − P k ∈I ϵ 2 0 α δ 1 k      =   X k ∈I ϵ 2 0 α δ 1 k    ℓ k ⋆ ( b ξ i ) + 2 ∥ ℓ ∥ lip     q ϵ 2 k ⋆ α ϵ 2 k ⋆      ≤ max  K δ 1  ℓ k ⋆ ( b ξ i ) + 2 ∥ ℓ ∥ lip     q ϵ 2 k ⋆ α ϵ 2 k ⋆      , 0  . On the other hand, since α ϵ 2 k ⋆ ≥ 1 K , we obtain ℓ k ⋆ ( b ξ i ) + 2 ∥ ℓ ∥ lip     q ϵ 2 k ⋆ α ϵ 2 k ⋆     ≤ ℓ k ⋆ ( b ξ i ) + 2 K ∥ ℓ ∥ lip   q ϵ 2 k ⋆   W ithout loss of generality , we may assume that ℓ k ⋆ ( b ξ i ) + 2 K ∥ ℓ ∥ lip   q ϵ 2 k ⋆   > 0 . For any ϵ 2 > 0 , let δ 1 = min    1 K 2 , ϵ 2 K  ℓ k ⋆ ( b ξ i ) + 2 K ∥ ℓ ∥ lip   q ϵ 2 k ⋆       and ϵ 3 = ϵ 2 K . 30 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G It follo ws that S old i ( b ) ≤ X k ∈I ϵ 2 0 α δ 1 k ℓ k ( b ξ i − q ϵ 2 k α δ 1 k ) + X k ∈I ϵ 2 1 α ϵ 2 k ℓ k ( b ξ i − q ϵ 2 k α ϵ 2 k ) + ϵ 2 + K ϵ 3 ≤ K X k =1 α ′ k ℓ k ( b ξ i − q ′ k α ′ k ) + K δ 1  ℓ k ⋆ ( b ξ i ) + 2 ∥ ℓ ∥ lip     q ϵ 2 k ⋆ α ϵ 2 k ⋆      + ϵ 2 + K ϵ 3 ≤ K X k =1 α ′ k ℓ k ( b ξ i − q ′ k α ′ k ) + 3 ϵ 2 ≤ S i ( b ) + 3 ϵ 2 . Letting ϵ 2 → 0 + , we obtain S old i ( b ) ≤ S new i ( b ) . This completes the proof. Next, we establish a fundamental property of the value function S new i ( b ) : its global optimum can be recov ered by solving  K 2  smaller subproblems, each deﬁned on a pair of v ariables. Lemma 19 F ix b ≥ 0 and i ∈ [ t ] . Under Assumptions 2 and 5 , for 1 ≤ k 1 < k 2 ≤ K , deﬁne S new , ( k 1 ,k 2 ) i ( b ) := max α k 1 ,α k 2 ,v k 1 ,v k 2 ( α k 1 ℓ k 1 ( b ξ i − v k 1 ) + α k 2 ℓ k 2 ( b ξ i − v k 2 ) s.t. α k 1 ∥ v k 1 ∥ + α k 2 ∥ v k 2 ∥ ≤ b, α k 1 + α k 2 = 1 , α k 1 , α k 2 ≥ 0 ) . (17) Then, we have S new i ( b ) = max 1 ≤ k 1 0 , there exists some { v ϵ 0 k } k ∈ [ K ] such that 0 ≤ S new i ( b ) − S new i ( b, { v ϵ 0 k } ) ≤ ϵ 0 . When { v k } k ∈ [ K ] are ﬁxed to { v ϵ 0 k } k ∈ [ K ] , problem ( 18 ) reduces to a linear program in the decision v ariables { α k } k ∈ [ K ] . Moreover , its feasible region is a bounded polyhedron, since { α k } k ∈ [ K ] lies in the standard simplex. By the Fundamental Theorem of Linear Programming, there exists an optimal solution { α ⋆,ϵ 0 k } k ∈ [ K ] that is an extreme point of the simplex. Consequently , at least K linearly independent constraints of ( 18 ) are active at { α ⋆,ϵ 0 k } k ∈ [ K ] . This implies that at least K − 2 constraints of the form α k ≥ 0 for all k ∈ [ K ] are active, and hence at most two components of 31 C H E N F A T TA H I S H A FI E E { α ⋆,ϵ 0 k } k ∈ [ K ] can be nonzero. Using this observation, one can enumerate all possible pairs of indices corresponding to potentially nonzero components and reformulate ( 18 ) accordingly as S new i  b,  v ϵ 0 k  k ∈ [ K ]  = max 1 ≤ k 1 0 , we have S new i ( b ) ≤ S new i ( b, { v ϵ 0 k } ) + ϵ 0 = ϵ 0 + max 1 ≤ k 1 0 with α ⋆ 1 + α ⋆ 2 = 1 , and restrict the search range of α 1 to (0 , 1) . Under Assumption 6 , for any ﬁxed α j ∈ (0 , 1) and β j ∈ [0 , b ] , j = 1 , 2 , we hav e access to a subroutine running in time Cost k j ,δ ev al / 2 that outputs a vector q ⋆ j satisfying ∥ q ⋆ j ∥ ≤ β j and α j ℓ k j  b ξ i − q ⋆ j α j  ≥ max ∥ q ∥≤ β j  α j ℓ k j  b ξ i − q α j  − δ ev al 2 , j = 1 , 2 . Since Ψ( α, β ) = max ∥ q ∥≤ β 1  α 1 ℓ k 1  b ξ i − q α 1  + max ∥ q ∥≤ β 2  α 2 ℓ k 2  b ξ i − q α 2  , we conclude that the approximate ev aluation b Ψ( α, β ) of Ψ( α, β ) can be computed in time Cost k 1 ,δ ev al / 2 + Cost k 2 ,δ ev al / 2 . No w , ﬁx α in the interior of the simplex. Since ℓ k 1 and ℓ k 2 are ∥ ℓ ∥ lip -Lipschitz, increasing β 1 by ∆ β can increase max ∥ q ∥≤ β 1 α 1 ℓ k 1 ( b ξ i − q /α 1 ) by at most ∥ ℓ ∥ lip ∆ β . At the same time, max ∥ q ∥≤ b − β 1 α 2 ℓ k 2 ( b ξ i − q /α 2 ) does not increase and may decrease by at most ∥ ℓ ∥ lip ∆ β . There- fore, Ψ( α , · ) is ∥ ℓ ∥ lip -Lipschitz in β . Let β and β ′ be two points ev aluated by the inner golden section search in Algorithm 4 . The algorithm discards a subinterv al based on comparing b Ψ( α, β ) and b Ψ( α, β ′ ) . W e claim that the algorithm makes the correct comparison whenev er | Ψ( α, β ) − Ψ( α, β ′ ) | > δ ev al . T o see this, suppose without loss of generality that Ψ( α, β ) > Ψ( α, β ′ ) + δ ev al . Then b Ψ( α, β ) ≥ Ψ( α, β ) − δ ev al > Ψ( α, β ′ ) ≥ b Ψ( α, β ′ ) , and the algorithm correctly shrinks the interval while retaining the optimal solution β ⋆ ( α ) . Thus, after suf ﬁciently many iterations, the inner golden section search returns b β ( α ) satisfying   Ψ( α, β ⋆ ( α )) − Ψ( α, b β ( α ))   ≤ δ ev al . (20) It remains to bound the number of iterations in the inner golden section search. Let η in denote the ﬁnal interv al length of this golden section search. By Lipschitz continuity of Ψ ,   Ψ( α, β ⋆ ( α )) − Ψ( α, b β ( α ))   ≤ ∥ ℓ ∥ lip η in . Hence, ( 20 ) is ensured by choosing η in ≤ δ ev al / ∥ ℓ ∥ lip . Consequently , the inner golden section search runs in time O   Cost k 1 ,δ ev al / 2 + Cost k 2 ,δ ev al / 2  log  b η in  = O   Cost k 1 ,δ ev al / 2 + Cost k 2 ,δ ev al / 2  log  b ∥ ℓ ∥ lip δ ev al  . 34 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G Next, we turn to the outer golden section search (o ver α ). Similar to our analysis of the inner goldern section search, let α and α ′ be tw o points e v aluated by the outer golden section search. The algorithm discards a subinterv al based on comparing b Ψ( α, b β ( α )) and b Ψ( α ′ , b β ( α ′ )) . W e claim that the algorithm makes the correct comparison whene ver | Ψ( α, b β ( α )) − Ψ( α ′ , β ⋆ ( α ′ )) | > 3 δ ev al . W ithout loss of generality , let us assume that Ψ( α, b β ( α )) > Ψ( α ′ , β ⋆ ( α ′ )) + 3 δ ev al . Note that | b Ψ( α, b β ( α )) − Ψ( α, b β ( α )) | ≤ δ ev al Moreov er , | b Ψ( α ′ , b β ( α ′ )) − Ψ( α ′ , β ⋆ ( α ′ )) | ≤ | b Ψ( α ′ , b β ( α ′ )) − Ψ( α ′ , b β ( α ′ )) | + | Ψ( α ′ , b β ( α ′ )) − Ψ( α ′ , β ⋆ ( α ′ )) | ≤ δ ev al + ∥ ℓ ∥ lip η in ≤ 2 δ ev al . Therefore, b Ψ( α, b β ( α )) ≥ Ψ( α, b β ( α )) − δ ev al > Ψ( α ′ , β ⋆ ( α ′ )) + 2 δ ev al ≥ b Ψ( α ′ , b β ( α ′ )) , and the algorithm correctly shrinks the interval while retaining the optimal solution α ⋆ . Thus, after suf ﬁciently many iterations, the outer golden section search returns a solution b α satisfying | Ψ( b α, b β ( b α )) − Ψ( α ⋆ , β ⋆ ( α ⋆ )) | ≤ 3 δ ev al . (21) It remains to bound the number of iterations in the outer golden section search. Let η out denote the ﬁnal interv al length of the outer golden section search. Since we assume α ⋆ lies in the interior of the simplex, there must e xists an open neighborhood N ( α ⋆ ) = { α : ∥ α − α ⋆ ∥ ≤ r } within which Ψ( · , β ⋆ ( α ⋆ )) is locally Lipschitz, that is, for an y α ∈ N ( α ⋆ ) : − L α ⋆ | α − α ⋆ | ≤ Ψ( α, β ⋆ ( α ⋆ )) − Ψ( α ⋆ , β ⋆ ( α ⋆ )) ≤ 0 . Using this property , one can write Ψ( α ⋆ , β ⋆ ( α ⋆ )) − Ψ( b α, b β ( b α )) = (Ψ( α ⋆ , β ⋆ ( α ⋆ )) − Ψ( b α, β ⋆ ( α ⋆ ))) + (Ψ( b α, β ⋆ ( α ⋆ )) − Ψ( b α, β ⋆ ( b α ))) +  Ψ( b α, β ⋆ ( b α )) − Ψ( b α, b β ( b α ))  ≤ (Ψ( α ⋆ , β ⋆ ( α ⋆ )) − Ψ( b α, β ⋆ ( α ⋆ ))) +  Ψ( b α, β ⋆ ( b α )) − Ψ( b α, b β ( b α ))  ≤ L α ⋆ η out + ∥ ℓ ∥ lip η in ≤ L α ⋆ η out + δ ev al , where in the third line, we used the fact that Ψ( b α , β ⋆ ( α ⋆ )) − Ψ( b α, β ⋆ ( b α )) ≤ 0 due to the fact that, by deﬁnition, β ⋆ ( b α ) is the maximizer of Ψ( b α, · ) . Since Ψ( α ⋆ , β ⋆ ( α ⋆ )) − Ψ( b α, b β ( b α )) ≥ 0 (due to the optimality of α ⋆ , β ⋆ ( α ⋆ ) ), we thus hav e | Ψ( α ⋆ , β ⋆ ( α ⋆ )) − Ψ( b α, b β ( b α )) | ≤ L α ⋆ η out + δ ev al (22) 35 C H E N F A T TA H I S H A FI E E Hence, we satisfy ( 21 ) by choosing η out ≤ min { 2 δ ev al /L α ⋆ , r } . Combined with the complexity of the inner golden section search, this implies that the nested golden section search runs in time O   Cost k 1 ,δ ev al / 2 + Cost k 2 ,δ ev al / 2  log  b η in  log  1 η out  = O   Cost k 1 ,δ ev al / 2 + Cost k 2 ,δ ev al / 2  log  b ∥ ℓ ∥ lip δ ev al  max  log  L 2 δ ev al  , log  1 r  . The proof is completed by noting that Ψ( α ⋆ , β ⋆ ( α ⋆ )) is precisely S ( k 1 ,k 2 ) i ( b ) , and its e v aluation b Ψ( b α, b β ( b α )) satisﬁes | b Ψ( b α, b β ( b α )) − Ψ( α ⋆ , β ⋆ ( α ⋆ )) | ≤ | b Ψ( b α, b β ( b α )) − Ψ( b α, b β ( b α )) | + | Ψ( b α, b β ( b α )) − Ψ( α ⋆ , β ⋆ ( α ⋆ )) | ≤ δ ev al + 3 δ ev al ≤ 4 δ ev al . Proof of Lemma 9 . The proof readily follo ws from ( 5 ) and the result of Lemma 8 . B.9. Proof of Lemma 10 Before presenting the proof of Lemma 10 , we ﬁrst need helper lemmas. Lemma 21 Under Assumptions 2 and 5 , the function S i ( b ) is concave in b ≥ 0 for eac h i ∈ [ T ] . Proof By Lemma 18 , S i ( b ) admits the representation S i ( b ) = max α,q ( K X k =1 α k ℓ k  b ξ i − q k α k       K X k =1 ∥ q k ∥ ≤ b, K X k =1 α k = 1 , α k ≥ 0 ) . W e ﬁrst characterize the feasible set. Since P K k =1 ∥ q k ∥ is con ve x in q , its epigraph n ( q , b ) : P K k =1 ∥ q k ∥ ≤ b o is jointly con vex in ( q , b ) . Moreov er , the standard simplex n α : P K k =1 α k = 1 , α k ≥ 0 o is con vex. Therefore, the set C 1 := ( { ( α k , q k , b ) } : K X k =1 ∥ q k ∥ ≤ b, K X k =1 α k = 1 , α k ≥ 0 ) is jointly con ve x in { ( α k , q k , b ) } . Ne xt, consider the objective. For each k , the function − α k ℓ k  b ξ i − q k α k  is the perspective of the con vex function − ℓ k , composed with the afﬁne map ( α k , q k ) 7→ ( α k b ξ i − q k , α k ) . Hence, it is jointly con ve x in ( α k , q k ) . Summing over k , we conclude that K X k =1 − α k ℓ k  b ξ i − q k α k  36 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G is jointly con vex in ( α, q ) . Consequently , its epigraph C 2 := ( { ( α k , q k , t ) } : K X k =1 − α k ℓ k  b ξ i − q k α k  ≤ t ) is jointly con vex in { ( α k , q k , t ) } . Combining the two parts, the set C 1 ∩ C 2 = ( { ( α k , q k , b, t ) } : K X k =1 ∥ q k ∥ ≤ b, K X k =1 α k = 1 , α k ≥ 0 , K X k =1 − α k ℓ k  b ξ i − q k α k  ≤ t ) is jointly con vex in ( α, q , b, t ) . Finally , observe that the epigraph of − S i can be written as epi ( − S i ) = { ( b, t ) : ∃ ( α, q ) s.t. ( α , q , b, t ) ∈ C 1 ∩ C 2 } , which is the projection of a con vex set onto the ( b, t ) -coordinates. Since projections preserve con- ve xity , epi ( − S i ) is con ve x, implying that − S i is con vex and hence S i is concav e in b . Our next lemma establishes the Lipschitz continuity of S i . Lemma 22 Under Assumptions 2 and 5 , the function S i ( b ) is Lipschitz continuous on b ≥ 0 with Lipschitz constant ∥ ℓ ∥ lip , for each i ∈ [ t ] . Proof First, we show that the right deri v ati ve of S i ( b ) exists for b ≥ 0 . For a ﬁx ed budget b ≥ 0 , the right deri v ativ e of S i ( b ) at b is deﬁned as S ′ i, + ( b ) := lim h → 0 + ϕ ( h ) := lim h → 0 + S i ( b + h ) − S i ( b ) h ≥ 0 . T o sho w this limit exists, it suf ﬁces to sho w that the difference quotient ϕ ( h ) is non-decreasing. Let 0 < h 1 < h 2 , since S i ( · ) is conca ve and b + h 1 =  1 − h 1 h 2  b + h 1 h 2 ( b + h 2 ) , by Jensen’ s inequality , we hav e S i ( b + h 1 ) ≥  1 − h 1 h 2  S i ( b ) + h 1 h 2 S i ( b + h 2 ) , which can be rearranged as ϕ ( h 1 ) = S i ( b + h 1 ) − S i ( b ) h 1 ≥ S i ( b + h 2 ) − S i ( b ) h 2 = ϕ ( h 2 ) . Therefore, the right deriv ati ve S ′ i, + ( b ) ≥ 0 exists for e very b ≥ 0 . Similarly , we can sho w that the left deri v ativ e S ′ i, − ( b ) := lim h → 0 − S i ( b + h ) − S i ( b ) h ≥ 0 exists for e very b > 0 . Furthermore, it follo ws from concavity that for e very b > 0 , S ′ i, − ( b ) ≥ S ′ i, + ( b ) , 37 C H E N F A T TA H I S H A FI E E and the set of supergradients is deﬁned as ∂ S i ( b ) = [ S ′ i, + ( b ) , S ′ i, − ( b )] . Next, we show that for 0 ≤ b 1 < b 2 , we hav e S ′ i, + ( b 1 ) ≥ S ′ i, − ( b 2 ) . Let h 2 < 0 < h 1 such that b 1 + h 1 ≤ b 2 + h 2 , since S i ( · ) is conca ve, we hav e S i ( b 1 + h 1 ) − S i ( b 1 ) ( b + h 1 ) − b 1 ≥ S i ( b 2 + h 2 ) − S i ( b 1 + h 1 ) ( b 2 + h 2 ) − ( b 1 + h 1 ) ≥ S i ( b 2 ) − S i ( b 2 + h 2 ) b 2 − ( b 2 + h 2 ) , which implies that S i ( b 1 + h 1 ) − S i ( b 1 ) h 1 ≥ S i ( b 2 + h 2 ) − S i ( b 2 ) h 2 . Let h 1 → 0 + and h 2 → 0 − , we hav e S ′ i, + ( b 1 ) ≥ S ′ i, − ( b 2 ) . It follo ws that 0 ≤ g i ≤ S ′ i, + (0) for e very g i ∈ ∂ S i ( b ) , ∀ b ≥ 0 . No w it remains to sho w that S ′ i, + (0) ≤ ∥ ℓ ( · ) ∥ lip . For any δ > 0 and ϵ > 0 , there exists a feasible solution { α δ,ϵ k } k ∈ [ K ] , { v δ,ϵ k } k ∈ [ K ] to ( 16b ) that satisﬁes 0 ≤ S i ( δ ) − K X k =1 α δ,ϵ k ℓ k ( b ξ i − v δ,ϵ k ) < ϵ. Therefore, the right deri v ativ e of S i at 0 can be written as S ′ i, + (0) = lim δ → 0 + S i ( δ ) − S i (0) δ ≤ lim δ → 0 + P K k =1 α δ,ϵ k ℓ k ( b ξ i − v δ,ϵ k ) + ϵ − max k ∈ [ K ] ℓ k ( b ξ i ) δ ≤ lim δ → 0 + P K k =1 α δ,ϵ k h ℓ k ( b ξ i − v δ,ϵ k ) − ℓ k ( b ξ i ) i + ϵ δ ≤ lim δ → 0 + P K k =1 α δ,ϵ k ∥ ℓ k ∥ lip    v δ,ϵ k    + ϵ δ ≤ lim δ → 0 + max k ∈ [ K ] ∥ ℓ k ∥ lip δ + ϵ δ ≤ lim δ → 0 + ∥ ℓ ∥ lip δ + ϵ δ Letting ϵ → 0 + , we hav e S ′ i, + (0) ≤ lim δ → 0 + ∥ ℓ ∥ lip δ δ = ∥ ℓ ∥ lip . This completes the proof. 38 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G Proof of Lemma 10 . Due to conca vity and non-decreasing properties of S i ( b ) , there e xists b limit i ≥ 0 such that the function S i ( b ) is strictly increasing and concav e on [0 , b limit i ] , and S i ( b ) is a constant on ( b limit i , ∞ ) . Note that b limit i can be + ∞ , in which case S i ( b ) is strictly increasing on [0 , + ∞ ) . Also, for any λ ≥ ∥ ℓ ∥ lip , b ⋆ i ( λ ) ∈ argmax b ≥ 0 { S i ( b ) − λb } must satisfy b ⋆ i ( λ ) = 0 . Therefore, without loss of generality , we can restrict the search range for the dual v ariable λ to [0 , ∥ ℓ ∥ lip ] . Note that argmax b ≥ 0 { S i ( b ) − λb } is a closed set. Let b ⋆ i ( λ ) ∈ argmax b ≥ 0 { S i ( b ) − λb } be the smallest element of this set. For any ﬁxed i ∈ [ t ] and dual candidate λ ∈ [0 , ∥ ℓ ∥ lip ] , let b b i ( λ ) be the output of Algorithm 3 . Let b i and b ′ i be tw o distinct points ev aluated by the golden section search presented in Algorithm 3 . The algorithm discards a subinterval based on comparing b S i ( b i ) − λb i and b S i ( b ′ i ) − λb ′ i . The algo- rithm makes the correct comparison whene ver   ( S i ( b i ) − λb i ) −  S i ( b ′ i ) − λb ′ i    > 2 δ ev al . T o see this, suppose without loss of generality that ( S i ( b i ) − λb i ) > ( S i ( b ′ i ) − λb ′ i ) + 2 δ ev al . Then b S i ( b i ) − λb i ≥ S i ( b i ) − λb i − δ ev al ≥ S i ( b ′ i ) − λb ′ i + δ ev al ≥ b S i ( b ′ i ) − λb ′ i and the algorithm correctly shrinks the interval while retaining the optimal solution b ⋆ ( λ ) . Thus, after suf ﬁciently many iterations, the golden section search returns b b ( λ ) satisfying     S i ( b b i ( λ )) − λ b b i ( λ )  − ( S i ( b ⋆ i ( λ )) − λb ⋆ i ( λ ))    ≤ 2 δ ev al . (23) It remains to bound the number of iterations in the golden section search. Let η b denote the ﬁnal interv al length of this golden section search. By Lemma 22 , the function S i is Lipschitz continuous with constant ∥ ℓ ∥ lip . Therefore,     S i ( b b i ( λ )) − λ b b i ( λ )  − ( S i ( b ⋆ i ( λ )) − λb ⋆ i ( λ ))    ≤ ( ∥ ℓ ∥ lip + λ ) η b ≤ 2 ∥ ℓ ∥ lip η b . where in the last inequality , we use the fact that λ ∈ [0 , ∥ ℓ ∥ lip ] . Hence, ( 23 ) is ensured by choos- ing η b = δ ev al / ∥ ℓ ∥ lip . Consequently , combined with the complexity of Algorithm 4 deriv ed by Lemma 9 , we conclude that Algorithm 3 runs in time O  Γ · K 2 · (Cost k 1 ,δ ev al / 2 + Cost k 2 ,δ ev al / 2 ) · log  b δ ev al  · log  1 δ ev al  · log  ρt η b  = O  Γ · K 2 · (Cost k 1 ,δ ev al / 2 + Cost k 2 ,δ ev al / 2 ) · log  b δ ev al  · log  1 δ ev al  · log  ρt ∥ ℓ ∥ lip δ ev al  . The proof is completed by noting that    ( b S i ( b b i ( λ )) − λ b b i ( λ )) − ( S i ( b ⋆ i ( λ )) − λb ⋆ i ( λ ))    ≤    b S i ( b b i ( λ )) − S i ( b b i ( λ ))    +    ( S i ( b b i ( λ )) − λ b b i ( λ )) − ( S i ( b ⋆ i ( λ )) − λb ⋆ i ( λ ))    ≤ 4 δ ev al + 2 δ ev al = 6 δ ev al . 39 C H E N F A T TA H I S H A FI E E B.10. Proof of Theor em 11 Denote the optimal dual v ariable of ( 4 ) by λ ⋆ ∈ [0 , ∥ ℓ ∥ lip ] . By Lemma 10 , for an y λ ∈ [0 , ∥ ℓ ∥ lip ] , Algorithm 7 runs in time O  Γ · (Cost k 1 ,δ ev al / 2 + Cost k 2 ,δ ev al / 2 ) · log  b δ ev al  · log  1 δ ev al  · log  ρt ∥ ℓ ∥ lip δ ev al  . and outputs a b b i ( λ ) satisfying | b b i ( λ ) − b ⋆ i ( λ ) | ≤ η b with η b ≤ δ ev al / ∥ ℓ ∥ lip . No w , we analyze the bisection method of Algorithm 2 (over λ ). Similar to our analysis in Lemma 10 , let λ be a dual candidate at some iteration of the bisection search. The algorithm discards a subinterv al based on the sign of ρt − P t i =1 b b i ( λ ) . W e claim that the algorithm makes the correct decision whene ver      ρt − t X i =1 b ⋆ i ( λ )      > tη b . T o see this, assume without loss of generality that ρt − P t i =1 b ⋆ i ( λ ) > tη b . Then ρt − t X i =1 b b i ( λ ) = ρt − t X i =1 b ⋆ i ( λ ) + t X i =1  b ⋆ i ( λ ) − b b i ( λ )  > tη b − tη b = 0 . This implies that the algorithm correctly shrinks the interval while retaining the optimal dual λ ⋆ within the interv al. Thus, after suf ﬁciently many iterations, the bisection search returns a solution b λ satisfying      ρt − t X i =1 b ⋆ i ( b λ )      ≤ tη b . (24) It remains to bound the number of iterations in the outer bisection search. Let η λ denote the ﬁnal interv al length of the bisection search. Giv en λ ⋆ ∈ [0 , ∥ ℓ ∥ lip ] , there e xists a neighborhood N i ( λ ⋆ ) = { λ : | λ − λ ⋆ | ≤ r i } ∩ [0 , ∥ ℓ ∥ lip ] within which b ⋆ i ( · ) is locally Lipschitz, that is, for an y λ ∈ N i ( λ ⋆ ) : | b ⋆ i ( λ ) − b ⋆ i ( λ ⋆ ) | ≤ L ( i ) λ ⋆ | λ − λ ⋆ | . Using this property , one can write      ρt − t X i =1 b ⋆ i ( b λ )      =      t X i =1  b ⋆ i ( λ ⋆ ) − b ⋆ i ( b λ )       ≤    b λ − λ ⋆    t X i =1 L ( i ) λ ⋆ ≤ η λ t X i =1 L ( i ) λ ⋆ . Hence, we satisfy ( 24 ) by choosing η λ = min { η b /L λ ⋆ , r min } , where L λ ⋆ := max i ∈ [ t ] L ( i ) λ ⋆ , r min := min i ∈ [ t ] { r i } . Combined with the complexity of the golden section search from Lemma 10 and after noting that η b = δ ev al / ∥ ℓ ∥ lip , we conclude that Algorithm 2 runs in time O  Γ · K 2 · max k ∈ [ K ]  Cost k,δ ev al / 2  · log  b δ ev al  · log  1 δ ev al  · log  ρt ∥ ℓ ∥ lip δ ev al  · log  ∥ ℓ ∥ lip η λ  = O Γ · K 2 · max k ∈ [ K ]  Cost k,δ ev al / 2  · log  b δ ev al  · log  1 δ ev al  · log  ρt ∥ ℓ ∥ lip δ ev al  · max  log  L λ ⋆ ∥ ℓ ∥ lip δ ev al  , log  1 r min  ! . 40 W A S S E R S T E I N D I S T R I B U T I O NA L LY R O B U S T O N L I N E L E A R N I N G The proof is completed by noting that 1 t t X i =1 S i ( b ⋆ i ( λ ⋆ )) − 1 t t X i =1 S i ( b b i ( b λ )) ≤ 1 t t X i =1 ∥ ℓ ∥ lip    b ⋆ i ( λ ⋆ ) − b b i ( b λ )    ≤ ∥ ℓ ∥ lip t t X i =1     b ⋆ i ( λ ⋆ ) − b ⋆ i ( b λ )    +    b ⋆ i ( b λ ) − b b i ( b λ )     ≤ ∥ ℓ ∥ lip t t X i =1  L ( i ) λ ⋆ η λ + η b  ≤ ∥ ℓ ∥ lip ( L λ ⋆ η λ + η b ) ≤ 2 δ ev al ■ 41

Wasserstein Distributionally Robust Online Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment