On the Efficiency of Entropic Regularized Algorithms for Optimal Transport

On the Eﬃciency of En tropic Regularize d Algorith ms for Op t imal T ransp ort Tian yi Lin ⋄ Nhat Ho ⋆ Mic hael I. Jordan ⋄ , † Departmen t of Electrical Engineering an d Compu ter S ciences ⋄ Departmen t of Statistics † Univ ersit y of California, Berk eley Departmen t of Statistics and Data Sciences, Univ ersit y of T exas, Austin ⋆ Ma y 19, 202 2 Abstract W e present several new co mplex it y results for the entropic reg ularized a lg orithms that approximately solve the optimal transp ort (O T) pro blem b etw een t wo discr ete pr obabil- it y measures with at most n ato ms. First, we improv e the complexity b ound of a greedy v ariant o f Sinkho rn, kno wn as Gr e enkhorn , from e O ( n 2 ε − 3 ) to e O ( n 2 ε − 2 ). Notably , our result can match the b est known complexity b ound of Sinkhorn and help clarify wh y Greenkhorn sig niﬁcantly o utper forms Sinkhorn in practice in terms of r ow/column up- dates a s observed by Altsch uler e t a l. [ 2017 ]. Second, we pro p o s e a new algor ithm, which we refer to as A PDAMD and which generalizes an adaptive primal-dual a ccelerated g radi- ent des cent (APDA GD) algorithm [ Dvurechensky et al. , 2018 ] with a pr esp eciﬁed mirror mapping φ . W e pr ov e that APD AMD achiev es the complexity b ound o f e O ( n 2 √ δ ε − 1 ) in which δ > 0 stands for the regular it y of φ . In addition, we show b y a co unt ere xample that the complexity b ound of e O (min { n 9 / 4 ε − 1 , n 2 ε − 2 } ) prov ed for AP D AGD b efore is inv a lid and give a reﬁned complexity bound o f e O ( n 5 / 2 ε − 1 ). F urther, we develop a deterministic accelerated v ar iant of Sinkho rn v ia app e al to estimated sequence and prove the complexity bo und of e O ( n 7 / 3 ε − 4 / 3 ). As s uch , w e see that accelera ted v ariant of Sinkhorn outp erforms Sinkhorn and Greenkho r n in ter ms of 1 /ε a nd APD AGD and a ccelerated alternating mini- mization (AAM) [ Guminov et al. , 2021 ] in terms of n . Finally , w e conduct the exp eriments on synthetic and rea l data and the numerical results show the eﬃciency of Greenkhorn, APDA MD a nd acceler ated Sinkhor n in pra ctice. 1 In tro duction F rom its origins in the seminal w orks by Monge [ 1781 ] and K an toro vic h [ 1942 ] r esp ectiv ely in the eigh teen th and tw en tieth cent ur ies, and through to presen t da y , the optimal transp ort (OT) problem has pla y ed a determinative role in the theory of m athematics [ Villani , 2009 ]. It also has found a wid e range of ap p lications in problem domains b ey ond the original set- ting in logistics. In the current era, th e str on g and increasing link ag e b et we en optimizatio n and mac hine learning has br ough t new app lications of OT to the fore; [see, e.g., Nguy en , 2013 , Cuturi and Doucet , 2014 , Sriv asta v a et al. , 2015 , R olet et al. , 2016 , Pe yr´ e et al. , 2016 , Nguy en , 2016 , Carri ` ere et al. , 2017 , Ar jo vsky et al. , 2017 , Gulra jani et al. , 2017 , Court y et al. , 2017 , Sriv asta v a et al. , 2018 , Dvurec henskii et al. , 2018 , T olstikhin et al. , 2018 , Sommerf eld et al. , 2019 , Lin et al. , 2019b , Ho et al. , 2019 ]. In these d ata-driv en applications, the fo cu s is on the probabilit y distribu tions und erlying the OT form ulation; ind eed, these distributions are ei- ther empirical distr ib utions whic h are obtained by placing unit masses at data p oin ts, or are 1 probabilit y mo dels of a putativ e un derlying data-ge nerating pro cess. The OT problem ac- cordingly often has a dir ect inferentia l meanin g — as the deﬁnition of an estimator [ Dudley , 1969 , F ournier and Guillin , 2015 , W ee d and Bac h , 2019 , Lei , 2020 ], th e deﬁnition of a lik eli- ho o d [ Sommerfeld and Munk , 2018 , Bern ton et al. , 2019 , Blanc het and Murthy , 2019 ], or as the r obust v ariant of an estimator [ Blanc het et al. , 2019 , P at y and Cutu r i , 2019 , Bala ji et al. , 2020 ]. The key c hallenge is compu tational [ Pe yr´ e and Cuturi , 2019 ]. I n deed, the underlyin g distributions generally inv olv e high-dimensional data an d complex mo d els in mac hine learning (ML) applications. W e study the OT problem in a discrete setting where we assume that the target and source probabilit y distribu tions eac h ha ve at most n atoms. In this setting, the O T problem can b e solv ed exactly usin g linear programming (LP) solve r b ased on sp ecialized in terior-p oint metho ds [ P ele and W erman , 2009 , Lee and Sidford , 2014 , v an den Brand et al. , 202 1 ], reﬂect- ing the LP formulat ion of the OT problem. In this context, v an den Brand et al. [ 2021 ] ha v e pro vided a bunch of randomized interio r-p oin t algorithms with improv ed r unt imes f or s olv- ing linear programs with tw o-sided constrain ts, leading to a new O T algorithm b ased on the Laplacian system solv ers that ac hieve d the b est known complexit y b ounds of ˜ O ( n 2 ). Ho w ev er, it do es not pr ovide an eﬀect ive solution to large-sca le mac hine learnin g problems in pr actice since eﬃcien t implemen tations of Laplacian approac h are y et unknown. F urthermore, man y com binatorial tec hn iqu es giv e exact algorithms for the OT p roblem. Indeed, the Hungarian algorithm [ Kuhn , 1955 , 1956 , Munk r es , 1957 ] solve s the assignment pr oblem in O ( n 3 ) time while there are sev eral com binatorial algorithms that can solv e the OT problem exactly in ˜ O ( n 2 . 5 ) time [ Gab o w and T arjan , 1991 , Orlin and Ah uja , 1992 ]. Combined with the scaling tec hnique, the net w ork simp lex algorithms [ Orlin et al. , 1993 , Orlin , 1997 ] can b e used to solv e the OT problem exactly in ˜ O ( n 3 ) time and Lahn et al. [ 2019 ] hav e r ecen tly dev elop ed a faster app ro ximation algorithm for the OT problem via app eal to the mo diﬁcation of the algorithm dev elop ed in Gab o w and T arjan [ 1991 ]. How ev er, computing the OT problem ex- actly results in an output that is not diﬀeren tiable with resp ect to measures’ lo cations or w eigh ts [ Bertsimas and Tsitsiklis , 1997 ]. Moreo v er, OT suﬀers fr om the curse of dimensional- it y [ Dudley , 1969 , F ournier and Guillin , 2015 ] and is thus lik ely to b e meaningless when used on samp les from h igh-dimensional d ensities. An alternativ e to solve the OT problem is a class of approximati on algorithms based on the ent ropy regularizatio n which has b een inv estig ated in optimization and transp ortation science long b efore [ S inkhorn , 1974 , Schneider and Zenios , 1990 , Kalan tari and Khac hiy an , 1996 , Kn igh t , 2008 , Kalan tari et al. , 2008 , Chakrabart y and K hanna , 2018 ]. It was Cu turi [ 2013 ] that p opu larized the use of entrop y regularization for O T in the mac hine learning comm unit y and th en initiated a p ro ductive line of researc h where an en tropic regularization w as imp osed to approximate the non-negativ e constrain ts in the original OT problem. The resulting pr ob lem is referred to as entr opic r e gularize d OT and the corresp onding class of appro ximation algorithms are called entr op ic r e gularize d algorithm s . It is w orth mentioning that the ent ropic regularized OT has man y fa v orable prop erties that the OT do es not enj oy , motiv ating us to study the computational eﬃciency of ent ropic regularized algorithms in this pap er. More sp eciﬁcally , fr om a s tatistica l p oin t of view, the en tropic regularized O T enjoys signiﬁcan tly b etter samp le complexity that is p olynomial in the dimen s ion [ Genev ay et al. , 2019 , Mena and Niles-W eed , 2019 , C h izat et al. , 2020 ], demonstrating that addin g an en- trop y r egularizatio n w ill reduce the curse of dimensionalit y . Ev en f rom a computational p oint of view, such regularization in O T leads to Sinkhorn whic h attains a ﬁr st near-linear time guarantee for the O T problem [ Cuturi , 2013 , Altsch uler et al. , 2017 , Dvu rec hensky et al. , 2018 ], and also mak es the problem diﬀerentiable with regards to distribu tions [ F eydy et al. , 2 2019 ]; hence, the entropic regularized algorithms are m ore easily applicable to deep learn- ing app lications [ Court y et al. , 2017 , Cuturi et al. , 2019 , Bala ji et al. , 2020 ] as opp osed to com binatorial algo rithms. This p oin t was highligh ted in Dong et al. [ 202 0 ] and further neces- sitated th e dev elopment of f aster entropic regularized algorithms. In this regard, the greedy v arian t of Sinkhorn – Greenkhorn – w as prop osed and sho wn to outp erform S inkhorn em- pirically [ Altsc h uler et al. , 2017 ]. Ho w ev er, a sizable gap exists here since the b est kno wn complexit y b ound of e O ( n 2 ε − 3 ) for Greenkhorn [ Altsc h uler et al. , 2017 ] is wo rse than that of e O ( n 2 ε − 2 ) for Sin khorn [ Dvurec hensky et al. , 2018 ]. F ur th er progress has b een made b y adapting ﬁrst-order optimizat ion algorithms for the OT problem [ Cuturi and Peyr ´ e , 2016 , Genev a y et al. , 2016 , Blondel et al. , 2018 , Dvur ec hensky et al. , 2018 , Altsc h uler et al. , 2019 , Guo et al. , 2020 , Guminov et al. , 2021 ]. Among these approac hes, t w o of repr esentati ves are an adaptiv e pr im al-du al accelerated gradient descen t (APD A GD) al- gorithm [ Dvurechensky et al. , 2018 ] with the claimed complexit y b ound of e O (min { n 9 / 4 ε − 1 , n 2 ε − 2 } ) and an accelerat ed alternating minimization (AAM) algorithm [ Gumino v et al. , 2021 ] with the complexit y b ound of e O ( n 5 / 2 ε − 1 ). Moreo v er, there are several s econd-order optimiza- tion algorithms [ Allen-Zh u et al. , 2017 , Cohen et al. , 2017 ] w h ic h are adapted for the OT problem [ Blanc het et al. , 2018 , Quanr ud , 2019 ] and guaran teed to ac hiev e th e improv ed com- plexit y b ound of e O ( n 2 ε − 1 ). Ho w ev er, the aforemen tioned second-ord er algorithms do not pro vide eﬀectiv e solutions to large-sca le mac hin e learning problems due to th e lac k of eﬃcien t implemen tations in practice. Con tributions. Giv en the adv an tages of ent ropic regularization in OT, we fo cus in his pap er the compu tational eﬃciency of a class of entropic regularized algorithms for th e OT problem and our theoretical analysis lead to several impro ve ments o v er the state- of-the-art algorithms in the literature. W e summarize the con tributions as follo ws: 1. W e imp r o v e the complexit y b ound of Greenkhorn from e O ( n 2 ε − 3 ) to e O ( n 2 ε − 2 ), which matc hes the b est existing b ound of Sinkhorn. Th e pro of tec hn iqu es are new and diﬀeren t from that used in Dvurec hensky et al. [ 2018 ] for analyzing Sinkhorn. In particular, Greenkhorn only u p dates a single row or column at eac h iteration and quant ifying the p er-iteration progress is more diﬃcu lt than the m easuremen t in Sinkhorn. 2. W e prop ose an adaptiv e primal-dual accelerated mirr or d escen t (APD AMD) algorithm whic h generalizes APDA GD w ith a presp eciﬁed mirror mappin g φ and prov e that AP- D AMD ac hieves the complexit y b ound of e O ( n 2 √ δ ε − 1 ) where δ > 0 refers to the regu- larit y of φ w.r.t. ℓ ∞ norm. W e show by a count erexample that the complexit y b ound of e O (min { n 9 / 4 ε − 1 , n 2 ε − 2 } ) pro ved f or APD A GD [ Dvurec hensky et al. , 2018 ] is in v alid and giv e a reﬁned complexit y b ound of e O ( n 5 / 2 ε − 1 ) whic h is wo rse than the claimed b ound in terms of n . 3. W e p r op ose a deterministic accelerat ed v arian t of S inkhorn via app eal to an estimate d sequence and prov e the complexit y b ound of e O ( n 7 / 3 ε − 4 / 3 ). In p articular, accelerated Sinkhorn consists in an exact minimization for main iterates accompanied b y another sequence of iterates based on co ordinate gradien t up dates and monotone searc h. Ou r results show that accelerat ed Sinkhorn outp erform s S inkhorn and Greenkhorn in terms of 1 /ε an d APD A GD and AAM in terms of n . W e note that a p reliminary version with only the analysis for Greenkhorn and APD AMD has b een accepted by ICML [ Lin et al. , 2019a ]. After our conference p ap er was pu b lished, 3 some new algorithms were dev elop ed for solving the OT p roblem [ Jam bulapati et al. , 2019 , Lahn et al. , 2019 ]. In particular, Jam bulapati et al. [ 2019 ] deve lop ed a du al extrap olation algorithm with the complexity b oun d e O  n 2 ε − 1  using an area-con ve x mapp ing [ Sherman , 2017 ]. Despite the theoretically soun d complexit y b ound, the lac k of simplicit y and ease-of- implemen tation mak e this algorithm less comp etitiv e w ith Sinkhorn and Greenkhorn wh ic h remain th e baseline solution metho d s in practice [ Flamary and Cour ty , 2017 ]. Diﬀeren t from the algorithm in Jam bulapati et al. [ 2019 ], the com binatorial algorithm in Lahn et al. [ 2019 ] is a p ractical solution metho d for the OT problem. It is worth men- tioning that the algorithm in Lahn et al. [ 2019 ] and other com bin atorial algorithms, e.g., the Hungarian algorithm, outp erf orm ou r algorithms in practice. This is in consistence with the observ atio n in Dong et al. [ 2020 ] who p ointe d out that com binatorial algorithms can outp er- form en tropic regularized algorithms in sp eed eve n the latter ones are asymptotically faster for OT (i.e., th e case of large n ). Ho wev er, we b eliev e our r esults are still v aluable due to the imp ortance of entropic regularized algorithms as men tioned b efore. Organization. The remainder is organized as follo ws. In Section 2 , w e present the basic setup for the pr im al and dual form of th e en tropic regularized OT problem. In Section 3 , w e p ro vide th e complexity analysis for Greenkhorn. In Section 4 , we prop ose APD AMD for solving entropic regularized OT and provide sev eral results on the complexit y b ound of APD A GD and APD AMD. In Section 5 , we p rop ose and analyze an accelerate d v arian t of Sinkhorn . In Section 6 , w e conduct the exp eriments on synthetic and real data and th e n umerical results sho w the eﬃciency of our algorithms. W e conclude this pap er in Section 7 . Notation. F or n ≥ 2, we let [ n ] b e the set { 1 , 2 , . . . , n } and R n + b e the set of all v ectors in R n with non-negativ e coord in ates. The notation ∆ n = { v ∈ R n + : P n i =1 v i = 1 } stands for a probabilit y simp lex in n − 1 dimensions. F or a ve ctor x ∈ R n and let 1 ≤ p < + ∞ , the notation k x k p stands for the ℓ p -norm and k x k indicates an ℓ 2 -norm. d iag( x ) is a diagonal matrix whic h has the vect or x on its diagonal. 1 n and 0 n are n -dimensional vecto r with all comp onent s b eing 1 an d 0. F or a matrix A ∈ R n × n , we d enote vec( A ) as the v ector in R n 2 obtained fr om concatenating the rows and columns of A . Th e n otation k A k 1 → 1 stands for sup k x k 1 =1 k Ax k 1 and the notations r ( A ) = A 1 n and c ( A ) = A ⊤ 1 n stand for the r ow and column sums r esp ectiv ely . F or a fun ction f , the n otation ∇ x f denotes a partial deriv ativ e with resp ect to x . F or the dimen s ion n and tole rance ε > 0, the notations a = O ( b ( n, ε )) and a = Ω( b ( n, ε )) indicate that a ≤ C 1 · b ( n, ε ) and a ≥ C 2 · b ( n, ε ) resp ectiv ely w here C 1 and C 2 are indep endent of n and ε . W e also denote a = Θ( b ( n, ε )) iﬀ a = O ( b ( n, ε )) = Ω( b ( n, ε )). Similarly , w e denote a = e O ( b ( n, ε )) to indicate the p revious inequalit y where C 1 dep end s on some loga rithm ic function of n and ε . 2 Problem Setup In this section, we ﬁrst pr esen t the linear programming (LP) representat ion of the optimal transp ort (OT) problem as well as a sp eciﬁcation of an appr o ximate transp ortation plan. W e also present an entropic r egularized v arian t of the OT p r oblem and deriv e th e dual form where the ob jectiv e function is in the form of the logarithm of sum of exp onents. Finally , w e establish sev eral prop erties of that d ual form which are useful for the subsequent analysis. 4 2.1 Linear progr amming represen tation According to Kan toro vic h [ 1942 ], the p r oblem of app ro ximating th e OT distance is equiv alent to s olving th e follo wing linear pr ogramming (LP) pr ob lem: min X ∈ R n × n h C, X i s.t. X 1 n = r , X ⊤ 1 n = c, X ≥ 0 . (1) In the ab ov e form ulation, X refers to the tr ansp ortation plan , C = ( C ij ) ∈ R n × n + stands for a cost matrix w ith n on-negativ e comp onents, and r ∈ R n and c ∈ R n are t w o prob ab ility distributions in the simplex ∆ n . W e see f r om Eq. ( 1 ), that the OT problem is a L P with 2 n equalit y constrain ts and n 2 v ariables and can b e solve d by the in terior-p oint metho d ; ho w ev er, th is metho d p erforms p o orly on large-scale problems du e to its high p er-iteration compu tational cost. In general, the solution that the algorithms aim at ac hieving is an ε -appro ximate transp ortation p lan b X ∈ R n × n + satisfying the marginal distribution constrain ts b X 1 n = r and b X ⊤ 1 n = c and the inequalit y giv en by h C, b X i ≤ h C, X ⋆ i + ε. Here X ⋆ is deﬁned as an optimal transp ortation plan for the OT problem. F or simplicit y , w e resp ectiv ely den ote h C , ˆ X i an ε -appr oximate tr ansp ortation c ost and ˆ X an ε -appr oximate tr ansp orta tion plan for the original problem. F ormally , we h a v e the follo wing deﬁ nition of ε -appro ximate transp ortation plan. Deﬁnition 1. The matrix ˆ X ∈ R n × n + is c al le d an ε -appr oximate tr ansp ortation plan if ˆ X 1 n = r and ˆ X ⊤ 1 n = c and the fol lowing ine quality holds true, h C, b X i ≤ h C, X ⋆ i + ε. wher e X ⋆ is deﬁne d as an optima l tr ansp ortation plan for the OT pr oblem. With th is deﬁnition in mind, the goal of this p ap er is to stud y the OT p roblem from a computational p oin t of view. Ind eed, we hop e to der ive an improv ed complexit y b ound of the cu r ren t state-o f-the-art algorithms and seek new practical algorithms whose runn ing time required to obtain an ε -appro ximate transp ortation plan has b etter dep endence on 1 /ε than the b enchmark algo rithms in the literature. Th e aforemen tioned new algorithms are fa v orable in the machine learning applications where high p recision ( ε is small) is necessary . 2.2 En tropic r egularized OT and its dual form Seeking another formulation for OT distance that is more amenable to computationally eﬃ- cien t algorithms, Cuturi [ 2013 ] prop osed to solv e an entropic regularized ve rsion of the O T problem in Eq. ( 1 ), whic h is giv en b y min X ∈ R n × n h C, X i − η H ( X ) , s.t. X 1 n = r , X ⊤ 1 n = c, (2) where η > 0 denotes the regularization parameter and H ( X ) denotes the en tropic regulariza- tion term, w hic h is give n b y: H ( X ) := −h X , log ( X ) − 1 n × n i . Note that, th e optimal solution of the en tropic regularized OT p roblem exists since the ob j ec- tiv e fun ction h C , X i − η H ( X ) is con tin uous and the feasible region { X ∈ R n × n : X ≥ 0 , X 1 n = 5 r , X ⊤ 1 n = c } is compact. F urthermore, that optimal solution is also un ique since the ob jectiv e function h C, X i − η H ( X ) is strongly con v ex ov er the feasible region with resp ect to ℓ 1 -norm. Ho w ev er, the optimal v alue of the entropic regularized OT problem (cf. Eq ( 2 )) yields a p o or appro ximation to the unregularized OT problem if η is large. An additional issu e of en tropic regularization is that the sparsity of the solution is lost. Ev en though an ε -appro ximate trans- p ortation plan can b e found eﬃcient ly , it is not clea r ho w diﬀeren t the sparsit y pattern of th is solution is with resp ect to the solution of the actual OT pr oblem. In con trast, the actual O T distance suﬀers f r om the curse of d imensionalit y [ Dudley , 1969 , F ournier and Guillin , 2015 , W eed and Bac h , 2019 ] and is signiﬁ can tly w orse than its en tropic regularized ve rsion in terms of the sample complexit y [ Genev ay et al. , 2019 , Mena and Niles-W eed , 2019 , C hizat et al. , 2020 ]. While there is an ongoing deb ate in the literature on the m erits of solving the OT pr ob- lem v.s. its entropic regularized version, we adopt here the viewp oint that reac hing an additive appro ximation of the actual OT cost m atters and therefore prop ose to scale η as a fun ction of the d esired accuracy of the appr o ximation. T hen, we pro ceed to d eriv e the du al form of the entropic r egularized OT p roblem in Eq. ( 2 ) and sho w that it remains an u n constrained smo oth optimizati on pr oblem. By int ro d ucing th e dual v ariables α, β ∈ R n , w e deﬁne the Lagrangian f unction of the entropic regularized OT problem as follo ws: L ( X, λ 1 , . . . , λ m ) = h C, X i − ηH ( X ) − α ⊤ ( X 1 n − r ) − β ⊤ ( X ⊤ 1 n − c ) . (3) In ord er to deriv e the smo oth d ual ob jectiv e fun ction, w e consider the follo wing minimization problem: min X : k X k 1 =1 h C, X i − η H ( X ) − α ⊤ ( X 1 n − r ) − β ⊤ ( X ⊤ 1 n − c ) . The ab o ve ob j ectiv e fu nction is strongly con ve x o v er the domain { X ∈ R n × n + | k X k 1 = 1 } . Th us , th e optimal solution is unique. After the simp le calculat ions, the optimal solution ¯ X = X ( α, β ) has the follo wing form: ¯ X ij = e η − 1 ( α i + β j − C ij ) P 1 ≤ i,j ≤ n e η − 1 ( α i + β j − C ij ) . (4) Plugging Eq. ( 4 ) into Eq. ( 3 ) yields that the dual form is: max α,β    − η log   X 1 ≤ i,j ≤ n e η − 1 ( α i + β j − C ij )   + α ⊤ r + β ⊤ c    . In order to streamline ou r presen tation, w e p erform a change of v ariables, u = η − 1 α and v = η − 1 β , and r eformulate the ab ov e pr oblem as min α,β ϕ ( α, β ) := log   X 1 ≤ i,j ≤ n e u i + v j − C ij η   − u ⊤ r − v ⊤ c. T o fur ther simplify the notatio n, w e deﬁne B ( u, v ) := ( B ij ) 1 ≤ i,j ≤ n ∈ R n × n b y B ij = e u i + v j − C ij η . T o this end, w e obtain th e dual entr opic r e gu larize d OT pr oblem d eﬁ ned by min u,v ϕ ( u, v ) := log( k B ( u, v ) k 1 ) − u ⊤ r − v ⊤ c. (5) 6 Remark 2.1. The ﬁrst p art of the obje ctive function ϕ is in the form of the lo garithm of sum of exp onents while the se c ond p art i s a line ar function. This is diﬀer ent fr om the obje ctive function use d i n pr evious dual entr op ic r e gularize d OT pr oblem [ Cuturi , 2013 , Altschuler et al. , 2017 , Dvur e chensky et al. , 2018 , Lin et al. , 201 9a ]. Notably, Eq. ( 5 ) is a sp e cial instanc e of a softmax minimization pr oblem, and the obje ctive function ϕ is known to b e smo oth [ Nester ov , 2005 ]. Final ly, we p oint out that the same formulation has b e en derive d in Guminov et al. [ 2021 ] for analyzing AA M. In the remainder of the pap er, we also denote ( u ⋆ , v ⋆ ) ∈ R 2 n as an optimal solution of the dual en tropic regularized OT pr oblem in Eq. ( 5 ). 2.3 Prop erties of dual en tropic regularized OT W e presen t sev eral usefu l prop erties of the dual en tropic regularized OT in Eq. ( 5 ). In partic- ular, we sho w that there exists an optimal s olution ( u ⋆ , v ⋆ ) ∈ R 2 n suc h that it has an upp er b ound in terms of the ℓ ∞ -norm. Lemma 2.2. F or the dual entr opic r e g u larize d OT pr oblem in Eq. ( 5 ) , ther e exists an optimal solution ( u ⋆ , v ⋆ ) such that k u ⋆ k ∞ ≤ R, k v ⋆ k ∞ ≤ R, wher e R := η − 1 k C k ∞ + log( n ) − log(min 1 ≤ i,j ≤ n { r i , c j } ) dep ends on C , r and c . Pr o of. First, w e claim th at th ere exists an optimal solution ( u ⋆ , v ⋆ ) suc h that max 1 ≤ i ≤ n u ⋆ i ≥ 0 ≥ min 1 ≤ i ≤ n u ⋆ i , max 1 ≤ i ≤ n v ⋆ i ≥ 0 ≥ min 1 ≤ i ≤ n v ⋆ i . (6) Indeed, letting ( b u ⋆ , b v ⋆ ) b e an optimal solution to Eq. ( 5 ), the claim holds tru e if ( b u ⋆ , b v ⋆ ) satisﬁes Eq. ( 6 ). O therwise, we d eﬁne the sh ift term giv en by b ∆ u = max 1 ≤ i ≤ n b u ⋆ i + min 1 ≤ i ≤ n b u ⋆ i 2 , b ∆ v = max 1 ≤ i ≤ n b v ⋆ i + min 1 ≤ i ≤ n b v ⋆ i 2 , and deﬁne ( u ⋆ , v ⋆ ) by u ⋆ = b u ⋆ − b ∆ u 1 n , v ⋆ = b v ⋆ − b ∆ v 1 n . By deﬁn ition, w e hav e ( u ⋆ , v ⋆ ) satisﬁes Eq. ( 6 ). Since 1 ⊤ n r = 1 ⊤ n c = 1, we ha ve ( u ⋆ ) ⊤ r = ( b u ⋆ ) ⊤ r − b ∆ u and ( v ⋆ ) ⊤ c = ( b v ⋆ ) ⊤ c − b ∆ v . In addition, log( k B ( u ⋆ , v ⋆ ) k 1 ) = log ( k B ( b u ⋆ , b v ⋆ ) k 1 ) + b ∆ u + b ∆ v . Putting these p ieces together yields ϕ ( u ⋆ , v ⋆ ) = ϕ ( b u ⋆ , b v ⋆ ). Th erefore, ( u ⋆ , v ⋆ ) is an optimal s olution of the dual en tropic regularized OT that s atisﬁes Eq. ( 6 ). Then, w e sho w that max 1 ≤ i ≤ n u ⋆ i − min 1 ≤ i ≤ n u ⋆ i ≤ k C k ∞ η − log  min 1 ≤ i,j ≤ n { r i , c j }  , (7) max 1 ≤ i ≤ n v ⋆ i − min 1 ≤ i ≤ n v ⋆ i ≤ k C k ∞ η − log  min 1 ≤ i,j ≤ n { r i , c j }  . (8) Indeed, for any 1 ≤ i ≤ n , we derive from the optimalit y condition of ( u ⋆ , v ⋆ ) that e u ⋆ i ( P n j =1 e v ⋆ j − η − 1 C ij ) k B ( u ⋆ , v ⋆ ) k 1 = r i , for all i ∈ [ n ] . 7 Since C ij ≥ 0 for all 1 ≤ i, j ≤ n and r i ≥ min 1 ≤ i,j ≤ n { r i , c j } for all 1 ≤ i ≤ n , we ha v e u ⋆ i ≥ log  min 1 ≤ i,j ≤ n { r i , c j }  − log   n X j =1 e v ⋆ j   + log( k B ( u ⋆ , v ⋆ ) k 1 ) , for all i ∈ [ n ] . Since 0 < r i ≤ 1 and C ij ≤ k C k ∞ , we hav e u ⋆ i ≤ k C k ∞ η − log   n X j =1 e v ⋆ j   + log( k B ( u ⋆ , v ⋆ ) k 1 ) , for all i ∈ [ n ] . Putting these pieces together yields Eq. ( 7 ). By the similar argumen t, w e can pro ve Eq. ( 8 ). Finally , we prov e our main results. I ndeed, Eq. ( 6 ) and Eq. ( 7 ) imply that − k C k ∞ η + log  min 1 ≤ i,j ≤ n { r i , c j }  ≤ min 1 ≤ i ≤ n u ⋆ i ≤ 0 , and 0 ≤ max 1 ≤ i ≤ n u ⋆ i ≤ k C k ∞ η − log  min 1 ≤ i,j ≤ n { r i , c j }  . Com bining the ab ov e t w o in equ alities with the deﬁnition of R implies that k u ⋆ k ∞ ≤ R . By the similar argumen t, we can p ro v e that k v ⋆ k ∞ ≤ R . As a consequence, we obtain the conclusion of the lemma.  The upp er b ound for the ℓ ∞ -norm of an optimal solution of dual entropic regularized OT in Lemma 2.2 d irectly leads to the follo wing direct b ound for the ℓ 2 -norm. Corollary 2.3. F or the dual entr opic r e gularize d OT pr oblem in Eq. ( 5 ) , ther e exists an optimal solution ( u ⋆ , v ⋆ ) such that k u ⋆ k ≤ √ nR, k v ⋆ k ≤ √ nR, wher e R > 0 is deﬁne d i n L emma 2.2 . Since the function − H ( X ) is strongly con v ex with resp ect to the ℓ 1 -norm on the p robabilit y simplex Q ⊆ R n × n , the en tropic regularized OT pr ob lem in Eq. ( 2 ) is a sp ecial case of the follo wing linearly constrained con v ex optimization problem: min x ∈ Q f ( x ) , s.t. Ax = b, where f is strongly conv ex with resp ect to the ℓ 1 -norm on the set Q : f ( x ′ ) − f ( x ) − ( x ′ − x ) ⊤ ∇ f ( x ) ≥ η 2 k x ′ − x k 2 1 for an y x ′ , x ∈ Q. By Nestero v [ 2005 , T heorem 1] with the ℓ 2 -norm for the dual space of the Lagrange multipliers, the dual ob j ectiv e function ˜ ϕ satisﬁes th e follo wing inequalit y: e ϕ ( α ′ , β ′ ) − e ϕ ( α, β ) −  α ′ − α β ′ − β  ⊤ ∇ e ϕ ( α, β ) ≤ k A k 2 1 → 2 2 η      α ′ − α β ′ − β      2 for any ( α ′ , β ′ ) , ( α, β ) ∈ R 2 n . 8 Algorithm 1: Greenkhor n ( C, η , r , c, ε ′ ) Input: t = 0 and u 0 = v 0 = 0 n . while E t > ε ′ do Compute I = a rgmax 1 ≤ i ≤ n ρ ( r i , r i ( B ( u t , v t ))) where ρ ( a, b ) = b − a + a log( a/b ) and ( B ( u t , v t )) i ′ j ′ = e u t i ′ + v t j ′ − C i ′ j ′ η for all ( i ′ , j ′ ). Compute J = argma x 1 ≤ j ≤ n ρ ( c j , c j ( B ( u t , v t ))). if ρ ( r i , r i ( B ( u t , v t ))) > ρ ( c j , c j ( B ( u t , v t ))) then u t +1 I = u t I + log( r I ) − log ( r I ( B ( u t , v t ))). else v t +1 J = v t J + log( c J ) − log ( c J ( B ( u t , v t ))). end if Increment by t = t + 1. end while Output: B ( u t , v t ). Recall that the fun ction ˜ ϕ is giv en b y e ϕ ( α, β ) = − η log   X 1 ≤ i,j ≤ n e η − 1 ( α i + β j − C ij )   + α ⊤ r + β ⊤ c. (9) W e notice that th e function ϕ in Eq. ( 5 ) s atisﬁes that ϕ ( u, v ) = − η − 1 e ϕ ( η u, η v ). Af ter some simple calculations, we h a v e ϕ ( u ′ , v ′ ) − ϕ ( u, v ) −  u ′ − u v ′ − v  ⊤ ∇ ϕ ( u, v ) ≤  k A k 2 1 → 2 2       u ′ − u v ′ − v      2 . (10) In the en tropic regularized OT problem, eac h column of the matrix A contai ns n o more than t w o nonzero elemen ts wh ic h are equal to one. Sin ce k A k 1 → 2 is equal to maximum ℓ 2 -norm of the column of this matrix, we ha ve k A k 1 → 2 = √ 2. Th us, the dual ob jectiv e fun ction ϕ is 2-gradien t Lipsc hitz with resp ect to the ℓ 2 -norm. 3 Greenkhorn In this section, we present a complexit y analysis for Greenkhorn. In particular, we improv e the existi ng b est k n o wn complexit y b ound O ( n 2 k C k 3 ∞ log( n ) ε − 3 ) [ Altsc h uler et al. , 2017 ] to O ( n 2 k C k 2 ∞ log( n ) ε − 2 ), which matc hes the current state-of-the-art complexit y b ound for Sinkhorn [ Dvurec hensky et al. , 2018 ]. T o facilitat e the subsequent discu ssion, we p resen t the p seudo co de of Greenkhorn in Al- gorithm 1 and its application to regularized OT in Algorithm 2 . Th e function for quant i- fying th e p r ogress in the d ual ob j ectiv e v al ue b etw een tw o consecutiv e iterates is giv en by ρ ( a, b ) = b − a + a log( a/b ) and we recall that ( u, v ) is an optimal solution of the dual en tropic regularized OT problem in Eq. ( 5 ) if r ( B ( u, v )) − r = 0 n and c ( B ( u, v )) − c = 0 n . This leads to th e quan tit y whic h measures the error of the t -th iterate in Algorithm 1 : E t := k r ( B ( u t , v t )) − r k 1 + k c ( B ( u t , v t )) − c k 1 . Both Sin k h orn and Greenkhorn can b e interpreted as co ordinate descen t f or minimizing the follo wing conv ex function [ Cuturi , 2013 , Altsc h uler et al. , 2017 , Dvurechensky et al. , 2018 , 9 Algorithm 2: Approximat ing OT b y Algorithm 1 Input: η = ε 4 log( n ) and ε ′ = ε 8 k C k ∞ . Step 1: Let ˜ r ∈ ∆ n and ˜ c ∈ ∆ n be deﬁned by ( ˜ r , ˜ c ) = (1 − ε ′ 8 )( r , c ) + ε ′ 8 n ( 1 n , 1 n ). Step 2: Compute ˜ X = G reenkhorn ( C, η , ˜ r, ˜ c, ε ′ 2 ). Step 3: Round ˜ X to ˆ X using Altsch uler e t al. [ 201 7 , Algo rithm 2 ] suc h that ˆ X 1 n = r and ˆ X ⊤ 1 n = c . Output: ˆ X . Lin et al. , 2019a ]: f ( u, v ) := k B ( u, v ) k 1 − u ⊤ r − v ⊤ c. (11) Comparing to the sc heme of Sin khorn that consists in the up d ates of al l r ows and columns, Algorithm 1 up d ates only one row or column at eac h step. As s u c h, Algorithm 1 up d ates only O ( n ) entrie s p er iteration rather th an O ( n 2 ) in Sinkh orn. It is also w orth mentio nin g that Algorithm 1 can b e implemen ted su ch that eac h iteration r uns in only O ( n ) arithmetic op erations [ Altsc h uler et al. , 2017 ]. Despite c heap p er-iteration computational cost, it is diﬃcult to qu antify th e p er-iteration progress of Algorithm 1 and the pro of tec hniques for Sinkhorn in Dvurec hensky et al. [ 2018 ] are not applicable her e. Th is m otiv ates us to inv estigate another pro of strateg y whic h will b e elab orated in the sequel. 3.1 Complexit y analysis—bounding dual ob jectiv e v alues Giv en the deﬁ nition of E t , w e ﬁrst prov e the follo wing lemma whic h yields an up p er b ound for th e ob jectiv e v alues of the iterates. Lemma 3.1. L etting { ( u t , v t ) } t ≥ 0 b e the iter ates gener ate d by Algorithm 1 , we have f ( u t , v t ) − f ( u ⋆ , v ⋆ ) ≤ 2 E t ( k u ⋆ k ∞ + k v ⋆ k ∞ ) , wher e ( u ∗ , v ∗ ) is a p oint that minimizes f ( u, v ) = k B ( u, v ) k 1 − u ⊤ r − v ⊤ c . Pr o of. By the d eﬁ nition, we hav e f ( u, v ) = X 1 ≤ i,j ≤ n e u i + v j − C ij η − n X i =1 u i r i − n X j =1 v j c j . By deﬁn ition, we ha v e ∇ u f ( u t , v t ) = B ( u t , v t ) 1 n − r and ∇ v f ( u t , v t ) = B ( u t , v t ) ⊤ 1 n − c . Thus, w e ha ve E t = k∇ u f ( u t , v t ) k 1 + k∇ v f ( u t , v t ) k 1 . Sin ce f is conv ex and minimized at ( u ⋆ , v ⋆ ), w e ha ve f ( u t , v t ) − f ( u ⋆ , v ⋆ ) ≤ ( u t − u ⋆ ) ⊤ ∇ u f ( u t , v t ) + ( v t − v ⋆ ) ⊤ ∇ v f ( u t , v t ) . Com bining H¨ older’s inequalit y and the deﬁn ition of E t yields f ( u t , v t ) − f ( u ⋆ , v ⋆ ) ≤ E t ( k u t − u ⋆ k ∞ + k v t − v ⋆ k ∞ ) . (12) Th us , it suﬃces to s h o w that k u t − u ⋆ k ∞ + k v t − v ⋆ k ∞ ≤ 2 k u ⋆ k ∞ + 2 k v ⋆ k ∞ . 10 The next result is the k ey observ ati on that mak es our analysis w ork f or Greenkh orn. W e use an in duction argu m en t to establish th e follo wing b ound: max {k u t − u ⋆ k ∞ , k v t − v ⋆ k ∞ } ≤ max {k u 0 − u ⋆ k ∞ , k v 0 − v ⋆ k ∞ } . (13) It is clear that Eq. ( 13 ) holds true when t = 0. Supp ose that the inequalit y holds tru e f or t ≤ k 0 , we sho w that it also h olds true for t = k 0 + 1. Without loss of generalit y , let I b e the index c hosen at the ( k 0 + 1)-th iteration. Th en k u k 0 +1 − u ⋆ k ∞ ≤ max {k u k 0 − u ⋆ k ∞ , | u k 0 +1 I − u ⋆ I |} , (14) k v k 0 +1 − v ⋆ k ∞ = k v k 0 − v ⋆ k ∞ . (15) By the up dating f ormula for u k 0 +1 I and the op timalit y condition f or u ⋆ I , w e ha v e e u k 0 +1 I = r I P n j =1 e − C ij η + v k 0 j , e u ⋆ I = r I P n j =1 e − C ij η + v ⋆ j . Putting these pieces together with the inequalit y that P n i =1 a i P n i =1 b i ≤ max 1 ≤ j ≤ n a i b i for all a i , b i > 0 yields | u k 0 +1 I − u ⋆ I | =       log   P n j =1 e − η − 1 C I j + v k 0 j P n j =1 e − η − 1 C I j + v ⋆ j         ≤ k v k 0 − v ⋆ k ∞ . (16) Com bining Eq. ( 14 ) and Eq. ( 16 ) yields k u k 0 +1 − u ⋆ k ∞ ≤ max {k u k 0 − u ⋆ k ∞ , k v k 0 − v ⋆ k ∞ } . (1 7) Com bining Eq. ( 15 ) and E q . ( 17 ) further implies Eq. ( 13 ). This together with u 0 = v 0 = 0 n implies k u t − u ⋆ k ∞ + k v t − v ⋆ k ∞ ≤ 2( k u 0 − u ⋆ k ∞ + k v 0 − v ⋆ k ∞ ) = 2 k u ⋆ k ∞ + 2 k v ⋆ k ∞ . (18) Putting Eq. ( 12 ) and E q . ( 18 ) together yields the desired result.  Our s econd lemma sho ws that at least one optimal s olution ( u ⋆ , v ⋆ ) of f h as an up p er b ound of η − 1 k C k ∞ + log( n ) − 2 log (min 1 ≤ i,j ≤ n { r i , c j } ) in ℓ ∞ -norm. This resu lt is str onger than Dvurec hensky et al. [ 2018 , Lemma 1] and generalizes Blanc het et al. [ 2018 , Lemma 10]. Lemma 3.2. Ther e exists an optimal solution ( u ⋆ , v ⋆ ) of the function f deﬁne d in Eq. ( 11 ) such that the fol lowing ine quality holds true, k u ⋆ k ∞ ≤ R, k v ⋆ k ∞ ≤ R, wher e R := η − 1 k C k ∞ + log( n ) − 2 log (min 1 ≤ i,j ≤ n { r i , c j } ) dep ends on C , r and c . Pr o of. By u s ing the similar argumen t as in Lemma 2.2 , w e can ﬁrst show that there exists an optimal s olution pair ( u ⋆ , v ⋆ ) su c h that (bu t n ot f or v ⋆ sim ultaneously) max 1 ≤ i ≤ n u ⋆ i ≥ 0 ≥ min 1 ≤ i ≤ n u ⋆ i . (19) 11 Then, we pr o ceed to establish the b ound s that are analogous to Eq. ( 7 ) and ( 8 ): max 1 ≤ i ≤ n u ⋆ i − min 1 ≤ i ≤ n u ⋆ i ≤ k C k ∞ η − log  min 1 ≤ i,j ≤ n { r i , c j }  , (20) max 1 ≤ i ≤ n v ⋆ i − min 1 ≤ i ≤ n v ⋆ i ≤ k C k ∞ η − log  min 1 ≤ i,j ≤ n { r i , c j }  . (21) Indeed, for eac h 1 ≤ i ≤ n , we h av e e − η − 1 k C k ∞ + u ⋆ i   n X j =1 e v ⋆ j   ≤ n X j =1 e − η − 1 C ij + u ⋆ i + v ⋆ j = [ B ( u ⋆ , v ⋆ ) 1 n ] i = r i ≤ 1 , whic h implies u ⋆ i ≤ η − 1 k C k ∞ − log ( P n j =1 e v ⋆ j ). F urthermore, w e ha ve e u ⋆ i   n X j =1 e v ⋆ j   ≥ n X j =1 e − η − 1 C ij + u ⋆ i + v ⋆ j = [ B ( u ⋆ , v ⋆ ) 1 n ] i = r i ≥ min 1 ≤ i,j ≤ n { r i , c j } , whic h implies u ⋆ i ≥ log(min 1 ≤ i,j ≤ n { r i , c j } ) − log( P n j =1 e v ⋆ j ). Putting th ese pieces together yields Eq . ( 20 ). Using the similar argumen t, w e can p r o v e Eq. ( 21 ) holds tru e. Finally , w e pro v e our main results. Since max 1 ≤ i ≤ n u ⋆ i ≥ 0 ≥ min 1 ≤ i ≤ n u ⋆ i , we derive fr om Eq. ( 20 ) that − k C k ∞ η + log  min 1 ≤ i,j ≤ n { r i , c j }  ≤ min 1 ≤ i ≤ n u ⋆ i ≤ max 1 ≤ i ≤ n u ⋆ i ≤ k C k ∞ η − log  min 1 ≤ i,j ≤ n { r i , c j }  . This imp lies that k u ⋆ k ∞ ≤ R . Then, w e b ound k v ⋆ k ∞ b y considering t w o diﬀerent cases. F or the former case, w e assume that max 1 ≤ i ≤ n v ⋆ i ≥ 0. Note that the optimalit y condition is P n i,j =1 e − η − 1 C ij + u ⋆ i + v ⋆ j = 1 and f urther implies that max 1 ≤ i ≤ n u ⋆ i + max 1 ≤ i ≤ n v ⋆ i ≤ log  max 1 ≤ i,j ≤ n e η − 1 C ij  = k C k ∞ η . Since max 1 ≤ i ≤ n u ∗ i ≥ 0 and max 1 ≤ i ≤ n v ∗ i ≥ 0, w e hav e 0 ≤ max 1 ≤ i ≤ n v ⋆ i ≤ k C k ∞ η . Com bining max 1 ≤ i ≤ n v ∗ i ≥ 0 with Eq. ( 21 ) yields that min 1 ≤ i ≤ n v ⋆ i ≥ − k C k ∞ η + log  min 1 ≤ i,j ≤ n { r i , c j }  . whic h implies that k v ⋆ k ∞ ≤ R . F or the latter case, we assum e that max 1 ≤ i ≤ n v ⋆ i ≤ 0. Th en, we ha ve min 1 ≤ i ≤ n v ⋆ i ≥ log  min 1 ≤ i,j ≤ n { r i , c j }  − log n X i =1 e u ⋆ i ! . This together with k u ⋆ k ∞ ≤ k C k ∞ η − log (min 1 ≤ i,j ≤ n { r i , c j } ) yields th at k v ⋆ k ∞ ≤ R .  Putting Lemma 3.1 and 3.2 together, w e ha ve the follo wing straigh tforward consequen ce: 12 Corollary 3.3. L etting { ( u t , v t ) } t ≥ 0 b e the iter ates gener ate d by Algorithm 1 , we have f ( u t , v t ) − f ( u ⋆ , v ⋆ ) ≤ 4 RE t . Remark 3.4. The notation R is also use d in Dvur e chensky et al. [ 2018 ] and has the same or der as ours sinc e R in our p ap er only involves an term log ( n ) − log (min 1 ≤ i,j ≤ n { r i , c j } ) . Remark 3.5. We further c omment on the pr o of te chniques in this p ap er and Dvur e chensky et al. [ 2018 ]. Inde e d, the pr o of for Dvur e chensky et al. [ 2018 , L emma 2] dep ends on taking ful l ad- vantage of the shift pr op erty of Sinkhorn; namely, either B ( u t , v t ) 1 n = r or B ( u t , v t ) ⊤ 1 n = c wher e ( u t , v t ) stands for the iter ate gener ate d by Sinkhorn. U nfortunately, Gr e enkhorn do es not enjoy such a shift pr op erty. We have thus pr op ose d a diﬀer ent appr o ach for b ounding f ( u t , v t ) − f ( u ⋆ , v ⋆ ) via app e al to the ℓ ∞ -norm of the solution ( u ⋆ , v ⋆ ) . 3.2 Complexit y analysis—bounding t he n um b er of iterations W e pro ceed to pr o vide an up p er b ound for the iteration n umber to ac hiev e a desired tole rance ε ′ in Algorithm 1 . First, w e s tart with a lo w er b oun d for the diﬀerence of f unction v alues b et wee n tw o consecutiv e iterates of Algorithm 1 : Lemma 3.6. L etting { ( u t , v t ) } t ≥ 0 b e the iter ates gener ate d by Algorithm 1 , we have f ( u t , v t ) − f ( u t +1 , v t +1 ) ≥ ( E t ) 2 28 n . Pr o of. Com binin g Altsc h uler et al. [ 2017 , Lemma 5] and the fact that the r o w or column up d ate is c hosen in a greedy manner, w e ha v e f ( u t , v t ) − f ( u t +1 , v t +1 ) ≥ 1 2 n  ρ ( r , r ( B ( u t , v t )) + ρ ( c, c ( B ( u t , v t ))  . F ur th ermore, Altsc h uler et al. [ 2017 , Lemma 6] implies that ρ ( r , r ( B ( u t , v t )) + ρ ( c, c ( B ( u t , v t )) ≥ 1 7  k r − r ( B ( u t , v t )) k 2 1 + k c − c ( B ( u t , v t )) k 2 1  . Putting these pieces together yields that f ( u t , v t ) − f ( u t +1 , v t +1 ) ≥ 1 14 n  k r − r ( B ( u t , v t )) k 2 1 + k c − c ( B ( u t , v t )) k 2 1  . Com bining the ab o v e inequalit y with the deﬁnition of E t implies the desired result.  W e are no w able to d eriv e the iteratio n complexit y of Algo rithm 1 . Theorem 3.7. L etting { ( u t , v t ) } t ≥ 0 b e the iter ates gener ate d b y Al gorithm 1 , the numb er of iter ations r e quir e d to satisfy E t ≤ ε ′ is u pp er b ounde d by t ≤ 2 + 112 nR ε ′ wher e R > 0 is deﬁne d in L emma 3.2 . Pr o of. Letting δ t = f ( u t , v t ) − f ( u ⋆ , v ⋆ ), w e deriv e from Corollary 3.3 and Lemma 3.6 that δ t − δ t +1 ≥ max  δ 2 t 448 nR 2 , ( ε ′ ) 2 28 n  , where E t ≥ ε ′ as so on as the stopping criterion is not fulﬁlled. In the f ollo wing step we apply a switc hing strategy in tro d uced b y Dvurechensky et al. [ 2018 ]. Giv en any t ≥ 1, we h a v e tw o estimates: 13 (i) Consid ering th e pr o cess fr om the ﬁrst iteration and the t -th iteratio n, w e ha v e δ t +1 448 nR 2 ≤ 1 t + 448 nR 2 δ − 2 1 = ⇒ t ≤ 1 + 448 nR 2 δ t − 448 nR 2 δ 1 . (ii) Consid er in g the pro cess from the ( t + 1)-th iteration to the ( t + m )-th iteration for any m ≥ 1, w e ha v e δ t + m ≤ δ t − ( ε ′ ) 2 m 28 n = ⇒ m ≤ 28 n ( δ t − δ t + m ) ( ε ′ ) 2 . W e then minimize the sum of t wo estimates by an optimal choice of s ∈ (0 , δ 1 ]: t ≤ min 0 0 stand s for the regularit y of th e mirr or m apping φ . 15 4.1 General setup W e follo w the setup in Section 2 and consid er the follo wing generalization of the en tropic regularized OT p roblem in Eq. ( 2 ): min x ∈ Q f ( x ) , s.t. Ax = b, (22) where f is strongly conv ex with resp ect to the ℓ 1 -norm on the set Q : f ( x ′ ) − f ( x ) − ( x ′ − x ) ⊤ ∇ f ( x ) ≥ η 2 k x ′ − x k 2 1 for an y x ′ , x ∈ Q. Note that, in the sp eciﬁc setting of the en tropic regularized OT pr ob lem, the function f ( x ) = P i,j C ij x j + n ( i − 1) + η · x j + n ( i − 1) log( x j + n ( i − 1) ) where x j + n ( i − 1) = X ij for any i, j where X is the tr ansp ortation plan in equation ( 2 ), and the vec tor b ∈ R 2 n × 1 is deﬁned as: b i = r i as 1 ≤ i ≤ n and b i = c i − n when n + 1 ≤ i ≤ 2 n . F u r thermore, the matrix A = ( A ij ) ∈ R 2 n × n 2 is deﬁned as: When 1 ≤ i ≤ n , w e denote A ij = 1 if 1 + n ( i − 1) ≤ j ≤ n · i and 0 otherwise; When n + 1 ≤ i ≤ 2 n , we deﬁne A ij = 1 if j ∈ { i − n + n ( l − 1) : 1 ≤ l ≤ n } and 0 otherwise. T o b e consisten t w ith th e notations in Algorithms 4 and 5 , we sp eciﬁcally denote A ot as the matrix A corresp onding to the en tropic regularized OT p roblem. After some calculations with the general problem ( 22 ), we obtain that the d u al pr oblem is as follo ws: min λ ∈ R 2 n e ϕ ( λ ) := {h λ, b i + max x ∈ R n 2 {− f ( x ) − h A ⊤ λ, x i}} , (2 3) and ∇ e ϕ ( λ ) = b − Ax ( λ ) wh ere x ( λ ) = argmax x ∈ R n {− f ( x ) − h A ⊤ λ, x i} ; see the explicit form in Eq. ( 9 ) w ith λ = ( α, β ). By Nestero v [ 2005 , Th eorem 1] with ℓ 1 -norm for the d ual sp ace of the Lagrange multipliers, the dual ob jective function ˜ ϕ satisﬁes the follo wing inequalit y: e ϕ ( λ ′ ) − e ϕ ( λ ) − ( λ ′ − λ ) ⊤ ∇ e ϕ ( λ ) ≤ k A k 2 1 → 1 2 η k λ ′ − λ k 2 ∞ . (24) In th e entropic regularized OT problem, eac h column of the matrix A ot con tains no more than t w o nonzero elemen ts wh ich are equal to one. S ince k A ot k 1 → 1 is equal to maximum ℓ 1 -norm of the column of this matrix, w e ha v e k A ot k 1 → 1 = 2. Thus, the dual ob jectiv e fu nction e ϕ is 4 η -gradien t Lipsc hitz with resp ect to the ℓ ∞ -norm. In addition, w e deﬁne the Bregman div ergence B φ : R 2 n × R 2 n 7→ [0 , + ∞ ) by B φ ( λ ′ , λ ) := φ ( λ ′ ) − φ ( λ ) − ( λ ′ − λ ) ⊤ ∇ φ ( λ ) , where the mir ror mapping φ is 1 δ -strongly con ve x and 1-smo oth on R 2 n with resp ect to ℓ ∞ - norm; that is, 1 2 δ k λ ′ − λ k 2 ∞ ≤ φ ( λ ′ ) − φ ( λ ) − ( λ ′ − λ ) ⊤ ∇ φ ( λ ) ≤ 1 2 k λ ′ − λ k 2 ∞ . F or example, w e can c ho ose φ ( λ ) = 1 2 n k λ k 2 and B φ ( λ ′ , λ ) = 1 2 n k λ ′ − λ k 2 in APD AMD where δ = n . As su c h, δ > 0 is a function of n in general and it will app ear in the complexit y b ound of APD AMD for appro ximating the OT problem (cf. Theorem 4.5 ). It is w orth noting that our algorithm uses a regularizer that acts only in the du al and our complexit y b ound is the b est existing one among th is group of algorithms [ Dvur echensky et al. , 2018 , Guo et al. , 2020 , Gumino v et al. , 2021 ]. A v ery recen t work of J am bulapati et al. [ 201 9 ] sho we d that th e complexit y b oun d can b e impr o v ed to e O ( n 2 ε − 1 ) using a more adv anced area-con v ex mir ror mapping [ Sherman , 201 7 ]. 16 Algorithm 4: Approximat ing OT b y Algorithm 3 Input: η = ε 4 log( n ) and ε ′ = ε 8 k C k ∞ . Step 1: Let ˜ r ∈ ∆ n and ˜ c ∈ ∆ n be deﬁned by ( ˜ r , ˜ c ) = (1 − ε ′ 8 )( r , c ) + ε ′ 8 n ( 1 n , 1 n ). Step 2: Let A ot ∈ R 2 n × n 2 and b ∈ R 2 n be deﬁned by A ot vec( X ) =  X 1 n X ⊤ 1 n  and b =  ˜ r ˜ c  . Step 3: Compute ˜ X = A pdamd ( e ϕ, A ot , b, ε ′ 2 ) where e ϕ is deﬁned b y Eq. ( 23 ). Step 4: Round ˜ X to ˆ X using Altsch uler et al. [ 2 017 , Algorithm 2 ] suc h that ˆ X 1 n = r and ˆ X ⊤ 1 n = c . Output: ˆ X . 4.2 Prop erties of APDAMD W e pr esen t sev eral imp ortant prop er ties of Algorithm 3 that can b e used later for entropic regularized O T problems. First, we pro ve the follo wing result r egarding the num b er of line searc h iterations in Algo rithm 3 for solving the en tropic regularized OT pr ob lem: Lemma 4.1. The numb er of line se ar ch iter ations in Algorithm 3 for solving the entr op ic OT pr oblem is ﬁnite. F urthermor e, the total numb er of gr adient or acle c al ls after the t - th iter atio n is b ounde d as N t ≤ 4 t + 4 + 2 log ( 8 η ) − 2 log( L 0 ) log 2 . Pr o of. First, we observe that multiplying M t b y t wo will not stop until the line searc h stopping criterion is satisﬁed. Th en, E q . ( 24 ) imp lies that the num b er of line searc h iterations in the line searc h str ategy is ﬁnite and M t ≤ 2 k A ot k 2 1 → 1 η holds true for all t ≥ 0. O therwise, the line searc h stopping criterion is satisﬁed with M t 2 since M t 2 ≥ k A ot k 2 1 → 1 η . Letting i j denote the total num b er of multi plication at the j -th iteratio n, w e ha v e i 0 ≤ 1 + log( M 0 L 0 ) log 2 , i j ≤ 2 + log( M j M j − 1 ) log 2 . Then, the total num b er of line searc h iterations is b ounded by t X j =0 i j ≤ 1 + log( M 0 L 0 ) log 2 + t X j =1 2 + log( M j M j − 1 ) log 2 ! ≤ 2 t + 1 + log( 2 k A ot k 2 1 → 1 η ) − log( L 0 ) log 2 . Since eac h lin e searc h con tains t wo gradien t oracle calls and k A ot k 1 → 1 = 2, we conclude th e desired upp er b ound for the total n umber of gradient oracle calls after the t -th iteration.  The next lemma pr esen ts a pr op ert y of the function e ϕ in Algorithm 3 . Lemma 4.2. F or e ach iter ation t of Algorithm 3 and any z ∈ R 2 n , we have ¯ α t e ϕ ( λ t ) ≤ t X j =0 ( α j ( e ϕ ( µ j ) + ( z − µ j ) ⊤ ∇ e ϕ ( µ j ))) + k z k 2 ∞ . Pr o of. First, w e claim th at it holds for an y z ∈ R n : α t +1 ( z t − z ) ⊤ ∇ e ϕ ( µ t +1 ) ≤ ¯ α t +1 ( e ϕ ( µ t +1 ) − e ϕ ( λ t +1 )) + B φ ( z , z t ) − B φ ( z , z t +1 ) . (25) 17 Indeed, the optimalit y condition in mirr or descent imp lies that, for an y z ∈ R 2 n , we hav e ( z − z t +1 ) ⊤  ∇ e ϕ ( µ t +1 ) + ∇ φ ( z t +1 ) −∇ φ ( z t ) α t +1  ≥ 0 . By deﬁnition, w e ha ve B φ ( z , z t ) − B φ ( z , z t +1 ) − B φ ( z t +1 , z t ) = ( z − z t +1 ) ⊤ ( ∇ φ ( z t +1 ) − ∇ φ ( z t )) and B φ ( z t +1 , z t ) ≥ 1 2 δ k z t +1 − z t k 2 ∞ . Pu tting these pieces together yields that α t +1 ( z t − z ) ⊤ ∇ e ϕ ( µ t +1 ) = α t +1 ( z t − z t +1 ) ⊤ ∇ e ϕ ( µ t +1 ) + α t +1 ( z t +1 − z ) ⊤ ∇ e ϕ ( µ t +1 ) ≤ α t +1 ( z t − z t +1 ) ⊤ ∇ e ϕ ( µ t +1 ) + ( z − z t +1 ) ⊤ ( ∇ φ ( z t +1 ) − ∇ φ ( z t )) = α t +1 ( z t − z t +1 ) ⊤ ∇ e ϕ ( µ t +1 ) + B φ ( z , z t ) − B φ ( z , z t +1 ) − B φ ( z t +1 , z t ) ≤ α t +1 ( z t − z t +1 ) ⊤ ∇ e ϕ ( µ t +1 ) + B φ ( z , z t ) − B φ ( z , z t +1 ) − k z t +1 − z t k 2 ∞ 2 δ . (26) The up date formulas of µ t +1 , λ t +1 , α t +1 and ¯ α t +1 imply that λ t +1 − µ t +1 = α t +1 ¯ α t +1 ( z t +1 − z t ) , δ M t ( α t +1 ) 2 = ¯ α t +1 . Therefore, w e ha v e α t +1 ( z t − z t +1 ) ⊤ ∇ e ϕ ( µ t +1 ) = ¯ α t +1 ( µ t +1 − λ t +1 ) ⊤ ∇ e ϕ ( µ t +1 ) , and k z t +1 − z t k 2 ∞ = ( ¯ α t +1 α t +1 ) 2 k µ t +1 − λ t +1 k 2 ∞ = δ M t ¯ α t +1 k µ t +1 − λ t +1 k 2 ∞ . Putting these pieces together with Eq. ( 26 ) yields that α t +1 ( z t − z ) ⊤ ∇ e ϕ ( µ t +1 ) ≤ ¯ α t +1 ( µ t +1 − λ t +1 ) ⊤ ∇ e ϕ ( µ t +1 ) + B φ ( z , z t ) − B φ ( z , z t +1 ) − ¯ α t +1 M t 2 k µ t +1 − λ t +1 k 2 ∞ = ¯ α t +1  ( µ t +1 − λ t +1 ) ⊤ ∇ e ϕ ( µ t +1 ) − M t 2 k µ t +1 − λ t +1 k 2 ∞  + B φ ( z , z t ) − B φ ( z , z t +1 ) ≤ ¯ α t +1 ( e ϕ ( µ t +1 ) − e ϕ ( λ t +1 )) + B φ ( z , z t ) − B φ ( z , z t +1 ) , where the last inequalit y comes from the stopping criterion in the line searc h. This imp lies that Eq. ( 25 ) holds tr u e. The next s tep is to b oun d the iterativ e ob jectiv e gap giv en by ¯ α t +1 e ϕ ( λ t +1 ) − ¯ α t e ϕ ( λ t ) (27) ≤ α t +1 ( e ϕ ( µ t +1 ) + ( z − µ t +1 ) ⊤ ∇ e ϕ ( µ t +1 )) + B φ ( z , z t ) − B φ ( z , z t +1 ) . Indeed, b y com bining ¯ α t +1 = ¯ α t + α t +1 and the u p date formula of µ t +1 , we hav e α t +1 ( µ t +1 − z t ) = ( ¯ α t +1 − ¯ α t ) µ t +1 − α t +1 z t = α t +1 z t + ¯ α t λ t − ¯ α t µ t +1 − α t +1 z t = ¯ α t ( λ t − µ t +1 ) . This together with the con v exit y of e ϕ implies that α t +1 ( µ t +1 − z ) ⊤ ∇ e ϕ ( µ t +1 ) = α t +1 ( µ t +1 − z t ) ⊤ ∇ e ϕ ( µ t +1 ) + α t +1 ( z t − z ) ⊤ ∇ e ϕ ( µ t +1 ) = ¯ α t ( λ t − µ t +1 ) ⊤ ∇ e ϕ ( µ t +1 ) + α t +1 ( z t − z ) ⊤ ∇ e ϕ ( µ t +1 ) ≤ ¯ α t ( e ϕ ( λ t ) − e ϕ ( µ t +1 )) + α t +1 ( z t − z ) ⊤ ∇ e ϕ ( µ t +1 ) . 18 F ur th ermore, w e d eriv e from Eq. ( 25 ) and ¯ α t +1 = ¯ α t + α t +1 that ¯ α t ( e ϕ ( λ t ) − e ϕ ( µ t +1 )) + α t +1 ( z t − z ) ⊤ ∇ e ϕ ( µ t +1 ) ≤ ¯ α t ( e ϕ ( λ t ) − e ϕ ( µ t +1 )) + ¯ α t +1 ( e ϕ ( µ t +1 ) − e ϕ ( λ t +1 )) + B φ ( z , z t ) − B φ ( z , z t +1 ) = ¯ α t e ϕ ( λ t ) − ¯ α t +1 e ϕ ( λ t +1 ) + α t +1 e ϕ ( µ t +1 ) + B φ ( z , z t ) − B φ ( z , z t +1 ) . Putting these pieces together yields that Eq. ( 27 ) holds tr u e. Finally , w e pro v e our main results. By changing the index t to j in Eq. ( 27 ) and summing up th e resu lting inequalit y ov er j = 0 , 1 , . . . , t − 1, we h av e ¯ α t e ϕ ( λ t ) − ¯ α 0 e ϕ ( λ 0 ) ≤ t − 1 X j =0 ( α j +1 ( e ϕ ( µ j +1 ) + ( z − µ j +1 ) ⊤ ∇ e ϕ ( µ j +1 ))) + B φ ( z , z 0 ) − B φ ( z , z t ) . Since α 0 = ¯ α 0 = 0, B φ ( z , z t ) ≥ 0 and φ is 1-smooth with resp ect to ℓ ∞ -norm, w e ha v e ¯ α t e ϕ ( λ t ) ≤ t X j =0 ( α j ( e ϕ ( µ j ) + ( z − µ j ) ⊤ ∇ e ϕ ( µ j ))) + B φ ( z , z 0 ) ≤ t X j =0 ( α j ( e ϕ ( µ j ) + ( z − µ j ) ⊤ ∇ e ϕ ( µ j ))) + k z − z 0 k 2 ∞ z 0 =0 = t X j =0 ( α j ( e ϕ ( µ j ) + ( z − µ j ) ⊤ ∇ e ϕ ( µ j ))) + k z k 2 ∞ . This completes the pro of.  The ﬁnal lemma provides us with a key lo we r b ound for the accum ulating parameter. Lemma 4.3. F or e ach iter ation t of Algorithm 3 , we have ¯ α t ≥ η ( t +1) 2 32 δ . Pr o of. F or t = 1, w e ha v e ¯ α 1 = α 1 = 1 δM 1 ≥ η 8 δ since M 1 ≤ 8 η w as pr o v en in Lemma 4.1 . Th us , the desired result h olds true when t = 1. Then w e p r o ceed to prov e that it holds true for t ≥ 1 using the in duction. Indeed, w e hav e ¯ α t +1 = ¯ α t + α t +1 = ¯ α t + 1 + √ 1 + 4 δ M t ¯ α t 2 δ M t = ¯ α t + 1 2 δ M t + s 1 4( δM t ) 2 + ¯ α t δ M t ≥ ¯ α t + 1 2 δ M t + r ¯ α t δ M t ≥ ¯ α t + η 16 δ + r η ¯ α t 8 δ , where the last inequalit y comes from M t ≤ 8 η as sho wn in Lemma 4.1 . S u pp ose that the desired result h olds tru e for t = k 0 , w e ha v e ¯ α k 0 +1 ≥ η ( k 0 + 1) 2 32 δ + η 16 δ + r η 2 ( k 0 + 1) 2 256 δ 2 = η (( k 0 + 1) 2 + 2 + 2( k 0 + 1)) 32 δ ≥ η ( k 0 + 2) 2 32 δ . This completes the pro of.  19 4.3 Complexit y analysis for A P D AMD W e are now ready to establish the complexit y b ound of APDA MD for s olving th e entropic regularized OT p roblem. Ind eed, we recall that e ϕ ( λ ) is deﬁned with λ = ( α, β ) b y e ϕ ( α, β ) = − η log   X 1 ≤ i,j ≤ n e η − 1 ( α i + β j − C ij )   + α ⊤ r + β ⊤ c. Since ( α, β ) can b e obtain by α i = η u i and β j = η v j , w e deriv e from Lemma 2.2 that k α ⋆ k ∞ ≤ η R, k β ⋆ k ∞ ≤ η R. where R is d eﬁned accordingly . Th en, we pr o ceed to the follo wing key result determin in g an upp er b ound for the num b er of iterations for Algorithm 3 to reac h a desired accuracy ε ′ : Theorem 4.4. L etting { X t } t ≥ 0 b e the iter ates gener at e d by Algor ithm 3 , the nu mb er of iter ations r e quir e d to satisfy k A ot v ec( X t ) − b k 1 ≤ ε ′ is upp er b ounde d by t ≤ 1 + r 128 δR ε ′ , wher e R > 0 is deﬁne d i n L emma 2.2 . Pr o of. F rom Lemma 4.2 , w e ha v e ¯ α t e ϕ ( λ t ) ≤ min z ∈ B ∞ (2 ηR )    t X j =0 ( α j ( e ϕ ( µ j ) + ( z − µ j ) ⊤ ∇ e ϕ ( µ j ))) + k z k 2 ∞    , where B ∞ ( r ) := { λ ∈ R n | k λ k ∞ ≤ r } . This implies that ¯ α t e ϕ ( λ t ) ≤ min z ∈ B ∞ (2 ηR )    t X j =0 ( α j ( e ϕ ( µ j ) + ( z − µ j ) ⊤ ∇ e ϕ ( µ j )))    + 4 η 2 R 2 . Since e ϕ is the ob jectiv e function of dual en tropic regularized OT p roblem, we h a v e e ϕ ( µ j ) + ( z − µ j ) ⊤ ∇ e ϕ ( µ j ) = − f ( x ( µ j )) + z ⊤ ( b − A ot x ( µ j )) . Therefore, w e conclude th at ¯ α t e ϕ ( λ t ) ≤ min z ∈ B ∞ (2 ηR )    t X j =0 ( α j ( e ϕ ( µ j ) + ( z − µ j ) ⊤ ∇ e ϕ ( µ j )))    + 4 η 2 R 2 ≤ 4 η 2 R 2 − ¯ α t f ( x t ) + m in z ∈ B ∞ (2 ηR ) { ¯ α t z ⊤ ( b − A ot x t ) } = 4 η 2 R 2 − ¯ α t f ( x t ) − 2 ¯ α t η R k A ot x t − b k 1 , where the second inequalit y comes from the con v exit y of f and the last equalit y comes from the fact that ℓ 1 -norm is th e du al norm of ℓ ∞ -norm. Th at is to say , f ( x t ) + e ϕ ( λ t ) + 2 η R k A ot x t − b k 1 ≤ 4 η 2 R 2 ¯ α t . 20 Supp ose that λ ⋆ is an optimal solution to d ual en tropic regularized OT pr oblem s atisfying k λ ⋆ k ∞ ≤ η R , we hav e f ( x t ) + e ϕ ( λ t ) ≥ f ( x t ) + e ϕ ( λ ⋆ ) = f ( x t ) + b ⊤ λ ⋆ + max x ∈ R n 2 n − f ( x ) − ( λ ⋆ ) ⊤ A ot x o ≥ f ( x t ) + b ⊤ λ ⋆ − f ( x t ) − ( λ ⋆ ) ⊤ A ot x t = ( b − A ot x t ) λ ⋆ ≥ − η R k A ot x t − b k 1 , Therefore, w e conclude th at k A ot x t − b k 1 ≤ 4 η R ¯ α t ≤ 128 δR ( t + 1) 2 . This completes the pro of.  No w, we are ready to p resen t the complexit y b ound of Algorithm 4 for app ro ximating the OT problem. Theorem 4.5. The APDAMD scheme for appr oxima ting optima l tr ansp ort (Algorith m 4 ) r etu rns an ε -appr oximate tr ansp ortation plan (cf. Deﬁnition 1 ) in O n 2 √ δ k C k ∞ log( n ) ε ! arithmetic op er ations. Pr o of. Using the same argument as in T h eorem 3.8 , w e ha v e h C, ˆ X i − h C, X ⋆ i ≤ ε 2 + 4( k ˜ X 1 n − r k 1 + k ˜ X ⊤ 1 n − c k 1 ) k C k ∞ , where ˆ X is returned by Algorithm 4 , X ∗ is a solution to the OT problem and ˜ X = Ap damd ( e ϕ, A ot , b, ε ′ 2 ). Since k ˜ X 1 n − r k 1 + k ˜ X ⊤ 1 n − c k 1 ≤ ε ′ and ε ′ = ε 8 k C k ∞ , we ha ve h C , ˆ X i − h C, X ⋆ i ≤ ε 2 + ε 2 = ε . The remaining step is to analyze the complexit y b ound. If f ollo ws from Lemma 4.1 and Theorem 4.4 that N t ≤ 4 t + 4 + 2 log ( 8 η ) − 2 log( L 0 ) log 2 ≤ 8 + r 2048 δR ε ′ + 2 log ( 8 η ) − 2 log( L 0 ) log 2 = 8 + 256 r δ R k C k ∞ log( n ) ε + 2 log ( 32 log ( n ) ε ) − 2 log( L 0 ) log 2 . Com bining the deﬁn ition of R in Lemma 2.2 with the deﬁnition of η , ˜ r and ˜ c in Algorithm 4 , w e ha ve R ≤ 4 k C k ∞ log( n ) ε + log( n ) − 2 log  ε 64 n k C k ∞  . Therefore, w e conclude th at N t ≤ 256 r δ k C k ∞ log( n ) ε s 4 k C k ∞ log( n ) ε + log( n ) − 2 log  ε 64 n k C k ∞  + 2 log ( 32 l og( n ) ε ) − 2 log( L 0 ) log 2 + 8 = O √ δ k C k ∞ log( n ) ε ! . 21 Algorithm 5: Approximat ing OT b y Dvurec hensky et al. [ 2018 , Algorithm 3] Input: η = ε 4 log( n ) and ε ′ = ε 8 k C k ∞ . Step 1: Let ˜ r ∈ ∆ n and ˜ c ∈ ∆ n be deﬁned by ( ˜ r , ˜ c ) = (1 − ε ′ 8 )( r , c ) + ε ′ 8 n ( 1 n , 1 n ). Step 2: Let A ot ∈ R 2 n × n 2 and b ∈ R 2 n be deﬁned by A ot vec( X ) =  X 1 n X ⊤ 1 n  and b =  ˜ r ˜ c  . Step 3: Compute ˜ X = A pda gd ( e ϕ, A ot , b, ε ′ 2 ) where e ϕ is deﬁned by Eq. ( 23 ). Step 4: Round ˜ X to ˆ X using Altsch uler et al. [ 2 017 , Algorithm 2 ] suc h that ˆ X 1 n = r and ˆ X ⊤ 1 n = c . The total iteration complexit y in Step 3 of Algorithm 4 is b ounded by O ( √ δ k C k ∞ log( n ) ε − 1 ). Eac h iteration of Algorithm 3 requires O ( n 2 ) arithmetic op erations. Ther efore, the total n umb er of arithmetic op erations is O ( n 2 √ δ k C k ∞ log( n ) ε − 1 ). Moreo v er, ˜ r and ˜ c in Step 1 of Algorithm 4 can b e found in O ( n ) arithmetic op erations and Altsc h uler et al. [ 2017 , Algorithm 2] requires O ( n 2 ) arithmetic op erations. Therefore, w e conclude th at the total n umb er of arithmetic op erations is O ( n 2 √ δ k C k ∞ log( n ) ε − 1 ).  The complexit y r esults in Theorem 4.5 suggests an inte resting feature of the (regularized) OT pr oblem. Indeed, the dep end en ce of that b ound on δ manif ests the necessit y of ℓ ∞ -norm in the u nderstandin g of the complexit y of the en tropic regularized OT problem. This view is also in harmony with the pro of tec hniqu e of runnin g time for Greenkhorn in Section 3 , where w e r ely on ℓ ∞ -norm of optimal solutions of the dual en tropic regularized OT problem to m easure the pr ogress in the ob jectiv e v alue among the successiv e iterates. 4.4 Revisiting A P D AGD W e r evisit APD AG D [ Dvurec hensky et al. , 2018 ] for the en tropic regularized OT problem. First, w e p oint out that the cur ren t complexit y b oun d of e O (min { n 9 / 4 ε − 1 , n 2 ε − 2 } ) is not v alid b y a simple coun terexample. Then, we establish a new complexit y b ound of APD A GD using our tec hniques in Section 4.3 . De sp ite the issue with en tropic regularized OT, w e wish to emphasize th at APD A GD is still an in teresting and eﬃcient accelerated algorithm for general linearly constrained con v ex optimization problem with solid theoretical guaran tee. More precisely , Dvurec hensky et al. [ 2018 , T heorem 3] is n ot applicable to en tropic r egularized OT since n o dual solution exists with a constant b oun d in ℓ 2 -norm. Ho w ev er, it can b e used for analyzing other pr oblems w ith b ounded optimal d ual solution. T o facilitate the ensuing discussion, w e ﬁ rst presen t the complexit y b ound for en tropic regu- larized OT in Dvurec hensky et al. [ 2018 ] u sing our n otation. Indeed, we recall that APD A GD is d evelo p ed for solving the optimizatio n p roblem with the ob jectiv e fun ction b ϕ deﬁned as follo ws, min α,β ∈ R n b ϕ ( α, β ) := η   n X i,j =1 e − C ij − α i − β j η − 1   − α ⊤ r − β ⊤ c. (28) Theorem 4.6 (Th eorem 4 in Dvurec hensky et al. [ 20 18 ]) . L et R > 0 b e deﬁne d such that ther e exists an optimal solution to the dual e ntr opic r e gularize d O T pr oblem in Eq. ( 23 ) , denote d by ( u ⋆ , v ⋆ ) , satisfying k ( u ⋆ , v ⋆ ) k ≤ R , the APDA GD scheme for appr oxima ting optimal tr ansp ort 22 (cf. Algorithm 5 ) r eturns an ε -appr oximate tr ansp ortation plan (cf. Deﬁnition 1 ) in O   min    n 9 / 4 q R k C k ∞ log( n ) ε , n 2 R k C k ∞ log( n ) ε 2      , arithmetic op er ations. F rom the ab o ve th eorem, Dvurec hensky et al. [ 2018 ] claims that the complexit y b ound for APD A GD is e O (min { n 9 / 4 ε − 1 , n 2 ε − 2 } ). How ev er, there are t w o issues: 1. Th e u pp er b ound R is assu med to b e indep enden t of n , wh ic h is not correct; see our coun terexample in Prop osition 4.7 . 2. Th e known u pp er b ound R for th e optimal solutio n dep en d s on min 1 ≤ i,j ≤ n { r i , c j } (cf. Dvurec hensky et al. [ 2018 , Lemma 1] or Lemm a 2.2 in our pap er). Th is implies that the v alid algorithm needs to tak e the roundin g error with r and c into accoun t. Corrected upp er b ound R . Corollary 2.3 and Lemma 3.2 imply that a straigh tforward upp er b ound for R is e O ( √ n ). Giv en a tolerance ε ∈ (0 , 1), w e fur ther sho w that R is in d eed Ω( √ n ) by us in g a sp eciﬁc en tropic regularized OT problem as follo ws. Prop osition 4.7. Supp ose that C = 1 n 1 ⊤ n and r = c = 1 n 1 n . Given a toler anc e ε ∈ (0 , 1) and the r e gulariza tion term η = ε 4 log ( n ) , al l the optimal solutions of the dual entr op ic r e gularize d OT pr oblem in Eq. ( 28 ) satisfy that k ( α ∗ , β ∗ ) k & √ n . Pr o of. By the d eﬁ nition r , c and η , w e rewrite the dual fu n ction b ϕ ( α, β ) as follo ws: b ϕ ( α, β ) = ε 4 e log ( n ) X 1 ≤ i,j ≤ n e − 4 log( n )(1 − α i − β j ) ε − α ⊤ 1 n n − β ⊤ 1 n n . Since ( α ⋆ , β ⋆ ) is an optimal solution of du al entropic regularized OT problem, we h a v e e 4 log( n ) α ⋆ i ε n X j =1 e − 4 log( n )(1 − β ⋆ j ) ε = e 4 log( n ) β ⋆ i ε n X j =1 e − 4 log( n )(1 − α ⋆ j ) ε = e n for all i ∈ [ n ] . (29) This imp lies α ⋆ i = α ⋆ j and β ⋆ i = β ⋆ j for all i, j ∈ [ n ], and α ⋆ i + β ⋆ i are the same for all i ∈ [ n ]. Without loss of generalit y , we can let α ⋆ i = 0 in Eq. ( 29 ) and obtain that β ⋆ i = 1 + ε 4 log ( n ) (1 − 2 log( n )) = 1 + ε 4 log ( n ) − ε 2 . whic h implies that α ⋆ i + β ⋆ i = 1 + ε 4 log ( n ) − ε 2 ≥ 1 2 for all i ∈ [ n ]. Th us, w e ha ve k ( α ⋆ , β ⋆ ) k ≥ r P n i =1 ( α ⋆ i + β ⋆ i ) 2 2 = 1 2 r n 2 & √ n. As a consequence, w e ac hiev e the conclusion of the prop osition.  23 Appro ximation algorithm for OT b y APD A GD. It is wo rth n oting that the round - ing p ro cedure is missing in Dvurec hensky et al. [ 2018 , Algorithm 4] and we imp ro v e it to Algorithm 5 . In particular, Dvu r ec hensky et al. [ 2018 , Algorithm 3] is used in Step 3 of Al- gorithm 5 for another function e ϕ d eﬁned in Eq. ( 9 ). Giv en the corrected up p er b ound R and Algorithm 5 for app ro ximating OT, we provide a new complexit y b ound of Algorithm 5 in the follo wing prop osition. Prop osition 4.8. The APDA GD scheme for appr oximating optimal tr ansp ort (Al gorithm 5 ) r etu rns an ε -appr oximate tr ansp ortation plan (cf. Deﬁnition 1 ) in O n 5 / 2 k C k ∞ p log( n ) ε ! arithmetic op er ations. Pr o of. The p r o of is a simple mo diﬁ cation of th e pro of for Dvurec hensky et al. [ 2018 , Theorem 4] and w e only giv e a p ro of sk etc h here. In p articular, w e can obtain that the num b er of iterations for Algorithm 5 required to reac h the tolerance ε is t ≤ O   max    min    n 1 / 4 q R k C k ∞ log( n ) ε , R k C k ∞ log( n ) ε 2    , R √ log n ε      . (30) Moreo v er, we ha ve R ≤ √ nη R wh ere R = η − 1 k C k ∞ + log( n ) − 2 log (min 1 ≤ i,j ≤ n { r i , c j } ). Th ere- fore, the total iteration complexity in Step 3 of Algorithm 5 is O ( p n log ( n ) k C k ∞ ε − 1 ). Eac h iteration of APDA GD requires O ( n 2 ) arithmetic op erations. Therefore, the total num b er of arithmetic op erations is O ( n 5 / 2 k C k ∞ p log( n ) ε − 1 ). Note that ˜ r and ˜ c in Step 1 of Algorithm 5 can b e found in O ( n ) arithmetic op erations an d Altsc huler et al. [ 2017 , Algorithm 2] requires O ( n 2 ) arithmetic op erations. Therefore, we conclude that the total n umb er of arithm etic op erations is O ( n 5 / 2 k C k ∞ p log( n ) ε − 1 ).  Remark 4.9. As indic ate d in Pr op osition 4.8 , the c orr e cte d c omp lexity b ound of APDAGD for the entr opic r e gularize d OT is similar to that of A P DAMD when we cho ose φ ( · ) = 1 2 n k · k 2 and have δ = n . F r om this p ersp e ctive, our algorith m c an b e viewe d as a gener alization of APDA GD. Sinc e our algorith m utilizes ℓ ∞ -norm in the line se ar ch criterion, it is mor e r obust than AP DA GD in pr actic e; se e Se ction 6 for the details. 5 Accelerating Sinkh orn In this section, we presen t an accelerat ed v arian t of Sin khorn for solving the en tropic regular- ized O T problem in E q. ( 2 ). Combined with a roun ding sc heme, our algorithm can b e used for solving the OT p roblem in E q. ( 1 ) an d ac hiev es a complexit y b oun d of e O ( n 7 / 3 ε − 4 / 3 ), whic h impro ve s that of Sinkhorn in terms of 1 /ε and APD A GD and AAM [ Gumino v et al. , 2021 ] in terms of n . The idea comes from a no v el com bination of Nestero v’s estimated sequence and the tec hniques for analyzing Sink h orn. 5.1 Algorithmic pro cedure W e presen t the ps eu do co de of accele rated Sin khorn in Algorithm 6 . Th is algorithm ac hiev es the acceleration by using Nestero v’s estimat e s equences [ Nestero v , 2018 ]. While our algorithm 24 Algorithm 6: A ccel era t ed Sinkho rn ( C, η , r, c, ε ′ ) Input: t = 0 , θ 0 = 1 and ˇ u 0 = ˜ u 0 = ˇ v 0 = ˜ v 0 = 0 n . while E t > ε ′ do Compute  ¯ u t ¯ v t  = (1 − θ t )  ˇ u t ˇ v t  + θ t  ˜ u t ˜ v t  . Compute ˜ u t +1 and ˜ v t +1 by ˜ u t +1 = ˜ u t − 1 2 θ t  r ( B ( ¯ u t , ¯ v t )) k B ( ¯ u t , ¯ v t ) k 1 − r  , ˜ v t +1 = ˜ v t − 1 2 θ t  c ( B ( ¯ u t , ¯ v t )) k B ( ¯ u t , ¯ v t ) k 1 − c  . Compute  ` u t ` v t  =  ¯ u t ¯ v t  + θ t  ˜ u t +1 ˜ v t +1  −  ˜ u t ˜ v t  . if t is even then b u t = ` u t + log( r ) − log( r ( B ( ` u t , ` v t ))) and b v t = ` v t . else b u t = ` u t and b v t = ` v t + log( c ) − log( c ( B ( ` u t , ` v t ))). end if Compute  u t v t  = arg min  ϕ ( u, v )     u v  ∈  ˇ u t ˇ v t  ,  b u t b v t  . if t is even then ˇ u t +1 = u t + log( r ) − log( r ( B ( u t , v t ))) and ˇ v t +1 = v t . else ˇ u t +1 = u t and ˇ v t +1 = v t + log( c ) − log ( c ( B ( u t , v t ))). end if Compute θ t +1 = θ t ( √ θ 2 t +4 − θ t ) 2 . Set t = t + 1. end while Output: B ( u t , v t ). can b e in terpr eted as an accelerated blo c k co ordinate descen t algo rithm , it is w orth mentioning that our algorithm is pur ely deterministic and th us d iﬀers from other acce lerated rand omized algorithms [ Lin et al. , 2015 , F erco q and Ric h t´ arik , 2015 , Lu et al. , 2018 ] in the optimizatio n literature. Algorithm 6 is a no vel com bination of Nestero v’s estimate sequences, a monotone searc h step, the c hoice of greedy co ordin ate and tw o coord inate up dates. It is applied to solve the dual en tropic regularized OT pr oblem in Eq. ( 5 ): min u,v ϕ ( u, v ) := log ( k B ( u, v ) k 1 ) − u ⊤ r − v ⊤ c. More sp eciﬁcally , Nestero v’s estimate sequen ces are resp onsible for optimizing a du al ob jectiv e function ϕ in a fast rate. The co ordinate up date guaran tees th at ϕ ( b u t , b v t ) ≤ ϕ ( ` u t , ` v t ) and k B ( b u t , b v t ) k 1 = 1. The monotone search step guaran tees that ϕ ( u t , v t ) ≤ ϕ ( b u t , b v t ). The greedy co ordinate up date guaran tees that ϕ ( ˇ u t +1 , ˇ v t +1 ) ≤ ϕ ( u t , v t ) with su ﬃcien t progress. F ur th ermore, we also use the same quanti ty as that in Greekhorn to measure th e p er - iteration residue of Algorithm 6 : E t = k r ( B ( u t , v t )) − r k 1 + k c ( B ( u t , v t )) − c k 1 . (31) The computationally exp en siv e step is to compute r ( B ( ¯ u t , ¯ v t )) k B ( ¯ u t , ¯ v t ) k 1 and c ( B ( ¯ u t , ¯ v t )) k B ( ¯ u t , ¯ v t ) k 1 . Since B ( ¯ u t , ¯ v t ) do es not h a v e an y sp ecial pr op ert y , it is diﬃcu lt to design some implemen tation tric k to 25 Algorithm 7: Approximat ing OT b y Algorithm 6 Input: η = ε 4 log( n ) and ε ′ = ε 8 k C k ∞ . Step 1: Let ˜ r ∈ ∆ n and ˜ c ∈ ∆ n be deﬁned by ( ˜ r , ˜ c ) = (1 − ε ′ 8 )( r , c ) + ε ′ 8 n ( 1 n , 1 n ). Step 2: Compute ˜ X = Accelera ted Sinkhorn ( C , η , ˜ r , ˜ c , ε ′ 2 ). Step 3: Round ˜ X to ˆ X using Altsch uler e t al. [ 201 7 , Algo rithm 2 ] suc h that ˆ X 1 n = r and ˆ X ⊤ 1 n = c . Output: b X . reduce the order of n . As such, the arithmetic op erations for eac h iteration is O ( n 2 ) and is exactly the s ame as Sinkh orn [ Cuturi , 2013 ], APD AG D [ Dvurechensky et al. , 2018 ] and AAM [ Gumino v et al. , 20 21 ]. Com bining Algo rithm 6 and Altsch uler et al. [ 2017 , Algo- rithm 2], we are ready to present the p seudo co de of our m ain algorithm in Algorithm 7 . The regularization parameter η is set as b efore, and Step 1 is necessary since accelerated Sinkhorn is n ot wel l b eha v ed if the m arginal d istributions h a v e sparse supp ort. 5.2 T ec hnical lemmas W e ﬁr st p resen t t w o tec hnical lemmas which are essen tial in the analysis of Algorithm 6 . The ﬁrst lemma p ro vides an ind uctiv e relationship on the quantit y δ t = ϕ ( ˇ u t , ˇ v t ) − ϕ ( u ⋆ , v ⋆ ) , (32) where ( u ⋆ , v ⋆ ) is an optimal solution of the d ual en tropic regularized OT pr oblem in Eq. ( 5 ) that satisﬁes Lemma 2.3 . T o facilitate the discussion, w e r ecall Eq. ( 10 ) with k A k 1 → 2 = √ 2 as follo ws, ϕ ( u ′ , v ′ ) − ϕ ( u, v ) −  u ′ − u v ′ − v  ⊤ ∇ ϕ ( u, v ) ≤      u ′ − u v ′ − v      2 , (33) whic h will b e us ed in the pro of of the ﬁrst lemma. Lemma 5.1. L et { ( ˇ u t , ˇ v t ) } t ≥ 0 b e the iter ates gener ate d by Al gorithm 6 and ( u ⋆ , v ⋆ ) b e an optimal solution of the dual e ntr opic r e gu larize d OT pr oblem. Then, we have δ t +1 ≤ (1 − θ t ) δ t + θ 2 t      u ⋆ − ˜ u t v ⋆ − ˜ v t      2 −      u ⋆ − ˜ u t +1 v ⋆ − ˜ v t +1      2 ! . Pr o of. Using Eq. ( 33 ) with ( u ′ , v ′ ) = ( ` u t , ` v t ) and ( u, v ) = ( ¯ u t , ¯ v t ), w e ha v e ϕ ( ` u t , ` v t ) ≤ ϕ ( ¯ u t , ¯ v t ) + θ t  ˜ u t +1 − ˜ u t ˜ v t +1 − ˜ v t  ⊤ ∇ ϕ ( ¯ u t , ¯ v t ) + θ 2 t      ˜ u t +1 − ˜ u t ˜ v t +1 − ˜ v t      2 . After simp le calculat ions, we ﬁ n d that ϕ ( ¯ u t , ¯ v t ) = (1 − θ t ) ϕ ( ¯ u t , ¯ v t ) + θ t ϕ ( ¯ u t , ¯ v t ) ,  ˜ u t +1 − ˜ u t ˜ v t +1 − ˜ v t  ⊤ ∇ ϕ ( ¯ u t , ¯ v t ) = −  ˜ u t − ¯ u t ˜ v t − ¯ v t  ⊤ ∇ ϕ ( ¯ u t , ¯ v t ) +  ˜ u t +1 − ¯ u t ˜ v t +1 − ¯ v t  ⊤ ∇ ϕ ( ¯ u t , ¯ v t ) . 26 Putting these pieces together yields that ϕ ( ` u t , ` v t ) ≤ θ t      ϕ ( ¯ u t , ¯ v t ) +  ˜ u t +1 − ¯ u t ˜ v t +1 − ¯ v t  ⊤ ∇ ϕ ( ¯ u t , ¯ v t ) + θ t      ˜ u t +1 − ˜ u t ˜ v t +1 − ˜ v t      2 | {z } I      (34) + (1 − θ t ) ϕ ( ¯ u t , ¯ v t ) − θ t  ˜ u t − ¯ u t ˜ v t − ¯ v t  ⊤ ∇ ϕ ( ¯ u t , ¯ v t ) | {z } II . W e ﬁrst b ound the term I . In d eed, b y the up d ate formula f or ( ˜ u t +1 , ˜ v t +1 ) and the deﬁnition of ∇ ϕ , we ha ve  u − ˜ u t +1 v − ˜ v t +1  ⊤  ∇ ϕ ( ¯ u t , ¯ v t ) + 2 θ t  ˜ u t +1 − ˜ u t ˜ v t +1 − ˜ v t  = 0 for all ( u, v ) ∈ R 2 n . Letting ( u, v ) = ( u ⋆ , v ⋆ ) and r earranging the resulting equation yields that  ˜ u t +1 − ¯ u t ˜ v t +1 − ¯ v t  ⊤ ∇ ϕ ( ¯ u t , ¯ v t ) =  u ⋆ − ¯ u t v ⋆ − ¯ v t  ⊤ ∇ ϕ ( ¯ u t , ¯ v t ) + θ t      u ⋆ − ˜ u t v ⋆ − ˜ v t      2 −      u ⋆ − ˜ u t +1 v ⋆ − ˜ v t +1      2 −      ˜ u t +1 − ˜ u t ˜ v t +1 − ˜ v t      2 ! . Using the conv exit y of ϕ , we h a v e  u ⋆ − ¯ u t v ⋆ − ¯ v t  ⊤ ∇ ϕ ( ¯ u t , ¯ v t ) ≤ ϕ ( u ⋆ , v ⋆ ) − ϕ ( ¯ u t , ¯ v t ) . Putting these pieces together yields that I ≤ ϕ ( u ⋆ , v ⋆ ) + θ t      u ⋆ − ˜ u t v ⋆ − ˜ v t      2 −      u ⋆ − ˜ u t +1 v ⋆ − ˜ v t +1      2 ! . (35) W e then b oun d the term I I . In d eed, we see from the d eﬁ nition of ( ¯ u t , ¯ v t ) that − θ t  ˜ u t − ¯ u t ˜ v t − ¯ v t  = θ t  ¯ u t ¯ v t  + (1 − θ t )  ˇ u t ˇ v t  −  ¯ u t ¯ v t  = (1 − θ t )  ˇ u t − ¯ u t ˇ v t − ¯ v t  . Com bining the ab o v e equation with the conv exit y of ϕ , w e ha v e I I = (1 − θ t ) ϕ ( ¯ u t , ¯ v t ) +  ˇ u t − ¯ u t ˇ v t − ¯ v t  ⊤ ∇ ϕ ( ¯ u t , ¯ v t ) ! ≤ (1 − θ t ) ϕ ( ˇ u t , ˇ v t ) . (36) Plugging Eq. ( 35 ) and Eq. ( 36 ) in to Eq. ( 34 ) yields that ϕ ( ` u t , ` v t ) ≤ (1 − θ t ) ϕ ( ˇ u t , ˇ v t ) + θ t ϕ ( u ⋆ , v ⋆ ) + θ 2 t      u ⋆ − ˜ u t v ⋆ − ˜ v t      2 −      u ⋆ − ˜ u t +1 v ⋆ − ˜ v t +1      2 ! . Since ( ˇ u t +1 , ˇ v t +1 ) is obtained b y a co ordinate u p date from ( u t , v t ), we ha v e ϕ ( u t , v t ) ≥ ϕ ( ˇ u t +1 , ˇ v t +1 ). By the d eﬁnition of ( u t , v t ), we ha ve ϕ ( b u t , b v t ) ≥ ϕ ( u t , v t ). S ince ( b u t , b v t ) is 27 obtained b y a co ordinate up d ate from ( ` u t , ` v t ), w e h av e ϕ ( ` u t , ` v t ) ≥ ϕ ( b u t , b v t ). Collecting all of these results leads to ϕ ( ˇ u t +1 , ˇ v t +1 ) − ϕ ( u ⋆ , v ⋆ ) ≤ (1 − θ t )( ϕ ( ˇ u t , ˇ v t ) − ϕ ( u ⋆ , v ⋆ )) + θ 2 t      u ⋆ − ˜ u t v ⋆ − ˜ v t      2 −      u ⋆ − ˜ u t +1 v ⋆ − ˜ v t +1      2 ! . This completes the pro of.  The second lemma pro vides an upp er b ound for δ t deﬁned b y Eq. ( 32 ) where { ( ˇ u t , ˇ v t ) } t ≥ 0 are generated by Algorithm 6 and ( u ⋆ , v ⋆ ) is an optimal solution deﬁned by Corollary 2.3 . Lemma 5.2. L et { ( ˇ u t , ˇ v t ) } t ≥ 0 b e the iter ates gener ate d by Al gorithm 6 and ( u ⋆ , v ⋆ ) b e an optimal solution of the dual entr opic r e gularize d O T pr oblem satisfying that k ( u ⋆ , v ⋆ ) k ≤ √ 2 nR wher e R is deﬁne d in Cor ol lary 2.3 . Then, we have δ t ≤ 8 nR 2 ( t + 1) 2 . Pr o of. By simple calculat ions, w e deriv e from the d eﬁnition of θ t that θ t +1 θ t = p 1 − θ t +1 . Therefore, w e conclude f rom L emma 5.1 that  1 − θ t +1 θ 2 t +1  δ t +1 −  1 − θ t θ 2 t  δ t ≤      u ⋆ − ˜ u t v ⋆ − ˜ v t      2 −      u ⋆ − ˜ u t +1 v ⋆ − ˜ v t +1      2 . Equiv alen tly , we ha ve  1 − θ t θ 2 t  δ t +      u ⋆ − ˜ u t v ⋆ − ˜ v t      2 ≤  1 − θ 0 θ 2 0  δ 0 +      u ⋆ − ˜ u 0 v ⋆ − ˜ v 0      2 . Since θ 0 = 1 and ˜ u 0 = ˜ v 0 = 0 n , w e ha v e δ t ≤ θ 2 t − 1 k ( u ⋆ , v ⋆ ) k 2 ≤ 2 nR 2 θ 2 t − 1 . The remaining step is to sho w that 0 < θ t ≤ 2 t +2 . In deed, the claim holds wh en t = 0 as w e ha ve θ 0 = 1. Assu me that the claim holds for t ≤ t 0 , i.e., θ t 0 ≤ 2 t 0 +2 , w e ha v e θ t 0 +1 = 2 1 + q 1 + 4 θ 2 t 0 ≤ 2 t 0 + 3 . Putting these pieces together yields the desired inequalit y f or δ t .  5.3 Main results W e pr esen t an upp er b oun d for the num b er of iterations r equired by Algorithm 6 . Note that the p er-iteration progress of Algorithm 6 is measured by the function ρ : R n + × R n + → R + giv en b y: ρ ( a, b ) := 1 ⊤ n ( b − a ) + P n i =1 a i log( a i b i ). Theorem 5.3. L e t { ( u t , v t ) } t ≥ 0 b e the iter ates gener ate d by Al gorithm 6 . The numb er of iter ations r e quir e d to r e ach the stopping criterion E t ≤ ε ′ satisﬁes t ≤ 1 +  16 √ nR ε ′  2 / 3 , wher e R > 0 is deﬁne d i n L emma 2.2 . 28 Pr o of. W e ﬁrst claim that ϕ ( u t , v t ) − ϕ ( ˇ u t +1 , ˇ v t +1 ) ≥ 1 2  k r ( B ( u t , v t )) − r k 2 1 + k c ( B ( u t , v t )) − c k 2 1  . (37) By the deﬁn ition of ϕ , w e ha ve ϕ ( u t , v t ) − ϕ ( ˇ u t +1 , ˇ v t +1 ) = log ( k B ( u t , v t ) k 1 ) (38) − log ( k B ( ˇ u t +1 , ˇ v t +1 ) k 1 ) − ( u t − ˇ u t +1 ) ⊤ r − ( v t − ˇ v t +1 ) ⊤ c. F rom the up date form ula for ( b u t , b v t ) and ( ˇ u t +1 , ˇ v t +1 ), it is clear that k B ( b u t , b v t ) k 1 = 1 and k B ( ˇ u t +1 , ˇ v t +1 ) k 1 = 1 for all t ≥ 0. Then, we d er ive from the up date formula for ( u t , v t ) that k B ( u t , v t ) k 1 = 1 for all t ≥ 1. Therefore, w e ha v e ϕ ( u t , v t ) − ϕ ( ˇ u t +1 , ˇ v t +1 ) = − ( u t − ˇ u t +1 ) ⊤ r − ( v t − ˇ v t +1 ) ⊤ c = (log( r ) − log ( r ( B ( u t , v t )))) ⊤ r + (log( c ) − log( c ( B ( u t , v t )))) ⊤ c. Since 1 ⊤ n r = 1 ⊤ n r ( B ( u t , v t )) = 1 ⊤ n c = 1 ⊤ n c ( B ( u t , v t )) = 1, w e ha v e ϕ ( u t , v t ) − ϕ ( ˇ u t +1 , ˇ v t +1 ) = ρ ( r , r ( B ( u t , v t ))) + ρ ( c, c ( B ( u t , v t ))) . Using Altsc h uler et al. [ 2017 , Lemma 4], w e derive Eq. ( 37 ) as desired. By th e deﬁn ition of ( u t , v t ), w e ha ve ϕ ( ˇ u t , ˇ v t ) ≥ ϕ ( u t , v t ). Plugging this inequalit y int o Eq. ( 37 ) together with the Cauc h y-Sch w arz inequalit y yields ϕ ( ˇ u t , ˇ v t ) − ϕ ( ˇ u t +1 , ˇ v t +1 ) ≥ 1 4 E 2 t . Therefore, w e conclude th at ϕ ( ˇ u t , ˇ v t ) − ϕ ( ˇ u t +1 , ˇ v t +1 ) ≥ 1 4   t X i = j E 2 i   for any j ∈ { 1 , 2 , . . . , t } . Since ϕ ( ˇ u t +1 , ˇ v t +1 ) ≥ ϕ ( u ⋆ , v ⋆ ) for all t ≥ 1, we ha v e ϕ ( ˇ u j , ˇ v j ) − ϕ ( ˇ u t +1 , ˇ v t +1 ) ≤ δ j . T hen, it follo ws from Lemma 5.2 that t X i = j E 2 i ≤ 32 nR 2 ( j + 1) 2 . Putting these pieces toget her with the f act that E t ≥ ε ′ as so on as th e stoppin g criterion is not fulﬁlled yields 32 nR 2 ( j + 1) 2 ( t − j + 1) ≥ ( ε ′ ) 2 . Since th is inequalit y holds true for all j ∈ { 1 , 2 , . . . , t } , we assume without loss of generalit y that t is ev en and let j = t/ 2. Then, w e obtain that t ≤ 1 +  16 √ nR ε ′  2 / 3 . This completes the pro of of the theorem.  W e are ready to present the complexity b ound of Algorithm 7 for solving the O T pr oblem in Eq. ( 1 ). Note that ε ′ = ε 8 k C k ∞ is d eﬁned using the desired accuracy ε > 0. 29 Theorem 5.4. The ac c eler ate d Sinkhorn scheme for appr oximating optimal tr ansp ort (Algo- rithm 7 ) r eturns an ε -appr oximate tr ansp orta tion plan (cf. Deﬁnition 1 ) in O n 7 / 3 k C k 4 / 3 ∞ (log( n )) 1 / 3 ε 4 / 3 ! arithmetic op er ations. Pr o of. Applying the same argumen t whic h is used in Theorem 3.8 , we obtain that h C, b X i − h C, X ⋆ i ≤ ε where ˜ X = Accelera ted Sinkhorn ( C, η , ˜ r , ˜ c, ε ′ 2 ) in S tep 2 of Algorithm 7 . It remains to b ound th e n um b er of iterations required b y Algorithm 6 to reac h the stopping criterion E t ≤ ε ′ 2 . Using Theorem 5.3 , w e ha v e t ≤ 1 +  32 √ nR ε ′  2 / 3 . By the deﬁn ition of R (cf. L emm a 2.2 ), η = ε 4 log ( n ) and ε ′ = ε 8 k C k ∞ , we hav e t ≤ 1 +  32 √ nR ε ′  2 / 3 ≤ 1 +  256 √ n k C k ∞ ε  k C k ∞ η + log( n ) − log  min 1 ≤ i,j ≤ n { r i , c j }  2 / 3 ≤ 1 +  256 √ n k C k ∞ ε  4 log ( n ) k C k ∞ ε + log( n ) − log  ε 64 n k C k ∞  2 / 3 = O n 1 / 3 k C k 4 / 3 ∞ (log( n )) 1 / 3 ε 4 / 3 ! . Since eac h iteration of Algorithm 6 r equ ires O ( n 2 ) arithmetic op erations, the total n umb er of arithmetic op erations required b y Step 2 of Algorithm 7 is O ( n 7 / 3 k C k 4 / 3 ∞ (log( n )) 1 / 3 ε − 4 / 3 ). Computing tw o v ectors ˜ r and ˜ c in Step 1 of Algorithm 7 requires O ( n ) arithmetic op erations and Altsc h uler et al. [ 2017 , Algorithm 2] r equires O ( n 2 ) arithm etic op erations. Therefore, the complexit y b oun d of Algorithm 7 is O ( n 7 / 3 k C k 4 / 3 ∞ (log( n )) 1 / 3 ε − 4 / 3 ).  Remark 5.5. The or em 5.4 shows that the c omplexity b ound of ac c eler ate d Sinkhorn is b etter than that of Sinkhorn and Gr e enkhorn in terms of 1 /ε but app e ars not to b e ne ar-line ar in n 2 . Thus, our algorithm is r e c ommend e d when n ≪ 1 /ε . This o c curs if the desir e d solution ac cu- r acy is r elatively smal l, saying 10 − 4 , and the examples include the applic ation pr oblems fr om e c onomics and op e r ations r ese ar ch. In c ontr ast, Sinkhorn and Gr e enk horn ar e r e c ommende d when n ≫ 1 /ε . This o c curs if the desir e d solution ac cur acy is r e latively lar ge, saying 10 − 2 , and the examples i nc lu de the applic at ion pr oblems fr om i mage pr o c essing. 6 Exp erimen ts In this section, we condu ct th e exp eriments to ev aluate Greenkhorn, accelerated Sin k h orn and APDA MD on syn thetic data and real images from the MNIST Digits dataset 1 . W e use 1 http://y ann.lecun.com/exdb/mnist/ 30 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Row/Col Updates -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 |r(P)-r| 1 + |c(P)-c| 1 Distance to Polytope GREENKHORN SINKHORN Spead of ln(compet ratio for error) 0 500 1000 1500 2000 2500 3000 3500 4000 Row/Col Updates 0 0.5 1 1.5 2 2.5 3 3.5 ln(compet ratio for error) Max Median Min 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Row/Col Updates 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 Value of OT SINKHORN vs GREENKHORN for OT True optimum SINKHORN, 1/eta=1 SINKHORN, 1/eta=5 SINKHORN, 1/eta=9 GREENKHORN, 1/eta=1 GREENKHORN, 1/eta=5 GREENKHORN, 1/eta=9 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Row/Col Updates 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 |r(P)-r| 1 + |c(P)-c| 1 Distance to Polytope APDAMD APDAGD Spead of ln(compet ratio for error) 0 500 1000 1500 2000 2500 3000 3500 4000 Row/Col Updates -1 -0.5 0 0.5 1 1.5 ln(compet ratio for error) Max Median Min 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Row/Col Updates 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 Value of OT APDAGD vs APDAMD for OT True optimum APDAGD, 1/eta=1 APDAGD, 1/eta=5 APDAGD, 1/eta=9 APDAMD, 1/eta=1 APDAMD, 1/eta=5 APDAMD, 1/eta=9 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Row/Col Updates 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 |r(P)-r| 1 + |c(P)-c| 1 Distance to Polytope ACCELERATION SINKHORN Spead of ln(compet ratio for error) 0 500 1000 1500 2000 2500 3000 3500 4000 Row/Col Updates -0.5 0 0.5 1 1.5 2 2.5 3 ln(compet ratio for error) Max Median Min 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Row/Col Updates 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 Value of OT Sinkhorn vs Acceleration for OT True optimum SINKHORN, 1/eta=1 SINKHORN, 1/eta=5 SINKHORN, 1/eta=9 Acceleration, 1/eta=1 Acceleration, 1/eta=5 Acceleration, 1/eta=9 Figure 1: Comparativ e p er f ormance of Sinkh orn v.s. Greenkhorn, APD A GD v.s. APD AMD and Sinkhorn v.s. accelerate d Sinkh orn on synt hetic images. Sinkhorn [ Cuturi , 2013 ], APD A GD [ Dvurec hensky et al. , 2018 ] and GCPB 2 [ Genev ay et al. , 2016 ] as the baseline appr oac hes. Since the fo cu s of this pap er is the entropic regularized algo- rithms, w e exclude the com binatorial algorithms from our exp erimen t an d refer to Dong et al. [ 2020 ] for an excellen t comparativ e stud y . In th e literature, Greenkhorn and APDA GD w ere s ho wn to outp erf orm the Sinkhorn algorithm in terms of r ow/co lumn up dates [ Altsc h uler et al. , 2017 , Dvurechensky et al. , 2018 ] and we rep eat the comparisons for the sake of completeness. F or parameter tuning in the implemen tation of Greenkh orn, accelerated Sinkh orn and APD AMD, w e f ollo w most of the setups as sho wn in Algo rithm 1 , 3 and 6 but emplo y more aggressiv e c hoice of stepsize for the co ordinate gradien t up dates in Algorithm 6 . T o obtain an optimal v alue of the OT problem, w e emplo y the default L P solv er in MA TLAB. 6.1 Syn thetic images T o generate the synt hetic images, we adopt the pro cess from Altsc h uler et al. [ 2017 ] and ev al- uate the p erf ormance of diﬀerent algorithms on these synt hetic images. The transp ortation 2 GCPB is simply an application of stochastic av eraged gradien t [ Schmidt et al. , 2017 ] for solving the d ual entropic regularized OT problem. 31 distance is d eﬁ ned b et w een t w o synthetic images while th e cost matrix is deﬁned based on the ℓ 1 distances among lo cations of pixel in the images. Eac h image is of size 20 by 20 pixels and generated b y means of randomly placing a foreground square in a blac k bac kground . F u rther- more, a un iform d istribution on [0 , 1] is used for the intensitie s of the pixels in the backg roun d while a uniform d istribution on [0 , 50] is emplo ye d for the p ixels in the foreground. W e ﬁx the prop ortion of th e size of the foreground s quare as 10% of the wh ole images and implemen t all candidate algo rithm s . W e use the standard metrics to assess the p erformance of all the candidate algorithms. The ﬁrst metric d ( . ) is an ℓ 1 distance b et ween the ro w, column outputs of some algorithm A and the corresp onding transp ortation p olytop e of the probabilit y measures, whic h is giv en b y: d ( A ) := k r ( A ) − r k 1 + k c ( A ) − c k 1 where r ( A ) and c ( A ) are the row and column obtained from the output of the algorithm A and r and c are row and column v ectors of th e original probabilit y measur es. T he second metric is deﬁned as comp etitiv e ratio log( d ( A 1 ) /d ( A 2 )) w here d ( A 1 ) and d ( A 2 ) are the distances b et wee n the ro w, column outputs of algorithms A 1 and A 2 and the transp ortation p olytop e. W e p erform thr ee pairwise comparativ e exp erimen ts on 10 randomly generated data: Sin k h orn v.s. Greekhorn, APDA GD v.s. APD AMD and Sinkh orn v.s. accelerated Sinkhorn. T o further ev aluate these algorithms, we compare their p erforman ce with r esp ect to diﬀerent c hoices of r egularizatio n parameter η ∈ { 1 , 1 5 , 1 9 } while using the v alue of the OT problem as the baseline approac h. The m axim um num b er of iterations is T = 5. Figure 1 su mmarizes the exp erimenta l results. The images in the ﬁ rst r o w show the comparativ e p erformance of Sinkh orn and Greenkhorn in terms of the row/co lumn up dates. In the leftmost image, the comparison u ses distance to transp ortation p olytop e d ( A ) where A is either Sinkh orn or Greenkhorn. In the middle image, the m axim um, m edian and min im um v alues of th e comp etitiv e ratios log ( d ( A 1 ) /d ( A 2 )) on 10 images are utilized for the comparison wh er e A 1 is Sinkhorn and A 2 is Greenkhorn. In the righ tmost image, we v ary the regularization parameter η ∈ { 1 , 1 5 , 1 9 } with these algorithms and using the v alue of the unregularized OT problem as the baseline. Th e other ro ws of images pr esen t comparativ e results for APD A GD v.s. APD AMD and Sin khorn v.s. accelerated Sinkhorn. W e ﬁ nd that (i) Greenkhorn outp erforms Sinkhorn in terms of row/co lumn up d ates, illustrating the impr o v ement from gr e e dy c o or dinate desc ent ; (ii) APD AMD with δ = n and φ = (1 / 2 n ) k · k 2 is more robust than APD A GD, illustrating the adv an tage of using mirr or desc ent and line searc h with k · k ∞ ; (iii) accelerated S inkhorn outp erforms S inkhorn in terms of row/co lumn up dates, illustr ating the improv emen t from estimate d se quenc e and monot one se ar ch . 6.2 MNIST images W e pro ceed to the comparison b et w een d iﬀeren t algorithms on real images, using essenti ally the same ev aluatio n metrics as in the syntheti c images. The MNIST dataset consists of 60,000 images of hand written digits of size 28 b y 28 pixels. T o ensure that the masses of p robabilit y measures are dense, which leads to a tigh t dep endence on n f or ou r algorithms, w e add a very small noise term (10 − 6 ) to all zero elemen ts in the measures and then normalize them so that their sum is 1. Th e maximum n umber of iterations is T = 5. Figures 2 and 3 summarize the exp erimen tal results on MNIS T. In the ﬁrst r o w of Figure 2 , w e compare Sinkh orn and Greenkhorn in terms of ro w/column up dates. The leftmost image sp eciﬁes the distances d ( A ) to the transp ortation p olytop e for the algorithm A , wh ic h is either Sinkhorn or Greenkhorn; the midd le image sp eciﬁes the maxim um, median and minimum of 32 0 1000 2000 3000 4000 5000 6000 7000 8000 Row/Col Updates 0 0.5 1 1.5 2 2.5 3 |r(P)-r| 1 + |c(P)-c| 1 Distance to Polytope GREENKHORN SINKHORN Spead of ln(compet ratio for error) 0 1000 2000 3000 4000 5000 6000 7000 8000 Row/Col Updates 0 1 2 3 4 5 6 7 8 ln(compet ratio for error) Max Median Min 0 1000 2000 3000 4000 5000 6000 7000 8000 Row/Col Updates 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Value of OT SINKHORN vs GREENKHORN for OT True optimum SINKHORN, 1/eta=1 SINKHORN, 1/eta=5 SINKHORN, 1/eta=9 GREENKHORN, 1/eta=1 GREENKHORN, 1/eta=5 GREENKHORN, 1/eta=9 0 1000 2000 3000 4000 5000 6000 7000 8000 Row/Col Updates 0 0.5 1 1.5 2 2.5 3 3.5 |r(P)-r| 1 + |c(P)-c| 1 Distance to Polytope APDAMD APDAGD Spead of ln(compet ratio for error) 0 1000 2000 3000 4000 5000 6000 7000 8000 Row/Col Updates -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 ln(compet ratio for error) Max Median Min 0 1000 2000 3000 4000 5000 6000 7000 8000 Row/Col Updates 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Value of OT APDAGD vs APDAMD for OT True optimum APDAGD, 1/eta=1 APDAGD, 1/eta=5 APDAGD, 1/eta=9 APDAMD, 1/eta=1 APDAMD, 1/eta=5 APDAMD, 1/eta=9 0 1000 2000 3000 4000 5000 6000 7000 8000 Row/Col Updates 0 0.5 1 1.5 2 2.5 3 |r(P)-r| 1 + |c(P)-c| 1 Distance to Polytope ACCELERATION SINKHORN Spead of ln(compet ratio for error) 0 1000 2000 3000 4000 5000 6000 7000 8000 Row/Col Updates -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 ln(compet ratio for error) Max Median Min 0 1000 2000 3000 4000 5000 6000 7000 8000 Row/Col Updates 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Value of OT Sinkhorn vs Acceleration for OT True optimum SINKHORN, 1/eta=1 SINKHORN, 1/eta=5 SINKHORN, 1/eta=9 Acceleration, 1/eta=1 Acceleration, 1/eta=5 Acceleration, 1/eta=9 Figure 2: Comparativ e p er f ormance of Sinkh orn v.s. Greenkhorn, APD A GD v.s. APD AMD and Sinkhorn v.s. accelerate d Sinkh orn on the MNIST real images. comp etitiv e r atios log( d ( A 1 ) /d ( A 2 )) on ten random p airs of MNIST images, wh ere A 1 and A 2 resp ectiv ely corresp ond to Sinkhorn and Greenkhorn ; the right most image sp eciﬁes the v alues of the en tropic regularized OT problem with v arying regularizatio n parameters η ∈ { 1 , 1 5 , 1 9 } . The r emainin g rows pr esen t comparativ e results for APD A GD v.s.APD AMD and Sin k h orn v.s.accele rated Sin khorn. W e observe that (i) the comparative p erf ormances of Sinkh orn v.s.Greenkhorn and APDA GD v.s.APDA MD are consisten t w ith those on synthetic images; (ii) accelerated S inkhorn d eteriorates but remains b etter th an S inkhorn; (iii) APDA MD is more robust than APD A GD and GCP B. 7 Conclusion W e ﬁr st s ho w that the complexit y b ound of Greenkhorn can b e impro v ed to e O ( n 2 ε − 2 ), whic h matc hes the b est known b ound of S inkhorn. Then, we prop ose APD AMD b y generalizing APD A GD w ith a presp eciﬁed mirr or mapp ing φ and show that it ac hiev es the complexit y b ound of e O ( n 2 √ δ ε − 1 ) where δ > 0 refers to the regularit y of φ . W e pro ve th at the complexit y b ound of e O (min { n 9 / 4 ε − 1 , n 2 ε − 2 } ) pr o v ed for APDA GD is in v alid and p ro v e a reﬁ n ed complex- it y b ound of e O ( n 5 / 2 ε − 1 ). Moreo v er, w e prop ose a deterministic accelerat ed v arian t of Sinkhorn 33 0 10 20 30 40 50 60 Time 0.37 0.375 0.38 0.385 0.39 0.395 Value of OT APDAMD vs APDAGD vs GCPB for OT, eta=1 APDAMD APDAGD GCPB, stepsize=1/(Ln) GCPB, stepsize=3/(Ln) GCPB, stepsize=5/(Ln) 0 10 20 30 40 50 60 Time 0.26 0.28 0.3 0.32 0.34 0.36 0.38 Value of OT APDAMD vs APDAGD vs GCPB for OT, eta=1/5 APDAMD APDAGD GCPB, stepsize=1/(Ln) GCPB, stepsize=3/(Ln) GCPB, stepsize=5/(Ln) 0 10 20 30 40 50 60 Time 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 0.4 0.42 Value of OT APDAMD vs APDAGD vs GCPB for OT, eta=1/9 APDAMD APDAGD GCPB, stepsize=1/(Ln) GCPB, stepsize=3/(Ln) GCPB, stepsize=5/(Ln) Figure 3: P erformance of GCPB, APD A GD and APD AMD in term of time on the MNIST real images. Th ese images sp ecify the v alues of en tropic regularized OT with v arying regulariza tion parameter η ∈ { 1 , 1 5 , 1 9 } , demonstrating the robustn ess of APD AMD. via app eal to estimate sequence tec hniques and pro ve the complexit y b ound of e O ( n 7 / 3 ε − 4 / 3 ). As such, w e see th at accele rated S inkhorn outp erform s S inkhorn and Gr eenkhorn in terms of 1 /ε and APD A GD and AAM in terms of n . E xp eriments on syn thetic data and real images demonstrate the eﬃciency of our algo rithms. There are a few promising future directions arising from this w ork. Firs t, it is imp ortan t to dev elop fast algorithms to compute dimens ion-r ed uced versions of OT. In deed, the OT suf- fers fr om th e cur se of dim en sionalit y [ Dudley , 196 9 , F ournier and Guillin , 2015 ], whic h m eans that a large amount of samp les from t wo conti nuous measures is necessary to appro ximate the true OT b etw een them. T his can b e mitigated wh en data lie on lo w-dimens ional m ani- folds [ W eed and Bac h , 2019 , P at y and Cuturi , 2019 ] b u t the sample complexit y still remain p essimistic ev en in that case. This motiv ates recent wo rks on eﬃcien t d im en sion-reduced OT, e.g., the sliced OT [ Bonneel et al. , 2015 ], generalized sliced O T [ Kolouri et al. , 2019 ], distributional sliced OT [ Nguy en et al. , 202 1 ], further inspiring us to explore the application of our algorithms to these settings and ev en tually automatic d iﬀerentiati on schemes. Second, there ha ve b een sev eral application p roblems arising from the interpla y b et w een OT and ad- v ersarial ML; see Bhago ji et al. [ 2019 ] and Py d i and Jog [ 2020 ] for example. Ho w ev er, it is kno wn that OT has robustness issues wh en th ere are outliers in th e su pp orts of probab ility measures. Robust OT had b een in tro du ced to deal with these robustn ess iss u es [ Bala ji et al. , 2020 ] where the idea is to relax the marginal constrain ts via certain pr ob ab ility div ergences, suc h as KL d iv ergence. It is to limit the amount of masses that the transp ortation plan will assign for the outliers in the sup p orts of measures. Similar to OT, a k ey p r actical question with robust OT is computational. As such, w e manage to dev elop eﬃcien t algo rithms for th e robust OT pr oblem in the future w ork. 8 Ac kno wledgmen ts This work wa s supp orted in part b y th e Mathematical Data S cience program of the Oﬃce of Na v al Researc h under grant num b er N00014-18-1 -2764 to MJ, and by the NSF IFML 2019844 a w ard and researc h gifts b y UT Austin ML grant to NH. 34 References B. K. Abid and R. M. Go w er. Greedy sto c hastic algorithms f or ent ropy-regularize d optimal transp ort problems. In AIST A TS , 2018. (Cited on p age 15 .) Z. Allen-Zhu, Y. Li, R. O liv eira, and A. Wigderson. Muc h faster algorithms for m atrix scaling. In F OCS , pages 890–901. IEEE, 2017. (Cited on page 3 .) J. Altsc h uler, J. W eed, and P . Rigollet. Near-linear time appro ximation algorithms for optimal transp ort via Sinkhorn iteration. In NeurIPS , p ages 1964–1974 , 2017. (Cited on p ages 1 , 2 , 3 , 7 , 9 , 10 , 13 , 14 , 15 , 17 , 22 , 24 , 26 , 29 , 30 , and 31 .) J. Altsc huler, F. Bac h, A. Rud i, and J. Niles-W eed. Massively scalable Sinkhorn distances via the Nystr¨ om metho d. In Neu rIP S , pages 4429 –4439, 2019. (Cited on p age 3 .) M. Arjovsky , S . Ch in tala, and L. Bottou. W asserstein generativ e adve rsarial net works. In ICML , pages 214–223, 2017. (Cited on page 1 .) Y. Bal a ji, R. Chellappa, and S. F eizi. Rob u st optimal transp ort with applications in generativ e mo deling and d omain adaptation. In NeurIPS , 2020. (Cited on p ages 2 , 3 , and 34 .) E. Bern ton, P . E. Jacob, M. Gerb er, and C. P . Rob ert. On parameter estimation with the Wasserstein distance. Informatio n and Infer enc e: A Journal of the IMA , 8(4):657 –676, 2019. (Cited on page 2 .) D. Bertsimas and J. Tsitsiklis. Intr o duction to Line ar O ptimization . Athena Scien tiﬁc, 1997. (Cited on p age 2 .) A. N. Bhago j i, D. Cu llina, and P . Mittal. Lo w er b ounds on adversarial robustness from optimal tr ansp ort. In Neu rIPS , pages 7498–75 10, 2019. (Cited on p age 34 .) J. Blanc het and K. Murthy . Qu an tifying d istributional mo d el risk via optimal transp ort. Mathematics of Op er ations R e se ar c h , 44(2):565 –600, 2019. ( Cited on p age 2 .) J. Blanc het, A. Jam bu lapati, C. Kent, and A. Sidford. T ow ards optimal run ning times for optimal tr ansp ort. ArXiv Pr eprint: 1810.0 7717 , 2018. (Cited on pages 3 and 11 .) J. Blanchet, Y. Kang, and K. Murthy . Robust Wasserstein proﬁle inference and applications to m ac hine learning. Journal of Applie d Pr ob ability , 56(3):830–8 57, 2019. (Cited on page 2 .) M. Blondel, V. Seguy , and A. Rolet. S mo oth and sparse optimal transp ort. In AIST A TS , pages 880 –889, 2018 . (Cited on page 3 .) N. Bonneel, J. Rabin, G. P eyr´ e, and H. Pﬁs ter. Sliced and r adon Wasserstein barycenters of measures. Journal of M athematic al Imaging and V ision , 51(1): 22–45, 2015. (Cited on page 34 .) M. Carri ` ere, M. Cu turi, and S . Oudot. Sliced Wasserstein k ernel for p ersistence diagrams. In ICML , pages 1–10, 2017. ( Cited on page 1 .) D. C hakrabart y and S. Khann a. Better and simpler error analysis of the S inkhorn-Kn opp algorithm for matrix scaling. In 1st Symp osium on Simplicity in Algorith ms , 2018. (Cited on page 2 .) 35 L. Chizat, P . Roussillon, F. L´ eger, F-X. Vialard, and G. Peyr ´ e. F aster Wasserstein distance estimation with the Sin khorn d iv ergence. In Neu rIP S , pages 2257– 2269, 2020. (Cited on pages 2 and 6 .) M. B. Cohen, A. Madry , D. Tsipras, and A. Vladu. Matrix s caling and balancing via b ox constrained Newton’s metho d and interior p oin t metho d s. In FOCS , pages 902–913. IE E E, 2017. (Cited on page 3 .) N. Court y , R. Flamary , D. T uia, and A. Rakot omamonjy . Optimal transp ort for domain adaptation. IEEE T r ansactions on Pattern Analysis and Machine Intel lige nc e , 39(9 ):1853– 1865, 2017 . (Cited on pages 1 and 3 .) M. Cutur i. Sinkhorn distances: Ligh tsp eed computation of optimal transp ort. In NeurIPS , pages 229 2–2300, 2013 . (Cited on pages 2 , 5 , 7 , 9 , 26 , and 31 .) M. Cutu ri and A. Doucet. F ast computation of Wasserstein barycente rs. In ICML , pages 685–6 93, 2014. (Cited on page 1 .) M. Cuturi and G. P eyr ´ e. A smo othed dual appr oac h for v ariati onal Wasserstein p roblems. SIAM Journal on Imaging Scienc es , 9(1):320–3 43, 2016. (Cited on page 3 .) M. Cutur i, O . T eb oul, and J-P . V ert. Diﬀeren tiable r anks and sorting using optimal transp ort. In NeurIPS , p ages 6861–6871 , 2019. (Cited on page 3 .) Y. Dong, Y. Gao, R. P eng, I. Razensh teyn, and S. Sa wlani. A study of p erform ance of optimal transp ort. ArXiv Pr eprint: 2005.01 182 , 2020. (Cited on pages 3 , 4 , and 31 .) R. M. Dudley . The sp eed of m ean Gliv enk o-Can telli conv ergence. The Annals of Mathematic al Statistics , 40(1) :40–50, 1969. (Cited on p ages 2 , 6 , and 34 .) P . Dvurec henskii, D. Dvinskikh, A. Gasniko v , C. Ur ib e, and A. Nedic h. Decent ralize and randomize: Faster algorithm f or Wasserstein barycen ters. In NeurIPS , p ages 10783–107 93, 2018. (Cited on page 1 .) P . Dvurec hensky , A. Gasnik o v, and A. Kroshnin. C omputational optimal transp ort: C om- plexit y by accelerat ed gradien t descent is b etter than by Sin khorn’s algorithm. I n ICML , pages 136 7–1376, 2018 . (Cited on pages 1 , 2 , 3 , 7 , 9 , 10 , 11 , 13 , 15 , 16 , 22 , 23 , 24 , 26 , and 31 .) Olivier F erco q and P eter Rich t´ arik. Accele rated, parallel, and p r o ximal co ordinate descen t. SIAM Journal on O ptimization , 25(4):1997– 2023, 2015. (Cited on p age 25 .) J. F eydy , T. S ´ ejour n ´ e, F-X. Vialard, S-I. Amari, A. T rouv´ e, and G. P eyr´ e. In terp olating b et wee n optimal transp ort and MMD using S inkhorn dive rgences. In AIST A TS , pages 2681– 2690. P MLR, 2019 . (Cited on page 2 .) R. Flamary and N. Cour t y . POT: Python optimal transp ort library , 201 7. URL https:// github.co m/rflamary/POT . (Cited on p age 4 .) N. F ournier and A. Guillin. On the rate of con v ergence in Wasserstein distance of the empirical measure. Pr ob ability The ory and R elate d Fields , 162 (3-4):707– 738, 2015. (Cited on pages 2 , 6 , and 34 .) H. N. Gab o w and R. E. T arjan. F aster scaling algorithms for general graph matc hing pr ob lems. Journal of the ACM (JACM) , 38(4):8 15–853, 1991. (Cited on page 2 .) 36 A. Genev a y , M. Cuturi, G. P eyr´ e, and F. Bac h. Sto c hastic optimization for large-scale optimal transp ort. I n Neu rIPS , pages 3440–3 448, 2016. (Cited on pages 3 and 31 .) A. Genev a y , L. Chizat, F. Bac h, M. C u turi, and G. P eyr´ e. Sample complexit y of Sinkhorn div ergences. In A IST A TS , pages 157 4–1583. PMLR, 2019. (Cited on pages 2 and 6 .) I. Gulra jani, F. Ah med, M. Arjovsky , V. Du moulin, and A. C. Courville. Impr o v ed training of Wasserstein GANs. In NeurIPS , pages 5767–5 777, 2017. (Cited on page 1 .) S. Gumin o v, P . Dvurec hensky , N. T up itsa, and A. Gasniko v. On a combinatio n of alte rn ating minimization and Nestero v’s moment um . In ICML , pages 3886 –3898. PMLR, 2021. (Cited on pages 1 , 3 , 7 , 16 , 24 , and 26 .) W. Guo, N. Ho, and M. Jordan. F ast algorithms f or computational optimal transp ort and Wasserstein b arycen ter. In AIST A TS , pages 2088–2097 . PMLR, 2020. (Cited on pages 3 and 16 .) N. Ho, V. Huyn h, D. P h ung, and M. I. Jordan. Probabilistic m ultilev el clustering via com- p osite transp ortation distance. AIST A TS , pages 3149 –3157, 2019. (Cited on p age 1 .) A. Jam bu lapati, A. Sid ford, and K. Tian. A dir ect tilde { O } (1/epsilon) iteratio n parallel algorithm f or optimal transp ort. In NeurIPS , pages 11355–113 66, 2019. (Cited on p ages 4 and 16 .) B. Kalan tari and L. Khac hiya n. On the complexit y of n onnegativ e m atrix scaling. Line ar Algebr a and its A pplic ations , 240:87 –103, 1996. (Cited on page 2 .) B. Kalan tari, I. Lari, F. Ricca, and B. Simeone. O n the complexit y of general matrix scaling and entrop y minimization via the RAS algorithm. Mathematic al Pr o gr amming , 11 2(2): 371–4 01, 2008. (Cited on page 2 .) L. V. Kan toro vic h. On the translo cation of masses. In Dokl. Akad . Nauk. USSR (NS) , v olume 37, p ages 199–201, 1942. ( Cited on pages 1 and 5 .) P . A. Knigh t. Th e Sinkhorn-Kn op p algorithm: con v ergence and applications. SIAM Journal on Matrix Anal ysis and Applic ations , 30(1) :261–275, 2008. (Cited on page 2 .) S. Kolouri, K. Nadjahi, U. Simsekli, R. Badeau, and G. Roh d e. Generalized sliced Wasserstein distances. In NeurIPS , pages 261– 272, 2019. (Cited on p age 34 .) H. W. Kuhn. T he Hungarian m etho d for the assignmen t problem. Naval R ese ar ch L o gistics Quarterly , 2(1-2):83–9 7, 1955. (Cited on page 2 .) H. W. K uhn. V ariant s of the Hun garian metho d for assignmen t pr oblems. N aval R ese ar ch L o gistics Quarterly , 3(4):2 53–258, 1956. (Cited on page 2 .) N. Lahn, D. Mu lc handani, and S . Ragh ve nd ra. A graph theoretic additiv e app ro ximation of optimal tr ansp ort. In Neu rIPS , pages 13813–1 3823, 2019. ( Cited on pages 2 and 4 .) Y. T. Lee and A. Sidford. Pa th ﬁn ding metho ds for linear p r ogramming: Solving linear programs in e O (sqr t(rank)) iterations and faster algorithms for maxim um ﬂo w. In F OCS , pages 424 –433. IEEE, 2014. (Cited on p age 2 .) 37 J. Lei. Con verge nce and concen tration of empir ical m easur es un d er w asserstein d istance in unboun ded functional spaces. Bernoul li , 26(1):76 7–798, 2020. (Cited on page 2 .) Q. Lin, Z. Lu, and L. Xiao. An accelerate d randomized pro ximal co ordinate gradien t metho d and its application to regularized empirical risk minimizatio n. SIAM Journal on Optimiza- tion , 25(4):2244– 2273, 2015. (Cited on p age 25 .) T. Lin , N. Ho, and M. Jordan. On eﬃcien t optimal tr ansp ort: An analysis of greedy and accele rated mirr or d escen t algorithms. In ICML , pages 3982–39 91, 2019a . ( Cited on pages 3 , 7 , and 10 .) T. Lin, Z. Hu, and X. Guo. Sparsemax and relaxed Wasserstein for topic sparsity . I n WSDM , pages 141 –149. A CM, 2019 b. (Cited on p age 1 .) H. Lu, R. F reund , and V. Mirrokni. Accelerating greedy co ordin ate descent metho d s. In ICML , pages 3263–327 2, 2018. (Cited on p age 25 .) G. Mena and J. Niles-W eed. Statistical b oun ds for entropic optimal transp ort: s amp le com- plexit y and the cen tral limit theorem. In NeurIPS , pages 4541–45 51, 2019. (Cited on pages 2 and 6 .) G. Monge. M ´ emoire sur la th ´ eorie des d´ eb lais et des rem blais. Histoir e de l’A c ad´ emie R oyale des Scienc es de Paris , 1781 . (Cited on page 1 .) J. Munkres. Algorithms for the assignmen t and transp ortation p roblems. Journal of the So cie ty for Industrial and Applie d M athematics , 5(1):32 –38, 1957. (Cited on page 2 .) Y. Nestero v. Smo oth minimization of non-smo oth fu nctions. Mathematic al Pr o gr amm ing , 103 (1):12 7–152, 2005. ( Cited on pages 7 , 8 , and 16 .) Y. Nestero v. L e ctur es on Convex Optimization , v olume 137. Springer, 2018. (Cited on page 24 .) K. Nguyen, N. Ho, T. Pham, and H. Bui. Distributional sliced-Wasserstein and applications to generativ e mo deling. In ICLR , 2021. URL https:// openrevie w.net/forum?id=QYjO70ACDK . (Cited on p age 34 .) X. Nguy en. C on v ergence of laten t mixing measures in ﬁn ite and inﬁnite mixture mo dels. Anna ls of Statistics , 4(1):370–4 00, 2013. (Cited on page 1 .) X. Nguyen. Borrowing strength in hierarc hical Bay es: posterior concen tration of th e Diric hlet base measure. Bernoul li , 22(3) :1535–157 1, 2016. ( Cited on page 1 .) J. B. Orlin. A p olynomial time primal n et w ork simp lex algorithm for minimum cost ﬂo ws. Mathematic al Pr o gr amming , 78(2):1 09–129, 1997. (Cited on page 2 .) J. B. Orlin and R. K. Ah uja. New scaling algorithms for the assignment and minimum mean cycle problems. Mathematic al Pr o gr amming , 54(1):41–5 6, 1992. (Cited on p age 2 .) J. B. O rlin, S. A. Plotkin, and E. T ardos. P olynomial dual net work s implex algorithms. Mathematic al Pr o gr amming , 60(1):2 55–276, 1993. (Cited on page 2 .) F-P . P at y and M. Cuturi. Su b space robust Wasserstein distances. In ICML , p ages 5072–50 81. PMLR, 2019 . (Cited on pages 2 and 34 .) 38 O. P ele and M. W erman. F ast and robust earth mov er’s distance. In ICCV . IEEE, 2009. (Cited on p age 2 .) G. P eyr´ e and M. Cutur i. C omp utational optimal transp ort. F oundations and T r ends ® in Machine L e arning , 11(5-6):355 –607, 2019 . (Cited on page 2 .) G. P eyr ´ e, M. Cuturi, and J. Solomon. Gromo v-Wasserstein a v eraging of k ernel and distance matrices. I n ICML , pages 2664–2672 , 2016. (Cited on page 1 .) M. S. Pyd i and V. Jog. Adversarial risk via optimal transp ort and optimal couplin gs. In ICML , pages 7814–782 3. PMLR, 2020. (Cited on page 34 .) K. Quanr ud. Approximati ng optimal transp ort with linear programs. In SOSA , pages 61–69, 2019. (Cited on page 3 .) A. Rolet, M. Cu turi, and G. P eyr´ e. F ast d ictionary learning w ith a smo othed Wasserstein loss. In AIST A TS , pages 630– 638, 2016. (Cited on page 1 .) M. S c hmidt, N. L. Roux, and F. Bac h. Minimizing ﬁ nite su ms with the sto c hastic a ve rage gradien t. Mathematic al Pr o gr amming , 162:83–11 2, 2017. (Cited on page 31 .) M. H. Schneider and S. A. Z enios. A comparativ e study of algorithms for matrix balancing. Op e r ations R ese ar ch , 38(3):43 9–455, 1990. (Cited on page 2 .) J. Sh erman. Area-con v exit y , ℓ ∞ regularization, and und ir ected multicommodity ﬂow. In STOC , pages 452–460. ACM, 2017. (Cited on pages 4 and 16 .) R. Sinkhorn . Diagonal equiv alence to matrices with p rescrib ed ro w and column su ms. Pr o- c e e dings of the Americ an Mathematic al So ciety , 45(2):195 –198, 1974. (Cited on page 2 .) M. Sommerfeld and A. Mun k. Inf erence for empirical Wasserstein d istances on ﬁnite spaces. Journal of the R oyal Statistic al So c iety: Series B (Statistic al Metho dolo g y) , 80(1):219 –238, 2018. (Cited on page 2 .) M. Sommerfeld, Y. Z emel, and A. Mun k. Optimal transp ort: Fast probabilistic appro ximation with exact solve rs. Journal of M achine L e arning R ese ar ch , 20:1–23 , 2019. (Cited on p age 1 .) S. Sriv asta v a, V. Cevher, Q. Dinh, and D. Du nson. W ASP: Scalable Ba y es via barycen ters of subset p osteriors. In AIST A TS , pages 912 –920, 2015. (Cited on page 1 .) S. Sr iv astav a, C. Li, and D. Dunson. Scalable Ba y es via barycen ter in Wasserstein sp ace. Journal of Machine L e arning R ese ar c h , 19(8):1–3 5, 2018. (Cited on p age 1 .) I. T olstikhin, O. Bousqu et, S. Gelly , and B. S c ho elk opf. W asserstein auto-encoders . In ICLR , 2018. (Cited on page 1 .) J. v an den Brand , Y. T. Lee, Y. P . Liu, T. Saranurak, A. Sidford, Z. Song, and D. W ang. Minim um cost ﬂows, MDPs, and ℓ 1 -regression in nearly linear time for dens e instances. In STOC , pages 859–869, 2021. (Cited on page 2 .) C. Villani. Optimal T r ansp ort: Old and Ne w , v olume 338. Spr in ger, 2009. (Cited on page 1 .) J. W eed and F. Bac h. Sharp asymptotic and ﬁnite-sample r ates of con verge nce of empirical measures in Wasserstein distance. Bernoul li , 25(4A):2 620–2648 , 2019. (Cited on pages 2 , 6 , and 34 .) 39

On the Efficiency of Entropic Regularized Algorithms for Optimal Transport

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment