On the Efficiency of Entropic Regularized Algorithms for Optimal Transport
We present several new complexity results for the entropic regularized algorithms that approximately solve the optimal transport (OT) problem between two discrete probability measures with at most $n$ atoms. First, we improve the complexity bound of …
Authors: Tianyi Lin, Nhat Ho, Michael I. Jordan
On the Efficiency of En tropic Regularize d Algorith ms for Op t imal T ransp ort Tian yi Lin ⋄ Nhat Ho ⋆ Mic hael I. Jordan ⋄ , † Departmen t of Electrical Engineering an d Compu ter S ciences ⋄ Departmen t of Statistics † Univ ersit y of California, Berk eley Departmen t of Statistics and Data Sciences, Univ ersit y of T exas, Austin ⋆ Ma y 19, 202 2 Abstract W e present several new co mplex it y results for the entropic reg ularized a lg orithms that approximately solve the optimal transp ort (O T) pro blem b etw een t wo discr ete pr obabil- it y measures with at most n ato ms. First, we improv e the complexity b ound of a greedy v ariant o f Sinkho rn, kno wn as Gr e enkhorn , from e O ( n 2 ε − 3 ) to e O ( n 2 ε − 2 ). Notably , our result can match the b est known complexity b ound of Sinkhorn and help clarify wh y Greenkhorn sig nificantly o utper forms Sinkhorn in practice in terms of r ow/column up- dates a s observed by Altsch uler e t a l. [ 2017 ]. Second, we pro p o s e a new algor ithm, which we refer to as A PDAMD and which generalizes an adaptive primal-dual a ccelerated g radi- ent des cent (APDA GD) algorithm [ Dvurechensky et al. , 2018 ] with a pr esp ecified mirror mapping φ . W e pr ov e that APD AMD achiev es the complexity b ound o f e O ( n 2 √ δ ε − 1 ) in which δ > 0 stands for the regular it y of φ . In addition, we show b y a co unt ere xample that the complexity b ound of e O (min { n 9 / 4 ε − 1 , n 2 ε − 2 } ) prov ed for AP D AGD b efore is inv a lid and give a refined complexity bound o f e O ( n 5 / 2 ε − 1 ). F urther, we develop a deterministic accelerated v ar iant of Sinkho rn v ia app e al to estimated sequence and prove the complexity bo und of e O ( n 7 / 3 ε − 4 / 3 ). As s uch , w e see that accelera ted v ariant of Sinkhorn outp erforms Sinkhorn and Greenkho r n in ter ms of 1 /ε a nd APD AGD and a ccelerated alternating mini- mization (AAM) [ Guminov et al. , 2021 ] in terms of n . Finally , w e conduct the exp eriments on synthetic and rea l data and the numerical results show the efficiency of Greenkhorn, APDA MD a nd acceler ated Sinkhor n in pra ctice. 1 In tro duction F rom its origins in the seminal w orks by Monge [ 1781 ] and K an toro vic h [ 1942 ] r esp ectiv ely in the eigh teen th and tw en tieth cent ur ies, and through to presen t da y , the optimal transp ort (OT) problem has pla y ed a determinative role in the theory of m athematics [ Villani , 2009 ]. It also has found a wid e range of ap p lications in problem domains b ey ond the original set- ting in logistics. In the current era, th e str on g and increasing link ag e b et we en optimizatio n and mac hine learning has br ough t new app lications of OT to the fore; [see, e.g., Nguy en , 2013 , Cuturi and Doucet , 2014 , Sriv asta v a et al. , 2015 , R olet et al. , 2016 , Pe yr´ e et al. , 2016 , Nguy en , 2016 , Carri ` ere et al. , 2017 , Ar jo vsky et al. , 2017 , Gulra jani et al. , 2017 , Court y et al. , 2017 , Sriv asta v a et al. , 2018 , Dvurec henskii et al. , 2018 , T olstikhin et al. , 2018 , Sommerf eld et al. , 2019 , Lin et al. , 2019b , Ho et al. , 2019 ]. In these d ata-driv en applications, the fo cu s is on the probabilit y distribu tions und erlying the OT form ulation; ind eed, these distributions are ei- ther empirical distr ib utions whic h are obtained by placing unit masses at data p oin ts, or are 1 probabilit y mo dels of a putativ e un derlying data-ge nerating pro cess. The OT problem ac- cordingly often has a dir ect inferentia l meanin g — as the definition of an estimator [ Dudley , 1969 , F ournier and Guillin , 2015 , W ee d and Bac h , 2019 , Lei , 2020 ], th e definition of a lik eli- ho o d [ Sommerfeld and Munk , 2018 , Bern ton et al. , 2019 , Blanc het and Murthy , 2019 ], or as the r obust v ariant of an estimator [ Blanc het et al. , 2019 , P at y and Cutu r i , 2019 , Bala ji et al. , 2020 ]. The key c hallenge is compu tational [ Pe yr´ e and Cuturi , 2019 ]. I n deed, the underlyin g distributions generally inv olv e high-dimensional data an d complex mo d els in mac hine learning (ML) applications. W e study the OT problem in a discrete setting where we assume that the target and source probabilit y distribu tions eac h ha ve at most n atoms. In this setting, the O T problem can b e solv ed exactly usin g linear programming (LP) solve r b ased on sp ecialized in terior-p oint metho ds [ P ele and W erman , 2009 , Lee and Sidford , 2014 , v an den Brand et al. , 202 1 ], reflect- ing the LP formulat ion of the OT problem. In this context, v an den Brand et al. [ 2021 ] ha v e pro vided a bunch of randomized interio r-p oin t algorithms with improv ed r unt imes f or s olv- ing linear programs with tw o-sided constrain ts, leading to a new O T algorithm b ased on the Laplacian system solv ers that ac hieve d the b est known complexit y b ounds of ˜ O ( n 2 ). Ho w ev er, it do es not pr ovide an effect ive solution to large-sca le mac hine learnin g problems in pr actice since efficien t implemen tations of Laplacian approac h are y et unknown. F urthermore, man y com binatorial tec hn iqu es giv e exact algorithms for the OT p roblem. Indeed, the Hungarian algorithm [ Kuhn , 1955 , 1956 , Munk r es , 1957 ] solve s the assignment pr oblem in O ( n 3 ) time while there are sev eral com binatorial algorithms that can solv e the OT problem exactly in ˜ O ( n 2 . 5 ) time [ Gab o w and T arjan , 1991 , Orlin and Ah uja , 1992 ]. Combined with the scaling tec hnique, the net w ork simp lex algorithms [ Orlin et al. , 1993 , Orlin , 1997 ] can b e used to solv e the OT problem exactly in ˜ O ( n 3 ) time and Lahn et al. [ 2019 ] hav e r ecen tly dev elop ed a faster app ro ximation algorithm for the OT problem via app eal to the mo dification of the algorithm dev elop ed in Gab o w and T arjan [ 1991 ]. How ev er, computing the OT problem ex- actly results in an output that is not differen tiable with resp ect to measures’ lo cations or w eigh ts [ Bertsimas and Tsitsiklis , 1997 ]. Moreo v er, OT suffers fr om the curse of dimensional- it y [ Dudley , 1969 , F ournier and Guillin , 2015 ] and is thus lik ely to b e meaningless when used on samp les from h igh-dimensional d ensities. An alternativ e to solve the OT problem is a class of approximati on algorithms based on the ent ropy regularizatio n which has b een inv estig ated in optimization and transp ortation science long b efore [ S inkhorn , 1974 , Schneider and Zenios , 1990 , Kalan tari and Khac hiy an , 1996 , Kn igh t , 2008 , Kalan tari et al. , 2008 , Chakrabart y and K hanna , 2018 ]. It was Cu turi [ 2013 ] that p opu larized the use of entrop y regularization for O T in the mac hine learning comm unit y and th en initiated a p ro ductive line of researc h where an en tropic regularization w as imp osed to approximate the non-negativ e constrain ts in the original OT problem. The resulting pr ob lem is referred to as entr opic r e gularize d OT and the corresp onding class of appro ximation algorithms are called entr op ic r e gularize d algorithm s . It is w orth mentioning that the ent ropic regularized OT has man y fa v orable prop erties that the OT do es not enj oy , motiv ating us to study the computational efficiency of ent ropic regularized algorithms in this pap er. More sp ecifically , fr om a s tatistica l p oin t of view, the en tropic regularized O T enjoys significan tly b etter samp le complexity that is p olynomial in the dimen s ion [ Genev ay et al. , 2019 , Mena and Niles-W eed , 2019 , C h izat et al. , 2020 ], demonstrating that addin g an en- trop y r egularizatio n w ill reduce the curse of dimensionalit y . Ev en f rom a computational p oint of view, such regularization in O T leads to Sinkhorn whic h attains a fir st near-linear time guarantee for the O T problem [ Cuturi , 2013 , Altsch uler et al. , 2017 , Dvu rec hensky et al. , 2018 ], and also mak es the problem differentiable with regards to distribu tions [ F eydy et al. , 2 2019 ]; hence, the entropic regularized algorithms are m ore easily applicable to deep learn- ing app lications [ Court y et al. , 2017 , Cuturi et al. , 2019 , Bala ji et al. , 2020 ] as opp osed to com binatorial algo rithms. This p oin t was highligh ted in Dong et al. [ 202 0 ] and further neces- sitated th e dev elopment of f aster entropic regularized algorithms. In this regard, the greedy v arian t of Sinkhorn – Greenkhorn – w as prop osed and sho wn to outp erform S inkhorn em- pirically [ Altsc h uler et al. , 2017 ]. Ho w ev er, a sizable gap exists here since the b est kno wn complexit y b ound of e O ( n 2 ε − 3 ) for Greenkhorn [ Altsc h uler et al. , 2017 ] is wo rse than that of e O ( n 2 ε − 2 ) for Sin khorn [ Dvurec hensky et al. , 2018 ]. F ur th er progress has b een made b y adapting first-order optimizat ion algorithms for the OT problem [ Cuturi and Peyr ´ e , 2016 , Genev a y et al. , 2016 , Blondel et al. , 2018 , Dvur ec hensky et al. , 2018 , Altsc h uler et al. , 2019 , Guo et al. , 2020 , Guminov et al. , 2021 ]. Among these approac hes, t w o of repr esentati ves are an adaptiv e pr im al-du al accelerated gradient descen t (APD A GD) al- gorithm [ Dvurechensky et al. , 2018 ] with the claimed complexit y b ound of e O (min { n 9 / 4 ε − 1 , n 2 ε − 2 } ) and an accelerat ed alternating minimization (AAM) algorithm [ Gumino v et al. , 2021 ] with the complexit y b ound of e O ( n 5 / 2 ε − 1 ). Moreo v er, there are several s econd-order optimiza- tion algorithms [ Allen-Zh u et al. , 2017 , Cohen et al. , 2017 ] w h ic h are adapted for the OT problem [ Blanc het et al. , 2018 , Quanr ud , 2019 ] and guaran teed to ac hiev e th e improv ed com- plexit y b ound of e O ( n 2 ε − 1 ). Ho w ev er, the aforemen tioned second-ord er algorithms do not pro vide effectiv e solutions to large-sca le mac hin e learning problems due to th e lac k of efficien t implemen tations in practice. Con tributions. Giv en the adv an tages of ent ropic regularization in OT, we fo cus in his pap er the compu tational efficiency of a class of entropic regularized algorithms for th e OT problem and our theoretical analysis lead to several impro ve ments o v er the state- of-the-art algorithms in the literature. W e summarize the con tributions as follo ws: 1. W e imp r o v e the complexit y b ound of Greenkhorn from e O ( n 2 ε − 3 ) to e O ( n 2 ε − 2 ), which matc hes the b est existing b ound of Sinkhorn. Th e pro of tec hn iqu es are new and differen t from that used in Dvurec hensky et al. [ 2018 ] for analyzing Sinkhorn. In particular, Greenkhorn only u p dates a single row or column at eac h iteration and quant ifying the p er-iteration progress is more difficu lt than the m easuremen t in Sinkhorn. 2. W e prop ose an adaptiv e primal-dual accelerated mirr or d escen t (APD AMD) algorithm whic h generalizes APDA GD w ith a presp ecified mirror mappin g φ and prov e that AP- D AMD ac hieves the complexit y b ound of e O ( n 2 √ δ ε − 1 ) where δ > 0 refers to the regu- larit y of φ w.r.t. ℓ ∞ norm. W e show by a count erexample that the complexit y b ound of e O (min { n 9 / 4 ε − 1 , n 2 ε − 2 } ) pro ved f or APD A GD [ Dvurec hensky et al. , 2018 ] is in v alid and giv e a refined complexit y b ound of e O ( n 5 / 2 ε − 1 ) whic h is wo rse than the claimed b ound in terms of n . 3. W e p r op ose a deterministic accelerat ed v arian t of S inkhorn via app eal to an estimate d sequence and prov e the complexit y b ound of e O ( n 7 / 3 ε − 4 / 3 ). In p articular, accelerated Sinkhorn consists in an exact minimization for main iterates accompanied b y another sequence of iterates based on co ordinate gradien t up dates and monotone searc h. Ou r results show that accelerat ed Sinkhorn outp erform s S inkhorn and Greenkhorn in terms of 1 /ε an d APD A GD and AAM in terms of n . W e note that a p reliminary version with only the analysis for Greenkhorn and APD AMD has b een accepted by ICML [ Lin et al. , 2019a ]. After our conference p ap er was pu b lished, 3 some new algorithms were dev elop ed for solving the OT p roblem [ Jam bulapati et al. , 2019 , Lahn et al. , 2019 ]. In particular, Jam bulapati et al. [ 2019 ] deve lop ed a du al extrap olation algorithm with the complexity b oun d e O n 2 ε − 1 using an area-con ve x mapp ing [ Sherman , 2017 ]. Despite the theoretically soun d complexit y b ound, the lac k of simplicit y and ease-of- implemen tation mak e this algorithm less comp etitiv e w ith Sinkhorn and Greenkhorn wh ic h remain th e baseline solution metho d s in practice [ Flamary and Cour ty , 2017 ]. Differen t from the algorithm in Jam bulapati et al. [ 2019 ], the com binatorial algorithm in Lahn et al. [ 2019 ] is a p ractical solution metho d for the OT problem. It is worth men- tioning that the algorithm in Lahn et al. [ 2019 ] and other com bin atorial algorithms, e.g., the Hungarian algorithm, outp erf orm ou r algorithms in practice. This is in consistence with the observ atio n in Dong et al. [ 2020 ] who p ointe d out that com binatorial algorithms can outp er- form en tropic regularized algorithms in sp eed eve n the latter ones are asymptotically faster for OT (i.e., th e case of large n ). Ho wev er, we b eliev e our r esults are still v aluable due to the imp ortance of entropic regularized algorithms as men tioned b efore. Organization. The remainder is organized as follo ws. In Section 2 , w e present the basic setup for the pr im al and dual form of th e en tropic regularized OT problem. In Section 3 , w e p ro vide th e complexity analysis for Greenkhorn. In Section 4 , we prop ose APD AMD for solving entropic regularized OT and provide sev eral results on the complexit y b ound of APD A GD and APD AMD. In Section 5 , we p rop ose and analyze an accelerate d v arian t of Sinkhorn . In Section 6 , w e conduct the exp eriments on synthetic and real data and th e n umerical results sho w the efficiency of our algorithms. W e conclude this pap er in Section 7 . Notation. F or n ≥ 2, we let [ n ] b e the set { 1 , 2 , . . . , n } and R n + b e the set of all v ectors in R n with non-negativ e coord in ates. The notation ∆ n = { v ∈ R n + : P n i =1 v i = 1 } stands for a probabilit y simp lex in n − 1 dimensions. F or a ve ctor x ∈ R n and let 1 ≤ p < + ∞ , the notation k x k p stands for the ℓ p -norm and k x k indicates an ℓ 2 -norm. d iag( x ) is a diagonal matrix whic h has the vect or x on its diagonal. 1 n and 0 n are n -dimensional vecto r with all comp onent s b eing 1 an d 0. F or a matrix A ∈ R n × n , we d enote vec( A ) as the v ector in R n 2 obtained fr om concatenating the rows and columns of A . Th e n otation k A k 1 → 1 stands for sup k x k 1 =1 k Ax k 1 and the notations r ( A ) = A 1 n and c ( A ) = A ⊤ 1 n stand for the r ow and column sums r esp ectiv ely . F or a fun ction f , the n otation ∇ x f denotes a partial deriv ativ e with resp ect to x . F or the dimen s ion n and tole rance ε > 0, the notations a = O ( b ( n, ε )) and a = Ω( b ( n, ε )) indicate that a ≤ C 1 · b ( n, ε ) and a ≥ C 2 · b ( n, ε ) resp ectiv ely w here C 1 and C 2 are indep endent of n and ε . W e also denote a = Θ( b ( n, ε )) iff a = O ( b ( n, ε )) = Ω( b ( n, ε )). Similarly , w e denote a = e O ( b ( n, ε )) to indicate the p revious inequalit y where C 1 dep end s on some loga rithm ic function of n and ε . 2 Problem Setup In this section, we first pr esen t the linear programming (LP) representat ion of the optimal transp ort (OT) problem as well as a sp ecification of an appr o ximate transp ortation plan. W e also present an entropic r egularized v arian t of the OT p r oblem and deriv e th e dual form where the ob jectiv e function is in the form of the logarithm of sum of exp onents. Finally , w e establish sev eral prop erties of that d ual form which are useful for the subsequent analysis. 4 2.1 Linear progr amming represen tation According to Kan toro vic h [ 1942 ], the p r oblem of app ro ximating th e OT distance is equiv alent to s olving th e follo wing linear pr ogramming (LP) pr ob lem: min X ∈ R n × n h C, X i s.t. X 1 n = r , X ⊤ 1 n = c, X ≥ 0 . (1) In the ab ov e form ulation, X refers to the tr ansp ortation plan , C = ( C ij ) ∈ R n × n + stands for a cost matrix w ith n on-negativ e comp onents, and r ∈ R n and c ∈ R n are t w o prob ab ility distributions in the simplex ∆ n . W e see f r om Eq. ( 1 ), that the OT problem is a L P with 2 n equalit y constrain ts and n 2 v ariables and can b e solve d by the in terior-p oint metho d ; ho w ev er, th is metho d p erforms p o orly on large-scale problems du e to its high p er-iteration compu tational cost. In general, the solution that the algorithms aim at ac hieving is an ε -appro ximate transp ortation p lan b X ∈ R n × n + satisfying the marginal distribution constrain ts b X 1 n = r and b X ⊤ 1 n = c and the inequalit y giv en by h C, b X i ≤ h C, X ⋆ i + ε. Here X ⋆ is defined as an optimal transp ortation plan for the OT problem. F or simplicit y , w e resp ectiv ely den ote h C , ˆ X i an ε -appr oximate tr ansp ortation c ost and ˆ X an ε -appr oximate tr ansp orta tion plan for the original problem. F ormally , we h a v e the follo wing defi nition of ε -appro ximate transp ortation plan. Definition 1. The matrix ˆ X ∈ R n × n + is c al le d an ε -appr oximate tr ansp ortation plan if ˆ X 1 n = r and ˆ X ⊤ 1 n = c and the fol lowing ine quality holds true, h C, b X i ≤ h C, X ⋆ i + ε. wher e X ⋆ is define d as an optima l tr ansp ortation plan for the OT pr oblem. With th is definition in mind, the goal of this p ap er is to stud y the OT p roblem from a computational p oin t of view. Ind eed, we hop e to der ive an improv ed complexit y b ound of the cu r ren t state-o f-the-art algorithms and seek new practical algorithms whose runn ing time required to obtain an ε -appro ximate transp ortation plan has b etter dep endence on 1 /ε than the b enchmark algo rithms in the literature. Th e aforemen tioned new algorithms are fa v orable in the machine learning applications where high p recision ( ε is small) is necessary . 2.2 En tropic r egularized OT and its dual form Seeking another formulation for OT distance that is more amenable to computationally effi- cien t algorithms, Cuturi [ 2013 ] prop osed to solv e an entropic regularized ve rsion of the O T problem in Eq. ( 1 ), whic h is giv en b y min X ∈ R n × n h C, X i − η H ( X ) , s.t. X 1 n = r , X ⊤ 1 n = c, (2) where η > 0 denotes the regularization parameter and H ( X ) denotes the en tropic regulariza- tion term, w hic h is give n b y: H ( X ) := −h X , log ( X ) − 1 n × n i . Note that, th e optimal solution of the en tropic regularized OT p roblem exists since the ob j ec- tiv e fun ction h C , X i − η H ( X ) is con tin uous and the feasible region { X ∈ R n × n : X ≥ 0 , X 1 n = 5 r , X ⊤ 1 n = c } is compact. F urthermore, that optimal solution is also un ique since the ob jectiv e function h C, X i − η H ( X ) is strongly con v ex ov er the feasible region with resp ect to ℓ 1 -norm. Ho w ev er, the optimal v alue of the entropic regularized OT problem (cf. Eq ( 2 )) yields a p o or appro ximation to the unregularized OT problem if η is large. An additional issu e of en tropic regularization is that the sparsity of the solution is lost. Ev en though an ε -appro ximate trans- p ortation plan can b e found efficient ly , it is not clea r ho w differen t the sparsit y pattern of th is solution is with resp ect to the solution of the actual OT pr oblem. In con trast, the actual O T distance suffers f r om the curse of d imensionalit y [ Dudley , 1969 , F ournier and Guillin , 2015 , W eed and Bac h , 2019 ] and is signifi can tly w orse than its en tropic regularized ve rsion in terms of the sample complexit y [ Genev ay et al. , 2019 , Mena and Niles-W eed , 2019 , C hizat et al. , 2020 ]. While there is an ongoing deb ate in the literature on the m erits of solving the OT pr ob- lem v.s. its entropic regularized version, we adopt here the viewp oint that reac hing an additive appro ximation of the actual OT cost m atters and therefore prop ose to scale η as a fun ction of the d esired accuracy of the appr o ximation. T hen, we pro ceed to d eriv e the du al form of the entropic r egularized OT p roblem in Eq. ( 2 ) and sho w that it remains an u n constrained smo oth optimizati on pr oblem. By int ro d ucing th e dual v ariables α, β ∈ R n , w e define the Lagrangian f unction of the entropic regularized OT problem as follo ws: L ( X, λ 1 , . . . , λ m ) = h C, X i − ηH ( X ) − α ⊤ ( X 1 n − r ) − β ⊤ ( X ⊤ 1 n − c ) . (3) In ord er to deriv e the smo oth d ual ob jectiv e fun ction, w e consider the follo wing minimization problem: min X : k X k 1 =1 h C, X i − η H ( X ) − α ⊤ ( X 1 n − r ) − β ⊤ ( X ⊤ 1 n − c ) . The ab o ve ob j ectiv e fu nction is strongly con ve x o v er the domain { X ∈ R n × n + | k X k 1 = 1 } . Th us , th e optimal solution is unique. After the simp le calculat ions, the optimal solution ¯ X = X ( α, β ) has the follo wing form: ¯ X ij = e η − 1 ( α i + β j − C ij ) P 1 ≤ i,j ≤ n e η − 1 ( α i + β j − C ij ) . (4) Plugging Eq. ( 4 ) into Eq. ( 3 ) yields that the dual form is: max α,β − η log X 1 ≤ i,j ≤ n e η − 1 ( α i + β j − C ij ) + α ⊤ r + β ⊤ c . In order to streamline ou r presen tation, w e p erform a change of v ariables, u = η − 1 α and v = η − 1 β , and r eformulate the ab ov e pr oblem as min α,β ϕ ( α, β ) := log X 1 ≤ i,j ≤ n e u i + v j − C ij η − u ⊤ r − v ⊤ c. T o fur ther simplify the notatio n, w e define B ( u, v ) := ( B ij ) 1 ≤ i,j ≤ n ∈ R n × n b y B ij = e u i + v j − C ij η . T o this end, w e obtain th e dual entr opic r e gu larize d OT pr oblem d efi ned by min u,v ϕ ( u, v ) := log( k B ( u, v ) k 1 ) − u ⊤ r − v ⊤ c. (5) 6 Remark 2.1. The first p art of the obje ctive function ϕ is in the form of the lo garithm of sum of exp onents while the se c ond p art i s a line ar function. This is differ ent fr om the obje ctive function use d i n pr evious dual entr op ic r e gularize d OT pr oblem [ Cuturi , 2013 , Altschuler et al. , 2017 , Dvur e chensky et al. , 2018 , Lin et al. , 201 9a ]. Notably, Eq. ( 5 ) is a sp e cial instanc e of a softmax minimization pr oblem, and the obje ctive function ϕ is known to b e smo oth [ Nester ov , 2005 ]. Final ly, we p oint out that the same formulation has b e en derive d in Guminov et al. [ 2021 ] for analyzing AA M. In the remainder of the pap er, we also denote ( u ⋆ , v ⋆ ) ∈ R 2 n as an optimal solution of the dual en tropic regularized OT pr oblem in Eq. ( 5 ). 2.3 Prop erties of dual en tropic regularized OT W e presen t sev eral usefu l prop erties of the dual en tropic regularized OT in Eq. ( 5 ). In partic- ular, we sho w that there exists an optimal s olution ( u ⋆ , v ⋆ ) ∈ R 2 n suc h that it has an upp er b ound in terms of the ℓ ∞ -norm. Lemma 2.2. F or the dual entr opic r e g u larize d OT pr oblem in Eq. ( 5 ) , ther e exists an optimal solution ( u ⋆ , v ⋆ ) such that k u ⋆ k ∞ ≤ R, k v ⋆ k ∞ ≤ R, wher e R := η − 1 k C k ∞ + log( n ) − log(min 1 ≤ i,j ≤ n { r i , c j } ) dep ends on C , r and c . Pr o of. First, w e claim th at th ere exists an optimal solution ( u ⋆ , v ⋆ ) suc h that max 1 ≤ i ≤ n u ⋆ i ≥ 0 ≥ min 1 ≤ i ≤ n u ⋆ i , max 1 ≤ i ≤ n v ⋆ i ≥ 0 ≥ min 1 ≤ i ≤ n v ⋆ i . (6) Indeed, letting ( b u ⋆ , b v ⋆ ) b e an optimal solution to Eq. ( 5 ), the claim holds tru e if ( b u ⋆ , b v ⋆ ) satisfies Eq. ( 6 ). O therwise, we d efine the sh ift term giv en by b ∆ u = max 1 ≤ i ≤ n b u ⋆ i + min 1 ≤ i ≤ n b u ⋆ i 2 , b ∆ v = max 1 ≤ i ≤ n b v ⋆ i + min 1 ≤ i ≤ n b v ⋆ i 2 , and define ( u ⋆ , v ⋆ ) by u ⋆ = b u ⋆ − b ∆ u 1 n , v ⋆ = b v ⋆ − b ∆ v 1 n . By defin ition, w e hav e ( u ⋆ , v ⋆ ) satisfies Eq. ( 6 ). Since 1 ⊤ n r = 1 ⊤ n c = 1, we ha ve ( u ⋆ ) ⊤ r = ( b u ⋆ ) ⊤ r − b ∆ u and ( v ⋆ ) ⊤ c = ( b v ⋆ ) ⊤ c − b ∆ v . In addition, log( k B ( u ⋆ , v ⋆ ) k 1 ) = log ( k B ( b u ⋆ , b v ⋆ ) k 1 ) + b ∆ u + b ∆ v . Putting these p ieces together yields ϕ ( u ⋆ , v ⋆ ) = ϕ ( b u ⋆ , b v ⋆ ). Th erefore, ( u ⋆ , v ⋆ ) is an optimal s olution of the dual en tropic regularized OT that s atisfies Eq. ( 6 ). Then, w e sho w that max 1 ≤ i ≤ n u ⋆ i − min 1 ≤ i ≤ n u ⋆ i ≤ k C k ∞ η − log min 1 ≤ i,j ≤ n { r i , c j } , (7) max 1 ≤ i ≤ n v ⋆ i − min 1 ≤ i ≤ n v ⋆ i ≤ k C k ∞ η − log min 1 ≤ i,j ≤ n { r i , c j } . (8) Indeed, for any 1 ≤ i ≤ n , we derive from the optimalit y condition of ( u ⋆ , v ⋆ ) that e u ⋆ i ( P n j =1 e v ⋆ j − η − 1 C ij ) k B ( u ⋆ , v ⋆ ) k 1 = r i , for all i ∈ [ n ] . 7 Since C ij ≥ 0 for all 1 ≤ i, j ≤ n and r i ≥ min 1 ≤ i,j ≤ n { r i , c j } for all 1 ≤ i ≤ n , we ha v e u ⋆ i ≥ log min 1 ≤ i,j ≤ n { r i , c j } − log n X j =1 e v ⋆ j + log( k B ( u ⋆ , v ⋆ ) k 1 ) , for all i ∈ [ n ] . Since 0 < r i ≤ 1 and C ij ≤ k C k ∞ , we hav e u ⋆ i ≤ k C k ∞ η − log n X j =1 e v ⋆ j + log( k B ( u ⋆ , v ⋆ ) k 1 ) , for all i ∈ [ n ] . Putting these pieces together yields Eq. ( 7 ). By the similar argumen t, w e can pro ve Eq. ( 8 ). Finally , we prov e our main results. I ndeed, Eq. ( 6 ) and Eq. ( 7 ) imply that − k C k ∞ η + log min 1 ≤ i,j ≤ n { r i , c j } ≤ min 1 ≤ i ≤ n u ⋆ i ≤ 0 , and 0 ≤ max 1 ≤ i ≤ n u ⋆ i ≤ k C k ∞ η − log min 1 ≤ i,j ≤ n { r i , c j } . Com bining the ab ov e t w o in equ alities with the definition of R implies that k u ⋆ k ∞ ≤ R . By the similar argumen t, we can p ro v e that k v ⋆ k ∞ ≤ R . As a consequence, we obtain the conclusion of the lemma. The upp er b ound for the ℓ ∞ -norm of an optimal solution of dual entropic regularized OT in Lemma 2.2 d irectly leads to the follo wing direct b ound for the ℓ 2 -norm. Corollary 2.3. F or the dual entr opic r e gularize d OT pr oblem in Eq. ( 5 ) , ther e exists an optimal solution ( u ⋆ , v ⋆ ) such that k u ⋆ k ≤ √ nR, k v ⋆ k ≤ √ nR, wher e R > 0 is define d i n L emma 2.2 . Since the function − H ( X ) is strongly con v ex with resp ect to the ℓ 1 -norm on the p robabilit y simplex Q ⊆ R n × n , the en tropic regularized OT pr ob lem in Eq. ( 2 ) is a sp ecial case of the follo wing linearly constrained con v ex optimization problem: min x ∈ Q f ( x ) , s.t. Ax = b, where f is strongly conv ex with resp ect to the ℓ 1 -norm on the set Q : f ( x ′ ) − f ( x ) − ( x ′ − x ) ⊤ ∇ f ( x ) ≥ η 2 k x ′ − x k 2 1 for an y x ′ , x ∈ Q. By Nestero v [ 2005 , T heorem 1] with the ℓ 2 -norm for the dual space of the Lagrange multipliers, the dual ob j ectiv e function ˜ ϕ satisfies th e follo wing inequalit y: e ϕ ( α ′ , β ′ ) − e ϕ ( α, β ) − α ′ − α β ′ − β ⊤ ∇ e ϕ ( α, β ) ≤ k A k 2 1 → 2 2 η α ′ − α β ′ − β 2 for any ( α ′ , β ′ ) , ( α, β ) ∈ R 2 n . 8 Algorithm 1: Greenkhor n ( C, η , r , c, ε ′ ) Input: t = 0 and u 0 = v 0 = 0 n . while E t > ε ′ do Compute I = a rgmax 1 ≤ i ≤ n ρ ( r i , r i ( B ( u t , v t ))) where ρ ( a, b ) = b − a + a log( a/b ) and ( B ( u t , v t )) i ′ j ′ = e u t i ′ + v t j ′ − C i ′ j ′ η for all ( i ′ , j ′ ). Compute J = argma x 1 ≤ j ≤ n ρ ( c j , c j ( B ( u t , v t ))). if ρ ( r i , r i ( B ( u t , v t ))) > ρ ( c j , c j ( B ( u t , v t ))) then u t +1 I = u t I + log( r I ) − log ( r I ( B ( u t , v t ))). else v t +1 J = v t J + log( c J ) − log ( c J ( B ( u t , v t ))). end if Increment by t = t + 1. end while Output: B ( u t , v t ). Recall that the fun ction ˜ ϕ is giv en b y e ϕ ( α, β ) = − η log X 1 ≤ i,j ≤ n e η − 1 ( α i + β j − C ij ) + α ⊤ r + β ⊤ c. (9) W e notice that th e function ϕ in Eq. ( 5 ) s atisfies that ϕ ( u, v ) = − η − 1 e ϕ ( η u, η v ). Af ter some simple calculations, we h a v e ϕ ( u ′ , v ′ ) − ϕ ( u, v ) − u ′ − u v ′ − v ⊤ ∇ ϕ ( u, v ) ≤ k A k 2 1 → 2 2 u ′ − u v ′ − v 2 . (10) In the en tropic regularized OT problem, eac h column of the matrix A contai ns n o more than t w o nonzero elemen ts wh ic h are equal to one. Sin ce k A k 1 → 2 is equal to maximum ℓ 2 -norm of the column of this matrix, we ha ve k A k 1 → 2 = √ 2. Th us, the dual ob jectiv e fun ction ϕ is 2-gradien t Lipsc hitz with resp ect to the ℓ 2 -norm. 3 Greenkhorn In this section, we present a complexit y analysis for Greenkhorn. In particular, we improv e the existi ng b est k n o wn complexit y b ound O ( n 2 k C k 3 ∞ log( n ) ε − 3 ) [ Altsc h uler et al. , 2017 ] to O ( n 2 k C k 2 ∞ log( n ) ε − 2 ), which matc hes the current state-of-the-art complexit y b ound for Sinkhorn [ Dvurec hensky et al. , 2018 ]. T o facilitat e the subsequent discu ssion, we p resen t the p seudo co de of Greenkhorn in Al- gorithm 1 and its application to regularized OT in Algorithm 2 . Th e function for quant i- fying th e p r ogress in the d ual ob j ectiv e v al ue b etw een tw o consecutiv e iterates is giv en by ρ ( a, b ) = b − a + a log( a/b ) and we recall that ( u, v ) is an optimal solution of the dual en tropic regularized OT problem in Eq. ( 5 ) if r ( B ( u, v )) − r = 0 n and c ( B ( u, v )) − c = 0 n . This leads to th e quan tit y whic h measures the error of the t -th iterate in Algorithm 1 : E t := k r ( B ( u t , v t )) − r k 1 + k c ( B ( u t , v t )) − c k 1 . Both Sin k h orn and Greenkhorn can b e interpreted as co ordinate descen t f or minimizing the follo wing conv ex function [ Cuturi , 2013 , Altsc h uler et al. , 2017 , Dvurechensky et al. , 2018 , 9 Algorithm 2: Approximat ing OT b y Algorithm 1 Input: η = ε 4 log( n ) and ε ′ = ε 8 k C k ∞ . Step 1: Let ˜ r ∈ ∆ n and ˜ c ∈ ∆ n be defined by ( ˜ r , ˜ c ) = (1 − ε ′ 8 )( r , c ) + ε ′ 8 n ( 1 n , 1 n ). Step 2: Compute ˜ X = G reenkhorn ( C, η , ˜ r, ˜ c, ε ′ 2 ). Step 3: Round ˜ X to ˆ X using Altsch uler e t al. [ 201 7 , Algo rithm 2 ] suc h that ˆ X 1 n = r and ˆ X ⊤ 1 n = c . Output: ˆ X . Lin et al. , 2019a ]: f ( u, v ) := k B ( u, v ) k 1 − u ⊤ r − v ⊤ c. (11) Comparing to the sc heme of Sin khorn that consists in the up d ates of al l r ows and columns, Algorithm 1 up d ates only one row or column at eac h step. As s u c h, Algorithm 1 up d ates only O ( n ) entrie s p er iteration rather th an O ( n 2 ) in Sinkh orn. It is also w orth mentio nin g that Algorithm 1 can b e implemen ted su ch that eac h iteration r uns in only O ( n ) arithmetic op erations [ Altsc h uler et al. , 2017 ]. Despite c heap p er-iteration computational cost, it is difficult to qu antify th e p er-iteration progress of Algorithm 1 and the pro of tec hniques for Sinkhorn in Dvurec hensky et al. [ 2018 ] are not applicable her e. Th is m otiv ates us to inv estigate another pro of strateg y whic h will b e elab orated in the sequel. 3.1 Complexit y analysis—bounding dual ob jectiv e v alues Giv en the defi nition of E t , w e first prov e the follo wing lemma whic h yields an up p er b ound for th e ob jectiv e v alues of the iterates. Lemma 3.1. L etting { ( u t , v t ) } t ≥ 0 b e the iter ates gener ate d by Algorithm 1 , we have f ( u t , v t ) − f ( u ⋆ , v ⋆ ) ≤ 2 E t ( k u ⋆ k ∞ + k v ⋆ k ∞ ) , wher e ( u ∗ , v ∗ ) is a p oint that minimizes f ( u, v ) = k B ( u, v ) k 1 − u ⊤ r − v ⊤ c . Pr o of. By the d efi nition, we hav e f ( u, v ) = X 1 ≤ i,j ≤ n e u i + v j − C ij η − n X i =1 u i r i − n X j =1 v j c j . By defin ition, we ha v e ∇ u f ( u t , v t ) = B ( u t , v t ) 1 n − r and ∇ v f ( u t , v t ) = B ( u t , v t ) ⊤ 1 n − c . Thus, w e ha ve E t = k∇ u f ( u t , v t ) k 1 + k∇ v f ( u t , v t ) k 1 . Sin ce f is conv ex and minimized at ( u ⋆ , v ⋆ ), w e ha ve f ( u t , v t ) − f ( u ⋆ , v ⋆ ) ≤ ( u t − u ⋆ ) ⊤ ∇ u f ( u t , v t ) + ( v t − v ⋆ ) ⊤ ∇ v f ( u t , v t ) . Com bining H¨ older’s inequalit y and the defin ition of E t yields f ( u t , v t ) − f ( u ⋆ , v ⋆ ) ≤ E t ( k u t − u ⋆ k ∞ + k v t − v ⋆ k ∞ ) . (12) Th us , it suffices to s h o w that k u t − u ⋆ k ∞ + k v t − v ⋆ k ∞ ≤ 2 k u ⋆ k ∞ + 2 k v ⋆ k ∞ . 10 The next result is the k ey observ ati on that mak es our analysis w ork f or Greenkh orn. W e use an in duction argu m en t to establish th e follo wing b ound: max {k u t − u ⋆ k ∞ , k v t − v ⋆ k ∞ } ≤ max {k u 0 − u ⋆ k ∞ , k v 0 − v ⋆ k ∞ } . (13) It is clear that Eq. ( 13 ) holds true when t = 0. Supp ose that the inequalit y holds tru e f or t ≤ k 0 , we sho w that it also h olds true for t = k 0 + 1. Without loss of generalit y , let I b e the index c hosen at the ( k 0 + 1)-th iteration. Th en k u k 0 +1 − u ⋆ k ∞ ≤ max {k u k 0 − u ⋆ k ∞ , | u k 0 +1 I − u ⋆ I |} , (14) k v k 0 +1 − v ⋆ k ∞ = k v k 0 − v ⋆ k ∞ . (15) By the up dating f ormula for u k 0 +1 I and the op timalit y condition f or u ⋆ I , w e ha v e e u k 0 +1 I = r I P n j =1 e − C ij η + v k 0 j , e u ⋆ I = r I P n j =1 e − C ij η + v ⋆ j . Putting these pieces together with the inequalit y that P n i =1 a i P n i =1 b i ≤ max 1 ≤ j ≤ n a i b i for all a i , b i > 0 yields | u k 0 +1 I − u ⋆ I | = log P n j =1 e − η − 1 C I j + v k 0 j P n j =1 e − η − 1 C I j + v ⋆ j ≤ k v k 0 − v ⋆ k ∞ . (16) Com bining Eq. ( 14 ) and Eq. ( 16 ) yields k u k 0 +1 − u ⋆ k ∞ ≤ max {k u k 0 − u ⋆ k ∞ , k v k 0 − v ⋆ k ∞ } . (1 7) Com bining Eq. ( 15 ) and E q . ( 17 ) further implies Eq. ( 13 ). This together with u 0 = v 0 = 0 n implies k u t − u ⋆ k ∞ + k v t − v ⋆ k ∞ ≤ 2( k u 0 − u ⋆ k ∞ + k v 0 − v ⋆ k ∞ ) = 2 k u ⋆ k ∞ + 2 k v ⋆ k ∞ . (18) Putting Eq. ( 12 ) and E q . ( 18 ) together yields the desired result. Our s econd lemma sho ws that at least one optimal s olution ( u ⋆ , v ⋆ ) of f h as an up p er b ound of η − 1 k C k ∞ + log( n ) − 2 log (min 1 ≤ i,j ≤ n { r i , c j } ) in ℓ ∞ -norm. This resu lt is str onger than Dvurec hensky et al. [ 2018 , Lemma 1] and generalizes Blanc het et al. [ 2018 , Lemma 10]. Lemma 3.2. Ther e exists an optimal solution ( u ⋆ , v ⋆ ) of the function f define d in Eq. ( 11 ) such that the fol lowing ine quality holds true, k u ⋆ k ∞ ≤ R, k v ⋆ k ∞ ≤ R, wher e R := η − 1 k C k ∞ + log( n ) − 2 log (min 1 ≤ i,j ≤ n { r i , c j } ) dep ends on C , r and c . Pr o of. By u s ing the similar argumen t as in Lemma 2.2 , w e can first show that there exists an optimal s olution pair ( u ⋆ , v ⋆ ) su c h that (bu t n ot f or v ⋆ sim ultaneously) max 1 ≤ i ≤ n u ⋆ i ≥ 0 ≥ min 1 ≤ i ≤ n u ⋆ i . (19) 11 Then, we pr o ceed to establish the b ound s that are analogous to Eq. ( 7 ) and ( 8 ): max 1 ≤ i ≤ n u ⋆ i − min 1 ≤ i ≤ n u ⋆ i ≤ k C k ∞ η − log min 1 ≤ i,j ≤ n { r i , c j } , (20) max 1 ≤ i ≤ n v ⋆ i − min 1 ≤ i ≤ n v ⋆ i ≤ k C k ∞ η − log min 1 ≤ i,j ≤ n { r i , c j } . (21) Indeed, for eac h 1 ≤ i ≤ n , we h av e e − η − 1 k C k ∞ + u ⋆ i n X j =1 e v ⋆ j ≤ n X j =1 e − η − 1 C ij + u ⋆ i + v ⋆ j = [ B ( u ⋆ , v ⋆ ) 1 n ] i = r i ≤ 1 , whic h implies u ⋆ i ≤ η − 1 k C k ∞ − log ( P n j =1 e v ⋆ j ). F urthermore, w e ha ve e u ⋆ i n X j =1 e v ⋆ j ≥ n X j =1 e − η − 1 C ij + u ⋆ i + v ⋆ j = [ B ( u ⋆ , v ⋆ ) 1 n ] i = r i ≥ min 1 ≤ i,j ≤ n { r i , c j } , whic h implies u ⋆ i ≥ log(min 1 ≤ i,j ≤ n { r i , c j } ) − log( P n j =1 e v ⋆ j ). Putting th ese pieces together yields Eq . ( 20 ). Using the similar argumen t, w e can p r o v e Eq. ( 21 ) holds tru e. Finally , w e pro v e our main results. Since max 1 ≤ i ≤ n u ⋆ i ≥ 0 ≥ min 1 ≤ i ≤ n u ⋆ i , we derive fr om Eq. ( 20 ) that − k C k ∞ η + log min 1 ≤ i,j ≤ n { r i , c j } ≤ min 1 ≤ i ≤ n u ⋆ i ≤ max 1 ≤ i ≤ n u ⋆ i ≤ k C k ∞ η − log min 1 ≤ i,j ≤ n { r i , c j } . This imp lies that k u ⋆ k ∞ ≤ R . Then, w e b ound k v ⋆ k ∞ b y considering t w o different cases. F or the former case, w e assume that max 1 ≤ i ≤ n v ⋆ i ≥ 0. Note that the optimalit y condition is P n i,j =1 e − η − 1 C ij + u ⋆ i + v ⋆ j = 1 and f urther implies that max 1 ≤ i ≤ n u ⋆ i + max 1 ≤ i ≤ n v ⋆ i ≤ log max 1 ≤ i,j ≤ n e η − 1 C ij = k C k ∞ η . Since max 1 ≤ i ≤ n u ∗ i ≥ 0 and max 1 ≤ i ≤ n v ∗ i ≥ 0, w e hav e 0 ≤ max 1 ≤ i ≤ n v ⋆ i ≤ k C k ∞ η . Com bining max 1 ≤ i ≤ n v ∗ i ≥ 0 with Eq. ( 21 ) yields that min 1 ≤ i ≤ n v ⋆ i ≥ − k C k ∞ η + log min 1 ≤ i,j ≤ n { r i , c j } . whic h implies that k v ⋆ k ∞ ≤ R . F or the latter case, we assum e that max 1 ≤ i ≤ n v ⋆ i ≤ 0. Th en, we ha ve min 1 ≤ i ≤ n v ⋆ i ≥ log min 1 ≤ i,j ≤ n { r i , c j } − log n X i =1 e u ⋆ i ! . This together with k u ⋆ k ∞ ≤ k C k ∞ η − log (min 1 ≤ i,j ≤ n { r i , c j } ) yields th at k v ⋆ k ∞ ≤ R . Putting Lemma 3.1 and 3.2 together, w e ha ve the follo wing straigh tforward consequen ce: 12 Corollary 3.3. L etting { ( u t , v t ) } t ≥ 0 b e the iter ates gener ate d by Algorithm 1 , we have f ( u t , v t ) − f ( u ⋆ , v ⋆ ) ≤ 4 RE t . Remark 3.4. The notation R is also use d in Dvur e chensky et al. [ 2018 ] and has the same or der as ours sinc e R in our p ap er only involves an term log ( n ) − log (min 1 ≤ i,j ≤ n { r i , c j } ) . Remark 3.5. We further c omment on the pr o of te chniques in this p ap er and Dvur e chensky et al. [ 2018 ]. Inde e d, the pr o of for Dvur e chensky et al. [ 2018 , L emma 2] dep ends on taking ful l ad- vantage of the shift pr op erty of Sinkhorn; namely, either B ( u t , v t ) 1 n = r or B ( u t , v t ) ⊤ 1 n = c wher e ( u t , v t ) stands for the iter ate gener ate d by Sinkhorn. U nfortunately, Gr e enkhorn do es not enjoy such a shift pr op erty. We have thus pr op ose d a differ ent appr o ach for b ounding f ( u t , v t ) − f ( u ⋆ , v ⋆ ) via app e al to the ℓ ∞ -norm of the solution ( u ⋆ , v ⋆ ) . 3.2 Complexit y analysis—bounding t he n um b er of iterations W e pro ceed to pr o vide an up p er b ound for the iteration n umber to ac hiev e a desired tole rance ε ′ in Algorithm 1 . First, w e s tart with a lo w er b oun d for the difference of f unction v alues b et wee n tw o consecutiv e iterates of Algorithm 1 : Lemma 3.6. L etting { ( u t , v t ) } t ≥ 0 b e the iter ates gener ate d by Algorithm 1 , we have f ( u t , v t ) − f ( u t +1 , v t +1 ) ≥ ( E t ) 2 28 n . Pr o of. Com binin g Altsc h uler et al. [ 2017 , Lemma 5] and the fact that the r o w or column up d ate is c hosen in a greedy manner, w e ha v e f ( u t , v t ) − f ( u t +1 , v t +1 ) ≥ 1 2 n ρ ( r , r ( B ( u t , v t )) + ρ ( c, c ( B ( u t , v t )) . F ur th ermore, Altsc h uler et al. [ 2017 , Lemma 6] implies that ρ ( r , r ( B ( u t , v t )) + ρ ( c, c ( B ( u t , v t )) ≥ 1 7 k r − r ( B ( u t , v t )) k 2 1 + k c − c ( B ( u t , v t )) k 2 1 . Putting these pieces together yields that f ( u t , v t ) − f ( u t +1 , v t +1 ) ≥ 1 14 n k r − r ( B ( u t , v t )) k 2 1 + k c − c ( B ( u t , v t )) k 2 1 . Com bining the ab o v e inequalit y with the definition of E t implies the desired result. W e are no w able to d eriv e the iteratio n complexit y of Algo rithm 1 . Theorem 3.7. L etting { ( u t , v t ) } t ≥ 0 b e the iter ates gener ate d b y Al gorithm 1 , the numb er of iter ations r e quir e d to satisfy E t ≤ ε ′ is u pp er b ounde d by t ≤ 2 + 112 nR ε ′ wher e R > 0 is define d in L emma 3.2 . Pr o of. Letting δ t = f ( u t , v t ) − f ( u ⋆ , v ⋆ ), w e deriv e from Corollary 3.3 and Lemma 3.6 that δ t − δ t +1 ≥ max δ 2 t 448 nR 2 , ( ε ′ ) 2 28 n , where E t ≥ ε ′ as so on as the stopping criterion is not fulfilled. In the f ollo wing step we apply a switc hing strategy in tro d uced b y Dvurechensky et al. [ 2018 ]. Giv en any t ≥ 1, we h a v e tw o estimates: 13 (i) Consid ering th e pr o cess fr om the first iteration and the t -th iteratio n, w e ha v e δ t +1 448 nR 2 ≤ 1 t + 448 nR 2 δ − 2 1 = ⇒ t ≤ 1 + 448 nR 2 δ t − 448 nR 2 δ 1 . (ii) Consid er in g the pro cess from the ( t + 1)-th iteration to the ( t + m )-th iteration for any m ≥ 1, w e ha v e δ t + m ≤ δ t − ( ε ′ ) 2 m 28 n = ⇒ m ≤ 28 n ( δ t − δ t + m ) ( ε ′ ) 2 . W e then minimize the sum of t wo estimates by an optimal choice of s ∈ (0 , δ 1 ]: t ≤ min 0 0 stand s for the regularit y of th e mirr or m apping φ . 15 4.1 General setup W e follo w the setup in Section 2 and consid er the follo wing generalization of the en tropic regularized OT p roblem in Eq. ( 2 ): min x ∈ Q f ( x ) , s.t. Ax = b, (22) where f is strongly conv ex with resp ect to the ℓ 1 -norm on the set Q : f ( x ′ ) − f ( x ) − ( x ′ − x ) ⊤ ∇ f ( x ) ≥ η 2 k x ′ − x k 2 1 for an y x ′ , x ∈ Q. Note that, in the sp ecific setting of the en tropic regularized OT pr ob lem, the function f ( x ) = P i,j C ij x j + n ( i − 1) + η · x j + n ( i − 1) log( x j + n ( i − 1) ) where x j + n ( i − 1) = X ij for any i, j where X is the tr ansp ortation plan in equation ( 2 ), and the vec tor b ∈ R 2 n × 1 is defined as: b i = r i as 1 ≤ i ≤ n and b i = c i − n when n + 1 ≤ i ≤ 2 n . F u r thermore, the matrix A = ( A ij ) ∈ R 2 n × n 2 is defined as: When 1 ≤ i ≤ n , w e denote A ij = 1 if 1 + n ( i − 1) ≤ j ≤ n · i and 0 otherwise; When n + 1 ≤ i ≤ 2 n , we define A ij = 1 if j ∈ { i − n + n ( l − 1) : 1 ≤ l ≤ n } and 0 otherwise. T o b e consisten t w ith th e notations in Algorithms 4 and 5 , we sp ecifically denote A ot as the matrix A corresp onding to the en tropic regularized OT p roblem. After some calculations with the general problem ( 22 ), we obtain that the d u al pr oblem is as follo ws: min λ ∈ R 2 n e ϕ ( λ ) := {h λ, b i + max x ∈ R n 2 {− f ( x ) − h A ⊤ λ, x i}} , (2 3) and ∇ e ϕ ( λ ) = b − Ax ( λ ) wh ere x ( λ ) = argmax x ∈ R n {− f ( x ) − h A ⊤ λ, x i} ; see the explicit form in Eq. ( 9 ) w ith λ = ( α, β ). By Nestero v [ 2005 , Th eorem 1] with ℓ 1 -norm for the d ual sp ace of the Lagrange multipliers, the dual ob jective function ˜ ϕ satisfies the follo wing inequalit y: e ϕ ( λ ′ ) − e ϕ ( λ ) − ( λ ′ − λ ) ⊤ ∇ e ϕ ( λ ) ≤ k A k 2 1 → 1 2 η k λ ′ − λ k 2 ∞ . (24) In th e entropic regularized OT problem, eac h column of the matrix A ot con tains no more than t w o nonzero elemen ts wh ich are equal to one. S ince k A ot k 1 → 1 is equal to maximum ℓ 1 -norm of the column of this matrix, w e ha v e k A ot k 1 → 1 = 2. Thus, the dual ob jectiv e fu nction e ϕ is 4 η -gradien t Lipsc hitz with resp ect to the ℓ ∞ -norm. In addition, w e define the Bregman div ergence B φ : R 2 n × R 2 n 7→ [0 , + ∞ ) by B φ ( λ ′ , λ ) := φ ( λ ′ ) − φ ( λ ) − ( λ ′ − λ ) ⊤ ∇ φ ( λ ) , where the mir ror mapping φ is 1 δ -strongly con ve x and 1-smo oth on R 2 n with resp ect to ℓ ∞ - norm; that is, 1 2 δ k λ ′ − λ k 2 ∞ ≤ φ ( λ ′ ) − φ ( λ ) − ( λ ′ − λ ) ⊤ ∇ φ ( λ ) ≤ 1 2 k λ ′ − λ k 2 ∞ . F or example, w e can c ho ose φ ( λ ) = 1 2 n k λ k 2 and B φ ( λ ′ , λ ) = 1 2 n k λ ′ − λ k 2 in APD AMD where δ = n . As su c h, δ > 0 is a function of n in general and it will app ear in the complexit y b ound of APD AMD for appro ximating the OT problem (cf. Theorem 4.5 ). It is w orth noting that our algorithm uses a regularizer that acts only in the du al and our complexit y b ound is the b est existing one among th is group of algorithms [ Dvur echensky et al. , 2018 , Guo et al. , 2020 , Gumino v et al. , 2021 ]. A v ery recen t work of J am bulapati et al. [ 201 9 ] sho we d that th e complexit y b oun d can b e impr o v ed to e O ( n 2 ε − 1 ) using a more adv anced area-con v ex mir ror mapping [ Sherman , 201 7 ]. 16 Algorithm 4: Approximat ing OT b y Algorithm 3 Input: η = ε 4 log( n ) and ε ′ = ε 8 k C k ∞ . Step 1: Let ˜ r ∈ ∆ n and ˜ c ∈ ∆ n be defined by ( ˜ r , ˜ c ) = (1 − ε ′ 8 )( r , c ) + ε ′ 8 n ( 1 n , 1 n ). Step 2: Let A ot ∈ R 2 n × n 2 and b ∈ R 2 n be defined by A ot vec( X ) = X 1 n X ⊤ 1 n and b = ˜ r ˜ c . Step 3: Compute ˜ X = A pdamd ( e ϕ, A ot , b, ε ′ 2 ) where e ϕ is defined b y Eq. ( 23 ). Step 4: Round ˜ X to ˆ X using Altsch uler et al. [ 2 017 , Algorithm 2 ] suc h that ˆ X 1 n = r and ˆ X ⊤ 1 n = c . Output: ˆ X . 4.2 Prop erties of APDAMD W e pr esen t sev eral imp ortant prop er ties of Algorithm 3 that can b e used later for entropic regularized O T problems. First, we pro ve the follo wing result r egarding the num b er of line searc h iterations in Algo rithm 3 for solving the en tropic regularized OT pr ob lem: Lemma 4.1. The numb er of line se ar ch iter ations in Algorithm 3 for solving the entr op ic OT pr oblem is finite. F urthermor e, the total numb er of gr adient or acle c al ls after the t - th iter atio n is b ounde d as N t ≤ 4 t + 4 + 2 log ( 8 η ) − 2 log( L 0 ) log 2 . Pr o of. First, we observe that multiplying M t b y t wo will not stop until the line searc h stopping criterion is satisfied. Th en, E q . ( 24 ) imp lies that the num b er of line searc h iterations in the line searc h str ategy is finite and M t ≤ 2 k A ot k 2 1 → 1 η holds true for all t ≥ 0. O therwise, the line searc h stopping criterion is satisfied with M t 2 since M t 2 ≥ k A ot k 2 1 → 1 η . Letting i j denote the total num b er of multi plication at the j -th iteratio n, w e ha v e i 0 ≤ 1 + log( M 0 L 0 ) log 2 , i j ≤ 2 + log( M j M j − 1 ) log 2 . Then, the total num b er of line searc h iterations is b ounded by t X j =0 i j ≤ 1 + log( M 0 L 0 ) log 2 + t X j =1 2 + log( M j M j − 1 ) log 2 ! ≤ 2 t + 1 + log( 2 k A ot k 2 1 → 1 η ) − log( L 0 ) log 2 . Since eac h lin e searc h con tains t wo gradien t oracle calls and k A ot k 1 → 1 = 2, we conclude th e desired upp er b ound for the total n umber of gradient oracle calls after the t -th iteration. The next lemma pr esen ts a pr op ert y of the function e ϕ in Algorithm 3 . Lemma 4.2. F or e ach iter ation t of Algorithm 3 and any z ∈ R 2 n , we have ¯ α t e ϕ ( λ t ) ≤ t X j =0 ( α j ( e ϕ ( µ j ) + ( z − µ j ) ⊤ ∇ e ϕ ( µ j ))) + k z k 2 ∞ . Pr o of. First, w e claim th at it holds for an y z ∈ R n : α t +1 ( z t − z ) ⊤ ∇ e ϕ ( µ t +1 ) ≤ ¯ α t +1 ( e ϕ ( µ t +1 ) − e ϕ ( λ t +1 )) + B φ ( z , z t ) − B φ ( z , z t +1 ) . (25) 17 Indeed, the optimalit y condition in mirr or descent imp lies that, for an y z ∈ R 2 n , we hav e ( z − z t +1 ) ⊤ ∇ e ϕ ( µ t +1 ) + ∇ φ ( z t +1 ) −∇ φ ( z t ) α t +1 ≥ 0 . By definition, w e ha ve B φ ( z , z t ) − B φ ( z , z t +1 ) − B φ ( z t +1 , z t ) = ( z − z t +1 ) ⊤ ( ∇ φ ( z t +1 ) − ∇ φ ( z t )) and B φ ( z t +1 , z t ) ≥ 1 2 δ k z t +1 − z t k 2 ∞ . Pu tting these pieces together yields that α t +1 ( z t − z ) ⊤ ∇ e ϕ ( µ t +1 ) = α t +1 ( z t − z t +1 ) ⊤ ∇ e ϕ ( µ t +1 ) + α t +1 ( z t +1 − z ) ⊤ ∇ e ϕ ( µ t +1 ) ≤ α t +1 ( z t − z t +1 ) ⊤ ∇ e ϕ ( µ t +1 ) + ( z − z t +1 ) ⊤ ( ∇ φ ( z t +1 ) − ∇ φ ( z t )) = α t +1 ( z t − z t +1 ) ⊤ ∇ e ϕ ( µ t +1 ) + B φ ( z , z t ) − B φ ( z , z t +1 ) − B φ ( z t +1 , z t ) ≤ α t +1 ( z t − z t +1 ) ⊤ ∇ e ϕ ( µ t +1 ) + B φ ( z , z t ) − B φ ( z , z t +1 ) − k z t +1 − z t k 2 ∞ 2 δ . (26) The up date formulas of µ t +1 , λ t +1 , α t +1 and ¯ α t +1 imply that λ t +1 − µ t +1 = α t +1 ¯ α t +1 ( z t +1 − z t ) , δ M t ( α t +1 ) 2 = ¯ α t +1 . Therefore, w e ha v e α t +1 ( z t − z t +1 ) ⊤ ∇ e ϕ ( µ t +1 ) = ¯ α t +1 ( µ t +1 − λ t +1 ) ⊤ ∇ e ϕ ( µ t +1 ) , and k z t +1 − z t k 2 ∞ = ( ¯ α t +1 α t +1 ) 2 k µ t +1 − λ t +1 k 2 ∞ = δ M t ¯ α t +1 k µ t +1 − λ t +1 k 2 ∞ . Putting these pieces together with Eq. ( 26 ) yields that α t +1 ( z t − z ) ⊤ ∇ e ϕ ( µ t +1 ) ≤ ¯ α t +1 ( µ t +1 − λ t +1 ) ⊤ ∇ e ϕ ( µ t +1 ) + B φ ( z , z t ) − B φ ( z , z t +1 ) − ¯ α t +1 M t 2 k µ t +1 − λ t +1 k 2 ∞ = ¯ α t +1 ( µ t +1 − λ t +1 ) ⊤ ∇ e ϕ ( µ t +1 ) − M t 2 k µ t +1 − λ t +1 k 2 ∞ + B φ ( z , z t ) − B φ ( z , z t +1 ) ≤ ¯ α t +1 ( e ϕ ( µ t +1 ) − e ϕ ( λ t +1 )) + B φ ( z , z t ) − B φ ( z , z t +1 ) , where the last inequalit y comes from the stopping criterion in the line searc h. This imp lies that Eq. ( 25 ) holds tr u e. The next s tep is to b oun d the iterativ e ob jectiv e gap giv en by ¯ α t +1 e ϕ ( λ t +1 ) − ¯ α t e ϕ ( λ t ) (27) ≤ α t +1 ( e ϕ ( µ t +1 ) + ( z − µ t +1 ) ⊤ ∇ e ϕ ( µ t +1 )) + B φ ( z , z t ) − B φ ( z , z t +1 ) . Indeed, b y com bining ¯ α t +1 = ¯ α t + α t +1 and the u p date formula of µ t +1 , we hav e α t +1 ( µ t +1 − z t ) = ( ¯ α t +1 − ¯ α t ) µ t +1 − α t +1 z t = α t +1 z t + ¯ α t λ t − ¯ α t µ t +1 − α t +1 z t = ¯ α t ( λ t − µ t +1 ) . This together with the con v exit y of e ϕ implies that α t +1 ( µ t +1 − z ) ⊤ ∇ e ϕ ( µ t +1 ) = α t +1 ( µ t +1 − z t ) ⊤ ∇ e ϕ ( µ t +1 ) + α t +1 ( z t − z ) ⊤ ∇ e ϕ ( µ t +1 ) = ¯ α t ( λ t − µ t +1 ) ⊤ ∇ e ϕ ( µ t +1 ) + α t +1 ( z t − z ) ⊤ ∇ e ϕ ( µ t +1 ) ≤ ¯ α t ( e ϕ ( λ t ) − e ϕ ( µ t +1 )) + α t +1 ( z t − z ) ⊤ ∇ e ϕ ( µ t +1 ) . 18 F ur th ermore, w e d eriv e from Eq. ( 25 ) and ¯ α t +1 = ¯ α t + α t +1 that ¯ α t ( e ϕ ( λ t ) − e ϕ ( µ t +1 )) + α t +1 ( z t − z ) ⊤ ∇ e ϕ ( µ t +1 ) ≤ ¯ α t ( e ϕ ( λ t ) − e ϕ ( µ t +1 )) + ¯ α t +1 ( e ϕ ( µ t +1 ) − e ϕ ( λ t +1 )) + B φ ( z , z t ) − B φ ( z , z t +1 ) = ¯ α t e ϕ ( λ t ) − ¯ α t +1 e ϕ ( λ t +1 ) + α t +1 e ϕ ( µ t +1 ) + B φ ( z , z t ) − B φ ( z , z t +1 ) . Putting these pieces together yields that Eq. ( 27 ) holds tr u e. Finally , w e pro v e our main results. By changing the index t to j in Eq. ( 27 ) and summing up th e resu lting inequalit y ov er j = 0 , 1 , . . . , t − 1, we h av e ¯ α t e ϕ ( λ t ) − ¯ α 0 e ϕ ( λ 0 ) ≤ t − 1 X j =0 ( α j +1 ( e ϕ ( µ j +1 ) + ( z − µ j +1 ) ⊤ ∇ e ϕ ( µ j +1 ))) + B φ ( z , z 0 ) − B φ ( z , z t ) . Since α 0 = ¯ α 0 = 0, B φ ( z , z t ) ≥ 0 and φ is 1-smooth with resp ect to ℓ ∞ -norm, w e ha v e ¯ α t e ϕ ( λ t ) ≤ t X j =0 ( α j ( e ϕ ( µ j ) + ( z − µ j ) ⊤ ∇ e ϕ ( µ j ))) + B φ ( z , z 0 ) ≤ t X j =0 ( α j ( e ϕ ( µ j ) + ( z − µ j ) ⊤ ∇ e ϕ ( µ j ))) + k z − z 0 k 2 ∞ z 0 =0 = t X j =0 ( α j ( e ϕ ( µ j ) + ( z − µ j ) ⊤ ∇ e ϕ ( µ j ))) + k z k 2 ∞ . This completes the pro of. The final lemma provides us with a key lo we r b ound for the accum ulating parameter. Lemma 4.3. F or e ach iter ation t of Algorithm 3 , we have ¯ α t ≥ η ( t +1) 2 32 δ . Pr o of. F or t = 1, w e ha v e ¯ α 1 = α 1 = 1 δM 1 ≥ η 8 δ since M 1 ≤ 8 η w as pr o v en in Lemma 4.1 . Th us , the desired result h olds true when t = 1. Then w e p r o ceed to prov e that it holds true for t ≥ 1 using the in duction. Indeed, w e hav e ¯ α t +1 = ¯ α t + α t +1 = ¯ α t + 1 + √ 1 + 4 δ M t ¯ α t 2 δ M t = ¯ α t + 1 2 δ M t + s 1 4( δM t ) 2 + ¯ α t δ M t ≥ ¯ α t + 1 2 δ M t + r ¯ α t δ M t ≥ ¯ α t + η 16 δ + r η ¯ α t 8 δ , where the last inequalit y comes from M t ≤ 8 η as sho wn in Lemma 4.1 . S u pp ose that the desired result h olds tru e for t = k 0 , w e ha v e ¯ α k 0 +1 ≥ η ( k 0 + 1) 2 32 δ + η 16 δ + r η 2 ( k 0 + 1) 2 256 δ 2 = η (( k 0 + 1) 2 + 2 + 2( k 0 + 1)) 32 δ ≥ η ( k 0 + 2) 2 32 δ . This completes the pro of. 19 4.3 Complexit y analysis for A P D AMD W e are now ready to establish the complexit y b ound of APDA MD for s olving th e entropic regularized OT p roblem. Ind eed, we recall that e ϕ ( λ ) is defined with λ = ( α, β ) b y e ϕ ( α, β ) = − η log X 1 ≤ i,j ≤ n e η − 1 ( α i + β j − C ij ) + α ⊤ r + β ⊤ c. Since ( α, β ) can b e obtain by α i = η u i and β j = η v j , w e deriv e from Lemma 2.2 that k α ⋆ k ∞ ≤ η R, k β ⋆ k ∞ ≤ η R. where R is d efined accordingly . Th en, we pr o ceed to the follo wing key result determin in g an upp er b ound for the num b er of iterations for Algorithm 3 to reac h a desired accuracy ε ′ : Theorem 4.4. L etting { X t } t ≥ 0 b e the iter ates gener at e d by Algor ithm 3 , the nu mb er of iter ations r e quir e d to satisfy k A ot v ec( X t ) − b k 1 ≤ ε ′ is upp er b ounde d by t ≤ 1 + r 128 δR ε ′ , wher e R > 0 is define d i n L emma 2.2 . Pr o of. F rom Lemma 4.2 , w e ha v e ¯ α t e ϕ ( λ t ) ≤ min z ∈ B ∞ (2 ηR ) t X j =0 ( α j ( e ϕ ( µ j ) + ( z − µ j ) ⊤ ∇ e ϕ ( µ j ))) + k z k 2 ∞ , where B ∞ ( r ) := { λ ∈ R n | k λ k ∞ ≤ r } . This implies that ¯ α t e ϕ ( λ t ) ≤ min z ∈ B ∞ (2 ηR ) t X j =0 ( α j ( e ϕ ( µ j ) + ( z − µ j ) ⊤ ∇ e ϕ ( µ j ))) + 4 η 2 R 2 . Since e ϕ is the ob jectiv e function of dual en tropic regularized OT p roblem, we h a v e e ϕ ( µ j ) + ( z − µ j ) ⊤ ∇ e ϕ ( µ j ) = − f ( x ( µ j )) + z ⊤ ( b − A ot x ( µ j )) . Therefore, w e conclude th at ¯ α t e ϕ ( λ t ) ≤ min z ∈ B ∞ (2 ηR ) t X j =0 ( α j ( e ϕ ( µ j ) + ( z − µ j ) ⊤ ∇ e ϕ ( µ j ))) + 4 η 2 R 2 ≤ 4 η 2 R 2 − ¯ α t f ( x t ) + m in z ∈ B ∞ (2 ηR ) { ¯ α t z ⊤ ( b − A ot x t ) } = 4 η 2 R 2 − ¯ α t f ( x t ) − 2 ¯ α t η R k A ot x t − b k 1 , where the second inequalit y comes from the con v exit y of f and the last equalit y comes from the fact that ℓ 1 -norm is th e du al norm of ℓ ∞ -norm. Th at is to say , f ( x t ) + e ϕ ( λ t ) + 2 η R k A ot x t − b k 1 ≤ 4 η 2 R 2 ¯ α t . 20 Supp ose that λ ⋆ is an optimal solution to d ual en tropic regularized OT pr oblem s atisfying k λ ⋆ k ∞ ≤ η R , we hav e f ( x t ) + e ϕ ( λ t ) ≥ f ( x t ) + e ϕ ( λ ⋆ ) = f ( x t ) + b ⊤ λ ⋆ + max x ∈ R n 2 n − f ( x ) − ( λ ⋆ ) ⊤ A ot x o ≥ f ( x t ) + b ⊤ λ ⋆ − f ( x t ) − ( λ ⋆ ) ⊤ A ot x t = ( b − A ot x t ) λ ⋆ ≥ − η R k A ot x t − b k 1 , Therefore, w e conclude th at k A ot x t − b k 1 ≤ 4 η R ¯ α t ≤ 128 δR ( t + 1) 2 . This completes the pro of. No w, we are ready to p resen t the complexit y b ound of Algorithm 4 for app ro ximating the OT problem. Theorem 4.5. The APDAMD scheme for appr oxima ting optima l tr ansp ort (Algorith m 4 ) r etu rns an ε -appr oximate tr ansp ortation plan (cf. Definition 1 ) in O n 2 √ δ k C k ∞ log( n ) ε ! arithmetic op er ations. Pr o of. Using the same argument as in T h eorem 3.8 , w e ha v e h C, ˆ X i − h C, X ⋆ i ≤ ε 2 + 4( k ˜ X 1 n − r k 1 + k ˜ X ⊤ 1 n − c k 1 ) k C k ∞ , where ˆ X is returned by Algorithm 4 , X ∗ is a solution to the OT problem and ˜ X = Ap damd ( e ϕ, A ot , b, ε ′ 2 ). Since k ˜ X 1 n − r k 1 + k ˜ X ⊤ 1 n − c k 1 ≤ ε ′ and ε ′ = ε 8 k C k ∞ , we ha ve h C , ˆ X i − h C, X ⋆ i ≤ ε 2 + ε 2 = ε . The remaining step is to analyze the complexit y b ound. If f ollo ws from Lemma 4.1 and Theorem 4.4 that N t ≤ 4 t + 4 + 2 log ( 8 η ) − 2 log( L 0 ) log 2 ≤ 8 + r 2048 δR ε ′ + 2 log ( 8 η ) − 2 log( L 0 ) log 2 = 8 + 256 r δ R k C k ∞ log( n ) ε + 2 log ( 32 log ( n ) ε ) − 2 log( L 0 ) log 2 . Com bining the defin ition of R in Lemma 2.2 with the definition of η , ˜ r and ˜ c in Algorithm 4 , w e ha ve R ≤ 4 k C k ∞ log( n ) ε + log( n ) − 2 log ε 64 n k C k ∞ . Therefore, w e conclude th at N t ≤ 256 r δ k C k ∞ log( n ) ε s 4 k C k ∞ log( n ) ε + log( n ) − 2 log ε 64 n k C k ∞ + 2 log ( 32 l og( n ) ε ) − 2 log( L 0 ) log 2 + 8 = O √ δ k C k ∞ log( n ) ε ! . 21 Algorithm 5: Approximat ing OT b y Dvurec hensky et al. [ 2018 , Algorithm 3] Input: η = ε 4 log( n ) and ε ′ = ε 8 k C k ∞ . Step 1: Let ˜ r ∈ ∆ n and ˜ c ∈ ∆ n be defined by ( ˜ r , ˜ c ) = (1 − ε ′ 8 )( r , c ) + ε ′ 8 n ( 1 n , 1 n ). Step 2: Let A ot ∈ R 2 n × n 2 and b ∈ R 2 n be defined by A ot vec( X ) = X 1 n X ⊤ 1 n and b = ˜ r ˜ c . Step 3: Compute ˜ X = A pda gd ( e ϕ, A ot , b, ε ′ 2 ) where e ϕ is defined by Eq. ( 23 ). Step 4: Round ˜ X to ˆ X using Altsch uler et al. [ 2 017 , Algorithm 2 ] suc h that ˆ X 1 n = r and ˆ X ⊤ 1 n = c . The total iteration complexit y in Step 3 of Algorithm 4 is b ounded by O ( √ δ k C k ∞ log( n ) ε − 1 ). Eac h iteration of Algorithm 3 requires O ( n 2 ) arithmetic op erations. Ther efore, the total n umb er of arithmetic op erations is O ( n 2 √ δ k C k ∞ log( n ) ε − 1 ). Moreo v er, ˜ r and ˜ c in Step 1 of Algorithm 4 can b e found in O ( n ) arithmetic op erations and Altsc h uler et al. [ 2017 , Algorithm 2] requires O ( n 2 ) arithmetic op erations. Therefore, w e conclude th at the total n umb er of arithmetic op erations is O ( n 2 √ δ k C k ∞ log( n ) ε − 1 ). The complexit y r esults in Theorem 4.5 suggests an inte resting feature of the (regularized) OT pr oblem. Indeed, the dep end en ce of that b ound on δ manif ests the necessit y of ℓ ∞ -norm in the u nderstandin g of the complexit y of the en tropic regularized OT problem. This view is also in harmony with the pro of tec hniqu e of runnin g time for Greenkhorn in Section 3 , where w e r ely on ℓ ∞ -norm of optimal solutions of the dual en tropic regularized OT problem to m easure the pr ogress in the ob jectiv e v alue among the successiv e iterates. 4.4 Revisiting A P D AGD W e r evisit APD AG D [ Dvurec hensky et al. , 2018 ] for the en tropic regularized OT problem. First, w e p oint out that the cur ren t complexit y b oun d of e O (min { n 9 / 4 ε − 1 , n 2 ε − 2 } ) is not v alid b y a simple coun terexample. Then, we establish a new complexit y b ound of APD A GD using our tec hniques in Section 4.3 . De sp ite the issue with en tropic regularized OT, w e wish to emphasize th at APD A GD is still an in teresting and efficient accelerated algorithm for general linearly constrained con v ex optimization problem with solid theoretical guaran tee. More precisely , Dvurec hensky et al. [ 2018 , T heorem 3] is n ot applicable to en tropic r egularized OT since n o dual solution exists with a constant b oun d in ℓ 2 -norm. Ho w ev er, it can b e used for analyzing other pr oblems w ith b ounded optimal d ual solution. T o facilitate the ensuing discussion, w e fi rst presen t the complexit y b ound for en tropic regu- larized OT in Dvurec hensky et al. [ 2018 ] u sing our n otation. Indeed, we recall that APD A GD is d evelo p ed for solving the optimizatio n p roblem with the ob jectiv e fun ction b ϕ defined as follo ws, min α,β ∈ R n b ϕ ( α, β ) := η n X i,j =1 e − C ij − α i − β j η − 1 − α ⊤ r − β ⊤ c. (28) Theorem 4.6 (Th eorem 4 in Dvurec hensky et al. [ 20 18 ]) . L et R > 0 b e define d such that ther e exists an optimal solution to the dual e ntr opic r e gularize d O T pr oblem in Eq. ( 23 ) , denote d by ( u ⋆ , v ⋆ ) , satisfying k ( u ⋆ , v ⋆ ) k ≤ R , the APDA GD scheme for appr oxima ting optimal tr ansp ort 22 (cf. Algorithm 5 ) r eturns an ε -appr oximate tr ansp ortation plan (cf. Definition 1 ) in O min n 9 / 4 q R k C k ∞ log( n ) ε , n 2 R k C k ∞ log( n ) ε 2 , arithmetic op er ations. F rom the ab o ve th eorem, Dvurec hensky et al. [ 2018 ] claims that the complexit y b ound for APD A GD is e O (min { n 9 / 4 ε − 1 , n 2 ε − 2 } ). How ev er, there are t w o issues: 1. Th e u pp er b ound R is assu med to b e indep enden t of n , wh ic h is not correct; see our coun terexample in Prop osition 4.7 . 2. Th e known u pp er b ound R for th e optimal solutio n dep en d s on min 1 ≤ i,j ≤ n { r i , c j } (cf. Dvurec hensky et al. [ 2018 , Lemma 1] or Lemm a 2.2 in our pap er). Th is implies that the v alid algorithm needs to tak e the roundin g error with r and c into accoun t. Corrected upp er b ound R . Corollary 2.3 and Lemma 3.2 imply that a straigh tforward upp er b ound for R is e O ( √ n ). Giv en a tolerance ε ∈ (0 , 1), w e fur ther sho w that R is in d eed Ω( √ n ) by us in g a sp ecific en tropic regularized OT problem as follo ws. Prop osition 4.7. Supp ose that C = 1 n 1 ⊤ n and r = c = 1 n 1 n . Given a toler anc e ε ∈ (0 , 1) and the r e gulariza tion term η = ε 4 log ( n ) , al l the optimal solutions of the dual entr op ic r e gularize d OT pr oblem in Eq. ( 28 ) satisfy that k ( α ∗ , β ∗ ) k & √ n . Pr o of. By the d efi nition r , c and η , w e rewrite the dual fu n ction b ϕ ( α, β ) as follo ws: b ϕ ( α, β ) = ε 4 e log ( n ) X 1 ≤ i,j ≤ n e − 4 log( n )(1 − α i − β j ) ε − α ⊤ 1 n n − β ⊤ 1 n n . Since ( α ⋆ , β ⋆ ) is an optimal solution of du al entropic regularized OT problem, we h a v e e 4 log( n ) α ⋆ i ε n X j =1 e − 4 log( n )(1 − β ⋆ j ) ε = e 4 log( n ) β ⋆ i ε n X j =1 e − 4 log( n )(1 − α ⋆ j ) ε = e n for all i ∈ [ n ] . (29) This imp lies α ⋆ i = α ⋆ j and β ⋆ i = β ⋆ j for all i, j ∈ [ n ], and α ⋆ i + β ⋆ i are the same for all i ∈ [ n ]. Without loss of generalit y , we can let α ⋆ i = 0 in Eq. ( 29 ) and obtain that β ⋆ i = 1 + ε 4 log ( n ) (1 − 2 log( n )) = 1 + ε 4 log ( n ) − ε 2 . whic h implies that α ⋆ i + β ⋆ i = 1 + ε 4 log ( n ) − ε 2 ≥ 1 2 for all i ∈ [ n ]. Th us, w e ha ve k ( α ⋆ , β ⋆ ) k ≥ r P n i =1 ( α ⋆ i + β ⋆ i ) 2 2 = 1 2 r n 2 & √ n. As a consequence, w e ac hiev e the conclusion of the prop osition. 23 Appro ximation algorithm for OT b y APD A GD. It is wo rth n oting that the round - ing p ro cedure is missing in Dvurec hensky et al. [ 2018 , Algorithm 4] and we imp ro v e it to Algorithm 5 . In particular, Dvu r ec hensky et al. [ 2018 , Algorithm 3] is used in Step 3 of Al- gorithm 5 for another function e ϕ d efined in Eq. ( 9 ). Giv en the corrected up p er b ound R and Algorithm 5 for app ro ximating OT, we provide a new complexit y b ound of Algorithm 5 in the follo wing prop osition. Prop osition 4.8. The APDA GD scheme for appr oximating optimal tr ansp ort (Al gorithm 5 ) r etu rns an ε -appr oximate tr ansp ortation plan (cf. Definition 1 ) in O n 5 / 2 k C k ∞ p log( n ) ε ! arithmetic op er ations. Pr o of. The p r o of is a simple mo difi cation of th e pro of for Dvurec hensky et al. [ 2018 , Theorem 4] and w e only giv e a p ro of sk etc h here. In p articular, w e can obtain that the num b er of iterations for Algorithm 5 required to reac h the tolerance ε is t ≤ O max min n 1 / 4 q R k C k ∞ log( n ) ε , R k C k ∞ log( n ) ε 2 , R √ log n ε . (30) Moreo v er, we ha ve R ≤ √ nη R wh ere R = η − 1 k C k ∞ + log( n ) − 2 log (min 1 ≤ i,j ≤ n { r i , c j } ). Th ere- fore, the total iteration complexity in Step 3 of Algorithm 5 is O ( p n log ( n ) k C k ∞ ε − 1 ). Eac h iteration of APDA GD requires O ( n 2 ) arithmetic op erations. Therefore, the total num b er of arithmetic op erations is O ( n 5 / 2 k C k ∞ p log( n ) ε − 1 ). Note that ˜ r and ˜ c in Step 1 of Algorithm 5 can b e found in O ( n ) arithmetic op erations an d Altsc huler et al. [ 2017 , Algorithm 2] requires O ( n 2 ) arithmetic op erations. Therefore, we conclude that the total n umb er of arithm etic op erations is O ( n 5 / 2 k C k ∞ p log( n ) ε − 1 ). Remark 4.9. As indic ate d in Pr op osition 4.8 , the c orr e cte d c omp lexity b ound of APDAGD for the entr opic r e gularize d OT is similar to that of A P DAMD when we cho ose φ ( · ) = 1 2 n k · k 2 and have δ = n . F r om this p ersp e ctive, our algorith m c an b e viewe d as a gener alization of APDA GD. Sinc e our algorith m utilizes ℓ ∞ -norm in the line se ar ch criterion, it is mor e r obust than AP DA GD in pr actic e; se e Se ction 6 for the details. 5 Accelerating Sinkh orn In this section, we presen t an accelerat ed v arian t of Sin khorn for solving the en tropic regular- ized O T problem in E q. ( 2 ). Combined with a roun ding sc heme, our algorithm can b e used for solving the OT p roblem in E q. ( 1 ) an d ac hiev es a complexit y b oun d of e O ( n 7 / 3 ε − 4 / 3 ), whic h impro ve s that of Sinkhorn in terms of 1 /ε and APD A GD and AAM [ Gumino v et al. , 2021 ] in terms of n . The idea comes from a no v el com bination of Nestero v’s estimated sequence and the tec hniques for analyzing Sink h orn. 5.1 Algorithmic pro cedure W e presen t the ps eu do co de of accele rated Sin khorn in Algorithm 6 . Th is algorithm ac hiev es the acceleration by using Nestero v’s estimat e s equences [ Nestero v , 2018 ]. While our algorithm 24 Algorithm 6: A ccel era t ed Sinkho rn ( C, η , r, c, ε ′ ) Input: t = 0 , θ 0 = 1 and ˇ u 0 = ˜ u 0 = ˇ v 0 = ˜ v 0 = 0 n . while E t > ε ′ do Compute ¯ u t ¯ v t = (1 − θ t ) ˇ u t ˇ v t + θ t ˜ u t ˜ v t . Compute ˜ u t +1 and ˜ v t +1 by ˜ u t +1 = ˜ u t − 1 2 θ t r ( B ( ¯ u t , ¯ v t )) k B ( ¯ u t , ¯ v t ) k 1 − r , ˜ v t +1 = ˜ v t − 1 2 θ t c ( B ( ¯ u t , ¯ v t )) k B ( ¯ u t , ¯ v t ) k 1 − c . Compute ` u t ` v t = ¯ u t ¯ v t + θ t ˜ u t +1 ˜ v t +1 − ˜ u t ˜ v t . if t is even then b u t = ` u t + log( r ) − log( r ( B ( ` u t , ` v t ))) and b v t = ` v t . else b u t = ` u t and b v t = ` v t + log( c ) − log( c ( B ( ` u t , ` v t ))). end if Compute u t v t = arg min ϕ ( u, v ) u v ∈ ˇ u t ˇ v t , b u t b v t . if t is even then ˇ u t +1 = u t + log( r ) − log( r ( B ( u t , v t ))) and ˇ v t +1 = v t . else ˇ u t +1 = u t and ˇ v t +1 = v t + log( c ) − log ( c ( B ( u t , v t ))). end if Compute θ t +1 = θ t ( √ θ 2 t +4 − θ t ) 2 . Set t = t + 1. end while Output: B ( u t , v t ). can b e in terpr eted as an accelerated blo c k co ordinate descen t algo rithm , it is w orth mentioning that our algorithm is pur ely deterministic and th us d iffers from other acce lerated rand omized algorithms [ Lin et al. , 2015 , F erco q and Ric h t´ arik , 2015 , Lu et al. , 2018 ] in the optimizatio n literature. Algorithm 6 is a no vel com bination of Nestero v’s estimate sequences, a monotone searc h step, the c hoice of greedy co ordin ate and tw o coord inate up dates. It is applied to solve the dual en tropic regularized OT pr oblem in Eq. ( 5 ): min u,v ϕ ( u, v ) := log ( k B ( u, v ) k 1 ) − u ⊤ r − v ⊤ c. More sp ecifically , Nestero v’s estimate sequen ces are resp onsible for optimizing a du al ob jectiv e function ϕ in a fast rate. The co ordinate up date guaran tees th at ϕ ( b u t , b v t ) ≤ ϕ ( ` u t , ` v t ) and k B ( b u t , b v t ) k 1 = 1. The monotone search step guaran tees that ϕ ( u t , v t ) ≤ ϕ ( b u t , b v t ). The greedy co ordinate up date guaran tees that ϕ ( ˇ u t +1 , ˇ v t +1 ) ≤ ϕ ( u t , v t ) with su fficien t progress. F ur th ermore, we also use the same quanti ty as that in Greekhorn to measure th e p er - iteration residue of Algorithm 6 : E t = k r ( B ( u t , v t )) − r k 1 + k c ( B ( u t , v t )) − c k 1 . (31) The computationally exp en siv e step is to compute r ( B ( ¯ u t , ¯ v t )) k B ( ¯ u t , ¯ v t ) k 1 and c ( B ( ¯ u t , ¯ v t )) k B ( ¯ u t , ¯ v t ) k 1 . Since B ( ¯ u t , ¯ v t ) do es not h a v e an y sp ecial pr op ert y , it is difficu lt to design some implemen tation tric k to 25 Algorithm 7: Approximat ing OT b y Algorithm 6 Input: η = ε 4 log( n ) and ε ′ = ε 8 k C k ∞ . Step 1: Let ˜ r ∈ ∆ n and ˜ c ∈ ∆ n be defined by ( ˜ r , ˜ c ) = (1 − ε ′ 8 )( r , c ) + ε ′ 8 n ( 1 n , 1 n ). Step 2: Compute ˜ X = Accelera ted Sinkhorn ( C , η , ˜ r , ˜ c , ε ′ 2 ). Step 3: Round ˜ X to ˆ X using Altsch uler e t al. [ 201 7 , Algo rithm 2 ] suc h that ˆ X 1 n = r and ˆ X ⊤ 1 n = c . Output: b X . reduce the order of n . As such, the arithmetic op erations for eac h iteration is O ( n 2 ) and is exactly the s ame as Sinkh orn [ Cuturi , 2013 ], APD AG D [ Dvurechensky et al. , 2018 ] and AAM [ Gumino v et al. , 20 21 ]. Com bining Algo rithm 6 and Altsch uler et al. [ 2017 , Algo- rithm 2], we are ready to present the p seudo co de of our m ain algorithm in Algorithm 7 . The regularization parameter η is set as b efore, and Step 1 is necessary since accelerated Sinkhorn is n ot wel l b eha v ed if the m arginal d istributions h a v e sparse supp ort. 5.2 T ec hnical lemmas W e fir st p resen t t w o tec hnical lemmas which are essen tial in the analysis of Algorithm 6 . The first lemma p ro vides an ind uctiv e relationship on the quantit y δ t = ϕ ( ˇ u t , ˇ v t ) − ϕ ( u ⋆ , v ⋆ ) , (32) where ( u ⋆ , v ⋆ ) is an optimal solution of the d ual en tropic regularized OT pr oblem in Eq. ( 5 ) that satisfies Lemma 2.3 . T o facilitate the discussion, w e r ecall Eq. ( 10 ) with k A k 1 → 2 = √ 2 as follo ws, ϕ ( u ′ , v ′ ) − ϕ ( u, v ) − u ′ − u v ′ − v ⊤ ∇ ϕ ( u, v ) ≤ u ′ − u v ′ − v 2 , (33) whic h will b e us ed in the pro of of the first lemma. Lemma 5.1. L et { ( ˇ u t , ˇ v t ) } t ≥ 0 b e the iter ates gener ate d by Al gorithm 6 and ( u ⋆ , v ⋆ ) b e an optimal solution of the dual e ntr opic r e gu larize d OT pr oblem. Then, we have δ t +1 ≤ (1 − θ t ) δ t + θ 2 t u ⋆ − ˜ u t v ⋆ − ˜ v t 2 − u ⋆ − ˜ u t +1 v ⋆ − ˜ v t +1 2 ! . Pr o of. Using Eq. ( 33 ) with ( u ′ , v ′ ) = ( ` u t , ` v t ) and ( u, v ) = ( ¯ u t , ¯ v t ), w e ha v e ϕ ( ` u t , ` v t ) ≤ ϕ ( ¯ u t , ¯ v t ) + θ t ˜ u t +1 − ˜ u t ˜ v t +1 − ˜ v t ⊤ ∇ ϕ ( ¯ u t , ¯ v t ) + θ 2 t ˜ u t +1 − ˜ u t ˜ v t +1 − ˜ v t 2 . After simp le calculat ions, we fi n d that ϕ ( ¯ u t , ¯ v t ) = (1 − θ t ) ϕ ( ¯ u t , ¯ v t ) + θ t ϕ ( ¯ u t , ¯ v t ) , ˜ u t +1 − ˜ u t ˜ v t +1 − ˜ v t ⊤ ∇ ϕ ( ¯ u t , ¯ v t ) = − ˜ u t − ¯ u t ˜ v t − ¯ v t ⊤ ∇ ϕ ( ¯ u t , ¯ v t ) + ˜ u t +1 − ¯ u t ˜ v t +1 − ¯ v t ⊤ ∇ ϕ ( ¯ u t , ¯ v t ) . 26 Putting these pieces together yields that ϕ ( ` u t , ` v t ) ≤ θ t ϕ ( ¯ u t , ¯ v t ) + ˜ u t +1 − ¯ u t ˜ v t +1 − ¯ v t ⊤ ∇ ϕ ( ¯ u t , ¯ v t ) + θ t ˜ u t +1 − ˜ u t ˜ v t +1 − ˜ v t 2 | {z } I (34) + (1 − θ t ) ϕ ( ¯ u t , ¯ v t ) − θ t ˜ u t − ¯ u t ˜ v t − ¯ v t ⊤ ∇ ϕ ( ¯ u t , ¯ v t ) | {z } II . W e first b ound the term I . In d eed, b y the up d ate formula f or ( ˜ u t +1 , ˜ v t +1 ) and the definition of ∇ ϕ , we ha ve u − ˜ u t +1 v − ˜ v t +1 ⊤ ∇ ϕ ( ¯ u t , ¯ v t ) + 2 θ t ˜ u t +1 − ˜ u t ˜ v t +1 − ˜ v t = 0 for all ( u, v ) ∈ R 2 n . Letting ( u, v ) = ( u ⋆ , v ⋆ ) and r earranging the resulting equation yields that ˜ u t +1 − ¯ u t ˜ v t +1 − ¯ v t ⊤ ∇ ϕ ( ¯ u t , ¯ v t ) = u ⋆ − ¯ u t v ⋆ − ¯ v t ⊤ ∇ ϕ ( ¯ u t , ¯ v t ) + θ t u ⋆ − ˜ u t v ⋆ − ˜ v t 2 − u ⋆ − ˜ u t +1 v ⋆ − ˜ v t +1 2 − ˜ u t +1 − ˜ u t ˜ v t +1 − ˜ v t 2 ! . Using the conv exit y of ϕ , we h a v e u ⋆ − ¯ u t v ⋆ − ¯ v t ⊤ ∇ ϕ ( ¯ u t , ¯ v t ) ≤ ϕ ( u ⋆ , v ⋆ ) − ϕ ( ¯ u t , ¯ v t ) . Putting these pieces together yields that I ≤ ϕ ( u ⋆ , v ⋆ ) + θ t u ⋆ − ˜ u t v ⋆ − ˜ v t 2 − u ⋆ − ˜ u t +1 v ⋆ − ˜ v t +1 2 ! . (35) W e then b oun d the term I I . In d eed, we see from the d efi nition of ( ¯ u t , ¯ v t ) that − θ t ˜ u t − ¯ u t ˜ v t − ¯ v t = θ t ¯ u t ¯ v t + (1 − θ t ) ˇ u t ˇ v t − ¯ u t ¯ v t = (1 − θ t ) ˇ u t − ¯ u t ˇ v t − ¯ v t . Com bining the ab o v e equation with the conv exit y of ϕ , w e ha v e I I = (1 − θ t ) ϕ ( ¯ u t , ¯ v t ) + ˇ u t − ¯ u t ˇ v t − ¯ v t ⊤ ∇ ϕ ( ¯ u t , ¯ v t ) ! ≤ (1 − θ t ) ϕ ( ˇ u t , ˇ v t ) . (36) Plugging Eq. ( 35 ) and Eq. ( 36 ) in to Eq. ( 34 ) yields that ϕ ( ` u t , ` v t ) ≤ (1 − θ t ) ϕ ( ˇ u t , ˇ v t ) + θ t ϕ ( u ⋆ , v ⋆ ) + θ 2 t u ⋆ − ˜ u t v ⋆ − ˜ v t 2 − u ⋆ − ˜ u t +1 v ⋆ − ˜ v t +1 2 ! . Since ( ˇ u t +1 , ˇ v t +1 ) is obtained b y a co ordinate u p date from ( u t , v t ), we ha v e ϕ ( u t , v t ) ≥ ϕ ( ˇ u t +1 , ˇ v t +1 ). By the d efinition of ( u t , v t ), we ha ve ϕ ( b u t , b v t ) ≥ ϕ ( u t , v t ). S ince ( b u t , b v t ) is 27 obtained b y a co ordinate up d ate from ( ` u t , ` v t ), w e h av e ϕ ( ` u t , ` v t ) ≥ ϕ ( b u t , b v t ). Collecting all of these results leads to ϕ ( ˇ u t +1 , ˇ v t +1 ) − ϕ ( u ⋆ , v ⋆ ) ≤ (1 − θ t )( ϕ ( ˇ u t , ˇ v t ) − ϕ ( u ⋆ , v ⋆ )) + θ 2 t u ⋆ − ˜ u t v ⋆ − ˜ v t 2 − u ⋆ − ˜ u t +1 v ⋆ − ˜ v t +1 2 ! . This completes the pro of. The second lemma pro vides an upp er b ound for δ t defined b y Eq. ( 32 ) where { ( ˇ u t , ˇ v t ) } t ≥ 0 are generated by Algorithm 6 and ( u ⋆ , v ⋆ ) is an optimal solution defined by Corollary 2.3 . Lemma 5.2. L et { ( ˇ u t , ˇ v t ) } t ≥ 0 b e the iter ates gener ate d by Al gorithm 6 and ( u ⋆ , v ⋆ ) b e an optimal solution of the dual entr opic r e gularize d O T pr oblem satisfying that k ( u ⋆ , v ⋆ ) k ≤ √ 2 nR wher e R is define d in Cor ol lary 2.3 . Then, we have δ t ≤ 8 nR 2 ( t + 1) 2 . Pr o of. By simple calculat ions, w e deriv e from the d efinition of θ t that θ t +1 θ t = p 1 − θ t +1 . Therefore, w e conclude f rom L emma 5.1 that 1 − θ t +1 θ 2 t +1 δ t +1 − 1 − θ t θ 2 t δ t ≤ u ⋆ − ˜ u t v ⋆ − ˜ v t 2 − u ⋆ − ˜ u t +1 v ⋆ − ˜ v t +1 2 . Equiv alen tly , we ha ve 1 − θ t θ 2 t δ t + u ⋆ − ˜ u t v ⋆ − ˜ v t 2 ≤ 1 − θ 0 θ 2 0 δ 0 + u ⋆ − ˜ u 0 v ⋆ − ˜ v 0 2 . Since θ 0 = 1 and ˜ u 0 = ˜ v 0 = 0 n , w e ha v e δ t ≤ θ 2 t − 1 k ( u ⋆ , v ⋆ ) k 2 ≤ 2 nR 2 θ 2 t − 1 . The remaining step is to sho w that 0 < θ t ≤ 2 t +2 . In deed, the claim holds wh en t = 0 as w e ha ve θ 0 = 1. Assu me that the claim holds for t ≤ t 0 , i.e., θ t 0 ≤ 2 t 0 +2 , w e ha v e θ t 0 +1 = 2 1 + q 1 + 4 θ 2 t 0 ≤ 2 t 0 + 3 . Putting these pieces together yields the desired inequalit y f or δ t . 5.3 Main results W e pr esen t an upp er b oun d for the num b er of iterations r equired by Algorithm 6 . Note that the p er-iteration progress of Algorithm 6 is measured by the function ρ : R n + × R n + → R + giv en b y: ρ ( a, b ) := 1 ⊤ n ( b − a ) + P n i =1 a i log( a i b i ). Theorem 5.3. L e t { ( u t , v t ) } t ≥ 0 b e the iter ates gener ate d by Al gorithm 6 . The numb er of iter ations r e quir e d to r e ach the stopping criterion E t ≤ ε ′ satisfies t ≤ 1 + 16 √ nR ε ′ 2 / 3 , wher e R > 0 is define d i n L emma 2.2 . 28 Pr o of. W e first claim that ϕ ( u t , v t ) − ϕ ( ˇ u t +1 , ˇ v t +1 ) ≥ 1 2 k r ( B ( u t , v t )) − r k 2 1 + k c ( B ( u t , v t )) − c k 2 1 . (37) By the defin ition of ϕ , w e ha ve ϕ ( u t , v t ) − ϕ ( ˇ u t +1 , ˇ v t +1 ) = log ( k B ( u t , v t ) k 1 ) (38) − log ( k B ( ˇ u t +1 , ˇ v t +1 ) k 1 ) − ( u t − ˇ u t +1 ) ⊤ r − ( v t − ˇ v t +1 ) ⊤ c. F rom the up date form ula for ( b u t , b v t ) and ( ˇ u t +1 , ˇ v t +1 ), it is clear that k B ( b u t , b v t ) k 1 = 1 and k B ( ˇ u t +1 , ˇ v t +1 ) k 1 = 1 for all t ≥ 0. Then, we d er ive from the up date formula for ( u t , v t ) that k B ( u t , v t ) k 1 = 1 for all t ≥ 1. Therefore, w e ha v e ϕ ( u t , v t ) − ϕ ( ˇ u t +1 , ˇ v t +1 ) = − ( u t − ˇ u t +1 ) ⊤ r − ( v t − ˇ v t +1 ) ⊤ c = (log( r ) − log ( r ( B ( u t , v t )))) ⊤ r + (log( c ) − log( c ( B ( u t , v t )))) ⊤ c. Since 1 ⊤ n r = 1 ⊤ n r ( B ( u t , v t )) = 1 ⊤ n c = 1 ⊤ n c ( B ( u t , v t )) = 1, w e ha v e ϕ ( u t , v t ) − ϕ ( ˇ u t +1 , ˇ v t +1 ) = ρ ( r , r ( B ( u t , v t ))) + ρ ( c, c ( B ( u t , v t ))) . Using Altsc h uler et al. [ 2017 , Lemma 4], w e derive Eq. ( 37 ) as desired. By th e defin ition of ( u t , v t ), w e ha ve ϕ ( ˇ u t , ˇ v t ) ≥ ϕ ( u t , v t ). Plugging this inequalit y int o Eq. ( 37 ) together with the Cauc h y-Sch w arz inequalit y yields ϕ ( ˇ u t , ˇ v t ) − ϕ ( ˇ u t +1 , ˇ v t +1 ) ≥ 1 4 E 2 t . Therefore, w e conclude th at ϕ ( ˇ u t , ˇ v t ) − ϕ ( ˇ u t +1 , ˇ v t +1 ) ≥ 1 4 t X i = j E 2 i for any j ∈ { 1 , 2 , . . . , t } . Since ϕ ( ˇ u t +1 , ˇ v t +1 ) ≥ ϕ ( u ⋆ , v ⋆ ) for all t ≥ 1, we ha v e ϕ ( ˇ u j , ˇ v j ) − ϕ ( ˇ u t +1 , ˇ v t +1 ) ≤ δ j . T hen, it follo ws from Lemma 5.2 that t X i = j E 2 i ≤ 32 nR 2 ( j + 1) 2 . Putting these pieces toget her with the f act that E t ≥ ε ′ as so on as th e stoppin g criterion is not fulfilled yields 32 nR 2 ( j + 1) 2 ( t − j + 1) ≥ ( ε ′ ) 2 . Since th is inequalit y holds true for all j ∈ { 1 , 2 , . . . , t } , we assume without loss of generalit y that t is ev en and let j = t/ 2. Then, w e obtain that t ≤ 1 + 16 √ nR ε ′ 2 / 3 . This completes the pro of of the theorem. W e are ready to present the complexity b ound of Algorithm 7 for solving the O T pr oblem in Eq. ( 1 ). Note that ε ′ = ε 8 k C k ∞ is d efined using the desired accuracy ε > 0. 29 Theorem 5.4. The ac c eler ate d Sinkhorn scheme for appr oximating optimal tr ansp ort (Algo- rithm 7 ) r eturns an ε -appr oximate tr ansp orta tion plan (cf. Definition 1 ) in O n 7 / 3 k C k 4 / 3 ∞ (log( n )) 1 / 3 ε 4 / 3 ! arithmetic op er ations. Pr o of. Applying the same argumen t whic h is used in Theorem 3.8 , we obtain that h C, b X i − h C, X ⋆ i ≤ ε where ˜ X = Accelera ted Sinkhorn ( C, η , ˜ r , ˜ c, ε ′ 2 ) in S tep 2 of Algorithm 7 . It remains to b ound th e n um b er of iterations required b y Algorithm 6 to reac h the stopping criterion E t ≤ ε ′ 2 . Using Theorem 5.3 , w e ha v e t ≤ 1 + 32 √ nR ε ′ 2 / 3 . By the defin ition of R (cf. L emm a 2.2 ), η = ε 4 log ( n ) and ε ′ = ε 8 k C k ∞ , we hav e t ≤ 1 + 32 √ nR ε ′ 2 / 3 ≤ 1 + 256 √ n k C k ∞ ε k C k ∞ η + log( n ) − log min 1 ≤ i,j ≤ n { r i , c j } 2 / 3 ≤ 1 + 256 √ n k C k ∞ ε 4 log ( n ) k C k ∞ ε + log( n ) − log ε 64 n k C k ∞ 2 / 3 = O n 1 / 3 k C k 4 / 3 ∞ (log( n )) 1 / 3 ε 4 / 3 ! . Since eac h iteration of Algorithm 6 r equ ires O ( n 2 ) arithmetic op erations, the total n umb er of arithmetic op erations required b y Step 2 of Algorithm 7 is O ( n 7 / 3 k C k 4 / 3 ∞ (log( n )) 1 / 3 ε − 4 / 3 ). Computing tw o v ectors ˜ r and ˜ c in Step 1 of Algorithm 7 requires O ( n ) arithmetic op erations and Altsc h uler et al. [ 2017 , Algorithm 2] r equires O ( n 2 ) arithm etic op erations. Therefore, the complexit y b oun d of Algorithm 7 is O ( n 7 / 3 k C k 4 / 3 ∞ (log( n )) 1 / 3 ε − 4 / 3 ). Remark 5.5. The or em 5.4 shows that the c omplexity b ound of ac c eler ate d Sinkhorn is b etter than that of Sinkhorn and Gr e enkhorn in terms of 1 /ε but app e ars not to b e ne ar-line ar in n 2 . Thus, our algorithm is r e c ommend e d when n ≪ 1 /ε . This o c curs if the desir e d solution ac cu- r acy is r elatively smal l, saying 10 − 4 , and the examples include the applic ation pr oblems fr om e c onomics and op e r ations r ese ar ch. In c ontr ast, Sinkhorn and Gr e enk horn ar e r e c ommende d when n ≫ 1 /ε . This o c curs if the desir e d solution ac cur acy is r e latively lar ge, saying 10 − 2 , and the examples i nc lu de the applic at ion pr oblems fr om i mage pr o c essing. 6 Exp erimen ts In this section, we condu ct th e exp eriments to ev aluate Greenkhorn, accelerated Sin k h orn and APDA MD on syn thetic data and real images from the MNIST Digits dataset 1 . W e use 1 http://y ann.lecun.com/exdb/mnist/ 30 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Row/Col Updates -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 |r(P)-r| 1 + |c(P)-c| 1 Distance to Polytope GREENKHORN SINKHORN Spead of ln(compet ratio for error) 0 500 1000 1500 2000 2500 3000 3500 4000 Row/Col Updates 0 0.5 1 1.5 2 2.5 3 3.5 ln(compet ratio for error) Max Median Min 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Row/Col Updates 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 Value of OT SINKHORN vs GREENKHORN for OT True optimum SINKHORN, 1/eta=1 SINKHORN, 1/eta=5 SINKHORN, 1/eta=9 GREENKHORN, 1/eta=1 GREENKHORN, 1/eta=5 GREENKHORN, 1/eta=9 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Row/Col Updates 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 |r(P)-r| 1 + |c(P)-c| 1 Distance to Polytope APDAMD APDAGD Spead of ln(compet ratio for error) 0 500 1000 1500 2000 2500 3000 3500 4000 Row/Col Updates -1 -0.5 0 0.5 1 1.5 ln(compet ratio for error) Max Median Min 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Row/Col Updates 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 Value of OT APDAGD vs APDAMD for OT True optimum APDAGD, 1/eta=1 APDAGD, 1/eta=5 APDAGD, 1/eta=9 APDAMD, 1/eta=1 APDAMD, 1/eta=5 APDAMD, 1/eta=9 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Row/Col Updates 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 |r(P)-r| 1 + |c(P)-c| 1 Distance to Polytope ACCELERATION SINKHORN Spead of ln(compet ratio for error) 0 500 1000 1500 2000 2500 3000 3500 4000 Row/Col Updates -0.5 0 0.5 1 1.5 2 2.5 3 ln(compet ratio for error) Max Median Min 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Row/Col Updates 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 Value of OT Sinkhorn vs Acceleration for OT True optimum SINKHORN, 1/eta=1 SINKHORN, 1/eta=5 SINKHORN, 1/eta=9 Acceleration, 1/eta=1 Acceleration, 1/eta=5 Acceleration, 1/eta=9 Figure 1: Comparativ e p er f ormance of Sinkh orn v.s. Greenkhorn, APD A GD v.s. APD AMD and Sinkhorn v.s. accelerate d Sinkh orn on synt hetic images. Sinkhorn [ Cuturi , 2013 ], APD A GD [ Dvurec hensky et al. , 2018 ] and GCPB 2 [ Genev ay et al. , 2016 ] as the baseline appr oac hes. Since the fo cu s of this pap er is the entropic regularized algo- rithms, w e exclude the com binatorial algorithms from our exp erimen t an d refer to Dong et al. [ 2020 ] for an excellen t comparativ e stud y . In th e literature, Greenkhorn and APDA GD w ere s ho wn to outp erf orm the Sinkhorn algorithm in terms of r ow/co lumn up dates [ Altsc h uler et al. , 2017 , Dvurechensky et al. , 2018 ] and we rep eat the comparisons for the sake of completeness. F or parameter tuning in the implemen tation of Greenkh orn, accelerated Sinkh orn and APD AMD, w e f ollo w most of the setups as sho wn in Algo rithm 1 , 3 and 6 but emplo y more aggressiv e c hoice of stepsize for the co ordinate gradien t up dates in Algorithm 6 . T o obtain an optimal v alue of the OT problem, w e emplo y the default L P solv er in MA TLAB. 6.1 Syn thetic images T o generate the synt hetic images, we adopt the pro cess from Altsc h uler et al. [ 2017 ] and ev al- uate the p erf ormance of different algorithms on these synt hetic images. The transp ortation 2 GCPB is simply an application of stochastic av eraged gradien t [ Schmidt et al. , 2017 ] for solving the d ual entropic regularized OT problem. 31 distance is d efi ned b et w een t w o synthetic images while th e cost matrix is defined based on the ℓ 1 distances among lo cations of pixel in the images. Eac h image is of size 20 by 20 pixels and generated b y means of randomly placing a foreground square in a blac k bac kground . F u rther- more, a un iform d istribution on [0 , 1] is used for the intensitie s of the pixels in the backg roun d while a uniform d istribution on [0 , 50] is emplo ye d for the p ixels in the foreground. W e fix the prop ortion of th e size of the foreground s quare as 10% of the wh ole images and implemen t all candidate algo rithm s . W e use the standard metrics to assess the p erformance of all the candidate algorithms. The first metric d ( . ) is an ℓ 1 distance b et ween the ro w, column outputs of some algorithm A and the corresp onding transp ortation p olytop e of the probabilit y measures, whic h is giv en b y: d ( A ) := k r ( A ) − r k 1 + k c ( A ) − c k 1 where r ( A ) and c ( A ) are the row and column obtained from the output of the algorithm A and r and c are row and column v ectors of th e original probabilit y measur es. T he second metric is defined as comp etitiv e ratio log( d ( A 1 ) /d ( A 2 )) w here d ( A 1 ) and d ( A 2 ) are the distances b et wee n the ro w, column outputs of algorithms A 1 and A 2 and the transp ortation p olytop e. W e p erform thr ee pairwise comparativ e exp erimen ts on 10 randomly generated data: Sin k h orn v.s. Greekhorn, APDA GD v.s. APD AMD and Sinkh orn v.s. accelerated Sinkhorn. T o further ev aluate these algorithms, we compare their p erforman ce with r esp ect to different c hoices of r egularizatio n parameter η ∈ { 1 , 1 5 , 1 9 } while using the v alue of the OT problem as the baseline approac h. The m axim um num b er of iterations is T = 5. Figure 1 su mmarizes the exp erimenta l results. The images in the fi rst r o w show the comparativ e p erformance of Sinkh orn and Greenkhorn in terms of the row/co lumn up dates. In the leftmost image, the comparison u ses distance to transp ortation p olytop e d ( A ) where A is either Sinkh orn or Greenkhorn. In the middle image, the m axim um, m edian and min im um v alues of th e comp etitiv e ratios log ( d ( A 1 ) /d ( A 2 )) on 10 images are utilized for the comparison wh er e A 1 is Sinkhorn and A 2 is Greenkhorn. In the righ tmost image, we v ary the regularization parameter η ∈ { 1 , 1 5 , 1 9 } with these algorithms and using the v alue of the unregularized OT problem as the baseline. Th e other ro ws of images pr esen t comparativ e results for APD A GD v.s. APD AMD and Sin khorn v.s. accelerated Sinkhorn. W e fi nd that (i) Greenkhorn outp erforms Sinkhorn in terms of row/co lumn up d ates, illustrating the impr o v ement from gr e e dy c o or dinate desc ent ; (ii) APD AMD with δ = n and φ = (1 / 2 n ) k · k 2 is more robust than APD A GD, illustrating the adv an tage of using mirr or desc ent and line searc h with k · k ∞ ; (iii) accelerated S inkhorn outp erforms S inkhorn in terms of row/co lumn up dates, illustr ating the improv emen t from estimate d se quenc e and monot one se ar ch . 6.2 MNIST images W e pro ceed to the comparison b et w een d ifferen t algorithms on real images, using essenti ally the same ev aluatio n metrics as in the syntheti c images. The MNIST dataset consists of 60,000 images of hand written digits of size 28 b y 28 pixels. T o ensure that the masses of p robabilit y measures are dense, which leads to a tigh t dep endence on n f or ou r algorithms, w e add a very small noise term (10 − 6 ) to all zero elemen ts in the measures and then normalize them so that their sum is 1. Th e maximum n umber of iterations is T = 5. Figures 2 and 3 summarize the exp erimen tal results on MNIS T. In the first r o w of Figure 2 , w e compare Sinkh orn and Greenkhorn in terms of ro w/column up dates. The leftmost image sp ecifies the distances d ( A ) to the transp ortation p olytop e for the algorithm A , wh ic h is either Sinkhorn or Greenkhorn; the midd le image sp ecifies the maxim um, median and minimum of 32 0 1000 2000 3000 4000 5000 6000 7000 8000 Row/Col Updates 0 0.5 1 1.5 2 2.5 3 |r(P)-r| 1 + |c(P)-c| 1 Distance to Polytope GREENKHORN SINKHORN Spead of ln(compet ratio for error) 0 1000 2000 3000 4000 5000 6000 7000 8000 Row/Col Updates 0 1 2 3 4 5 6 7 8 ln(compet ratio for error) Max Median Min 0 1000 2000 3000 4000 5000 6000 7000 8000 Row/Col Updates 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Value of OT SINKHORN vs GREENKHORN for OT True optimum SINKHORN, 1/eta=1 SINKHORN, 1/eta=5 SINKHORN, 1/eta=9 GREENKHORN, 1/eta=1 GREENKHORN, 1/eta=5 GREENKHORN, 1/eta=9 0 1000 2000 3000 4000 5000 6000 7000 8000 Row/Col Updates 0 0.5 1 1.5 2 2.5 3 3.5 |r(P)-r| 1 + |c(P)-c| 1 Distance to Polytope APDAMD APDAGD Spead of ln(compet ratio for error) 0 1000 2000 3000 4000 5000 6000 7000 8000 Row/Col Updates -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 ln(compet ratio for error) Max Median Min 0 1000 2000 3000 4000 5000 6000 7000 8000 Row/Col Updates 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Value of OT APDAGD vs APDAMD for OT True optimum APDAGD, 1/eta=1 APDAGD, 1/eta=5 APDAGD, 1/eta=9 APDAMD, 1/eta=1 APDAMD, 1/eta=5 APDAMD, 1/eta=9 0 1000 2000 3000 4000 5000 6000 7000 8000 Row/Col Updates 0 0.5 1 1.5 2 2.5 3 |r(P)-r| 1 + |c(P)-c| 1 Distance to Polytope ACCELERATION SINKHORN Spead of ln(compet ratio for error) 0 1000 2000 3000 4000 5000 6000 7000 8000 Row/Col Updates -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 ln(compet ratio for error) Max Median Min 0 1000 2000 3000 4000 5000 6000 7000 8000 Row/Col Updates 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Value of OT Sinkhorn vs Acceleration for OT True optimum SINKHORN, 1/eta=1 SINKHORN, 1/eta=5 SINKHORN, 1/eta=9 Acceleration, 1/eta=1 Acceleration, 1/eta=5 Acceleration, 1/eta=9 Figure 2: Comparativ e p er f ormance of Sinkh orn v.s. Greenkhorn, APD A GD v.s. APD AMD and Sinkhorn v.s. accelerate d Sinkh orn on the MNIST real images. comp etitiv e r atios log( d ( A 1 ) /d ( A 2 )) on ten random p airs of MNIST images, wh ere A 1 and A 2 resp ectiv ely corresp ond to Sinkhorn and Greenkhorn ; the right most image sp ecifies the v alues of the en tropic regularized OT problem with v arying regularizatio n parameters η ∈ { 1 , 1 5 , 1 9 } . The r emainin g rows pr esen t comparativ e results for APD A GD v.s.APD AMD and Sin k h orn v.s.accele rated Sin khorn. W e observe that (i) the comparative p erf ormances of Sinkh orn v.s.Greenkhorn and APDA GD v.s.APDA MD are consisten t w ith those on synthetic images; (ii) accelerated S inkhorn d eteriorates but remains b etter th an S inkhorn; (iii) APDA MD is more robust than APD A GD and GCP B. 7 Conclusion W e fir st s ho w that the complexit y b ound of Greenkhorn can b e impro v ed to e O ( n 2 ε − 2 ), whic h matc hes the b est known b ound of S inkhorn. Then, we prop ose APD AMD b y generalizing APD A GD w ith a presp ecified mirr or mapp ing φ and show that it ac hiev es the complexit y b ound of e O ( n 2 √ δ ε − 1 ) where δ > 0 refers to the regularit y of φ . W e pro ve th at the complexit y b ound of e O (min { n 9 / 4 ε − 1 , n 2 ε − 2 } ) pr o v ed for APDA GD is in v alid and p ro v e a refi n ed complex- it y b ound of e O ( n 5 / 2 ε − 1 ). Moreo v er, w e prop ose a deterministic accelerat ed v arian t of Sinkhorn 33 0 10 20 30 40 50 60 Time 0.37 0.375 0.38 0.385 0.39 0.395 Value of OT APDAMD vs APDAGD vs GCPB for OT, eta=1 APDAMD APDAGD GCPB, stepsize=1/(Ln) GCPB, stepsize=3/(Ln) GCPB, stepsize=5/(Ln) 0 10 20 30 40 50 60 Time 0.26 0.28 0.3 0.32 0.34 0.36 0.38 Value of OT APDAMD vs APDAGD vs GCPB for OT, eta=1/5 APDAMD APDAGD GCPB, stepsize=1/(Ln) GCPB, stepsize=3/(Ln) GCPB, stepsize=5/(Ln) 0 10 20 30 40 50 60 Time 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 0.4 0.42 Value of OT APDAMD vs APDAGD vs GCPB for OT, eta=1/9 APDAMD APDAGD GCPB, stepsize=1/(Ln) GCPB, stepsize=3/(Ln) GCPB, stepsize=5/(Ln) Figure 3: P erformance of GCPB, APD A GD and APD AMD in term of time on the MNIST real images. Th ese images sp ecify the v alues of en tropic regularized OT with v arying regulariza tion parameter η ∈ { 1 , 1 5 , 1 9 } , demonstrating the robustn ess of APD AMD. via app eal to estimate sequence tec hniques and pro ve the complexit y b ound of e O ( n 7 / 3 ε − 4 / 3 ). As such, w e see th at accele rated S inkhorn outp erform s S inkhorn and Gr eenkhorn in terms of 1 /ε and APD A GD and AAM in terms of n . E xp eriments on syn thetic data and real images demonstrate the efficiency of our algo rithms. There are a few promising future directions arising from this w ork. Firs t, it is imp ortan t to dev elop fast algorithms to compute dimens ion-r ed uced versions of OT. In deed, the OT suf- fers fr om th e cur se of dim en sionalit y [ Dudley , 196 9 , F ournier and Guillin , 2015 ], whic h m eans that a large amount of samp les from t wo conti nuous measures is necessary to appro ximate the true OT b etw een them. T his can b e mitigated wh en data lie on lo w-dimens ional m ani- folds [ W eed and Bac h , 2019 , P at y and Cuturi , 2019 ] b u t the sample complexit y still remain p essimistic ev en in that case. This motiv ates recent wo rks on efficien t d im en sion-reduced OT, e.g., the sliced OT [ Bonneel et al. , 2015 ], generalized sliced O T [ Kolouri et al. , 2019 ], distributional sliced OT [ Nguy en et al. , 202 1 ], further inspiring us to explore the application of our algorithms to these settings and ev en tually automatic d ifferentiati on schemes. Second, there ha ve b een sev eral application p roblems arising from the interpla y b et w een OT and ad- v ersarial ML; see Bhago ji et al. [ 2019 ] and Py d i and Jog [ 2020 ] for example. Ho w ev er, it is kno wn that OT has robustness issues wh en th ere are outliers in th e su pp orts of probab ility measures. Robust OT had b een in tro du ced to deal with these robustn ess iss u es [ Bala ji et al. , 2020 ] where the idea is to relax the marginal constrain ts via certain pr ob ab ility div ergences, suc h as KL d iv ergence. It is to limit the amount of masses that the transp ortation plan will assign for the outliers in the sup p orts of measures. Similar to OT, a k ey p r actical question with robust OT is computational. As such, w e manage to dev elop efficien t algo rithms for th e robust OT pr oblem in the future w ork. 8 Ac kno wledgmen ts This work wa s supp orted in part b y th e Mathematical Data S cience program of the Office of Na v al Researc h under grant num b er N00014-18-1 -2764 to MJ, and by the NSF IFML 2019844 a w ard and researc h gifts b y UT Austin ML grant to NH. 34 References B. K. Abid and R. M. Go w er. Greedy sto c hastic algorithms f or ent ropy-regularize d optimal transp ort problems. In AIST A TS , 2018. (Cited on p age 15 .) Z. Allen-Zhu, Y. Li, R. O liv eira, and A. Wigderson. Muc h faster algorithms for m atrix scaling. In F OCS , pages 890–901. IEEE, 2017. (Cited on page 3 .) J. Altsc h uler, J. W eed, and P . Rigollet. Near-linear time appro ximation algorithms for optimal transp ort via Sinkhorn iteration. In NeurIPS , p ages 1964–1974 , 2017. (Cited on p ages 1 , 2 , 3 , 7 , 9 , 10 , 13 , 14 , 15 , 17 , 22 , 24 , 26 , 29 , 30 , and 31 .) J. Altsc huler, F. Bac h, A. Rud i, and J. Niles-W eed. Massively scalable Sinkhorn distances via the Nystr¨ om metho d. In Neu rIP S , pages 4429 –4439, 2019. (Cited on p age 3 .) M. Arjovsky , S . Ch in tala, and L. Bottou. W asserstein generativ e adve rsarial net works. In ICML , pages 214–223, 2017. (Cited on page 1 .) Y. Bal a ji, R. Chellappa, and S. F eizi. Rob u st optimal transp ort with applications in generativ e mo deling and d omain adaptation. In NeurIPS , 2020. (Cited on p ages 2 , 3 , and 34 .) E. Bern ton, P . E. Jacob, M. Gerb er, and C. P . Rob ert. On parameter estimation with the Wasserstein distance. Informatio n and Infer enc e: A Journal of the IMA , 8(4):657 –676, 2019. (Cited on page 2 .) D. Bertsimas and J. Tsitsiklis. Intr o duction to Line ar O ptimization . Athena Scien tific, 1997. (Cited on p age 2 .) A. N. Bhago j i, D. Cu llina, and P . Mittal. Lo w er b ounds on adversarial robustness from optimal tr ansp ort. In Neu rIPS , pages 7498–75 10, 2019. (Cited on p age 34 .) J. Blanc het and K. Murthy . Qu an tifying d istributional mo d el risk via optimal transp ort. Mathematics of Op er ations R e se ar c h , 44(2):565 –600, 2019. ( Cited on p age 2 .) J. Blanc het, A. Jam bu lapati, C. Kent, and A. Sidford. T ow ards optimal run ning times for optimal tr ansp ort. ArXiv Pr eprint: 1810.0 7717 , 2018. (Cited on pages 3 and 11 .) J. Blanchet, Y. Kang, and K. Murthy . Robust Wasserstein profile inference and applications to m ac hine learning. Journal of Applie d Pr ob ability , 56(3):830–8 57, 2019. (Cited on page 2 .) M. Blondel, V. Seguy , and A. Rolet. S mo oth and sparse optimal transp ort. In AIST A TS , pages 880 –889, 2018 . (Cited on page 3 .) N. Bonneel, J. Rabin, G. P eyr´ e, and H. Pfis ter. Sliced and r adon Wasserstein barycenters of measures. Journal of M athematic al Imaging and V ision , 51(1): 22–45, 2015. (Cited on page 34 .) M. Carri ` ere, M. Cu turi, and S . Oudot. Sliced Wasserstein k ernel for p ersistence diagrams. In ICML , pages 1–10, 2017. ( Cited on page 1 .) D. C hakrabart y and S. Khann a. Better and simpler error analysis of the S inkhorn-Kn opp algorithm for matrix scaling. In 1st Symp osium on Simplicity in Algorith ms , 2018. (Cited on page 2 .) 35 L. Chizat, P . Roussillon, F. L´ eger, F-X. Vialard, and G. Peyr ´ e. F aster Wasserstein distance estimation with the Sin khorn d iv ergence. In Neu rIP S , pages 2257– 2269, 2020. (Cited on pages 2 and 6 .) M. B. Cohen, A. Madry , D. Tsipras, and A. Vladu. Matrix s caling and balancing via b ox constrained Newton’s metho d and interior p oin t metho d s. In FOCS , pages 902–913. IE E E, 2017. (Cited on page 3 .) N. Court y , R. Flamary , D. T uia, and A. Rakot omamonjy . Optimal transp ort for domain adaptation. IEEE T r ansactions on Pattern Analysis and Machine Intel lige nc e , 39(9 ):1853– 1865, 2017 . (Cited on pages 1 and 3 .) M. Cutur i. Sinkhorn distances: Ligh tsp eed computation of optimal transp ort. In NeurIPS , pages 229 2–2300, 2013 . (Cited on pages 2 , 5 , 7 , 9 , 26 , and 31 .) M. Cutu ri and A. Doucet. F ast computation of Wasserstein barycente rs. In ICML , pages 685–6 93, 2014. (Cited on page 1 .) M. Cuturi and G. P eyr ´ e. A smo othed dual appr oac h for v ariati onal Wasserstein p roblems. SIAM Journal on Imaging Scienc es , 9(1):320–3 43, 2016. (Cited on page 3 .) M. Cutur i, O . T eb oul, and J-P . V ert. Differen tiable r anks and sorting using optimal transp ort. In NeurIPS , p ages 6861–6871 , 2019. (Cited on page 3 .) Y. Dong, Y. Gao, R. P eng, I. Razensh teyn, and S. Sa wlani. A study of p erform ance of optimal transp ort. ArXiv Pr eprint: 2005.01 182 , 2020. (Cited on pages 3 , 4 , and 31 .) R. M. Dudley . The sp eed of m ean Gliv enk o-Can telli conv ergence. The Annals of Mathematic al Statistics , 40(1) :40–50, 1969. (Cited on p ages 2 , 6 , and 34 .) P . Dvurec henskii, D. Dvinskikh, A. Gasniko v , C. Ur ib e, and A. Nedic h. Decent ralize and randomize: Faster algorithm f or Wasserstein barycen ters. In NeurIPS , p ages 10783–107 93, 2018. (Cited on page 1 .) P . Dvurec hensky , A. Gasnik o v, and A. Kroshnin. C omputational optimal transp ort: C om- plexit y by accelerat ed gradien t descent is b etter than by Sin khorn’s algorithm. I n ICML , pages 136 7–1376, 2018 . (Cited on pages 1 , 2 , 3 , 7 , 9 , 10 , 11 , 13 , 15 , 16 , 22 , 23 , 24 , 26 , and 31 .) Olivier F erco q and P eter Rich t´ arik. Accele rated, parallel, and p r o ximal co ordinate descen t. SIAM Journal on O ptimization , 25(4):1997– 2023, 2015. (Cited on p age 25 .) J. F eydy , T. S ´ ejour n ´ e, F-X. Vialard, S-I. Amari, A. T rouv´ e, and G. P eyr´ e. In terp olating b et wee n optimal transp ort and MMD using S inkhorn dive rgences. In AIST A TS , pages 2681– 2690. P MLR, 2019 . (Cited on page 2 .) R. Flamary and N. Cour t y . POT: Python optimal transp ort library , 201 7. URL https:// github.co m/rflamary/POT . (Cited on p age 4 .) N. F ournier and A. Guillin. On the rate of con v ergence in Wasserstein distance of the empirical measure. Pr ob ability The ory and R elate d Fields , 162 (3-4):707– 738, 2015. (Cited on pages 2 , 6 , and 34 .) H. N. Gab o w and R. E. T arjan. F aster scaling algorithms for general graph matc hing pr ob lems. Journal of the ACM (JACM) , 38(4):8 15–853, 1991. (Cited on page 2 .) 36 A. Genev a y , M. Cuturi, G. P eyr´ e, and F. Bac h. Sto c hastic optimization for large-scale optimal transp ort. I n Neu rIPS , pages 3440–3 448, 2016. (Cited on pages 3 and 31 .) A. Genev a y , L. Chizat, F. Bac h, M. C u turi, and G. P eyr´ e. Sample complexit y of Sinkhorn div ergences. In A IST A TS , pages 157 4–1583. PMLR, 2019. (Cited on pages 2 and 6 .) I. Gulra jani, F. Ah med, M. Arjovsky , V. Du moulin, and A. C. Courville. Impr o v ed training of Wasserstein GANs. In NeurIPS , pages 5767–5 777, 2017. (Cited on page 1 .) S. Gumin o v, P . Dvurec hensky , N. T up itsa, and A. Gasniko v. On a combinatio n of alte rn ating minimization and Nestero v’s moment um . In ICML , pages 3886 –3898. PMLR, 2021. (Cited on pages 1 , 3 , 7 , 16 , 24 , and 26 .) W. Guo, N. Ho, and M. Jordan. F ast algorithms f or computational optimal transp ort and Wasserstein b arycen ter. In AIST A TS , pages 2088–2097 . PMLR, 2020. (Cited on pages 3 and 16 .) N. Ho, V. Huyn h, D. P h ung, and M. I. Jordan. Probabilistic m ultilev el clustering via com- p osite transp ortation distance. AIST A TS , pages 3149 –3157, 2019. (Cited on p age 1 .) A. Jam bu lapati, A. Sid ford, and K. Tian. A dir ect tilde { O } (1/epsilon) iteratio n parallel algorithm f or optimal transp ort. In NeurIPS , pages 11355–113 66, 2019. (Cited on p ages 4 and 16 .) B. Kalan tari and L. Khac hiya n. On the complexit y of n onnegativ e m atrix scaling. Line ar Algebr a and its A pplic ations , 240:87 –103, 1996. (Cited on page 2 .) B. Kalan tari, I. Lari, F. Ricca, and B. Simeone. O n the complexit y of general matrix scaling and entrop y minimization via the RAS algorithm. Mathematic al Pr o gr amming , 11 2(2): 371–4 01, 2008. (Cited on page 2 .) L. V. Kan toro vic h. On the translo cation of masses. In Dokl. Akad . Nauk. USSR (NS) , v olume 37, p ages 199–201, 1942. ( Cited on pages 1 and 5 .) P . A. Knigh t. Th e Sinkhorn-Kn op p algorithm: con v ergence and applications. SIAM Journal on Matrix Anal ysis and Applic ations , 30(1) :261–275, 2008. (Cited on page 2 .) S. Kolouri, K. Nadjahi, U. Simsekli, R. Badeau, and G. Roh d e. Generalized sliced Wasserstein distances. In NeurIPS , pages 261– 272, 2019. (Cited on p age 34 .) H. W. Kuhn. T he Hungarian m etho d for the assignmen t problem. Naval R ese ar ch L o gistics Quarterly , 2(1-2):83–9 7, 1955. (Cited on page 2 .) H. W. K uhn. V ariant s of the Hun garian metho d for assignmen t pr oblems. N aval R ese ar ch L o gistics Quarterly , 3(4):2 53–258, 1956. (Cited on page 2 .) N. Lahn, D. Mu lc handani, and S . Ragh ve nd ra. A graph theoretic additiv e app ro ximation of optimal tr ansp ort. In Neu rIPS , pages 13813–1 3823, 2019. ( Cited on pages 2 and 4 .) Y. T. Lee and A. Sidford. Pa th fin ding metho ds for linear p r ogramming: Solving linear programs in e O (sqr t(rank)) iterations and faster algorithms for maxim um flo w. In F OCS , pages 424 –433. IEEE, 2014. (Cited on p age 2 .) 37 J. Lei. Con verge nce and concen tration of empir ical m easur es un d er w asserstein d istance in unboun ded functional spaces. Bernoul li , 26(1):76 7–798, 2020. (Cited on page 2 .) Q. Lin, Z. Lu, and L. Xiao. An accelerate d randomized pro ximal co ordinate gradien t metho d and its application to regularized empirical risk minimizatio n. SIAM Journal on Optimiza- tion , 25(4):2244– 2273, 2015. (Cited on p age 25 .) T. Lin , N. Ho, and M. Jordan. On efficien t optimal tr ansp ort: An analysis of greedy and accele rated mirr or d escen t algorithms. In ICML , pages 3982–39 91, 2019a . ( Cited on pages 3 , 7 , and 10 .) T. Lin, Z. Hu, and X. Guo. Sparsemax and relaxed Wasserstein for topic sparsity . I n WSDM , pages 141 –149. A CM, 2019 b. (Cited on p age 1 .) H. Lu, R. F reund , and V. Mirrokni. Accelerating greedy co ordin ate descent metho d s. In ICML , pages 3263–327 2, 2018. (Cited on p age 25 .) G. Mena and J. Niles-W eed. Statistical b oun ds for entropic optimal transp ort: s amp le com- plexit y and the cen tral limit theorem. In NeurIPS , pages 4541–45 51, 2019. (Cited on pages 2 and 6 .) G. Monge. M ´ emoire sur la th ´ eorie des d´ eb lais et des rem blais. Histoir e de l’A c ad´ emie R oyale des Scienc es de Paris , 1781 . (Cited on page 1 .) J. Munkres. Algorithms for the assignmen t and transp ortation p roblems. Journal of the So cie ty for Industrial and Applie d M athematics , 5(1):32 –38, 1957. (Cited on page 2 .) Y. Nestero v. Smo oth minimization of non-smo oth fu nctions. Mathematic al Pr o gr amm ing , 103 (1):12 7–152, 2005. ( Cited on pages 7 , 8 , and 16 .) Y. Nestero v. L e ctur es on Convex Optimization , v olume 137. Springer, 2018. (Cited on page 24 .) K. Nguyen, N. Ho, T. Pham, and H. Bui. Distributional sliced-Wasserstein and applications to generativ e mo deling. In ICLR , 2021. URL https:// openrevie w.net/forum?id=QYjO70ACDK . (Cited on p age 34 .) X. Nguy en. C on v ergence of laten t mixing measures in fin ite and infinite mixture mo dels. Anna ls of Statistics , 4(1):370–4 00, 2013. (Cited on page 1 .) X. Nguyen. Borrowing strength in hierarc hical Bay es: posterior concen tration of th e Diric hlet base measure. Bernoul li , 22(3) :1535–157 1, 2016. ( Cited on page 1 .) J. B. Orlin. A p olynomial time primal n et w ork simp lex algorithm for minimum cost flo ws. Mathematic al Pr o gr amming , 78(2):1 09–129, 1997. (Cited on page 2 .) J. B. Orlin and R. K. Ah uja. New scaling algorithms for the assignment and minimum mean cycle problems. Mathematic al Pr o gr amming , 54(1):41–5 6, 1992. (Cited on p age 2 .) J. B. O rlin, S. A. Plotkin, and E. T ardos. P olynomial dual net work s implex algorithms. Mathematic al Pr o gr amming , 60(1):2 55–276, 1993. (Cited on page 2 .) F-P . P at y and M. Cuturi. Su b space robust Wasserstein distances. In ICML , p ages 5072–50 81. PMLR, 2019 . (Cited on pages 2 and 34 .) 38 O. P ele and M. W erman. F ast and robust earth mov er’s distance. In ICCV . IEEE, 2009. (Cited on p age 2 .) G. P eyr´ e and M. Cutur i. C omp utational optimal transp ort. F oundations and T r ends ® in Machine L e arning , 11(5-6):355 –607, 2019 . (Cited on page 2 .) G. P eyr ´ e, M. Cuturi, and J. Solomon. Gromo v-Wasserstein a v eraging of k ernel and distance matrices. I n ICML , pages 2664–2672 , 2016. (Cited on page 1 .) M. S. Pyd i and V. Jog. Adversarial risk via optimal transp ort and optimal couplin gs. In ICML , pages 7814–782 3. PMLR, 2020. (Cited on page 34 .) K. Quanr ud. Approximati ng optimal transp ort with linear programs. In SOSA , pages 61–69, 2019. (Cited on page 3 .) A. Rolet, M. Cu turi, and G. P eyr´ e. F ast d ictionary learning w ith a smo othed Wasserstein loss. In AIST A TS , pages 630– 638, 2016. (Cited on page 1 .) M. S c hmidt, N. L. Roux, and F. Bac h. Minimizing fi nite su ms with the sto c hastic a ve rage gradien t. Mathematic al Pr o gr amming , 162:83–11 2, 2017. (Cited on page 31 .) M. H. Schneider and S. A. Z enios. A comparativ e study of algorithms for matrix balancing. Op e r ations R ese ar ch , 38(3):43 9–455, 1990. (Cited on page 2 .) J. Sh erman. Area-con v exit y , ℓ ∞ regularization, and und ir ected multicommodity flow. In STOC , pages 452–460. ACM, 2017. (Cited on pages 4 and 16 .) R. Sinkhorn . Diagonal equiv alence to matrices with p rescrib ed ro w and column su ms. Pr o- c e e dings of the Americ an Mathematic al So ciety , 45(2):195 –198, 1974. (Cited on page 2 .) M. Sommerfeld and A. Mun k. Inf erence for empirical Wasserstein d istances on finite spaces. Journal of the R oyal Statistic al So c iety: Series B (Statistic al Metho dolo g y) , 80(1):219 –238, 2018. (Cited on page 2 .) M. Sommerfeld, Y. Z emel, and A. Mun k. Optimal transp ort: Fast probabilistic appro ximation with exact solve rs. Journal of M achine L e arning R ese ar ch , 20:1–23 , 2019. (Cited on p age 1 .) S. Sriv asta v a, V. Cevher, Q. Dinh, and D. Du nson. W ASP: Scalable Ba y es via barycen ters of subset p osteriors. In AIST A TS , pages 912 –920, 2015. (Cited on page 1 .) S. Sr iv astav a, C. Li, and D. Dunson. Scalable Ba y es via barycen ter in Wasserstein sp ace. Journal of Machine L e arning R ese ar c h , 19(8):1–3 5, 2018. (Cited on p age 1 .) I. T olstikhin, O. Bousqu et, S. Gelly , and B. S c ho elk opf. W asserstein auto-encoders . In ICLR , 2018. (Cited on page 1 .) J. v an den Brand , Y. T. Lee, Y. P . Liu, T. Saranurak, A. Sidford, Z. Song, and D. W ang. Minim um cost flows, MDPs, and ℓ 1 -regression in nearly linear time for dens e instances. In STOC , pages 859–869, 2021. (Cited on page 2 .) C. Villani. Optimal T r ansp ort: Old and Ne w , v olume 338. Spr in ger, 2009. (Cited on page 1 .) J. W eed and F. Bac h. Sharp asymptotic and finite-sample r ates of con verge nce of empirical measures in Wasserstein distance. Bernoul li , 25(4A):2 620–2648 , 2019. (Cited on pages 2 , 6 , and 34 .) 39
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment