Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification, and Local Computations

Qsparse-lo cal-SGD: Distributed SGD with Quan tization, Sparsiﬁcation, and Lo cal Computations Debra j Basu ∗ , Deep esh Data † , Can Karakus ‡ , Suhas Diggavi § Abstract Comm unication b ottlenec k has b een identiﬁed as a signiﬁcant issue in distributed op- timization of large-scale learning mo dels. Recen tly , several approac hes to mitigate this problem ha ve been prop osed, including diﬀerent forms of gradien t compression or comput- ing local models and mixing them iteratively . In this pap er, we propose Qsp arse-lo c al-SGD algorithm, whic h combines aggressiv e sparsiﬁcation with quantization and lo cal computa- tion along with error comp ensation, b y keeping track of the diﬀerence b et w een the true and compressed gradients. W e propose both sync hronous and asynchronous implemen tations of Qsp arse-lo c al-SGD . W e analyze conv ergence for Qsp arse-lo c al-SGD in the distribute d setting for smo oth non-con vex and con vex ob jective functions. W e demonstrate that Qsp arse-lo c al- SGD con verges at the same rate as v anilla distributed SGD for many imp ortan t classes of sparsiﬁers and quantizers. W e use Qsp arse-lo c al-SGD to train ResNet-50 on ImageNet and sho w that it results in signiﬁcant sa vings o ver the state-of-the-art, in the num ber of bits transmitted to reac h target accuracy . Keyw ords: Distributed optimization and learning; sto c hastic optimization; communication eﬃcien t training metho ds. 1 In tro duction Sto c hastic Gradient Descent (SGD) [ HM51 ] and its many v ariants hav e b ecome the workhorse for mo dern large-scale optimization as applied to mac hine learning [ Bot10 , BM11 ]. W e consider a setup, in whic h SGD is applied to the distribute d setting, where R diﬀerent no des compute lo c al sto c hastic gradients on their own datasets D r . Co-ordination b et ween them is done by aggregating these lo cal computations to up date the ov erall parameter x t as, x t +1 = x t − η t R R X r =1 g r t , where g r t , for r = 1 , 2 , . . . , R , is the lo cal sto c hastic gradient at the r ’th mac hine for a lo cal loss function f ( r ) ( x ) of the parameter v ector x , where f ( r ) : R d → R and η t is the learning rate. T raining of high dimensional mo dels is t ypically p erformed at a large scale ov er bandwidth limited net works. Therefore, despite the distributed pro cessing gains, it is well understo o d by no w that exchange of full-precision gradien ts b et w een no des causes communication to b e the b ottlenec k for man y large-scale mo dels [ AHJ + 18 , WXY + 17 , BW AA18 , SYKM17 ]. F or example, ∗ A dob e Inc.; dbasu@adob e.com; w ork done while at UCLA. † UCLA; deepesh.data@gmail.com ‡ Amazon W eb Services; cak arak@amazon.com; work done while at UCLA. § UCLA; suhasdigga vi@ucla.edu 1 consider training the BER T architecture for language mo dels [ DCL T18 ] which has ab out 340 million parameters, implying that each full precision exchange b et ween work ers is ov er 1.3GB. Suc h a comm unication b ottlenec k could be signiﬁcant in emerging edge computation architec- tures suggested by federated learning [ Kon17 , MMR + 17 , ABC + 16 ]. In such an arc hitecture, data resides on and can ev en b e generated by p ersonal devices such as smart phones, and other edge (IoT) devices, in contrast to data-cen ter arc hitectures. Learning is envisaged with suc h an ultra- large scale, heterogeneous en vironment, with p oten tially unreliable or limited communication. These and other applications hav e led to many recently prop osed metho ds, whic h are broadly based on three ma jor approaches: 1. Quantization of gradients, where no des lo cally quan tize the gradient (p erhaps with ran- domization) to a small n umber of bits [ AGL + 17 , BW AA18 , WHHZ18 , WXY + 17 , SYKM17 ]. 2. Sp arsiﬁc ation of gradients, e.g., where nodes locally select T op k v alues of the gradient in absolute v alue and transmit these at full precision [ Str15 , AH17 , SCJ18 , AHJ + 18 , WHHZ18 , LHM + 18 ], while main taining errors in lo cal no des for later comp ensation. 3. Skipping c ommunic ation r ounds , whereb y no des a verage their mo dels after lo cally up dating their mo dels for sev eral steps [ YYZ19 , Cop15 , ZD W13 , Sti19 , CH16 , WJ18 ]. In this w ork we propose a Qsp arse-lo c al-SGD algorithm, which com bines aggressiv e sparsiﬁca- tion with quan tization and lo cal computations, along with error compensation, b y keeping track of the diﬀerence betw een the true and compressed gradients. W e prop ose both sync hronous and async hronous implemen tations of Qsp arse-lo c al-SGD in a distribute d setting, where the no des p erform computations on their lo cal datasets. In our asynchronous mo del, the distributed no des’ iterates evolv e at the same rate, but up date the gradien ts at arbitrary times; see Section 4 for more details. W e analyze conv ergence for Qsp arse-lo c al-SGD in the distribute d case, for smo oth non-con vex and smooth strongly-con vex ob jectiv e functions. W e demonstrate that Qsp arse- lo c al-SGD conv erges at the same rate as v anilla distributed SGD for many imp ortan t classes of sparsiﬁers and quan tizers. W e implement Qsp arse-lo c al-SGD for ResNet-50 using the ImageNet dataset, and for a softmax m ulticlass classiﬁer using the MNIST dataset, and w e achiev e target accuracies with ab out a factor of 15-20 savings o ver the state-of-the-art [ AHJ + 18 , SCJ18 , Sti19 ], in the total num b er of bits transmitted. 1.1 Related W ork The use of quan tization for comm unication eﬃcien t gradient metho ds has decades rich history [ GMT73 ] and its recen t use in training deep neural net works [ SFD + 14 , Str15 ] has re-ignited in terest. Theoretically justiﬁed gradient compression using unbiased sto c hastic quan tizers has b een prop osed and analyzed in [ A GL + 17 , WXY + 17 , SYKM17 ]. Though methods in [ WWLZ18 , WSL + 18 ] use induced sparsity in the quantized gradients, explicitly sparsifying the gradien ts more aggressiv ely by retaining T op k comp onen ts, e.g., k < 1% , has b een prop osed [ Str15 , AH17 , LHM + 18 , AHJ + 18 , SCJ18 ], combined with error comp ensation to ensure that all co-ordinates do get even tually up dated as needed. [ WHHZ18 ] analyzed error comp ensation for QSGD, without T op k sparsiﬁcation while fo cusing on quadratic functions. Another approac h for mitigating the comm unication b ottlenec ks is by ha ving infrequen t comm unication, which has b een p opularly referred to in the literature as iter ative p ar ameter mixing , see [ Cop15 ], and mo del aver aging , see [ Sti19 , YYZ19 , ZSMR16 ] and references therein. Our w ork is most closely related to and builds on the recent theoretical results in [ AHJ + 18 , SCJ18 , Sti19 , YYZ19 ]. The analysis for the cen tralized T op k (among other sparsiﬁers) was considered in [ SCJ18 ], and [ AHJ + 18 ] analyzed a distributed version with the assumption of closeness of the aggregated T op k gradien ts to the 2 cen tralized T op k case, see Assumption 1 in [ AHJ + 18 ]. Lo cal-SGD, where several lo cal iterations are done b efore sending the ful l gradients, was studied in [ Sti19 , YYZ19 ], without any gradient compression b ey ond lo cal iterations. Our work generalizes these works in several wa ys. W e prov e con vergence for the distribute d sparsiﬁcation and error comp ensation algorithm, without the assumption of [ AHJ + 18 ], b y using the p erturbed iterate metho ds [ MPP + 17 , SCJ18 ]. W e analyze non-con vex as well as conv ex ob jectives for the distributed case with lo cal computations. A pro of of sparsiﬁed SGD for con vex ob jectiv e functions and for the c entr alize d case, without lo cal computations 1 w as given in [ SCJ18 ]. Our techniques comp ose a (sto c hastic or deterministic 1 -bit sign) quantizer with sparsiﬁcation and lo cal computations using error comp ensation. While our fo cus has only b een on mitigating the communication b ottlenec ks in training high dimensional mo dels o ver bandwidth limited net works, this tec hnique w orks for any compression op erator satisfying a regularit y condition (see Deﬁnition 3 ) including our comp osed op erators. 1.2 Con tributions W e study a distributed set of R work er no des, eac h of which p erform computations on lo cally stored data, denoted by D r . Consider the empirical-risk minimization of the loss function f ( x ) = 1 R R X r =1 f ( r ) ( x ) where f ( r ) ( x ) = E i ∼D r [ f i ( x )] , where E i ∼D r [ · ] denotes exp ectation ov er a random sample chosen from the lo cal data set D r . Our setup can also handle diﬀerent lo cal functional forms, b ey ond dep endence on the lo cal data set D r , which is not explicitly written for notational simplicity . F or f : R d → R , we denote x ∗ := arg min x ∈ R d f ( x ) and f ∗ := f ( x ∗ ) . The distributed no des p erform computations and provide up dates to the master no de that is responsible for aggregation and mo del update. W e dev elop Qsp arse-lo c al-SGD , a distributed SGD comp osing gradien t quan tization and explicit sparsiﬁcation ( e.g., T op k comp onen ts), along with local iterations. W e dev elop the algorithms and analysis for both synchronous as well as asynchronous op erations, in whic h work ers can comm unicate with the master at arbitrary time interv als. T o the b est of our kno wledge, these are the ﬁrst algorithms which com bine quantization, aggressive sparsiﬁcation, and lo cal computations for distributed optimization. With some minor mo diﬁcations to Qsp arse- lo c al-SGD , it can also b e used in a p eer-to-p eer setting, where the aggregation is done without an y help from the master no de, and each w orker exchanges its up dates with all other w orkers. Our main theoretical results are the conv ergence analyses of Qsp arse-lo c al-SGD for both non-con vex as well as conv ex ob jectives; see Theorem 1 and Theorem 3 for the synchronous case, as w ell as Theorem 4 and Theorem 6 , for the async hronous op eration. Our analyses also demonstrate natural gains in conv ergence that distributed, mini-batch op eration aﬀords, and has conv ergence similar to equiv alent v anilla SGD with lo cal iterations (see Corollary 2 and Corollary 3 ), for both the non-con vex case (with con vergence rate ∼ 1 √ T for ﬁxed learning rate) as well as the conv ex case (with con vergence rate ∼ 1 T , for diminishing learning rate). W e also demonstrate that quan tizing and sparsifying the gradient, ev en after lo cal iterations asymptotically yields an almost “free” eﬃciency gain (also observ ed numerically in Section 5 non-asymptotically). The n umerical results on ImageNet dataset implemented for a ResNet-50 1 A t the completion of our work, we recently found that in parallel to our work [ KRSJ19 ] examined use of sign-SGD quan tization, without sp arsiﬁc ation for the centralized mo del. Another recent work in [ KSJ19 ] studies the decen tralized case with sparsiﬁcation for strongly conv ex functions. In con trast, our work, developed indep enden t of these works, uses quantization, sparsiﬁcation and lo cal computations for the distributed case, for b oth non-conv ex and strongly conv ex ob jectives. 3 arc hitecture and for the conv ex case for multi-class logistic classiﬁcation on MNIST [ LBBH98 ] dataset demonstrates that one can get signiﬁcant comm unication sa vings, while retaining equiv- alen t state-of-the-art p erformance. The com bination of quantization, sparsiﬁcation, and lo cal computations p oses several c hallenges for theoretical analyses, including the analyses of im- pact of lo cal iterations (blo c k up dates) of parameters on quan tization and sparsiﬁcation (see Lemma 4 - 5 in Section 3 ), as well as asynchronous up dates and its com bination with distributed compression (see Lemma 9 - 12 in Section 4 ). 1.3 P ap er Organization In Section 2 , w e demonstrate that comp osing certain classes of quantizers with sparsiﬁers satisﬁes a certain regularity condition that is needed for sev eral con vergence pro ofs for our algorithms. W e describ e the synchronous implementation of Qsp arse-lo c al-SGD in Section 3 , and outline the main con vergence results for it in Section 3.3 , brieﬂy giving the pro of ideas in Section 3.4 . W e describ e our asynchronous implemen tation of Qsp arse-lo c al-SGD and provide the theoretical con vergence results in Section 4 . The exp erimen tal results are given in Section 5 . Many of the pro of details are giv en in the app endices, given as part of the supplemen tary material. 2 Comm unication Eﬃcient Op erators T raditionally , distributed sto c hastic gradien t descent aﬀords to send full precision (32 or 64 bit) unbiased gradien t up dates across work ers to p eers or to a cen tral serv er that helps with aggregation. Ho wev er, comm unication b ottlenec ks that arise in bandwidth limited netw orks limit the applicability of such an algorithm at a large scale when the parameter size is massiv e or the data is widely distributed on a very large n umber of work er no des. In such settings, one could think of up dates whic h not only result in con vergence, but also require less bandwidth thus making the training pro cess faster. In the follo wing sections w e discuss several useful op erators from literature and enhance their use by prop osing a nov el class of comp osed op erators. W e ﬁrst consider tw o diﬀeren t tec hniques used in the literature for mitigating the com- m unication b ottlenec k in distributed optimization, namely , quan tization and sparsiﬁcation. In quan tization, w e reduce precision of the gradient v ector b y mapping eac h of its components b y a deterministic [ BW AA18 , KRSJ19 ] or randomized [ AGL + 17 , WXY + 17 , SYKM17 , ZDJW13 ] map to a ﬁnite num b er of quantization lev els. In sparsiﬁcation, w e sparsify the gradients vector b efore using it to up date the parameter vector, by taking its top k components, denoted by T op k , or c ho osing k comp onen ts uniformly at random, denoted b y Rand k , [ SCJ18 , KSJ19 ]. 2.1 Quan tization SGD computes an un biased estimate of the gradien t, which can b e used to up date the mo del iterativ ely and is extremely useful in large scale applications. It is w ell known that the ﬁrst order terms in the rate of con vergence are aﬀected b y the v ariance of the gradien ts. While sto c hastic quan tization of gradien ts could result in a v ariance blow up, it preserves the un biasedness of the gradien ts at lo w precision; and, therefore, when training ov er bandwidth limited netw orks, the con vergence w ould b e m uch faster; see [ AGL + 17 , WXY + 17 , SYKM17 , ZDJW13 ]. Deﬁnition 1 (Randomized quan tizer) . W e say that Q s : R d → R d is a r andomize d quantizer with s quantization levels, if the fol lowing holds for every x ∈ R d : (i) E Q [ Q s ( x )] = x ; (ii) E Q [ k Q s ( x ) k 2 ] ≤ (1 + β d,s ) k x k 2 , wher e β d,s > 0 c ould b e a function of d and s . Her e exp e ctation is taken over the r andomness of Q s . Examples of randomized quantizers include 4 1. QSGD [ A GL + 17 , WXY + 17 ], which indep enden tly quantizes comp onen ts of x ∈ R d in to s lev els, with β d,s = min( d s 2 , √ d s ) . 2. Sto chastic s -level Quantization [ SYKM17 , ZDJW13 ], which independently quantizes every comp onen t of x ∈ R d in to s levels b et ween argmax i x i and argmin i x i , with β d,s = d 2 s 2 . 3. Sto chastic R otate d Quantization [ SYKM17 ], which is a sto c hastic quantization, prepro- cessed by a random rotation, with β d,s = 2 log 2 (2 d ) s 2 . Instead of quan tizing randomly in to s lev els, we can tak e a deterministic approach and round oﬀ each comp onen t of the vector to the nearest lev el. In particular, we can just take the sign, whic h has shown promise in [ BW AA18 , KRSJ19 ]. Deﬁnition 2 (Deterministic Sign quantizer) . A deterministic quantizer S ig n : R d → { +1 , − 1 } d is deﬁne d as fol lows: for every ve ctor x ∈ R d , the i ’th c omp onent of S ig n ( x ) , for i ∈ [ d ] , is deﬁne d as 1 { x i ≥ 0 } − 1 { x i < 0 } . Suc h metho ds drew interest since Rprop [ RB93 ], which only used the temp oral behavior of the sign of the gradient. This is an example where the biased 1-bit quantizer as in Deﬁnition 2 is used. This further inspired optimizers, suc h as RMSpr op [ TH12 ], Ad am [ KB15 ], whic h incorp orate appropriate adaptiv e scaling with momentum acceleration and ha ve demonstrated empirical sup eriorit y in non-conv ex applications. 2.2 Sparsiﬁcation As men tioned earlier, we consider t w o imp ortan t examples of sparsiﬁcation op erators: T op k and Rand k . F or any x ∈ R d , T op k ( x ) is equal to a d -length vector, whic h has at most k non-zero comp onen ts whose indices corresp ond to the indices of the largest k comp onen ts (in absolute v alue) of x . Similarly , Rand k ( x ) is a d -length (random) vector, which is obtained by selecting k comp onen ts of x uniformly at random. Both of these satisfy a so-called “compression” prop erty as deﬁned b elo w, with γ = k /d [ SCJ18 ]. Deﬁnition 3 (Compression op erator [ SCJ18 ]) . A (r andomize d) function C omp k : R d → R d is c al le d a compression op er ator, if ther e exists a c onstant γ ∈ (0 , 1] (that may dep end on k and d ), such that for every x ∈ R d , we have E C [ k x − C omp k ( x ) k 2 2 ] ≤ (1 − γ ) k x k 2 2 , (1) wher e exp e ctation is taken over the r andomness of the c ompr ession op er ator C omp k . Note that sto c hastic quantizers, as deﬁned in Deﬁnition 1 , also satisfy this regularity condi- tion in Deﬁnition 3 for β d,s ≤ 1 . No w we give a simple but imp ortan t corollary , which allo ws us to apply diﬀeren t compression op erators to diﬀerent co ordinates of a v ector. As an application, in the case of training neural netw orks, we can apply diﬀeren t op erators to diﬀerent la yers. Corollary 1 (Piecewise compression) . L et C i : R d i → R d i for i ∈ [ L ] denote p ossibly diﬀer ent c ompr ession op er ators with c ompr ession c o eﬃcients γ i . L et x = [ x 1 x 2 . . . x L ] , wher e x i ∈ R d i for al l i ∈ [ L ] . Then C ( x ) := [ C 1 ( x 1 ) C 2 ( x 2 ) . . . C L ( x L )] is a c ompr ession op er ator with the c ompr ession c o eﬃcient b eing e qual to γ min = min i ∈ [ L ] γ i . Pr o of. Fix an arbitrary x ∈ R d . The result follows from the following set of inequalities: E C k x − C ( x ) k 2 2 = P L i =1 E C i k x i − C i ( x i ) k 2 2 (a) ≤ P L i =1 (1 − γ i ) k x i k 2 2 ≤ (1 − γ min ) k x k 2 2 , where inequalit y (a) follo ws b ecause eac h C i is a compression op erator with the compression co eﬃcient γ i . Corollary 1 allo ws us to apply diﬀeren t compression op erators to diﬀerent co ordinates of the up dates whic h can based up on their dimensionality and sparsity patterns. 5 2.3 Comp osition of Quantization and Sparsiﬁcation No w w e sho w that we can comp ose deterministic/randomized quantizers with sparsiﬁers and the resulting op erator is a compression op erator. First w e comp ose a general sto c hastic quantizer with an explicit sparsiﬁer, such as T op k ( x ) and Rand k ( x ) , and show that the resulting op erator is a “compression” op erator. A pro of is provided in App endix A.1 . Lemma 1 (Compression of a comp osed op erator) . L et C omp k ∈ { T op k , Rand k } . L et Q s : R d → R d b e a quantizer with p ar ameter s that satisﬁes Deﬁnition 1 . L et Q s C omp k : R d → R d b e deﬁne d as Q s C omp k ( x ) := Q s ( C omp k ( x )) for every x ∈ R d . If k , s ar e such that β k,s < 1 , then Q s C omp k : R d → R d is a c ompr ession op er ator with the c ompr ession c o eﬃcient b eing e qual to γ = (1 − β k,s ) k d , i.e., for every x ∈ R d , we have E C,Q [ k x − Q s C omp k ( x ) k 2 2 ] ≤  1 − (1 − β k,s ) k d  k x k 2 2 , wher e exp e ctation is taken over the r andomness of the c ompr ession op er ator C omp k as wel l as of the quantizer Q s . F or the diﬀerent quantizers men tioned earlier, the conditions when their comp osition with C omp k giv es β k,s < 1 are: 1. QSGD: for k < s 2 , we get. γ =  1 − k s 2  k d 2. Sto chastic k-level Quantization: for k < 2 s 2 , we get γ =  1 − k 2 s 2  k d . 3. Sto chastic R otate d Quantization: for k < 2 s 2 / 2 − 1 , we get γ =  1 − 2 log 2 (2 k ) s 2  k d . Remark 1. Observe that for a given sto chastic quantizer that satisﬁes Deﬁnition 1 , we have a pr escrib e d op er ating r e gime of β k,s < 1 . This r esults in an upp er b ound on the c o arseness of the quantizer, which happ ens b e c ause the quantization le ads to a blow-up of the se c ond moment; se e c ondition (ii) of Deﬁnition 1 . However, by employing Cor ol lary 1 , we show that this c an b e al leviate d to some extent via an example. Consider an op er ator as describ e d in L emma 1 , wher e the quantizer, Q s : R d → R d in use is QSGD [ AGL + 17 , WXY + 17 ], and the sp arsiﬁer, C omp k is T op k [ AHJ + 18 , SCJ18 ]. Apply it to a ve ctor x = [ x 1 x 2 . . . x L ] ∈ R d in a pie c ewise manner, i.e., Q s i C omp k i : R d i → R d i to smal ler ve ctors x i ∈ R d i as pr escrib e d in Cor ol lary 1 . Deﬁne β k i ,s i = k i s 2 i as the c o eﬃcient of the varianc e b ound as in Deﬁnition 1 for the quantizer Q s i , use d for x i and k := P L i =1 k i . Observe that the r e gularity c ondition in Deﬁnition 3 c an b e satisﬁe d by having k i < s 2 i . Ther efor e, the pie c ewise c ompr ession op er ator al lows a c o arser quantizer than when the op er ator is applie d to the entir e ve ctor to gether wher e we r e quir e β k,s = k s 2 < 1 , thus pr oviding a smal l gain in c ommunic ation eﬃciency. F or example, c onsider the c omp ose d op er ator b eing applie d on a p er layer b asis to a de ep neur al network. W e c an now aﬀor d to have a much c o arser quantizer than when the op er ator is applie d to al l the p ar ameters at onc e. As discussed ab o v e, sto c hastic quan tization results in a v ariance blow-up whic h limits our regime of op eration, when we combine that with sparsiﬁcation. How ev er, it turns out that, we can expand our regime of op eration unrestrictedly by scaling the vector Q s C omp k ( x ) appropri- ately . W e summarize the result in the following lemma, which is prov ed in App endix A.2 . Lemma 2 (Comp osing sparsiﬁcation with sto c hastic quantization) . L et C omp k ∈ { T op k , Rand k } . L et Q s : R d → R d b e a sto chastic quantizer with p ar ameter s that satisﬁes Deﬁnition 1 . 6 L et Q s C omp k : R d → R d b e deﬁne d as Q s C omp k ( x ) := Q s ( C omp k ( x )) for every x ∈ R d . Then Q s C omp k ( x ) 1+ β k,s is a c ompr ession op er ator with the c ompr ession c o eﬃcient b eing e qual to γ = k d (1+ β k,s ) , i.e., for every x ∈ R d , we have E C,Q "     x − Q s C omp k ( x ) 1 + β k,s     2 2 # ≤  1 − k d (1 + β k,s )  k x k 2 2 . Remark 2. Note that, unlike Q s C omp k ( x ) , the scaled version Q s C omp k ( x ) 1+ β k,s is always a c ompr es- sion op er ator for al l values of β k,s > 0 . F urthermor e, observe that, if β k,s < 1 , then we have (1 − β k,s ) k d < k d (1+ β k,s ) , which implies that even in the op er ating r e gime of β k,s < 1 , which is r e quir e d in L emma 1 , the scaled c omp ose d op er ator Q s C omp k ( x ) 1+ β k,s of L emma 2 gives b etter c om- pr ession than what we get fr om the unscaled c omp ose d op er ator Q s C omp k ( x ) of L emma 1 . So, appr opriately sc ale d c omp ose d op er ator is always a b etter choic e for c ompr ession. W e can also comp ose a deterministic 1-bit quantizer S ig n with C omp k . F or that we need some notations ﬁrst. F or C omp k ∈ { T op k , Rand k } and giv en v ector x ∈ R d , let S C omp k ( x ) ∈  [ d ] k  denote the set of k indices chosen for deﬁning C omp k ( x ) . F or example, if C omp k = T op k , then S C omp k ( x ) denote the set of k indices corresp onding to the largest k comp onents of x ; if C omp k = Rand k , then S C omp k ( x ) denote a set of random set of k indices in [ d ] . The comp osition of S ig n with C omp k ∈ { T op k , Rand k } is denoted by S ig nC omp k : R d → R d , and for i ∈ [ d ] , the i ’th comp onen t of S ig nC omp k ( x ) is deﬁned as ( S ig nC omp k ( x )) i := ( 1 { x i ≥ 0 } − 1 { x i < 0 } if i ∈ S C omp k ( x ) , 0 otherwise . In the follo wing lemma we sho w that S ig nC omp k is a compression op erator; a pro of of whic h is pro vided in App endix A.3 . Lemma 3 (Comp osing sparsiﬁcation with deterministic quantization) . F or C omp k ∈ { T op k , Rand k } , the op er ator k C omp k ( x ) k m S ig nC omp k ( x ) k for any m ∈ Z + is a c ompr ession op er ator with the c ompr ession c o eﬃcient γ m b eing e qual to γ m =      max  1 d , k d  k C omp k ( x ) k 1 √ d k C omp k ( x ) k 2  2  if m = 1 , k 2 m − 1 d if m ≥ 2 . Remark 3. Observe that for m = 1 , dep ending on the value of k , either of the terms inside the max c an b e bigger than the other term. F or example, if k = 1 , then k C omp k ( x ) k 1 = k C omp k ( x ) k 2 , which implies that the se c ond term inside the max is e qual to 1 /d 2 , which is much smal ler than the ﬁrst term. On the other hand, if k = d and the ve ctor x is dense, then the se c ond term may b e much bigger than the ﬁrst term. 3 Distributed Synch ronous Op eration Let I ( r ) T ⊆ [ T ] := { 1 , . . . , T } with T ∈ I ( r ) T denote a set of indices for which work er r ∈ [ R ] sync hronizes with the master. In a synchronous setting, I ( r ) T is same for all the work ers. Let 7 I T := I ( r ) T for any r ∈ [ R ] . Ev ery w orker r ∈ [ R ] maintains a lo cal parameter vector b x ( r ) t whic h is up dated in each iteration t . If t ∈ I T , every work er r ∈ [ R ] sends the compressed and error- comp ensated up date g ( r ) t computed on the net progress made since the last synchronization to the master no de, and up dates its lo cal memory m ( r ) t . Up on receiving g ( r ) t , r = 1 , 2 . . . . , R , master aggregates them, updates the global parameter vector, and sends the new mo del x t +1 to all the w orkers; up on receiving which, they set their lo cal parameter vector b x ( r ) t +1 to b e equal to the global parameter v ector x t +1 . Our algorithm is summarized in Algorithm 1 . Algorithm 1 Qsparse-lo cal-SGD 1: Initialize x 0 = b x ( r ) 0 = m ( r ) 0 = 0 , ∀ r ∈ [ R ] . Supp ose η t follo ws a certain learning rate schedule. 2: for t = 0 to T − 1 do 3: On W orkers: 4: for r = 1 to R do 5: b x ( r ) t + 1 2 ← b x ( r ) t − η t ∇ f i ( r ) t  b x ( r ) t  ; i ( r ) t is a mini-batc h of size b uniformly in D r 6: if t + 1 / ∈ I T then 7: x t +1 ← x t , m ( r ) t +1 ← m ( r ) t and b x ( r ) t +1 ← b x ( r ) t + 1 2 8: else 9: g ( r ) t ← Q C omp k  m ( r ) t + x t − b x ( r ) t + 1 2  , send g ( r ) t to the master 10: m ( r ) t +1 ← m ( r ) t + x t − b x ( r ) t + 1 2 − g ( r ) t 11: Receiv e x t +1 from the master and set b x ( r ) t +1 ← x t +1 12: end if 13: end for 14: A t Master: 15: if t + 1 / ∈ I T then 16: x t +1 ← x t 17: else 18: Receiv e g ( r ) t from R work ers and compute x t +1 = x t − 1 R P R r =1 g ( r ) t 19: Broadcast x t +1 to all w orkers 20: end if 21: end for 22: Commen t: b x ( r ) t + 1 2 is used to denote an intermediate v ariable b et ween iterations t and t + 1 3.1 Assumptions All results in this pap er use the following tw o standard assumptions. 1. Smo othness: The lo cal function f ( r ) : R d → R at each work er r ∈ [ R ] is L -smo oth, i.e., for every x , y ∈ R d , we ha ve f ( r ) ( y ) ≤ f ( r ) ( x ) + h∇ f ( r ) ( x ) , y − x i + L 2 k y − x k 2 2 . 2. Bounded second momen t: F or every b x ( r ) t ∈ R d , r ∈ [ R ] , t ∈ [ T ] and for some constan t 0 ≤ G < ∞ , we hav e E i ∼D r [ k∇ f i ( b x ( r ) t ) k 2 2 ] ≤ G 2 . This is a standard assumption in [ SSS07 , NJLS09 , RR WN11 , HK14 , RSS12 , SCJ18 , Sti19 , YYZ19 , KSJ19 , AHJ + 18 ]. Relaxation of the uniform b oundedness of the gradien t allowing arbitrarily diﬀeren t gradients of lo cal functions in heterogenous settings as done for SGD in [ NNvD + 18 , WJ18 ] is left for future w ork. This also imposes a b ound on the v ariance : E i ∼D r [ k∇ f i ( b x ( r ) t ) −∇ f ( r ) ( b x ( r ) t ) k 2 2 ] ≤ σ 2 r , where σ 2 r ≤ G 2 for every r ∈ [ R ] . In this section w e presen t our main conv ergence results with synchronous up dates, obtained b y 8 running Algorithm 1 for smo oth functions, b oth non-con vex and strongly con vex. T o state our results, we need the follo wing deﬁnition from [ Sti19 ]. Deﬁnition 4 (Gap [ Sti19 ]) . L et I T = { t 0 , t 1 , . . . , t k } , wher e t i < t i +1 for i = 0 , 1 , . . . , k − 1 . The gap of I T is deﬁne d as g ap ( I T ) := max i ∈ [ k ] { ( t i − t i − 1 ) } , which is e qual to the maximum diﬀer enc e b etwe en any two c onse cutive synchr onization indic es. 3.2 Error Comp ensation Sparsiﬁed gradien t metho ds, where work ers send the largest k co ordinates of the up dates based on their magnitudes hav e b een in vestigated in the literature and serv es as a communication eﬃcien t strategy for distributed training of learning mo dels. Ho w ever, the con vergence rates are subpar to distributed v anilla SGD. T ogether with some form of error comp ensation, these metho ds hav e b een empirically observ ed to conv erge as fast as v anilla SGD in [ Str15 , AH17 , LHM + 18 , AHJ + 18 , SCJ18 ]. In [ AHJ + 18 , SCJ18 ], sparsiﬁed SGD with suc h feedback schemes has b een carefully analyzed. Under analytic assumptions, [ AHJ + 18 ] prov es the con vergence of distributed T op k SGD with error feedbac k. The net error in the system is accumulated by each w orker lo cally on a p er iteration basis and this is used as feedback for generating the future up dates. [ SCJ18 ] did the analysis for the centralized T op k SGD for strongly conv ex ob jectives. In Algorithm 1 , the error in tro duced in ev ery iteration is accum ulated into the memory of eac h work er, which is comp ensated for in the future rounds of comm unication. This feedbac k is the k ey to recov ering the conv ergence rates matching v anilla SGD. The op erators employ ed pro vide a con trolled w ay of using b oth the curren t up date as well as the compression errors from the previous rounds of communication. Under the assumption of the uniform b oundedness of the gradien ts, we analyze the controlled evolution of memory through the optimization pro cess; the results are summarized in Lemma 4 and Lemma 5 b elow. 3.2.1 Deca ying Learning Rate Here we sho w that if we run Algorithm 1 with a decaying learning rate η t , then the lo cal memory at each w orker con tracts and go es to zero as O ( η t ) 2 . Lemma 4 (Memory contraction) . L et g ap ( I T ) ≤ H and η t = ξ a + t , wher e ξ is a c onstant and a > 4 H γ , with γ b eing the c ompr ession c o eﬃcient of the c ompr ession op er ator. Then ther e exists a c onstant C ≥ 4 aγ (1 − γ 2 ) aγ − 4 H , such that the fol lowing holds for every t ∈ Z + and r ∈ [ R ] : E k m ( r ) t k 2 2 ≤ 4 η 2 t γ 2 C H 2 G 2 . (2) W e prov e Lemma 4 in App endix B.1 . Note that for ﬁxed γ , H , the memory decays as O ( η 2 t ) . This implies that the net error in the algorithm from the compression of up dates in each round of communication is comp ensated for in the end. 3.2.2 Fixed Learning Rate In the following lemma, which is prov ed in App endix B.2 , w e show that if w e run Algorithm 1 with a ﬁxed learning rate η t = η , ∀ t , then the lo cal memory at eac h w orker is b ounded. It can b e v eriﬁed that the pro of of Lemma 4 also holds for ﬁxed learning rate, and we can trivially b ound E k m ( r ) t k 2 2 in this case by simply putting η t = η in ( 2 ). Ho wev er, w e can get a b etter b ound (sa ving a factor of C 1 − γ 2 , which is bigger than 4) by directly working with a ﬁxed learning rate. 9 Lemma 5 (Bounded memory) . L et g ap ( I T ) ≤ H . Then the fol lowing holds for every worker r ∈ [ R ] and for every t ∈ Z + : E k m ( r ) t k 2 2 ≤ 4 η 2 (1 − γ 2 ) γ 2 H 2 G 2 . (3) Note that, for ﬁxed γ , H , the memory is upp er b ounded by a constan t O ( η 2 ) . Observe that since the memory accumulates the past errors due to compression and lo cal computation, in order to asymptotically reduce the memory to zero, the learning rate would ha ve to b e reduced once in a while throughout the training pro cess. 3.3 Main Results W e lev erage the perturb ed iterate analysis as in [ MPP + 17 , SCJ18 ] to provide conv ergence guar- an tees for Qsp arse-lo c al-SGD . Under the assumptions of Section 3.1 , the follo wing theorems hold when Algorithm 1 is run with any compression op erator (including our comp osed op erators). Theorem 1 (Smo oth (non-con vex) case with ﬁxed learning rate) . L et f ( r ) ( x ) b e L -smo oth for every i ∈ [ R ] . L et QC omp k : R d → R d b e a c ompr ession op er ator whose c ompr ession c o eﬃcient is e qual to γ ∈ (0 , 1] . L et { b x ( r ) t } T − 1 t =0 b e gener ate d ac c or ding to Algorithm 1 with QC omp k , for step sizes η = b C √ T (wher e b C is a c onstant such that b C √ T ≤ 1 2 L ) and g ap ( I T ) ≤ H . Then we have E k∇ f ( z T ) k 2 2 ≤ E [ f ( x 0 )] − f ∗ b C + b C L P R r =1 σ 2 r bR 2 !! 4 √ T + 8  4 (1 − γ 2 ) γ 2 + 1  b C 2 L 2 G 2 H 2 T . Her e z T is a r andom variable which samples a pr evious p ar ameter b x ( r ) t with pr ob ability 1 /R T . Corollary 2. L et E [ f ( x 0 )] − f ∗ ≤ J 2 , wher e J < ∞ is a c onstant, 2 σ max = max r ∈ [ R ] σ r , and b C 2 = bR ( E [ f ( x 0 )] − f ∗ ) σ 2 max L , we have E k∇ f ( z T ) k 2 2 ≤ O  J σ max √ bRT  + O  J 2 bRG 2 H 2 σ 2 max γ 2 T  . In or der to ensur e that the c ompr ession do es not aﬀe ct the dominating terms while c onver ging at a r ate of O  1 / √ bRT  , we would r e quir e H = O  γ T 1 / 4 / ( bR ) 3 / 4  . Theorem 1 is prov ed in App endix B.6 and provides non-asymptotic guarantees, where we observ e that compression do es not aﬀect the ﬁrst order term. Here, we are required to decide the horizon T b efore running the algorithm. Therefore, in order to conv erge to a ﬁxed p oin t, the learning rate needs to follo w a piecewise sc hedule (i.e., the learning rate would hav e to b e reduced once in a while throughout the training pro cess), whic h is also the case in our numerics in Section 5.1 . The corresponding asymptotic result (with deca ying learning rate) is given b elo w. Theorem 2 (Smo oth (non-con vex) case with decaying learning rate) . L et f ( r ) ( x ) b e L -smo oth for every r ∈ [ R ] . L et QC omp k : R d → R d b e a c ompr ession op er ator whose c ompr ession c o eﬃcient is e qual to γ ∈ (0 , 1] . L et { b x ( r ) t } T − 1 t =0 b e gener ate d ac c or ding to A lgorithm 1 with QC omp k , for step sizes η t = ξ ( a + t ) and g ap ( I T ) ≤ H , wher e a > 1 is such that we have a > max { 4 H γ , 2 ξ L, H } and C ≥ 4 aγ (1 − γ 2 ) aγ − 4 H . Then the fol lowing holds. E k∇ f ( z T ) k 2 ≤ E f ( x 0 ) − f ∗ P T + Lξ 2 ( a − 1) P T P R r =1 σ 2 r bR 2 ! +  8 C γ 2 + 8  ξ 3 L 2 G 2 H 2 2( a − 1) 2 P T . 2 Ev en classical SGD requires kno wing an upp er b ound on k x 0 − x ∗ k in order to choose the learning rate. Smo othness of f translates this to the diﬀerence of the function v alues. 10 Her e (i) δ t := η t 4 R ; (ii) P T := P T − 1 t =0 P R r =1 δ t , which is lower b ounde d as P T ≥ ξ 4 ln  T + a − 1 a  ; and (iii) z T is a r andom variable which samples a pr evious p ar ameter b x ( r ) t with pr ob ability δ t /P T . Note that Theorem 2 giv es a conv ergence rate of O ( 1 log T ) . W e prov e it in App endix B.7 . Theorem 3 (Smo oth and strongly conv ex case with a deca ying learning rate) . L et f ( r ) ( x ) b e L -smo oth and µ -str ongly c onvex. L et QC omp k : R d → R d b e a c ompr ession op er ator whose c ompr ession c o eﬃcient is e qual to γ ∈ (0 , 1] . L et { b x ( r ) t } T − 1 t =0 b e gener ate d ac c or ding to A lgorithm 1 with QC omp k , for step sizes η t = 8 / µ ( a + t ) with g ap ( I T ) ≤ H , wher e a > 1 is such that we have a > max { 4 H / γ , 32 κ, H } , κ = L / µ . Then the fol lowing holds E [ f ( x T )] − f ∗ ≤ La 3 4 S T k x 0 − x ∗ k 2 2 + 8 LT ( T + 2 a ) µ 2 S T A + 128 LT µ 3 S T B . Her e (i) A = P R r =1 σ 2 r bR 2 , B = 4  3 µ 2 + 3 L  C G 2 H 2 γ 2 + 3 L 2 G 2 H 2  , wher e C ≥ 4 aγ (1 − γ 2 ) aγ − 4 H ; (ii) x T := 1 S T P T − 1 t =0 h w t  1 R P R r =1 b x ( r ) t i , wher e w t = ( a + t ) 2 ; and (iii) S T = P T − 1 t = o w t ≥ T 3 3 . Corollary 3. F or a > max { 4 H γ , 32 κ, H } , σ max = max r ∈ [ R ] σ r , and using E k x 0 − x ∗ k 2 2 ≤ 4 G 2 µ 2 fr om L emma 2 in [ RSS12 ], we have E [ f ( x T )] − f ∗ ≤ O  G 2 H 3 µ 2 γ 3 T 3  + O  σ 2 max µ 2 bRT + H σ 2 max µ 2 bRγ T 2  + O  G 2 H 2 µ 3 γ 2 T 2  . In or der to ensur e that the c ompr ession do es not aﬀe ct the dominating terms while c onver ging at a r ate of O (1 / ( bR T )) , we would r e quir e H = O  γ p T / ( bR )  . Theorem 3 is prov ed in App endix B.8 . F or no compression and only lo cal computations, i.e., for γ = 1 , and under the same assumptions, we recov er/generalize a few recent results from literature with similar conv ergence rates: 1. W e recov er [ YYZ19 , Theorem 1], whic h do es lo cal SGD for the non-conv ex case; 2. W e generalize [ Sti19 , Theorem 2.2], whic h do es lo cal SGD for a strongly conv ex case and requires the un biasedness assumption of gradients, 3 to the distributed case. W e emphasize that unlike [ YYZ19 , Sti19 ], which only consider lo cal computation, we combine quan tization and sparsiﬁcation with lo cal computation, which p oses several technical c hallenges; e.g., see pro ofs of Lemma 4 , 5 , 6 . 3.4 Pro of Outlines In order to prov e our results, we deﬁne virtual sequences for every work er r ∈ [ R ] and for all t ≥ 0 as follows: e x ( r ) 0 := b x ( r ) 0 and e x ( r ) t +1 := e x ( r ) t − η t ∇ f i ( r ) t  b x ( r ) t  (4) Here η t can be taken to b e decaying or ﬁxed, dep ending on the result that w e are pro ving. Let i t b e the set of random sampling of the mini-batches at each w ork er { i (1) t , i (2) t , . . . , i ( R ) t } . W e deﬁne 1. p t := 1 R P R r =1 ∇ f i ( r ) t  b x ( r ) t  , p t := E i t [ p t ] = 1 R P R r =1 ∇ f ( r )  b x ( r ) t  ; 2. e x t +1 := 1 R P R r =1 e x ( r ) t +1 = e x t − η t p t , b x t := 1 R P R r =1 b x ( r ) t . 3 The un biasedness of gradients at every work er can b e ensured by assuming that each work er samples data p oin ts from the entir e dataset. 11 3.4.1 Pro of Outline of Theorem 1 Pr o of. Since f is L -smo oth, we hav e from ( 4 ) (with ﬁxed learning rate η t = η ) that f ( e x t +1 ) − f ( e x t ) ≤ − η h∇ f ( e x t ) , p t i + η 2 L 2 k p t k 2 2 . (5) With some algebraic manipulations pro vided in App endix B.6 , for η ≤ 1 / 2 L , we arrive at η 4 R R X r =1 E k∇ f ( b x ( r ) t ) k 2 2 ≤ ( E [ f ( e x t )] − E [ f ( e x t +1 )]) + η 2 L E k p t − p t k 2 2 + 2 η L 2 E k e x t − b x t k 2 2 + 2 η L 2 1 R R X r =1 E k b x t − b x ( r ) t k 2 2 . (6) Under the Assumption 2, stated in Section 3.1 , we hav e E k p t − p t k 2 2 ≤ P R r =1 σ 2 r bR 2 . (7) T o b ound E k e x t − b x t k 2 2 on the RHS of ( 6 ), we ﬁrst sho w b elo w that b x t − e x t = 1 R P R r =1 m ( r ) t , i.e., the diﬀerence of the true and the virtual sequence is equal to the av erage memory; and then we can use the b ound on the lo cal memory terms from Lemma 5 . Lemma 6 (Memory) . L et e x ( r ) t , m ( r ) t , r ∈ [ R ] , t ≥ 0 b e gener ate d ac c or ding to Algorithm 1 and let b x ( r ) t b e as deﬁne d in ( 4 ) . L et e x t = 1 R P R r =1 e x ( r ) t and b x t = 1 R P R r =1 b x ( r ) t . Then we have b x t − e x t = 1 R R X r =1 m ( r ) t , i.e., the diﬀer enc e of the true and the virtual se quenc e is e qual to the aver age memory. A pro of of Lemma 6 is pro vided in App endix B.3 . Since E k e x t − b x t k 2 2 ≤ 1 R P R r =1 E k m ( r ) t k 2 2 , by using Lemma 5 to b ound the lo cal memory terms E k m ( r ) t k 2 2 , we get E k e x t − b x t k 2 2 ≤ 4 η 2 (1 − γ 2 ) γ 2 H 2 G 2 . (8) The last term on the RHS of ( 6 ) depicts the deviation of the lo cal sequences e x ( r ) t from the global sequence e x t whic h can b e b ounded as sho wn in Lemma 7 . The details are provided in App endix B.4 . Lemma 7 (Bounded deviation of local sequences) . L et g ap ( I T ) ≤ H . F or b x ( r ) t gener ate d ac c or ding to Algorithm 1 with a ﬁxe d le arning r ate η and letting b x t = 1 R P R r =1 b x ( r ) t , we have the fol lowing b ound on the deviation of the lo c al se quenc es: 1 R R X r =1 E k b x t − b x ( r ) t k 2 2 ≤ η 2 G 2 H 2 . (9) Substituting the b ounds from ( 7 )-( 9 ) into ( 6 ) yields η 4 R R X r =1 E k∇ f ( b x ( r ) t ) k 2 2 ≤ E [ f ( e x t )] − E [ f ( e x t +1 )] + η 2 L bR 2 R X r =1 σ 2 r + 8 η 3 (1 − γ 2 ) γ 2 L 2 G 2 H 2 + 2 η 3 L 2 G 2 H 2 . (10) 12 P erforming a telescopic sum from t = 0 to T − 1 and dividing b y η T 4 giv es 1 RT T − 1 X t =0 R X r =1 E k∇ f ( b x ( r ) t ) k 2 2 ≤ 4 ( E [ f ( e x 0 )] − f ∗ ) η T + 4 η L bR 2 R X r =1 σ 2 r + 32 η 2 (1 − γ 2 ) γ 2 L 2 G 2 H 2 + 8 η 2 L 2 G 2 H 2 . (11) By letting η = b C / √ T , where b C is a constan t such that b C √ T ≤ 1 2 L , w e arrive at b ound stated in Theorem 1 . 3.4.2 Pro of Outline of Theorem 2 Pr o of. Observ e that ( 6 ) holds irresp ectiv e of the learning rate schedule, as long as learning rate is at most 1 / 2 L ; see App endix B.7 for details. Here η t ≤ 1 2 L follo ws from our assumption that a ≥ 2 ξ L . Substituting a deca ying learning rate η t (suc h that η t ≤ 1 / 2 L holds for ev ery t ≥ 0 ) in ( 6 ) gives η t 4 R R X r =1 E k∇ f ( b x ( r ) t ) k 2 2 ≤ ( E [ f ( e x t )] − E [ f ( e x t +1 )]) + η 2 t L E k p t − p t k 2 2 + 2 η t L 2 E k e x t − b x t k 2 2 + 2 η t L 2 1 R R X r =1 E k b x t − b x ( r ) t k 2 2 . (12) W e hav e already b ounded E k p t − p t k 2 2 ≤ P R r =1 σ 2 r bR 2 in ( 7 ). Note that Lemma 6 holds irresp ectiv e of the learning rate sc hedule, and together with Lemma 4 , we can show that E k b x t − e x t k 2 ≤ C 4 η 2 t γ 2 G 2 H 2 . (13) The last term on the RHS of ( 12 ) is the deviation of lo cal sequences and we b ound it in Lemma 8 for decaying learning rates. The details are pro vided in App endix B.5 . Lemma 8 (Con tracting deviation of lo cal sequences) . L et g ap ( I T ) ≤ H . By running A lgo- rithm 1 with a de c aying le arning r ate η t , we have 1 R R X r =1 E k b x t − b x ( r ) t k 2 2 ≤ 4 η 2 t G 2 H 2 . (14) Observ e that for the case of ﬁxed learning rate, w e can trivially b ound 1 R P R r =1 E k e x t − e x ( r ) t k 2 2 b y simply putting η t = η in ( 14 ). Ho wev er, in ( 9 ), we can get a sligh tly b etter b ound (without the factor of 4) b y directly working with a ﬁxed learning rate. Using these b ounds in ( 12 ) gives η t 4 R R X r =1 E k∇ f ( b x ( r ) t ) k 2 ≤ E [ f ( e x t )] − E [ f ( e x t +1 )] + η 2 t L bR 2 R X r =1 σ 2 r + 8 η 3 t γ 2 C L 2 G 2 H 2 + 8 η 3 t L 2 G 2 H 2 . Let δ t := η t 4 R and P T := P T − 1 t =0 P R r =1 δ t . Performing a telescopic sum from t = 0 to T − 1 and dividing by P T giv es 1 P T T − 1 X t =0 R X r =1 δ t E k∇ f ( b x ( r ) t ) k 2 ≤ E f ( x 0 ) − f ∗ P T + Lξ 2 bR 2 ( a − 1) P R r =1 σ 2 r P T +  8 C γ 2 + 8  L 2 G 2 H 2 ξ 3 2 P T ( a − 1) 2 (15) 13 In ( 15 ), we used the following b ounds, which are sho wn in Appendix B.7 : P T ≥ ξ 4 ln  T + a − 1 a  , P T − 1 t =0 η 2 t ≤ ξ 2 a − 1 , and P T − 1 t =0 η 3 t ≤ ξ 3 2( a − 1) 2 . This completes the pro of of Theorem 2 . 3.4.3 Pro of Outline of Theorem 3 Pr o of. Using the deﬁnition of virtual sequences ( 4 ) that, we ha ve k e x t +1 − x ∗ k 2 2 = k e x t − x ∗ − η t p t k 2 2 + η 2 t k p t − p t k 2 2 − 2 η t h e x t − x ∗ − η t p t , p t − p t i . (16) Note that η t ≤ 1 / 4 L , whic h follo ws from the assumption that a > 32 L µ . No w, using µ -strong con vexit y and L -smo othness of f , together with some algebraic manipulations provided in Ap- p endix B.8 , by letting e t = E [ f ( b x t )] − f ∗ , we arriv e at E k e x t +1 − x ∗ k 2 2 ≤  1 − µη t 2  E k e x t − x ∗ k 2 2 − η t µ 2 L e t + η t  3 µ 2 + 3 L  E k b x t − e x t k 2 2 + 3 η t L R R X r =1 E k b x t − b x ( r ) t k 2 2 + η 2 t P R r =1 σ 2 r bR 2 . (17) Note that the b ounds in ( 13 ) and ( 14 ) hold irresp ectiv e of whether the function is conv ex or not. So, w e can use them here as well in ( 17 ), which giv es E k e x t +1 − x ∗ k 2 2 ≤  1 − µη t 2  E k e x t − x ∗ k 2 2 − µη t 2 L e t + η t  3 µ 2 + 3 L  C 4 η 2 t γ 2 G 2 H 2 + (3 η t L )4 η 2 t LG 2 H 2 + η 2 t P R r =1 σ 2 r bR 2 . (18) Emplo ying a slightly mo diﬁed result than [ SCJ18 , Lemma 3.3] with a t = E k e x t − x ∗ k 2 2 , A = P R r =1 σ 2 r bR 2 and B = 4  3 µ 2 + 3 L  C G 2 H 2 γ 2 + 3 L 2 G 2 H 2  , we ha ve a t +1 ≤  1 − µη t 2  a t − µη t 2 L e t + η 2 t A + η 3 t B . (19) F or η t = 8 µ ( a + t ) and w t = ( a + t ) 2 , S T = P T − 1 t = o ≥ T 3 3 , we ha ve µ 2 LS T T − 1 X t =0 w t e t ≤ µa 3 8 S T a 0 + 4 T ( T + 2 a ) µS T A + 64 T µ 2 S T B . (20) F rom conv exit y , we can ﬁnally write E f ( x T ) − f ∗ ≤ La 3 4 S T a 0 + 8 LT ( T + 2 a ) µ 2 S T A + 128 LT µ 3 S T B . (21) Where x T := 1 S T P T − 1 t =0 h w t  1 R P R r =1 b x ( r ) t i = 1 S T P T − 1 t =0 w t b x t . This completes the pro of of Theorem 3 . 4 Distributed Asynchrono us Op eration W e prop ose and analyze a particular form of asynchronous op eration, where the work ers syn- c hronize with the master at arbitrary times decided lo cally or by master picking a subset of no des as in federated learning [ Kon17 , MMR + 17 ]. Ho wev er, the lo cal iterates evolv e at the same 14 rate, i.e., eac h w orker takes the same n umber of steps p er unit time according to a global clo c k. The asynchron y is therefore that up dates o ccur after diﬀerent n umber of lo cal iterations but the lo cal iterations are in synchron y with resp ect to the global clo ck. This is diﬀerent from async hronous algorithms studied for stragglers [ WYL + 18 , RR WN11 ], where only one gradien t step is tak en but o ccurs at diﬀerent times due to delays. In this async hronous setting, I ( r ) T ’s may b e diﬀeren t for diﬀeren t w orkers. Ho wev er, w e assume that g ap ( I ( r ) T ) ≤ H holds for every r ∈ [ R ] , which means that there is a uniform b ound on the maximum delay in eac h w orker’s up date times. The algorithmic diﬀerence from Algorithm 1 is that, in this case, a subset of work ers (including a single work er) can send their up dates to the master at their sync hronization time steps; master aggregates them, up dates the global parameter vector, and sends that only to those w orkers. Our algorithm is summarized in Algorithm 2 Algorithm 2 Qsparse-lo cal-SGD with asynchronous up dates 1: Initialize x 0 = ¯ ¯ x 0 = x ( r ) 0 = b x ( r ) 0 = m ( r ) 0 = 0 , ∀ r ∈ [ R ] . Supp ose η t follo ws a certain learning rate sc hedule. 2: for t = 0 to T − 1 do 3: On W orkers: 4: for r = 1 to R do 5: b x ( r ) t + 1 2 ← b x ( r ) t − η t ∇ f i ( r ) t  b x ( r ) t  ; i ( r ) t is a mini-batc h of size b uniformly in D r 6: if t + 1 / ∈ I ( r ) T then 7: x ( r ) t +1 ← x ( r ) t , m ( r ) t +1 ← m ( r ) t and b x ( r ) t +1 ← b x ( r ) t + 1 2 8: else 9: g ( r ) t ← Q C omp k  m ( r ) t + x ( r ) t − b x ( r ) t + 1 2  and send g ( r ) t to the master 10: m ( r ) t +1 ← m ( r ) t + x ( r ) t − b x ( r ) t + 1 2 − g ( r ) t 11: Receiv e ¯ ¯ x t +1 from the master and set x ( r ) t +1 ← ¯ ¯ x t +1 and b x ( r ) t +1 ← ¯ ¯ x t +1 12: end if 13: end for 14: A t Master: 15: if t + 1 / ∈ I ( r ) T for all r ∈ [ R ] then 16: ¯ ¯ x t +1 ← ¯ ¯ x t 17: else 18: Let S ⊆ [ R ] b e the set of all work ers r such that master receives g ( r ) t from r 19: Compute ¯ ¯ x t +1 ← ¯ ¯ x t − 1 R P r ∈S g ( r ) t and broadcast ¯ ¯ x t +1 to all the work ers in S 20: end if 21: end for 4.1 Main Results In this section w e present our main conv ergence results with asynchronous up dates, obtained b y running Algorithm 2 for smo oth ob jectiv es, b oth non-conv ex and strongly conv ex. Under the same assumptions as in the synchronous setting of Section 3.1 , the following theorems hold ev en if Algorithm 2 is run with an arbitrary compression op erators (including our comp osed op erators from Section 2.3 ), whose compression co eﬃcien t is equal to γ . Theorem 4 (Smo oth (non-conv ex) case with ﬁxed learning rate) . Under the same c onditions as in The or em 1 with g ap ( I ( r ) T ) ≤ H , if { b x ( r ) t } T − 1 t =0 is gener ate d ac c or ding to Algorithm 2 , the 15 fol lowing holds. E k∇ f ( z T ) k 2 2 ≤ E [ f ( x 0 )] − f ∗ b C + b C L P R r =1 σ 2 r bR 2 !! 4 √ T + 8  12 (1 − γ 2 ) γ 2 + (2 + 8 C 1 H 2 )  b C 2 L 2 G 2 H 2 T . Her e (i) C 1 = ( 8 γ 2 − 6)(4 − 2 γ ) ; (ii) z T is a r andom variable which samples a pr evious p ar ameter b x ( r ) t with pr ob ability 1 /R T ; and (iii) b C is a c onstant such that b C √ T ≤ 1 2 L . Corollary 4. L et E [ f ( x 0 )] − f ∗ ≤ J 2 , wher e J < ∞ is a c onstant, σ max = max r ∈ [ R ] σ r , and b C 2 = bR ( E [ f ( x 0 )] − f ∗ ) / σ 2 max L . W e c an get a simpliﬁe d expr ession b elow E k∇ f ( z T ) k 2 2 ≤ O  J σ max √ bRT  + O  J 2 bRG 2 σ 2 max γ 2 T ( H 2 + H 4 )  . In or der to ensur e that the c ompr ession do es not aﬀe ct the dominating terms while c onver ging at a r ate of O  1 / √ bRT  , we would r e quir e H = O  √ γ T 1 / 8 / ( bR ) 3 / 8  . Theorem 4 pro vides non asymptotic guarantees where w e also observe that the compression comes for “free". The corresp onding asymptotic result is given b elow. Theorem 5 (Smo oth (non-conv ex) case with deca ying learning rate) . Under the same c onditions as in The or em 2 with g ap ( I ( r ) T ) ≤ H , if { b x ( r ) t } T − 1 t =0 is gener ate d ac c or ding to Algorithm 2 , the fol lowing holds. E k∇ f ( z T ) k 2 ≤ E f ( x 0 ) − f ∗ P T + Lξ 2 ( a − 1) P T P R r =1 σ 2 r bR 2 ! +  16 + 24 C γ 2 + 200 C 0 H 2  ξ 3 L 2 G 2 H 2 2( a − 1) 2 P T Her e (i) δ t := η t 4 R and P T := P T − 1 t =0 P R r =1 δ t , which is lower b ounde d as P T ≥ ξ 4 ln  T + a − 1 a  ; (ii) C 0 = (4 − 2 γ )(1 + C γ 2 ) ; and (iii) z T is a r andom variable which samples a pr evious p ar ameter b x ( r ) t with pr ob ability δ t /P T . Theorem 6 (Smo oth and strongly conv ex case with deca ying learning rate) . Under the same c onditions as in The or em 3 with g ap ( I ( r ) T ) ≤ H , if { b x ( r ) t } T − 1 t =0 is gener ate d ac c or ding to Algo- rithm 2 , the fol lowing holds. E [ f ( x T )] − f ∗ ≤ La 3 4 S T k x 0 − x ∗ k 2 2 + 8 LT ( T + 2 a ) µ 2 S T A + 128 LT µ 3 S T D Her e (i) C ≥ 4 aγ (1 − γ 2 ) aγ − 4 H , C 1 = 192(4 − 2 γ )  1 + C γ 2  , C 2 = 8(4 − 2 γ )(1 + C γ 2 ) ; (ii) A = P R r =1 σ 2 r bR 2 , D =  3 µ 2 + 3 L  ( 12 C G 2 H 2 γ 2 + C 1 η 2 t H 4 G 2 ) + 24(1 + C 2 H 2 ) LG 2 H 2 ; and (iii) x T , S T ar e as deﬁne d in The or em 3 . Corollary 5. Under the same c onditions as in The or em 3 with g ap ( I ( r ) T ) ≤ H , a > max { 4 H γ , 32 κ, H } , σ max = max r ∈ [ R ] σ r , if { b x ( r ) t } T − 1 t =0 is gener ate d ac c or ding to Algorithm 2 , the fol lowing holds: E [ f ( x T )] − f ∗ ≤ O  G 2 H 3 µ 2 γ 3 T 3  + O  σ 2 max µ 2 bRT + H σ 2 max µ 2 bRγ T 2  + O  G 2 µ 3 γ 2 T 2 ( H 2 + H 4 )  , wher e x T , S T ar e as deﬁne d in The or em 3 . In or der to ensur e that the c ompr ession do es not aﬀe ct the dominating terms while c onver ging at a r ate of O (1 / ( bRT )) , we would r e quir e H = O  √ γ ( T / ( bR )) 1 / 4  . 16 4.2 Pro of Outlines Our proofs of these results follow the same outlines of the corresp onding proofs in the syn- c hronous setting, but some technical details change signiﬁcan tly , which arise b ecause, in our async hronous setting, work ers are allow ed to up date the global parameter vector in b et ween tw o consecutiv e synchronization time steps of other work ers. Sp eciﬁcally , in the asynchronous set- ting, we ha ve to bound the deviation of lo cal sequences 1 R P R r =1 E k e x t − e x ( r ) t k 2 2 and the diﬀerence b et w een the virtual and true sequences E k e x t − b x t k 2 2 , b oth with a ﬁxed learning rate as well as with decaying learning rate. W e sho w these b elo w in Lemma 9 - 10 and Lemma 11 - 12 . Lemma 9 (Con tracting lo cal sequence deviation) . L et g ap ( I ( r ) T ) ≤ H holds for every r ∈ [ R ] . F or b x ( r ) t gener ate d ac c or ding to A lgorithm 2 with de c aying le arning r ate η t and letting b x t = 1 R P R r =1 b x ( r ) t , we have the fol lowing b ound on the deviation of the lo c al se quenc es: 1 R R X r =1 E k b x t − b x ( r ) t k 2 2 ≤ 8(1 + C 00 H 2 ) η 2 t G 2 H 2 , wher e C 00 = 8(4 − 2 γ )(1 + C γ 2 ) and C is a c onstant satisfying C ≥ 4 aγ (1 − γ 2 ) aγ − 4 H . Lemma 10 (Bounded lo cal sequence deviation) . L et g ap ( I ( r ) T ) ≤ H holds for every r ∈ [ R ] . By running Algorithm 2 with ﬁxe d le arning r ate η , we have 1 R R X r =1 E k b x t − b x ( r ) t k 2 2 ≤ (2 + H 2 C 0 ) η 2 G 2 H 2 , wher e C 0 = ( 16 γ 2 − 12)(4 − 2 γ ) . W e pro ve these ab o ve tw o lemmas in App endix C.1 and App endix C.2 , resp ectiv ely . Note that the b ound in Lemma 9 is 1 R P R r =1 E k b x t − b x ( r ) t k 2 2 ≤ O ( η 2 t G 2 ( H 2 + H 4 / γ 2 ) , which is w eaker than the corresp onding b ound O ( η 2 t G 2 H 2 ) for the sync hronous setting in Lemma 8 . See Lemma 10 and Lemma 7 for a similar comparison for the case of ﬁxed learning rate. No w w e b ound E k e x t − b x t k 2 2 . Fix a time t and consider any w orker r ∈ [ R ] . Let t r ∈ I ( r ) T denote the last sync hronization step until time t for the r ’th w orker. Deﬁne t 0 0 := min r ∈ [ R ] t r . W e w ant to b ound E k b x t − e x t k 2 2 . Note that in the synch ronous case, we ha ve shown in Lemma 6 that b x t − b x t = 1 R P R r =1 m ( r ) t . This do es not hold in the asynchronous setting, whic h mak es upp er-bounding E k b x t − e x t k 2 2 a bit more in volv ed. By deﬁnition b x t − e x t = 1 R P R r =1  b x ( r ) t − e x ( r ) t  . By the deﬁnition of virtual sequences and the up date rule for b x ( r ) t , w e also ha ve b x t − e x t = 1 R P R r =1  b x ( r ) t r − e x ( r ) t r  . This can b e written as b x t − e x t = " 1 R R X r =1 b x ( r ) t r − ¯ ¯ x t 0 0 # + h ¯ ¯ x t 0 0 − ¯ ¯ x t i + " ¯ ¯ x t − 1 R R X r =1 e x ( r ) t r # . (22) In ( 22 ), the third term on the RHS is equal to the av erage memory as sho wn in ( 96 ) in Ap- p endix C.3 , and unlik e Lemma 6 in the synchronous setting, whic h states that b x t − e x t = 1 R P R r =1 m ( r ) t , do es not hold here. How ever, we can show that b x t − e x t is equal to the sum of 1 R P R r =1 m ( r ) t and an additional term, whic h leads to p oten tially a weak er b ound E k b x t − e x t k 2 2 ≤ O  η 2 t / γ 2 G 2 ( H 2 + H 4 )  , prov ed in Lemma 11 - 12 in App endix C.3 and Appendix C.4 , in compar- ison to O  η 2 t / γ 2 G 2 H 2  for the sync hronous setting. 17 Lemma 11 (Con tracting distance b et ween virtual and true sequence) . L et g ap ( I ( r ) T ) ≤ H holds for every r ∈ [ R ] . If we run A lgorithm 2 with a de c aying le arning r ate η t , then we have the fol lowing b ound on the diﬀer enc e b etwe en the true and virtual se quenc es: E k b x t − e x t k 2 2 ≤ C 0 η 2 t H 4 G 2 + 12 C η 2 t γ 2 G 2 H 2 , wher e C 0 = 192(4 − 2 γ )  1 + C γ 2  and C is a c onstant satisfying C ≥ 4 aγ (1 − γ 2 ) aγ − 4 H . Lemma 12 (Bounded distance b et ween virtual and true sequence) . L et g ap ( I ( r ) T ) ≤ H holds for every r ∈ [ R ] . If we run Algorithm 2 with a ﬁxe d le arning r ate η , we have E k b x t − e x t k 2 2 ≤ 6 C 0 η 2 H 4 G 2 + 12 η 2 (1 − γ 2 ) γ 2 G 2 H 2 , wher e C 0 = (4 − 2 γ )  8 γ 2 − 6  . Summary of our results. Now w e give a brief summary of our conv ergence results in the sync hronous as well as asynchronous settings. 1. In the sync hronous setting, Qsp arse-lo c al-SGD asymptotically con verges as fast as dis- tributed v anilla SGD for H = O  γ T 1 / 4 / ( bR ) 3 / 4  in the smo oth and non-con vex case and for H = O  γ p T / ( bR )  in the strongly conv ex case. 2. In the asynchronous setting, Qsp arse-lo c al-SGD asymptotically con verges as fast as dis- tributed v anilla SGD for H = O ( √ γ T 1 / 8 / ( bR ) 3 / 8 ) in the smooth and non-conv ex case and for H = O ( √ γ ( T / ( bR )) 1 / 4 ) in the strongly con vex case. Therefore, our algorithm provides a lot of ﬂexibilit y in terms of diﬀerent w ays of mitigating the comm unication b ottlenec k. F or example, b y increasing the batch size on eac h no de, or by increasing the maximum synchronization p erio d H up to allow able limits. F urthermore, one could also choose to opt for diﬀerent v alues of k for the T op k sparsiﬁer, as well as adjust the conﬁgurations of the quan tizer. W e present numerics in Section 5 demonstrating signiﬁcan t sa vings in the n umber of bits exchanged ov er the state-of-the-art. 5 Exp erimen tal Results In this section we giv e extensive exp erimen tal results for v alidating our theoretical ﬁndings. 5.1 Non-Con v ex Ob jective 5.1.1 Exp erimen t Setup W e train ResNet-50 [ HZRS16 ] (which has d = 25 , 610 , 216 parameters) on ImageNet dataset, using 8 NVIDIA T esla V100 GPUs. W e use a learning rate sc hedule consisting of 5 epo chs of linear w armup, follow ed by a piecewise decay of 0.1 at ep o c hs 30, 60 and 80, with a batch size of 256 p er GPU. F or the purp ose of exp eriments, we fo cus on SGD with momen tum of 0.9, applied on the lo cal iterations of the work ers. W e build our compression scheme in to the Horo vod framework [ SB18 ]. W e use S ig nT op k as deﬁned in Lemma 3 and QT op k as deﬁned in Lemma 1 (whic h has an operating regime β k,s < 1 ), where Q is from [ A GL + 17 ], as our 18 comp osed op erators. 4 In T op k , w e only up date k t = min( d t , 1000) elements p er step for each tensor t , where d t is the num b er of elements in the tensor. F or ResNet-50 arc hitecture, this amoun ts to up dating a total of k = 99 , 400 elements p er step. (a) T raining loss vs ep ochs (b) T raining loss vs log 2 of communication budget (c) top-1 accuracy [ LHS15 ] for schemes in Figure 1a (d) top-5 accuracy [ LHS15 ] for schemes in Figure 1a Figure 1 Figure 1a - 1d demonstrate the gains in p erformance achiev ed by our Qspar se op erators in the non- con vex setting. 5.1.2 Results F rom Figure 1a , w e observe that quantization and sparsiﬁcation, both individually and combined, when error comp ensation is enabled through accumulating errors, has almost no p enalt y in terms of conv ergence rate, with resp ect to v anilla SGD. W e observe that b oth QT op k - S GD , which emplo ys a 4 bit quan tizer and the T op k sparsiﬁer, as well as S ig nT op k - S GD , which employs the 1 bit sign quan tizer and the T op k sparsiﬁer, demonstrate sup erior p erformance ov er other 4 Ev en though the “scaled” QT op k from Lemma 2 (with a scaling factor of (1 + β k,s ) ) works with all v alues of β k,s , and also does b etter than the “unscaled” QT op k from Lemma 1 even when β k,s < 1 (see Remark 2 ), w e report our experimental results in the non-conv ex setting only with unscaled QT op k . W e giv e some plots for a comparison on b oth these op erators in App endix D and observe that our algorithm with the unscaled op erator gives at least as go od p erformance as it gives with the scaled op erator. W e can attribute this to the fact that scaling the comp osed op erator is a suﬃcien t condition to obtain b etter conv ergence results, which do es not necessarily mean that in practice also it do es b etter. 19 sc hemes, b oth in terms of the required num b er of communic ated bits for achieving certain target loss as well as test accuracy . This is b ecause, in QT op k , the Q op erator from [ A GL + 17 ] further induces sparsity , whic h results in fewer than k co ordinates b eing transmitted, and in S ig nT op k , w e send only 1 bit for each T op k co ordinate. (a) T raining loss vs ep ochs (b) T raining loss vs log 2 of communication budget (c) T raining loss vs ep ochs (d) T raining loss vs log 2 of communication budget Figure 2 Figure 2a - 2b demonstrate the eﬀect of incorp orating local iterations and compare these eﬀects across v anilla SGD, the sparsiﬁer T op k , as well as its composition with the Sign op erator. Similar comparisons are also made b et ween v anilla SGD, the quan tizer QSGD with error accumulation, as well as its composition with the T op k sparsiﬁer. S ig nT opK _ hL , for h = 1 , 4 , 8 , corresp onds to running Algorithm 1 with S ig nT opK b eing the comp osed op erator with a synchronization p eriod of at most h . In Figure 2a - 2d , we show ho w the p erformance of diﬀerent metho ds (used in Figure 1a - 1d ) c hange when w e incorp orate lo cal iterations on top of them. Observ e that the incorp oration of lo cal iterations in Figure 2a and 2c has v ery little impact on the con vergence rates, as compared to v anilla SGD with the corresp onding n umber of lo cal iterations. F urthermore, this provides an added adv antage ov er the Qsp arse operator, in terms of savings in comm unicated bits for ac hieving target loss as seen in Figure 2b and 2d , by a factor of 6 to 8 times on av erage. Figure 3b , Figure 3c , and Figure 3d show the training loss, top-1, and top-5 conv ergence rates 5 resp ectiv ely , with resp ect to the total num b er of bits of communication used. W e ob- serv e that Qsp arse-lo c al-SGD combines the bit sa vings of either the deterministic sign based 5 Here top-i refers to the accuracy of the top i predictions by the mo del from the list of possible classes, see [ LHS15 ]. 20 (a) T raining loss vs against ep o c hs (b) T raining loss vs log 2 of communication budget (c) top-1 accuracy [ LHS15 ] for schemes in Figure 3a (d) top-5 accuracy [ LHS15 ] for schemes in Figure 3a Figure 3 Figure 3a - 3d demonstrate the p erformance of our scheme in comparison with ef - sign SGD [ KRSJ19 ], T opK-SGD [ SCJ18 , AHJ + 18 ] and lo cal SGD [ Sti19 , YYZ19 ] in the non-conv ex setting. op erator or the sto chastic quantizer (QSGD), and aggressiv e sparsiﬁer along with infrequen t comm unication, thereb y , outp erforming the cases where these techniques are individually used. In particular, the required n um b er of bits to achiev e the same loss or top-1 accuracy in the case of Qsp arse-lo c al-SGD is around 1/16 in comparison with T op k - S GD and o ver 1000 × less than v anilla SGD. This also veriﬁes that error comp ensation through memory can b e used to mitigate not only the missing components from up dates in previous synchronization rounds, but also explicit quan tization error. 5.2 Con v ex Ob jective The exp erimen ts in Figure 4 - 6 are in a synchronous distributed setting with 15 w orker no des, eac h pro cessing a mini-batch size of 8 samples p er iteration using the MNIST [ LBBH98 ] hand- written digits dataset. The corresp onding exp erimen ts for the async hronous op eration (as in Algorithm 2 ) are shown in Figure 7 . 21 (a) T raining loss vs ep ochs (b) T raining loss vs log 2 of communication budget (c) top-1 accuracy [ LHS15 ] for schemes in Figure 3a Figure 4 Figure 4a - 4c demonstrate the gains in p erformance ac hieved by our Qspar se operators in the con vex setting. 5.2.1 Mo del Arc hitecture Deﬁne the softmax function as h x ,z  a ( i )  = exp  x T j a ( i ) + z ( i )  P L l =1 exp  x T l a ( i ) + z ( l )  . Our exp erimen ts are all for softmax regression with a standard ` 2 regularizer. The cost function is − 1 n   n X i =1 L X j =1 1 { b ( i ) = j } log h x ,z  a ( i )    + λ 2 k x k 2 where a ( i ) ∈ R d , b ( i ) ∈ [ L ] are the data p oin ts, which can b elong to one of the L classes, and x j ∈ R d for every j ∈ [ L ] , are columns of the parameter structured as follo ws x =  x 1 x 2 . . . x L  , x j ∈ R d , ∀ j ∈ [ L ] , 22 and z ( i ) for every i ∈ [ L ] are the biases to b e learn t corresponding to ev ery class. W e set λ to b e 1 /n . 5.2.2 P arameter Selection and Learning Rates W e use the deterministic op erator as in Lemma 3 and the sto c hastic op erator QS GD denoted by Q , as deﬁned in [ AGL + 17 ], as our quan tizers and T op k with error comp ensation as the sparsiﬁer. The schemes with which we compare our comp osed op erators QT op k Lemma 2 , and S ig nT op k Lemma 3 , are ef-QSGD [ WHHZ18 ], ef-signSGD [ KRSJ19 ], T opK-SGD [ SCJ18 , AHJ + 18 ], and lo cal SGD [ Sti19 ]. The learning rate used for training is of the form c λ ( a + t ) , where (i) λ is the regularization parameter; (ii) c is set with a careful hyperparameter sweep; (iii) w t = ( a + t ) 2 as in Theorem 3 , where a is set as dH k with d b eing the dimension of the gradient v ector (7850 for MNIST ); (iv) k = 40 is the sparsity; (v) H is the sync hronization p erio d; (vi) t is the iteration index; (vii) b = 8 is the batch size; and (viii) R = 15 is the num b er of work ers. 5.2.3 Results In Figure 4a , we observe that the composition of a quantizer with a sparsiﬁer has v ery little eﬀect on the rate of conv ergence as compared to when the tec hniques are used individually . Observe that the algorithm run with the 2 bit QS GD is slow er than the 4 bit quantizer, b oth with or without sparsiﬁcation, whic h can be attributed to the reduction in the compression coeﬃcient γ in going from 4 to 2 bits; see Theorem 3 . F rom Figure 4b and 4c , w e see that our comp osed op erators ac hieve gains in comm unicated bits by a factor of 6-8 times o ver the state-of-the-art. Figure 5a , demonstrates the eﬀect of incorp orating lo cal iterations together with Qsp arse op erators, and we see that the rate of conv ergence is not signiﬁcan tly aﬀected as w e go from 1 to 8 lo cal iterations. F urthermore, observe that for a ﬁxed num b er of lo cal iterations, the Qsp arse op erator maintains the same rates as v anilla S GD or T op k - S GD . In doing so, it is able to ac hieve gains in comm unicated bits as seen in Figure 5b , simply by communicating infrequen tly with the master. On comparing Figure 5c and 5e , we observe that the QT op k op erator is more sensitiv e to the increase in local computations for coarser quantizers (smaller v alues of s , in this case s = 2 # − bits − 1 = 3 ). This can be v eriﬁed from Figure 5e whic h uses a 4 bit quan tizer (which implies s = 15 instead), and the corresp onding eﬀect of lo cal iterations on the conv ergence rate is less prominent. W e make comparisons b et ween v anilla S GD , QS GD with error accum ulation ( ef-QSGD ) and our QT op k op erator in Figure 5c and 5d , for which we do not observ e muc h diﬀerence in p erformance b et ween a ﬁner and coarser quantizer, even though the conv ergence rates with resp ect to iterations are aﬀected. This can be attributed to the precision of the quan tizer itself. In Figure 6a and Figure 6b , we compare the conv ergence of our proposed sc heme in Al- gorithm 1 with QT op k and S ig nT op k b eing the comp osed op erators, with v anilla SGD (32 bit ﬂoating p oin t), ef-QSGD , ef-signSGD [ KRSJ19 ], and T opK-SGD [ SCJ18 , AHJ + 18 ]. Both ﬁg- ures follo w a similar trend, where w e observe QT op k , S ig nT op k and T opK-SGD to b e con v erging at the same rate as that of v anilla SGD, which is similar to the observ ations in [ SCJ18 ]. This implies that the composition of quantization with sparsiﬁcation do es not aﬀect the conv ergence while achieving improv ed comm unication eﬃciency , as can b e seen in Figure 6c and Figure 6b . Figure 6c shows that for test error appro ximately 0.1, Qsp arse-lo c al-SGD com bines the b eneﬁts of the comp osed op erator S ig nT op k or QT op k , with lo cal computations, and needs 10-15 times less bits than T opK-SGD and 1000 × less bits than v anilla SGD. W e observe similar trends in Figure 7a - 7b for our async hronous op eration, where work ers sync hronize with the master at arbitrary time in terv als as p er Algorithm 2 . Sp eciﬁcally , in our exp erimen ts, for eac h r ∈ [ R ] , the time interv al for the r th work er is decided uniformly at 23 (a) T raining loss vs ep ochs (b) T raining loss vs log 2 of communication budget (c) T raining loss vs ep ochs (d) T raining loss vs log 2 of communication budget (e) T raining loss vs ep ochs (f ) T raining loss vs log 2 of communication budget Figure 5 Figure 5a - 5b demonstrate the eﬀect of incorp orating lo cal iterations and compare these eﬀects across v anilla SGD, the sparsiﬁer T opK , as well as its comp osition with the Sign op erator. Similar comparisons are also made b et ween v anilla SGD, the quan tizer QSGD with error accumulation, as well as its composition with the T opK sparsiﬁer. random from [ H ] after every synchronization by that w orker. This ensures that g ap ( I ( r ) T ) ≤ H holds for ev ery w orker r ∈ [ R ] and the schedule I ( r ) T is diﬀerent for each of them. 24 (a) T raining loss vs ep ochs (b) T raining loss vs log 2 of communication budget (c) top-1 accuracy [ LHS15 ] for schemes in Figure 6a Figure 6 Figure 6a - 6c demonstrate the p erformance of our scheme in comparison with ef-QSGD , ef - sign SGD [ KRSJ19 ] and T opK-SGD [ SCJ18 , AHJ + 18 ] in a conv ex setting for synchronous updates. 6 Conclusion In this pap er, w e prop ose a gradien t compression sc heme that composes b oth unbiased and biased quantization with aggressive sparsiﬁcation. F urthermore, we incorp orate lo cal computa- tions, which, when combined with quantization and explicit sparsiﬁcation, results in a highly comm unication eﬃcient distributed algorithm, which w e call Qsp arse-lo c al-SGD . W e developed con vergence analyses of our scheme in b oth synchronous as w ell as async hronous settings and for b oth conv ex and non-conv ex ob jectiv es, and we show that our prop osed algorithm achiev es the same rate as that of distributed v anilla SGD in each of these cases. Our sc hemes provide ﬂexibilit y in terms of diﬀerent options for mitigating the communication bottlenecks that arise in training high-dimensional learning mo dels o ver bandwidth limited net works. When run without compression, this also subsumes/generalizes several recent results from the literature on lo cal SGD, with similar conv ergence rates, as men tioned at the end of Section 3.3 . Our numerics incorp orate momen tum acceleration, whose analysis is a topic for future re- searc h ( e.g. , p oten tially by incorp orating ideas from [ YJY19 ]). Although we use momentum for 25 (a) T raining loss with the communication budget for our sc hemes against baselines (b) T est error using a mo del trained for given n umber of iterations, as seen in Figure 7a Figure 7 Figure 7a - 7b demonstrate the p erformance of our scheme in comparison with ef - sign SGD [ KRSJ19 ] and T opK-SGD [ SCJ18 , AHJ + 18 ] in a conv ex setting for asynchronous operation. eac h lo cal iteration, our preliminary results suggest that our metho d works with momentum applied to a blo c k of up dates as well though it was not the main fo cus of this pap er. A c kno wledgemen t The authors gratefully thank Navjot Singh for his help with experiments in the early stages of this w ork. This w ork w as partially supp orted by NSF gran t #1514531, b y UC-NL grant LFR-18-548554 and b y Army Researc h Lab oratory under Co operative Agreement W911NF-17- 2-0196. The views and conclusions contained in this do cumen t are those of the authors and should not b e interpreted as representing the oﬃcial p olicies, either expressed or implied, of the Army Research Lab oratory or the U.S. Go vernmen t. The U.S. Gov ernment is authorized to repro duce and distribute reprints for Gov ernment purp oses notwithstanding any cop yright notation here on. A Omitted Details from Section 2 A.1 Pro of of Lemma 1 Lemma (Restating Lemma 1 ) . L et C omp k ∈ { T op k , Rand k } . L et Q s : R d → R d b e a quan- tizer with p ar ameter s that satisﬁes Deﬁnition 1 . L et Q s C omp k : R d → R d b e deﬁne d as Q s C omp k ( x ) := Q s ( C omp k ( x )) for every x ∈ R d . If k , s ar e such that β k,s < 1 . then Q s C omp k : R d → R d is a c ompr ession op er ator with the c ompr ession c o eﬃcient b eing e qual to γ = (1 − β k,s ) k d , i.e., for every x ∈ R d , we have E C,Q [ k x − Q s C omp k ( x ) k 2 2 ] ≤  1 − (1 − β k,s ) k d  k x k 2 2 , wher e exp e ctation is taken over the r andomness of the c ompr ession op er ator C omp k as wel l as the quantizer Q s . 26 Pr o of. Fix an arbitrary x ∈ R d . E C,Q [ k x − Q s C omp k ( x ) k 2 2 ] = E C,Q [ k x k 2 2 ] + E C,Q [ k Q s C omp k ( x ) k 2 2 ] − 2 E C [ h x , E Q [ Q s C omp k ( x )] i ] = k x k 2 2 + E C,Q [ k Q s C omp k ( x ) k 2 2 ] − 2 E C [ h x , C omp k ( x ) i ] In the last equality , w e used that x is constan t with resp ect to the randomness of Q s and C omp k , and that E Q [ Q s C omp k ( x )] = C omp k ( x ) , whic h follo ws from (i) of Deﬁnition 1 . Observe that, for any C omp k ∈ { T op k , Rand k } , we hav e h x , C omp k ( x ) i = k C omp k ( x ) k 2 2 . Con tinuing from ab o v e, w e get E C,Q [ k x − Q s C omp k ( x ) k 2 2 ] = k x k 2 2 − 2 E C [ k C omp k ( x ) k 2 2 ] + E C,Q [ k Q s C omp k ( x ) k 2 2 ] (23) Observ e that for an y C omp k ∈ { T op k , Rand k } , C omp k ( x ) is a length- d vector, but only (at most) k of its comp onents are non-zero. This implies that, by treating C omp k ( x ) a length- k vector whose entries corresp ond to the k non-zero en tries of x , we can write E Q [ k Q s C omp k ( x ) k 2 2 ] ≤ (1 + β k,s ) k C omp k ( x ) k 2 2 ; see (ii) of Deﬁnition 1 . Putting this back in ( 23 ), w e get E C,Q [ k x − Q s C omp k ( x ) k 2 2 ] ≤ k x k 2 2 − E C [ k C omp k ( x ) k 2 2 ] + β k,s E C [ k C omp k ( x ) k 2 2 ] = k x k 2 2 − (1 − β k,s ) E C [ k C omp k ( x ) k 2 2 ] (24) Using E C [ k C omp k ( x ) k 2 2 ] ≥ k d k x k 2 2 (see ( 27 ) in Lemma 13 ) in ( 24 ) gives E C,Q [ k x − Q s C omp k ( x ) k 2 2 ] ≤ k x k 2 − (1 − β k,s ) k d k x k 2 2 =  1 − (1 − β k,s ) k d  k x k 2 2 . This completes the pro of of Lemma 1 . A.2 Pro of of Lemma 2 Lemma (Restating Lemma 2 ) . L et C omp k ∈ { T op k , Rand k } . L et Q s : R d → R d b e a sto chastic quantizer with p ar ameter s that satisﬁes Deﬁnition 1 . L et Q s C omp k : R d → R d b e deﬁne d as Q s C omp k ( x ) := Q s ( C omp k ( x )) for every x ∈ R d . Then Q s C omp k ( x ) 1+ β k,s is a c ompr ession op er ator with the c ompr ession c o eﬃcient b eing e qual to γ = k d (1+ β k,s ) , i.e., for every x ∈ R d E C,Q "     x − Q s C omp k ( x ) 1 + β k,s     2 2 # ≤  1 − k d (1 + β k,s )  k x k 2 2 , Pr o of. Fix an arbitrary x ∈ R d . E C,Q "     x − Q s C omp k ( x ) (1 + β k,s )     2 2 # = k x k 2 2 − 2 E C  x , E Q  Q s C omp k ( x ) (1 + β k,s )  + E C,Q  k Q s C omp k ( x ) k 2 2 (1 + β k,s ) 2  (a) = k x k 2 2 − 2 (1 + β k,s ) E C [ h x , C omp k ( x ) i ] 27 + 1 (1 + β k,s ) 2 E C,Q  k Q s C omp k ( x ) k 2 2  (b) = k x k 2 2 − 2 (1 + β k,s ) E C  k C omp k ( x ) k 2 2  + 1 (1 + β k,s ) 2 E C,Q  k Q s C omp k ( x ) k 2 2  (c) ≤ k x k 2 2 − 2 1 + β k,s E C  k C omp k ( x ) k 2 2  + 1 (1 + β k,s ) E C  k C omp k ( x ) k 2 2  = k x k 2 2 − 1 (1 + β k,s ) E C  k C omp k ( x ) k 2 2  (d) ≤  1 − k d (1 + β k,s )  k x k 2 2 . (25) In (a) we used E Q [ Q s C omp k ( x )] = C omp k ( x ) , in (b) we used h x , C omp k ( x ) i = k C omp k ( x ) k 2 2 ; in (c) w e used E Q [ k Q s C omp k ( x ) k 2 2 ] ≤ (1+ β k,s ) k C omp k ( x ) k 2 2 ; and in (d) w e used E C [ k C omp k ( x ) k 2 2 ] ≥ k d k x k 2 2 . This completes the pro of of Lemma 2 . A.3 Pro of of Lemma 3 Lemma (Restating Lemma 3 ) . F or C omp k ∈ { T op k , Rand k } , k C omp k ( x ) k m S ig nC omp k ( x ) k , for any m ∈ Z + is a c ompr ession op er ator with the c ompr ession c o eﬃcient γ m b eing e qual to γ m =      max  1 d , k d  k C omp k ( x ) k 1 √ d k C omp k ( x ) k 2  2  if m = 1 , k 2 m − 1 d if m ≥ 2 . F or proving Lemma 3 w e ﬁrst state and prov e Lemma 13 b elo w. Lemma 13. L et C omp k ∈ { T op k , Rand k } . F or any x ∈ R d , we have E [ k C omp k ( x ) k 2 1 ] ≥ max  k d k x k 2 2 , k 2 d 2 k x k 2 1  (26) E [ k C omp k ( x ) k 2 2 ] ≥ k d k x k 2 2 . (27) Pr o of. Let m ∈ { 1 , 2 } . Observ e that for any x ∈ R d , we ha ve E [ k T op k ( x ) k 2 m ] = k T op k ( x ) k 2 m and that k T op k ( x ) k 2 m ≥ E [ k Rand k ( x ) k 2 m ] . So, in order to prov e the lemma, it suﬃces to show that E [ k Rand k ( x ) k 2 m ] ≥ k d k x k 2 m holds for any m ∈ { 1 , 2 } , and that E [ k Rand k ( x ) k 2 1 ] ≥ k 2 d 2 k x k 2 1 . Let Ω k b e the set of all the k -elemen ts subsets of [ d ] . E [ k Rand k ( x ) k 2 m ] = X ω ∈ Ω k 1 | Ω k | d X i =1 | x i | m · 1 { i ∈ ω } ! 2 /m (a) ≥ X ω ∈ Ω k 1 | Ω k | d X i =1 | x i | 2 · 1 { i ∈ ω } = d X i =1 x 2 i · 1 | Ω k | X ω ∈ Ω k 1 { i ∈ ω } 28 = d X i =1 x 2 i · 1 | Ω k |  d − 1 k − 1  = k d k x k 2 2 Note that (a) holds only for m ∈ { 1 , 2 } , and it is equalit y for m = 2 . Now we show that E [ k Rand k ( x ) k 2 1 ] ≥ k 2 d 2 k x k 2 1 . E [ k Rand k ( x ) k 2 1 ] ≥ ( E [ k Rand k ( x ) k 1 ]) 2 =   X ω ∈ Ω k 1 | Ω k | d X i =1 | x i | · 1 { i ∈ ω }   2 =   d X i =1 | x i | · 1 | Ω k | X ω ∈ Ω k 1 { i ∈ ω }   2 = d X i =1 | x i | · 1 | Ω k |  d − 1 k − 1  ! 2 = k 2 d 2 k x k 2 1 This completes the pro of of Lemma 13 . Pr o of of L emma 3 . Fix an arbitrary x ∈ R d and consider the following: E C     k C omp k ( x ) k m S ig nC omp k ( x ) k − x     2 2 = E C  k C omp k ( x ) k 2 m k − 2  k C omp k ( x ) k m S ig nC omp k ( x ) k , x  + k x k 2 2  = E C  k C omp k ( x ) k 2 m k − 2 k C omp k ( x ) k m k C omp k ( x ) k 1 k + k x k 2 2  ≤ k x k 2 2 − E C k C omp k ( x ) k 2 m k (28) In ( 28 ) w e used the fact that k · k 1 ≥ k · k m for every m ≥ 1 . Case 1. When m = 1 : Substituting E C k C omp k ( x ) k 2 1 ≥ max n k d k x k 2 2 , k 2 d 2 k x k 2 1 o (from ( 26 )) in ( 28 ) gives E C     k C omp k ( x ) k 1 S ig nC omp k ( x ) k − x     2 2 ≤ k x k 2 2 − 1 k max  k d k x k 2 2 , k 2 d 2 k x k 2 1  ≤ " 1 − max ( 1 d , k d  k C omp k ( x ) k 1 √ d k C omp k ( x ) k 2  2 )# k x k 2 2 . Case 2. When m ≥ 2 : Since k u k p ≤ k 1 p − 1 q k u k q holds for ev ery u ∈ R k , whenever p ≤ q , using 29 this in ( 28 ) with q = m and p = 2 gives E C     k C omp k ( x ) k m S ig nC omp k ( x ) k − x     2 2 ≤ k x k 2 2 − 1 k k 2 m − 1 E C [ k C omp k ( x ) k 2 2 ] ≤ k x k 2 2 − 1 k k 2 m − 1 ( k /d ) k x k 2 2 (By Lemma 13 ) = " 1 − k 2 m − 1 d # k x k 2 2 . (29) This completes the pro of of Lemma 3 . B Omitted Details from Section 3 B.1 Pro of of Lemma 4 Lemma (Restating Lemma 4 ) . L et g ap ( I T ) ≤ H and η t = ξ a + t , wher e ξ is a c onstant and a > 4 H γ . Then ther e exists a c onstant C ≥ 4 aγ (1 − γ 2 ) aγ − 4 H , such that the fol lowing holds for every worker r ∈ [ R ] and for every t ∈ Z + : E k m ( r ) t k 2 2 ≤ 4 η 2 t γ 2 C H 2 G 2 . Pr o of. Fix an arbitrary w orker r ∈ [ R ] . In order to prov e the lemma, w e need to show that E k m ( r ) t k 2 ≤ 4 η 2 t γ 2 C H 2 G 2 holds for every t ∈ [ T ] , where C ≥ 4 aγ (1 − γ 2 ) aγ − 4 H . W e show this separately for t wo cases, dep ending on whether or not t ∈ I T . First consider the case when t ∈ I T . Let I T = { t (1) , t (2) , . . . , t ( l ) = T } . Fix an y i = 1 , 2 , . . . , l and consider E k m ( r ) t ( i +1) k 2 . Note that lo cal memory m ( r ) t at an y work er r and the global parameter vector x t do not c hange in b et ween the sync hronization indices. W e deﬁne m ( r ) t (0) := 0 for ev ery r ∈ [ R ] . E k m ( r ) t ( i +1) k 2 = E k m ( r ) t ( i +1) − 1 + x t ( i +1) − 1 − b x ( r ) t ( i +1) − 1 2 − g ( r ) t ( i +1) − 1 k 2 (a) ≤ (1 − γ ) E k m ( r ) t ( i +1) − 1 + x t ( i +1) − 1 − b x ( r ) t ( i +1) − 1 2 k 2 (b) = (1 − γ ) E k m ( r ) t ( i ) + x t ( i ) − b x ( r ) t ( i +1) − 1 2 k 2 (c) = (1 − γ ) E k m ( r ) t ( i ) + b x ( r ) t ( i ) − b x ( r ) t ( i +1) − 1 2 k 2 (30) Here (a) is due to the compression prop ert y , (b) holds since the memory and master parameter remain unchanged b etw een t wo rounds of synchronization, and in (c) we used that b x ( r ) t ( i ) = x t ( i ) , whic h holds for ev ery r . Using the inequality k a + b k 2 ≤ (1 + τ ) k a k 2 + (1 + 1 τ ) k b k 2 , which holds for every τ > 0 , in ( 30 ) gives (tak e an y p > 1 in the following): E k m ( r ) t ( i +1) k 2 ≤ (1 − γ )   1 + ( p − 1) γ p  E k m ( r ) t ( i ) k 2 +  1 + p ( p − 1) γ  E k b x ( r ) t ( i ) − b x ( r ) t ( i +1) − 1 2 k 2  ≤  1 − γ p  E k m ( r ) t ( i ) k 2 + (1 − γ )( pγ + p ) ( p − 1) γ E k b x ( r ) t ( i ) − b x ( r ) t ( i +1) − 1 2 k 2 30 =  1 − γ p  E k m ( r ) t ( i ) k 2 + p (1 − γ 2 ) ( p − 1) γ E k b x ( r ) t ( i ) − b x ( r ) t ( i +1) − 1 2 k 2 =  1 − γ p  E k m ( r ) t ( i ) k 2 + p (1 − γ 2 ) ( p − 1) γ E k t ( i +1) − 1 X j = t ( i ) η j ∇ f i ( r ) j  b x ( r ) j  k 2 ≤  1 − γ p  E k m ( r ) t ( i ) k 2 + p (1 − γ 2 ) ( p − 1) γ η 2 t ( i ) H 2 G 2 (31) In the last inequality ( 31 ) we used E k P t ( i +1) − 1 j = t ( i ) η j ∇ f i ( r ) j  b x ( r ) j  k 2 ≤ η 2 t ( i ) H 2 G 2 , which can b e seen as follo ws: E k t ( i +1) − 1 X j = t ( i ) η j ∇ ( r ) f ( i j )  b x ( r ) j  k 2 = ( t ( i +1) − t ( i ) ) 2 E k 1 ( t ( i +1) − t ( i ) ) t ( i +1) − 1 X j = t ( i ) η j ∇ f i ( r ) j  b x ( r ) j  k 2 (a) ≤ ( t ( i +1) − t ( i ) ) t ( i +1) − 1 X j = t ( i ) E k η j ∇ f i ( r ) j  b x ( r ) j  k 2 (b) ≤ ( t ( i +1) − t ( i ) ) η 2 t ( i ) t ( i +1) − 1 X j = t ( i ) E k∇ f i ( r ) j  b x ( r ) j  k 2 ≤ ( t ( i +1) − t ( i ) ) η 2 t ( i ) ( t ( i +1) − t ( i ) ) G 2 (c) ≤ η 2 t ( i ) H 2 G 2 Here (a) holds by Jensen’s inequality , (b) holds since since η t ≤ η t ( i ) ∀ t ≥ t ( i ) and (c) holds b ecause ( t ( i +1) − t ( i ) ) ≤ H . Deﬁne ˜ η t = 1 a + t and A = ξ 2 H 2 G 2 . Using this in ( 31 ) gives E k m ( r ) t ( i +1) k 2 ≤  1 − γ p  E k m ( r ) t ( i ) k 2 + p (1 − γ 2 ) ( p − 1) γ ˜ η 2 t ( i ) A. (32) W e w ant to sho w that E k m ( r ) t ( i ) k 2 ≤ 4 C ˜ η 2 t ( i ) γ 2 A holds for ev ery i = 1 , 2 , . . . , where C ≥ 4 aγ (1 − γ 2 ) aγ − 4 H . In fact w e prov e a slightly stronger b ound that E k m ( r ) t ( i ) k 2 ≤ C ˜ η 2 t ( i ) γ 2 A holds for every i = 1 , 2 , . . . . W e prov e this using induction on i . Base c ase ( i = 1) : Note that m ( r ) t (1) − 1 = m ( r ) 0 = 0 . Consider the following: E k m ( r ) t (1) k 2 = E k x t (1) − 1 − b x t (1) − 1 2 − g ( r ) t (1) − 1 k 2 ≤ (1 − γ ) E k x t (1) − 1 − b x t (1) − 1 2 k 2 (a) = (1 − γ ) E k b x ( r ) 0 − b x t (1) − 1 2 k 2 = (1 − γ ) E k t (1) − 1 X j =0 η j ∇ f i ( r ) j  b x ( r ) j  k 2 ≤ (1 − γ ) η 2 0 H 2 G 2 = (1 − γ ) ˜ η 2 0 A Here (a) holds since x t (1) − 1 = x 0 = b x ( r ) 0 . It is easy to verify that (1 − γ ) ˜ η 2 0 A ≤ 4 aγ (1 − γ 2 ) aγ − 4 H ˜ η 2 t (1) γ 2 A . T o sho w this, we use ˜ η 0 ˜ η t (1) = a + t (1) a ≤ a + H a ≤ 2 , where the ﬁrst inequality follows from t (1) ≤ H 31 and the second inequalit y follo ws from a ≥ H . Now, since C ≥ 4 aγ (1 − γ 2 ) aγ − 4 H , it follo ws that E k m ( r ) t (1) k 2 ≤ C ˜ η 2 t (1) γ 2 A . Inductive c ase: Assume E k m ( r ) ( i ) k 2 ≤ C ˜ η 2 t ( i ) γ 2 A for some i ∈ Z + . W e need to sho w that E k m ( r ) ( i +1) k 2 ≤ C ˜ η 2 t ( i +1) γ 2 A . Using the inductive h yp othesis in ( 32 ), we get E k m ( r ) ( i +1) k 2 ≤  1 − γ p  C ˜ η 2 t ( i ) γ 2 A + p (1 − γ 2 ) ( p − 1) γ ˜ η 2 t ( i ) A = C ˜ η 2 t ( i ) γ 2 A  1 − γ p + p (1 − γ 2 ) p − 1 γ C  = C ˜ η 2 t ( i ) γ 2 A  1 − γ p  1 − p 2 (1 − γ 2 ) ( p − 1) C  (33) Claim 1. F or any p > 1 , if γ p  1 − p 2 (1 − γ 2 ) ( p − 1) C  ≥ 2 H a , then ˜ η 2 t ( i )  1 − γ p  1 − p 2 (1 − γ 2 ) ( p − 1) C  ≤ ˜ η 2 t ( i +1) holds. Pr o of. Let γ p  1 − p 2 (1 − γ 2 ) ( p − 1) C  = β a . Since t ( i +1) ≤ t ( i ) + H (whic h implies that ˜ η 2 t ( i ) + H ≤ ˜ η 2 t ( i +1) ), it suﬃces to sho w that ˜ η 2 t ( i )  1 − β a  ≤ ˜ η 2 t ( i ) + H holds whenev er β ≥ 2 H . F or simplicit y of notation, let t = t ( i ) . Note that ˜ η 2 t  1 − β a  = ( a − β ) a ( a + t ) 2 . W e show b elo w that if β > 2 H , then a ( a + t ) 2 ≥ ( a + t + H ) 2 ( a − β ) . This prov es our claim, b ecause now w e ha ve ( a − β ) a ( a + t ) 2 ≤ ( a − β ) ( a + t + H ) 2 ( a − β ) = 1 ( a + t + H ) 2 = ˜ η 2 t + H . It only remains to show that a ( a + t ) 2 ≤ ( a + t + H ) 2 ( a − β ) holds if β ≥ 2 H . ( a + t + H ) 2 ( a − β ) =  ( a + t ) 2 + H 2 + 2 H ( a + t )  ( a − β ) = a ( a + t ) 2 + aH 2 + 2 H a 2 + 2 H at − β ( a + t ) 2 − β H 2 − 2 H β ( a + t ) = a ( a + t ) 2 + a ( H 2 + 2 H t − 2 β t − 2 H β ) + a 2 (2 H − β ) − β t 2 − β H 2 − 2 H β t ≤ a ( a + t ) 2 . The last inequalit y holds whenever β ≥ 2 H . Therefore w e need γ p  1 − p 2 (1 − γ 2 ) ( p − 1) C  ≥ 2 H a , which is equiv alen t to requiring C ≥ γ ap 2 (1 − γ 2 ) ( p − 1)( aγ − 2 pH ) , where a > 2 pH γ . Since this holds for ev ery p > 1 , by substituting p = 2 , w e get C ≥ 4 γ a (1 − γ 2 ) ( aγ − 4 H ) . This together with ( 33 ) and Claim 1 implies that if C ≥ 4 γ a (1 − γ 2 ) ( aγ − 4 H ) , where a > 4 H /γ , then E k m ( r ) ( i +1) k 2 ≤ C ˜ η 2 t ( i +1) γ 2 A holds. This pro ves our inductive step. W e ha ve sho wn that E k m ( r ) t k 2 ≤ 4 C ˜ η 2 t γ 2 A holds when t ∈ I T . It only remains to show that E k m ( r ) t k 2 ≤ 4 C ˜ η 2 t γ 2 A also holds when t ∈ [ T ] \ I T . Let i ∈ Z + b e such that t ( i ) ≤ t < t ( i +1) , whic h implies that ˜ η t ( i ) ≤ 2 ˜ η t . Since lo cal memory do es not change in b et ween the synchronization indices, w e hav e that m ( r ) t = m ( r ) t ( i ) . Thus we ha ve E k m ( r ) t k 2 = E k m ( r ) t ( i ) k 2 ≤ C ˜ η 2 t ( i ) γ 2 A ≤ 4 C ˜ η 2 t γ 2 A . This concludes the pro of of Lemma 4 . 32 B.2 Pro of of Lemma 5 Lemma (Restating Lemma 5 ) . L et g ap ( I T ) ≤ H . Then the fol lowing holds for every worker r ∈ [ R ] and for every t ∈ Z + : E k m ( r ) t k 2 2 ≤ 4 η 2 (1 − γ 2 ) γ 2 H 2 G 2 . Pr o of. Observ e that ( 31 ) holds irresp ective of the learning rate sc hedule. In particular, using a ﬁxed learning rate η t = η for every t gives E k m ( r ) t ( i +1) k 2 ≤  1 − γ p  E k m ( r ) t ( i ) k 2 + p (1 − γ 2 ) ( p − 1) γ η 2 H 2 G 2 When rolled out we see that the memory is upp er b ounded b y a geometric sum. E k m ( r ) t ( i +1) k 2 ≤ p (1 − γ 2 ) ( p − 1) γ η 2 H 2 G 2 ∞ X j =0  1 − γ p  j ≤ p 2 (1 − γ 2 ) ( p − 1) η 2 γ 2 H 2 G 2 . Note that the last inequality holds for ev ery p > 1 , and is minimized when p = 2 . By plugging p = 2 , w e get E k m ( r ) t ( i +1) k 2 ≤ 4(1 − γ 2 ) η 2 γ 2 H 2 G 2 . Since the RHS do es not dep end on t , it follows that E k m ( r ) t k 2 ≤ 4(1 − γ 2 ) η 2 γ 2 H 2 G 2 holds for ev ery t ∈ [ T ] . This completes the pro of of Lemma 5 . B.3 Pro of of Lemma 6 Lemma (Restating Lemma 6 ) . L et e x ( r ) t , m ( r ) t , r ∈ [ R ] , t ≥ 0 b e gener ate d ac c or ding to Algo- rithm 1 and let b x ( r ) t b e as deﬁne d in ( 4 ) . L et e x t = 1 R P R r =1 e x ( r ) t and b x t = 1 R P R r =1 b x ( r ) t . Then we have b x t − e x t = 1 R R X r =1 m ( r ) t , i.e., the diﬀer enc e of the true and the virtual se quenc e is e qual to the aver age memory. Pr o of. No w consider b x t − e x t = 1 R P R r =1 b x ( r ) t − e x ( r ) t . F or the nearest t r + 1 ∈ I T suc h that t r + 1 ≤ t and the nearest t 0 r + 1 ∈ I T suc h that t 0 r + 1 ≤ t r b x t − e x t = 1 R R X r =1  b x ( r ) t r +1 − e x ( r ) t r +1  = 1 R R X r =1 x t r − 1 R R X r =1 g ( r ) t r − ( e x ( r ) t 0 r +1 − ( b x ( r ) t 0 r +1 − b x ( r ) t r + 1 2 )) ! (34) 33 Here we used that b x ( r ) t 0 r +1 − b x ( r ) t r + 1 2 = t r P j = t 0 r +1 η j ∇ ( r ) f ( i j )  b x ( r ) j  . Substituting b x ( r ) t 0 r +1 = x t 0 r +1 w e get b x t − e x t = 1 R R X r =1 x t r − 1 R R X r =1 g ( r ) t r − ( e x ( r ) t 0 r +1 − ( x t 0 r +1 − b x ( r ) t r + 1 2 )) ! = x t 0 r +1 − 1 R R X r =1 g ( r ) t r − ( e x t 0 r +1 − ( x t 0 r +1 − b x t r + 1 2 )) = b x t 0 r +1 − e x t 0 r +1 + ( x t 0 r +1 − b x t r + 1 2 ) − 1 R R X r =1 g ( r ) t r (35) No w since x t 0 r +1 = x t r w e ha ve b x t − e x t = b x t 0 r +1 − e x t 0 r +1 + ( x t r − b x t r + 1 2 ) − 1 R R X r =1 g ( r ) t r (36) On rolling out the expression in ( 36 ) we get b x t − e x t = 1 R R X r =1     X j : j +1 ∈I T j ≤ t r  x ( r ) j − b x ( r ) j + 1 2 − g ( r ) j      = 1 R R X r =1 m ( r ) t r +1 = 1 R R X r =1 m ( r ) t (37) Therefore b x t − e x t = 1 R P R r =1 m ( r ) t is the a verage memory . This completes the pro of of Lemma 6 . B.4 Pro of of Lemma 7 Lemma (Restating Lemma 7 ) . L et g ap ( I T ) ≤ H . F or b x ( r ) t gener ate d ac c or ding to Algorithm 1 with a ﬁxe d le arning r ate η and letting b x t = 1 R P R r =1 b x ( r ) t , we have the fol lowing b ound on the deviation of the lo c al se quenc es: 1 R R X r =1 E k b x t − b x ( r ) t k 2 2 ≤ η 2 G 2 H 2 . Pr o of. T o prov e this, w e follo w the proof of Lemma 8 un til ( 38 ) and put η t r = η to get 1 R P R r =1 E k b x t − b x ( r ) t k 2 ≤ η 2 G 2 H 2 . B.5 Pro of of Lemma 8 Lemma (Restating Lemma 8 ) . L et g ap ( I T ) ≤ H . By running Algorithm 1 with a de c aying le arning r ate η t , we have 1 R R X r =1 E k b x t − b x ( r ) t k 2 2 ≤ 4 η 2 t G 2 H 2 . 34 Pr o of. W e sho w this along the lines of the pro of of [ Sti19 , Lemma 3.3]. W e need to upp er-b ound 1 R P R r =1 E k b x t − b x ( r ) t k 2 . Note that for an y R vectors u 1 , . . . , u R , if w e let ¯ u = 1 R P r i =1 u i , then P n i =1 k u i − ¯ u k 2 ≤ P R i =1 k u i k 2 . W e use this in the ﬁrst inequality b elo w. 1 R R X r =1 E k b x t − b x ( r ) t k 2 = 1 R R X r =1 E k b x ( r ) t − b x ( r ) t r − ( b x t − b x ( r ) t r ) k 2 ≤ 1 R R X r =1 E k b x ( r ) t − b x ( r ) t r k 2 ≤ η 2 t r G 2 H 2 (38) ≤ 4 η 2 t G 2 H 2 (39) The last inequalit y ( 39 ) uses η t r ≤ 2 η t r + H ≤ 2 η t and t − t r ≤ H . B.6 Pro of of Theorem 1 Pr o of. Let x ∗ b e the minimizer of f ( x ) , therefore we denote f ( x ∗ ) by f ∗ . F or the purp ose of reusing the pro of later while proving Theorem 2 , w e start oﬀ with the decaying learning rate η t un til ( 43 ) and then switc h to the ﬁxed learning rate η . Note that the pro of remains the same un til ( 43 ) irresp ectiv e of the learning rate sc hedule; in particular, we can take η t = η and the same pro of holds un til ( 43 ). By the deﬁnition of L -smo othness, w e hav e f ( e x t +1 ) − f ( e x t ) ≤ h∇ f ( e x t ) , e x t +1 − e x t i + L 2 k e x t +1 − e x t k 2 = − η t h∇ f ( e x t ) , p t i + η 2 t L 2 k p t k 2 = − η t h∇ f ( e x t ) , p t i + η 2 t L 2 k p t − p t + p t k 2 ≤ − η t h∇ f ( e x t ) , p t i + η 2 t L k p t − p t k 2 + η 2 t L k p t k 2 (Using Jensen’s Inequalit y) = − η t R R X r =1 h∇ f ( e x t ) , ∇ f i ( r ) t ( b x ( r ) t ) i + η 2 t L k 1 R R X r =1 ∇ f ( r ) ( b x ( r ) t ) k 2 + η 2 t L k p t − p t k 2 Deﬁne i t as the set of random sampling of the mini-batc hes at each work er { i (1) t , i (2) t , . . . , i ( R ) t } . T aking exp ectation w.r.t. the sampling at time t (conditioned on the past) and using the lipsc hitz con tinuit y of the gradients of lo cal functions gives E i t [ f ( e x t +1 )] − f ( e x t ) ≤ − η t 2 k∇ f ( e x t ) k 2 + k 1 R R X r =1 ∇ f ( r ) ( b x ( r ) t ) k 2 − k∇ f ( e x t ) − 1 R R X r =1 ∇ f ( r ) ( b x ( r ) t ) k 2 ! + η 2 t L k 1 R R X r =1 ∇ f ( r ) ( b x ( r ) t ) k 2 + η 2 t L bR 2 R X r =1 σ 2 r ≤ − η t 2 R R X r =1  k∇ f ( e x t ) k 2 − L 2 k e x t − b x ( r ) t k 2  + 2 η 2 t L − η t 2 k 1 R R X r =1 ∇ f ( r ) ( b x ( r ) t ) k 2 + η 2 t L bR 2 R X r =1 σ 2 r 35 = − η t 2 R R X r =1  k∇ f ( e x t ) k 2 + L 2 k e x t − b x ( r ) t k 2  + 2 η 2 t L − η t 2 R R X r =1 k∇ f ( b x ( r ) t ) k 2 + η 2 t L bR 2 R X r =1 σ 2 r + η t L 2 R R X r =1 k e x t − b x ( r ) t k 2 . (40) W e b ound the ﬁrst term in terms of k∇ f ( b x ( r ) t ) k 2 as follows: k∇ f ( b x ( r ) t ) k 2 ≤ 2 k∇ f ( b x ( r ) t ) − ∇ f ( e x t ) k 2 + 2 k∇ f ( e x t ) k 2 ≤ 2 L 2 k b x ( r ) t − e x t k 2 + 2 k∇ f ( e x t ) k 2 , (41) where the 2nd inequality follows from the smo othness ( L -Lipschitz gradien t) assumption. Using this and that η t ≤ 1 2 L in ( 40 ) and rearranging terms give η t 4 R R X r =1 k∇ f ( b x ( r ) t ) k 2 ≤ f ( e x t ) − E ( i t ) [ f ( e x t +1 )] + η 2 t L bR 2 R X r =1 σ 2 r + η t L 2 R R X r =1 k e x t − b x ( r ) t k 2 (42) T aking exp ectation w.r.t. to the entire pro cess and using the inequality k u + v k 2 ≤ 2 k u k 2 + 2 k v k 2 giv es η t 4 R R X r =1 E k∇ f ( b x ( r ) t ) k 2 ≤ E [ f ( e x t )] − E [ f ( e x t +1 )] + η 2 t L bR 2 R X r =1 σ 2 r + 2 η t L 2 E k e x t − b x t k 2 + 2 η t L 2 1 R R X r =1 E k b x t − b x ( r ) t k 2 (43) Observ e that ( 43 ) holds irresp ectiv e of the learning rate sc hedule. In particular, if w e take a ﬁxed learning rate η t = η ≤ 1 2 L in ( 43 ), w e get η 4 R R X r =1 E k∇ f ( b x ( r ) t ) k 2 ≤ E [ f ( e x t )] − E [ f ( e x t +1 )] + η 2 L bR 2 R X r =1 σ 2 r + 2 η L 2 E k e x t − b x t k 2 + 2 η L 2 1 R R X r =1 E k b x t − b x ( r ) t k 2 (44) Lemma 6 and Lemma 5 together imply E k b x t − e x t k 2 ≤ 4 η 2 (1 − γ 2 ) γ 2 G 2 H 2 . W e also hav e from Lemma 7 that 1 R P R r =1 E k b x t − b x ( r ) t k 2 ≤ η 2 G 2 H 2 . Substituting these in ( 44 ) gives η 4 R R X r =1 E k∇ f ( b x ( r ) t ) k 2 ≤ E [ f ( e x t )] − E [ f ( e x t +1 )] + η 2 L bR 2 R X r =1 σ 2 r + 8 η 3 (1 − γ 2 ) γ 2 L 2 G 2 H 2 + 2 η 3 L 2 G 2 H 2 (45) By taking a telescopic sum from t = 0 to t = T − 1 , w e get 1 4 RT T − 1 X t =0 R X r =1 E k∇ f ( b x ( r ) t ) k 2 ≤ E [ f ( e x 0 )] − f ∗ η T + η L bR 2 R X r =1 σ 2 r + 8 η 2 (1 − γ 2 ) γ 2 L 2 G 2 H 2 + 2 η 2 L 2 G 2 H 2 (46) 36 T ak e η = b C √ T , where b C is a constan t (that satisﬁes b C < √ T 2 L ). F or example, we can tak e b C = 1 2 L . This gives 1 RT T − 1 X t =0 R X r =1 E k∇ f ( b x ( r ) t ) k 2 ≤ E [ f ( x 0 )] − f ∗ b C + b C L bR 2 R X r =1 σ 2 r ! 4 √ T + 8  4 (1 − γ 2 ) γ 2 + 1  b C 2 L 2 G 2 H 2 T . (47) Sample a parameter z T from n b x ( r ) t o for r = 1 , . . . , R and t = 0 , 1 , . . . , T − 1 with probabilit y Pr[ z T = b x ( r ) t ] = 1 RT , whic h implies E k z T k 2 = 1 RT P T − 1 t =0 P R r =1 E k∇ f ( b x ( r ) t ) k 2 . Using this in ( 47 ) giv es E k z T k 2 = E [ f ( x 0 )] − f ∗ b C + b C L bR 2 R X r =1 σ 2 r ! 4 √ T + 8  4 (1 − γ 2 ) γ 2 + 1  b C 2 L 2 G 2 H 2 T . This completes the pro of of Theorem 1 . B.7 Pro of of Theorem 2 Pr o of. Observ e that we can use the pro of of Theorem 1 exactly until ( 43 ), for η t ≤ 1 2 L (whic h follo ws from our assumption that a ≥ 2 ξ L ), which giv es η t 4 R R X r =1 E k∇ f ( b x ( r ) t ) k 2 ≤ E [ f ( e x t )] − E [ f ( e x t +1 )] + η 2 t L bR 2 R X r =1 σ 2 r + 2 η t L 2 E k e x t − b x t k 2 + 2 η t L 2 1 R R X r =1 E k b x t − b x ( r ) t k 2 (48) W e ha ve from Lemma 8 that 1 R P R r =1 E k b x t − b x ( r ) t k 2 ≤ 4 η 2 t G 2 H 2 . Lemma 6 and Lemma 4 together imply that E k b x t − e x t k 2 ≤ 1 R P R r =1 k m ( r ) t k 2 ≤ C 4 η 2 t γ 2 G 2 H 2 . Using these b ounds in ( 48 ) gives η t 4 R R X r =1 E k∇ f ( b x ( r ) t ) k 2 ≤ E [ f ( e x t )] − E [ f ( e x t +1 )] + η 2 t L bR 2 R X r =1 σ 2 r + 8 η 3 t γ 2 C L 2 G 2 H 2 + 8 η 3 t L 2 G 2 H 2 T aking a telescopic sum from t = 0 to t = T − 1 gives T − 1 X t =0 η t 4 R R X r =1 E k∇ f ( b x ( r ) t ) k 2 ≤ E [ f ( x 0 )] − f ∗ + L P R r =1 σ 2 r bR 2 T − 1 X t =0 η 2 t +  8 C γ 2 + 8  L 2 G 2 H 2 T − 1 X t =0 η 3 t . (49) Let δ t := η t 4 R and P T := P T − 1 t =0 P R r =1 δ t . W e sho w at the end of this pro of that P T ≥ ξ 4 ln  T + a − 1 a  , P T − 1 t =0 η 2 t ≤ ξ 2 a − 1 , and that P T − 1 t =0 η 3 t ≤ ξ 3 2( a − 1) 2 . Using these in ( 49 ) yields 1 P T T − 1 X t =0 R X r =1 δ t E k∇ f ( b x ( r ) t ) k 2 ≤ E f ( x 0 ) − f ∗ P T + Lξ 2 bR 2 ( a − 1) P R r =1 σ 2 P T +  8 C γ 2 + 8  L 2 G 2 H 2 ξ 3 2 P T ( a − 1) 2 (50) 37 W e therefore can show a weak con vergence result, i.e., min t ∈{ 0 ,...,T − 1 } , r ∈ [ R ] E k∇ f ( b x ( r ) t ) k 2 T →∞ − − − − → 0 . (51) Sample a parameter z T from n b x ( r ) t o for r = 1 , . . . , R and t = 0 , 1 , . . . , T − 1 with probabilit y Pr[ z T = b x ( r ) t ] = δ t P T . This gives E k∇ f ( z T ) k 2 = 1 P T P T − 1 t =0 P R r =1 δ t E k∇ f ( b x ( r ) t ) k 2 . W e therefore ha ve the following from ( 50 ) E k∇ f ( z T ) k 2 ≤ E f ( x 0 ) − f ∗ P T + Lξ 2 P R r =1 σ 2 bR 2 ( a − 1) P T +  8 C γ 2 + 8  ξ 3 L 2 G 2 H 2 2( a − 1) 2 P T Since min t ∈{ 0 ,...,T − 1 } , r ∈ [ R ] E k∇ f ( b x ( r ) t ) k 2 , we ha ve a weak con vergence result: min t ∈{ 0 ,...,T − 1 } , r ∈ [ R ] E k∇ f ( b x ( r ) t ) k 2 T →∞ − − − − → 0 . Bounding the terms P T , P T − 1 t =0 η 2 t and P T − 1 t =0 η 3 t : P T = 1 4 T − 1 X t =0 η t ≥ 1 4 T − 1 X t =0 η t ≥ ξ 4 ln  T + a − 1 a  T − 1 X t =0 η 2 t ≤ ξ 2  1 a − 1 − 1 T + a − 1  = ξ 2 T ( a − 1)( T + a − 1) ≤ ξ 2 a − 1 T − 1 X t =0 η 3 t ≤ ξ 3 2  1 ( a − 1) 2 − 1 ( T + a − 1) 2  ≤ ξ 3 2( a − 1) 2 This completes the pro of of Theorem 2 . B.8 Pro of of Theorem 3 Pr o of. Let x ∗ b e the minimizer of f ( x ) , therefore we ha ve ∇ f ( x ∗ ) = 0 . W e denote f ( x ∗ ) by f ∗ . By taking the av erage of the virtual sequences e x ( r ) t +1 = e x ( r ) t − η t ∇ f i ( r ) t  b x ( r ) t  for each work er r ∈ [ R ] and deﬁning p t := 1 R P R r =1 ∇ f i ( r ) t  b x ( r ) t  , we get e x t +1 = e x t − η t p t . (52) Deﬁne i t as the set of random sampling of the mini-batches at each work er { i (1) t , i (2) t , . . . , i ( R ) t } and let p t = E i t [ p t ] . F rom ( 52 ) we can get k e x t +1 − x ∗ k 2 = k e x t − x ∗ − η t p t k 2 + η 2 t k p t − p t k 2 − 2 η t h e x t − x ∗ − η t p t , p t − p t i (53) T aking the exp ectation w.r.t. the sampling i t at time t (conditioning on the past) and noting that last term in ( 53 ) b ecomes zero gives: E i t k e x t +1 − x ∗ k 2 = k e x t − x ∗ − η t p t k 2 + η 2 t E i t k p t − p t k 2 (54) It follows from the Jensen’s inequalit y and indep endence that E i t k p t − p t k 2 ≤ P R r =1 σ 2 r bR 2 . This giv es E i t k e x t +1 − x ∗ k 2 ≤ k e x t − x ∗ − η t p t k 2 + η 2 t P R r =1 σ 2 r bR 2 . (55) No w w e b ound the ﬁrst term on the RHS. 38 Lemma 14. If η t ≤ 1 4 L , then we have k e x t − x ∗ − η t p t k 2 ≤  1 − µη t 2  k e x t − x ∗ k 2 − η t µ 2 L ( f ( b x t ) − f ∗ ) + η t  3 µ 2 + 3 L  k b x t − e x t k 2 + 3 η t L R R X r =1 k b x t − b x ( r ) t k 2 (56) Pr o of. k e x t − x ∗ − η t p t k 2 = k e x t − x ∗ k 2 + η 2 t k p t k 2 − 2 η t h e x t − x ∗ , p t i (57) Using the deﬁnition of p t w e ha ve k p t k 2 = k 1 R R X r =1  ∇ f ( r )  b x ( r ) t  − ∇ f ( r ) ( e x t )  + ∇ f ( e x t ) − ∇ f ( x ∗ ) k 2 ≤ 1 R R X r =1 2 k∇ f ( r )  b x ( r ) t  − ∇ f ( r ) ( e x t ) k 2 + 2 k∇ f ( e x t ) − ∇ f ( x ∗ ) k 2 ≤ 2 L 2 R R X r =1 k b x ( r ) t − e x t k + 2 k∇ f ( e x t ) − ∇ f ( x ∗ ) k 2 (58) By the deﬁnition of smo othness, we hav e k∇ f ( e x t ) − ∇ f ( x ∗ ) k 2 ≤ 2 L ( f ( e x t ) − f ( x ∗ )) , where ∇ f ( x ∗ ) = 0 . Substituting this in ( 58 ) gives η 2 t k p t k 2 ≤ 2 η 2 t L 2 R R X r =1 k b x ( r ) t − e x t k + 4 η 2 t L ( f ( e x t ) − f ( x ∗ )) (59) No w w e b ound the last term of ( 57 ). By deﬁnition, we ha ve − 2 η t h e x t − x ∗ , p t i = − 2 η t R R X r =1 D b x ( r ) t − x ∗ , ∇ f ( r )  b x ( r ) t E − 2 η t R R X r =1 D e x t − b x ( r ) t , ∇ f ( r )  b x ( r ) t E (60) F or the ﬁrst term on the RHS of ( 60 ), w e can use strong conv exity − 2 D b x ( r ) t − x ∗ , ∇ f ( r )  b x ( r ) t E ≤ − 2  f ( r )  b x ( r ) t  − f ( r ) ( x ∗ )  − µ k b x ( r ) t − x ∗ k 2 (61) F or the second term on the RHS of ( 60 ), w e can use the following b y smo othness. − 2 D e x t − b x ( r ) t , ∇ f ( r )  b x ( r ) t E ≤ L k e x t − b x ( r ) t k 2 + 2  f ( r )  b x ( r ) t  − f ( r ) ( e x t )  (62) Using ( 61 )-( 62 ) in ( 60 ) we get − 2 η t h e x t − x ∗ , p t i ≤ − 2 η t R R X r =1  f ( r ) ( e x t ) − f ( r ) ( x ∗ )  − η t µ R R X r =1 k b x ( r ) t − x ∗ k 2 + Lη t R R X r =1 k e x t − b x ( r ) t k 2 = − 2 η t ( f ( e x t ) − f ( x ∗ )) − η t µ R R X r =1 k b x ( r ) t − x ∗ k 2 + L η t R R X r =1 k e x t − b x ( r ) t k 2 (63) 39 A dding ( 59 ) and ( 63 ) and using a ≥ 32 L/µ which implies η t ≤ 1 / 4 L yields η 2 t k p t k 2 − 2 η t h e x t − x ∗ , p t i ≤ − 2 η t (1 − 2 η t L ) ( f ( e x t ) − f ∗ ) − η t µ R R X r =1 k b x ( r ) t − x ∗ k 2 + Lη t + 2 η 2 t L 2 R R X r =1 k e x t − b x ( r ) t k 2 ≤ − η t ( f ( e x t ) − f ∗ ) − η t µ k b x t − x ∗ k 2 + 3 Lη t R R X r =1  k e x t − b x t k 2 + k b x t − b x ( r ) t k 2  (64) Since k x + y k 2 ≤ 2 k x k 2 + 2 k y k 2 , we ha ve −k b x t − x ∗ k 2 ≤ k b x t − e x t k 2 − 1 2 k e x t − x ∗ k 2 (65) Using ( 65 ) in ( 64 ) and then substituting ( 64 ) in ( 57 ) gives k e x t − x ∗ − η t p t k 2 ≤  1 − µη t 2  k e x t − x ∗ k 2 − η t ( f ( e x t ) − f ∗ ) + η t ( µ + 3 L ) k b x t − e x t k 2 + 3 Lη t R R X r =1 k b x t − b x ( r ) t k 2 (66) Using strong con vexit y of f w e hav e k e x t − x ∗ − η t p t k 2 ≤  1 − µη t 2  k e x t − x ∗ k 2 − η t µ 2 k e x t − x ∗ k 2 + η t ( µ + 3 L ) k b x t − e x t k 2 + 3 Lη t R R X r =1 k b x t − b x ( r ) t k 2 (67) No w use −k e x t − x ∗ k 2 ≤ k e x t − b x t k 2 − 1 2 k b x t − x ∗ k 2 W e get k e x t − x ∗ − η t p t k 2 ≤  1 − µη t 2  k e x t − x ∗ k 2 − η t µ 4 k b x t − x ∗ k 2 + η t  3 µ 2 + 3 L  k b x t − e x t k 2 + 3 Lη t R R X r =1 k b x t − b x ( r ) t k 2 ≤  1 − µη t 2  k e x t − x ∗ k 2 − η t µ 2 L ( f ( b x t ) − f ∗ ) ( Using smo othness of f ( x )) + η t  3 µ 2 + 3 L  k b x t − e x t k 2 + 3 Lη t R R X r =1 k b x t − b x ( r ) t k 2 (68) This completes the pro of of Lemma 14 . Using ( 68 ) in ( 55 ) and then taking the exp ectation ov er the entire pro cess giv es E k e x t +1 − x ∗ k 2 ≤  1 − µη t 2  E k e x t − x ∗ k 2 − η t µ 2 L ( E [ f ( b x t )] − f ∗ ) + η t  3 µ 2 + 3 L  E k b x t − e x t k 2 + 3 η t L R R X r =1 E k b x t − b x ( r ) t k 2 + η 2 t P R r =1 σ 2 r bR 2 (69) 40 F rom Lemma 8 , we hav e 1 R P R r =1 E k b x t − b x ( r ) t k 2 ≤ 4 η 2 t G 2 H 2 . Lemma 6 and Lemma 4 together imply that E k b x t − e x t k 2 ≤ 4 C η 2 t γ 2 H 2 G 2 . Substituting these back in ( 69 ) and letting e t = E [ f ( b x t ) − f ∗ ] gives E k e x t +1 − x ∗ k 2 ≤  1 − µη t 2  E k e x t − x ∗ k 2 − µη t 2 L e t + η t  3 µ 2 + 3 L  C 4 η 2 t γ 2 G 2 H 2 + (3 Lη t )4 η 2 t LG 2 H 2 + η 2 t P R r =1 σ 2 r bR 2 (70) No w using η t ≤ 1 / 4 L we hav e E k e x t +1 − x ∗ k 2 ≤  1 − µη t 2  E k e x t − x ∗ k 2 − µη t 2 L e t + η t  3 µ 2 + 3 L  C 4 η 2 t γ 2 G 2 H 2 + (3 η t L )4 η 2 t LG 2 H 2 + η 2 t P R r =1 σ 2 r bR 2 (71) Emplo ying a sligh tly mo diﬁed Lemma 3.3 from [ SCJ18 ] with a t = E k e x t − x ∗ k 2 . A = P R r =1 σ 2 r bR 2 and B = 4  3 µ 2 + 3 L  C G 2 H 2 γ 2 + 3 L 2 G 2 H 2  , we ha ve a t +1 ≤  1 − µη t 2  a t − µη t 2 L e t + η 2 t A + η 3 t B (72) F or η t = 8 µ ( a + t ) and w t = ( a + t ) 2 , S T = P T − 1 t = o ≥ T 3 3 w e ha ve µ 2 LS T T − 1 X t =0 w t e t ≤ µa 3 8 S T a 0 + 4 T ( T + 2 a ) µS T A + 64 T µ 2 S T B (73) F rom conv exit y w e can ﬁnally write E f ( x T ) − f ∗ ≤ La 3 4 S T a 0 + 8 LT ( T + 2 a ) µ 2 S T A + 128 LT µ 3 S T B (74) Where x T := 1 S T P T − 1 t =0 h w t  1 R P R r =1 b x ( r ) t i = 1 S T P T − 1 t =0 w t b x t . This completes the pro of of Theorem 3 . C Omitted Details from Section 4 As b efore, in order to pro ve our results in the asynchronous setting, we deﬁne virtual sequences for every w orker r ∈ [ R ] and for all t ≥ 0 as follows: e x ( r ) 0 := b x ( r ) 0 e x ( r ) t +1 := e x ( r ) t − η t ∇ f i ( r ) t  b x ( r ) t  Deﬁne 1. e x t +1 := 1 R P R r =1 e x ( r ) t +1 = e x t − η t R P R r =1 ∇ f i ( r ) t  b x ( r ) t  2. p t := 1 R P R r =1 ∇ f i ( r ) t  b x ( r ) t  3. p t := E ( i t ) [ p t ] = 1 R P R r =1 ∇ f ( r )  b x ( r ) t  4. b x t = 1 R P R r =1 b x ( r ) t 5. I ( r ) T = { t ( r ) ( i ) : i ∈ Z + , t ( r ) ( i ) ∈ [ T ] , | t ( r ) ( i ) − t ( r ) ( j ) | ≤ H , ∀| i − j | ≤ 1 } 41 C.1 Pro of of Lemma 9 Lemma (Restating Lemma 9 ) . L et g ap ( I ( r ) T ) ≤ H holds for every r ∈ [ R ] . F or b x ( r ) t gener ate d ac c or ding to Algorithm 2 with de c aying le arning r ate η t and letting b x t = 1 R P R r =1 b x ( r ) t , we have the fol lowing b ound on the deviation of the lo c al se quenc es: 1 R R X r =1 E k b x t − b x ( r ) t k 2 2 ≤ 8(1 + C 00 H 2 ) η 2 t G 2 H 2 , wher e C 00 = 8(4 − 2 γ )(1 + C γ 2 ) and C is a c onstant satisfying C ≥ 4 aγ (1 − γ 2 ) aγ − 4 H . Pr o of. Fix a time t and consider any w orker r ∈ [ R ] . Let t r ∈ I ( r ) T denote the last synchroniza- tion step until time t for the r ’th work er. Deﬁne t 0 0 := min r ∈ [ R ] t r . W e need to upp er-bound 1 R P R r =1 E k b x t − b x ( r ) t k 2 . Note that for an y R vectors u 1 , . . . , u R , if w e let ¯ u = 1 R P r i =1 u i , then P n i =1 k u i − ¯ u k 2 ≤ P R i =1 k u i k 2 . W e use this in the ﬁrst inequality b elo w. 1 R R X r =1 E k b x t − b x ( r ) t k 2 = 1 R R X r =1 E k b x ( r ) t − ¯ ¯ x t 0 0 − ( b x t − ¯ ¯ x t 0 0 ) k 2 ≤ 1 R R X r =1 E k b x ( r ) t − ¯ ¯ x t 0 0 k 2 ≤ 2 R R X r =1 E k b x ( r ) t − b x ( r ) t r k 2 + 2 R R X r =1 E k b x ( r ) t r − ¯ ¯ x t 0 0 k 2 (75) W e b ound b oth the terms separately . F or the ﬁrst term: E k b x ( r ) t − b x ( r ) t r k 2 = E k t − 1 X j = t r η j ∇ f i ( r ) j  b x ( r ) j  k 2 ≤ ( t − t r ) t − 1 X j = t r E k η j ∇ f i ( r ) j  b x ( r ) j  k 2 ≤ ( t − t r ) 2 η 2 t r G 2 ≤ 4 η 2 t H 2 G 2 . (76) The last inequality ( 76 ) uses η t r ≤ 2 η t r + H ≤ 2 η t and t − t r ≤ H . T o b ound the second term of ( 75 ), note that we ha ve ¯ ¯ x ( r ) t r = ¯ ¯ x t 0 0 − 1 R R X s =1 t r − 1 X j = t 0 0 1 { j + 1 ∈ I ( s ) T } g ( s ) j . (77) Note that b x ( r ) t r = ¯ ¯ x ( r ) t r , b ecause at sync hronization steps, the lo cal parameter v ector b ecomes equal to the global parameter vector. Using this, the Jensen’s inequality , and that k 1 { j + 1 ∈ I ( s ) T } g ( s ) j k 2 ≤ k g ( s ) j k 2 , we can upp er-b ound ( 77 ) as E k b x ( r ) t r − ¯ ¯ x t 0 0 k 2 ≤ ( t r − t 0 0 ) R R X s =1 t r X j = t 0 0 E k g ( s ) j k 2 (78) 42 No w w e b ound E k g ( s ) j k 2 for any j ∈ { t 0 0 , . . . , t r } and s ∈ [ R ] : Since E k QC omp k ( u ) k 2 ≤ B k u k 2 holds for ev ery u , with B = (4 − 2 γ ) , 6 w e ha ve for any s ∈ [ R ] that E k g ( s ) j k 2 ≤ B E k m ( s ) j + x ( s ) j − b x ( s ) j + 1 2 k 2 (79) ≤ 2 B E k m ( s ) j k 2 + 2 B E k x ( s ) j − b x ( s ) j + 1 2 k 2 (80) Observ e that the pro of of Lemma 4 do es not depend on the synchron y of the net work; it only uses the fact that g ap ( I ( s ) T ) ≤ H for an y w ork er s ∈ [ R ] . Therefore, w e can directly use Lemma 4 to b ound the ﬁrst term in ( 76 ) as E k m ( s ) j k 2 ≤ 4 C η 2 j γ 2 H 2 G 2 . In order to b ound the second term of ( 76 ), note that x ( s ) j = b x ( s ) t s , which implies that k x ( s ) j − b x ( s ) j + 1 2 k 2 = k P j l = t s η l ∇ f i ( s ) l  b x ( s ) l  k 2 . T aking exp ectation yields E k x ( s ) j − b x ( s ) j + 1 2 k 2 ≤ 4 η 2 t s H 2 G 2 ≤ 4 η 2 t 0 0 H 2 G 2 , where in the last inequality w e used that t 0 0 ≤ t s . Using these in ( 80 ) gives E k g ( s ) j k 2 ≤ 8 B  1 + C γ 2  η 2 t 0 0 H 2 G 2 . (81) Since t 0 0 ≤ t ≤ t 0 0 + H , we ha ve η t 0 0 ≤ 2 η t 0 0 + H ≤ 2 η t . Putting the b ound on E k g ( s ) j k 2 (after substituting η t 0 0 ≤ 2 η t in ( 81 )) in ( 78 ) giv es E k b x ( r ) t r − ¯ ¯ x t 0 0 k 2 ≤ 32 B  1 + C γ 2  η 2 t H 4 G 2 . (82) Putting this and the b ound from ( 76 ) back in ( 75 ) giv es 1 R R X r =1 E k b x t − b x ( r ) t k 2 ≤ 8 η 2 t H 2 G 2 + 64 B  1 + C γ 2  η 2 t H 4 G 2 ≤ 8  1 + 8 B H 2  1 + C γ 2  η 2 t H 2 G 2 . This completes the pro of of Lemma 9 . C.2 Pro of of Lemma 10 Lemma (Restating Lemma 10 ) . L et g ap ( I ( r ) T ) ≤ H holds for every r ∈ [ R ] . By running A lgorithm 2 with ﬁxe d le arning r ate η , we have 1 R R X r =1 E k b x t − b x ( r ) t k 2 2 ≤ (2 + H 2 C 0 ) η 2 G 2 H 2 , wher e C 0 = ( 16 γ 2 − 12)(4 − 2 γ ) . Pr o of. F rom ( 79 ) and ( 80 ) and using the fact that E k QC o mp k ( u ) k 2 ≤ B k u k 2 for every u , where B = (4 − 2 γ ) , we ha ve the following: E k g ( s ) j k 2 ≤ 2 B E k m ( s ) j k 2 + 2 B η 2 H 2 G 2 ≤ 8 B (1 − γ 2 ) η 2 γ 2 H 2 G 2 + 2 η 2 B H 2 G 2 = 2 B  4 γ 2 − 3  η 2 H 2 G 2 (83) 6 This can b e seen as follows: E k QC ( u ) k 2 ≤ 2 E k u − QC ( u ) k 2 + 2 k u k 2 ≤ 2(1 − γ ) k u k 2 + 2 k u k 2 . 43 F or a ﬁxed learning rate η , using ( 83 ) and following similar analysis as in ( 76 ) we can b ound the ﬁrst term in ( 75 ) as follows E k b x ( r ) t − b x ( r ) t r k 2 ≤ η 2 H 2 G 2 (84) Similarly as in ( 77 )-( 81 ) w e can b ound the second term in ( 75 ) as follo ws E k b x ( r ) t r − ¯ ¯ x t 0 0 k 2 ≤ 2 B  4 γ 2 − 3  η 2 H 4 G 2 (85) Using ( 84 ) and ( 85 ) in ( 75 ) we can show that 1 R R X r =1 E k b x t − b x ( r ) t k 2 ≤  2 + 4 B H 2  4 γ 2 − 3  η 2 H 2 G 2 (86) This completes the pro of of Lemma 10 . C.3 Pro of of Lemma 11 Lemma (Restating Lemma 11 ) . L et g ap ( I ( r ) T ) ≤ H holds for every r ∈ [ R ] . If we run A l- gorithm 2 with a de c aying le arning r ate η t , then we have the fol lowing b ound on the diﬀer enc e b etwe en the true and virtual se quenc es: E k b x t − e x t k 2 2 ≤ C 0 η 2 t H 4 G 2 + 12 C η 2 t γ 2 G 2 H 2 , wher e C 0 = 192(4 − 2 γ )  1 + C γ 2  and C is a c onstant satisfying C ≥ 4 aγ (1 − γ 2 ) aγ − 4 H . Pr o of. Fix a time t and consider an y work er r ∈ [ R ] . Let t r ∈ I ( r ) T denote the last sync hronization step un til time t for the r ’th w orker. Deﬁne t 0 0 := min r ∈ [ R ] t r . W e wan t to b ound E k b x t − e x t k 2 . Note that in the synchronous case, we ha ve shown in Lemma 6 that b x t − b x t = 1 R P R r =1 m ( r ) t . This do es not hold in the async hronous setting, whic h mak es upp er-bounding E k b x t − e x t k 2 a bit more inv olved. By deﬁnition b x t − e x t = 1 R P R r =1  b x ( r ) t − e x ( r ) t  . By the deﬁnition of virtual sequences and the up date rule for b x ( r ) t , w e also ha ve b x t − e x t = 1 R P R r =1  b x ( r ) t r − e x ( r ) t r  . This can b e written as b x t − e x t = " 1 R R X r =1 b x ( r ) t r − ¯ ¯ x t 0 0 # + h ¯ ¯ x t 0 0 − ¯ ¯ x t i + " ¯ ¯ x t − 1 R R X r =1 e x ( r ) t r # (87) Applying Jensen’s inequalit y and taking exp ectation gives E k b x t − e x t k 2 ≤ " 3 R R X r =1 E k b x ( r ) t r − ¯ ¯ x t 0 0 k 2 # + h 3 E k ¯ ¯ x t 0 0 − ¯ ¯ x t k 2 i + " 3 E k ¯ ¯ x t − 1 R R X r =1 e x ( r ) t r k 2 # (88) W e b ound each of the three terms of ( 88 ) separately . W e ha ve upp er-b ounded the ﬁrst term earlier in ( 82 ), which is E k b x ( r ) t r − ¯ ¯ x t 0 0 k 2 ≤ 32 B  1 + C γ 2  η 2 t H 4 G 2 , (89) 44 where B = (4 − 2 γ ) . T o b ound the second term of ( 88 ), note that ¯ ¯ x t = ¯ ¯ x 0 − 1 R R X r =1 t r − 1 X j =0 1 { j + 1 ∈ I ( r ) T } g ( r ) j (90) = ¯ ¯ x t 0 0 − 1 R R X r =1 t r − 1 X j = t 0 0 1 { j + 1 ∈ I ( r ) T } g ( r ) j (91) By applying Jensen’s inequalit y , using k 1 { j + 1 ∈ I ( r ) T } g ( r ) j k 2 ≤ k g ( r ) j k 2 , and taking exp ectation, w e can upp er-b ound ( 91 ) as E k ¯ ¯ x t 0 0 − ¯ ¯ x t k 2 ≤ ( t r − t 0 0 ) R R X r =1 t r X j = t 0 0 E k g ( r ) j k 2 Using the b ound on E k g ( r ) j k 2 ’s from ( 82 ) gives E k ¯ ¯ x t 0 0 − ¯ ¯ x t k 2 ≤ 32 B  1 + C γ 2  η 2 t H 4 G 2 . (92) T o b ound the last term of ( 88 ), note that e x ( r ) t r = ¯ ¯ x 0 − t r − 1 X j =0 η j ∇ f i ( r ) j  b x ( r ) j  (93) F rom ( 90 ) and ( 93 ), w e can write ¯ ¯ x t − 1 R R X r =1 e x ( r ) t r = 1 R R X r =1   t r − 1 X j =0 η j ∇ ( r ) f ( i j )  b x ( r ) j  − t r − 1 X j =0 1 { j + 1 ∈ I ( r ) T } g ( r ) j   (94) Let t (1) r and t (2) r b e t wo consecutive sync hronization steps in I ( r ) T . Then, by the up date rule of b x ( r ) t , w e hav e b x ( r ) t (1) r − b x ( r ) t (2) r − 1 2 = P t (2) r − 1 j = t (1) r ∇ f i ( r ) j  b x ( r ) j  . Since x ( r ) t (1) r = b x ( r ) t (1) r and the w orkers do not mo dify their lo cal x ( r ) t ’s in b et w een the synchronization steps, we hav e x ( r ) t (2) r − 1 = x ( r ) t (1) r = b x ( r ) t (1) r . Therefore, we can write x ( r ) t (2) r − 1 − b x ( r ) t (2) r − 1 2 = t (2) r − 1 X j = t (1) r ∇ f i ( r ) j  b x ( r ) j  . (95) Using ( 95 ) for ev ery consecutiv e synchronization steps, we can equiv alen tly write ( 94 ) as ¯ ¯ x t − 1 R R X r =1 e x ( r ) t r = 1 R R X r =1      X j : j +1 ∈I ( r ) T j ≤ t r − 1  x ( r ) j − b x ( r ) j + 1 2 − g ( r ) j       = 1 R R X r =1 m ( r ) t r = 1 R R X r =1 m ( r ) t (96) 45 In the last inequalit y , w e used the fact that the w orkers do not up date their lo cal memory in betw een the sync hronization steps. F or the reasons given in the proof of Lemma 9 , w e can directly apply Lemma 4 to b ound the lo cal memories and obtain E k 1 R P R r =1 m ( r ) t k 2 ≤ 1 R P R r =1 E k m ( r ) t k 2 ≤ 4 C η 2 t γ 2 G 2 H 2 . This implies E k ¯ ¯ x t − 1 R R X r =1 e x ( r ) t r k 2 ≤ 4 C η 2 t γ 2 G 2 H 2 . (97) Putting the b ounds from ( 89 ), ( 92 ), and ( 97 ) in ( 88 ) and using B = (4 − 2 γ ) give E k b x t − e x t k 2 ≤ 192(4 − 2 γ )  1 + C γ 2  η 2 t H 4 G 2 + 12 C η 2 t γ 2 G 2 H 2 . This completes the pro of of Lemma 11 . C.4 Pro of of Lemma 12 Lemma (Restating Lemma 12 ) . L et g ap ( I ( r ) T ) ≤ H holds for every r ∈ [ R ] . If we run Algo- rithm 2 with a ﬁxe d le arning r ate η , we have E k b x t − e x t k 2 2 ≤ 6 C 0 η 2 H 4 G 2 + 12 η 2 (1 − γ 2 ) γ 2 G 2 H 2 , wher e C 0 = (4 − 2 γ )  8 γ 2 − 6  . Pr o of. F or a constan t learning rate the ﬁrst term in ( 88 ) has been bounded earlier in ( 85 ). F ollo wing similar steps as in ( 91 ) we w ould ha ve E k ¯ ¯ x t 0 0 − ¯ ¯ x t k 2 ≤ 2 B  4 γ 2 − 3  η 2 H 4 G 2 . (98) Finally , using ( 85 ),( 96 ), Lemma 5 and ( 98 ) in ( 88 ) w e ha ve E k b x t − e x t k 2 ≤ 12 B  4 γ 2 − 3  η 2 H 4 G 2 + 12 η 2 (1 − γ 2 ) γ 2 G 2 H 2 , (99) where B = (4 − 2 γ ) . This completes the pro of of Lemma 12 . D Omitted Details from Section 5 As mentioned in F o otnote 4 , here we compare the p erformance of Qsparse-lo cal-SGD with scaled and unscaled composed op erator QT op k in the non-con vex setting. W e will see that ev en though the scaled QT op k from Lemma 2 works b etter than unscaled QT op k from Lemma 1 theoretically (see Remark 2 ), our exp erimen ts show the opp osite phenomena, that the unscaled QT op k w orks at least as go o d as the scaled QT op k , and strictly b etter in some cases. W e can attribute this to the fact that scaling the composed op erator is a suﬃcient condition to obtain b etter conv ergence results, whic h do es not necessarily mean that in practice also it do es b etter. Therefore, w e p erform our exp erimen ts in the non-conv ex setting Section 5 with unscaled QT op k . W e give plots for the ab o ve-men tioned comparison in Figure 8 . F rom [ AGL + 17 ], we know that for quantized SGD, without an y form of error comp ensation, the dominating term in the con vergence rate is aﬀected b y the v ariance blow-up induced due to sto c hastic quantization; 46 (a) T raining loss vs ep ochs (b) T raining loss vs log 2 of communication budget (c) top-1 accuracy [ LHS15 ] for schemes in Figure 8a (d) top-5 accuracy [ LHS15 ] for schemes in Figure 8a Figure 8 Figure 8a - 8d demonstrate the comparable p erformance of Qsp arse-lo c al-SGD in the non-conv ex setting with scaled and unscaled QT op k op erators from Lemma 1 and Lemma 2 , resp ectiv ely . ho wev er, with error comp ensation, we recov er rates matching v anilla SGD despite compression and infrequent communication Section 3.3 . In Figure 8 , QT op k refers to QSGD comp osed with the T op k op erator as in Lemma 1 , and when used with the subscript sc ale d , w e in tro duce a scaling factor of (1 + β k,s ) as in Lemma 2 . Let L denote the n umber of lo cal iterations in b et w een t wo synchronization indices. Observ e that to ac hieve a certain target loss or accuracy , b oth the comp osed op erators perform almost equally in terms of the n umber of bits transmitted when L = 0 , 4 , but unscaled op erator p erforms b etter when L = 8 . Therefore, we restrict our use of comp osed op erator in the non-conv ex setting to the unscaled QT op k from Lemma 1 . References [ABC + 16] M. Abadi, P . Barham, J. Chen, Z. Chen, A. Da vis, J. Dean, M. Devin, S. Ghemaw at, G. Irving, M. Isard, M. Kudlur, J. Leven b erg, R. Monga, S. Mo ore, D. G. Murra y , B. Steiner, P . A. T uck er, V. V asudev an, P . W arden, M. Wick e, Y. Y u, and X. Zheng. T ensorﬂo w: A system for large-scale machine learning. In OSDI , pages 265–283, 2016. 47 [A GL + 17] D. Alistarh, D. Grubic, J. Li, R. T omiok a, and M. V o jnovic. QSGD: communication- eﬃcien t SGD via gradien t quan tization and enco ding. In NIPS , pages 1707–1718, 2017. [AH17] Alham Fikri Aji and Kenneth Heaﬁeld. Sparse comm unication for distributed gra- dien t descen t. In EMNLP , pages 440–445, 2017. [AHJ + 18] D. Alistarh, T. Hoeﬂer, M. Johansson, N. Konstan tinov, S. Khirirat, and C. Renggli. The con vergence of sparsiﬁed gradien t metho ds. In NeurIPS , pages 5977–5987, 2018. [BM11] F rancis R. Bach and Eric Moulines. Non-asymptotic analysis of sto chastic approx- imation algorithms for machine learning. In NIPS , pages 451–459, 2011. [Bot10] L. Bottou. Large-scale machine learning with sto c hastic gradient descent. In COMP- ST A T , pages 177–186, 2010. [BW AA18] J. Bernstein, Y. W ang, K. Azizzadenesheli, and A. Anandkumar. SignSGD: com- pressed optimisation for non-conv ex problems. In ICML , pages 559–568, 2018. [CH16] Kai Chen and Qiang Huo. Scalable training of deep learning mac hines b y incremen- tal blo c k training with intra-block parallel optimization and blockwise mo del-up date ﬁltering. In ICASSP , pages 5880–5884, 2016. [Cop15] Gregory F. Copp ola. Iter ative p ar ameter mixing for distribute d lar ge-mar gin tr aining of structur e d pr e dictors for natur al language pr o c essing . PhD thesis, Univ ersity of Edin burgh, UK, 2015. [DCL T18] Jacob Devlin, Ming-W ei Chang, Kenton Lee, and Kristina T outano v a. BER T: pre-training of deep bidirectional transformers for language understanding. CoRR , abs/1810.04805, 2018. [GMT73] R. Gitlin, J. Mazo, and M. T aylor. On the design of gradient algorithms for digitally implemen ted adaptiv e ﬁlters. IEEE T r ansactions on Cir cuit The ory , 20(2):125–136, Marc h 1973. [HK14] Elad Hazan and Saty en Kale. Beyond the regret minimization barrier: optimal al- gorithms for sto c hastic strongly-con vex optimization. Journal of Machine L e arning R ese ar ch , 15(1):2489–2512, 2014. [HM51] Robbins Herbert and Sutton Monro. A stochastic approximation metho d. The A nnals of Mathematic al Statistics. JSTOR , 22, no. 3:400–407, 1951. [HZRS16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR , pages 770–778, 2016. [KB15] Diederik P . Kingma and Jimmy Ba. A dam: A metho d for sto c hastic optimization. In ICLR , 2015. [K on17] Jakub Konecný. Sto c hastic, distributed and federated optimization for machine learning. CoRR , abs/1707.01155, 2017. [KRSJ19] Sai Praneeth Karimireddy , Quentin Reb jo c k, Sebastian U. Stich, and Martin Jaggi. Error feedbac k ﬁxes signsgd and other gradient compression schemes. In ICML , pages 3252–3261, 2019. 48 [KSJ19] Anastasia K olosko v a, Sebastian U. Stich, and Martin Jaggi. Decen tralized sto c hastic optimization and gossip algorithms with compressed communication. In ICML , pages 3478–3487, 2019. [LBBH98] Y. LeCun, L. Bottou, Y. Bengio, and P . Haﬀner. Gradien t-based learning applied to do cumen t recognition. In Pr o c e e dings of the IEEE, 86(11):2278-2324 , 1998. [LHM + 18] Y. Lin, S. Han, H. Mao, Y. W ang, and W. J. Dally . Deep gradient compression: Reducing the comm unication bandwidth for distributed training. In ICLR , 2018. [LHS15] Maksim Lapin, Matthias Hein, and Bernt Sc hiele. T op-k multiclass SVM. In NIPS , pages 325–333, 2015. [MMR + 17] B. McMahan, E. Mo ore, D. Ramage, S. Hampson, and B. A. y Arcas. Comm unication-eﬃcient learning of deep net works from decen tralized data. In AIS- T A TS , pages 1273–1282, 2017. [MPP + 17] H. Mania, X. P an, D. S. P apailiop oulos, B. Rech t, K. Ramc handran, and M. I. Jordan. Perturbed iterate analysis for asynchron ous sto c hastic optimization. SIAM Journal on Optimization , 27(4):2202–2229, 2017. [NJLS09] Ark adi Nemirovski, Anatoli Juditsky , Guanghui Lan, and Alexander Shapiro. Ro- bust sto c hastic approximation approac h to sto c hastic programming. SIAM Journal on Optimization , 19(4):1574–1609, 2009. [NNvD + 18] Lam M. Nguyen, Ph uong Ha Nguyen, Marten v an Dijk, P eter Ric htárik, Kat ya Sc heinberg, and Martin T akác. SGD and hogwild! con vergence without the b ounded gradien ts assumption. In ICML , pages 3747–3755, 2018. [RB93] M. Riedmiller and H. Braun. A direct adaptiv e metho d for faster bac kpropaga- tion learning: the rprop algorithm. In IEEE International Confer enc e on Neur al Networks , pages 586–591 vol.1, Marc h 1993. [RR WN11] Benjamin Rech t, Christopher Ré, Stephen J. W righ t, and F eng Niu. Hogwild: A lo c k-free approac h to parallelizing sto c hastic gradient descent. In NIPS , pages 693– 701, 2011. [RSS12] A. Rakhlin, O. Shamir, and K. Sridharan. Making gradien t descent optimal for strongly conv ex sto c hastic optimization. In ICML , 2012. [SB18] A. Sergeev and M. D. Balso. Horo vod: fast and easy distributed deep learning in tensorﬂo w. CoRR , abs/1802.05799, 2018. [SCJ18] S. U. Stich, J. B. Cordonnier, and M. Jaggi. Sparsiﬁed SGD with memory . In NeurIPS , pages 4452–4463, 2018. [SFD + 14] F. Seide, H. F u, J. Dropp o, G. Li, and D. Y u. 1-bit sto c hastic gradient descen t and its application to data-parallel distributed training of sp eec h dnns. In INTER- SPEECH , pages 1058–1062, 2014. [SSS07] Shai Shalev-Sh wartz, Y oram Singer, and Nathan Srebro. P egasos: Primal estimated sub-gradien t solv er for SVM. In ICML , pages 807–814, 2007. [Sti19] Sebastian U. Stich. Lo cal SGD conv erges fast and comm unicates little. In ICLR , 2019. 49 [Str15] Nikk o Strom. Scalable distributed DNN training using commo dit y GPU cloud computing. In INTERSPEECH , pages 1488–1492, 2015. [SYKM17] A. Theertha Suresh, F. X. Y u, S. Kumar, and H. B. McMahan. Distributed mean estimation with limited communication. In ICML , pages 3329–3337, 2017. [TH12] T. Tieleman and G Hin ton. RMSpr op. Courser a: Neur al Networks for Machine L e arning, L e ctur e 6.5 . 2012. [WHHZ18] J. W u, W. Huang, J. Huang, and T. Zhang. Error comp ensated quan tized SGD and its applications to large-scale distributed optimization. In ICML , pages 5321–5329, 2018. [WJ18] Jian yu W ang and Gauri Joshi. Co op erativ e SGD: A uniﬁed framew ork for the design and analysis of communication-eﬃcien t SGD algorithms. CoRR , abs/1808.07576, 2018. [WSL + 18] H. W ang, S. Sievert, S. Liu, Z. B. Charles, D. S. P apailiop oulos, and S. W right. A TOMO: communication-eﬃcien t learning via atomic sparsiﬁcation. In NeurIPS , pages 9872–9883, 2018. [WWLZ18] J. W angni, J. W ang, J. Liu, and T. Zhang. Gradien t sparsiﬁcation for comm unication-eﬃcient distributed optimization. In NeurIPS , pages 1306–1316, 2018. [WXY + 17] W. W en, C. Xu, F. Y an, C. W u, Y. W ang, Y. Chen, and H. Li. T erngrad: T ernary gradien ts to reduce communication in distributed deep learning. In NIPS , pages 1508–1518, 2017. [WYL + 18] Tianyu W u, Kun Y uan, Qing Ling, W otao Yin, and Ali H. Say ed. Decentralized consensus optimization with asynchron y and delays. IEEE T r ans. Signal and In- formation Pr o c essing over Networks , 4(2):293–307, 2018. [YJY19] Hao Y u, Rong Jin, and Sen Y ang. On the linear sp eedup analysis of comm unication eﬃcien t momentum sgd for distributed non-conv ex optimization. In ICML , pages 7184–7193, 2019. [YYZ19] Hao Y u, Sen Y ang, and Shengh uo Zh u. P arallel restarted SGD with faster con ver- gence and less communication: Dem ystifying why mo del av eraging works for deep learning. In AAAI , pages 5693–5700, 2019. [ZDJW13] Y. Zhang, J. C. Duchi, M. I. Jordan, and M. J. W ain wright. Information-theoretic lo wer b ounds for distributed statistical estimation with communication constraints. In NIPS , pages 2328–2336, 2013. [ZD W13] Y. Zhang, J. C. Duchi, and M. J. W ainwrigh t. Communication-eﬃcien t algorithms for statistical optimization. Journal of Machine L e arning R ese ar ch , 14(1):3321– 3363, 2013. [ZSMR16] Jian Zhang, Christopher De Sa, Ioannis Mitliagk as, and Christopher Ré. P arallel SGD: when do es a veraging help? CoRR , abs/1606.07365, 2016. 50

Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification, and Local Computations

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment