Biased Compression in Gradient Coding for Distributed Learning
Communication bottlenecks and the presence of stragglers pose significant challenges in distributed learning (DL). To deal with these challenges, recent advances leverage unbiased compression functions and gradient coding. However, the significant be…
Authors: Chengxi Li, Ming Xiao, Mikael Skoglund
JOURNAL OF L A T E X CLASS FILES 1 Biased Compression in Gradien t Co ding for Distributed Learning Chengxi Li, Mem b er, IEEE, Ming Xiao, Senior Member, IEEE, and Mikael Skoglund, F ello w, IEEE Abstract—Comm unication b ottlenec ks and the presence of stragglers p ose signican t challenges in distributed learning (DL). T o deal with these c hallenges, recent adv ances lev erage un biased compression functions and gradien t co ding. How ever, the signican t b enets of biased compression remain largely unexplored. T o close this gap, w e propose Compressed Gradient Coding with Error Feedback (COCO-EF), a no vel DL method that com bines gradient co ding with biased compression to mitigate straggler eects and reduce communication costs. In eac h iteration, non-straggler devices encode lo cal gradients from redundantly allocated training data, incorp orate prior compression errors, and compress the results using biased compression functions b efore transmission. The serv er ag- gregates these compressed messages from the non-stragglers to approximate the global gradien t for mo del updates. W e pro vide rigorous theoretical con vergence guarantees for COCO- EF and v alidate its superior learning performance ov er baseline metho ds through empirical ev aluations. As far as we know, w e are among the rst to rigorously demonstrate that biased compression has substantial benets in DL, when gradien t co ding is employ ed to cop e with stragglers. I. Introduction Distributed learning (DL) is emerging as an imp ortant paradigm for training machine learning models in a distributed fashion, whic h increases computation eciency b y using computational resources on edge devices simulta- neously [1]–[3]. Under the DL paradigm, multiple devices co ordinate with a cen tral server. Prior to training, the training data are divided into subsets and allo cated to the devices so that each device holds one or a few of the subsets [4]. Afterward, the training is implemented o ver m ultiple iterations. In each training iteration, the server transmits the curren t global model to the devices. With this global mo del, the devices compute lo cal gradients with the lo cal subsets and transmit them to the serv er. After aggregating the gradien ts received from the devices, the serv er up dates the global mo del [5], [6]. Despite the benets, there are t wo practical constrain ts encountered in DL problems, which are in tro duced as follo ws. The rst one is the communication b ottleneck caused b y the transmission of a large num b er of high-dimensional and real-v alued v ectors from the devices to the server, giv en limited comm unication resources [7]–[11]. Exist- ing works ha ve aimed to address the communication b ottlenec k b y reducing the communication ov erhead in C. Li, M. Xiao and M. Skoglund are with the Division of In- formation Science and Engineering, Sc ho ol of Electrical Engineer- ing and Computer Science, KTH Roy al Institute of T ec hnology , 10044 Stockholm, Sweden. (e-mail: chengxli@kth.se; mingx@kth.se; skoglund@kth.se). Corresp onding author: Chengxi Li. eac h iteration, which is attained b y using v arious com- pression functions to compress the gradients into some sp ecic vectors that can b e represented b y fewer bits. These compression functions can b e classied in to tw o categories: un biased compression functions and biased compression functions, dep ending on whether the output of the compression function is an unbiased estimate or a biased estimate of the vector b efore compression. F or example, sto chastic quantization [12], [13] and amplied rand- K sparsication [14] b elong to un biased compression functions, while the sign-bit quan tization [15] and top‑ K sparsication [16]–[18] b elong to biased compression func- tions. Compared to biased compression functions, unbi- ased compression functions are more extensively studied and understo o d, and they guarantee conv ergence to the stationary p oint in many DL tasks [12]–[14], [19]–[21]. Ho w ev er, biased compression has certain adv an tages o ver un biased compression when b oth are ev aluated in terms of their a v erage capacity to retain information in the gradien ts [22]. This is because smaller appro ximation error is introduced during the compression pro cess when biased compression is adopted. F urthermore, another dra wbac k of un biased compression in DL is that, when sto chastic noise con taminates the gradients, unbiased compression suers from a linear deterioration in conv ergence p erformance 1 [23]. Consequently , biased compression functions hav e b een adopted in practice to ac hiev e sup erior empirical p erformance compared to un biased compression functions, with a remedy known as the error feedback mechanism used to ensure conv ergence [24], [25]. The error feedback mec hanism w as originally used in [26], where it w as em- pirically sho wn that when training deep neural netw orks, the loss of training accuracy caused b y quantization error can b e comp ensated by using error feedback. In [27], error feedback was analyzed for one-device case and later extended to a distributed v ersion with m ultiple devices in [22]. With this mechanism, in eac h iteration, eac h device transmits the compressed local gradient to the server and stores the compression error lo cally , which can be used in the next iteration to modify the lo cal gradien t in order to correct the bias in compression and guarantee the correct direction of mo del update [22]. Second, in practice, signicant dela ys may o ccur on some devices, kno wn as stragglers, when computing and 1 Note that this linear deterioration in conv ergence p erformance still o ccurs even when unbiased compression is combined with gradient-dierence compression to comp ensate for the compression error [23]. JOURNAL OF L A T E X CLASS FILES 2 comm unicating to the serv er [28], [29]. This issue b ecomes ob vious when a large num b er of cheap devices are deploy ed to reduce the o verall system cost [30]. In this case, it is more desirable to w ait only for messages from the non-stragglers, the devices that resp ond promptly , to ensure ecien t training. As a result, in eac h iteration, only the non-stragglers participate, whic h degrades the learning p erformance due to missing information from the stragglers. T o deal with the negative impact of the stragglers in DL, the gradient co ding techniques ha v e b een developed. F or example, in [28], an exact gradient co ding strategy is prop osed, wherein the training data are divided in to subsets, replicated, and allocated redundan tly to devices in a carefully planned w ay . In each training iteration, each non-straggler device transmits a co ded v ector to the server, whic h is a single linear combination of its local gradients computed from the local subsets. By carefully designing the enco ding strategies at the devices and the deco ding strategies at the serv er, the true global gradien t can b e recov ered exactly without knowing which devices are stragglers. T o enhance exibility in coping with v ariabilit y in the num b er of stragglers o ver iterations, sto c hastic gradient co ding is prop osed in [31]. In this approac h, the serv er allocates the training subsets to the devices with a low level of redundancy according to a pairwise balanced scheme. The devices compute lo cal gradien ts on their subsets and send a linear com bination of these gradients. The serv er aggregates messages from the non-stragglers to approximately recov er the global gradien t and up dates the model. Although these metho ds can eectively mitigate the negative impact of stragglers, they suer from a signicant communication b ottlenec k. This is b ecause the non-straggler devices all transmit high- dimensional and real-v alued dense vectors to the server in each iteration, which induces a heavy communication o v erhead [32]. There is very little w ork that aims to address b oth the comm unication b ottleneck and the issue of stragglers sim ultaneously . One recent eort tackles this c hallenge. In [32], a DL metho d based on 1-bit gradien t co ding is prop osed, where lo cally computed gradients are enco ded and compressed into 1-bit vectors via un biased sto chastic quan tization. These vectors are then transmitted from the non-straggler devices to the server. Based on the gradien t co ding technique under data allocation redun- dancy , the straggler eects can b e mitigated eectively in [32]. Meanwhile, b y transmitting 1-bit v ectors instead of the real-v alued ones, the communication o verhead can b e reduced compared to the prior works. How ever, the p oten tial adv antages of biased compression functions are not exploited in [32], whic h could further enhance learning p erformance under the constrain ts of communication and stragglers. In this w ork, to exploit the adv antages of gradient co ding and biased compression functions in DL under the comm unication b ottleneck with stragglers, we prop ose a new DL metho d, compressed gradient co ding with error feedbac k (COCO-EF). COCO-EF is a meta-algorithm, whic h can b e applied with an y biased compressed func- tions. In COCO-EF, b efore the training starts, the training data are divided in to subsets and allo cated to the devices redundan tly in a pairwise balanced scheme, motiv ated b y sto chastic gradien t co ding in [31]. In each training iteration, the server broadcasts the current global mo del to all devices, and the devices compute the lo cal gradients based on the lo cal subsets of the training data. After that, each non-straggler device encodes the lo cal gradients in to a single vector, and incorp orates the compression error stored from the previous iterations into this co ded v ector. By using a specic type of biased compression function, the co ded vector incorp orated with error is compressed and transmitted to the serv er to reduce the comm unication ov erhead. After receiving from the non- stragglers, based on the redundancy of training data allo cation, the server appro ximately reconstruct the global gradien t for mo del up date. W e analyze the con vergence p erformance of COCO-EF theoretically for smo oth loss functions, and sho w that it can attain a con vergence rate of O 1 √ T . Finally , we pro vide v arious numerical results to demonstrate the sup eriority of the prop osed metho d compared to the baselines and verify our theoretical ndings. Compared with existing w orks related to this topic, the dierences b etw een our approach and prior studies are summarized as follo ws: • Both our method and [31] adopt a sto chastic gradien t co ding scheme to deal with the stragglers. How ever, in [31], lo cal gradients are enco ded and transmitted to the server without an y compression, leading to a substantial communication burden. In contrast, in our proposed method, compressed messages are trans- mitted by the devices to reduce the communication o v erhead. • Relativ e to [32], b oth our work and [32] compress co ded v ectors under the same sto chastic gradient co ding scheme b efore transmission to the server, in order to reduce the communication o v erhead and to address the stragglers. How ever, [32] emplo ys a sp ecic t yp e of un biased compression function. In con trast, our approac h leverages biased compression functions to achiev e further impro vemen ts in commu- nication reduction in the presence of stragglers. • Similar to [22], [27], our method also exploits the adv an tages of biased compression functions with the error feedback mechanism to reduce the communi- cation o verhead for DL tasks. How ever, [22], [27] do not consider stragglers and incorp orate the com- pression error directly in to the local gradients. With this op eration, their learning p erformance degrades signican tly in the presence of stragglers due to the absence of information from these stragglers. Considering this, how to exploit the p otential of biased compression functions with the error feedback mec hanism to reduce comm unication o verhead in the presence of stragglers remains an op en problem. W e JOURNAL OF L A T E X CLASS FILES 3 address these c hallenges thoroughly in this work. In the presence of p otential stragglers, our metho d com bines gradient co ding with biased compression and error feedback to handle stragglers while si- m ultaneously reducing comm unication ov erhead. In this sc heme, the compression error is incorp orated in to co ded gradients on the devices, which is a v ery dieren t op eration compared with [22], [27], resulting in a substantially dierent analysis and empirically sup erior learning p erformance. Based on that, our con tributions are stated as follows: • W e propose a new DL metho d, COCO-EF, to impro ve the learning p erformance of previous works under the c hallenges of the comm unication b ottleneck and stragglers. On one hand, lev eraging redundant train- ing data allo cation and the gradient co ding scheme, COCO-EF comp ensates for missing information from stragglers b y utilizing messages from non-stragglers, th us mitigating the negativ e impact of stragglers. On the other hand, by enco ding and compressing lo cal gradien ts using biased compression functions b efore transmission to the server, the communication ov er- head is reduced signicantly . Sp ecically , to oset information loss and the bias in compression, an error feedbac k mechanism is incorporated into COCO-EF, ensuring satisfactory learning performance despite the reduced comm unication o v erhead. • W e rigorously characterize the conv ergence p erfor- mance of the prop osed metho d for smo oth loss functions with constant learning rates, and sho w that it achiev es a con vergence rate of O 1 √ T , which impro v es up on the O 1 T 1 / 4 rate established in [32]. • W e present extensive n umerical results to demon- strate the sup eriority of the prop osed metho d ov er the baselines and to v alidate our theoretical ndings, whic h show that COCO-EF attains signican tly b et- ter learning p erformance under the same communi- cation o v erhead in the presence of stragglers. The rest of this pap er is organized as follows. In Section II, the problem model is in tro duced. In Section I I I, w e prop ose our metho d. In Section IV, we analyze the p erformance of the prop osed metho d. Numerical results are provided and discussed in Section V. The conclusions are presented in Section VI. W e provide all missing pro ofs in the app endices. I I. Problem Mo del The considered problem is formally formulated as fol- lo ws. Supp ose there are N devices and a cen tral server, whic h collab orate to train a mo del with a training dataset W comp osed of M subsets: W = {W 1 , ..., W M } . This is equiv alen t to solving the following optimization problem [31], [32]: arg min θ ∈ R D F ( θ ) , (1) where θ is the model parameter vector, and F ( θ ) repre- sen ts the o v erall training loss expressed as F ( θ ) = M X k =1 f k ( θ ) . (2) In (2), f k is the training loss associated with subset W k . T ypically , under the DL framework, b efore the training starts, the subsets of W are allocated to the devices [31]–[33]. Here, the data allo cation is determined by the system designer and is therefore kno wn to the serv er and all devices. Afterward, the training is implemented b y m ultiple iterations. In eac h iteration, the serv er broadcasts the global mo del to all devices, and eac h device computes the lo cal gradien ts based on its lo cal training data and transmits them to the server. After aggregating the lo cal gradien ts from the devices, the server computes the global gradien t and up dates the global mo del accordingly , whic h ends the curren t iteration [5], [6]. In practice, some stragglers among the devices may suer from signicant dela ys during lo cal computations or comm unication with the server due to v arious incidents. In such cases, only the non-stragglers participate in the training [28]. In the absence of detailed prior information ab out stragglers, it is reasonable to assume that, in eac h iteration, any device can b ecome a straggler with probabilit y p , and the straggler b ehavior is indep endent b oth across iterations and among dierent devices [10], [31], [32], [34]. It is in tuitiv e that the missing information from the stragglers can degrade the learning performance compared to cases where no stragglers are present. Mit- igating the negative impact of stragglers is necessary for DL in v arious applications. Additionally , considering that in each iteration multiple non-straggler devices transmit high-dimensional and real-v alued v ectors to the server, and that communication resources in the DL system are limited, it is crucial to reduce the comm unication o verhead while main taining satisfactory learning performance [32]. I I I. COCO-EF: Compressed Gradient Co ding with Error F eedback In this section, we prop ose a new metho d to cop e with the problem in troduced in Section I I. Before the training, the subsets in W = {W 1 , . . . , W M } are allocated to the devices in a pairwise balanced sc heme, as done in the stochastic gradien t coding scheme [31]. This sc heme is adopted since it can b e easily appro ximated by a completely random distribution of the training subsets in practice. In this case, subset W k is allo cated to d k devices, ∀ k , and the num b er of devices that hold b oth W k 1 and W k 2 is d k 1 d k 2 N for k 1 = k 2 . Based on that, we can use a matrix S to represen t the allocation of the training subsets among the devices, with s ( i, k ) b eing the ( i, k ) -th elemen t. In S , s ( i, k ) = 1 indicates W k is allocated to device i , while s ( i, k ) = 0 implies the opp osite. Each device i initializes its lo cal error vector as e 0 i = 0 , ∀ i , which will be used to store the compression error during the training process. JOURNAL OF L A T E X CLASS FILES 4 During the training, in each iteration t , the serv er broadcasts the current global mo del θ t to all devices. After that, each non-straggler device i computes the lo cal gradien ts corresp onding to its lo cal subsets and obtains ∇ f k θ t | k ∈ { 1 , ..., M } , s ( i, k ) = 0 . With these lo cal gradien ts, device i enco des them as g t i = X k ∈S i 1 d k (1 − p ) ∇ f k θ t , (3) where S i ≜ { k | s ( i, k ) = 0 } . After that, motiv ated by the error feedback mechanism 2 in [27], device i incorporates its lo cal error vector e t i with the co ded vector and compresses it as ˆ g t i = C γ g t i + e t i , (4) where C : R D → R D is a sp ecic t yp e of biased compression function, and γ denotes the learning rate. Here, the error v ector represents the compression error accum ulated from the previous iterations b efore obtaining the global mo del θ t . By adding the error bac k into the co ded gradien t, the compression bias in the previous iterations can b e comp ensated, whic h guarantees the correct direction to up date the model. Note that the length of the compressed v ector ˆ g t i is the same as the input vector of the compression function, and the comm unication is compressed based on the fact that less bits are required to transmit the compressed vector than the original v ector. Some examples of biased compression functions C are listed as follo ws: • Grouped sign-bit quantization. The input vector is g ∈ R D . W e partition the index set { 1 , . . . , D } into M 0 non-o v erlapping groups. Group m is denoted by I m with cardinalit y |I m | , for m = 1 , . . . , M 0 . F or each group m , w e extract the sub vector g m = ( g j ) j ∈I m ∈ R |I m | , where g j is the j -th element in the input vector g . W e p erform the follo wing compression independently on eac h group: C m ( g m ) = sign( g m ) ∥ g m ∥ 1 |I m | , m = 1 , . . . , M 0 , (5) where sign ( · ) is the sign op erator, and ∥·∥ 1 is the l 1 norm. The output of group ed sign-bit quantization is obtained b y concatenating the outputs in (5): C ( g ) = C 1 ( g 1 ) ∥ C 2 ( g 2 ) ∥ · · · ∥ C M 0 ( g M 0 ) . (6) Note that, the concatenation preserves the same elemen t order as in the original input vector g . When M 0 = 1 , group ed sign-bit quan tization is also known as sign-bit quan tization. 2 In addition to error feedback, other tec hniques can b e used to mitigate the negative impact of compression errors, such as com- pression of gradien t dierences [23]. How ever, since error feedback is specically designed for biased compression [22], [27], whereas com- pression of gradien t dierences is designed for unbiased compression, we adopt error feedbac k in our method. • T op- K sparsication. In this case, the output of the compression function only remains K elemen ts in the input of the compression function with the largest magnitude while zeroing out the others. After the compression, the error vector at non-straggler device i is up dated as e t +1 i = γ g t i + e t i − C γ g t i + e t i , (7) whic h is the dierence b etw een the original and the compressed v ector. Let us use I t i = 1 to indicate that device i is not a straggler in iteration t while I t i = 0 indicates the opp osite, where it holds that Pr I t i = 1 = 1 − p, Pr I t i = 0 = p, (8) based on the straggler mo del in tro duced in Section II. Dieren t from the non-stragglers, for the straggler device i , where I t i = 0 , no message is successfully transmitted to the serv er 3 . In that case, the error v ectors on the stragglers remain unc hanged, i.e., e t +1 i = e t i . After receiving the messages from all non-stragglers, the server aggregates the messages as ˆ g t = X { i | I t i =0 } ˆ g t i , (9) whic h is an approximate version of the global gradient. Finally , the server updates the global mo del as θ t +1 = θ t − ˆ g t . (10) W e describ e the prop osed metho d as Algorithm 1, and pro vide its o w chart in Fig. 1. IV. Conv ergence Analysis In this section, the con vergence p erformance of COCO- EF is analyzed. Before providing the main theorem, some widely-used assumptions are stated as follo ws. Assumption 1. The training loss function F is L -smo oth [27], [35], whic h implies F ( x ) ≤ F ( y ) + ⟨∇ F ( y ) , x − y ⟩ + L 2 ∥ x − y ∥ 2 , ∀ x , y , (11) ∥∇ F ( x ) − ∇ F ( y ) ∥ ⩽ L ∥ x − y ∥ , ∀ x , y . (12) Here, ∥·∥ denotes the squared ℓ 2 -norm, i.e., ∥ x ∥ 2 = P D n =1 x 2 n , where x n is the n -th elemen t in x ∈ R D . Assumption 2. The heterogeneit y among {W 1 , ..., W M } is b ounded [35]–[38], whic h indicates ∇ f k ( θ ) − 1 M ∇ F ( θ ) 2 ⩽ β 2 , ∀ k , ∀ θ . (13) Assumption 3. The training loss is lo wer b ounded by some constan t F ∗ , whic h means [39], [40] F ( θ ) ⩾ F ∗ , ∀ θ . (14) 3 Note that the straggling b eha vior of the stragglers can b e caused either b y dela yed or failed computations or b y comm unication errors. JOURNAL OF L A T E X CLASS FILES 5 Befor e th e tra ining Du ri ng the training: ite rat ion t T rai nin g d atas et M su bs ets N devic es In a p air w ise b alan ce d scheme Alloc ation Gl o ba l m o del s ent by the ser v er C o m pr es s ed vec to r s ent by the no n - s tra g g l ers Stra g g l er N o n - s tra g g l er No n - s tra g g l er Serve r A g g reg at e an d u p d at e t h e m o d el Co m p res s ed v ect o r Loc al gra die nts Err or Biased com pr essed functi on In co rp o rat ed v ect o r D i f ference U p d at ed error E ncodi ng Figure 1. The ow chart of COCO-EF. Algorithm 1: COCO-EF Input: Initial parameter v ector θ 0 , stepsize γ , and n um b er of iterations T 1. Before the training: The training subsets are allo cated to the devices in a pairwise balanced sc heme 2. During training: Initialize: t = 0 , lo cal error v ectors e 0 i = 0 , ∀ i while t ≤ T do F or the server: Broadcast the global model θ t to all devices; In parallel for all devices i ∈ { 1 , ..., N } : if I t i = 1 then for k ∈ { 1 , ..., M } do if s ( i, k ) = 1 then Compute lo cal gradien t ∇ f k ( θ t i ) ; end end Enco de the lo cal gradien ts into g t i as (3); Incorp orate γ g t i + e t i ; Compress to obtain ˆ g t i as (4); Up date e t +1 i as (7); Send ˆ g t i to the serv er; else e t +1 i = e t i ; end F or the server: Receiv e the messages and compute ˆ g t as (9); Up date the global model as (10); t = t + 1 ; end Assumption 4. The compression discrepancy is b ounded b y [24] E N X i =1 [ x i − C ( x i )] 2 ⩽ q A N X i =1 x i 2 , ∀ x i , (15) where q A < 1 is a constant. Assumption 5. The biased compression function C satises [27] E h ∥C ( x ) − x ∥ 2 i ⩽ δ ∥ x ∥ 2 , ∀ x , (16) where 0 ⩽ δ < 1 is a constant. Note that, in Assumption 4 and Assumption 5, the ex- p ectation is tak en o v er the randomness of the compression function C , which ma y b e p ossibly random. Although the group ed sign-bit quantization and top- K sparsication in tro duced in Section II I are b oth deterministic com- pression functions, we retain the exp ectation operator to mak e these assumptions applicable to a broader v ariety of biased compression functions. Based on Assumption 4 and Assumption 5, w e can deriv e the following propositions. Prop osition 1. The parameter q A in Assumption 4 dep ends on the v alue of δ , where a larger v alue of δ indicates a higher lev el of information loss caused b y the compression. T o illustrate this, consider the sp ecial case where δ is v ery close to zero. In this case, the information loss due to compression is negligible, and q A w ould b e close to zero as w ell. Prop osition 2. F or grouped sign-bit quantization intro- duced in Section I I I, the constan t δ dened in Assump- tion 5 can b e expressed as δ = 1 − min m ∈{ 1 ,...,M 0 } 1 |I m | JOURNAL OF L A T E X CLASS FILES 6 ξ 1 ≜ 1 1 − h (1 − p )(2 δ +1) 2 + p i 8 β 2 pδ ϑ 1 − p + (4 δ + 2) 4 δ pβ 2 ϑ 2 (1 − p ) δ + p 1 1 − [2 (1 − p ) δ + p ] . (20) ξ 2 ≜ 4 pδ 1 − p 1 N + 2 ϑ M 2 + q A (2 δ + 1) (1 − p ) (2 δ + 1 − 2 q A ) + h (1 − p )(2 δ +1) 2 + p i (4 δ +2)2 δp 2(1 − p ) δ + p 1 N + 2 ϑ M 2 (1 − p )(2 δ +1) 2 + p − [2 (1 − p ) δ + p ] 1 1 − (1 − p ) 2 δ +1 2 + p . (21) [24]. F or top- K sparsication, δ is giv en as δ = 1 − K D [24], where D is the length of the input v ector of the compression function. In [24], it has b een demonstrated that the empirical v alue of q A remains well b ounded by 1 during training, whether sign-bit quantization or top- K sparsication is adopted as the compression function in the DL setting. Next, based on Assumptions 1-5, w e presen t t wo lemmas to aid the deriv ation of the main theorem. Lemma 1. The following b ound can b e provided: E N X i =1 I t i g t i 2 F t ⩽ 2 pβ 2 ϑ 1 − p + p (1 − p ) N + 1 + 2 pϑ (1 − p ) M 2 ∇ F θ t 2 , (17) where E [ ·| F t ] is the expectation conditioned on iterations 0 , ..., t − 1 , and ϑ ≜ M X k =1 1 d k − 1 N . (18) Pro of. Please see App endix A. Lemma 2. F or δ < 0 . 5 and q A < 2 δ +1 2 , the error vectors can b e bounded as T X t =0 E N X i =1 e t +1 i 2 ⩽ ( T + 1) γ 2 ξ 1 + γ 2 ξ 2 T X t =0 E h ∇ F θ t 2 i , (19) where ξ 1 and ξ 2 are dened as (20) and (21). Pro of. Please see App endix B. Based on Lemma 1 and Lemma 2, the conv ergence p erformance of COCO-EF is c haracterized in Theorem 1. Theorem 1 (Con vergence p erformance of COCO-EF). Based on Assumptions 1-5, for δ < 0 . 5 and q A < 2 δ +1 2 , with a constant learning rate γ = ϕ √ T +1 , ϕ > 0 , for T > ( ε 0 ϕ ) 2 − 1 , COCO-EF conv erges as 1 T + 1 T X t =0 E h ∇ F θ t 2 i ⩽ ε 1 ϕ √ T + 1 − ε 0 ϕ + F θ 0 − F ∗ ϕ √ T + 1 − ε 0 ϕ 2 , (22) where ε 0 ≜ L 2 p (1 − p ) N + 1 + 2 pϑ (1 − p ) M 2 + Lξ 2 β √ pϑ p 2 ξ 1 (1 − p ) + L p ξ 1 (1 − p ) 2 p 2 pβ 2 ϑ p (1 − p ) N + 1 + 2 pϑ (1 − p ) M 2 , (23) ε 1 ≜ s 2 L 2 ξ 1 pβ 2 ϑ 1 − p + Lpβ 2 ϑ 1 − p , (24) and T is the total num b er of iterations. Pro of. Please see App endix C. Note that the condition in Theorem 1, i.e., δ < 0 . 5 and q A < 2 δ +1 2 , indicates that the information loss caused b y compression is upper-b ounded. These conditions can b e readily satised b y adjusting the compression lev el of the compression functions. This can b e ac hieved b y tuning the parameters of the compression functions, such as K in top- K sparsication and M 0 together with the group size in group ed sign-bit quan tization. In Theorem 1, (22) can be rewritten as 1 T + 1 T X t =0 E h ∇ F θ t 2 i = O " ε 1 ϕ √ T + 1 + F θ 0 − F ∗ ϕ √ T + 1 # , (25) where a smaller v alue of ε 1 indicates a b etter learning p erformance. Based on the expression of ε 1 pro vided in (24), w e can mak e the following observ ations: • F or a smaller v alue of the straggler probability p , indicating that devices are less likely to b ecome stragglers, the v alue of ε 1 decreases, improving the learning performance. This aligns with our intuition that b etter learning p erformance can be ac hieved with less missing information from stragglers. • If w e increase the v alues of { d k , k = 1 , . . . , M } , imply- ing that the training subsets are allocated to devices more redundantly , the v alue of ϑ in (18) decreases, leading to a reduction in ε 1 . In this case, learning p erformance improv es at the cost of increased com- putational and storage burdens on the devices. JOURNAL OF L A T E X CLASS FILES 7 • With a smaller v alue of L , indicating that the training loss function is smo other, the v alue of ε 1 decreases. Noting that the v alue of L aects only the term ε 1 on the righ t-hand side of (25), a smaller v alue of L implies impro ved learning p erformance. The rationale is as follows. In COCO-EF, information on the correct direction of the model up date, contained in the error vectors, is utilized in a delay ed manner in later iterations. F or a smo other loss function, the gradien ts c hange more gradually , making the dela yed information in the error v ectors less outdated, which impro v es learning p erformance more eectiv ely . • As the v alue of β decreases, the v alue of ε 1 also decreases. In other w ords, if the training data are divided into more homogeneous subsets, the learning p erformance can b e improv ed. The rationale is that when subsets are more similar, missing information from stragglers can b e more eectively comp ensated b y information from non-stragglers, giv en that the lo cal gradien ts computed from dierent subsets are more similar. • When δ = 0 , corresponding to the case without comm unication compression, ε 1 = 0 , and the rst term on the righ t-hand side of (22) equals zero. In this case, the learning p erformance represents the optimal p erformance b ound for COCO-EF. V. Numerical Results In this section, the p erformance of COCO-EF is ev alu- ated on dieren t learning tasks. A. Linear regression task W e consider a linear regression task with a synthetic dataset. Supp ose there are N = 100 devices and the o verall dataset W is divided into M = 100 subsets, where eac h subset W k con tains a single data sample { z k , y k } . Here, y k ∈ R represen ts the lab el of z k ∈ R 100 . The elemen ts in { z 1 , . . . , z M } are sampled indep endently from the normal distribution N (0 , 100) . T o generate eac h y k , w e rst create a random v ector ˆ θ consisting of 100 elemen ts drawn from a standard normal distribution. Consequently , each y k is generated according to the distribution y k ∼ N ( ⟨ z k , ˆ θ ⟩ , 1) , ∀ k . Based on that, the loss function is giv en as F ( θ ) = M X k =1 f k ( θ ) , f k ( θ ) = 1 2 ( ⟨ θ , z k ⟩ − y k ) 2 , (26) where k = 1 , ..., M , θ ∈ R 100 . Before the training starts, w e allocate the training subsets uniformly and randomly to all devices, in which case eac h subset W k is allo cated to d k devices, ∀ k . This can b e regarded as an approximation of training data allo cation in a pairwise balanced sc heme, as pointed out in [31]. The parameter v ector is initialized as θ 0 , where each elemen t is drawn from the standard normal distribution indep enden tly . Figure 2. T raining loss as a function of the n umber of iterations for COCO-EF and the baselines with v arious compression functions. F or each method, w e run 5 indep endent trials. The solid curve shows the mean training loss as a function of the n umber of iterations, and the shaded region represents the standard deviation across trials. First, to v erify the sup eriorit y of the proposed metho d, w e compare the performance of the following methods: 1) COCO-EF (Sign): The prop osed metho d that uses sign-bit quantization as the compression function. In this setting, sign-bit quantization reduces to the group ed sign-bit quan tization with M 0 = 1 , as describ ed in Section II I. 2) COCO-EF (T op- K ): The prop osed method that uses top- K sparsication describ ed in Section I I I as the compression function. 3) Un biased (Sign): The 1-bit gradien t coding metho d prop osed in [32], whic h com bines the adv an tages of gradien t co ding and 1-bit sto chastic quantization to deal with the stragglers with communication b ottlenec k in DL. 4) Un biased (Rand- K ): This metho d extends the metho d prop osed in [32] b y using amplied rand- K sparsication as the compression function in place of 1-bit sto c hastic quantization. 5) Un biased-di (Sign): This metho d combines Unbi- ased (Sign) with gradien t-dierence compression [23] to comp ensate for the compression error. 6) Un biased-di (Rand- K ): This method combines Un- biased (rand- K ) with gradient-dierence compres- sion [23] to compensate for the compression error. In the metho ds ab ov e, COCO-EF (Sign) and COCO- EF (T op- K ) are dierent realizations of the prop osed metho d, using tw o types of biased compression functions. Un biased (Sign) and Unbiased (Rand- K ) are baseline metho ds designed to address the same problem considered in this paper, but they use unbiased compression functions for comm unication compression [32]. Unbiased-di (Sign) and Unbiased-di (Rand- K ) are tw o baseline metho ds that combine Unbiased (Sign) and Unbiased (Rand- K ) with gradient-dierence compression in [23], resp ectively , in order to mitigate the negativ e impact of the compression error on the learning p erformance. Note that in COCO- EF (Sign), Unbiased (Sign) and Unbiased-di (Sign), 1- JOURNAL OF L A T E X CLASS FILES 8 bit v ectors are transmitted from the non-stragglers to the serv er, while in COCO-EF (T op- K ), Un biased (Rand- K ) and Unbiased-di (Rand- K ), sparsied vectors are sen t by the non-stragglers. Based on this, it is rational to compare the learning p erformance among COCO-EF (Sign), Unbiased (Sign) and Un biased-di (Sign), as well as among COCO-EF (T op- K ), Unbiased (Rand- K ) and Un biased-di (Rand- K ), to demonstrate the p erformance gain of the prop osed method resulting from the use of biased compression functions in gradient coding. Here, w e only compare the prop osed metho d with dierent realizations of the baseline metho d in [32] and omit the comparison with the metho d in [31] mentioned in Section I. This is because the sup eriority of the metho d in [32] o v er [31] has already b een thoroughly demonstrated in [32]. Based on this, it is sucien t for the presen t work to show the superiority of our metho d ov er [32] in order to demonstrate its v alue ov er all these prior works. In Fig. 2, w e plot the training loss as a function of the num b er of iterations, where w e set d k = 5 , ∀ k , p = 0 . 2 and K = 2 . The learning rate of COCO-EF is xed at γ = 10 − 5 and the learning rates of the baselines are ne-tuned to b e γ = 2 × 10 − 6 , 10 − 5 , 2 × 10 − 6 , 6 × 10 − 6 for Unbiased (Sign), Un biased (Rand- K ), Unbiased-di (Sign), and Un biased-di (Rand- K ). Here, when selecting the learning rate for eac h method, we ne-tune this parameter to ensure that the b est p ossible p erformance is achiev ed for that metho d for a fair comparison. F rom Fig. 2, w e observe that the prop osed method ac hieves a lo wer training loss after the same n umber of iterations. Since the comm unication ov erhead p er iteration is iden tical among COCO-EF (Sign), Unbiased (Sign) and Unbiased-di (Sign), as well as among COCO-EF (T op- K ), Un biased (Rand- K ) and Un biased-di (Rand- K ), it is clear that the prop osed metho d outp erforms the baselines, regardless of whether 1-bit messages or sparsied messages are used. In other words, the prop osed metho d delivers b etter learning p erformance under the same communication o verhead in the presence of stragglers. This demonstrates that biased compression functions are sup erior to unbiased compression functions for communication compression in gradien t co ding for DL, under comm unication bottlenecks with stragglers. Also note that, under the parameter settings used here and according to Prop osition 2, the condition δ < 0 . 5 is not satised in our exp eriments. Ho w ev er, the prop osed metho d still conv erges empirically under the exp erimental congurations we consider. This suggests that the theoretical requiremen t in Theorem 1 is a sucien t but not a necessary condition for con vergence. It is worth noting that when un biased compression is adopted in the baseline methods, some existing techniques can p otentially b e used to comp ensate for the compres- sion error, such as error feedbac k and gradient-dierence compression. How ever, it has b een demonstrated in [25] that error feedback pro vides little impro vemen t in learning p erformance when used with unbiased compression in DL settings. T o further verify this b ehavior in our context, we conducted additional simulations b y combining Un biased 0 200 400 600 800 1000 Number of iterations 0 100 200 300 400 500 600 700 Training loss p = 0.1 p = 0.5 p = 0.9 Figure 3. T raining loss as a function of the n umber of iterations for COCO-EF (Sign) under v arying v alues of p . (Sign) and Unbiased (Rand- K ) with error feedback for the considered problem. The results sho w that the method barely conv erges under this conguration. Therefore, we do not include these simulation results in the pap er. Based on these observ ations, it b ecomes clear that the p erformance gain of the prop osed metho d arises from the com bination of biased compression and error feedbac k, rather than from the adv antage of error feedback alone. In other words, the b enets of error feedbac k are eective only when used with biased compression, and can therefore b e regarded as part of the o verall adv antage of biased compression in the prop osed metho d. In addition, our exp erimen ts show that, in the baseline, the p erformance gain brough t b y gradien t-dierence compression is negli- gible, and the resulting learning p erformance is inferior to that of the prop osed metho d. This is due to the inherent dra wbac k of un biased compression in DL settings—its a v erage ability to retain information is low er than that of biased compression [22]. Next, we inv estigate the inuence of the straggler probabilit y p on the learning performance of the proposed metho d. T o this end, in Fig. 3, we plot the training loss as a function of the num b er of iterations under dierent v alues of p , where COCO-EF (Sign) is implemen ted. Here w e x d k = 2 , ∀ k , and γ = 10 − 5 . F rom Fig. 3, we observ e that with a larger v alue of p , indicating that devices are more lik ely to be stragglers, the learning performance degrades to some extent. This aligns with the theoretical analysis in Section IV and our in tuition that more stragglers lead to more missing information, whic h cannot b e utilized b y the server to up date the global mo del eectively and accurately . How ever, p erformance degradation only b ecomes noticeable when p is fairly large (i.e., close to 1), whic h can b e attributed to the gradient co ding scheme adopted in COCO-EF based on a redundant allocation of training data across devices, enhancing robustness against stragglers. T o illustrate the inuence of data allocation redundancy on robustness to stragglers in COCO-EF, Fig. 4 shows the training loss as a function of the n umber of iterations for COCO-EF (Sign) under v arying v alues of d k , with p = 0 . 9 and γ = 10 − 5 . F rom Fig. 4, it is evident that as d k in- JOURNAL OF L A T E X CLASS FILES 9 0 100 200 300 400 500 600 Number of iterations 0 50 100 150 200 250 300 350 Training loss d k =1 d k =5 d k =10 d k =15 d k =20 Figure 4. T raining loss as a function of the n umber of iterations for COCO-EF (Sign) under v arying v alues of d k . 0 100 200 300 400 500 Number of iterations 0 100 200 300 400 Training loss COCO (Sign) COCO-EF (Sign) COCO (Top-K) COCO-EF (Top-K) Figure 5. T raining loss as a function of the n umber of iterations for COCO-EF and COCO. creases, indicating greater redundancy in data allocation, the learning p erformance of COCO-EF improv es, which aligns with our analysis in Section IV. The rationale is that with increased redundancy , the missing information from stragglers can be compensated to a larger exten t b y information from non-stragglers, enhancing learning p erformance. It is imp ortant to note that improving learning p erformance by increasing redundancy comes at the cost of greater computational and comm unication burdens on devices, and this trade-o should be considered in practice. A dditionally , from Fig. 4, we observe that increasing d k from 1 to 10 signican tly impro ves learning p erformance. How ever, further increases in d k b ey ond 10 result in only marginal improv ements. Note that by setting d k = M , the eects of stragglers can b e completely mitigated, since each device holds a copy of the ov erall training dataset. This suggests that the prop osed metho d can eectively mitigate the impact of stragglers with a reasonable lev el of data allocation redundancy . Ev en when the goal is to fully coun teract the eects of stragglers, only a relativ ely lo w lev el of redundancy is necessary . T o verify the necessity of adopting the error feedbac k mec hanism in the prop osed metho d, Fig. 5 shows the training loss as a function of the n umber of iterations for COCO-EF (Sign), COCO (Sign), COCO-EF (T op- K ), and COCO (T op- K ), where w e set K = 2 , p = 0 . 2 , d k = 5 , and γ = 10 − 5 . Here, COCO represents a version of the prop osed method without the error feedback mechanism, implemen ted b y xing { e t i = 0 , ∀ i, ∀ t } . COCO-EF (Sign) and COCO (Sign) use the same compression function, as do COCO-EF (T op- K ) and COCO (T op- K ). F rom Fig. 5, w e observe that the learning p erformance of COCO- EF (Sign) is signicantly b etter than that of COCO (Sign). Similarly , COCO-EF (T op- K ) outp erforms COCO (T op- K ), with COCO (T op- K ) particularly struggling to con v erge, while the prop osed metho d consistently con- v erges to the stationary point. This suggests that adopting the error feedback mechanism in the prop osed metho d is essential to ensure con vergence and accelerate it b y comp ensating for the compression bias, thus main taining the correct direction to update the global mo del. In our theoretical analysis and in the numerical settings ab o v e, we consider the use of a constant learning rate. Ho w ev er, we are also interested in comparing the learning p erformance of the prop osed metho d under constant and deca ying learning-rate sc hemes. T o assess the inuence of these dierent learning-rate settings, Fig. 6 plots the training loss as a function of the n umber of iterations, where COCO-EF (Sign) is implemen ted. Here, we x p = 0 . 5 and d k = 2 , ∀ k . Under the constant learning- rate setting, the learning rate is xed at γ = 2 × 10 − 5 . Under the deca ying learning-rate setting, the learning rate at iteration t is set to γ t = 2 × 10 − 5 √ t +1 . This ensures a fair comparison b et w een the constant and decaying sc hemes, as b oth begin with the same initial learning rate. F rom Fig. 6, it can b e seen that the learning p erfor- mance is signican tly better under the constant learning- rate setting. The rationale b ehind this phenomenon is as follo ws. F rom (4), the enco ded vector is scaled b y the learning rate at the current iteration t . In contrast, the error term in (4) is accumulated from previous iterations, whic h was scaled b y the learning rates used in those earlier iterations. Since the learning rates in the previous iterations are larger than the current learning rate under the decaying learning-rate scheme, the error term in (4) b ecomes more dominan t than the newly enco ded v ector. This dominance ma y negativ ely aect the learning pro cess, as the system tends to ov eremphasize compensating for past errors rather than striking an appropriate balance b et w een correcting past errors and incorp orating the curren t enco ded vectors derived from the latest gradients. B. Image classication task In this subsection, w e ev aluate the p erformance of COCO-EF on an image classication task using the MNIST dataset [41], where a con volutional neural netw ork is trained. The training loss is dened as the cross-entrop y loss, which is a non-con vex function. W e consider a DL system with N = 100 devices. The MNIST training set contains 60,000 samples represen ting 10 handwritten digits, whic h are randomly divided in to M = 100 subsets without ov erlap suc h that each subset contains the same n um b er of samples. T o sim ulate heterogeneity among the subsets, all samples within each subset represent the same digit, while the digits across dierent subsets may dier. JOURNAL OF L A T E X CLASS FILES 10 0 200 400 600 800 1000 Number of iterations 0 100 200 300 400 500 Training loss Constant learning rate Decaying learning rate Figure 6. T raining loss as a function of the num b er of iterations for COCO-EF (Sign) under constant and decaying learning-rate schemes. Before training b egins, the subsets are allo cated to the devices uniformly and randomly; as a result, eac h subset is assigned to d k devices for all k . The straggler probability is xed at p = 0 . 6 . The MNIST test set is used to ev aluate the test loss and test accuracy of the trained mo del. It contains 10,000 samples representing 10 handwritten digits. T o demonstrate the superiority of the proposed method, w e compare the p erformance of COCO-EF (Sign) with that of Un biased (Sign). In Fig. 7, w e plot the training loss, training accuracy , test loss, and test accuracy as functions of the n umber of iterations for both metho ds under v arying v alues of d k . The learning rates of COCO-EF (Sign) and Un biased (Sign) are all ne-tuned to b e nearly optimal. Note that, for both metho ds, the comm unication ov erhead p er iteration is the same. Therefore, comparing their learning p erformance under the same num b er of iterations pro vides a fair comparison under equal comm unication o v erhead in the presence of stragglers. F rom Fig. 7, it can b e seen that the prop osed metho d outp erforms the baseline metho d in all considered settings, demonstrat- ing sup erior communication eciency and robustness to stragglers. It can also b e observ ed that the learning p erformance impro v es as the v alue of d k increases. This observ ation aligns with our in tuition that increasing data allo cation redundancy enhances robustness to stragglers b y comp ensating for the information loss caused by straggling devices through the messages received from non-straggling devices. VI. Conclusions In this pap er, we studied the DL problem with strag- glers under the communication bottleneck. W e prop osed COCO-EF, which combines the b enets of biased com- pression functions with error feedback and gradien t cod- ing. In COCO-EF, eac h non-straggler device, holding redundan tly allocated training data, encodes its lo cal gradien ts into a single co ded vector. This co ded vector is incorp orated with the compression error b efore being further compressed and transmitted to the serv er with a sp ecied biased compression function. With the received v ectors and the redundancy in training data allocation, the serv er can approximately reconstruct the global gradien t to up date the mo del. W e analyzed the con v ergence p erfor- mance of COCO-EF theoretically and presented n umerical results, demonstrating its superior learning p erformance compared to the baselines under the same comm unication o v erhead with stragglers. In our experiments, the num b er of devices is set to 100, whic h is practical in man y real-w orld DL settings, such as edge computing systems with tens to h undreds of edge nodes [42], [43]. In the future, we plan to apply the prop osed metho d to more complex applications with a substantially larger num b er of devices in v ery large-scale learning systems. In addition, w e also plan to explore more adv anced error feedback mec hanisms to further enhance the performance of the prop osed metho d, suc h as the one designed in [44]. References [1] J. V erbraeken, M. W olting, J. Katzy , J. Klopp enburg, T. V er- belen, and J. S. Rellermey er, “A survey on distributed machine learning,” ACM computing surveys (CSUR), v ol. 53, no. 2, pp. 1–33, 2020. [2] X. W ang, Y. Han, V. C. Leung, D. Niy ato, X. Y an, and X. Chen, “Conv ergence of edge computing and deep learning: A comprehensive survey ,” IEEE Communications Surveys & T utorials, vol. 22, no. 2, pp. 869–904, 2020. [3] A. V. Makkuv a, M. Bondasc hi, T. V ogels, M. Jaggi, H. Kim, and M. Gastpar, “LASER: Linear compression in wireless dis- tributed optimization,” in International Conference on Machine Learning. PMLR, 2024. [4] T. Jahani-Nezhad and M. A. Maddah-Ali, “Optimal communication-computation trade-o in heterogeneous gradient co ding,” IEEE Journal on Selected Areas in Information Theory , vol. 2, no. 3, pp. 1002–1011, 2021. [5] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-ecien t learning of deep netw orks from decen tralized data,” in Articial in telligence and statistics. PMLR, 2017, pp. 1273–1282. [6] X. Cao, T. Başar, S. Diggavi, Y. C. Eldar, K. B. Letaief, H. V. Poor, and J. Zhang, “Communication-ecien t distributed learning: An ov erview,” IEEE journal on selected areas in communications, vol. 41, no. 4, pp. 851–873, 2023. [7] E. Gorbunov, F. Hanzely , and P . Rich tárik, “A unied the- ory of SGD: V ariance reduction, sampling, quantization and coordinate descent,” in International Conference on Articial Intelligence and Statistics. PMLR, 2020, p p. 680–690. [8] Z. T ang, S. Shi, W. W ang, B. Li, and X. Ch u, “Communication- ecient distributed deep learning: A comprehensive survey ,” arXiv preprint arXiv:2003.06307, 2020. [9] L. Qian, P . Y ang, M. Xiao, O. A. Dobre, M. Di Renzo, J. Li, Z. Han, Q. Yi, and J. Zhao, “Distributed learning for wireless communications: Metho ds, applications and challenges,” IEEE Journal of Selected T opics in Signal Processing, vol. 16, no. 3, pp. 326–342, 2022. [10] C. Li and M. Skoglund, “Decen tralized learning based on gradient co ding with compressed comm unication,” IEEE T rans- actions on Signal Pro cessing, 2024. [11] H. W ang, S. Sievert, S. Liu, Z. Charles, D. Papailiopoulos, and S. W right, “Atomo: Communication-ecien t learning via atomic sparsication,” Adv ances in neural information processing sys- tems, vol. 31, 2018. [12] D. Alistarh, D. Grubic, J. Li, R. T omioka, and M. V o jnovic, “QSGD: Communication-ecien t SGD via gradient quan tiza- tion and enco ding,” Adv ances in neural information pro cessing systems, vol. 30, 2017. [13] M. Safary an and P . Ric htárik, “Sto chastic sign descen t methods: New algorithms and better theory ,” in In ternational Conference on Machine Learning. PMLR, 2021, pp. 9224–9234. [14] J. W angni, J. W ang, J. Liu, and T. Zhang, “Gradient sparsi- cation for communication-ecient distributed optimization,” Adv ances in Neural Information Pro cessing Systems, vol. 31, 2018. JOURNAL OF L A T E X CLASS FILES 11 Figure 7. T raining loss, training accuracy , test loss, and test accuracy as functions of the num b er of iterations for COCO-EF (Sign) and Unbiased (Sign) under dieren t v alues of d k . F or each method, we conduct 5 independent trials. The solid curves sho w the mean p erformance across trials, and the shaded regions represen t the corresponding standard deviations. [15] J. Bernstein, Y.-X. W ang, K. Azizzadenesheli, and A. Anand- kumar, “signSGD: Compressed optimisation for non-con vex problems,” in International Conference on Machine Learning. PMLR, 2018, pp. 560–569. [16] D. Alistarh, T. Hoeer, M. Johansson, N. Konstan tinov, S. Khirirat, and C. Renggli, “The convergence of sparsied gradient metho ds,” Adv ances in Neural Information Processing Systems, vol. 31, 2018. [17] S. U. Stic h, J.-B. Cordonnier, and M. Jaggi, “Sparsied SGD with memory ,” A dv ances in neural information pro cessing sys- tems, vol. 31, 2018. [18] A. Sah u, A. Dutta, A. M Ab delmoniem, T. Banerjee, M. Canini, and P . Kalnis, “Rethinking gradient sparsication as total error minimization,” A dv ances in Neural Information Processing Systems, vol. 34, pp. 8133–8146, 2021. [19] Z. Li and P . Rich tárik, “A unied analysis of stochastic gradien t methods for nonconv ex federated optimization,” arXiv preprint arXiv:2006.07013, 2020. [20] L. Condat, A. Maranjyan, and P . Rich tárik, “LoCoDL: Communication-ecien t distributed learning with lo cal training and compression,” arXiv preprint arXiv:2403.04348, 2024. [21] Y. He, X. Huang, and K. Y uan, “Unbiased compression sav es communication in distributed optimization: when and how muc h?” A dvances in Neural Information Processing Systems, vol. 36, 2024. [22] A. Beznosiko v, S. Horv áth, P . Rich tárik, and M. Safaryan, “On biased compression for distributed learning,” Journal of Machine Learning Research, vol. 24, no. 276, pp. 1–50, 2023. [23] S. U. Stich, “On comm unication compression for dis- tributed optimization on heterogeneous data,” arXiv preprint arXiv:2009.02388, 2020. [24] X. Li and P . Li, “Analysis of error feedback in federated non- conv ex optimization with biased compression: F ast conv ergence and partial participation,” in International Conference on Ma- chine Learning. PMLR, 2023, pp. 19 638–19 688. [25] S. Horv áth and P . Rich tárik, “A better alternative to error feed- back for communication-ecien t distributed learning,” arXiv preprint arXiv:2006.11077, 2020. [26] F. Seide, H. F u, J. Dropp o, G. Li, and D. Y u, “1-bit sto chastic gradient descent and its application to data-parallel distributed training of sp eech dnns. ” in In terspeech, vol. 2014. Singap ore, 2014, pp. 1058–1062. [27] S. P . Karimireddy , Q. Reb jock, S. Stich, and M. Jaggi, “Er- ror feedback xes signSGD and other gradient compression schemes,” in International Conference on Machine Learning. PMLR, 2019, pp. 3252–3261. [28] R. T andon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, “Gradient co ding: A voiding stragglers in distributed learning,” in International Conference on Machine Learning. PMLR, 2017, pp. 3368–3376. [29] B. Buyukates and S. Ulukus, “Timely distributed computa- tion with stragglers,” IEEE T ransactions on Comm unications, vol. 68, no. 9, pp. 5273–5282, 2020. [30] A. Hard, A. M. Girgis, E. Amid, S. Augenstein, L. Mc- Connaughey , R. Mathews, and R. Anil, “Learning from straggler clien ts in federated learning,” arXiv preprint arXiv:2403.09086, 2024. [31] R. Bitar, M. W o otters, and S. El Roua yheb, “Stochastic gradient coding for straggler mitigation in distributed learning,” IEEE Journal on Selected Areas in Information Theory , vol. 1, no. 1, pp. 277–291, 2020. [32] C. Li and M. Skoglund, “Distributed learning based on 1-bit gradient coding in the presence of stragglers,” IEEE T ransac- tions on Communications, v ol. 72, no. 8, pp. 4903–4916, 2024. [33] M. Y e and E. Abbe, “Comm unication-computation ecien t gra- dient coding,” in International Conference on Machine Learning. PMLR, 2018, pp. 5610–5619. [34] T. Adikari and S. Draper, “Decen tralized optimization with non-identical sampling in presence of stragglers,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 3702–3706. [35] L. Ding, K. Jin, B. Ying, K. Y uan, and W. Yin, “DSGD- CECA: decentralized SGD with communication-optimal exact consensus algorithm,” in International Conference on Machine Learning. PMLR, 2023, pp. 8067–8089. [36] S. P . Karimireddy , L. He, and M. Jaggi, “Byzantine- robust learning on heterogeneous datasets via buc keting,” in International Conference on Learning Representations, 2022. [Online]. A v ailable: https://openreview.net/forum?id= jXKKDEi5vJt JOURNAL OF L A T E X CLASS FILES 12 [37] H. Zhu and Q. Ling, “Byzan tine-robust distributed learning with compression,” IEEE T ransactions on Signal and Informa- tion Pro cessing ov er Netw orks, 2023. [38] R. Islamo v, M. Safary an, and D. Alistarh, “Asgrad: A sharp uni- ed analysis of async hronous-SGD algorithms,” in In ternational Conference on Articial In telligence and Statistics. PMLR, 2024, pp. 649–657. [39] A. Kolosk ov a, S. Stich, and M. Jaggi, “Decentralized sto chastic optimization and gossip algorithms with compressed commu- nication,” in International Conference on Machine Learning. PMLR, 2019, pp. 3478–3487. [40] R. Jin, Y. Liu, Y. Huang, X. He, T. W u, and H. Dai, “Sign-based gradient descent with heterogeneous data: Conv ergence and byzan tine resilience,” IEEE T ransactions on Neural Netw orks and Learning Systems, 2024. [41] Y. LeCun, “The mnist database of handwritten digits,” http://y ann. lecun. com/exdb/mnist/, 1998. [42] X. Chen, J. Cao, R. Cao, Y. Sahni, M. Zhang, and Y. Ji, “Decentralized task ooading in collaborative edge comput- ing: a digital twin assisted m ulti-agent reinforcement learning approach,” IEEE T ransactions on Mobile Computing, 2025. [43] X. T ang, L. Lou, R. P eng, T. Zhang, Z. Liu, Z. Liu, J. Kang, Z. Han, and D. Niyato, “A veriable priv acy-preserving cross- chain protocol for trusted vehicle edge computing,” IEEE T ransactions on V ehicular T echnology , 2025. [44] P . Rich tárik, I. Sokolo v, and I. F atkhullin, “EF21: A new, sim- pler, theoretically better, and practically faster error feedback,” Adv ances in Neural Information Pro cessing Systems, vol. 34, pp. 4384–4396, 2021. VI I. Ac kno wledgmen t The authors gratefully ackno wledge Martin Jaggi who suggested to lo ok at error feedback in the setting of this w ork. App endix A Pro of of Lemma 1 W e can easily derive E N X i =1 I t i g t i 2 F t = N X j =1 N X i =1 E I t i g t i , I t j g t j F t = N X i =1 E I t i g t i , I t i g t i F t + N X j =1 N X i = j E I t i g t i , I t j g t j F t = (1 − p ) N X i =1 E g t i , g t i F t + (1 − p ) 2 N X j =1 N X i =1 E g t i , g t j F t − (1 − p ) 2 N X i =1 E g t i , g t i F t ⟨ 1 ⟩ = (1 − p ) N X i =1 E h g t i 2 F t i + ∇ F θ t 2 − (1 − p ) 2 N X i =1 E h g t i 2 F t i = p − p 2 N X i =1 E h g t i 2 F t i + ∇ F θ t 2 , (27) where ⟨ 1 ⟩ is obtained b y noting that (1 − p ) N P i =1 g t i = ∇ F θ t . In (27), w e can express N X i =1 E h g t i 2 F t i ⟨ 1 ⟩ = N X i =1 E X k ∈S i 1 d k (1 − p ) ∇ f k θ t 2 F t = 1 (1 − p ) 2 N X i =1 M X k 1 =1 M X k 2 =1 s ( i, k 1 ) s ( i, k 2 ) d k 1 d k 2 × ∇ f k 1 θ t , ∇ f k 2 θ t = 1 (1 − p ) 2 M X k 1 =1 M X k 2 =1 1 d k 1 d k 2 N X i =1 s ( i, k 1 ) s ( i, k 2 ) × ∇ f k 1 θ t , ∇ f k 2 θ t = 1 (1 − p ) 2 X k 1 ∇ f k 1 θ t 2 d 2 k 1 N X i =1 s ( i, k 1 ) + 1 (1 − p ) 2 X k 1 X k 2 = k 1 ∇ f k 1 θ t , ∇ f k 2 θ t d k 1 d k 2 × N X i =1 s ( i, k 1 ) s ( i, k 2 ) ⟨ 2 ⟩ = 1 (1 − p ) 2 X k ∇ f k θ t 2 d k + 1 (1 − p ) 2 X k 1 X k 2 = k 1 ∇ f k 1 θ t , ∇ f k 2 θ t N = 1 (1 − p ) 2 X k ∇ f k θ t 2 d k − 1 (1 − p ) 2 X k ∇ f k θ t 2 N + 1 (1 − p ) 2 X k 1 X k 2 ∇ f k 1 θ t , ∇ f k 2 θ t N = 1 (1 − p ) 2 M X k =1 ∇ f k θ t 2 1 d k − 1 N + 1 (1 − p ) 2 ∇ F θ t 2 N = 1 (1 − p ) 2 M X k =1 ∇ f k θ t − 1 M ∇ F θ t + 1 M ∇ F θ t 2 × 1 d k − 1 N + 1 (1 − p ) 2 ∇ F θ t 2 N ⟨ 3 ⟩ ⩽ 2 1 (1 − p ) 2 M X k =1 ∇ f k θ t − 1 M ∇ F θ t 2 1 d k − 1 N + 2 1 (1 − p ) 2 M X k =1 1 M ∇ F θ t 2 1 d k − 1 N JOURNAL OF L A T E X CLASS FILES 13 + 1 (1 − p ) 2 ∇ F θ t 2 N ⟨ 4 ⟩ ⩽ 2 β 2 (1 − p ) 2 M X k =1 1 d k − 1 N + 1 N (1 − p ) 2 + 2 M P k =1 1 d k − 1 N (1 − p ) 2 M 2 ∇ F θ t 2 , (28) where ⟨ 1 ⟩ is derived from (3), ⟨ 2 ⟩ holds according to the follo wing equations under the pairwise balanced sc heme: N X i =1 s ( i, k 1 ) = d k 1 , N X i =1 s ( i, k 1 ) s ( i, k 2 ) = d k 1 d k 2 N , k 1 = k 2 , (29) ⟨ 3 ⟩ holds due to the follo wing basic equality: n X i =1 x i 2 ⩽ n n X i =1 ∥ x i ∥ 2 , ∀ x i ∈ R D , (30) and ⟨ 4 ⟩ is deriv ed from Assumption 2. Substituting (28) in to (27) yields (17) and completes the proof. App endix B Pro of of Lemma 2 Let us dene ˜ e t i = γ g t i + e t i − C γ g t i + e t i , i = 1 , ..., N . (31) Based on this denition, w e hav e e t +1 i = ˜ e t i , if device i is a non-straggler, e t i , otherwise. (32) A ccording to (32) and the straggler mo del in (8), we hav e N X i =1 e t +1 i 2 = N X i =1 I t i ˜ e t i + N X i =1 1 − I t i e t i 2 = N X i =1 I t i ˜ e t i − e t i + N X i =1 e t i 2 . (33) T aking exp ectations on b oth sides of (33) conditioned on iterations t = 0 , ..., t − 1 , we can obtain E N X i =1 e t +1 i 2 F t = E " * N X i =1 I t i ˜ e t i − e t i , N X i =1 I t i ˜ e t i − e t i + F t # + E " * N X i =1 e t i , N X i =1 e t i + F t # + 2 E " * N X i =1 e t i , N X i =1 I t i ˜ e t i − e t i + F t # . (34) In (34), the rst term can be expressed as E " * N X i =1 I t i ˜ e t i − e t i , N X i =1 I t i ˜ e t i − e t i + F t # = X i E I t i ˜ e t i − e t i , I t i ˜ e t i − e t i F t + X i = j E I t i ˜ e t i − e t i , I t j ˜ e t j − e t j F t ⟨ 1 ⟩ = (1 − p ) X i E h ˜ e t i − e t i 2 F t i + (1 − p ) 2 X i = j E ˜ e t i − e t i , ˜ e t j − e t j F t = p − p 2 X i E h ˜ e t i − e t i 2 F t i + (1 − p ) 2 X i,j E ˜ e t i − e t i , ˜ e t j − e t j F t ⩽ 2 p − p 2 X i E h ˜ e t i 2 F t i + 2 p − p 2 X i e t i 2 + (1 − p ) 2 E X i ˜ e t i 2 F t + (1 − p ) 2 X i e t i 2 − 2(1 − p ) 2 E " * X i ˜ e t i , X i e t i + F t # , (35) where ⟨ 1 ⟩ is derived from (8). Substituting (35) into (34) yields E N X i =1 e t +1 i 2 F t ⟨ 1 ⟩ ⩽ 2 p − p 2 X i E h ˜ e t i 2 F t i + 2 p − p 2 X i e t i 2 + (1 − p ) 2 E X i ˜ e t i 2 F t + p 2 X i e t i 2 + 2 p − p 2 E " * N X i =1 e t i , N X i =1 ˜ e t i + F t # ⩽ 2 p − p 2 X i E h ˜ e t i 2 F t i + 2 p − p 2 X i e t i 2 + (1 − p ) E X i ˜ e t i 2 F t + p X i e t i 2 , (36) where ⟨ 1 ⟩ is derived from (8). F or the rst term in (36), w e can deriv e X i E h ˜ e t i 2 F t i ⟨ 1 ⟩ = X i E h γ g t i + e t i − C γ g t i + e t i 2 F t i ⟨ 2 ⟩ ⩽ δ X i γ g t i + e t i 2 ⟨ 3 ⟩ ⩽ 2 δ γ 2 X i g t i 2 + 2 δ X i e t i 2 , (37) JOURNAL OF L A T E X CLASS FILES 14 where ⟨ 1 ⟩ is obtained from (31), ⟨ 2 ⟩ holds due to Assump- tion 5, and ⟨ 3 ⟩ can b e deriv ed based on (30). F or the third term in (36), we ha ve E X i ˜ e t i 2 F t = E X i γ g t i + e t i − C γ g t i + e t i 2 F t ⟨ 1 ⟩ ⩽ q A E X i γ g t i + e t i 2 F t ⟨ 2 ⟩ ⩽ q A (1 + α ) γ 2 X i g t i 2 + q A 1 + α − 1 X i e t i 2 ⟨ 3 ⟩ = q A γ 2 (1 + α ) (1 − p ) 2 ∇ F θ t 2 + q A 1 + α − 1 X i e t i 2 , ∀ α > 0 , (38) where ⟨ 1 ⟩ holds due to Assumption 4, ⟨ 2 ⟩ is derived from the follo wing basic inequalit y: ∥ x + y ∥ 2 ⩽ (1 + α ) ∥ x ∥ 2 + 1 + α − 1 ∥ y ∥ 2 , ∀ x , y ∈ R D , ∀ α > 0 , (39) and ⟨ 3 ⟩ can b e deriv ed from (3). Substituting (37) and (38) in to (36), we ha ve E N X i =1 e t +1 i 2 F t ⩽ 4 p − p 2 δ γ 2 X i g t i 2 + (4 δ + 2) p − p 2 X i e t i 2 + q A γ 2 (1 + α ) 1 − p ∇ F θ t 2 + (1 − p ) q A 1 + α − 1 + p X i e t i 2 . (40) In (40), w e can bound P i ∥ g t i ∥ 2 b y X i g t i 2 ⩽ 2 β 2 (1 − p ) 2 ϑ + " 1 N (1 − p ) 2 + 2 ϑ (1 − p ) 2 M 2 # ∇ F θ t 2 , (41) based on (18) and (28). F or the second term in (40), we can deriv e E h e t +1 i 2 F t i = (1 − p ) E h ˜ e t i 2 F t i + p e t i 2 = (1 − p ) E h γ g t i + e t i − C γ g t i + e t i 2 F t i + p e t i 2 ⩽ (1 − p ) δ γ g t i + e t i 2 + p e t i 2 ⩽ 2 (1 − p ) δ γ 2 g t i 2 + [2 (1 − p ) δ + p ] e t i 2 , (42) b y following similar steps as in (37). Summing ov er all devices based on (41) and (42), w e hav e X i E h e t +1 i 2 F t i ⩽ 4 δ γ 2 β 2 ϑ 1 − p + 2 δ γ 2 1 N (1 − p ) + 2 ϑ (1 − p ) M 2 ∇ F θ t 2 + [2 (1 − p ) δ + p ] X i e t i 2 . (43) T aking full exp ectations on b oth sides of (43), we can deriv e that E " X i e t +1 i 2 # ⩽ 4 δ γ 2 β 2 ϑ 1 − p + 2 δ γ 2 1 − p 1 N + 2 ϑ M 2 E h ∇ F θ t 2 i + [2 (1 − p ) δ + p ] E " X i e t i 2 # ⩽ 2 δ γ 2 1 − p t X τ =0 2 β 2 ϑ + 1 N + 2 ϑ M 2 E h ∇ F θ t − τ 2 i × [2 (1 − p ) δ + p ] τ . (44) F rom (41) and (44), we can take full exp ectations on b oth sides of (40) and obtain (45). Based on (45), let us sum o v er T + 1 iterations to obtain (46). By setting α = 2 q A 2 δ +1 − 2 q A in (46), under the conditions δ < 0 . 5 and q A < 2 δ +1 2 , w e can deriv e (19) easily . App endix C Pro of of Theorem 1 Let us consider the virtual sequence { x t , t = 0 , ..., T } , where x t ≜ θ t − N X i =1 e t i . (47) F rom (47), we can deriv e x t +1 = θ t +1 − N X i =1 e t +1 i ⟨ 1 ⟩ = θ t − ˆ g t − N X i =1 I t i e t +1 i − N X i =1 1 − I t i e t +1 i ⟨ 2 ⟩ = θ t − N X i =1 I t i C γ g t i + e t i − N X i =1 I t i e t +1 i − N X i =1 1 − I t i e t i = θ t − N X i =1 I t i C γ g t i + e t i + e t +1 i − N X i =1 1 − I t i e t i ⟨ 3 ⟩ = θ t − N X i =1 I t i γ g t i + e t i − N X i =1 1 − I t i e t i = θ t − N X i =1 e t i − γ N X i =1 I t i g t i ⟨ 4 ⟩ = x t − γ N X i =1 I t i g t i , (48) JOURNAL OF L A T E X CLASS FILES 15 E N X i =1 e t +1 i 2 ⩽ 8 β 2 pδ γ 2 ϑ 1 − p + 4 pδ γ 2 1 − p 1 N + 2 ϑ M 2 ∇ F θ t 2 + (4 δ + 2) 2 δ γ 2 p 2 (1 − p ) δ + p t X τ =0 2 β 2 ϑ + 1 N + 2 ϑ M 2 E h ∥∇ F ( θ τ ) ∥ 2 i [2 (1 − p ) δ + p ] t − τ + q A γ 2 (1 + α ) 1 − p E h ∇ F θ t 2 i + (1 − p ) q A 1 + α − 1 + p E X i e t i 2 ⩽ t X l =0 (1 − p ) q A 1 + α − 1 + p t − l 8 β 2 pδ γ 2 ϑ 1 − p + 4 pδ γ 2 1 − p 1 N + 2 ϑ M 2 E ∇ F θ l 2 + (4 δ + 2) 2 δ γ 2 p 2 (1 − p ) δ + p l X τ =0 2 β 2 ϑ + 1 N + 2 ϑ M 2 E h ∥∇ F ( θ τ ) ∥ 2 i [2 (1 − p ) δ + p ] l − τ + q A γ 2 (1 + α ) 1 − p E ∇ F θ l 2 = t X l =0 (1 − p ) q A 1 + α − 1 + p t − l ( 8 β 2 pδ γ 2 ϑ 1 − p + (4 δ + 2) 4 δ γ 2 pβ 2 ϑ [2 (1 − p ) δ + p ] 1 − [2 (1 − p ) δ + p ] l +1 1 − [2 (1 − p ) δ + p ] ) + 4 pδ γ 2 1 − p 1 N + 2 ϑ M 2 + q A γ 2 (1 + α ) 1 − p t X l =0 (1 − p ) q A 1 + α − 1 + p t − l E ∇ F θ l 2 + (4 δ +2)2 δγ 2 p 2(1 − p ) δ + p 1 N + 2 ϑ M 2 (1 − p ) q A 1 + α − 1 + p t 1 − [2(1 − p ) δ + p ] [(1 − p ) q A (1+ α − 1 )+ p ] t X τ =0 E h ∥∇ F ( θ τ ) ∥ 2 i 1 − n [2(1 − p ) δ + p ] [(1 − p ) q A (1+ α − 1 )+ p ] o t − τ +1 [(1 − p ) q A (1 + α − 1 ) + p ] τ ⩽ 1 1 − [(1 − p ) q A (1 + α − 1 ) + p ] 8 β 2 pδ γ 2 ϑ 1 − p + (4 δ + 2) 4 δ γ 2 pβ 2 ϑ 2 (1 − p ) δ + p 1 1 − [2 (1 − p ) δ + p ] + 4 pδ γ 2 1 − p 1 N + 2 ϑ M 2 + q A γ 2 (1 + α ) 1 − p + (4 δ +2)2 δγ 2 p 2(1 − p ) δ + p 1 N + 2 ϑ M 2 1 − 2(1 − p ) δ + p (1 − p ) q A (1+ α − 1 )+ p × (1 − p ) q A 1 + α − 1 + p t t X τ =0 (1 − p ) q A 1 + α − 1 + p − τ E h ∥∇ F ( θ τ ) ∥ 2 i . (45) where ⟨ 1 ⟩ is obtained by substituting (10) into (48), ⟨ 2 ⟩ is deriv ed from (4) and (9), ⟨ 3 ⟩ holds due to (7), and ⟨ 4 ⟩ is deriv ed b y (47). F rom Assumption 1 and (48), we ha ve F x t +1 ⩽ F x t + ∇ F x t , x t +1 − x t + L 2 x t +1 − x t 2 = F x t − * ∇ F x t , γ N X i =1 I t i g t i + + L 2 γ N X i =1 I t i g t i 2 . (49) T aking exp ectations on b oth sides of (49) conditioned on iterations 0 , ..., t − 1 yields E F x t +1 F t − F x t ⩽ − γ E " * ∇ F x t , N X i =1 I t i g t i + F t # + Lγ 2 2 E N X i =1 I t i g t i 2 F t = − γ E " * ∇ F θ t , N X i =1 I t i g t i + F t # + Lγ 2 2 E N X i =1 I t i g t i 2 F t + γ E " * ∇ F θ t − ∇ F x t , N X i =1 I t i g t i + F t # . (50) F or the rst term in (50), we can deriv e − γ E " * ∇ F θ t , N X i =1 I t i g t i + F t # ⟨ 1 ⟩ = − γ (1 − p ) E " * ∇ F θ t , N X i =1 X k ∈S i ∇ f k θ t d k (1 − p ) + F t # JOURNAL OF L A T E X CLASS FILES 16 T X t =0 E N X i =1 e t +1 i 2 ⩽ T + 1 1 − [(1 − p ) q A (1 + α − 1 ) + p ] 8 β 2 pδ γ 2 ϑ 1 − p + (4 δ + 2) 4 δ γ 2 pβ 2 ϑ 2 (1 − p ) δ + p 1 1 − [2 (1 − p ) δ + p ] + 4 pδ γ 2 1 − p 1 N + 2 ϑ M 2 + q A γ 2 (1 + α ) 1 − p + (4 δ +2)2 δγ 2 p 2(1 − p ) δ + p 1 N + 2 ϑ M 2 1 − 2(1 − p ) δ + p (1 − p ) q A (1+ α − 1 )+ p T X τ =0 E h ∥∇ F ( θ τ ) ∥ 2 i T X t = τ (1 − p ) q A 1 + α − 1 + p t − τ = T + 1 1 − [(1 − p ) q A (1 + α − 1 ) + p ] 8 β 2 pδ γ 2 ϑ 1 − p + (4 δ + 2) 4 δ γ 2 pβ 2 ϑ 2 (1 − p ) δ + p 1 1 − [2 (1 − p ) δ + p ] + 4 pδ γ 2 1 − p 1 N + 2 ϑ M 2 + q A γ 2 (1 + α ) 1 − p + (4 δ +2)2 δγ 2 p 2(1 − p ) δ + p 1 N + 2 ϑ M 2 1 − 2(1 − p ) δ + p (1 − p ) q A (1+ α − 1 )+ p T X τ =0 E h ∥∇ F ( θ τ ) ∥ 2 i 1 − (1 − p ) q A 1 + α − 1 + p T − τ +1 1 − [(1 − p ) q A (1 + α − 1 ) + p ] ⩽ T + 1 1 − [(1 − p ) q A (1 + α − 1 ) + p ] 8 β 2 pδ γ 2 ϑ 1 − p + (4 δ + 2) 4 δ γ 2 pβ 2 ϑ 2 (1 − p ) δ + p 1 1 − [2 (1 − p ) δ + p ] + 4 pδ γ 2 1 − p 1 N + 2 ϑ M 2 + q A γ 2 (1 + α ) 1 − p + (4 δ +2)2 δγ 2 p 2(1 − p ) δ + p 1 N + 2 ϑ M 2 1 − [2(1 − p ) δ + p ] (1 − p ) q A (1+ α − 1 )+ p 1 1 − [(1 − p ) q A (1 + α − 1 ) + p ] T X τ =0 E h ∥∇ F ( θ τ ) ∥ 2 i . (46) = − γ E " * ∇ F θ t , M X k =1 ∇ f k θ t + F t # = − γ E ∇ F θ t , ∇ F θ t F t = − γ ∇ F θ t 2 , (51) where ⟨ 1 ⟩ holds due to (8). F or the third term in (50), we ha ve γ E " * ∇ F θ t − ∇ F x t , N X i =1 I t i g t i + F t # ⟨ 1 ⟩ ⩽ γ ρ 2 ∇ F θ t − ∇ F x t 2 + γ 2 ρ E N X i =1 I t i g t i 2 F t ⟨ 2 ⟩ ⩽ γ ρL 2 2 θ t − x t 2 + γ 2 ρ E N X i =1 I t i g t i 2 F t ⟨ 3 ⟩ = γ ρL 2 2 N X i =1 e t i 2 + γ 2 ρ E N X i =1 I t i g t i 2 F t , ∀ ρ > 0 , (52) where ⟨ 1 ⟩ holds due to Y oung’s inequality , ⟨ 2 ⟩ can be deriv ed based on Assumption 1, and ⟨ 3 ⟩ is obtained by substituting (47) in to (52). F rom (51) and (52), we can rewrite (50) as E F x t +1 F t − F x t ⩽ − γ ∇ F θ t 2 + Lγ 2 2 + γ 2 ρ E N X i =1 I t i g t i 2 F t + γ ρL 2 2 N X i =1 e t i 2 , ∀ ρ > 0 . (53) Substituting (17) in Lemma 1 in to (53), we ha ve E F x t +1 F t − F x t ⩽ Lγ 2 2 + γ 2 ρ p (1 − p ) N + 1 + 2 pϑ (1 − p ) M 2 − γ × ∇ F θ t 2 + γ ρL 2 2 N X i =1 e t i 2 + Lγ 2 2 + γ 2 ρ 2 pβ 2 ϑ 1 − p . (54) T aking full exp ectations on b oth sides of (54), we ha ve E F x t +1 − E F x t ⩽ Lγ 2 2 + γ 2 ρ p (1 − p ) N + 1 + 2 pϑ (1 − p ) M 2 − γ × E h ∇ F θ t 2 i + γ ρL 2 2 E N X i =1 e t i 2 + Lγ 2 2 + γ 2 ρ 2 pβ 2 ϑ 1 − p . (55) T aking the av erage ov er T + 1 iterations based on (55) yields E F x T +1 − E F x 0 T + 1 ⩽ Lγ 2 2 + γ 2 ρ p (1 − p ) N + 1 + 2 pϑ (1 − p ) M 2 − γ × 1 T + 1 T X t =0 E h ∇ F θ t 2 i + γ ρL 2 2 1 T + 1 T +1 X t =0 E N X i =1 e t i 2 + Lγ 2 2 + γ 2 ρ 2 pβ 2 ϑ 1 − p . (56) JOURNAL OF L A T E X CLASS FILES 17 − Lγ 2 2 + γ 2 ρ p (1 − p ) N + 1 + 2 pϑ (1 − p ) M 2 + γ − γ 3 ρL 2 2 ξ 2 1 T + 1 T X t =0 E h ∇ F θ t 2 i ⩽ γ 3 ρL 2 2 ξ 1 + Lγ 2 2 + γ 2 ρ 2 pβ 2 ϑ 1 − p + − E F x T +1 + E F x 0 T + 1 . (57) Substituting (19) in Lemma 2 into (56) and rearranging the terms, we hav e (57). By setting ρ = 1 γ ρ 0 in (57) and using Assumption 3, w e can further b ound (57) as γ − ˜ ε 0 γ 2 1 T + 1 T X t =0 E h ∇ F θ t 2 i ⩽ ˜ ε 1 γ 2 + E F x 0 − F ∗ T + 1 , ∀ ρ 0 > 0 , (58) where ˜ ε 0 ≜ L + ρ 0 2 p (1 − p ) N + 1 + 2 pϑ (1 − p ) M 2 + L 2 ξ 2 2 ρ 0 , (59) and ˜ ε 1 ≜ L 2 2 ρ 0 ξ 1 + L 2 + ρ 0 2 2 pβ 2 ϑ 1 − p . (60) If we set γ = ϕ √ T +1 , ϕ > 0 , for T > ( ˜ ε 0 ϕ ) 2 − 1 , it holds from (58) that 1 T + 1 T X t =0 E h ∇ F θ t 2 i ⩽ ˜ ε 1 ϕ √ T + 1 − ˜ ε 0 ϕ + F x 0 − F ∗ ϕ √ T + 1 − ˜ ε 0 ϕ 2 = ˜ ε 1 ϕ √ T + 1 − ˜ ε 0 ϕ + F θ 0 − F ∗ ϕ √ T + 1 − ˜ ε 0 ϕ 2 . (61) Note that ˜ ε 1 ⩾ ε 1 ≜ s 2 L 2 ξ 1 pβ 2 ϑ 1 − p + Lpβ 2 ϑ 1 − p , (62) whic h can be attained by setting ρ 0 = s L 2 ξ 1 (1 − p ) 2 pβ 2 ϑ . (63) Substituting (63) into (59), based on (61) and (62), w e can easily deriv e Theorem 1.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment