Accelerating Byzantine-Robust Distributed Learning with Compressed Communication via Double Momentum and Variance Reduction

A ccelerating Byzan tine-Robust Distributed Learning with Compressed Comm unication via Double Momen tum and V ariance Reduction Y anghao Li Changxin Liu † Y uhao Yi † Sic huan Univ ersity East China Universit y of Science and T ec hnology Sic huan Univ ersity Abstract In collab orativ e and distributed learning, Byzan tine robustness reﬂects a ma jor facet of optimization algorithms. Suc h distributed algorithms are often accompanied by trans- mitting a large num b er of parameters, so comm unication compression is essential for an eﬀective solution. In this pap er, we pro- p ose Byz-DM21, a nov el Byzan tine-robust and comm unication-eﬃcient stochastic dis- tributed learning algorithm. Our k ey innov a- tion is a nov el gradient estimator based on a double-momentum mec hanism, integrating recen t adv ancemen ts in error feedback tech- niques. Using this estimator, we design b oth standard and accelerated algorithms that elim- inate the need for large batch sizes while main- taining robustness against Byzantine w orkers. W e prov e that the Byz-DM21 algorithm has a smaller neigh b orho o d size and conv erges to ε -stationary p oints in O ( ε − 4 ) iterations. T o further enhance eﬃciency , we in tro duce a dis- tributed v ariant called Byz-VR-DM21, which incorp orates lo cal v ariance reduction at each no de to progressively eliminate v ariance from random approximations. W e sho w that Byz- VR-DM21 prov ably conv erges to ε -stationary p oin ts in O ( ε − 3 ) iterations. A dditionally , we extend our results to the case where the func- tions satisfy the P olyak-Ło jasiewicz condition. Finally , numerical exp eriments demonstrate the eﬀectiveness of the prop osed metho d. 1 In tro duction Distributed learning has garnered signiﬁcant attention due to its widespread applications. Sp eciﬁc applica- Pro ceedings of the 29 th In ternational Conference on Arti- ﬁcial Intelligence and Statistics (AIST A TS) 2026, T angier, Moro cco. PMLR: V olume 300. Copyrigh t 2026 by the au- thor(s). † Corresp onsing author. tions include large-scale training of deep neural net- w orks, collab orative learning in edge computing, and distributed optimization in federated learning Haddad- p our et al. (2019); Jaggi et al. (2014); Lee et al. (2017); Y u et al. (2019). T raditional distributed learning often assumes an environmen t free from faults or attacks. Ho wev er, in certain real-world scenarios, such as edge computing Shi et al. (2016) and federated learning McMahan and Ramage (2017), service providers (also kno wn as the server) typically hav e limited control o ver computational no des (also known as work ers). In suc h cases, work ers may exp erience v arious softw are and hardware failures Xie et al. (2019). W orse still, some work ers may b e compromised by malicious third parties and inten tionally send erroneous information to disrupt the distributed learning pro cess Kairouz et al. (2021). W orkers aﬀected b y such failures or attacks are referred to as Byzantine workers 1 . Distributed learning in the presence of Byzan tine work ers, also kno wn as Byzan tine-Robust Distributed Learning (BRDL), has recen tly emerged as a prominent research topic Yin et al. (2018); Bernstein et al. (2019); Diak onikolas and Kane (2019); Konstan tinidis and Ramamo orth y (2021). A common approac h to ac hieving Byzan tine robustness is to replace the standard mean aggregator with robust alternativ es, such as Krum Blanc hard et al. (2017), geometric median Chen et al. (2017), co ordinate-wise median Yin et al. (2018), and trimmed mean Yin et al. (2018), among others. Ho wev er, in the presence of Byzan tine w orkers, ev en robust aggregators inevitably in tro duce aggregation error, deﬁned as the discrep- ancy b etw een the aggregated result and the true mean. Moreo ver, even with indep endent and identically dis- tributed (i.i.d.) data, the aggregation error can b e signiﬁcan t due to the high v ariance of stochastic gra- 1 F ollowing the standard terminology in the litera- ture Lamp ort et al. (2019); Su and V aidya (2016), w e refer to a work er as Byzantine if it may , either maliciously or unin tentionally , send incorrect information to other work ers or to the server. Such work ers are assumed to b e omni- scien t: they can access the vectors transmitted by other w orkers, are aw are of the server-side aggregation rule, and ma y co ordinate their actions with one another. Double Momen tum for Byzantine Robust Learning dien ts Karimireddy et al. (2021), which are typically sen t from work ers to the server for parameter up dates. Large aggregation errors may lead to the failure of BRDL metho ds Xie et al. (2020). In addition to Byzantine robustness, the eﬃciency of distributed learning systems is considered a ma jor p er- formance metric. Due to its distributed nature, a b ottlenec k lies in the communication b etw een work ers and the server, particularly the transmission of lo cal sto c hastic gradients. This challenge b ecomes more pro- nounced with high-dimensional mo dels, resulting in substan tial comm unication o verhead. T o mitigate this issue, several strategies hav e b een prop osed, includ- ing reducing communication frequency by p erforming m ultiple lo cal up dates Chen et al. (2018) and com- pressing the transmitted messages. Common compres- sion techniques include quantization, which enco des v ectors using a limited num b er of bits Alistarh et al. (2017), and sparsiﬁcation, which reduces the num b er of non-zero elements in transmitted vectors W angni et al. (2018). In this work, w e primarily focus on the latter approach. Sp eciﬁcally , at eac h iteration, work ers compress their lo cal gradients b efore transmission, and the serv er aggregates these compressed gradients to up date the mo del parameters. Byzan tine robustness and comm unication eﬃciency are b oth crucial prop erties in distributed learning, y et their simultaneous exploration has been relatively lim- ited in the existing literature, with current metho ds facing notable challenges. Zhu and Ling (2021) pro- p osed Byzantine-robust v ariants of compressed SGD ( BR-CSGD ) and SAGA ( BR-CSAGA ), as well as BROAD- CAST , whic h integrates DIANA Mishc henko et al. (2019) with BR-CSAGA . How ever, their conv ergence analysis is conﬁned to strongly conv ex problems and relies on strin- gen t assumptions. Similarly , Gorbunov et al. (2023) studied the Byzan tine-tolerant Byz-VR-MARINA , which ac hieves fast conv ergence but o ccasionally triggers un- compressed message comm unication and full gradi- en t computation. Moreo ver, most existing Byzantine- robust metho ds utilize unbiased compressors, whereas biased contractiv e compressors combined with error feedbac k often yield sup erior empirical p erformance Rammal et al. (2024). In this w ork, w e comprehensiv ely address these limita- tions by building on the recently prop osed Byzantine- robust stochastic distributed learning metho d with error feedback, Byz-EF21-SGDM Liu et al. (2026). W e ﬁrst prop ose Byz-DM21 , a Byzantine-robust sto c has- tic distributed learning algorithm that utilizes Double Momen tum with Error F eedback-21, an enhanced v ari- an t of Byz-EF21-SGDM , featuring improv ed conv ergence prop erties ov er the original method. A double momen- tum estimator u ( t ) i has richer "memory" of the past gradien ts compared to SGDM F atkhullin et al. (2023). Building on Byz-DM21 , we further in tro duce Byz-VR- DM21 , which incorp orates lo cal v ariance reduction at eac h no de to progressively eliminate v ariance arising from sto chastic approximations. W e summarize our main contributions as follo ws: • W e prop ose a no vel Byzantine-robust and comm unication-eﬃcient sto c hastic distributed learning method, Byz-DM21 . Our new algorithm is batc h-free and employs a Double-Momentum mec hanism to simultaneously suppress v ariance from sto c hastic gradients and bias in tro duced b y compression, which leads to improv ed sample complexit y for Byz-EF21-SGDM in the non-asymptotic regime (see Remark 3.2). Moreov er, we prov e that the v ariance of the double-momen tum estimator is strictly smaller than that of the single-momentum estimator, with an asymptotic reduction factor of appro ximately 1 / 2 when η is small. W e also sho w that Byz-DM21 con verges to an ε -stationary p oint in O ( ε − 4 ) iterations. • W e prop ose Byz-VR-DM21 as an extension to Byz- DM21 . The improv ed algorithm incorp orates lo cal v ariance reduction on all no des to progressiv ely elim- inate v ariance from sto chastic approximations. By com bining w orker momen tum based v ariance reduc- tion with a Byzan tine robust aggregator, we obtain a faster Byzantine robust algorithm. W e pro ve that Byz-VR-DM21 ac hieves an accelerated conv ergence to ε -stationary p oints in O ( ε − 3 ) iterations. Addi- tionally , we analyze Byz-DM21 and Byz-VR-DM21 for problems that satisfy the Poly ak-Ło jasiewicz condi- tion Poly ak (1963). • W e deriv e complexity b ounds for Byz-DM21 under standard assumptions and further extend these re- sults to Byz-VR-DM21 . These complexit y bounds demonstrate that our algorithm outperforms the state-of-the-art, sp eciﬁcally in the full gradient set- ting Rammal et al. (2024), in terms of conv ergence sp eed. Notably , our results are tight and align with established low er b ounds in b oth sto chastic and full gradien t scenarios when no Byzantine w orkers are presen t. • Under the ζ 2 -heterogeneit y assumption, our algo- rithm conv erges to a tigh ter neighborho o d around the optimal solution and also matc hes the estab- lished low er b ound Karimireddy et al. (2021); Al- louah et al. (2023). F or a detailed comparison, see T able 1. F urthermore, exp eriments demonstrate that the prop osed algorithm not only conv erges faster but also asymptotically reaches a mo del with a smaller error. Y anghao Li, Changxin Liu † , Y uhao Yi † T able 1: Summary of related works on Byzantine-robust and communication-eﬃcien t distributed metho ds. "Complexit y (NC)" and "Complexity (PŁ)": represents the total n umber of communication rounds required for eac h w orker to ﬁnd x suc h that E [ ∥∇ f ( x ) ∥ ] ≤ ε in the general non-con vex case and such x that E [ f ( x ) − f ( x ∗ )] ≤ ε in PŁ case resp ectively . σ 2 represen ts the v ariance of lo cal sto chastic gradien ts, κ refers to the parameter of robust aggregators, α ∈ (0 , 1] and ω ≥ 0 are parameters for biased con tractive and unbiased compressors, resp ectively , ζ 2 denotes the heterogeneity b ound among honest w orkers, and c denotes the heterogeneity constan t. The parameter p ∈ (0 , 1] is the sampling probability used in Byz-VR-MARINA and Byz-DASHA-P A GE . m represen ts the lo cal dataset size for work ers in Byz-VR-MARINA , Byz-DASHA-P A GE and BROADCAST . Method Setting Batch-free? Complexity (NC) Complexity (PŁ) Accuracy BROADCAST Zhu and Ling (2021) ﬁnite-sum % - m 2 (1+ ω ) 3 / 2 G µ 2 ( n − 2 B ) (1) κ (1 + ω ) ζ 2 Byz-VR-MARINA (2) Gorbunov et al. (2023) ﬁnite-sum %  1+ √ max { ω 2 ,mω }  √ 1 G + √ κ max { ω ,m }  ε 2  1+ √ max { ω 2 ,mω }  √ 1 G + √ κ max { ω ,m }  + µ ( m + ω ) µ κζ 2 p − cκ Byz-DASHA-P AGE (2) Rammal et al. (2024) ﬁnite-sum %  1+  ω + √ m ω  √ 1 G + √ κ  ε 2 - κζ 2 1 − cκ Byz-EF21 (2) Rammal et al. (2024) full gradient % 1+ √ κ αε 2 - ( κ + √ κ ) ζ 2 1 − c ( κ + √ κ ) Byz-EF21-SGDM Liu et al. (2026) stochastic gradient " σ 2 Gε 4 + κσ 2 ε 4 - κζ 2 Byz-DM21 (This work) stochastic gradient " √ κ +1 σ 2 Gε 4 + ( κ +1) 3 / 2 σ 2 ε 4 √ κ +1 αε 2 (full gradient) ( G ( κ +1)+1) σ 2 ( µ + √ κ +1) µ 2 εG κζ 2 Byz-VR-DM21 (This work) stochastic gradient " √ κ +1 σ √ Gε 3 + ( κ +1) σ ε 3 √ κ +1 αε 2 (full gradient) ( G ( κ +1)+1) σ 2 µεG κζ 2 (1) The rate is derived under the strong conv exity assumption. Strong conv exity implies the PŁ-condition, but the conv erse is not true: there exist non-convex PŁ functions Karimi et al. (2016). (2) F or comparison, the complexity results of Byz-VR-MARINA and Byz-EF21 are derived b y exploring the relationship b etween ( δ, c ) -agnostic robust aggregator and ( B , κ ) -robust aggregator. See Remark 3.5 for details. 2 Preliminaries W e consider a distributed learning system comprising a central server and n w orkers, denoted as the set [ n ] = G ∪ B . In this setup, G represen ts the subset of reliable or honest w orkers and G = |G | , while B consists of malicious or Byzan tine work ers and B = |B | . Notably , the identities of the honest work ers and Byzan- tine work ers are unknown b eforehand. The Byzantine w orkers are assumed to b e omniscient Baruch et al. (2019) and capable of colluding with each other to send arbitrary malicious messages to the serv er. The pri- mary ob jectiv e is to ﬁnd the optimal solution to this distribute d sto chastic optimization pr oblem min x ∈ R d n f ( x ) = 1 G X i ∈G f i ( x ) o , (1) where f i ( x ) = E ξ i ∼D i f i ( x, ξ i ) , ∀ i ∈ G . x ∈ R d rep- resen ts the mo del parameters to b e optimized, while f i ( x ) denotes the (typically noncon vex) loss function of the mo del parameterized by x on the dataset D i held b y client i . W e allow the distributions of malicious no des, D 1 , . . . , D n , to v ary arbitrarily . Our ob jectiv e is to solve the optimization problem (1) in the pres- ence of arbitrary malicious messages sent by Byzan tine w orkers, while ensuring communication eﬃciency . The following assumptions will b e used throughout the analysis of our algorithms. Assumption 2.1 ( L -Smo othness) . W e assume that function f : R d → R is L -smo oth, me aning that for al l x, y ∈ R d , the fol lowing ine quality holds: ∥∇ f i ( x ) − ∇ f i ( y ) ∥ ≤ L i ∥ x − y ∥ , (2) and ∥∇ f ( x ) − ∇ f ( y ) ∥ ≤ L ∥ x − y ∥ . (3) Assumption 2.2 (Individual Smo othness) . F or e ach i = 1 , . . . , n , every r e alization of ξ i ∼ D i , the sto chastic gr adient ∇ f i ( x, ξ i ) is ℓ i - Lipschitz , i.e., for al l x, y ∈ R d , we have: ∥∇ f i ( x, ξ i ) − ∇ f i ( y , ξ i ) ∥ ≤ ℓ i ∥ x − y ∥ . (4) W e denote the aver age d smo othness c onstants as e L 2 = G − 1 P i ∈G L 2 i and e ℓ 2 = G − 1 P i ∈G ℓ 2 i . Fi- nal ly, we assume that f is lower b ounde d, i.e., f ∗ := min x ∈ R d f ( x ) ≥ −∞ . In scenarios with arbitrary heterogeneity , distinguish- ing b etw een regular and Byzantine work ers b ecomes infeasible. Therefore, w e adopt a common assumption regarding the heterogeneity of the gradient of lo cal loss functions. Assumption 2.3 ( ζ 2 -heterogeneit y) . W e assume that go o d workers have ζ 2 -heter o gene ous lo c al loss functions for some ζ ≥ 0 , i.e., 1 G X i ∈G ∥∇ f i ( x ) − ∇ f ( x ) ∥ 2 ≤ ζ 2 , ∀ x ∈ R d . (5) T o mo del the stochastic noise in tro duced at each honest w orker, w e adopt the following assumption. Double Momen tum for Byzantine Robust Learning Assumption 2.4 (Bounded v ariance (BV)) . Ther e exists σ >0 such that E [ ∥∇ f i ( x, ξ i ) − ∇ f i ( x ) ∥ 2 ] ≤ σ 2 , ∀ x ∈ R d , (6) wher e E [ · ] is deﬁne d over the r andomness of the algo- rithm and ξ i ∼ D i ar e i.i.d. r andom samples for e ach i ∈ G . W e deﬁne a Byzantine-robust algorithm as an algorithm guaran teed to ﬁnd an ε -appro ximate stationary p oint for f i despite the presence of B Byzan tine work ers. In particular, w e in tro duce the formal deﬁnition of Byzan tine robustness as follows. Deﬁnition 2.5 (( B , ε )-Byzan tine robustness) . A le arn- ing algorithm is said ( B , ε ) -Byzantine r obust if, even in the pr esenc e of B Byzantine workers, it outputs ˆ x satisfying E [ ∥∇ f ( ˆ x ) ∥ 2 ] ≤ ε . (7) A chieving ( B , ε )-Byzan tine robustness is generally not feasible for any ε when the num b er of Byzantine work- ers B is at least half the total work ers ( B ≥ n/ 2 ). Therefore, in this work, w e assume an upp er b ound on B , sp eciﬁcally B < n/ 2 , to ensure robustness. Sev eral robust aggregation metho ds ha v e been pro- p osed, including the co ordinate-wise trimmed mean ( CWTM ) Yin et al. (2018) and centered clipping Karim- ireddy et al. (2021). T o formally assess the robustness of aggregation techniques, we adopt the concept of ( B , κ ) -robustness Allouah et al. (2023). This property ensures that for any subset of inputs of size G , the out- put of the aggregation rule remains close to the a verage of those inputs. This concept serves as a useful metric for comparing the robustness of diﬀeren t aggregation metho ds. F or a detailed analysis and formal quantiﬁ- cation of commonly used aggregation rules, refer to Allouah et al. (2023). Deﬁnition 2.6 ( ( B , κ ) -robustness) . Given an inte- ger B < n/ 2 and a r e al numb er κ ≥ 0 , an aggr e ga- tion rule F is ( B , κ ) -r obust if for any set of n ve ctors { g 1 , g 2 , . . . , g n } , and any subset S ⊆ [ n ] with | S | = G , ∥ F ( g 1 , g 2 , . . . , g n ) − g S ∥ 2 ≤ κ | S | X i ∈ S ∥ g i − g S ∥ 2 , (8) wher e g S := G − 1 P i ∈ S g i . T o facilitate communication-eﬃcien t learning, we intro- duce compression tec hniques for message transmission. A general compression op erator is deﬁned as follows. Deﬁnition 2.7 ( Con tractive compressors) . A (p os- sibly r andomize d) mapping C : R d → R d is c al le d a c ontr active c ompr ession op er ator if ther e exists a c on- stant α ∈ (0 , 1] such that E [ ∥C ( x ) − x ∥ 2 ] ≤ (1 − α ) ∥ x ∥ 2 , ∀ x ∈ R d . (9) In this pap er, w e fo cus on the T op k sparsiﬁcation-based compression op erator. This op erator sorts the input v ector, selects the largest k elemen ts, and transmits these elements along with their original indices. An- other commonly used sparsiﬁcation op erator is Rand k , whic h randomly sets d − k elemen ts of the input vector to zero, where k is an integer b etw een 1 and d . This ap- proac h ac hieves a sparsity ratio of d/k . F or a detailed o verview of both biased and unbiased compressors, please refer to Beznosik ov et al. (2023). 3 Byzan tine-Robust Distributed Learning In this section, w e in tro duce our main results on meth- o ds employing biased compression. 3.1 Byzan tine-DM21 W e summarize our new method Byz-DM21 in Algorithm 1. At eac h iteration of Byz-DM21 , the mo del parameter x ( t +1) is up dated by the parameter server using the form ula: x ( t +1) = x ( t ) − γ g ( t ) , where γ represen ts the step size and g ( t ) is the gradient estimator received from the parameter server. F ollowing the up date, the serv er broadcasts the up dated mo del x ( t +1) to all w ork- ers. Up on receiving x ( t +1) , the go o d work ers pro ceed with the following steps: Up date their ﬁrst momen tum estimate v ( t +1) i and the second momentum estimate u ( t +1) i (lines 5 and 6). The change in g ( t ) i is then com- pressed using the expression: c ( t +1) i = C ( u ( t +1) i − g ( t ) i ) , these compressed vectors are sent bac k to the serv er (line 7). Simultaneously , the honest work ers up date their lo cal state according to: g ( t +1) i = g ( t ) i + c ( t +1) i re- ﬂecting the adjustments made due to compression (line 8). After that, the server gathers the results of com- putations from the work ers and applies a ( B , κ ) -robust aggregator to compute the next estimator g ( t +1) . T wo crucial asp ects of the Byz-DM21 are introduced as follo ws. First , honest work ers transmit only compressed up- dates of their lo cal momentum v ariables and estimated momen tum v ariables to the server. This approach not only reduces communication ov erhead but also helps to iden tify and exclude Byzan tine work ers who attempt to subv ert the algorithm by transmitting dense, incon- sisten t vectors. Se c ond , the double momentum method enhances the eﬃciency of the optimization process b y com bining ﬁrst and second momen tum estimations. On the one hand, the ﬁrst momen tum helps smo oth gra- dien t ﬂuctuations, reduces the inﬂuence of noise, and accelerates con vergence. On the other hand, the sec- ond momentum further optimizes the up date direction, prev ents gradien t oscillations, and improv es lo cal con- Y anghao Li, Changxin Liu † , Y uhao Yi † Algorithm 1 Byzantine-DM21 Input : initial mo del x (0) , step-size γ > 0 , momentum co eﬃcient η ∈ (0 , 1] , robust aggregation F , initial batch size b , the n umber of rounds T Initialization : for every honest work er i ∈ G , v (0) i = u (0) i = g (0) i = ∇ f i ( x (0) , ξ (0) i ) , each work er i ∈ [ n ] sends g (0) i to the server, g (0) = F ( { g (0) 1 , . . . , g (0) n } ) 1: for t = 0 , 1 , . . . , T − 1 do 2: Serv er computes x ( t +1) = x ( t ) − γ g ( t ) and broadcasts x ( t +1) to all work ers 3: for every honest w orker i ∈ G in parallel do 4: Compute the ﬁrst momen tum estimator: 5: v ( t +1) i = ( (1 − η ) v ( t ) i + η ∇ f i ( x ( t +1) , ξ ( t +1) i ) (Byz-DM21) ∇ f i ( x ( t +1) , ξ ( t +1) i ) + (1 − η )( v ( t ) i − ∇ f i ( x ( t ) , ξ ( t +1) i )) (Byz-VR-DM21) 6: Compute the second momen tum estimator: u ( t +1) i = (1 − η ) u ( t ) i + η ( v ( t +1) i ) 7: Compress c ( t +1) i = C ( u ( t +1) i − g ( t ) i ) and send c ( t +1) i to the server 8: Up date lo cal state g ( t +1) i = g ( t ) i + c ( t +1) i 9: end for 10: Serv er updates g ( t +1) = F ( { g ( t +1) 1 , . . . , g ( t +1) n } ) via g ( t +1) i = g ( t ) i + c ( t +1) i , ∀ i ∈ [ n ] 11: end for 12: Return : A random ˆ x ( T ) from { x ( t ) } T − 1 t =0 v ergence. This double momentum mechanism not only accelerates the training pro cess but also reduces the impact of gradient noise and enhances robustness. 3.2 Con vergence of Byz-DM21 for General Non-Con vex Problems The conv ergence analysis of Byz-DM21 hinges on the monotonicit y of the following Ly apunov function: Φ ( t ) = δ t + 4 γ η ∥ f M ( t ) ∥ 2 + ( 48 γ (8 κ + 1)( η 4 + 6 η 2 ) η α 2 G + 8 γ (28 κ + 3) η G ) X i ∈G ∥ M ( t ) i ∥ 2 , (10) where δ t = f ( x ( t ) ) − f ( x ∗ ) , M ( t ) i = v ( t ) i − ∇ f i ( x ( t ) ) , and f M ( t ) = G − 1 P i ∈G v ( t ) i − ∇ f G ( x ( t ) ) . The primary con vergence result for general non-con vex functions is articulated in Theorem 3.1, with the corresp onding pro of detailed in App endix F. Theorem 3.1. Assuming that Assumptions 2.1, 2.3, and 2.4 hold, we c onsider Algorithm 1 for solving the distribute d le arning pr oblem (1) with B < n/ 2 Byzantine workers and c ommunic ation c ompr ession char acterize d by the p ar ameter α ∈ (0 , 1] as p er Deﬁnition 2.7. If η ≤ 1 , and γ ≤ min n α 4 e L √ 234(8 κ +1) , η 4 √ (56 κ +6) e L 2 + L 2 o , then E h ∥∇ f ( ˆ x ( T ) ) ∥ 2 i ≤ Φ (0) γ T + ( 48( η 5 + 7 η 3 )(8 κ + 1) α 2 + 4 η (64 κ + 7) + 4 η G + 8(8 κ + 1) η 4 α ) σ 2 + 32 κζ 2 , wher e ˆ x ( T ) is sample d uniformly at r andom fr om the iter ations of the metho d, Φ (0) is deﬁne d in (10) . Then choic e η ≤ O  min n α 2 δ 0 b L ( κ +1) σ 2 T  1 / 6 ,  α 2 δ 0 b L ( κ +1) σ 2 T  1 / 4 ,  δ 0 b L ( κ +1) σ 2 T  1 / 2 ,  δ 0 G b L σ 2 T  1 / 2 ,  α b Lδ 0 ( κ +1) σ 2 T  1 / 5 o , wher e b L def = 8 q (56 κ + 6) e L 2 + L 2 , we obtain E h ∥∇ f ( ˆ x ( T ) ) ∥ 2 i = O  ( ( κ + 1) 1 / 5 σ 2 / 5 b Lδ 0 α 2 / 5 T ) 5 / 6 + κζ 2 + ( ( κ + 1) 1 / 3 σ 2 / 3 b Lδ 0 α 2 / 3 T ) 3 / 4 + ( ( κ + 1) σ 2 b Lδ 0 T ) 1 / 2 + ( ( κ + 1) 1 / 4 σ 1 / 2 b Lδ 0 α 1 / 4 T ) 4 / 5 + Φ (0) γ T + ( σ 2 b Lδ 0 GT ) 1 / 2  . Theorem 3.1 establishes that the prop osed algorithm, Byz-DM21 , conv erges in E [ ∥∇ f ( ˆ x T ) ∥ 2 ] , adhering to the Byzan tine robustness criterion sp eciﬁed in Deﬁnition 2.5. Consequen tly , Byz-DM21 achiev es a conv ergence rate comparable to that of standard SGD with momen- tum Cutkosky and Mehta (2020). F urthermore, prior studies Cutk osky and Mehta (2020); Arjev ani et al. (2023) hav e shown that, under standard Assumptions, the conv ergence rate of O ( 1 / T 1 / 2 ) is optimal for SGD . Although Byz-VR-MARINA Gorbunov et al. (2023) at- tains a faster conv ergence rate by intermitten tly using full gradients, this approach inc urs signiﬁcant com- putational costs, particularly in real-world scenarios in volving large-scale training data. Remark 3.2. Omitting c onstants and higher-or der terms, The orem 3.1 gives E [ ∥∇ f ( ˆ x ( T ) ) ∥ 2 ] ≲ σ √ GT + √ κσ √ T . When ther e is no Byzantine adversary, i.e., B = 0 , employing the standar d me an aggr e gator (which Double Momen tum for Byzantine Robust Learning is (0,0)-r obust) yields a c onver genc e r ate of O  σ √ GT  , which impr oves with the numb er of workers G . A d- ditional ly, Byz-DM21 attains b etter sample c omplexity than Byz-EF21-SGDM b e c ause its b ound in The or em (3.1) involves η 4 /α , which is mor e favor able c omp ar e d to the terms η 2 /α in Byz-EF21-SGDM . Conse quently, this term b e c omes dominate d by other terms and vanishes in Cor ol lary 3.3. Corollary 3.3. T o guar ante e E [ ∥∇ f ( ˆ x ( T ) ) ∥ 2 ] ≤ ε 2 for ε 2 ≥ 64 κζ 2 , we obtain T = O  e L √ κ + 1 αε 2 + ( κ + 1) 1 / 5 σ 2 / 5 b L α 2 / 5 ε 12 / 5 + ( κ + 1) 1 / 4 σ 1 / 2 b L α 1 / 4 ε 5 / 2 + ( κ + 1) 1 / 3 σ 2 / 3 b L α 2 / 3 ε 8 / 3 + ( G ( κ + 1) + 1) σ 2 b L Gε 4  . Next, we consider a sp ecial case where lo cal full gradi- en ts are av ailable to work ers, i.e., σ = 0 . Corollary 3.4. If σ = 0 , then E  ∥∇ f ( ˆ x ( T ) ) ∥  ≤ ε after T = O ( ˜ L √ κ +1 / αε 2 ) iter ations. Remark 3.5. In the sp e cial c ase of ful l gr adients, R ammal et al. (2024) establishe d a c omplexity of O ( 1+ √ cδ / αε 2 ) , wher e δ = B / n and c ar e p ar ameters char acterizing the agnostic r obust aggr e gator (ARA gg) Karimir e ddy et al. (2021). Note that an ( B , κ ) -r obust aggr e gation rule also qualiﬁes as a ( δ, c ) -ARA gg with c = κn / 2 B A l louah et al. (2023). As a r esult, Cor ol lary 3.4 pr ovides a slight impr ovement over the c omplexity b ound O ( 1+ √ κ / αε 2 ) establishe d in R ammal et al. (2024). Mor e over, in the absenc e of Byzantine faults, we adopt the standar d me an aggr e gator, which is (0,0)-r obust. Then, the iter ation c omplexity is T = O ( e L / ( αε 2 ) ) , achieving the lower b ound for Byzantine-fr e e distribute d le arning with c ommunic ation c ompr ession Huang et al. (2022). W e note that this asymptotic c omplexity b ound also holds for any c onstant κ , achieve d by many other aggr e gators unless n is big and B / n → 1 / 2 . 3.3 Con vergence of Byz-DM21 under the P olyak-Ło jasiewicz Condition In this section, w e pro vide complexit y b ounds for Byz- DM21 under the P olyak-Ło jasiewicz (PŁ) condition. Assumption 3.6 (P olyak-Ło jasiewicz condition) . The function f satisﬁes Polyak-Łojasiewicz (PŁ) c ondition with p ar ameter µ , i.e., for al l x ∈ R d ther e exists x ∗ ∈ arg min x ∈ R d f ( x ) such that 2 µ ( f ( x ) − f ( x ∗ )) ≤ ∥∇ f ( x ) ∥ 2 , ∀ x ∈ R , (11) wher e f ( x ∗ ) = inf x ∈ R d f ( x ) > −∞ . Her e we use a dif- fer ent notion of an ε -solution: it is a (r andom) p oint ˆ x , such that E [ f ( ˆ x ( T ) ) − f ( x ∗ )] ≤ ε . Under this and pr e- viously intr o duc e d assumptions, we derive the fol lowing r esult. Theorem 3.7. L et Assumptions 2.1, 2.3, 2.4 and 3.6 b e satisﬁe d, then cho ose momentum η ≤ O  min n µεG ( G ( κ +1)+1) σ 2 , µα 2 ε 1 / 3 ( κ +1) σ 2 , µαε 1 / 4 ( κ +1) σ 2 o , after T = O  ( G ( κ + 1) + 1) σ 2 µεG + ( κ + 1) σ 2 µα 2 ε 1 / 3 + ( κ + 1) σ 2 µαε 1 / 4 + L µ + Lσ 2 ( G ( κ + 1) + 1) p ( κ + 1) µ 2 εG  iter ations with stepsize γ ≤ min n η 2 µ , α 4 µ , L − 1  1 + q 104 α 2 (8 κ +1)+48 η 2 (8 κ +1)(2 η 4 +28 η 2 + α 2 +48) α 2 η 2  − 1 o , Byz-DM21 pr o duc es a p oint ˆ x ( T ) for which E [ f ( ˆ x ( T ) ) − f ( x ∗ )] ≤ ε . 4 Incorp orating V ariance Reduction In this work, w e emplo y a gradient estimator inspired b y those utilized in Cutkosky and Orab ona (2019) and T ran-Dinh et al. (2019). The prop osed gradien t esti- mator of each no de integrates the adv antages of the widely-used SARAH estimator Nguyen et al. (2017) and the unbiased SGD gradient estimator. F ormally , the gradien t estimator of no de i ∈ G at time step t is expressed as: v ( t ) i = η ∇ f i ( x ( t ) , ξ ( t ) i ) | {z } SGD + (12) + (1 − η )  v ( t − 1) i + ∇ f i ( x ( t ) , ξ ( t ) i ) − ∇ f i ( x ( t − 1) , ξ ( t ) i )  | {z } SARAH , with the parameter η ∈ (0 , 1] as the momentum param- eter. Next, we discuss the algorithm. W e extend Byz-DM21 to Byz-VR-DM21 , incorp orating lo- cal v ariance reduction on all no des to progressively elim- inate v ariance from sto chastic appro ximations. This new algorithm is depicted in Algorithm 1. Its main diﬀerence from the original Byz-DM21 is that in the momen tum up date rule (Line 5), an additional term of (1 − η )( ∇ f ( x ( t +1) , ξ ( t +1) i ) −∇ f ( x ( t ) , ξ ( t +1) i )) is added, in- spired by the STORM algorithm Cutkosky and Orab ona (2019). This term corrects the bias of v ( t ) i so that it is an unbiased estimate of ∇ f ( x ( t ) ) in the current iter- ation condition, i.e., E [ v ( t ) i | x ( t ) ] = ∇ f i ( x ( t ) ) . W e will also show that it reduces the v ariance and accelerates the conv ergence. 4.1 Con vergence of Byz-VR-DM21 for General Non-Con vex Problems W e no w p resen t a v ariant of Byz-VR-DM21 , outlined in Algorithm 1. The conv ergence analysis of this metho d hinges on the monotonicity of the following Ly apunov Y anghao Li, Changxin Liu † , Y uhao Yi † function: Φ ( t ) = δ t + 4 γ η ∥ f M ( t ) ∥ 2 + 48 γ (8 κ + 1)( η 3 + 6 η ) α 2 G + 8 γ (28 κ + 3) η G ! X i ∈G ∥ M ( t ) i ∥ 2 , (13) where δ t = f ( x ( t ) ) − f ( x ∗ ) , M ( t ) i = v ( t ) i − ∇ f i ( x ( t ) ) , and f M ( t ) = G − 1 P i ∈G v ( t ) i − ∇ f G ( x ( t ) ) . The primary con vergence result for general non-con vex functions is articulated in Theorem 4.1, with the corresp onding pro of detailed in App endix G. Theorem 4.1. L et Assumptions 2.1, 2.2, 2.3 and 2.4 b e satisﬁe d, c onsider Algorithm 1 for solving the dis- tribute d le arning pr oblem (1) with B < n/ 2 Byzantine workers and c ommunic ation c ompr ession char acterize d by the p ar ameter α ∈ (0 , 1] as p er Deﬁnition 2.7. If η ≤ 1 , and γ ≤ min n α 10 √ (8 κ +1)(49 e ℓ 2 +15 e L 2 ) , √ η 20 √ (8 κ +1) e ℓ 2 o , then E h ∥∇ f ( ˆ x ( T ) ) ∥ 2 i ≤ Φ (0) γ T + 96(8 κ + 1)( η 5 + 7 η 3 ) α 2 + 8 η (60 κ + 7) + 8 η G + 16 η 4 (8 κ + 1) α ! σ 2 + 32 κζ 2 , wher e ˆ x ( T ) is sample d uniformly at r andom fr om the iter ates of the metho d. By setting η ≤ O  min n α 2 δ 0 b ℓ ( κ +1) σ 2 T  2 / 11 ,  α 2 δ 0 b ℓ ( κ +1) σ 2 T  2 / 7 ,  δ 0 b ℓ ( κ +1) σ 2 T  2 / 3 ,  δ 0 G b ℓ σ 2 T  2 / 3 ,  αδ 0 b ℓ ( κ +1) σ 2 T  2 / 9 o , wher e b ℓ def = 40 q (8 κ + 1) e ℓ 2 , we obtain E h ∥∇ f ( ˆ x ( T ) ) ∥ 2 i = O  ( ( κ + 1) 1 / 10 σ 2 / 10 δ 0 b ℓ α 2 / 10 T ) 10 / 11 + Φ (0) γ T + κζ 2 + ( ( κ + 1) 1 / 6 σ 2 / 6 δ 0 b ℓ α 2 / 6 T ) 6 / 7 + ( σ δ 0 b ℓ G 1 / 2 T ) 2 / 3 + ( ( κ + 1) 1 / 2 σ δ 0 b ℓ T ) 2 / 3 + ( ( κ + 1) 1 / 8 σ 1 / 4 δ 0 b ℓ α 1 / 8 T ) 8 / 9  . Corollary 4.2. T o guar ante e E [ ∥∇ f ( ˆ x ( T ) ) ∥ 2 ] ≤ ε 2 for ε 2 ≥ 64 κζ 2 , then T = O  ( G ( κ + 1) 1 / 2 + 1) σ b ℓ G 1 / 2 ε 3 + ( κ + 1) 1 / 10 σ 1 / 5 b ℓ α 2 / 5 ε 11 / 5 + ( κ + 1) 1 / 6 σ 1 / 3 b ℓ α 1 / 3 ε 7 / 3 + ( κ + 1) 1 / 8 σ 1 / 4 b ℓ α 1 / 8 ε 9 / 4 + e L √ κ + 1 αε 2  . Corollary 4.3. If σ = 0 , then E  ∥∇ f ( ˆ x ( T ) ) ∥  ≤ ε after T = O ( e L √ κ +1 / αε 2 ) iter ations. Remark 4.4. W e emphasize sever al key pr op erties of The or em 3.1 and The or em 4.1. These the or ems 0 20 40 epochs 10 −6 10 −4 10 −2 10 0 variance ALIE 0 20 40 epochs 10 −6 10 −4 10 −2 10 0 variance IPM 0 20 40 epochs 10 −6 10 −4 10 −2 10 0 variance LF 0 20 40 epochs 10 −6 10 −4 10 −2 10 0 variance SF BR-DIANA Byz-VR-MARINA Byz-EF21-SGDM Byz-DM21 Byz-VR-DM21 BR-DIANA Byz-VR-MARINA Byz-EF21-SGDM Byz-DM21 Byz-VR-DM21 Figure 1: The training v ariance of honest messages under four attack scenarios on the a9a dataset. pr ovide the the or etic al guar ante e for the c onver genc e of the err or fe e db ack metho d with sto chastic gr adi- ents in the pr esenc e of Byzantine attacks. In the heter o gene ous c ase (i.e., ζ > 0 ), the algorithm do es not ensur e that E  ∥∇ f ( ˆ x ( T ) ) ∥  c an b e made arbi- tr arily smal l. This limitation is char acteristic of al l Byzantine-r obust algorithms in heter o gene ous envir on- ments. Sp e ciﬁc al ly, with an or der-optimal r obustness c o eﬃcient κ = O ( B / n ) , as achieve d by CWTM , the r esult is c onsistent with the lower b ound Ω( B / n ζ 2 ) establishe d by Al louah et al. (2023). Mor e over, the highest attain- able ac cur acy of Byz-DM21 and Byz-VR-DM21 is tighter than that of Byz-VR-MARINA and Byz-EF21 (se e T able 1). 4.2 Con vergence of Byz-VR-DM21 under the P olyak-Ło jasiewicz Condition Theorem 4.5. L et Assumptions 2.1, 2.2, 2.3, 2.4 and 3.6 b e satisﬁe d, then cho ose momentum η ≤ O  min n µGε ( G ( κ +1)+1) σ 2 , µα 2 ε 1 / 3 ( κ +1) σ 2 , µαε 1 / 4 ( κ +1) σ 2 o , after T = O  ( G ( κ + 1) + 1) σ 2 µεG + ( κ + 1) σ 2 µα 2 ε 1 / 3 + ( κ + 1) σ 2 µαε 1 / 4 + L µ + Lσ 2 ( G ( κ + 1) + 1) µ 2 εG  iter ations with stepsize γ ≤ min n  L + 4 √ 8 κ + 1 q 16 α 2 ( L 2 + 7 η ℓ 2 ) + η 2 ( L 2 (3 η 2 + 24) + ℓ 2 (12 η 3 + αη 2 +156 η )) η 2 α 2  − 1 , η 2 µ , α 4 µ o , Byz-VR-DM21 pr o duc es a p oint ˆ x ( T ) for which E [ f ( ˆ x ( T ) ) − f ( x ∗ )] ≤ ε . Double Momen tum for Byzantine Robust Learning 0 10 20 30 40 epochs 4 × 10 −1 6 × 10 −1 10 0 training loss RF A | ALIE 0 10 20 30 40 epochs 4 × 10 −1 6 × 10 −1 10 0 training loss RF A | IPM 0 10 20 30 40 epochs 4 × 10 −1 6 × 10 −1 10 0 training loss RF A | LF 0 10 20 30 40 epochs 4 × 10 −1 6 × 10 −1 10 0 training loss RF A | SF 0 10 20 30 40 epochs 4 × 10 −1 6 × 10 −1 10 0 training loss CM | ALIE 0 10 20 30 40 epochs 4 × 10 −1 6 × 10 −1 10 0 training loss CM | IPM 0 10 20 30 40 epochs 4 × 10 −1 6 × 10 −1 10 0 training loss CM | LF 0 10 20 30 40 epochs 4 × 10 −1 6 × 10 −1 10 0 training loss CM | SF BR-DIANA Byz-VR-MARINA Byz-EF21-SGDM Byz-DM21 Byz-VR-DM21 BR-DIANA Byz-VR-MARINA Byz-EF21-SGDM Byz-DM21 Byz-VR-DM21 Figure 2: The training loss of RF A and CM under four attack scenarios (SF, IPM, LF, ALIE) on the a9a dataset in a heterogeneous setting. W e use k = 0 . 1 d for b oth Rand k and T op k compressors. 5 Numerical Exp eriments In this section, we demonstrate the p erformance of the prop osed metho d. The goal of our exp erimental ev alu- ation is to show case the b eneﬁts of double momentum to mitigate Byzantine work ers. W e consider tw o binary classiﬁcation tasks: a9a and w8a from the LIBSVM Chang and Lin (2011) dataset, and image classiﬁcation on CIF AR-10 Krizhevsky et al. (2009) and FEMNIST Caldas et al. (2018). Due to space limitations, w e presen t results only on a9a and defer the rest to the App endix D. A dversarial attac ks. W e address a binary logistic regression problem with a regularizer, using the a9a dataset from LIBSVM Chang and Lin (2011). The data is distributed across n = 20 work ers, 8 of which are Byzan tine. F or aggregation, w e use Robust F ederated A v eraging (RF A) Pillutla et al. (2022), Coordinate- wise Median (CM) Yin et al. (2018), and the NNM algorithm Allouah et al. (2023). The exp eriments test four Byzantine attack strategies: Sign Flipping (SF) Allen-Zh u et al. (2021), Lab el Flipping (LF) Allen-Zhu et al. (2021), Inner Product Manipulation (IPM) Xie et al. (2020), and A Little Is Enough (ALIE) Baruch et al. (2019) (details in App endix C). W e compare our algorithm with BR-DIANA 2 Mishc henko et al. (2019), Byz-VR-MARINA Gorbunov et al. (2023), and Byz-EF21- SGDM Liu et al. (2026). F or the contractiv e compressor, w e use T op k , while Byz-EF21-SGDM uses it to o, and all other algorithms use Rand k 3 . 2 BR-DIANA is a version of BROADCAST with the SGD estimator instead of the SAGA estimator. BROADCAST consumes a large amount of memory , which scales linearly with the num b er of data points. W e compare Byrd-SA GA in App endix D. 3 T o ensure a fair comparison, eac h metho d employs a Empirical results. Figure 1 illustrates the training v ariance of honest messages for the compared algo- rithms. The results demonstrate that Byz-VR-DM21 ef- fectiv ely reduces the v ariance of the sto chastic gradient, main taining a consistently low level of v ariance even after conv ergence. This highlights Byz-VR-DM21 ’s en- hanced robustness in mitigating noise and interference from Byzantine no des. On the other hand, Byz-DM21 ac hieves a comparable v ariance level to Byz-VR-MARINA despite not employing explicit v ariance reduction tech- niques, show casing the adv antages conferred b y its double-momen tum design. Figure 2 depicts the train- ing loss across the four metho ds under v arious attack scenarios. Both Byz-DM21 and Byz-VR-DM21 exhibit rapid conv ergence and strong resilience to Byzantine in terference, outp erforming the other algorithms. In con trast, Byz-VR-MARINA , although capable of quick con vergence, suﬀers from signiﬁcant ﬂuctuations when exp osed to Byzantine attacks. Notably , Byz-VR-DM21 ac hieves a faster conv ergence rate compared to Byz- DM21 , which can b e attributed to the eﬀectiveness of its v ariance reduction mechanism. Repro ducibilit y . T o ensure repro ducibility , all exp er- imen ts were conducted using three diﬀerent random seeds. W e rep ort the mean training loss along with one standard error. 6 Conclusion W e introduce Byz-DM21 and Byz-VR-DM21 , Byzantine- toleran t schemes designed to harness the empirical b en- eﬁts of double momentum and v ariance reduction, while preserving communication eﬃciency . Unlik e most ex- theoretically compatible compressor. Y anghao Li, Changxin Liu † , Y uhao Yi † isting Byzantine-toleran t metho ds, our new algorithm lev erages stochastic gradients and is batch-free, mean- ing it does not require computing full gradients or additional tuning. Through theoretical analysis, we pro ve that our new algorithm has tight low er b ounds and a smaller neighborho o d size. It matches the up- p er b ound results in b oth sto chastic and full gradient scenarios when the problem is Byzantine-free, and also con verges to a smaller neighborho o d around the op- timal solution. W e further show that, under the PŁ condition, our algorithm admits a faster conv ergence guaran tee in terms of the optimalit y gap. Moreov er, we demonstrate the robustness of our algorithm against adv ersarial attacks through exp eriments on binary and image classiﬁcation tasks. A c knowledgemen ts Y anghao Li and Y uhao Yi ackno wledge supp ort from the National Natural Science F oundation of China un- der Grant 62303338, while Changxin Liu ac knowledges supp ort under Grant 62573196. References Alistarh, D., Grubic, D., Li, J., T omiok a, R., and V o jno vic, M. (2017). Qsgd: Comm unication-eﬃcient sgd via gradient quantization and enco ding. A dvanc es in neur al information pr o c essing systems , 30. Alistarh, D., Ho eﬂer, T., Johansson, M., Konstan tinov, N., Khirirat, S., and Renggli, C. (2018). The conv er- gence of sparsiﬁed gradient metho ds. A dvanc es in Neur al Information Pr o c essing Systems , 31. Allen-Zh u, Z. (2018). Katyusha: The ﬁrst direct accel- eration of sto chastic gradient metho ds. Journal of Machine L e arning R ese ar ch , 18(221):1–51. Allen-Zh u, Z., Ebrahimianghazani, F., Li, J., and Al- istarh, D. (2021). Byzan tine-resilien t non-conv ex sto c hastic gradient descent. In International Confer- enc e on L e arning R epr esentations . Allouah, Y., F arhadkhani, S., Guerraoui, R., Gupta, N., Pinot, R., and Stephan, J. (2023). Fixing b y mixing: A recip e for optimal byzan tine ml under het- erogeneit y . In International Confer enc e on A rtiﬁcial Intel ligenc e and Statistics , pages 1232–1300. PMLR. Arjev ani, Y., Carmon, Y., Duchi, J. C., F oster, D. J., Srebro, N., and W o o dworth, B. (2023). Lo wer b ounds for non-conv ex sto chastic optimization. Mathemati- c al Pr o gr amming , 199(1):165–214. Baruc h, G., Baruch, M., and Goldb erg, Y. (2019). A lit- tle is enough: Circumv enting defenses for distributed learning. A dvanc es in Neur al Information Pr o c essing Systems , 32. Bernstein, J., Zhao, J., Azizzadenesheli, K., and Anand- kumar, A. (2019). signSGD with ma jorit y vote is comm unication eﬃcien t and fault tolerant. In Inter- national Confer enc e on L e arning R epr esentations . Beznosik ov, A., Horváth, S., Ric htárik, P ., and Sa- fary an, M. (2023). On biased compression for dis- tributed learning. Journal of Machine L e arning R e- se ar ch , 24(276):1–50. Blanc hard, P ., El Mhamdi, E. M., Guerraoui, R., and Stainer, J. (2017). Mac hine learning with adv ersaries: Byzan tine tolerant gradient descent. A dvanc es in neur al information pr o c essing systems , 30. Caldas, S., Duddu, S. M. K., W u, P ., Li, T., Konečn` y, J., McMahan, H. B., Smith, V., and T alwalk ar, A. (2018). Leaf: A b enchmark for federated settings. arXiv pr eprint arXiv:1812.01097 . Chang, C.-C. and Lin, C.-J. (2011). Libsvm: a library for supp ort vector machines. ACM tr ansactions on intel ligent systems and te chnolo gy (TIST) , 2(3):1–27. Chen, T., Giannakis, G., Sun, T., and Yin, W. (2018). Lag: Lazily aggregated gradien t for communication- eﬃcien t distributed learning. A dvanc es in neur al information pr o c essing systems , 31. Chen, Y., Su, L., and Xu, J. (2017). Distributed statis- tical mac hine learning in adversarial settings: Byzan- tine gradient descent. Pr o c e e dings of the ACM on Me asur ement and Analysis of Computing Systems , 1(2):1–25. Cohen, G., Afshar, S., T apson, J., and V an Schaik, A. (2017). Emnist: Extending mnist to handwritten letters. In 2017 international joint c onfer enc e on neur al networks (IJCNN) , pages 2921–2926. IEEE. Cutk osky , A. and Mehta, H. (2020). Momentum im- pro ves normalized sgd. In International c onfer enc e on machine le arning , pages 2260–2268. PMLR. Cutk osky , A. and Orab ona, F. (2019). Momentum- based v ariance reduction in non-conv ex sgd. A d- vanc es in neur al information pr o c essing systems , 32. Defazio, A., Bach, F., and Lacoste-Julien, S. (2014). Saga: A fast incremental gradien t metho d with sup- p ort for non-strongly conv ex comp osite ob jectiv es. A dvanc es in neur al information pr o c essing systems , 27. Diak onikolas, I. and Kane, D. M. (2019). Recen t ad- v ances in algorithmic high-dimensional robust statis- tics. arXiv pr eprint arXiv:1911.05911 . El Mhamdi, E. M., Guerraoui, R., and Rouault, S. L. A. (2021). Distributed momentum for byzan tine-resilient sto c hastic gradien t descen t. In 9th International Confer enc e on L e arning R epr esentations (ICLR) . F ang, C., Li, C. J., Lin, Z., and Zhang, T. (2018). Spider: Near-optimal non-conv ex optimization via Double Momen tum for Byzantine Robust Learning sto c hastic path-integrated diﬀerential estimator. A d- vanc es in neur al information pr o c essing systems , 31. F atkhullin, I., Sokolo v, I., Gorbunov, E., Li, Z., and Ric htárik, P . (2021). Ef21 with b ells & whistles: Prac- tical algorithmic extensions of mo dern error feedback. arXiv pr eprint arXiv:2110.03294 . F atkhullin, I., Tyurin, A., and Rich tárik, P . (2023). Momen tum prov ably improv es error feedback! In Thirty-seventh Confer enc e on Neur al Information Pr o c essing Systems . F edin, N. and Gorbunov, E. (2023). Byzantine-robust lo opless sto chastic v ariance-reduced gradien t. In International Confer enc e on Mathematic al Optimiza- tion The ory and Op er ations R ese ar ch , pages 39–53. Springer. Gorbuno v, E., Horváth, S., Ric htárik, P ., and Gidel, G. (2023). V ariance reduction is an an tidote to b yzan- tines: Better rates, weak er assumptions and com- m unication compression as a c herry on the top. In The Eleventh International Confer enc e on L e arning R epr esentations . Go wer, R. M., Sc hmidt, M., Bach, F., and Rich tárik, P . (2020). V ariance-reduced metho ds for machine learning. Pr o c e e dings of the IEEE , 108(11):1968– 1983. Guerraoui, R., Gupta, N., and Pinot, R. (2024). Byzan- tine machine learning: A primer. ACM Computing Surveys , 56(7):1–39. Haddadp our, F., Kamani, M. M., Mahdavi, M., and Cadam b e, V. (2019). T rading redundancy for com- m unication: Sp eeding up distributed sgd for non- con vex optimization. In International Confer enc e on Machine L e arning , pages 2545–2554. PMLR. He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Pr o c e e d- ings of the IEEE c onfer enc e on c omputer vision and p attern r e c o gnition , pages 770–778. Hong, S., Y ang, H., and Lee, J. (2022). Hierarchical group testing for byzan tine attack identiﬁcation in distributed matrix m ultiplication. IEEE Journal on Sele cte d Ar e as in Communic ations , 40(3):1013–1029. Horvóth, S., Ho, C.-Y., Horv ath, L., Sahu, A. N., Canini, M., and Rich tárik, P . (2022). Natural com- pression for distributed deep learning. In Mathemat- ic al and Scientiﬁc Machine L e arning , pages 129–141. PMLR. Huang, X., Chen, Y., Yin, W., and Y uan, K. (2022). Lo wer b ounds and nearly optimal algorithms in dis- tributed learning with communication compression. A dvanc es in Neur al Information Pr o c essing Systems , 35:18955–18969. Jaggi, M., Smith, V., T akác, M., T erhorst, J., Krish- nan, S., Hofmann, T., and Jordan, M. I. (2014). Comm unication-eﬃcient distributed dual co ordinate ascen t. A dvanc es in neur al information pr oc essing systems , 27. Johnson, R. and Zhang, T. (2013). Accelerating sto chas- tic gradient descent using predictive v ariance reduc- tion. A dvanc es in neur al information pr o c essing sys- tems , 26. Kairouz, P ., McMahan, H. B., A ven t, B., Bellet, A., Bennis, M., Bhago ji, A. N., Bonawitz, K., Charles, Z., Cormo de, G., Cummings, R., et al. (2021). Adv ances and op en problems in federated learning. F ounda- tions and tr ends ® in machine le arning , 14(1–2):1– 210. Karimi, H., Nutini, J., and Schmidt, M. (2016). Lin- ear conv ergence of gradien t and proximal-gradien t metho ds under the polyak-ło jasiewicz condition. In Machine L e arning and Know le dge Disc overy in Datab ases: Eur op e an Confer enc e, ECML PKDD 2016, Riva del Gar da, Italy, Septemb er 19-23, 2016, Pr o c e e dings, Part I 16 , pages 795–811. Springer. Karimireddy , S. P ., He, L., and Jaggi, M. (2021). Learn- ing from history for byzan tine robust optimization. In International Confer enc e on Machine L e arning , pages 5311–5319. PMLR. Karimireddy , S. P ., He, L., and Jaggi, M. (2022). Byzan tine-robust learning on heterogeneous datasets via buck eting. In International Confer enc e on L e arn- ing R epr esentations . Karimireddy , S. P ., Reb jock, Q., Stich, S., and Jaggi, M. (2019). Error feedback ﬁxes signsgd and other gra- dien t compression schemes. In International Confer- enc e on Machine L e arning , pages 3252–3261. PMLR. K onstantinidis, K. and R amamoorthy , A. (2021). Byzshield: An eﬃcient and robust system for dis- tributed training. Pr o c e e dings of Machine L e arning and Systems , 3:812–828. Krizhevsky , A., Hinton, G., et al. (2009). Learning m ultiple la yers of features from tin y images. Lamp ort, L., Shostak, R., and Pease, M. (2019). The b yzantine generals problem. In Concurr ency: the works of leslie lamp ort , pages 203–226. Lan, G., Li, Z., and Zhou, Y. (2019). A uniﬁed v ariance- reduced accelerated gradient method for conv ex opti- mization. A dvanc es in Neur al Information Pr o c essing Systems , 32. Lan, G. and Zhou, Y. (2018). An optimal random- ized incremental gradient metho d. Mathematic al pr o gr amming , 171:167–215. Lee, J. D., Lin, Q., Ma, T., and Y ang, T. (2017). Dis- tributed sto chastic v ariance reduced gradient meth- Y anghao Li, Changxin Liu † , Y uhao Yi † o ds by sampling extra data with replacemen t. Jour- nal of Machine L e arning R ese ar ch , 18(122):1–43. Li, Z., Bao, H., Zhang, X., and Rich tárik, P . (2021). P age: A simple and optimal probabilistic gradient es- timator for nonconv ex optimization. In International c onfer enc e on machine le arning , pages 6286–6295. PMLR. Liu, C., Li, Y., Yi, Y., and Johansson, K. H. (2026). Byzan tine-robust and communication-eﬃcien t dis- tributed learning via compressed momentum ﬁltering. IEEE T r ansactions on Neur al Networks and L e arning Systems . Liu, H., Shan, L., Bao, H., Y ou, R., Yi, Y., and Lv, J. (2025). Lid-ﬂ: T ow ards list-deco dable federated learning. In Pr o c e e dings of the AAAI Confer enc e on A rtiﬁcial Intel ligenc e , v olume 39, pages 18825–18833. McMahan, B. and Ramage, D. (2017). F ederated learn- ing: Collab orative machine learning without central- ized training data. Go o gle R ese ar ch Blo g , 3. Mishc henko, K., Gorbuno v, E., T akáč, M., and Ric htárik, P . (2019). Distributed learning with compressed gradient diﬀerences. arXiv pr eprint arXiv:1901.09269 . Nguy en, L. M., Liu, J., Schein b erg, K., and T akáč, M. (2017). Sarah: A nov el metho d for machine learn- ing problems using sto chastic recursive gradient. In International c onfer enc e on machine le arning , pages 2613–2621. PMLR. Pillutla, K., Kak ade, S. M., and Harchaoui, Z. (2022). Robust aggregation for federated learning. IEEE T r ansactions on Signal Pr o c essing , 70:1142–1154. P olyak, B. T. (1963). Gradient metho ds for the min- imisation of functionals. USSR Computational Math- ematics and Mathematic al Physics , 3(4):864–878. Rammal, A., Gruntk owsk a, K., F edin, N., Gorbunov, E., and Rich tárik, P . (2024). Comm unication com- pression for b yzantine robust learning: New eﬃcient algorithms and impro ved rates. In International Confer enc e on Artiﬁcial Intel ligenc e and Statistics , pages 1207–1215. PMLR. Ric htárik, P ., Sok olov, I., and F atkhullin, I. (2021). EF21: A new, simpler, theoretically better, and practically faster error feedback. A dvanc es in Neur al Information Pr o c essing Systems , 34:4384–4396. Safary an, M., Shulgin, E., and Rich tárik, P . (2022). Uncertain ty principle for communication compres- sion in distributed and federated learning and the searc h for an optimal compressor. Information and Infer enc e: A Journal of the IMA , 11(2):557–580. Sah u, A., Dutta, A., M Ab delmoniem, A., Banerjee, T., Canini, M., and Kalnis, P . (2021). Rethinking gradien t sparsiﬁcation as total error minimization. A dvanc es in Neur al Information Pr o c essing Systems , 34:8133–8146. Sc hmidt, M., Le Roux, N., and Bac h, F. (2017). Mini- mizing ﬁnite sums with the sto chastic av erage gradi- en t. Mathematic al Pr o gr amming , 162:83–112. Seide, F., F u, H., Dropp o, J., Li, G., and Y u, D. (2014). 1-bit sto chastic gradien t descen t and its application to data-parallel distributed training of sp eech dnns. In Intersp e e ch , volume 2014, pages 1058–1062. Sin- gap ore. Shi, W., Cao, J., Zhang, Q., Li, Y., and Xu, L. (2016). Edge computing: Vision and challenges. IEEE in- ternet of things journal , 3(5):637–646. Su, L. and V aidya, N. H. (2016). F ault-tolerant m ulti- agen t optimization: optimal iterative distributed algorithms. In Pr o c e e dings of the 2016 ACM symp o- sium on principles of distribute d c omputing , pages 425–434. T ran-Dinh, Q., Pham, N. H., Phan, D. T., and Nguyen, L. M. (2019). Hybrid sto chastic gradient descent algorithms for sto c hastic nonconv ex optimization. arXiv pr eprint arXiv:1905.05920 . V ogels, T., Karimireddy , S. P ., and Jaggi, M. (2019). P ow ersgd: Practical low-rank gradien t compression for distributed optimization. A dvanc es in Neur al Information Pr o c essing Systems , 32. W angni, J., W ang, J., Liu, J., and Zhang, T. (2018). Gradien t sparsiﬁcation for comm unication-eﬃcient distributed optimization. A dvanc es in Neur al Infor- mation Pr o c essing Systems , 31. W eiszfeld, E. (1937). Sur le p oint pour lequel la somme des distances de n p oints donnés est minimum. T o- hoku Mathematic al Journal, First Series , 43:355–386. W u, Z., Ling, Q., Chen, T., and Giannakis, G. B. (2020). F ederated v ariance-reduced stochastic gradient de- scen t with robustness to byzan tine attacks. IEEE T r ansactions on Signal Pr o c essing , 68:4583–4596. Xie, C., K oy ejo, O., and Gupta, I. (2020). F all of empires: Breaking b yzantine-toleran t sgd b y inner pro duct manipulation. In Unc ertainty in Artiﬁcial Intel ligenc e , pages 261–270. PMLR. Xie, C., K oy ejo, S., and Gupta, I. (2019). Zeno: Dis- tributed sto chastic gradient descen t with suspicion- based fault-tolerance. In International Confer enc e on Machine L e arning , pages 6893–6901. PMLR. Y ang, Y.-R. and Li, W.-J. (2021). Basgd: Buﬀered async hronous sgd for byzan tine learning. In Interna- tional c onfer enc e on machine le arning , pages 11751– 11761. PMLR. Yin, D., Chen, Y., Kannan, R., and Bartlett, P . (2018). Byzan tine-robust distributed learning: T ow ards op- Double Momen tum for Byzantine Robust Learning timal statistical rates. In International c onfer enc e on machine le arning , pages 5650–5659. Pmlr. Y u, H., Jin, R., and Y ang, S. (2019). On the linear sp eedup analysis of communication eﬃcient momen- tum sgd for distributed non-con vex optimization. In International Confer enc e on Machine L e arning , pages 7184–7193. PMLR. Zh u, H. and Ling, Q. (2021). Broadcast: Reducing b oth sto chastic and compression noise to robustify comm unication-eﬃcient federated learning. arXiv pr eprint arXiv:2104.06685 . Y anghao Li, Changxin Liu † , Y uhao Yi † Con ten ts 1 In tro duction 1 2 Preliminaries 3 3 Byzan tine-Robust Distributed Learning 4 3.1 Byzan tine-DM21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 Con vergence of Byz-DM21 for Gene ral Non-Conv ex Problems . . . . . . . . . . . . . . . . . . . . 5 3.3 Con vergence of Byz-DM21 under the Poly ak-Ło jasiewicz Condition . . . . . . . . . . . . . . . . . 6 4 Incorp orating V ariance Reduction 6 4.1 Con vergence of Byz-VR-DM21 for Gene ral Non-Conv ex Problems . . . . . . . . . . . . . . . . . . 6 4.2 Con vergence of Byz-VR-DM21 under the Poly ak-Ło jasiewicz Condition . . . . . . . . . . . . . . 7 5 Numerical Exp erimen ts 8 6 Conclusion 8 A Related W ork 15 B Intuition Behind Double Momentum 16 C F urther Details on Robust Aggregation and Byzan tine Attac ks 18 C.1 Robust Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 C.2 Byzantine Attac ks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 D Extra Exp eriments and Exp erimental Details 19 D.1 General Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 D.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 D.3 Exp eriments Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 D.4 Empirical Results on logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 D.5 Comparison of V ariance Reduction Metho ds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 D.6 Error F eedback Exp eriments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 D.7 Empirical Results on CIF AR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 D.8 Byzantine Ratio Exp erimen ts on CIF AR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 D.9 Empirical Results on FEMNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 D.10 Additional Exp erimen ts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 E Useful F acts 33 F Missing Pro ofs of Byz-DM21 for General Non-Con vex F unctions 34 Double Momen tum for Byzantine Robust Learning F.1 Supp orting Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 F.2 Pro of of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 G Missing Pro ofs of Byz-VR-DM21 for General Non-Con vex F unctions 43 G.1 Supporting Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 G.2 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 H Missing Pro ofs for P olyak-Ło jasiewicz F unctions 54 H.1 Byz-DM21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 H.2 Byz-VR-DM21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Y anghao Li, Changxin Liu † , Y uhao Yi † A Related W ork Byzan tine-robust distributed learning . A distributed learning algorithm is Byzantine-robust if its p erformance remains reliable in the face of Byzan tine w orkers, which may act arbitrarily . The framework for Byzantine-robust distributed learning generally includes three main steps Guerraoui et al. (2024): (i) pre-pro cessing of vectors submitted by w orkers (e.g., mo dels or sto chastic gradients), (ii) robust aggregation of these vectors, and (iii) the application of an optimization metho d. Existing works in this domain often v ary in their treatment of one or more of these steps. F or pre-pro cessing, techniques such as Buc keting Karimireddy et al. (2022) and Nearest Neigh b or Mixing ( NNM ) Allouah et al. (2023) hav e b een prop osed. Many robust aggregation rules hav e b een prop osed. The most commonly used ones include co ordinate-wise median ( CWMed ), co ordinate-wise trimmed mean ( CWTM ) Yin et al. (2018), centered clipping Karimireddy et al. (2021). Beyond the classical setting where the serv er outputs a single mo del and robustness typically relies on a suﬃciently large honest fraction, list-deco dable federated learning maintains a list of candidate models and guarantees that at least one mo del p erforms w ell ev en under a malicious ma jorit y Liu et al. (2025). Additionally , Byzantine attack iden tiﬁcation strategies hav e demonstrated enhanced robustness in distributed computing tasks suc h as matrix m ultiplication Hong et al. (2022). V arious optimization algorithms hav e b een applied to the considered problem, including Sto chastic Gradient Descent ( SGD ) Y ang and Li (2021), Poly ak momentum Karimireddy et al. (2022); El Mhamdi et al. (2021), SAGA W u et al. (2020), and VR-MARINA Gorbuno v et al. (2023). V ariance Reduction. V ariance reduction is a pow erful tec hnique for accelerating the conv ergence of sto chastic metho ds, particularly when a go o d approximation of the solution is required. The earliest v ariance-reduced metho ds were introduced by Schmidt et al. (2017); Johnson and Zhang (2013); Defazio et al. (2014). Optimal metho ds for v ariance reduction in (strongly) conv ex problems were later prop osed by Lan and Zhou (2018); Allen-Zh u (2018); Lan et al. (2019), while metho ds for non-conv ex optimization were dev elop ed b y Nguy en et al. (2017); F ang et al. (2018); Li et al. (2021). Although v ariance reduction has attracted signiﬁcant attention Gow er et al. (2020), there has b een limited research on combining it with Byzantine robustness, with only a few studies addressing this area W u et al. (2020); Zh u and Ling (2021); Karimireddy et al. (2021); Gorbuno v et al. (2023). Compressed Comm unications and Error F eedback. Research on distributed metho ds with communication compression can generally be divided into tw o main categories. The ﬁrst focuses on metho ds utilizing un biased compression op erators, such as Rand K sparsiﬁcation ( Rand k ) Horvóth et al. (2022), while the second explores metho ds that employ biased compressors, like T op K sparsiﬁcation ( T op k ) Alistarh et al. (2018). A detailed summary of the most commonly used compression op erators can b e found in Safaryan et al. (2022); Beznosiko v et al. (2023). Biased compression com bined with error feedback has demonstrated strong practical performance Seide et al. (2014); V ogels et al. (2019). In the non-conv ex setting, which is the fo cus of our work, standard error feedback has been analyzed by Karimireddy et al. (2019); Beznosik ov et al. (2023) and Sah u et al. (2021). Ho wev er, the complexity b ounds for standard error feedbac k t ypically dep end on the heterogeneity parameter ζ 2 or require b ounded gradients. Ric htárik et al. (2021) address these challenges by in tro ducing a no vel v ariant of error feedback, kno wn as EF21. This approac h has since b een extended in v arious directions b y F atkhullin et al. (2021). Double Momen tum for Byzantine Robust Learning B In tuition Behind Double Momentum In this section, we presen t a detailed comparison b etw een the single-momen tum and double-momentum metho ds, thereb y clarifying the adv antages of the double-momentum approach. W e consider a simple one-dimensional sto chastic gradient mo del g t = µ t + ξ t , where µ t = E [ g t | F t − 1 ] is the (p ossibly time-v arying) true gradient at step t , and ξ t is the noise term with E [ ξ t | F t − 1 ] = 0 and V ar( ξ t | F t − 1 ) ≤ σ 2 . Here F t − 1 denotes the sigma-ﬁeld generated b y all randomness up to time t − 1 . W e do **not** need to assume that µ t is constan t or indep endent of the noise. The only structural assumption is the standard one for SGD-type metho ds: the noise sequence ( ξ t ) is a martingale diﬀerence with b ounded conditional v ariance. Single momen tum (conditional analysis) F or conv enience of analysis, we set v 0 = 0 ; how ever, choosing v 0 = g 0 instead do es not c hange the asymptotic v ariance. The single-momentum up date is v t +1 = (1 − η ) v t + η g t +1 , 0 < η < 1 , v 0 = 0 . Unrolling the recursion yields the identit y v t = η t − 1 X k =0 (1 − η ) k g t − k . Substituting g t − k = µ t − k + ξ t − k , we decomp ose v t = η t − 1 X k =0 (1 − η ) k µ t − k | {z } signal part + η t − 1 X k =0 (1 − η ) k ξ t − k | {z } noise part . No w ﬁx any realization of { µ 1 , . . . , µ t } and take conditional exp ectation with resp ect to the noise: E [ v t | µ 1 , . . . , µ t ] = η t − 1 X k =0 (1 − η ) k µ t − k . This shows that in the time-v arying case, momentum tracks an exp onen tially weigh ted av erage of past true gradien ts, not the instantaneous µ t . This is the usual bias of momentum metho ds. F or the conditional v ariance, deﬁne the centered noise part e v t := v t − E [ v t | µ 1 , . . . , µ t ] = η t − 1 X k =0 (1 − η ) k ξ t − k . Under the standard assumption that the noise terms ξ s are uncorrelated in time and hav e conditional v ariance V ar( ξ s | F s − 1 ) = σ 2 , we obtain E [ e v 2 t | µ 1 , . . . , µ t ] = η 2 σ 2 t − 1 X k =0 (1 − η ) 2 k . Letting t → ∞ giv es V ar noise ( v ∞ | µ 1: ∞ ) = η 2 σ 2 ∞ X k =0 (1 − η ) 2 k = η 2 − η σ 2 . Imp ortan tly , this conditional v ariance form ula is identical to the constant- µ case and do es not dep end on how µ t is generated. Y anghao Li, Changxin Liu † , Y uhao Yi † Double momen tum (conditional analysis) Double momentum maintains tw o EMA v ariables: v t +1 = (1 − η ) v t + η g t +1 , u t +1 = (1 − η ) u t + η v t +1 , v 0 = u 0 = 0 . Unrolling the second recursion gives u t = η 2 t − 1 X r =0 ( r + 1)(1 − η ) r g t − r . Again substituting g t − r = µ t − r + ξ t − r , we obtain u t = η 2 t − 1 X r =0 ( r + 1)(1 − η ) r µ t − r | {z } signal part + η 2 t − 1 X r =0 ( r + 1)(1 − η ) r ξ t − r | {z } noise part . Conditioning on { µ 1 , . . . , µ t } , the centered noise comp onent is e u t := u t − E [ u t | µ 1 , . . . , µ t ] = η 2 t − 1 X r =0 ( r + 1)(1 − η ) r ξ t − r , and its conditional v ariance is E [ e u 2 t | µ 1 , . . . , µ t ] = η 4 σ 2 t − 1 X r =0 ( r + 1) 2 (1 − η ) 2 r . Letting t → ∞ and using the standard series formula P ∞ r =0 ( r + 1) 2 b r = 1+ b (1 − b ) 3 with b = (1 − η ) 2 , we get V ar noise ( u ∞ | µ 1: ∞ ) = σ 2 η 2 − 2 η + η 2 (2 − η ) 3 . Conditional v ariance comparison Therefore, V ar noise ( v ∞ | µ 1: ∞ ) = η 2 − η σ 2 , V ar noise ( u ∞ | µ 1: ∞ ) = σ 2 η 2 − 2 η + η 2 (2 − η ) 3 , and the ratio V ar noise ( u ∞ | µ 1: ∞ ) V ar noise ( v ∞ | µ 1: ∞ ) = 2 − 2 η + η 2 (2 − η ) 2 ∈  1 2 , 1  , 0 < η < 1 . Th us, for an y (p ossibly random and noise-dep endent) tra jectory { µ t } , the conditional noise v ariance of the double-momen tum estimator is strictly smaller than that of the single-momentum estimator, with an asymptotic reduction factor of ab out 1 / 2 when η is small. The time v ariation of µ t only aﬀects the conditional mean E [ v t | µ 1: t ] , E [ u t | µ 1: t ] , i.e., the bias, but not the ab ov e v ariance formulas for the noise comp onent. Notably , the "half-v ariance" eﬀect observed in the 1D toy mo del extends to each co ordinate and th us to the full high-dimensional case. Double Momen tum for Byzantine Robust Learning C F urther Details on Robust Aggregation and Byzantine Attac ks C.1 Robust Aggregation In Section 2, we apply robust aggregation rules satisfying Deﬁnition 2.5 after using NNM prop osed by Allouah et al. (2023) (see Algorithm 2). This algorithm can enhance the robustness of aggregation rules. In particular, Allouah et al. (2023) shows that Algorithm 2 makes Robust F ederated A v eraging ( RF A ) Pillutla et al. (2022) (also known as geometric median), Co ordinate-wise Median ( CM ) Chen et al. (2017) robust, and Co ordinate-wise T rimmed Mean ( CWTM ) in view of the deﬁnition from Allouah et al. (2023). Algorithm 2 NNM: Nearest Neighbor Mixing Allouah et al. (2023) 1: Input: num b er of inputs n , n umber of Byzantine inputs B < n / 2 , v ectors x 1 , . . . , x n ∈ R d . 2: for i = 1 . . . n do 3: Sort inputs to get ( x i :1 , . . . , x i : n ) such that ∥ x i :1 − x i ∥ ≤ . . . ≤ ∥ x i : n − x i ∥ ; 4: A v erage the G nearest neighbors of x i , i.e., y i = 1 / G P G j =1 x i : j ; 5: end for 6: Return: y 1 , . . . , y n ; Our main goal in this section is to show that RF A ◦ NNM , CM ◦ NNM , and CWTM ◦ NNM satisfy Deﬁnition 2.5. Before we prov e this fact, we need to introduce RF A , CM , CWTM . Robust F ederated A veraging. RF A-estimator ﬁnds a geometric median: RF A ( x 1 , . . . , x n ) def = argmin x ∈ R d n X i =1 ∥ x − x i ∥ . (14) The ab ov e problem has no closed-form solution. How ev er, one can compute approximate RF A using several steps of smo othed W eiszfeld algorithm having O ( n ) computation cost of eac h iteration W eiszfeld (1937); Pillutla et al. (2022). Co ordinate-wise Median. CM -estimator computes a median of each comp onent separately . That is, for the t -th co ordinate, it is deﬁned as [ CM ( x 1 , . . . , x n )] t def = Median ([ x 1 ] t , . . . , [ x n ] t ) = argmin u ∈ R n X i =1 | u − [ x i ] t | , (15) Co ordinate-wise T rimmed Mean. The CWTM-estimator for input v ectors x 1 , . . . , x n ∈ R d is deﬁned as: [ CWTM ( x 1 , . . . , x n )] t def = argmin v ∈ R n − B X i = B +1 ([ x i ] t − v ) 2 , (16) where [ x i ] t represen ts the t -th co ordinate v alues of the input vectors, sorted in non-decreasing order. CWTM ﬁnds the v alue v that minimizes the sum of squared diﬀerences with the middle ( n − 2 B ) v alues after discarding the smallest B and largest B . C.2 Byzan tine Attac ks Byzan tine w orkers send sophisticated malicious up dates to the server. In this scenario, w e model an omniscient attac ker Baruch et al. (2019), who kno ws the data of all clients. Therefore, the attack er can mimic the statistical prop erties of honest up dates and then craft a malicious one. • Sign Flipping (SF) Allen-Zhu et al. (2021): Byzantine w orkers compute − c t +1 i and send it to the server. • Lab el Flipping (LF) Allen-Zhu et al. (2021): Byzantine w orkers compute their gradients using p oisoned lab els (i.e., y i → − y i ). Y anghao Li, Changxin Liu † , Y uhao Yi † • A Little Is Enough (ALIE) Baruch et al. (2019): Byzan tine w orkers compute the empirical mean µ G and standard deviation σ G of { c t +1 i } i ∈G and send µ G − z σ G to the serv er, where z is a constan t that controls the strength of the attac k. • Inner Pro duct Manipulation (IPM) Xie et al. (2020): Byzan tine work ers send − z G P i ∈G c t +1 i to the serv er, where z > 0 is a constant that controls the strength of the attack. D Extra Exp eriments and Exp erimental Details D.1 General Setup Our running environmen t has the following setup: • CPU: AMD Ryzen Threadripp er PRO 5975WX 32-Cores, • GPU: NVIDIA GeF orce R TX 4090 with CUDA version 11.7, • PyT orch version: 2.0.1. D.2 Datasets W e apply the prop osed algorithm to tw o types of classiﬁcation tasks to ev aluate its robustness, namely binary classiﬁcation and image classiﬁcation. F or the binary classiﬁcation task, we utilize the a9a and w8a datasets from LIBSVM Chang and Lin (2011). The information of the a9a and the w8a dataset is summarized in T able 2). F or the image classiﬁcation task, we use the CIF AR-10 Krizhevsky et al. (2009) and FEMNIST Caldas et al. (2018) datasets. The introduction of the data set and the distribution of the data are as follo ws: T able 2: Overview of the LibSVM datasets used. Dataset N (# of datap oints) d (# of features) a9a 32,561 123 w8a 49,749 300 • CIF AR-10. The CIF AR-10 dataset consists of 60,000 32 × 32 color images across 10 classes, with 6,000 images p er class. The sampled images are evenly distributed among the clients. F or CIF AR-10, we train a Residual Netw ork with 20 la yers (ResNet-20) He et al. (2016). • FEMNIST. The F ederated Extended MNIST (FEMNIST) dataset is a widely used b enchmark for federated learning, constructed b y partitioning the EMNIST dataset Cohen et al. (2017). FEMNIST consists of 805,263 images across 62 classes. F or this study , w e randomly sample 5% of the images from the original dataset and distribute them to clients in an indep endent and iden tically distributed (i.i.d.) manner. It is imp ortant to note that we simulate an imbalanced data distribution, where the num b er of training samples v aries across clien ts. The implementation is based on LEAF Caldas et al. (2018). F or FEMNIST, we train a Conv olutional Neural Net work (CNN) Krizhevsky et al. (2009) with t wo con volutional la yers. In each case, the data is distributed among n = 20 work ers, out of which 8 are Byzantine. D.3 Exp erimen ts Setup F or each exp erimen t, we select the step size from the candidate set {0.5, 0.05, 0.005} and ﬁx it throughout the training pro cess. No learning rate warm up or deca y is applied. Each exp eriment is rep eated with three diﬀeren t random seeds, and w e report the mean training loss or testing accuracy along with one standard error. A dditionally , we adopt the same set of robust aggregation hyperparameters as outlined in Karimireddy et al. (2022), i.e., • Byz-EF21-SGDM, Byz-DM21, and Byz-VR-DM21 : the momentum parameter is set as η = 0 . 1 , Double Momen tum for Byzantine Robust Learning • Byz-VR-MARINA : the probability to compute full gradien t as p = min { b / m , 1 / (1+ ω ) } , • BR-DIANA : the compressed diﬀerence parameter β = 0 . 01 , • RF A : the num b er of steps of smo othed W eisﬁeld algorithm T = 8, • ALIE : a small constant that controls the strength of the attack z is c hosen according to Baruc h et al. (2019), • IPM : a small constan t that controls the strength of the attac k ϵ = 0 . 1 . Y anghao Li, Changxin Liu † , Y uhao Yi † 0 10 20 30 40 epochs 2 × 10 −1 3 × 10 −1 4 × 10 −1 6 × 10 −1 10 0 training loss RFA | ALIE 0 10 20 30 40 epochs 2 × 10 −1 3 × 10 −1 4 × 10 −1 6 × 10 −1 10 0 training loss RFA | IPM 0 10 20 30 40 epochs 2 × 10 −1 3 × 10 −1 4 × 10 −1 6 × 10 −1 10 0 training loss RFA | LF 0 10 20 30 40 epochs 2 × 10 −1 3 × 10 −1 4 × 10 −1 6 × 10 −1 10 0 training loss RFA | SF 0 10 20 30 40 epochs 2 × 10 −1 3 × 10 −1 4 × 10 −1 6 × 10 −1 10 0 training loss CM | ALIE 0 10 20 30 40 epochs 2 × 10 −1 3 × 10 −1 4 × 10 −1 6 × 10 −1 10 0 training loss CM | IPM 0 10 20 30 40 epochs 2 × 10 −1 3 × 10 −1 4 × 10 −1 6 × 10 −1 10 0 training loss CM | LF 0 10 20 30 40 epochs 2 × 10 −1 3 × 10 −1 4 × 10 −1 6 × 10 −1 10 0 training loss CM | SF 0 10 20 30 40 epochs 2 × 10 −1 3 × 10 −1 4 × 10 −1 6 × 10 −1 10 0 training loss CWTM | ALIE 0 10 20 30 40 epochs 2 × 10 −1 3 × 10 −1 4 × 10 −1 6 × 10 −1 10 0 training loss CWTM | IPM 0 10 20 30 40 epochs 2 × 10 −1 3 × 10 −1 4 × 10 −1 6 × 10 −1 10 0 training loss CWTM | LF 0 10 20 30 40 epochs 2 × 10 −1 3 × 10 −1 4 × 10 −1 6 × 10 −1 10 0 training loss CWTM | SF BR-DIANA Byz-VR-MARINA Byz-EF21-SGDM Byz-DM21 Byz-VR-DM21 BR-DIANA Byz-VR-MARINA Byz-EF21-SGDM Byz-DM21 Byz-VR-DM21 Figure 3: The training loss of RF A , CM , and CWTM aggregation rules under four attack scenarios (SF, IPM, LF, ALIE) on the w8a dataset in a heterogeneous setting. BR-DIANA and Byz-VR-MARINA use the Rand k compressor with k = 0 . 1 d , while Byz-EF21-SGDM , Byz-DM21 , and Byz-VR-DM21 use the T op k compressor with k = 0 . 1 d . D.4 Empirical Results on logistic regression F or this task, we consider solving a logistic regression problem with l 2 -regularization: f ( x, ξ i ) = log(1 + exp( − b i a i x )) + λ ∥ x ∥ 2 where ξ i = ( a i , b i ) ∈ R d × {− 1 , 1 } denotes each data p oint, with a i ∈ R d and b i ∈ {− 1 , 1 } , and λ > 0 is the regularization parameter. W e set λ = 1 /m for these expe rimen ts, where m is the n umber of samples in the lo cal datasets. F or all methods, w e use a batc h size of b = 1 , and select the step-size from the following candidates: γ ∈ { 0 . 5 , 0 . 05 , 0 . 005 } . W e use k = 0 . 1 d for b oth T op k and Rand k compressors. Finally , the num b er of ep o chs is set to 40. T o ensure repro ducibility , all exp eriments were conducted using three diﬀerent random seeds. W e rep ort the mean training loss along with one standard error. Figure 3 illustrates the training loss for ﬁve algorithms— BR-DIANA , Byz-VR-MARINA , Byz-EF21-SGDM , Byz-DM21 , and Byz-VR-DM21 —under a 0.4 adversarial setting on the w9a dataset in a heterogeneous setting. The results sho w that Byz-DM21 and Byz-VR-DM21 ov erall ha ve a slight adv an tage, b eing able to con verge more quickly to the optimal solution, ev en when there is a high prop ortion of Byzantine work ers. The improv emen t in p erformance is attributed to the implementation of the double momen tum mec hanism and v ariance reduction tec hniques, whic h eﬀectiv ely reduce the impact of malicious up dates. Double Momen tum for Byzantine Robust Learning Figure 4: The relativ e error curve of 3 aggregation rules (RF A, CM, CWTM) under 4 attacks (ALIE, IPM, LF, SF) on the a9a dataset in a heterogeneous setting. The dataset is uniformly split ov er 12 honest work ers with 8 Byzan tine w orkers. BR-LSVRG , Byz-VR-MARINA , Byrd-SA GA and Byz-VR-DM21 with batchsize b = 0 . 01 m and step size γ = 1 / 2 L . D.5 Comparison of V ariance Reduction Metho ds In this experiment, w e compare our Byz-VR-DM21 with sev eral w ell-known v ariance-reduction algorithms, including BR-LSVRG F edin and Gorbunov (2023), Byz-VR-MARINA Gorbunov et al. (2023), and Byrd-SAGA W u et al. (2020). W e ev aluate these metho ds on the a9a dataset from the LIBSVM library , using logistic regression with l 2 regularization. Sp eciﬁcally , we set l 2 = L/ 1000 in our exp eriments, where L denotes the L -smo othness constant. The total num b er of w orkers is n = 20 , including 8 Byzan tine w orkers. All metho ds were ev aluated with a step size γ = 1 / 2 L and a batch size of b = 0 . 01 m , i.e., b = 325 . Figure 4 presents the relativ e error curv e of four diﬀeren t algorithms— BR-LSVRG , Byz-VR-MARINA , Byrd-SAGA , and Byz-VR-DM21 —under a 0.4 adv ersarial setting on the a9a dataset in a heterogeneous setting. The results sho w that, under b oth Sign Flipping (SF) and Lab el Flipping (LF) attacks, Byz-VR-DM21 consistently outp erforms the other algorithms, ac hieving faster conv ergence to the optimal solution. This underscores Byz-VR-DM21 ’s sup erior robustness in minimizing the impact of adversarial up dates, enabling the mo del to stay on course to ward optimal p erformance despite the presence of Byzantine work ers. BR-LSVRG is particularly eﬀective in the A Little Is Enough (ALIE) attacks, and Byz-VR-MARINA is particularly eﬀective in the Inner Pro duct Manipulation (IPM) attac ks. In all cases, Byz-VR-DM21 demonstrates con vergence to very high accuracy , while none of the algorithms dominate across all the attacks. In addition, Byz-VR-DM21 consistently ac hieves a functional sub optimalit y of at least 10 − 4 , represen ting reasonably go o d accuracy . This exp erimen t demonstrates that Byz-VR-DM21 can con verge to high accuracy ev en with small or mo derate batch sizes. Y anghao Li, Changxin Liu † , Y uhao Yi † Figure 5: The communication complexity comparison under 4 attacks (ALIE, IPM, LF, SF) on the a9a dataset in a heterogeneous setting. Byz-VR-MARINA uses the Rand k compressor, while Byz-VR-DM21 uses the T op k compressor, with k = 0 . 1 d , batch size b = 1 , and step size γ = 1 / 2 L . D.6 Error F eedbac k Exp eriments W e next compare the empirical p erformance of Byz-VR-DM21 and Byz-VR-MARINA on the a9a dataset in a heterogeneous setting. The total num b er of w orkers is n = 20 , including 2 Byzantine work ers. All metho ds were ev aluated with a step size of γ = 1 / 2 L and a batc h size of b = 1 . W e use k = 0 . 1 d for b oth the T op k and Rand k compressors. The exp eriments unv eil promising p otential for error feedback of communication impro vemen t. As shown in Figures 5, Byz-VR-DM21 con verges sligh tly faster than Byz-VR-MARINA , b efore b oth reach a p oint of stagnation. Double Momen tum for Byzantine Robust Learning 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | ALIE 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | IPM 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | LF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | SF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | ALIE 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | IPM 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | LF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | SF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | ALIE 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | IPM 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | LF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | SF BR-DIANA Byz-VR-MARINA Byz-EF21-SGDM Byz-DM21 Byz-VR-DM21 BR-DIANA Byz-VR-MARINA Byz-EF21-SGDM Byz-DM21 Byz-VR-DM21 Figure 6: The testing accuracy (%) of 3 aggregation rules ( RF A, CM, CWTM ) under 4 attacks (ALIE, IPM, LF, SF) on the CIF AR-10 dataset. BR-DIANA and Byz-VR-MARINA use the Rand k compressor, while Byz-EF21-SGDM , Byz-DM21 , and Byz-VR-DM21 use the T op k compressor, where k = 0 . 1 d . D.7 Empirical Results on CIF AR-10 W e ev aluate our algorithms on an image classiﬁcation task using the CIF AR-10 dataset Krizhevsky et al. (2009) with the ResNet-20 deep neural netw ork He et al. (2016). F or all metho ds, w e use a batch size of b = 64 and select the step size from the following candidates: γ ∈ { 0 . 5 , 0 . 05 , 0 . 005 } . W e use k = 0 . 1 d for b oth T op k and Rand k compressors. The training pro cess is carried out ov er 100 ep o chs, iterating ov er 5000 iterations. T o ensure repro ducibilit y , all exp eriments were conducted using three diﬀerent random seeds. W e rep ort the mean testing accuracy along with one standard error. Figure 6 presents the testing accuracy of ﬁv e diﬀeren t algorithms— BR-DIANA , Byz-VR-MARINA , Byz-EF21-SGDM , Byz-DM21 , and Byz-VR-DM21 —under a 0.4 adv ersarial setting on the CIF AR-10 dataset. The results show that Byz-VR-DM21 demonstrates a slight adv antage ov er the other algorithms in all the attack scenarios considered, ev en in the case of a high prop ortion of Byzantine w orkers. Y anghao Li, Changxin Liu † , Y uhao Yi † 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | ALIE 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | IPM 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | LF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | SF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | ALIE 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | IPM 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | LF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | SF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | ALIE 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | IPM 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | LF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | SF BR-DIANA Byz-VR-MARINA Byz-EF21-SGDM Byz-DM21 Byz-VR-DM21 BR-DIANA Byz-VR-MARINA Byz-EF21-SGDM Byz-DM21 Byz-VR-DM21 Figure 7: The testing accuracy (%) of 3 aggregation rules ( RF A, CM, CWTM ) under 4 attacks (ALIE, IPM, LF, SF) on the CIF AR-10 dataset with a heterogeneity regime of a = 0 . 25 (high). BR-DIANA and Byz-VR-MARINA use the Rand k compressor, while Byz-EF21-SGDM , Byz-DM21 , and Byz-VR-DM21 use the T op k compressor, where k = 0 . 1 d . Figure 7 presents the testing accuracy of ﬁv e diﬀeren t algorithms— BR-DIANA , Byz-VR-MARINA , Byz-EF21-SGDM , Byz-DM21 , and Byz-VR-DM21 —under a 0.4 adversarial setting on the CIF AR-10 dataset with a heterogeneity regimes of a = 0 . 25 (high). The results show that Byz-VR-DM21 demonstrates a slight adv antage ov er the other algorithms in all the attac k scenarios considered, ev en in the case of a high prop ortion of Byzantine work ers and high heterogeneity . In T able 3, w e present the p erformance of ﬁve algorithms— BR-DIANA , Byz-VR-MARINA , Byz-EF21-SGDM , Byz-DM21 , and Byz-VR-DM21 —on the CIF AR-10 dataset. F or each algorithm and under every attack, we highlight in bold the algorithm that results in the highest accuracy for the considered scenario. In T able 4, w e present the p erformance of ﬁve algorithms— BR-DIANA , Byz-VR-MARINA , Byz-EF21-SGDM , Byz-DM21 , and Byz-VR-DM21 —on the CIF AR-10 dataset with a heterogeneit y regimes of a = 0 . 25 (high). F or eac h algorithm and under every attack, we highlight in b old the algorithm that results in the highest accuracy for the considered scenario. Double Momen tum for Byzantine Robust Learning Aggregation Metho d ALIE IPM LF SF N.A. W orst Case BR-DIANA 65 . 86 ± 02 . 03 61 . 03 ± 00 . 83 51.96 ± 02.66 51.62 ± 02.07 67 . 64 ± 01 . 63 51 . 62 ± 02 . 07 Byz-VR-MARINA 50.76 ± 24.86 42.86 ± 00.79 69.03 ± 01.17 69.63 ± 02.44 71.07 ± 00.18 42 . 86 ± 00 . 79 RF A+NNM Byz-EF21-SGDM 64.14 ± 00.33 64.90 ± 01.35 72.75 ± 01.55 80.12 ± 00.87 80 . 94 ± 01 . 12 64 . 14 ± 00 . 33 Byz-DM21 63.47 ± 01.41 64.26 ± 01.00 65.98 ± 02.28 78.42 ± 01.63 80 . 10 ± 02 . 14 63 . 47 ± 01 . 41 Byz-VR-DM21 68.63 ± 01.37 66.20 ± 00.64 73.34 ± 00.51 81.19 ± 01.21 82.33 ± 00.87 66 . 20 ± 00 . 64 BR-DIANA 64 . 99 ± 02 . 05 61.93 ± 01.02 51.27 ± 01.72 50.91 ± 04.73 67 . 95 ± 02 . 05 50 . 91 ± 04 . 73 Byz-VR-MARINA 43.66 ± 01.24 42.02 ± 00.28 70.71 ± 00.36 70.46 ± 00.37 70 . 17 ± 01 . 66 42 . 02 ± 00 . 28 CM+NNM Byz-EF21-SGDM 63.42 ± 00.64 63.88 ± 00.64 71.89 ± 01.54 80.36 ± 01.06 80.98 ± 01.02 63 . 42 ± 00 . 64 Byz-DM21 62.77 ± 00.45 63.96 ± 01.18 66.04 ± 00.49 78.50 ± 02.53 80 . 73 ± 00 . 44 62 . 77 ± 00 . 45 Byz-VR-DM21 66.70 ± 03.19 64.60 ± 00.62 70.08 ± 01.99 82.46 ± 00.59 82.66 ± 00.31 64 . 60 ± 00 . 62 BR-DIANA 64.29 ± 02.73 61.41 ± 00.97 52.09 ± 01.49 49.97 ± 03.31 68.35 ± 02.42 49 . 97 ± 03 . 31 Byz-VR-MARINA 42.68 ± 01.53 42.82 ± 01.05 70.71 ± 00.67 69.75 ± 01.93 70 . 57 ± 01 . 10 42 . 68 ± 01 . 53 CWTM+NNM Byz-EF21-SGDM 64.10 ± 02.96 64.56 ± 00.32 71.78 ± 01.75 79.89 ± 02.51 79 . 89 ± 02 . 51 64 . 10 ± 02 . 96 Byz-DM21 64.90 ± 00.29 63.86 ± 01.57 65.89 ± 01.14 80.71 ± 01.59 80.93 ± 01.68 63 . 86 ± 01 . 57 Byz-VR-DM21 68.11 ± 01.46 67.41 ± 00.45 71.04 ± 00.77 81.87 ± 00.88 83.08 ± 00.49 67 . 41 ± 00 . 45 T able 3: Maximum testing accuracy ( % ) across T = 5000 learning steps on the CIF AR-10 dataset, under four Byzan tine attac ks strategies. There are B = 8 Byzantine work ers among n = 20 . ’N.A.’ denotes the case with no Byzan tine attac kers. In eac h of the three horizontal blo cks and under eac h attac k, the b est accuracy is highlighted in b old . Additionally , for ev ery metho d, we rep ort the worst-case accuracy across attacks. Aggregation Metho d ALIE IPM LF SF N.A. W orst Case BR-DIANA 60.52 ± 03.29 59 . 28 ± 01 . 08 48.47 ± 01.74 48.00 ± 04.43 62 . 25 ± 01 . 63 48 . 00 ± 04 . 43 Byz-VR-MARINA 42.77 ± 01.34 26.76 ± 01.97 51.72 ± 21.35 63.06 ± 02.42 64.15 ± 02.18 26 . 76 ± 01 . 97 RF A+NNM Byz-EF21-SGDM 20.23 ± 04.62 27.41 ± 04.01 66.05 ± 01.52 65.25 ± 01.54 72 . 62 ± 00 . 38 20 . 23 ± 04 . 62 Byz-DM21 55.80 ± 00.61 55.64 ± 01.30 57.13 ± 00.72 68.13 ± 01.00 75 . 77 ± 00 . 85 55 . 64 ± 01 . 30 Byz-VR-DM21 59.36 ± 00.92 61.36 ± 01.67 64.15 ± 01.77 70.49 ± 02.92 78.96 ± 00.74 59 . 36 ± 00 . 92 BR-DIANA 62.34 ± 03.85 58.71 ± 00.96 48.54 ± 03.82 48.81 ± 05.03 63 . 24 ± 01 . 86 47 . 04 ± 01 . 86 Byz-VR-MARINA 42.94 ± 00.66 25.37 ± 01.86 64.84 ± 04.01 50.97 ± 25.65 63 . 04 ± 02 . 61 25 . 37 ± 01 . 86 CM+NNM Byz-EF21-SGDM 20.36 ± 01.84 25.70 ± 00.39 65.31 ± 02.07 60.27 ± 00.83 71.57 ± 00.97 20 . 36 ± 01 . 84 Byz-DM21 55.23 ± 00.42 55.91 ± 00.44 56.08 ± 00.57 67.49 ± 02.84 76 . 59 ± 01 . 72 55 . 23 ± 00 . 42 Byz-VR-DM21 55.44 ± 02.34 56.39 ± 01.93 59.05 ± 01.93 68.96 ± 01.14 79.57 ± 01.89 55 . 44 ± 02 . 34 BR-DIANA 59.71 ± 05.10 59.58 ± 01.54 46.39 ± 00.73 47.83 ± 07.10 61.00 ± 01.25 46 . 39 ± 00 . 73 Byz-VR-MARINA 43.28 ± 00.23 25.93 ± 02.38 57.55 ± 25.73 64.58 ± 12.93 64 . 74 ± 12 . 54 25 . 93 ± 02 . 38 CWTM+NNM Byz-EF21-SGDM 20.46 ± 01.78 25.91 ± 00.55 64.98 ± 00.21 61.98 ± 02.01 72 . 75 ± 00 . 77 20 . 46 ± 01 . 78 Byz-DM21 56.27 ± 00.27 56.34 ± 00.96 57.71 ± 01.20 67.55 ± 00.59 72.30 ± 05.23 56 . 27 ± 00 . 27 Byz-VR-DM21 54.31 ± 00.32 57.41 ± 00.86 59.06 ± 02.12 72.78 ± 02.59 79.23 ± 00.71 54 . 31 ± 00 . 32 T able 4: Maximum testing accuracy ( % ) across T = 5000 learning steps on the CIF AR-10 dataset with a heterogeneit y regimes of a = 0 . 25 (high), under four Byzantine attacks strategies. There are B = 8 Byzan tine w orkers among n = 20 . ’N.A.’ denotes the case with no Byzantine attack ers. In each of the three horizontal blo c ks and under each attac k, the b est accuracy is highlighted in b old . Additionally , F or every metho d, w e rep ort the worst-case accuracy across attacks. Y anghao Li, Changxin Liu † , Y uhao Yi † 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | ALIE 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | IPM 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | LF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | SF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | ALIE 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | IPM 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | LF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | SF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | ALIE 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | IPM 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | LF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | SF BR-DIANA Byz-VR-MARINA Byz-EF21-SGDM Byz-DM21 Byz-VR-DM21 BR-DIANA Byz-VR-MARINA Byz-EF21-SGDM Byz-DM21 Byz-VR-DM21 Figure 8: The testing accuracy (%) of 3 aggregation rules ( RF A, CM, CWTM ) under 4 attacks (ALIE, IPM, LF, SF) on the CIF AR-10 dataset. The dataset is uniformly split among 20 w orkers, including 1 Byzantine w orker. BR-DIANA and Byz-VR-MARINA use the Rand k compressor, while Byz-EF21-SGDM , Byz-DM21 , and Byz-VR-DM21 use the T op k compressor, where k = 0 . 1 d . D.8 Byzan tine Ratio Exp eriments on CIF AR-10 W e ev aluate v alidation accuracy under diﬀerent Byzantine work er ratios. The results include three conﬁgurations: 19 go o d w orkers with 1 Byzantine w orker(Figure 8), 18 go o d work ers with 2 Byzan tine w orkers(Figure 9), and 16 go o d work ers with 4 Byzantine w orkers(Figure 10). F or all metho ds, we use a batch size of b = 64 and select the step size from the following candidates: γ ∈ { 0 . 5 , 0 . 05 , 0 . 005 } . W e use k = 0 . 1 d for b oth T op k and Rand k compressors. The training pro cess is carried out o ver 100 ep o chs, iterating ov er 5000 iterations. T o ensure repro ducibilit y , all exp eriments were conducted using three diﬀerent random seeds. W e rep ort the mean testing accuracy along with one standard error. Double Momen tum for Byzan tine Robust L earning 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | ALIE 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | IPM 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | LF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | SF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | ALIE 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | IPM 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | LF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | SF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | ALIE 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | IPM 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | LF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | SF BR-DIANA Byz-VR-MARINA Byz-EF21-SGDM Byz-DM21 Byz-VR-DM21 BR-DIANA Byz-VR-MARINA Byz-EF21-SGDM Byz-DM21 Byz-VR-DM21 Figure 9: The testing accuracy (%) of 3 aggregation rules ( RF A, CM, CWTM ) under 4 attacks (ALIE, IPM, LF, SF) on the CIF AR-10 dataset. The dataset is uniformly split among 20 w orkers, including 2 Byzantine work ers. BR-DIANA and Byz-VR-MARINA use the Rand k compressor, while Byz-EF21-SGDM , Byz-DM21 , and Byz-VR-DM21 use the T op k compressor, where k = 0 . 1 d . Y anghao Li, Changxin Liu † , Y uhao Yi † 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | ALIE 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | IPM 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | LF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | SF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | ALIE 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | IPM 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | LF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | SF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | ALIE 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | IPM 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | LF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | SF BR-DIANA Byz-VR-MARINA Byz-EF21-SGDM Byz-DM21 Byz-VR-DM21 BR-DIANA Byz-VR-MARINA Byz-EF21-SGDM Byz-DM21 Byz-VR-DM21 Figure 10: The testing accuracy (%) of 3 aggregation rules ( RF A, CM, CWTM ) under 4 attacks (ALIE, IPM, LF, SF) on the CIF AR-10 dataset. The dataset is uniformly split among 20 w orkers, including 4 Byzantine work ers. BR-DIANA and Byz-VR-MARINA use the Rand k compressor, while Byz-EF21-SGDM , Byz-DM21 , and Byz-VR-DM21 use the T op k compressor, where k = 0 . 1 d . Double Momen tum for Byzantine Robust Learning 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | ALIE 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | IPM 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | LF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | SF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | ALIE 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | IPM 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | LF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | SF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | ALIE 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | IPM 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | LF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | SF BR-DIANA Byz-VR-MARINA Byz-EF21-SGDM Byz-DM21 Byz-VR-DM21 BR-DIANA Byz-VR-MARINA Byz-EF21-SGDM Byz-DM21 Byz-VR-DM21 Figure 11: The testing accuracy (%) of 3 aggregation rules ( RF A, CM, CWTM ) under 4 attacks (ALIE, IPM, LF, SF) on the FEMNIST dataset. BR-DIANA and Byz-VR-MARINA use the Rand k compressor, while Byz-EF21-SGDM , Byz-DM21 , and Byz-VR-DM21 use the T op k compressor, where k = 0 . 1 d . D.9 Empirical Results on FEMNIST W e ev aluate our algorithms on an image classiﬁcation task using the FEMNIST dataset Caldas et al. (2018) with a conv olutional neural netw ork (CNN) Krizhevsky et al. (2009). F or all methods , we use a batc h size of b = 32 and select the step size from the following candidates : γ ∈ { 0 . 5 , 0 . 05 , 0 . 005 } . W e use k = 0 . 1 d for b oth T op k and Rand k compressors. The training pro cess is carried out ov er 100 ep o chs, iterating ov er 5000 iterations. T o ensure repro ducibilit y , all exp eriments were conducted using three diﬀerent random seeds. W e rep ort the mean testing accuracy along with one standard error. Figure 11 presents the testing accuracy of ﬁv e diﬀeren t algorithms— BR-DIANA , Byz-VR-MARINA , Byz-EF21-SGDM , Byz-DM21 , and Byz-VR-DM21 —under a 0.4 adversarial setting on the FEMNIST dataset. The results sho w that Byz-DM21 and Byz-VR-DM21 demonstrate a sligh t adv antage ov er the other algorithms in all attac k scenarios considered, achieving higher accuracy during con vergence, ev en with a high prop ortion of Byzan tine work ers. This highlights the b eneﬁts of the double momentum and v ariance reduction approach. Y anghao Li, Changxin Liu † , Y uhao Yi † Aggregation Metho d ALIE IPM LF SF N.A. W orst Case BR-DIANA 75.74 ± 00.78 67 . 94 ± 00 . 60 61.81 ± 04.07 61.06 ± 02.34 65 . 16 ± 00 . 29 61 . 81 ± 04 . 07 Byz-VR-MARINA 61.18 ± 00.00 51.45 ± 00.83 68.91 ± 02.00 62.77 ± 16.24 70 . 74 ± 01 . 35 51 . 45 ± 00 . 83 RF A+NNM Byz-EF21-SGDM 72.94 ± 01.07 75.37 ± 00.49 76.07 ± 05.02 80.23 ± 00.40 80 . 32 ± 00 . 94 72 . 94 ± 01 . 07 Byz-DM21 79.45 ± 00.34 79.01 ± 00.59 80.35 ± 00.14 81.49 ± 00.44 81 . 85 ± 00 . 60 79 . 01 ± 00 . 59 Byz-VR-DM21 79.40 ± 07.33 80.05 ± 03.27 80.50 ± 01.12 81.72 ± 00.15 82.02 ± 00.41 79 . 40 ± 07 . 33 BR-DIANA 75 . 71 ± 01 . 71 67.75 ± 00.67 61.61 ± 02.19 59.87 ± 02.81 73.41 ± 01.28 59 . 87 ± 02 . 81 Byz-VR-MARINA 60.05 ± 00.00 50.85 ± 00.54 66.04 ± 11.14 67.79 ± 04.05 70.79 ± 04.79 50 . 85 ± 00 . 54 CM+NNM Byz-EF21-SGDM 72.95 ± 03.18 74.71 ± 01.83 75.66 ± 00.23 80.37 ± 00.34 80.45 ± 00.15 72 . 95 ± 03 . 18 Byz-DM21 78.52 ± 00.36 78.07 ± 00.40 79.90 ± 00.52 81.60 ± 00.59 81 . 65 ± 00 . 42 78 . 07 ± 00 . 40 Byz-VR-DM21 78.53 ± 00.54 79.37 ± 00.36 80.29 ± 00.42 81.61 ± 00.51 81.98 ± 00.89 78 . 53 ± 00 . 54 BR-DIANA 69.48 ± 01.39 67.94 ± 00.61 59.11 ± 03.31 59.96 ± 03.74 72 . 85 ± 02 . 88 59 . 11 ± 03 . 31 Byz-VR-MARINA 60.18 ± 00.00 50.97 ± 00.79 62.90 ± 28.64 66.20 ± 31.62 69 . 30 ± 06 . 43 50 . 97 ± 00 . 79 CWTM+NNM Byz-EF21-SGDM 72.94 ± 03.06 74.54 ± 02.51 76.12 ± 00.09 80.39 ± 00.45 80.50 ± 01.96 72 . 94 ± 03 . 06 Byz-DM21 78.35 ± 01.13 78.00 ± 00.55 79.89 ± 00.80 81.49 ± 00.45 82.39 ± 00.13 78 . 00 ± 00 . 55 Byz-VR-DM21 78.99 ± 00.35 79.61 ± 01.93 80.35 ± 00.34 81.67 ± 00.66 81 . 86 ± 00 . 89 78 . 99 ± 00 . 35 T able 5: Maximum testing accuracy ( % ) across T = 5000 learning steps on the FEMNIST dataset, under four Byzan tine attac k strategies. There are B = 8 Byzantine w orkers among n = 20 . ’N.A.’ denotes the case with no Byzan tine attac kers. In eac h of the three horizontal blo cks and under eac h attac k, the b est accuracy is highlighted in b old . Additionally , for ev ery metho d, we rep ort the worst-case accuracy across attacks. In T able 5, w e present the p erformance of ﬁve algorithms— BR-DIANA , Byz-VR-MARINA , Byz-EF21-SGDM , Byz-DM21 , and Byz-VR-DM21 —on the FEMNIST dataset. F or each algorithm and under every attac k, w e highlight in b old the algorithm that results in the highest accuracy for the considered scenario. Double Momen tum for Byzantine Robust Learning D.10 A dditional Exp eriments 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | ALIE 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | IPM 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | LF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) RF A | SF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | ALIE 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | IPM 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | LF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CM | SF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | ALIE 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | IPM 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | LF 0 25 50 75 100 epochs 0 25 50 75 testing accuracy(%) CWTM | SF BR-DIANA Byz-VR-MARINA Byz-DASHA-P AGE Byz-EF21-SGDM Byz-DM21 Byz-VR-DM21 BR-DIANA Byz-VR-MARINA Byz-DASHA-P AGE Byz-EF21-SGDM Byz-DM21 Byz-VR-DM21 Figure 12: The testing accuracy (%) of 3 aggregation rules ( RF A, CM, CWTM ) under 4 attacks (ALIE, IPM, LF, SF) on the CIF AR-10 dataset. BR-DIANA , Byz-VR-MARINA , and Byz-DASHA-P AGE use the Rand k compressor, while Byz-EF21-SGDM , Byz-DM21 , and Byz-VR-DM21 use the T op k compressor, where k = 0 . 1 d . Figure 12 presents the testing accuracy of six diﬀeren t algorithms— BR-DIANA , Byz-VR-MARINA , Byz-DASHA- P AGE , Byz-EF21-SGDM , Byz-DM21 , and Byz-VR-DM21 —under a 0.4 adversarial setting on the CIF AR-10 dataset. The results sho w that Byz-VR-DM21 consisten tly outp erforms the other baselines across all considered attack scenarios, even when the prop ortion of Byzan tine w orkers is high; the only exception is Byz-D ASHA-P AGE . Under our current CIF AR-10 setup and hyperparameters, Byz-DASHA-P A GE is very comp etitiv e empirically and in some cases slightly b etter than Byz-DM21 , and close to Byz-VR-DM21 in terms of test acc uracy . How ev er, this comes at a non-negligible algorithmic cost: Byz-DASHA-P A GE requires computing a full local gradien t with probability at each iteration, while our metho ds ( Byz-DM21 and Byz-VR-DM21 ) never rely on full-gradient ev aluations and only use sto chastic gradients. In federated or large-scale distributed settings, full gradients can b e signiﬁcantly more exp ensive than sto chastic ones, esp ecially when each client holds a large dataset. F or this reason, we view Byz-VR-DM21 as oﬀering a more fav orable trade-oﬀ b et ween robustness, conv ergence guaran tees, and p er-iteration computational cost, even when its ﬁnal accuracy is comparable to that of Byz-DASHA-P A GE . Y anghao Li, Changxin Liu † , Y uhao Yi † E Useful F acts F or all a, b ∈ R d and α > 0 , ρ ∈ (0 , 1] the following relations hold: ∥ a + b ∥ 2 ≤ (1 + ρ ) ∥ a ∥ 2 + (1 + ρ − 1 ) ∥ b ∥ 2 , (17) ∥ a + b + c ∥ 2 ≤ 3 ∥ a ∥ 2 + 3 ∥ b ∥ 2 + 3 ∥ c ∥ 2 , (18) ∥ a + b + c + d ∥ 2 ≤ 4 ∥ a ∥ 2 + 4 ∥ b ∥ 2 + 4 ∥ c ∥ 2 + 4 ∥ d ∥ 2 . (19) V ariance decomp osition: F or any random v ector X ∈ R d and any non-random v ector c ∈ R d , we hav e E h ∥ X − c ∥ 2 i = E h ∥ X − E [ X ] ∥ 2 i + ∥ E [ X ] − c ∥ 2 . (20) Lemma E.1 (Rich tárik et al. (2021)) . L et a, b > 0 . If 0 < γ ≤ 1 √ a + b , then aγ 2 + bγ ≤ 1 . Mor e over, the b ound is tight up to the factor of 2 sinc e 1 √ a + b ≤ min n 1 √ a , 1 b o ≤ 2 √ a + b . Double Momen tum for Byzantine Robust Learning F Missing Pro ofs of Byz-DM21 for General Non-Con vex F unctions Let us state the following lemma used in the analysis of our metho ds. F.1 Supp orting Lemmas W e prepare the following lemmas to facilitate the pro of of Theorem 3.1. Lemma F.1 (Descen t lemma from Li et al. (2021)) . Given an L -smo oth function f ( x ) . F or the up date x ( t +1) = x ( t ) − γ g ( t ) , ther e holds f ( x ( t +1) ) ≤ f ( x ( t ) ) − γ 2 ∥∇ f ( x ( t ) ) ∥ 2 − ( 1 2 γ − L 2 ) ∥ x ( t +1) − x ( t ) ∥ 2 + γ 2 ∥ g ( t ) − ∇ f ( x ( t ) ) ∥ 2 . (21) Lemma F.2 (Robust aggregation error) . Supp ose that Assumption 2.3 holds. Then, for al l t ≥ 0 the iter ates gener ate d by Byz-DM21 in A lgorithm 1 satisfy the fol lowing c ondition: ∥ g ( t ) − g ( t ) ∥ 2 ≤ κ G X i ∈G ∥ g ( t ) i − g ( t ) ∥ 2 ≤ κ 2 G 2 X i,j ∈G ∥ g ( t ) i − g ( t ) j ∥ 2 ≤ 8 κ ( G − 1) G 2 X i ∈G  ∥ C ( t ) i ∥ 2 + ∥ P ( t ) i ∥ 2 + ∥ M ( t ) i ∥ 2  + 8 κ ( G − 1) G ζ 2 , (22) where g ( t ) = G − 1 P i ∈G g i , C ( t ) i def = g ( t ) i − u ( t ) i , P ( t ) i def = u ( t ) i − v ( t ) i , M ( t ) i def = v ( t ) i − ∇ f i ( x ( t ) ) . Pr o of. Deﬁne H ( t ) i def = ∇ f i ( x ( t ) ) − ∇ f ( x ( t ) ) . W e consider X i,j ∈G ∥ g ( t ) i − g ( t ) j ∥ 2 = X i,j ∈G ,i  = j ∥ C ( t ) i + P ( t ) i + M ( t ) i + H ( t ) i − C ( t ) j − P ( t ) j − M ( t ) j − H ( t ) j ∥ 2 ≤ 8 X i,j ∈G ,i  = j  ∥ C ( t ) i ∥ 2 + ∥ P ( t ) i ∥ 2 + ∥ M ( t ) i ∥ 2 + ∥ H ( t ) i ∥ 2 + ∥ C ( t ) j ∥ 2 + ∥ P ( t ) j ∥ 2 + ∥ M ( t ) j ∥ 2 + ∥ H ( t ) j ∥ 2  = 16( G − 1) X i ∈G  ∥ C ( t ) i ∥ 2 + ∥ P ( t ) i ∥ 2 + ∥ M ( t ) i ∥ 2 + ∥ H ( t ) i ∥ 2  ≤ 16( G − 1) X i ∈G  ∥ C ( t ) i ∥ 2 + ∥ P ( t ) i ∥ 2 + ∥ M ( t ) i ∥ 2  + 16( G − 1) Gζ 2 . Since X i ∈G ∥ g ( t ) i − g ( t ) ∥ 2 = 1 2 G X i,j ∈G ∥ g ( t ) i − g ( t ) j ∥ 2 , and the aggregation mec hanism is ( B , κ ) -robust 4 , we hav e ∥ g ( t ) − g ( t ) ∥ 2 ≤ κ G X i ∈G ∥ g ( t ) i − g ( t ) ∥ 2 ≤ κ 2 G 2 X i,j ∈G ∥ g ( t ) i − g ( t ) j ∥ 2 ≤ 8 κ ( G − 1) G 2 X i ∈G  ∥ C ( t ) i ∥ 2 + ∥ P ( t ) i ∥ 2 + ∥ M ( t ) i ∥ 2  + 8 κ ( G − 1) G ζ 2 . 4 Note that the B represents the num b er of Byzantine w orkers in this context, and n − B has b een replaced with G . Y anghao Li, Changxin Liu † , Y uhao Yi † Lemma F.3 (Accum ulated compression error) . L et Assumption 2.1 and 2.4 b e satisﬁe d, and supp ose C is a c ontr active c ompr essor. F or every i = 1 , . . . , G , let the se quenc es { v ( t ) i } t ≥ 0 , { u ( t ) i } t ≥ 0 , and { g ( t ) i } t ≥ 0 b e up date d via g ( t ) i = g ( t − 1) i + C ( u ( t ) i − g ( t − 1) i ) u ( t ) i = u ( t − 1) i + η ( v ( t ) i − u ( t − 1) i ) v ( t ) i = v ( t − 1) i + η ( ∇ f i ( x ( t ) , ξ ( t ) i ) − v ( t − 1) i ) . Then for al l t ≥ 0 the iter ates gener ate d by Byz-DM21 in A lgorithm 1 satisfy T − 1 X t =0 E h ∥ C ( t ) i ∥ 2 i = T − 1 X t =0 E h ∥ g ( t ) i − u ( t ) i ∥ 2 i ≤ 12 η 4 α 2 T − 1 X t =0 E h ∥∇ f i ( x ( t ) ) − v ( t ) i ∥ 2 i + 2 η 4 T σ 2 α + 12 η 4 L 2 i α 2 T − 1 X t =0 E h ∥ x ( t +1) − x ( t ) ∥ 2 i + 12 η 2 α 2 E h ∥ u ( t ) i − v ( t ) i ∥ 2 i . (23) Pr o of. By the up date rules of g ( t ) i , u ( t ) i and v ( t ) i , we derive E h ∥ g ( t ) i − u ( t ) i ∥ 2 i = E h ∥ g ( t − 1) i − u ( t ) i + C ( u ( t ) i − g ( t − 1) i ) ∥ 2 i = E h E C h ∥ u ( t ) i − g ( t − 1) i − C ( u ( t ) i − g ( t − 1) i ) ∥ 2 ii ( i ) ≤ (1 − α ) E h ∥ u ( t ) i − g ( t − 1) i ∥ 2 i = (1 − α ) E h ∥ u ( t − 1) i − g ( t − 1) i + η ( v ( t ) i − u ( t − 1) i ) ∥ 2 i = (1 − α ) E h ∥ u ( t − 1) i − g ( t − 1) i + η  v ( t − 1) i − u ( t − 1) i + η ( ∇ f i ( x ( t ) , ξ ( t ) i ) − v ( t − 1) i )  ∥ 2 i = (1 − α ) E h ∥ u ( t − 1) i − g ( t − 1) i + η  v ( t − 1) i − u ( t − 1) i + η ( ∇ f i ( x ( t ) ) − v ( t − 1) i ) + η ( ∇ f i ( x ( t ) , ξ ( t ) i ) − ∇ f i ( x ( t ) )  ∥ 2 i = (1 − α ) E h E ξ ( t ) i h ∥ u ( t − 1) i − g ( t − 1) i + η ( v ( t − 1) i − u ( t − 1) i ) + η 2 ( ∇ f i ( x ( t ) ) − v ( t − 1) i ) + η 2 ( ∇ f i ( x ( t ) , ξ ( t ) i ) − ∇ f i ( x ( t ) ) ∥ 2 ii = (1 − α ) E h ∥ u ( t − 1) i − g ( t − 1) i + η ( v ( t − 1) i − u ( t − 1) i ) + η 2 ( ∇ f i ( x ( t ) ) − v ( t − 1) i ) ∥ 2 i + (1 − α ) η 4 E h ∥∇ f i ( x ( t ) , ξ ( t ) i ) − ∇ f i ( x ( t ) ) ∥ 2 i , where ( i ) refers to the con tractive prop erty , as deﬁned in Deﬁnition 2.7, and then w e use inequality (17) to obtain Double Momen tum for Byzantine Robust Learning ( i ) ≤ (1 − α )(1 + ρ ) E h ∥ u ( t − 1) i − g ( t − 1) i ∥ 2 i + η 4 σ 2 + (1 − α )(1 + ρ − 1 ) E h ∥ η ( v ( t − 1) i − u ( t − 1) i ) + η 2 ( ∇ f i ( x ( t ) ) − v ( t − 1) i ) ∥ 2 i = (1 − α )(1 + ρ ) E h ∥ u ( t − 1) i − g ( t − 1) i ∥ 2 i + (1 − α )(1 + ρ − 1 ) E h ∥ η ( v ( t − 1) i − u ( t − 1) i ) + η 2 ( ∇ f i ( x ( t − 1) ) − v ( t − 1) i ) + η 2 ( ∇ f i ( x ( t ) ) − ∇ f i ( x ( t − 1) )) ∥ 2 i + η 4 σ 2 ≤ (1 − α )(1 + ρ ) E h ∥ u ( t − 1) i − g ( t − 1) i ∥ 2 i + η 4 σ 2 + 3(1 − α )(1 + ρ − 1 ) η 2 E h ∥ v ( t − 1) i − u ( t − 1) i ∥ 2 i + 3(1 − α )(1 + ρ − 1 ) η 4 E h ∥∇ f i ( x ( t − 1) ) − v ( t − 1) i ∥ 2 i + 3(1 − α )(1 + ρ − 1 ) η 4 E h ∥∇ f i ( x ( t ) ) − ∇ f i ( x ( t − 1) ) ∥ 2 i ( ii ) ≤ (1 − α )(1 + ρ ) E h ∥ u ( t − 1) i − g ( t − 1) i ∥ 2 i + η 4 σ 2 + 3(1 − α )(1 + ρ − 1 ) η 2 E h ∥ v ( t − 1) i − u ( t − 1) i ∥ 2 i + 3(1 − α )(1 + ρ − 1 ) η 4 E h ∥∇ f i ( x ( t − 1) ) − v ( t − 1) i ∥ 2 i + 3(1 − α )(1 + ρ − 1 ) η 4 L 2 i E h ∥ x ( t ) − x ( t − 1) ∥ 2 i , where ( i ) refers to the b ounded v ariance assumption, as stated in Deﬁnition 2.4, and ( ii ) leverages the smo othness prop ert y of f i ( · ) . By setting ρ = α/ 2 , we obtain the following result. (1 − α )(1 + α 2 ) = 1 − α 2 − α 2 2 ≤ 1 − α 2 , and (1 − α )(1 + ρ − 1 ) = 2 α − α − 1 ≤ 2 α . Therefore we attain E h ∥ g ( t ) i − u ( t ) i ∥ 2 i = E h ∥ g ( t − 1) i − u ( t ) i + C ( u ( t ) i − g ( t − 1) i ) ∥ 2 i ≤  1 − α 2  E h ∥ u ( t − 1) i − g ( t − 1) i ∥ 2 i + 6 η 4 α E h ∥∇ f i ( x ( t − 1) ) − v ( t − 1) i ∥ 2 i + η 4 σ 2 + 6 η 4 L 2 i α E h ∥ x ( t ) − x ( t − 1) ∥ 2 i + 6 η 2 α E h ∥ u ( t − 1) i − v ( t − 1) i ∥ 2 i . Summing up the ab o ve inequalit y from t = 0 to t = T − 1 leads to T − 1 X t =0 E h ∥ g ( t ) i − u ( t ) i ∥ 2 i ≤ 12 η 4 α 2 T − 1 X t =0 E h ∥∇ f i ( x ( t ) ) − v ( t ) i ∥ 2 i + 2 η 4 T σ 2 α + 12 η 4 L 2 i α 2 T − 1 X t =0 E h ∥ x ( t +1) − x ( t ) ∥ 2 i + 12 η 2 α 2 E h ∥ u ( t ) i − v ( t ) i ∥ 2 i . Lemma F.4 (Accum ulated second momentum deviation) . L et Assumption 2.1 and 2.4 b e satisﬁe d, and supp ose 0 < η ≤ 1 . F or every i = 1 , . . . , G , let the se quenc es { v ( t ) i } t ≥ 0 and { u ( t ) i } t ≥ 0 b e up date d via v ( t ) i = (1 − η ) v ( t − 1) i + η ∇ f i ( x ( t ) , ξ ( t ) i ) u ( t ) i = u ( t − 1) i + η ( v ( t ) i − u ( t − 1) i ) . Y anghao Li, Changxin Liu † , Y uhao Yi † Then for al l t ≥ 0 the iter ates gener ate d by Byz-DM21 in A lgorithm 1 satisfy T − 1 X t =0 E h ∥ P ( t ) i ∥ 2 i = T − 1 X t =0 E h ∥ u ( t ) i − v ( t ) i ∥ 2 i ≤ 6 T − 1 X t =0 E h ∥ v ( t ) i − ∇ f i ( x ( t ) ∥ 2 i + 6 L 2 i T − 1 X t =0 E h ∥ x ( t ) − x ( t − 1) ∥ 2 i + η T σ 2 . (24) Pr o of. By the up date rule of u ( t ) i and v ( t ) i , we hav e E h ∥ u ( t ) i − v ( t ) i ∥ 2 i = E h ∥ u ( t − 1) i − v ( t ) i + η ( v ( t ) i − u ( t − 1) i ) ∥ 2 i = (1 − η ) 2 E h ∥ u ( t − 1) i − v ( t ) i ∥ 2 i = (1 − η ) 2 E h ∥ (1 − η ) v ( t − 1) i + η ∇ f i ( x ( t ) , ξ ( t ) i ) − u ( t − 1) i ∥ 2 i = (1 − η ) 2 E h ∥ ( v ( t − 1) i − u ( t − 1) i ) + η ( ∇ f i ( x ( t ) , ξ ( t ) i ) − v ( t − 1) i ) ∥ 2 i = (1 − η ) 2 E h ∥ ( u ( t − 1) i − v ( t − 1) i ) + η ( v ( t − 1) i − ∇ f i ( x ( t ) )) + η ( ∇ f i ( x ( t ) ) − ∇ f i ( x ( t ) , ξ ( t ) i )) ∥ 2 i = (1 − η ) 2 E h E ξ ( t ) i h ∥ u ( t − 1) i − v ( t − 1) i + η ( v ( t − 1) i − ∇ f i ( x ( t ) )) + η ( ∇ f i ( x ( t ) ) − ∇ f i ( x ( t ) , ξ ( t ) i )) ∥ 2 ii = (1 − η ) 2  E h ∥ u ( t − 1) i − v ( t − 1) i + η ( v ( t − 1) i − ∇ f i ( x ( t ) )) ∥ 2 i + η 2 E h ∥∇ f i ( x ( t ) , ξ ( t ) i ) − ∇ f i ( x ( t ) ) ∥ 2 i ( i ) ≤ (1 − η ) 2 E h ∥ u ( t − 1) i − v ( t − 1) i + η ( v ( t − 1) i − ∇ f i ( x ( t ) )) ∥ 2 i + η 2 σ 2 ( ii ) ≤ (1 − η ) 2 (1 + ρ ) E h ∥ u ( t − 1) i − v ( t − 1) i ∥ 2 i + η 2 (1 + ρ − 1 ) E h ∥ v ( t − 1) i − ∇ f i ( x ( t ) ) ∥ 2 i + η 2 σ 2 , where ( i ) utilizes Assumption 2.4, ( ii ) holds by inequality (17) . Setting ρ = η / 2 , and b ecause of η ∈ (0 , 1] , we obtain (1 − η ) 2 (1 + η 2 ) = 1 − 3 η 2 + η 3 2 ≤ 1 − η , and η 2 (1 + 2 η ) = η 2 + 2 η ≤ 3 η . There holds E h ∥ u ( t ) i − v ( t ) i ∥ 2 i ≤ (1 − η ) E h ∥ u ( t − 1) i − v ( t − 1) i ∥ 2 i + 3 η E h ∥ v ( t − 1) i − ∇ f i ( x ( t ) ) ∥ 2 i + η 2 σ 2 ( i ) ≤ (1 − η ) E h ∥ u ( t − 1) i − v ( t − 1) i ∥ 2 i + 6 η E h ∥ v ( t − 1) i − ∇ f i ( x ( t − 1) ) ∥ 2 i + 6 η E h ∥∇ f i ( x ( t ) ) − ∇ f i ( x ( t − 1) ) ∥ 2 i + η 2 σ 2 ( ii ) ≤ (1 − η ) E h ∥ u ( t − 1) i − v ( t − 1) i ∥ 2 i + 6 η E h ∥ v ( t − 1) i − ∇ f i ( x ( t − 1) ) ∥ 2 i + 6 η L 2 i E h ∥ x ( t ) − x ( t − 1) ∥ 2 i + η 2 σ 2 , where ( i ) uses inequalit y (17) , and ( ii ) uses smo othness of f i ( · ) . Summing up the ab ov e inequalit y from t = 0 to Double Momen tum for Byzantine Robust Learning t = T − 1 yields T − 1 X t =0 E h ∥ P ( t ) i ∥ 2 i ≤ 6 T − 1 X t =0 E h ∥ v ( t ) i − ∇ f i ( x ( t ) ∥ 2 i + 6 L 2 i T − 1 X t =0 E h ∥ x ( t ) − x ( t − 1) ∥ 2 i + 1 η T − 1 X t =0 E h ∥ P (0) i ∥ 2 i + η T σ 2 ≤ 6 T − 1 X t =0 E h ∥ v ( t ) i − ∇ f i ( x ( t ) ∥ 2 i + 6 L 2 i T − 1 X t =0 E h ∥ x ( t ) − x ( t − 1) ∥ 2 i + η T σ 2 , where P (0) i = u (0) i − v (0) i , b ecause of u (0) i = v (0) i , P (0) i = 0 . Lemma F.5 (A ccumulated momentum deviation) . L et Assumption 2.1 and 2.4 b e satisﬁe d, and supp ose 0 < η ≤ 1 . F or every i = 1 , . . . , G , let the se quenc es { v ( t ) i } t ≥ 0 b e up date d via v ( t ) i = (1 − η ) v ( t − 1) i + η ∇ f i ( x ( t ) , ξ ( t ) i ) . Then for al l t ≥ 0 the iter ates gener ate d by Byz-DM21 in A lgorithm 1 satisfy 1 G T − 1 X t =0 X i ∈G E h ∥ M ( t ) i ∥ 2 i = 1 G T − 1 X t =0 X i ∈G E h ∥ v ( t ) i − ∇ f i ( x ( t ) ) ∥ 2 i ≤ e L 2 η 2 T − 1 X t =0 E h ∥ x ( t +1) − x ( t ) ∥ 2 i + η T σ 2 + 1 η G X i ∈G E h ∥ v (0) i − ∇ f i ( x (0) ) ∥ 2 i , (25) and T − 1 X t =0 E h ∥ f M ( t ) i ∥ 2 i = T − 1 X t =0 E h ∥ v ( t ) − ∇ f ( x ( t ) ) ∥ 2 i ≤ L 2 η 2 T − 1 X t =0 E h ∥ x ( t +1) − x ( t ) ∥ 2 i + η T σ 2 G + 1 η E h ∥ v (0) − ∇ f G ( x (0) ) ∥ 2 i . (26) Pr o of. By the up date rule of v ( t ) i , and consider ∥ v ( t ) i − ∇ f i ( x ( t ) ) ∥ 2 = ∥ (1 − η ) v ( t − 1) i + η ∇ f i ( x ( t ) i , ξ ( t ) i ) − ∇ f i ( x ( t ) ) ∥ 2 = ∥ (1 − η )( v ( t − 1) i − ∇ f i ( x ( t ) )) + η ( ∇ f i ( x ( t ) i , ξ ( t ) i ) − ∇ f i ( x ( t ) )) ∥ 2 . T aking the exp ectation on b oth sides and using the law of total exp ectation, w e obtain E h ∥ v ( t ) i − ∇ f i ( x ( t ) ) ∥ 2 i = E h E ξ ( t ) i h ∥ (1 − η )( v ( t − 1) i − ∇ f i ( x ( t ) )) + η ( ∇ f i ( x ( t ) i , ξ ( t ) i ) − ∇ f i ( x ( t ) )) ∥ 2 ii , there holds E h ∥ v ( t ) i − ∇ f i ( x ( t ) ) ∥ 2 i = (1 − η ) 2 E h ∥ v ( t − 1) i − ∇ f i ( x ( t ) ) ∥ 2 i + η 2 E h ∥∇ f i ( x ( t ) , ξ ( t ) i ) − ∇ f i ( x ( t ) ) ∥ 2 i ( i ) ≤ (1 − η ) 2 (1 + a ) E h ∥ v ( t − 1) i − ∇ f i ( x ( t − 1) ) ∥ 2 i + η 2 σ 2 + (1 − η ) 2 (1 + a − 1 ) E h ∥∇ f i ( x ( t ) ) − ∇ f i ( x ( t − 1) ) ∥ 2 i . Y anghao Li, Changxin Liu † , Y uhao Yi † where ( i ) uses Assumption 2.4 and inequalit y (17) . for an y a > 0 , w e take a = η (1 − η ) − 1 and use the L -smo othness of f i ( · ) to obtain E h ∥ M ( t ) i ∥ 2 i = E h ∥ v ( t ) i − ∇ f i ( x ( t ) ) ∥ 2 i ≤ (1 − η ) E h ∥ M ( t − 1) i ∥ 2 i + L 2 i η E h ∥ x ( t ) − x ( t − 1) ∥ 2 i + η 2 σ 2 . Summing up the ab o ve inequalit y o ver all i ∈ G and from t = 0 to t = T − 1 yields 1 G T − 1 X t =0 X i ∈G E h ∥ M ( t ) i ∥ 2 i ≤ e L 2 η 2 T − 1 X t =0 E h ∥ x ( t +1) − x ( t ) ∥ 2 i + η T σ 2 + 1 η G X i ∈G E h ∥ v (0) i − ∇ f i ( x (0) ) ∥ 2 i . Using the same argumen ts, w e obtain E h ∥ v ( t ) − ∇ f G ( x ( t ) ) ∥ 2 i ≤ (1 − η ) E h ∥ v ( t − 1) − ∇ f G ( x ( t − 1) ) ∥ 2 i + L 2 η E h ∥ x ( t − 1) − x ( t ) ∥ 2 i + η 2 σ 2 G , and T − 1 X t =0 E h ∥ v ( t ) − ∇ f G ( x ( t ) ) ∥ 2 i ≤ L 2 η 2 T − 1 X t =0 E h ∥ x ( t +1) − x ( t ) ∥ 2 i + η T σ 2 G + 1 η E h ∥ v (0) − ∇ f G ( x (0) ) ∥ 2 i . F.2 Pro of of Theorem 3.1 Pr o of. By Lemma F.1, there holds, for any γ ≤ 1 / (2 L ) , f ( x ( t +1) ) ≤ f ( x ( t ) ) − γ 2 ∥∇ f ( x ( t ) ) ∥ 2 − 1 4 γ ∥ x ( t +1) − x ( t ) ∥ 2 + γ 2 ∥ g ( t ) − ∇ f ( x ( t ) ) ∥ 2 . (27) Summing the ab ov e from t = 0 to t = T − 1 and taking expectation, we deﬁne δ t def = E h ∇ f ( x ( t ) ) − f ( x ∗ ) i , then w e ha ve 1 T T − 1 X t =0 E h ∥∇ f ( x ( t ) ) ∥ 2 i ≤ 2 δ γ T − 1 2 γ 2 T T − 1 X t =0 E h ∥ x ( t +1) − x ( t ) ∥ 2 i + 1 T T − 1 X t =0 E h ∥ g ( t ) − ∇ f ( x ( t ) ) ∥ 2 i . In order to con trol the error b etw een g ( t ) and ∇ f ( x ( t ) ) , we decomp ose it into four terms ∥ g ( t ) − ∇ f ( x ( t ) ) ∥ 2 ≤ 4 ∥ g ( t ) − g ( t ) ∥ 2 + 4 ∥ g ( t ) − u ( t ) ∥ 2 + 4 ∥ u ( t ) − v ( t ) ∥ 2 + 4 ∥ v ( t ) − ∇ f ( x ( t ) ) ∥ 2 ≤ 4 ∥ g ( t ) − g ( t ) ∥ 2 + 4 G X i ∈G ∥ g ( t ) i − u ( t ) i ∥ 2 + 4 G X i ∈G ∥ u ( t ) i − v ( t ) i ∥ 2 + 4 ∥ v ( t ) − ∇ f ( x ( t ) ) ∥ 2 , where g = G − 1 P i ∈G g i , u = G − 1 P i ∈G u i and v = G − 1 P i ∈G v i . Double Momen tum for Byzantine Robust Learning Next, we apply the technical lemmas from the previous section to deriv e a b ound on the deviation b etw een g ( t ) and ∇ f ( x ( t ) ) . First, we in vok e Lemma F.2 to obtain the following result ∥ g ( t ) − ∇ f ( x ( t ) ) ∥ 2 ≤ 32 κ ( G − 1) G 2 X i ∈G  ∥ C ( t ) i ∥ 2 + ∥ P ( t ) i ∥ 2 + ∥ M ( t ) i ∥ 2  + 32 κ ( G − 1) G ζ 2 + 4 G X i ∈G ∥ C ( t ) i ∥ 2 + 4 G X i ∈G ∥ P ( t ) i ∥ 2 + 4 ∥ f M ( t ) i ∥ 2 ≤ 4(8 κ + 1) G X i ∈G ∥ C ( t ) i ∥ 2 + 4(8 κ + 1) G X i ∈G ∥ P ( t ) i ∥ 2 + 32 κ G X i ∈G ∥ M ( t ) i ∥ 2 + 4 ∥ f M ( t ) i ∥ 2 + 32 κζ 2 , (28) where C ( t ) i = g ( t ) i − u ( t ) i , P ( t ) i = u ( t ) i − v ( t ) i , f M ( t ) i = v ( t ) i − ∇ f i ( x ( t ) ) . By summing up the ab ov e inequalit y from t = 0 to t = T − 1 , w e obtain T − 1 X t =0 ∥ g ( t ) − ∇ f ( x ( t ) ) ∥ 2 ≤ 4(8 κ + 1) G T − 1 X t =0 X i ∈G ∥ C ( t ) i ∥ 2 + 4(8 κ + 1) G T − 1 X t =0 X i ∈G ∥ P ( t ) i ∥ 2 + 32 κ G T − 1 X t =0 X i ∈G ∥ M ( t ) i ∥ 2 + 4 T − 1 X t =0 ∥ f M ( t ) i ∥ 2 + 32 κT ζ 2 . Next, by taking the exp ectation and using Lemma F.3, and deﬁne R ( t ) def = E h ∥ x ( t +1) − x ( t ) ∥ 2 i , we hav e T − 1 X t =0 E h ∥ g ( t ) − ∇ f ( x ( t ) ) ∥ 2 i ≤ 4(8 κ + 1) G X i ∈G  12 η 4 α 2 T − 1 X t =0 E h ∥ M ( t ) i ∥ 2 i + 2 η 4 T σ 2 α + 12 η 4 L 2 i α 2 T − 1 X t =0 E h ∥ R ( t ) ∥ 2 i + 12 η 2 α 2 E h ∥ P ( t ) i ∥ 2 i  + 4(8 κ + 1) G T − 1 X t =0 X i ∈G E h ∥ P ( t ) i ∥ 2 i + 32 κ G T − 1 X t =0 X i ∈G E h ∥ M ( t ) i ∥ 2 i + 4 T − 1 X t =0 h ∥ f M ( t ) i ∥ 2 i + 32 κT ζ 2 ≤  48 η 4 (8 κ + 1) α 2 G + 32 κ G  T − 1 X t =0 X i ∈G E h ∥ M ( t ) i ∥ 2 i + 8(8 κ + 1) η 4 T σ 2 α + 48 η 4 e L 2 (8 κ + 1) α 2 T − 1 X t =0 E h ∥ R ( t ) ∥ 2 i + ( α 2 + 12 η 2 α 2 )( 4(8 κ + 1) G ) T − 1 X t =0 X i ∈G E h ∥ P ( t ) i ∥ 2 i + 4 T − 1 X t =0 E h ∥ f M ( t ) i ∥ 2 i + 32 κT ζ 2 . Y anghao Li, Changxin Liu † , Y uhao Yi † Then, by using Lemma F.4, we get T − 1 X t =0 E h ∥ g ( t ) − ∇ f ( x ( t ) ) ∥ 2 i ≤  48 η 4 (8 κ + 1) α 2 G + 32 κ G  T − 1 X t =0 X i ∈G E h ∥ M ( t ) i ∥ 2 i + 4 T − 1 X t =0 E h ∥ f M ( t ) i ∥ 2 i + 32 κT ζ 2 + 48 η 4 e L 2 (8 κ + 1) α 2 T − 1 X t =0 E h ∥ R ( t ) ∥ 2 i + 8(8 κ + 1) η 4 T σ 2 α + ( α 2 + 12 η 2 α 2 )( 4(8 κ + 1) G ) X i ∈G  6 T − 1 X t =0 E h ∥ M ( t ) i ∥ 2 i + 6 L 2 i T − 1 X t =0 E h ∥ R ( t ) ∥ 2 i + η T σ 2  ≤  (48 η 4 + 288 η 2 + 24 α 2 )(8 κ + 1) + 32 κα 2 α 2 G  T − 1 X t =0 X i ∈G E h ∥ M ( t ) i ∥ 2 i + 8(8 κ + 1) η 4 T σ 2 α + 4 T − 1 X t =0 E h ∥ f M ( t ) i ∥ 2 i + ( 4(8 κ + 1)( α 2 + 12 η 2 ) α 2 ) η T σ 2 + 32 κT ζ 2 +  24 e L 2 (8 κ + 1)(12 η 4 + α 2 + 12 η 2 ) α 2  T − 1 X t =0 E h ∥ R ( t ) ∥ 2 i . F urthermore, by using Lemma F.5, we get T − 1 X t =0 E h ∥ g ( t ) − ∇ f ( x ( t ) ) ∥ 2 i ≤  (48 η 4 + 288 η 2 + 24 α 2 )(8 κ + 1) + 32 κα 2 α 2  e L 2 η 2 T − 1 X t =0 E h ∥ R ( t ) ∥ 2 i + η T σ 2 + 1 η G X i ∈G E h ∥ M (0) i ∥ 2 i  + 8(8 κ + 1) η 4 T σ 2 α + 4 L 2 η 2 T − 1 X t =0 E h ∥ R ( t ) ∥ 2 i + 4 η T σ 2 G + 4 η E h ∥ f M (0) i ∥ 2 i + 32 κT ζ 2 + ( 4(8 κ + 1)( α 2 + 12 η 2 ) α 2 ) η T σ 2 +  24 e L 2 (8 κ + 1)(12 η 4 + α 2 + 12 η 2 ) α 2  T − 1 X t =0 E h ∥ R ( t ) ∥ 2 i ≤ e L 2 (48 η 4 + 288 η 2 + 24 α 2 )(8 κ + 1) + 32 κ e L 2 α 2 + 4 α 2 L 2 α 2 η 2 + 24 e L 2 (8 κ + 1)(12 η 4 + α 2 + 12 η 2 ) α 2 ! T − 1 X t =0 E h ∥ R ( t ) ∥ 2 i + 32 κT ζ 2 + 4 η E h ∥ f M (0) ∥ 2 i + (48 η 4 + 288 η 2 + 24 α 2 )(8 κ + 1) + 32 κα 2 η α 2 G E X i ∈G h ∥ M (0) i ∥ 2 i +  (48 η 4 + 336 η 2 + 28 α 2 )(8 κ + 1) α 2 + 32 κ + 4 G + 8(8 κ + 1) η 3 α  η T σ 2 . Subtracting f ( x ∗ ) from b oth sides of inequality (27) , taking exp ectation and deﬁning δ t def = ∇ f ( x ( t ) ) − f ( x ∗ ) , we Double Momen tum for Byzantine Robust Learning deriv e E h ∥∇ f ( ˆ x ( T ) ) ∥ 2 i ≤ 2 δ γ T − A T T − 1 X t =0 E h ∥ x ( t +1) − x ( t ) ∥ 2 i + 32 κζ 2 + 4 η T E h ∥ f M (0) ∥ 2 i + (48 η 4 + 288 η 2 + 24 α 2 )(8 κ + 1) + 32 κα 2 η α 2 GT E X i ∈G h ∥ M (0) i ∥ 2 i +  48( η 4 + 7 η 2 )(8 κ + 1) α 2 + 4(64 κ + 7) + 4 G + 8(8 κ + 1) η 3 α  η σ 2 , where ˆ x ( T ) is sampled uniformly at random from T iterates and A = 1 γ 2 1 2 − γ 2 e L 2 (48 η 4 + 288 η 2 + 24 α 2 )(8 κ + 1) α 2 η 2 − 4 γ 2 (8 κ e L 2 + L 2 ) η 2 − 24 γ 2 e L 2 (8 κ + 1)(12 η 4 + α 2 + 12 η 2 ) α 2 ! = 1 γ 2 1 2 − 48 γ 2 e L 2 ( η 2 + 6)(8 κ + 1) α 2 − 24 γ 2 e L 2 (8 κ + 1) η 2 − 4 γ 2 (8 κ e L 2 + L 2 ) η 2 − 24 γ 2 e L 2 (8 κ + 1)(12 η 4 + α 2 + 12 η 2 ) α 2 ! = 1 γ 2 1 2 − 48 γ 2 e L 2 ( η 2 + 6)(8 κ + 1) α 2 − 4 γ 2 ((56 κ + 6) e L 2 + L 2 ) η 2 − 24 γ 2 e L 2 (8 κ + 1)(12 η 4 + α 2 + 12 η 2 ) α 2 ! = 1 γ 2  1 2 − 24 γ 2 e L 2 (8 κ + 1)(12 η 4 + α 2 + 14 η 2 + 12) α 2 − 4 γ 2 ((56 κ + 6) e L 2 + L 2 ) η 2  ( i ) ≥ 1 γ 2  1 2 − 936 γ 2 e L 2 (8 κ + 1) α 2 − 4 γ 2 ((56 κ + 6) e L 2 + L 2 ) η 2  ( ii ) ≥ 0 , where ( i ) and (ii) are due to η ≤ 1 and the assumption on step-size. Finally , by using the c hoice of the momentum parameter, we derive η ≤ min (  8 α 2 q (56 κ + 6) e L 2 + L 2 δ 0 48(8 κ + 1) σ 2 T  1 / 6 ,  8 α 2 q (56 κ + 6) e L 2 + L 2 δ 0 336(8 κ + 1) σ 2 T  1 / 4 ,  8 q (56 κ + 6) e L 2 + L 2 δ 0 (64 κ + 7) σ 2 T  1 / 2 ,  8 q (56 κ + 6) e L 2 + L 2 δ 0 G 4 σ 2 T  1 / 2 ,  8 α q (56 κ + 6) e L 2 + L 2 δ 0 8(8 κ + 1) σ 2 T  1 / 5 ) , ensures that 48 η 5 (8 κ +1) σ 2 α 2 ≤ 8 √ (56 κ +6) e L 2 + L 2 η T , 336 η 3 (8 κ +1) σ 2 α 2 ≤ 8 √ (56 κ +6) e L 2 + L 2 η T , 4 η (64 κ + 7) σ 2 ≤ 8 √ (56 κ +6) e L 2 + L 2 η T , 4 η σ 2 G ≤ 8 √ (56 κ +6) e L 2 + L 2 η T , and 8 η 4 (8 κ +1) σ 2 α ≤ 8 √ (56 κ +6) e L 2 + L 2 η T , we denote b L = q ( κ + 1) e L 2 + L 2 , then we obtain E h ∥∇ f ( ˆ x ( T ) ) ∥ 2 i ≤  (48(8 κ + 1)) 1 / 5 σ 2 / 5 b Lδ 0 α 2 / 5 T  5 / 6 +  (336(8 κ + 1)) 1 / 3 σ 2 / 3 b Lδ 0 α 2 / 3 T  3 / 4 + 32 κζ 2 + Φ 0 γ T +  4(64 κ + 7) σ 2 b Lδ 0 T  1 / 2 +  16 σ 2 b Lδ 0 GT  1 / 2 +  (8(8 κ + 1)) 1 / 4 σ 1 / 2 b Lδ 0 α 1 / 4 T  4 / 5 . This concludes the pro of. Y anghao Li, Changxin Liu † , Y uhao Yi † G Missing Pro ofs of Byz-VR-DM21 for General Non-Con vex F unctions Let us state the following lemma that is used in the analysis of our methods. G.1 Supp orting Lemmas Lemma G.1 (Robust aggregation error) . Using L emma F.2 and supp ose that Assumption 2.3 holds. Then, for al l t ≥ 0 the iter ates gener ate d by Byz-VR-DM21 in Algorithm 1 satisfy the fol lowing c ondition: ∥ g ( t ) − g ( t ) ∥ 2 ≤ κ G X i ∈G ∥ g ( t ) i − g ( t ) ∥ 2 ≤ κ 2 G 2 X i,j ∈G ∥ g ( t ) i − g ( t ) j ∥ 2 ≤ 8 κ ( G − 1) G 2 X i ∈G  ∥ C ( t ) i ∥ 2 + ∥ P ( t ) i ∥ 2 + ∥ M ( t ) i ∥ 2  + 8 κ ( G − 1) G ζ 2 , where g ( t ) = G − 1 P i ∈G g i , C ( t ) i = g ( t ) i − u ( t ) i , P ( t ) i = u ( t ) i − v ( t ) i , M ( t ) i = v ( t ) i − ∇ f i ( x ( t ) ) . Lemma G.2 (Accum ulated compression error) . L et Assumption 2.1 and 2.4 b e satisﬁe d, and supp ose C is a c ontr active c ompr essor. F or every i = 1 , . . . , G , let the se quenc es { v ( t ) i } t ≥ 0 , { u ( t ) i } t ≥ 0 , and { g ( t ) i } t ≥ 0 b e up date d via g ( t ) i = g ( t − 1) i + C ( u ( t ) i − g ( t − 1) i ) u ( t ) i = u ( t − 1) i + η ( v ( t ) i − u ( t − 1) i ) v ( t ) i = (1 − η ) v ( t − 1) i + η ∇ f i ( x ( t ) , ξ ( t ) i ) + (1 − η )( ∇ f i ( x ( t ) , ξ ( t ) i ) − ∇ f i ( x ( t − 1) , ξ ( t ) i )) . Then for al l t ≥ 0 the iter ates gener ate d by Byz-VR-DM21 in A lgorithm 1 satisfy T − 1 X t =0 E h ∥ C ( t ) i ∥ 2 i = T − 1 X t =0 E h ∥ g ( t ) i − u ( t ) i ∥ 2 i ≤ 12 η 4 α 2 T − 1 X t =0 E h ∥∇ f i ( x ( t ) ) − v ( t ) i ∥ 2 i + 4 η 4 T σ 2 α + 4 η 2 ( ℓ 2 i + 3 α L 2 i ) α T − 1 X t =0 E h ∥ x ( t +1) − x ( t ) ∥ 2 i + 12 η 2 α 2 T − 1 X t =0 E h ∥ u ( t ) i − v ( t ) i ∥ 2 i . (29) Pr o of. W e deﬁne S ( t ) i def = ∇ f i ( x ( t ) , ξ ( t ) i ) − ∇ f i ( x ( t ) ) , S ( t ) def = 1 G X i ∈G S ( t ) i , M ( t ) i def = ∇ f i ( x ( t ) , ξ ( t ) i ) − ∇ f i ( x ( t ) ) + ∇ f i ( x ( t − 1) ) − ∇ f i ( x ( t − 1) , ξ ( t ) i ) , M ( t ) def = 1 G X i ∈G M ( t ) i . (30) Double Momen tum for Byzantine Robust Learning Then by Assumptions 2.4, we hav e E h S ( t ) i i = E h M ( t ) i i = E h S ( t ) i = E h M ( t ) i = 0 , E h ∥S ( t ) i ∥ 2 i ≤ σ 2 , E h ∥S ( t ) ∥ 2 i ≤ σ 2 G . (31) F urthermore, we can derive E h ∥M ( t ) ∥ 2 i = E h ∥ 1 G X i ∈G M ( t ) i ∥ 2 i = 1 G 2 E h ∥ X i ∈G M ( t ) i ∥ 2 i = 1 G 2 X i ∈G E h ∥M ( t ) i ∥ 2 i + 1 G 2 X i,j ∈G ,i  = j E h ⟨M ( t ) i , M ( t ) j ⟩ i ( i ) = 1 G 2 X i ∈G E h ∥M ( t ) i ∥ 2 i + 1 G 2 X i,j ∈G ,i  = j D E [ M ( t ) i ] , E [ M ( t ) j ] E = 1 G 2 X i ∈G E h ∥M ( t ) i ∥ 2 i ≤ 1 G 2 X i ∈G E h ∥∇ f i ( x ( t ) , ξ ( t ) i ) − ∇ f i ( x ( t − 1) , ξ ( t ) i ) ∥ 2 i ≤ 1 G 2 X i ∈G ℓ 2 i E h ∥ x ( t ) − x ( t − 1) ∥ 2 i = e ℓ 2 G E h ∥ x ( t ) − x ( t − 1) ∥ 2 i . (32) Step ( i ) holds due to the conditional indep endence of M ( t ) i and M ( t ) j , while the ﬁnal inequalit y follo ws from the smo othness of the sto chastic functions, as stated in Assumption 2.1. Consequently , we obtain the following result: E h ∥M ( t ) i ∥ 2 i ≤ ℓ 2 i E h ∥ x ( t ) − x ( t − 1) ∥ 2 i , E h ∥M ( t ) ∥ 2 i ≤ e ℓ 2 G E h ∥ x ( t ) − x ( t − 1) ∥ 2 i , (33) where the ﬁrst inequalit y is obtained by using a similar deriv ation. Y anghao Li, Changxin Liu † , Y uhao Yi † Then by the up date rules of g ( t ) i , u ( t ) i and v ( t ) i , we derive E h ∥ g ( t ) i − u ( t ) i ∥ 2 i = E h ∥ g ( t − 1) i − u ( t ) i + C ( u ( t ) i − g ( t − 1) i ) ∥ 2 i = E h E C h ∥ u ( t ) i − g ( t − 1) i − C ( u ( t ) i − g ( t − 1) i ) ∥ 2 ii ( i ) ≤ (1 − α ) E h ∥ u ( t ) i − g ( t − 1) i ∥ 2 i = (1 − α ) E h ∥ u ( t − 1) i − g ( t − 1) i + η ( v ( t ) i − u ( t − 1) i ) ∥ 2 i = (1 − α ) E h ∥ u ( t − 1) i − g ( t − 1) i + η  (1 − η )( v ( t − 1) i − ∇ f i ( x ( t − 1) , ξ ( t ) )) + ∇ f i ( x ( t ) , ξ ( t ) ) − u ( t − 1) i  ∥ 2 i = (1 − α ) E " ∥ η  v ( t − 1) i − u ( t − 1) i + η ( ∇ f i ( x ( t ) , ξ ( t ) ) − v ( t − 1) i ) + (1 − η )( ∇ f i ( x ( t ) ) − ∇ f i ( x ( t − 1) ))  + η (1 − η )( ∇ f i ( x ( t ) , ξ ( t ) i ) − ∇ f i ( x ( t ) ) + ∇ f i ( x ( t − 1) ) − ∇ f i ( x ( t − 1) , ξ ( t ) i )) + u ( t − 1) i − g ( t − 1) i ∥ 2 # = (1 − α ) E " ∥ η  v ( t − 1) i − u ( t − 1) i + η ( ∇ f i ( x ( t ) , ξ ( t ) ) − ∇ f i ( x ( t ) )) + η ( ∇ f i ( x ( t − 1) ) − v ( t − 1) i )  + u ( t − 1) i − g ( t − 1) i + η ( ∇ f i ( x ( t ) ) − ∇ f i ( x ( t − 1) )) + η (1 − η )( ∇ f i ( x ( t ) , ξ ( t ) i ) − ∇ f i ( x ( t ) ) + ∇ f i ( x ( t − 1) ) − ∇ f i ( x ( t − 1) , ξ ( t ) i )) ∥ 2 # = (1 − α ) E " E ξ ( t ) i h ∥ u ( t − 1) i − g ( t − 1) i + η ( v ( t − 1) i − u ( t − 1) i ) + η 2 S ( t ) i + η 2 ( ∇ f i ( x ( t − 1) ) − v ( t − 1) i ) + η ( ∇ f i ( x ( t ) ) − ∇ f i ( x ( t − 1) )) + η (1 − η ) M ( t ) i ∥ 2 i # = (1 − α ) E " ∥ u ( t − 1) i − g ( t − 1) i + η ( v ( t − 1) i − u ( t − 1) i ) + η 2 ( ∇ f i ( x ( t − 1) ) − v ( t − 1) i ) + η ( ∇ f i ( x ( t ) ) − ∇ f i ( x ( t − 1) )) ∥ 2 # + (1 − α ) E h ∥ η 2 S ( t ) i + η (1 − η ) M ( t ) i ∥ 2 i ≤ (1 − α ) E " ∥ u ( t − 1) i − g ( t − 1) i + η ( v ( t − 1) i − u ( t − 1) i ) + η 2 ( ∇ f i ( x ( t − 1) ) − v ( t − 1) i ) + η ( ∇ f i ( x ( t ) ) − ∇ f i ( x ( t − 1) )) ∥ 2 # + 2(1 − α ) η 4 E h ∥S ( t ) i ∥ 2 i + 2(1 − α ) η 2 (1 − η ) 2 E h ∥M ( t ) i ∥ 2 i ( ii ) ≤ (1 − α )(1 + ρ ) E h ∥ u ( t − 1) i − g ( t − 1) i ∥ 2 i + 2 η 4 σ 2 + 2 η 2 ℓ 2 i E h ∥ x ( t ) − x ( t − 1) ∥ 2 i + (1 − α )(1 + ρ − 1 ) E  ∥ η ( v ( t − 1) i − u ( t − 1) i ) + η 2 ( ∇ f i ( x ( t − 1) ) − v ( t − 1) i ) + η ( ∇ f i ( x ( t ) ) − ∇ f i ( x ( t − 1) )) ∥ 2  Double Momen tum for Byzantine Robust Learning E h ∥ g ( t ) i − u ( t ) i ∥ 2 i ( iii ) ≤ (1 − α )(1 + ρ ) E h ∥ u ( t − 1) i − g ( t − 1) i ∥ 2 i + 2 η 4 σ 2 + 2 η 2 ℓ 2 i E h ∥ x ( t ) − x ( t − 1) ∥ 2 i + 3 η 2 (1 − α )(1 + ρ − 1 ) E h ∥ v ( t − 1) i − u ( t − 1) i ∥ 2 i + 3 η 4 (1 − α )(1 + ρ − 1 ) E h ∥∇ f i ( x ( t − 1) ) − v ( t − 1) i ∥ 2 i + 3 η 2 (1 − α )(1 + ρ − 1 ) E h ∥∇ f i ( x ( t ) ) − ∇ f i ( x ( t − 1) ) ∥ 2 i , where ( i ) refers to the contractiv e property , as deﬁned in Deﬁnition 2.7, ( ii ) hold by inequality (17) for an y ρ > 0 , and ( iii ) leverages the smo othness property of f i ( · ) . Setting ρ = α/ 2 , we obtain (1 − α )(1 + α 2 ) = 1 − α 2 − α 2 2 ≤ 1 − α 2 , and (1 − α )(1 + 2 α ) = 2 α − α − 1 ≤ 2 α . There holds E h ∥ g ( t ) i − u ( t ) i ∥ 2 i = E h ∥ g ( t − 1) i − u ( t ) i + C ( u ( t ) i − g ( t − 1) i ) ∥ 2 i ≤  1 − α 2  E h ∥ u ( t − 1) i − g ( t − 1) i ∥ 2 i + 6 η 4 α E h ∥∇ f i ( x ( t − 1) ) − v ( t − 1) i ∥ 2 i + 2 η 4 σ 2 + η 2 (2 ℓ 2 i + 6 α L 2 i ) E h ∥ x ( t ) − x ( t − 1) ∥ 2 i + 6 η 2 α E h ∥ u ( t − 1) i − v ( t − 1) i ∥ 2 i . Summing up the ab o ve inequalit y from t = 0 to t = T − 1 leads to T − 1 X t =0 E h ∥ g ( t ) i − u ( t ) i ∥ 2 i ≤ 12 η 4 α 2 T − 1 X t =0 E h ∥∇ f i ( x ( t ) ) − v ( t ) i ∥ 2 i + 4 η 4 T σ 2 α + 4 η 2 ( ℓ 2 i + 3 α L 2 i ) α T − 1 X t =0 E h ∥ x ( t +1) − x ( t ) ∥ 2 i + 12 η 2 α 2 T − 1 X t =0 E h ∥ u ( t ) i − v ( t ) i ∥ 2 i . Lemma G.3 (Accum ulated second momen tum deviation) . L et Assumption 2.1 and 2.4 b e satisﬁe d, and supp ose 0 < η ≤ 1 . F or every i = 1 , . . . , G , let the se quenc es { v ( t ) i } t ≥ 0 and { u ( t ) i } t ≥ 0 b e up date d via u ( t ) i = u ( t − 1) i + η ( v ( t ) i − u ( t − 1) i ) , v ( t ) i = (1 − η ) v ( t − 1) i + η ∇ f i ( x ( t ) , ξ ( t ) i ) + (1 − η )( ∇ f i ( x ( t ) , ξ ( t ) i ) − ∇ f i ( x ( t − 1) , ξ ( t ) i )) . Then for al l t ≥ 0 the iter ates gener ate d by Byz-VR-DM21 in A lgorithm 1 satisfy T − 1 X t =0 E h ∥ P ( t ) i ∥ 2 i = T − 1 X t =0 E h ∥ u ( t ) i − v ( t ) i ∥ 2 i ≤ 6 T − 1 X t =0 E h ∥ M ( t ) i ∥ 2 i + 2( ℓ 2 i + 2 η L 2 i ) η T − 1 X t =0 E h ∥ x ( t +1) − x ( t ) ∥ 2 i + 2 η T σ 2 . (34) Y anghao Li, Changxin Liu † , Y uhao Yi † Pr o of. By the up date rule of u ( t ) i and v ( t ) i , we hav e E h ∥ u ( t ) i − v ( t ) i ∥ 2 i = E h ∥ u ( t − 1) i − v ( t ) i + η ( v ( t ) i − u ( t − 1) i ) ∥ 2 i = (1 − η ) 2 E h ∥ u ( t − 1) i − v ( t ) i ∥ 2 i = (1 − η ) 2 E h ∥ (1 − η ) v ( t − 1) i + η ∇ f i ( x ( t ) , ξ ( t ) i ) − u ( t − 1) i + (1 − η )( ∇ f i ( x ( t ) , ξ ( t ) i ) − ∇ f i ( x ( t − 1) , ξ ( t ) i )) ∥ 2 i = (1 − η ) 2 E h ∥ ( v ( t − 1) i − u ( t − 1) i ) + η ( ∇ f i ( x ( t ) , ξ ( t ) i ) − v ( t − 1) i ) + (1 − η )( ∇ f i ( x ( t ) ) − ∇ f i ( x ( t − 1) )) + (1 − η )( ∇ f i ( x ( t ) , ξ ( t ) i ) − ∇ f i ( x ( t ) ) + ∇ f i ( x ( t − 1) ) − ∇ f i ( x ( t − 1) , ξ ( t ) i )) ∥ 2 i = (1 − η ) 2 E h ∥ ( v ( t − 1) i − u ( t − 1) i ) + η ( ∇ f i ( x ( t ) , ξ ( t ) i ) − ∇ f i ( x ( t ) )) + η ( ∇ f i ( x ( t − 1) ) − v ( t − 1) i ) + ∇ f i ( x ( t ) ) − ∇ f i ( x ( t − 1) ) + (1 − η ) M ( t ) i ∥ 2 i = (1 − η ) 2 E h E ξ ( t ) i h ∥ ( v ( t − 1) i − u ( t − 1) i ) + η ( ∇ f i ( x ( t ) , ξ ( t ) i ) − ∇ f i ( x ( t ) )) + η ( ∇ f i ( x ( t − 1) ) − v ( t − 1) i ) + ∇ f i ( x ( t ) ) − ∇ f i ( x ( t − 1) ) + (1 − η ) M ( t ) i ∥ 2 ii = (1 − η ) 2 E h ∥ ( v ( t − 1) i − u ( t − 1) i ) + η ( ∇ f i ( x ( t − 1) ) − v ( t − 1) i ) + ( ∇ f i ( x ( t ) ) − ∇ f i ( x ( t − 1) )) ∥ 2 i + (1 − η ) 2 E h ∥ η S t i + (1 − η ) M t i ∥ 2 i ( i ) ≤ (1 − η ) 2 (1 + ρ ) E h ∥ u ( t − 1) i − v ( t − 1) i ∥ 2 i + (1 − η ) 2 (1 + ρ − 1 ) E " ∥ η ( v ( t − 1) i − ∇ f i ( x ( t − 1) )) + ∇ f i ( x ( t ) ) − ∇ f i ( x ( t − 1) ) ∥ 2 # + 2(1 − η ) 2 η 2 E h ∥S t i ∥ 2 i + 2(1 − η ) 4 E h ∥M t i ∥ 2 i ≤ (1 − η ) 2 (1 + ρ ) E h ∥ u ( t − 1) i − v ( t − 1) i ∥ 2 i + 2 η 2 (1 + ρ − 1 ) E h ∥ v ( t − 1) i − ∇ f i ( x ( t − 1) ) ∥ 2 i + 2 η 2 σ 2 + 2(1 − η ) 2 (1 + ρ − 1 ) E h ∥∇ f i ( x ( t ) ) − ∇ f i ( x ( t − 1) ) ∥ 2 i + 2 ℓ 2 i E h ∥ x ( t ) − x ( t − 1) ∥ 2 i , where ( i ) uses inequality (32) , smo othness of f i ( · ) and inequality (17) for any ρ > 0 . Setting ρ = η / 2 , and b ecause of η ∈ (0 , 1] , we obtain (1 − η ) 2 (1 + η 2 ) = 1 − 3 η 2 + η 3 2 ≤ 1 − η , (1 − η ) 2 (1 + 2 η ) ≤ 1 + 2 η − η − 2 ≤ 2 η and η 2 (1 + 2 η ) = η 2 + 2 η ≤ 3 η . There holds E h ∥ u ( t ) i − v ( t ) i ∥ 2 i ≤ (1 − η ) 2 (1 + η 2 ) E h ∥ u ( t − 1) i − v ( t − 1) i ∥ 2 i + 2 η 2 (1 + 2 η ) E h ∥ v ( t − 1) i − ∇ f i ( x ( t − 1) ) ∥ 2 i + 2(1 − η ) 2 (1 + 2 η ) E h ∥∇ f i ( x ( t ) ) − ∇ f i ( x ( t − 1) ) ∥ 2 i + 2 η 2 σ 2 + 2 ℓ 2 i E h ∥ x ( t ) − x ( t − 1) ∥ 2 i ≤ (1 − η ) E h ∥ u ( t − 1) i − v ( t − 1) i ∥ 2 i + 6 η E h ∥ v ( t − 1) i − ∇ f i ( x ( t − 1) ) ∥ 2 i + 2( ℓ 2 i + 2 η L 2 i ) E h ∥ x ( t ) − x ( t − 1) ∥ 2 i + 2 η 2 σ 2 . Double Momen tum for Byzantine Robust Learning Summing up the ab o ve inequalit y from t = 0 to t = T − 1 yields T − 1 X t =0 E h ∥ P ( t ) i ∥ 2 i ≤ 6 T − 1 X t =0 E h ∥ v ( t ) i − ∇ f i ( x ( t ) ∥ 2 i + 2( ℓ 2 i + 2 η L 2 i ) η T − 1 X t =0 E h ∥ x ( t +1) − x ( t ) ∥ 2 i + 1 η T − 1 X t =0 E h ∥ P (0) i ∥ 2 i + 2 η T σ 2 ≤ 6 T − 1 X t =0 E h ∥ M ( t ) i ∥ 2 i + 2( ℓ 2 i + 2 η ) L 2 i η T − 1 X t =0 E h ∥ x ( t +1) − x ( t ) ∥ 2 i + 2 η T σ 2 , where P (0) i = u (0) i − v (0) i , b ecause of u (0) i = v (0) i , P (0) i = 0 . Lemma G.4 (A ccumulated momentum deviation) . L et Assumption 2.1 and 2.4 b e satisﬁe d, and supp ose 0 < η ≤ 1 . F or every i = 1 , . . . , G , let the se quenc es { v ( t ) i } t ≥ 0 b e up date d via v ( t ) i = (1 − η ) v ( t − 1) i + η ∇ f i ( x ( t ) , ξ ( t ) i ) + (1 − η )( ∇ f i ( x ( t ) , ξ ( t ) i ) − ∇ f i ( x ( t − 1) , ξ ( t ) i )) . Then for al l t ≥ 0 the iter ates gener ate d by Byz-VR-DM21 in A lgorithm 1 satisfy 1 G T − 1 X t =0 X i ∈G E h ∥ M ( t ) i ∥ 2 i = 1 G T − 1 X t =0 X i ∈G E h ∥ v ( t ) i − ∇ f i ( x ( t ) ) ∥ 2 i ≤ 2 e ℓ 2 η T − 1 X t =0 E h ∥ x ( t +1) − x ( t ) ∥ 2 i + 2 η T σ 2 + 1 η G X i ∈G E h ∥ v (0) i − ∇ f i ( x (0) ) ∥ 2 i , (35) and T − 1 X t =0 E h ∥ f M ( t ) i ∥ 2 i = T − 1 X t =0 E h ∥ v ( t ) − ∇ f G ( x ( t ) ) ∥ 2 i ≤ 2 e ℓ 2 η T − 1 X t =0 E h ∥ x ( t +1) − x ( t ) ∥ 2 i + 2 η T σ 2 G + 1 η E h ∥ v (0) − ∇ f G ( x (0) ) ∥ 2 i . (36) Pr o of. By the up date rule of v ( t ) i , and consider ∥ v ( t ) i − ∇ f i ( x ( t ) ) ∥ 2 = ∥ (1 − η )( v ( t − 1) i − ∇ f i ( x ( t ) )) + η ( ∇ f i ( x ( t ) i , ξ ( t ) i ) − ∇ f i ( x ( t ) )) + (1 − η )( ∇ f i ( x ( t ) , ξ ( t ) i ) − ∇ f i ( x ( t − 1) , ξ ( t ) i )) ∥ 2 = ∥ (1 − η )( v ( t − 1) i − ∇ f i ( x ( t ) )) + η ( ∇ f i ( x ( t − 1) i , ξ ( t ) i ) − ∇ f i ( x ( t ) )) + ∇ f i ( x ( t ) , ξ ( t ) i ) − ∇ f i ( x ( t − 1) , ξ ( t ) i ) ∥ 2 = ∥ (1 − η )( v ( t − 1) i − ∇ f i ( x ( t − 1) )) + η ( ∇ f i ( x ( t ) , ξ ( t ) ) − ∇ f i ( x ( t ) )) + (1 − η )  ∇ f i ( x ( t − 1) ) − ∇ f i ( x ( t − 1) , ξ ( t ) ) + ∇ f i ( x ( t ) , ξ ( t ) ) − ∇ f i ( x ( t ) )  ∥ 2 = ∥ (1 − η )( v ( t − 1) i − ∇ f ( x ( t − 1) ) + η S ( t ) i + (1 − η ) M ( t ) i ∥ 2 . T aking the exp ectation on b oth sides and using the law of total exp ectation, w e obtain E h ∥ v ( t ) i − ∇ f i ( x ( t ) ) ∥ 2 i = E h E ξ ( t ) i h ∥ (1 − η )( v ( t − 1) i − ∇ f ( x ( t − 1) ) + η S ( t ) i + (1 − η ) M ( t ) i ∥ 2 ii . Y anghao Li, Changxin Liu † , Y uhao Yi † Then, there holds E h ∥ v ( t ) i − ∇ f i ( x ( t ) ) ∥ 2 i = (1 − η ) 2 E h ∥ v ( t − 1) i − ∇ f i ( x ( t − 1) ) ∥ 2 i + E h ∥ η S ( t ) i + (1 − η ) M ( t ) i ∥ 2 i ≤ (1 − η ) E h ∥ v ( t − 1) i − ∇ f i ( x ( t − 1) ) ∥ 2 i + 2 η 2 E h ∥S ( t ) i ∥ 2 i + 2 E h ∥M ( t ) i ∥ 2 i ≤ (1 − η ) E h ∥ v ( t − 1) i − ∇ f i ( x ( t − 1) ) ∥ 2 i + 2 η 2 σ 2 + 2 ℓ 2 i E h ∥ x ( t ) − x ( t − 1) ∥ 2 i . Summing up the ab o ve inequalit y o ver all i ∈ G and from t = 0 to t = T − 1 yields 1 G T − 1 X t =0 X i ∈G E h ∥ M ( t ) i ∥ 2 i ≤ 2 e ℓ 2 η T − 1 X t =0 E h ∥ x ( t ) − x ( t − 1) ∥ 2 i + 2 η T σ 2 + 1 η G X i ∈G E h ∥ v (0) i − ∇ f i ( x (0) ) ∥ 2 i . Using the same argumen ts, w e obtain E h ∥ v ( t ) − ∇ f G ( x ( t ) ) ∥ 2 i ≤ (1 − η ) E h ∥ v ( t − 1) − ∇ f G ( x ( t − 1) ) ∥ 2 i + e ℓ 2 E h ∥ x ( t ) − x ( t − 1) ∥ 2 i + 2 η 2 σ 2 G and T − 1 X t =0 E h ∥ v ( t ) − ∇ f G ( x ( t ) ) ∥ 2 i ≤ 2 e ℓ 2 η T − 1 X t =0 E h ∥ x ( t ) − x ( t − 1) ∥ 2 i + 2 η T σ 2 G + 1 η E h ∥ v (0) − ∇ f G ( x (0) ) ∥ 2 i . G.2 Pro of of Theorem 4.1 Pr o of. By Lemma F.1, there holds, for any γ ≤ 1 / (2 L ) , f ( x ( t +1) ) ≤ f ( x ( t ) ) − γ 2 ∥∇ f ( x ( t ) ) ∥ 2 − 1 4 γ ∥ x ( t +1) − x ( t ) ∥ 2 + γ 2 ∥ g ( t ) − ∇ f ( x ( t ) ) ∥ 2 . (37) Summing the ab ov e from t = 0 to t = T − 1 and taking exp ectation, w e ha ve 1 T T − 1 X t =0 E h ∥∇ f ( x ( t ) ) ∥ 2 i ≤ 2 δ γ T − 1 2 γ 2 T T − 1 X t =0 E h ∥ x ( t +1) − x ( t ) ∥ 2 i + 1 T T − 1 X t =0 E h ∥ g ( t ) − ∇ f ( x ( t ) ) ∥ 2 i , where δ t = E h ∇ f ( x ( t ) ) − f ( x ∗ ) i . W e note that ∥ g ( t ) − ∇ f ( x ( t ) ) ∥ 2 ≤ 4 ∥ g ( t ) − g ( t ) ∥ 2 + 4 ∥ g ( t ) − u ( t ) ∥ 2 + 4 ∥ u ( t ) − v ( t ) ∥ 2 + 4 ∥ v ( t ) − ∇ f ( x ( t ) ) ∥ 2 ≤ 4 ∥ g ( t ) − g ( t ) ∥ 2 + 4 G X i ∈G ∥ g ( t ) i − u ( t ) i ∥ 2 + 4 G X i ∈G ∥ u ( t ) i − v ( t ) i ∥ 2 + 4 ∥ v ( t ) − ∇ f ( x ( t ) ) ∥ 2 , where g = G − 1 P i ∈G g i , u = G − 1 P i ∈G u i and v = G − 1 P i ∈G v i . Double Momen tum for Byzantine Robust Learning Next, we apply the technical lemmas from the previous section to deriv e a b ound on the deviation b etw een g ( t ) and ∇ f ( x ( t ) ) . First, we in vok e Lemma F.2 to obtain the following result ∥ g ( t ) − ∇ f ( x ( t ) ) ∥ 2 ≤ 32 κ ( G − 1) G 2 X i ∈G  ∥ C ( t ) i ∥ 2 + ∥ P ( t ) i ∥ 2 + ∥ M ( t ) i ∥ 2  + 32 κ ( G − 1) G ζ 2 + 4 G X i ∈G ∥ C ( t ) i ∥ 2 + 4 G X i ∈G ∥ P ( t ) i ∥ 2 + 4 ∥ f M ( t ) i ∥ 2 ≤ 4(8 κ + 1) G X i ∈G ∥ C ( t ) i ∥ 2 + 4(8 κ + 1) G X i ∈G ∥ P ( t ) i ∥ 2 + 32 κ G X i ∈G ∥ M ( t ) i ∥ 2 + 4 ∥ f M ( t ) i ∥ 2 + 32 κζ 2 , where C ( t ) i = g ( t ) i − u ( t ) i , P ( t ) i = u ( t ) i − v ( t ) i , f M ( t ) i = v ( t ) i − ∇ f i ( x ( t ) ) . By summing up the ab ov e inequalit y from t = 0 to t = T − 1 , w e obtain T − 1 X t =0 ∥ g ( t ) − ∇ f ( x ( t ) ) ∥ 2 ≤ 4(8 κ + 1) G T − 1 X t =0 X i ∈G ∥ C ( t ) i ∥ 2 + 4(8 κ + 1) G T − 1 X t =0 X i ∈G ∥ P ( t ) i ∥ 2 + 32 κ G T − 1 X t =0 X i ∈G ∥ M ( t ) i ∥ 2 + 4 T − 1 X t =0 ∥ f M ( t ) i ∥ 2 + 32 κT ζ 2 . Next, by taking exp ectation and using G.2 and deﬁne R ( t ) def = E h ∥ x ( t +1) − x ( t ) ∥ 2 i , we hav e T − 1 X t =0 E h ∥ g ( t ) − ∇ f ( x ( t ) ) ∥ 2 i ≤ 4(8 κ + 1) G X i ∈G  12 η 4 α 2 T − 1 X t =0 E h ∥ M ( t ) i ∥ 2 i + 4 η 2 ( ℓ 2 i + 3 α L 2 i ) α T − 1 X t =0 E h ∥ R ( t ) ∥ 2 i + 4 η 4 T σ 2 α + 12 η 2 α 2 T − 1 X t =0 E h ∥ P ( t ) i ∥ 2 i  + 4(8 κ + 1) G T − 1 X t =0 X i ∈G E h ∥ P ( t ) i ∥ 2 i + 32 κ G T − 1 X t =0 X i ∈G E h ∥ M ( t ) i ∥ 2 i + 4 T − 1 X t =0 h ∥ f M ( t ) i ∥ 2 i + 32 κT ζ 2 ≤ ( 48 η 4 (8 κ + 1) α 2 G + 32 κ G ) T − 1 X t =0 X i ∈G E h ∥ M ( t ) i ∥ 2 i + 16(8 κ + 1) η 4 T σ 2 α + 16 η 2 (8 κ + 1)( e ℓ 2 + 3 α e L 2 ) α T − 1 X t =0 E h ∥ R ( t ) ∥ 2 i + ( α 2 + 12 η 2 α 2 )( 4(8 κ + 1) G ) T − 1 X t =0 X i ∈G E h ∥ P ( t ) i ∥ 2 i + 4 T − 1 X t =0 E h ∥ f M ( t ) i ∥ 2 i + 32 κT ζ 2 . Y anghao Li, Changxin Liu † , Y uhao Yi † Then, by using G.3, we get T − 1 X t =0 E h ∥ g ( t ) − ∇ f ( x ( t ) ) ∥ 2 i ≤  48 η 4 (8 κ + 1) α 2 G + 32 κ G  T − 1 X t =0 X i ∈G E h ∥ M ( t ) i ∥ 2 i + 4 T − 1 X t =0 E h ∥ f M ( t ) i ∥ 2 i + 32 κT ζ 2 + 16 η 2 (8 κ + 1)( e ℓ 2 + 3 α e L 2 ) α E h ∥ R ( t ) ∥ 2 i + 16(8 κ + 1) η 4 T σ 2 α + ( α 2 + 12 η 2 α 2 )( 4(8 κ + 1) G ) X i ∈G  6 T − 1 X t =0 E h ∥ M ( t ) i ∥ 2 i + 2( ℓ 2 i + 2 η L 2 i ) η T − 1 X t =0 E h ∥ R ( t ) ∥ 2 i + 2 η T σ 2  ≤  48 η 4 (8 κ + 1) + 24(8 κ + 1)( α 2 + 12 η 2 ) + 32 κα 2 α 2 G  T − 1 X t =0 X i ∈G E h ∥ M ( t ) i ∥ 2 i + 16(8 κ + 1) η 4 T σ 2 α + 4 T − 1 X t =0 E h ∥ f M ( t ) i ∥ 2 i + ( 8( α 2 + 12 η 2 )(8 κ + 1) α 2 ) η T σ 2 + 32 κT ζ 2 +  16 η 2 (8 κ + 1)( e ℓ 2 + 3 α e L 2 ) α + 8(8 κ + 1)( α 2 + 12 η 2 )( e ℓ 2 + 2 η e L 2 ) α 2 η  T − 1 X t =0 E h ∥ R ( t ) ∥ 2 i . F urthermore, by using G.4, w e get T − 1 X t =0 E h ∥ g ( t ) − ∇ f ( x ( t ) ) ∥ 2 i ≤  48 η 4 (8 κ + 1) + 24(8 κ + 1)( α 2 + 12 η 2 ) + 32 κα 2 α 2  2 e ℓ 2 η T − 1 X t =0 E h ∥ R ( t ) ∥ 2 i + 2 η T σ 2 + 1 η G X i ∈G E h ∥ M (0) i ∥ 2 i  + 16(8 κ + 1) η 4 T σ 2 α + 8 e ℓ 2 η T − 1 X t =0 E h ∥ R ( t ) ∥ 2 i + 8 η T σ 2 G + 4 η E h ∥ f M (0) ∥ 2 i + ( 8( α 2 + 12 η 2 )(8 κ + 1) α 2 ) η T σ 2 + 32 κT ζ 2 +  16 η 2 (8 κ + 1)( e ℓ 2 + 3 α e L 2 ) α + 8(8 κ + 1)( α 2 + 12 η 2 )( e ℓ 2 + 2 η e L 2 ) α 2 η  T − 1 X t =0 E h ∥ R ( t ) ∥ 2 i ≤  e ℓ 2 (96 η 4 (8 κ + 1) + 48(8 κ + 1)( α 2 + 12 η 2 ) + 64 κα 2 ) α 2 η + 8(8 κ + 1)( α 2 + 12 η 2 )( e ℓ 2 + 2 η e L 2 ) α 2 η + 8 e ℓ 2 η + 16 η 2 (8 κ + 1)( e ℓ 2 + 3 α e L 2 ) α  T − 1 X t =0 E h ∥ R ( t ) ∥ 2 i + 32 κT ζ 2 + 4 η E h ∥ f M (0) ∥ 2 i + 48 η 4 (8 κ + 1) + 24(8 κ + 1)( α 2 + 12 η 2 ) + 32 κα 2 α 2 η G X i ∈G E h ∥ M (0) i ∥ 2 i +  96 η 4 (8 κ + 1) + 56(8 κ + 1)( α 2 + 12 η 2 ) α 2 + 32 κ + 8 G + 16 η 3 (8 κ + 1) α  η T σ 2 . Subtracting f ( x ∗ ) from b oth sides of inequality (37) , taking exp ectation and deﬁning δ t def = E h ∇ f ( x ( t ) ) − f ( x ∗ ) i , Double Momen tum for Byzantine Robust Learning w e deriv e E h ∥∇ f ( ˆ x ( T ) ) ∥ 2 i ≤ 2 δ γ T − A T T − 1 X t =0 E h ∥ x ( t +1) − x ( t ) ∥ 2 i + 32 κζ 2 + 4 η T E h ∥ f M (0) ∥ 2 i + 48 η 4 (8 κ + 1) + 24(8 κ + 1)( α 2 + 12 η 2 ) + 32 κα 2 α 2 T η G X i ∈G E h ∥ M (0) i ∥ 2 i +  96(8 κ + 1)( η 4 + 7 η 2 ) α 2 + 8(60 κ + 7) + 8 G + 16 η 3 (8 κ + 1) α  η σ 2 , where ˆ x ( T ) is sampled uniformly at random from T iterates and A = 1 γ 2  1 2 − 16 γ 2 e ℓ 2 (6 η 4 (8 κ + 1) + 3(8 κ + 1)( α 2 + 12 η 2 ) + 4 κα 2 ) α 2 η − 8 γ 2 e ℓ 2 η − 8 γ 2 (8 κ + 1)( α 2 + 12 η 2 )( e ℓ 2 + 2 η e L 2 ) α 2 η − 16 γ 2 η 2 (8 κ + 1)( e ℓ 2 + 3 α e L 2 ) α  = 1 γ 2  1 2 − 16 γ 2 e ℓ 2 (6 η 4 (8 κ + 1) + 3(8 κ + 1)( α 2 + 12 η 2 )) α 2 η − 64 γ 2 e ℓ 2 κ η − 8 γ 2 e ℓ 2 η − 8 γ 2 e ℓ 2 (8 κ + 1)( α 2 + 12 η 2 ) α 2 η − 16 γ 2 e L 2 (8 κ + 1)( α 2 + 12 η 2 ) α 2 η 2 − 16 γ 2 e ℓ 2 η 2 (8 κ + 1) α − 48 γ 2 e L 2 η 2 (8 κ + 1) α 2  = 1 γ 2  1 2 − 96 γ 2 e ℓ 2 η 3 (8 κ + 1) α 2 − 48 γ 2 e ℓ 2 (8 κ + 1) η − 576 η γ 2 e ℓ 2 (8 κ + 1) α 2 − 64 γ 2 e ℓ 2 κ η − 8 γ 2 e ℓ 2 η − 8 γ 2 e ℓ 2 (8 κ + 1) η − 96 η γ 2 e ℓ 2 (8 κ + 1) α 2 − 16 γ 2 e L 2 (8 κ + 1) η 2 − 192 γ 2 e L 2 (8 κ + 1) α 2 − 16 γ 2 e ℓ 2 η 2 (8 κ + 1) α − 48 γ 2 e L 2 η 2 (8 κ + 1) α 2  = 1 γ 2  1 2 − 8 γ 2 (8 κ + 1)(12 η 3 e ℓ 2 + 84 η e ℓ 2 + 24 e L 2 + 6 η 2 e L 2 + 2 αη 2 e ℓ 2 ) α 2 − 64 γ 2 e ℓ 2 (8 κ + 1) η − 16 γ 2 e L 2 (8 κ + 1) η 2  ( i ) ≥ 0 . where ( i ) are due to η ≤ 1 , α ≤ 1 , and the assumption on step-size. Finally , by using the c hoice of the momentum parameter η ≤ min (  40 α 2 δ 0 q (8 κ + 1) e ℓ 2 96(8 κ + 1) σ 2 T  2 / 11 ,  40 α 2 δ 0 q (8 κ + 1) e ℓ 2 672(8 κ + 1) σ 2 T  2 / 7 ,  40 δ 0 q (8 κ + 1) e ℓ 2 8(60 κ + 7) σ 2 T  2 / 3 ,  40 δ 0 G q (8 κ + 1) e ℓ 2 8 σ 2 T  2 / 3 ,  40 αδ 0 q (8 κ + 1) e ℓ 2 16(8 κ + 1) σ 2 T  2 / 9 ) , ensures that 96 η 5 ( κ +1) σ 2 α 2 ≤ 40 √ (8 κ +1) e ℓ 2 √ η T , 672 η 3 ( κ +1) σ 2 α 2 ≤ 40 √ (8 κ +1) e ℓ 2 √ η T , 8 η (60 κ + 7) σ 2 ≤ 40 √ (8 κ +1) e ℓ 2 √ η T , 8 η σ 2 G ≤ Y anghao Li, Changxin Liu † , Y uhao Yi † 40 √ (8 κ +1) e ℓ 2 √ η T , and 16 η 4 (8 κ +1) σ 2 α ≤ 40 √ (8 κ +1) e ℓ 2 √ η T , we obtain E h ∥∇ f ( ˆ x ( T ) ) ∥ 2 i ≤  40(96(8 κ + 1)) 1 / 10 σ 2 / 10 δ 0 q (8 κ + 1) e ℓ 2 α 2 / 10 T  10 / 11 +  40(672(8 κ + 1)) 1 / 6 σ 2 / 6 δ 0 q (8 κ + 1) e ℓ 2 α 2 / 6 T  6 / 7 +  40(8(60 κ + 7)) 1 / 2 σ δ 0 q (8 κ + 1) e ℓ 2 T  2 / 3 +  40( σ δ 0 q 8(8 κ + 1) e ℓ 2 ) G 1 / 2 T  2 / 3 +  40(16(8 κ + 1)) 1 / 8 σ 1 / 4 δ 0 q (8 κ + 1) e ℓ 2 α 1 / 8 T  8 / 9 + 32 κζ 2 + Φ 0 γ T . This concludes the pro of. Double Momen tum for Byzantine Robust Learning H Missing Pro ofs for Poly ak-Ło jasiewicz F unctions In this section, w e pro ve our conv ergence rates under the (Poly ak-Ło jasiewicz) PŁ-condition. H.1 Byz-DM21 Lemma H.1 (Descen t lemma) . Supp ose that Assumptions 2.1, 2.3, and 3.6 hold. Then for al l s > 0 we have E h f ( x t +1 ) − f ( x ∗ ) i ≤ (1 − γ µ ) E h f ( x ( t ) ) − f ( x ∗ ) i − ( 1 2 γ − L 2 ) E h ∥ x ( t +1) − x ( t ) ∥ 2 i + γ 2 E h ∥ g ( t ) − ∇ f ( x ( t ) ) ∥ 2 i . (38) Pr o of. The result follows from com bining Lemma F.1 with Assumption 3.6. Theorem H.2. L et Assumptions 2.1, 2.3, 2.4, and 3.6 hold. L et us take η ∈ (0 , 1] and 0 < γ ≤ min  1 L + √ A , η 2 µ , α 4 µ  wher e A =  104 L 2 (8 κ +1) η 2 + 48 L 2 (8 κ +1)(2 η 4 +28 η 2 + α 2 +48) α 2  , Then for al l T ≥ 0 the iter ates of Byz-DM21 satisfy E h f ( x T ) − f ( x ∗ ) i ≤ (1 − γ µ ) T E h Ψ 0 i + 4 η σ 2 (112 κG + 13 G + 1) µG + 96 η 3 σ 2  8 κ + 13 + η 2 (8 κ + 1)  µα 2 + 16 κζ 2 µ , wher e E h Ψ 0 i = E h f ( x (0) ) − f ( x ∗ ) i + 8 γ (8 κ +1) α E h ∥ g (0) − u (0) ∥ 2 i + 4 γ (8 κ +1)( α 2 +24 η 2 ) η α 2 E h ∥ u (0) − v (0) ∥ 2 i + 16 γ ((8 κ +1)(3 α 2 +72 η 2 +6 η 4 )+2 κα 2 ) α 2 η E h ∥ v (0) − ∇ f ( x 0 ) ∥ 2 i + 4 γ η E h ∥ v (0) − ∇ f ( x 0 ) ∥ 2 i . Pr o of. Let                D , E , F , H ≥ 0 D = 8 γ (8 κ +1) α E = 4 γ (8 κ +1)( α 2 +24 η 2 ) η α 2 F = 16 γ ((8 κ +1)(3 α 2 +72 η 2 +6 η 4 )+2 κα 2 ) α 2 η H = 4 γ η (39) using inequality (28) and Assumption 3.6, we obtain E h f ( x ( t +1) ) − f ( x ∗ ) i ≤ (1 − γ µ ) E h f ( x ( t ) ) − f ( x ∗ ) i − ( 1 2 γ − L 2 ) E h ∥ x ( t +1) − x ( t ) ∥ 2 i + γ 2 Ψ ( t ) ≤ (1 − γ µ ) E h f ( x ( t ) ) − f ( x ∗ ) i − ( 1 2 γ − L 2 ) E h ∥ x ( t +1) − x ( t ) ∥ 2 i + 2 γ E h ∥ f M ( t ) ∥ 2 i + 16 γ κζ 2 + 2 γ (8 κ + 1) G X i ∈G E h ∥ C ( t ) i ∥ 2 i + 2 γ (8 κ + 1) G X i ∈G E h ∥ P ( t ) i ∥ 2 i + 16 γ κ G X i ∈G E h ∥ M ( t ) ∥ 2 i ≤ (1 − γ µ ) E h f ( x ( t ) ) − f ( x ∗ ) i − ( 1 2 γ − L 2 ) E h ∥ x ( t +1) − x ( t ) ∥ 2 i + 2 γ E h ∥ f M ( t ) ∥ 2 i + 16 γ κζ 2 + 2 γ (8 κ + 1) E h ∥ C ( t ) ∥ 2 i + 2 γ (8 κ + 1) E h ∥ P ( t ) ∥ 2 i + 16 γ κ E h ∥ M ( t ) ∥ 2 i . (40) Y anghao Li, Changxin Liu † , Y uhao Yi † Let δ ( t +1) def = f ( x ( t +1) ) − f ( x ∗ ) , R ( t ) def = x ( t +1) − x ( t ) . By adding D · E h ∥ C ( t +1) ∥ 2 i and using Lemma F.3, we hav e E h δ ( t +1) i + D · E h ∥ C ( t +1) ∥ 2 i ≤ (1 − γ µ ) E h δ ( t ) i − ( 1 2 γ − L 2 ) E h ∥ R ( t ) ∥ 2 i + 2 γ E h ∥ f M ( t ) ∥ 2 i + 16 γ κζ 2 + 2 γ (8 κ + 1) E h ∥ C ( t ) ∥ 2 i + 2 γ (8 κ + 1) E h ∥ P ( t ) ∥ 2 i + 16 γ κ E h ∥ M ( t ) ∥ 2 i + D · (1 − α 2 ) E h ∥ C ( t ) ∥ 2 i + 6 η 4 α E h ∥ M ( t ) ∥ 2 i + η 4 σ 2 + 6 η 4 L 2 α E h ∥ R ( t ) ∥ 2 i + 6 η 2 α E h ∥ P ( t ) ∥ 2 i ! ≤ (1 − γ µ ) E h δ ( t ) i − ( 1 2 γ − L 2 ) E h ∥ R ( t ) ∥ 2 i + 2 γ E h ∥ f M ( t ) ∥ 2 i + 16 γ κζ 2 + 2 γ (8 κ + 1) E h ∥ C ( t ) ∥ 2 i + 2 γ (8 κ + 1) E h ∥ P ( t ) ∥ 2 i + 16 γ κ E h ∥ M ( t ) ∥ 2 i + D · (1 − α 2 ) E h ∥ C ( t ) ∥ 2 i + D · 6 η 4 α E h ∥ M ( t ) ∥ 2 i + D · η 4 σ 2 + D · 6 η 4 L 2 α E h ∥ R ( t ) ∥ 2 i + D · 6 η 2 α E h ∥ P ( t ) ∥ 2 i . Using inequality (39) and substituting D = 8 γ (8 κ +1) α , we get E h δ ( t +1) i + D · E h ∥ C ( t +1) ∥ 2 i ≤ (1 − γ µ ) E h δ ( t ) i − ( 1 2 γ − L 2 − 48 γ η 4 L 2 (8 κ + 1) α 2 ) E h ∥ R ( t ) ∥ 2 i + 2 γ E h ∥ f M ( t ) ∥ 2 i + 16 γ κζ 2 + D · (1 − α 4 ) E h ∥ C ( t ) ∥ 2 i + 2 γ (8 κ + 1)( α 2 + 24 η 2 ) α 2 E h ∥ P ( t ) ∥ 2 i + 16 γ (3 η 4 (8 κ + 1) + κα 2 ) α 2 E h ∥ M ( t ) ∥ 2 i + 8 γ η 4 σ 2 (8 κ + 1) α . Next, by adding E · E h ∥ P ( t +1) ∥ 2 i and using Lemma F.4, we hav e E h δ ( t +1) i + D · E h ∥ C ( t +1) ∥ 2 i + E · E h ∥ P ( t +1) ∥ 2 i ≤ (1 − γ µ ) E h δ ( t ) i − ( 1 2 γ − L 2 − 48 γ η 4 L 2 (8 κ + 1) α 2 ) E h ∥ R ( t ) ∥ 2 i + 2 γ E h ∥ f M ( t ) ∥ 2 i + 16 γ κζ 2 + 8 γ η 4 σ 2 (8 κ + 1) α + 16 γ (3 η 4 (8 κ + 1) + κα 2 ) α 2 E h ∥ M ( t ) ∥ 2 i + D · (1 − α 4 ) E h ∥ C ( t ) ∥ 2 i + 2 γ (8 κ + 1)( α 2 + 24 η 2 ) α 2 E h ∥ P ( t ) ∥ 2 i + E ·  (1 − η ) E h ∥ P ( t ) ∥ 2 i + 6 η E h ∥ M ( t ) ∥ 2 i + 6 η L 2 E h ∥ R ( t ) ∥ 2 i + η 2 σ 2  . Double Momen tum for Byzantine Robust Learning Using inequality (39) and substituting E = 4 γ (8 κ +1)( α 2 +24 η 2 ) η α 2 , we attain E h δ ( t +1) i + D · E h ∥ C ( t +1) ∥ 2 i + E · E h ∥ P ( t +1) ∥ 2 i ≤ (1 − γ µ ) E h δ ( t ) i −  1 2 γ − L 2 − 48 γ η 4 L 2 (8 κ + 1) α 2 − 24 γ L 2 (8 κ + 1)( α 2 + 24 η 2 ) α 2  E h ∥ R ( t ) ∥ 2 i + 2 γ E h ∥ f M ( t ) ∥ 2 i + 4 η γ σ 2 (8 κ + 1)( α 2 + 24 η 2 ) α 2 + 16 γ κζ 2 + D · (1 − α 4 ) E h ∥ C ( t ) ∥ 2 i + E · (1 − η 2 ) E h ∥ P ( t ) ∥ 2 i + 8 γ  (8 κ + 1)(3 α 2 + 72 η 2 ) + 6 η 4 (8 κ + 1) + 2 κα 2  α 2 E h ∥ M ( t ) ∥ 2 i + 8 γ η 4 σ 2 (8 κ + 1) α . Then, by adding F · E h ∥ M ( t +1) ∥ 2 i , using Lemma F.5, and substituting F = 16 γ ((8 κ +1)(3 α 2 +72 η 2 +6 η 4 )+2 κα 2 ) α 2 η , we obtain E h δ ( t +1) i + D · E h ∥ C ( t +1) ∥ 2 i + E · E h ∥ P ( t +1) ∥ 2 i + F · E h ∥ M ( t +1) ∥ 2 i ≤ (1 − γ µ ) E h δ ( t ) i −  1 2 γ − L 2 − 48 γ η 4 L 2 (8 κ + 1) α 2 − 24 γ L 2 (8 κ + 1)( α 2 + 24 η 2 ) α 2  E h ∥ R ( t ) ∥ 2 i + D · (1 − α 4 ) E h ∥ C ( t ) ∥ 2 i + E · (1 − η 2 ) E h ∥ P ( t ) ∥ 2 i + 4 η γ σ 2 (8 κ + 1)( α 2 + 24 η 2 ) α 2 + 16 γ κζ 2 + 8 γ  (8 κ + 1)(3 α 2 + 72 η 2 + 6 η 4 ) + 2 κα 2  α 2 E h ∥ M ( t ) ∥ 2 i + 8 γ η 4 σ 2 (8 κ + 1) α + 2 γ E h ∥ f M ( t ) ∥ 2 i + F ·  E h ∥ M ( t ) ∥ 2 i + L 2 η E h ∥ R ( t ) ∥ 2 i + η 2 σ 2  ≤ (1 − γ µ ) E h δ ( t ) i + D · (1 − α 4 ) E h ∥ C ( t ) ∥ 2 i + E · (1 − η 2 ) E h ∥ P ( t ) ∥ 2 i + 4 η γ σ 2 (8 κ + 1)( α 2 + 24 η 2 + 2 αη 3 ) α 2 + F · (1 − η 2 ) E h ∥ M ( t ) ∥ 2 i − 1 2 γ − L 2 − 24 γ L 2 (8 κ + 1)(2 η 4 + α 2 + 24 η 2 ) α 2 − 16 γ L 2  (8 κ + 1)(3 α 2 + 72 η 2 + 6 η 4 ) + 2 κα 2  α 2 η 2 ! E h ∥ R ( t ) ∥ 2 i + 2 γ E h ∥ f M ( t ) ∥ 2 i + 16 γ κζ 2 + 16 γ ησ 2  (8 κ + 1)(3 α 2 + 72 η 2 + 6 η 4 ) + 2 κα 2  α 2 . Y anghao Li, Changxin Liu † , Y uhao Yi † F urthermore, by adding H · E h ∥ f M ( t +1) ∥ 2 i , using Lemma F.5, and substituting H = 4 γ η , we arrive at E h δ ( t +1) i + D · E h ∥ C ( t +1) ∥ 2 i + E · E h ∥ P ( t +1) ∥ 2 i + F · E h ∥ M ( t +1) ∥ 2 i + H · E h ∥ f M ( t +1) ∥ 2 i ≤ (1 − γ µ ) E h δ ( t ) i + D · (1 − α 4 ) E h ∥ C ( t ) ∥ 2 i + E · (1 − η 2 ) E h ∥ P ( t ) ∥ 2 i + 4 η γ σ 2 (8 κ + 1)( α 2 + 24 η 2 + 2 αη 3 ) α 2 + 16 γ κζ 2 − 1 2 γ − L 2 − 24 γ L 2 (8 κ + 1)(2 η 4 + α 2 + 24 η 2 ) α 2 − 16 γ L 2  (8 κ + 1)(3 α 2 + 72 η 2 + 6 η 4 ) + 2 κα 2  α 2 η 2 ! E h ∥ R ( t ) ∥ 2 i + F · (1 − η 2 ) E h ∥ M ( t ) ∥ 2 i + 16 γ ησ 2  (8 κ + 1)(3 α 2 + 72 η 2 + 6 η 4 ) + 2 κα 2  α 2 + 2 γ E h ∥ f M ( t ) ∥ 2 i + H ·  (1 − η ) E h ∥ f M ( t ) ∥ 2 i + L 2 η E h ∥ R ( t ) ∥ 2 i + η 2 σ 2 G  ≤ (1 − γ µ ) E h δ ( t ) i + D · (1 − α 4 ) E h ∥ C ( t ) ∥ 2 i + E · (1 − η 2 ) E h ∥ P ( t ) ∥ 2 i + 4 η γ σ 2 (8 κ + 1)( α 2 + 24 η 2 + 2 αη 3 ) α 2 + 16 γ κζ 2 + F · (1 − η 2 ) E h ∥ M ( t ) ∥ 2 i + 16 γ ησ 2  (8 κ + 1)(3 α 2 + 72 η 2 + 6 η 4 ) + 2 κα 2  α 2 + H · (1 − η 2 ) E h ∥ f M ( t ) ∥ 2 i + 4 γ L 2 η 2 E h ∥ R ( t ) ∥ 2 i + 4 γ ησ 2 G . Finally , we b ound R ( t ) , 1 2 γ − L 2 − 4 γ L 2 η 2 − 24 γ L 2 (8 κ + 1)(2 η 4 + α 2 + 24 η 2 ) α 2 − 16 γ L 2  (8 κ + 1)(3 α 2 + 72 η 2 + 6 η 4 ) + 2 κα 2  α 2 η 2 ! E h ∥ R ( t ) ∥ 2 i ≥ 0 . Let A def =  8 L 2 η 2 + 48 L 2 (8 κ + 1)(2 η 4 + α 2 + 24 η 2 ) α 2 + 32 L 2  (8 κ + 1)(3 α 2 + 72 η 2 + 6 η 4 ) + 2 κα 2  α 2 η 2  =  104 L 2 (8 κ + 1) η 2 + 48 L 2 (8 κ + 1)(2 η 4 + 28 η 2 + α 2 + 48) α 2  . T aking 0 < γ ≤ 1 L + √ A and applying Lemma E.1 gives 1 2 γ − L 2 − γ A 2 ≥ 0 . Let E h Ψ ( t +1) i def = E h δ ( t +1) i + D · E h ∥ C ( t +1) ∥ 2 i + E · E h ∥ P ( t +1) ∥ 2 i + F · E h ∥ M ( t +1) ∥ 2 i + H · E h ∥ f M ( t +1) ∥ 2 i and w e use the assumption on η and α to establish that 1 − α 4 ≤ 1 − γ µ and 1 − η 2 ≤ 1 − γ µ . Applying the inequality Double Momen tum for Byzantine Robust Learning iterativ ely giv es E h Ψ T i ≤ (1 − γ µ ) T E h Ψ 0 i + 4 η σ 2 (8 κ + 1)( α 2 + 24 η 2 + 2 αη 3 ) µα 2 + 16 η σ 2  (8 κ + 1)(3 α 2 + 72 η 2 + 6 η 4 ) + 2 κα 2  µα 2 + 16 κζ 2 µ + 4 η σ 2 µG ≤ (1 − γ µ ) T E h Ψ 0 i + 4 η σ 2 (8 κ + 1) µ + 96 η 3 σ 2 (8 κ + 1) µα 2 + 8 η 4 σ 2 (8 κ + 1) µα + 32 η σ 2 κ µ + 16 κζ 2 µ + 4 η σ 2 µG + 48 η σ 2 (8 κ + 1) µ + 96 η σ 2 (8 κ + 1)(12 η 2 + η 4 ) µα 2 ≤ (1 − γ µ ) T E h Ψ 0 i + 4 η σ 2 (112 κG + 13 G + 1) µG + 8 η 3 σ 2 (8 κ + 1)(12 η 2 + η α + 156) µα 2 + 16 κζ 2 µ . Noting that E h Ψ T i ≥ E h f ( x T ) − f ( x ∗ ) i , we ﬁnish the pro of. Corollary H.3. Supp ose that assumptions fr om The or em H.2 hold, momentum η ≤ min n µεG ( G ( κ +1)+1) σ 2 , µα 2 ε 1 / 3 ( κ +1) σ 2 , µαε 1 / 4 ( κ +1) σ 2 o , and for al l i ∈ G , then A lgorithm 1 ne e ds T = e O ( G ( κ + 1) + 1) σ 2 µεG + ( κ + 1) σ 2 µα 2 ε 1 / 3 + ( κ + 1) σ 2 µαε 1 / 4 + L µ + Lσ 2 ( G ( κ + 1) + 1) p ( κ + 1) µ 2 εG ! (41) c ommunic ation r ounds to get an ε -solution. Pr o of. Considering the choice of η , we hav e 1 µ  4 η (112 κG +13 G +1) G + 8 η 3 (8 κ +1)(112 η 2 + η α +156) α 2  σ 2 = O ( ε ) , which guaran tees that E h f ( x ( T ) ) − f ( x ∗ ) i ≤ ε for ε ≥ 32 κζ 2 µ . Therefore, is it suﬃcient to tak e the num b er of comm unication rounds equal (41) to get an ε -solution. H.2 Byz-VR-DM21 Theorem H.4. L et Assumptions 2.1, 2.3, 2.4, and 3.6 hold. L et us take η ∈ (0 , 1] and 0 < γ ≤ min  1 L + √ A , η 2 µ , α 4 µ  , wher e A = 32  8(8 κ +1)( L 2 +7 η ℓ 2 ) η 2 + (8 κ +1)( L 2 (3 η 2 +24)+ ℓ 2 (12 η 3 + αη 2 +156 η )) α 2  , Then for al l T ≥ 0 the iter ates of Byz-VR- DM21 satisfy E h f ( x T ) − f ( x ∗ ) i ≤ (1 − γ µ ) T E h Ψ 0 i + 8 η 3 σ 2 (8 κ + 1)(8 η 2 + 2 αη + 56) µα 2 + 4 η σ 2 (64 κG + 7 G + 1) µG + 16 κζ 2 µ , wher e E h Ψ 0 i = E h f ( x (0) ) − f ( x ∗ ) i + 8 γ (8 κ +1) α E h ∥ g (0) − u (0) ∥ 2 i + 4 γ (8 κ +1)( α 2 +24 η 2 ) η α 2 E h ∥ u (0) − v (0) ∥ 2 i + 16 γ ((8 κ +1)(3 α 2 +72 η 2 +6 η 4 )+2 κα 2 ) α 2 η E h ∥ v (0) − ∇ f ( x 0 ) ∥ 2 i + 4 γ η E h ∥ v (0) − ∇ f ( x 0 ) ∥ 2 i . Y anghao Li, Changxin Liu † , Y uhao Yi † Pr o of. Let                D , E , F , H ≥ 0 D = 8 γ (8 κ +1) α E = 4 γ (8 κ +1)( α 2 +24 η 2 ) η α 2 F = 16 γ ((8 κ +1)(3 α 2 +72 η 2 +6 η 4 )+2 κα 2 ) α 2 η H = 4 γ η (42) Using inequality (40), we obtain E h f ( x ( t +1) ) − f ( x ∗ ) i ≤ (1 − γ µ ) E h f ( x ( t ) ) − f ( x ∗ ) i − ( 1 2 γ − L 2 ) E h ∥ x ( t +1) − x ( t ) ∥ 2 i + 2 γ E h ∥ f M ( t ) ∥ 2 i + 16 γ κζ 2 + 2 γ (8 κ + 1) E h ∥ C ( t ) ∥ 2 i + 2 γ (8 κ + 1) E h ∥ P ( t ) ∥ 2 i + 16 γ κ E h ∥ M ( t ) ∥ 2 i . (43) Let δ ( t +1) def = f ( x ( t +1) ) − f ( x ∗ ) , R ( t ) def = x ( t +1) − x ( t ) and by adding D · E h ∥ C ( t +1) ∥ 2 i , by using Lemma G.2, and substituting D = 8 γ (8 κ +1) α , we hav e E h δ ( t +1) i + D · E h ∥ C ( t +1) ∥ 2 i ≤ (1 − γ µ ) E h δ ( t ) i − ( 1 2 γ − L 2 ) E h ∥ R ( t ) ∥ 2 i + 2 γ E h ∥ f M ( t ) ∥ 2 i + 16 γ κζ 2 + 2 γ (8 κ + 1) E h ∥ C ( t ) ∥ 2 i + 2 γ (8 κ + 1) E h ∥ P ( t ) ∥ 2 i + 16 γ κ E h ∥ M ( t ) ∥ 2 i + D · (1 − α 2 ) E h ∥ C ( t ) ∥ 2 i + 6 η 4 α E h ∥ M ( t ) ∥ 2 i + 2 η 4 σ 2 + 2 η 2 ( 3 α L 2 + ℓ 2 ) E h ∥ R ( t ) ∥ 2 i + 6 η 2 α E h ∥ P ( t ) ∥ 2 i ! ≤ (1 − γ µ ) E h δ ( t ) i − ( 1 2 γ − L 2 − 16 γ η 2 (8 κ + 1)( 3 α L 2 + ℓ 2 ) α ) E h ∥ R ( t ) i ∥ 2 + 2 γ E h ∥ f M ( t ) ∥ 2 i + 16 γ κζ 2 + D · (1 − α 4 ) E h ∥ C ( t ) ∥ 2 i + 2 γ (8 κ + 1)( α 2 + 24 η 2 ) α 2 E h ∥ P ( t ) ∥ 2 i + 16 γ (3 η 4 (8 κ + 1) + κα 2 ) α 2 E h ∥ M ( t ) ∥ 2 i + 16 γ η 4 σ 2 (8 κ + 1) α . Double Momen tum for Byzantine Robust Learning Next, by adding E · E h ∥ P ( t +1) ∥ 2 i , using Lemma G.3, and substituting E = 4 γ (8 κ +1)( α 2 +24 η 2 ) η α 2 , we get E h δ ( t +1) i + D · E h ∥ C ( t +1) ∥ 2 i + E · E h ∥ P ( t +1) ∥ 2 i ≤ (1 − γ µ ) E h δ ( t ) i − ( 1 2 γ − L 2 − 16 γ η 2 (8 κ + 1)( 3 α L 2 + ℓ 2 ) α ) E h ∥ R ( t ) ∥ 2 i + 2 γ E h ∥ f M ( t ) ∥ 2 i + 16 γ κζ 2 + 16 γ η 4 σ 2 (8 κ + 1) α + 16 γ (3 η 4 (8 κ + 1) + κα 2 ) α 2 E h ∥ M ( t ) ∥ 2 i + D · (1 − α 4 ) E h ∥ C ( t ) ∥ 2 i + 2 γ (8 κ + 1)( α 2 + 24 η 2 ) α 2 E h ∥ P ( t ) ∥ 2 i + E ·  (1 − η ) E h ∥ P ( t ) ∥ 2 i + 6 η E h ∥ M ( t ) ∥ 2 i + 2( 2 η L 2 + ℓ 2 ) E h ∥ R ( t ) ∥ 2 i + 2 η 2 σ 2  ≤ (1 − γ µ ) E h δ ( t ) i + 16 γ κζ 2 + 2 γ E h ∥ f M ( t ) ∥ 2 i −  1 2 γ − L 2 − 16 γ η 2 (8 κ + 1)( 3 α L 2 + ℓ 2 ) α − 8 γ (8 κ + 1)( 2 η L 2 + ℓ 2 )( α 2 + 24 η 2 ) η α 2  E h ∥ R ( t ) ∥ 2 i + D · (1 − α 4 ) E h ∥ C ( t ) ∥ 2 i + E · (1 − η 2 ) E h ∥ P ( t ) ∥ 2 i + 8 η γ σ 2 (8 κ + 1)( α 2 + 24 η 2 ) α 2 + 8 γ  (8 κ + 1)(3 α 2 + 72 η 2 + 6 η 4 ) + 2 κα 2  α 2 E h ∥ M ( t ) ∥ 2 i + 16 γ η 4 σ 2 (8 κ + 1) α . Then, by adding F · E h ∥ M ( t +1) ∥ 2 i , using Lemma G.4, and substituting F = 16 γ ((8 κ +1)(3 α 2 +72 η 2 +6 η 4 )+2 κα 2 ) α 2 η , we obtain E h δ ( t +1) i + D · E h ∥ C ( t +1) ∥ 2 i + E · E h ∥ P ( t +1) ∥ 2 i + F · E h ∥ M ( t +1) ∥ 2 i ≤ (1 − γ µ ) E h δ ( t ) i + 16 γ κζ 2 + 2 γ E h ∥ f M ( t ) ∥ 2 i −  1 2 γ − L 2 − 16 γ η 2 (8 κ + 1)( 3 α L 2 + ℓ 2 ) α − 8 γ (8 κ + 1)( 2 η L 2 + ℓ 2 )( α 2 + 24 η 2 ) η α 2  E h ∥ R ( t ) ∥ 2 i + D · (1 − α 4 ) E h ∥ C ( t ) ∥ 2 i + E · (1 − η 2 ) E h ∥ P ( t ) ∥ 2 i + 8 η γ σ 2 (8 κ + 1)( α 2 + 24 η 2 ) α 2 + 8 γ  (8 κ + 1)(3 α 2 + 72 η 2 + 6 η 4 ) + 2 κα 2  α 2 E h ∥ M ( t ) ∥ 2 i + 16 γ η 4 σ 2 (8 κ + 1) α + F ·  (1 − η ) E h ∥ M ( t ) ∥ 2 i + 2 ℓ 2 E h ∥ R ( t ) ∥ 2 i + 2 η 2 σ 2  ≤ (1 − γ µ ) E h δ ( t ) i + D · (1 − α 4 ) E h ∥ C ( t ) ∥ 2 i + E · (1 − η 2 ) E h ∥ P ( t ) ∥ 2 i + 8 η γ σ 2 (8 κ + 1)( α 2 + 24 η 2 + 2 αη 3 ) α 2 + F · (1 − η 2 ) E h ∥ M ( t ) ∥ 2 i − 1 2 γ − L 2 − 16 γ η 2 (8 κ + 1)( 3 α L 2 + ℓ 2 ) α − 8 γ (8 κ + 1)( 2 η L 2 + ℓ 2 )( α 2 + 24 η 2 ) η α 2 − 32 γ ℓ 2  (8 κ + 1)(3 α 2 + 72 η 2 + 6 η 4 ) + 2 κα 2  η α 2 ! E h ∥ R ( t ) ∥ 2 i + 2 γ E h ∥ f M ( t ) ∥ 2 i + 16 γ κζ 2 + 32 γ ησ 2  (8 κ + 1)(3 α 2 + 72 η 2 + 6 η 4 ) + 2 κα 2  α 2 . Y anghao Li, Changxin Liu † , Y uhao Yi † F urthermore, by adding H · E h ∥ f M ( t +1) ∥ 2 i , using Lemma G.4, and substituting H = 4 γ η , we arrive at E h δ ( t +1) i + D · E h ∥ C ( t +1) ∥ 2 i + E · E h ∥ P ( t +1) ∥ 2 i + F · E h ∥ M ( t +1) ∥ 2 i + H · E h ∥ f M ( t +1) ∥ 2 i ≤ (1 − γ µ ) E h δ ( t ) i + D · (1 − α 4 ) E h ∥ C ( t ) ∥ 2 i + E · (1 − η 2 ) E h ∥ P ( t ) ∥ 2 i + 2 γ E h ∥ f M ( t ) ∥ 2 i − 1 2 γ − L 2 − 16 γ η 2 (8 κ + 1)( 3 α L 2 + ℓ 2 ) α − 8 γ (8 κ + 1)( 2 η L 2 + ℓ 2 )( α 2 + 24 η 2 ) η α 2 − 32 γ ℓ 2  (8 κ + 1)(3 α 2 + 72 η 2 + 6 η 4 ) + 2 κα 2  η α 2 ! E h ∥ R ( t ) ∥ 2 i + F · (1 − η 2 ) E h ∥ M ( t ) ∥ 2 i + 16 γ κζ 2 + 32 γ ησ 2  (8 κ + 1)(3 α 2 + 72 η 2 + 6 η 4 ) + 2 κα 2  α 2 + 8 η γ σ 2 (8 κ + 1)( α 2 + 24 η 2 + 2 αη 3 ) α 2 + H ·  (1 − η ) E h ∥ f M ( t ) ∥ 2 i + 2 ℓ 2 E h ∥ R ( t ) ∥ 2 i + 2 η 2 σ 2 G  ≤ (1 − γ µ ) E h δ ( t ) i + D · (1 − α 4 ) E h ∥ C ( t ) ∥ 2 i + E · (1 − η 2 ) E h ∥ P ( t ) ∥ 2 i + 8 η γ σ 2 (8 κ + 1)( α 2 + 24 η 2 + 2 αη 3 ) α 2 − 1 2 γ − L 2 − 16 γ η 2 (8 κ + 1)( 3 α L 2 + ℓ 2 ) α − 8 γ (8 κ + 1)( 2 η L 2 + ℓ 2 )( α 2 + 24 η 2 ) η α 2 − 8 γ ℓ 2 η − 32 γ ℓ 2  (8 κ + 1)(3 α 2 + 72 η 2 + 6 η 4 ) + 2 κα 2  η α 2 ! E h ∥ R ( t ) ∥ 2 i + F · (1 − η 2 ) E h ∥ M ( t ) ∥ 2 i + 16 γ κζ 2 + 32 γ ησ 2  (8 κ + 1)(3 α 2 + 72 η 2 + 6 η 4 ) + 2 κα 2  α 2 + H · (1 − η 2 ) E h ∥ f M ( t ) ∥ 2 i + 8 γ ησ 2 G . Finally , we b ound R ( t ) , 1 2 γ − L 2 − 8 γ ℓ 2 η − 16 γ η 2 (8 κ + 1)( 3 α L 2 + ℓ 2 ) α − 8 γ (8 κ + 1)( 2 η L 2 + ℓ 2 )( α 2 + 24 η 2 ) η α 2 − 32 γ ℓ 2  (8 κ + 1)(3 α 2 + 72 η 2 + 6 η 4 ) + 2 κα 2  η α 2 ! E h ∥ R ( t ) ∥ 2 i ≥ 0 . Let A def = 16 ℓ 2 η + 16(8 κ + 1)( 2 η L 2 + ℓ 2 )( α 2 + 24 η 2 ) η α 2 + 64 ℓ 2  (8 κ + 1)(3 α 2 + 72 η 2 + 6 η 4 ) + 2 κα 2  η α 2 + 32 η 2 (8 κ + 1)( 3 α L 2 + ℓ 2 ) α ! = 16  ℓ 2 η + (8 κ + 1)( 2 η L 2 + ℓ 2 ) η + 24 η (8 κ + 1)( 2 η L 2 + ℓ 2 ) α 2 + 8 κℓ 2 η + 4 ℓ 2 (8 κ + 1)(3 α 2 + 72 η 2 + 6 η 4 ) η α 2 + 2 η 2 (8 κ + 1)( 3 α L 2 + ℓ 2 ) α  = 16  2(8 κ + 1) ℓ 2 η + 2(8 κ + 1) L 2 η 2 + 24 η (8 κ + 1) ℓ 2 α 2 + 48(8 κ + 1) L 2 α 2 + 12(8 κ + 1) ℓ 2 η + 4 ℓ 2 (8 κ + 1)(72 η + 6 η 3 ) α 2 + 6 η 2 (8 κ + 1) L 2 α 2 + 2 η 2 (8 κ + 1) ℓ 2 α  = 16  (8 κ + 1)(2 L 2 + 14 η ℓ 2 ) η 2 + 2(8 κ + 1)((156 η + 12 η 3 + αη 2 ) ℓ 2 + (3 η 2 + 24) L 2 ) α 2  . Double Momen tum for Byzantine Robust Learning T aking 0 < γ ≤ 1 L + √ A and applying Lemma E.1 gives 1 2 γ − L 2 − γ A 2 ≥ 0 . Let E h Ψ ( t +1) i def = E h δ ( t +1) i + D · E h ∥ C ( t +1) ∥ 2 i + E · E h ∥ P ( t +1) ∥ 2 i + F · E h ∥ M ( t +1) ∥ 2 i + H · E h ∥ f M ( t +1) ∥ 2 i and w e use the assumption on η and α to establish that 1 − α 4 ≤ 1 − γ µ and 1 − η 2 ≤ 1 − γ µ . Applying the inequality iterativ ely giv es E h Ψ T i ≤ (1 − γ µ ) T E h Ψ 0 i + 8 η σ 2 (8 κ + 1)( α 2 + 24 η 2 + 2 αη 3 ) µα 2 + 32 η σ 2 (8 κ + 1)(3 α 2 + 72 η 2 + 6 η 4 ) µα 2 + 64 η σ 2 κ µ + 16 κζ 2 µ + 8 η σ 2 µG ≤ (1 − γ µ ) T E h Ψ 0 i + 8 η σ 2 (8 κ + 1) µ + 16 η 3 σ 2 (8 κ + 1)( αη + 12) µα 2 + 96 η σ 2 (8 κ + 1) µ + 64 η σ 2 κ µ + 192 η 3 σ 2 (8 κ + 1)( η 2 + 12) µα 2 + 16 κζ 2 µ + 8 η σ 2 µG ≤ (1 − γ µ ) T E [Ψ 0 ] + 8 η σ 2 (108 κG + 13 G + 1) µG + 16 η 3 σ 2 (8 κ + 1)(12 η 2 + αη + 156) µα 2 + 16 κζ 2 µ . Noting that E h Ψ T i ≥ E h f ( x T ) − f ( x ∗ ) i , we ﬁnish the pro of. Corollary H.5. Supp ose that assumptions fr om The or em H.2 hold, momentum η ≤ min n µGε ( G ( κ +1)+1) σ 2 , µα 2 ε 1 / 3 ( κ +1) σ 2 , µαε 1 / 4 ( κ +1) σ 2 o , and for al l i ∈ G , then A lgorithm 1 ne e ds T = e O ( G ( κ + 1) + 1) σ 2 µεG + ( κ + 1) σ 2 µα 2 ε 1 / 3 + ( κ + 1) σ 2 µαε 1 / 4 + L µ + σ p ( ℓ 2 + L 2 )( G ( κ + 1) + 1)( κ + 1) µ 3 / 2 ε 1 / 2 G 1 / 2 ! (44) c ommunic ation r ounds to get an ε -solution. Pr o of. Considering the choice of η , we hav e 1 µ  8 η (108 κG +13 G +1) G + 16 η 3 (8 κ +1)(12 η 2 + αη +156) α 2  σ 2 = O ( ε ) , which guaran tees that E h f ( x ( T ) ) − f ( x ∗ ) i ≤ ε for ε ≥ 32 κζ 2 µ . Therefore, is it suﬃcient to tak e the num b er of comm unication rounds equal (44) to get an ε -solution.

Accelerating Byzantine-Robust Distributed Learning with Compressed Communication via Double Momentum and Variance Reduction

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment