Stability and Generalization of Push-Sum Based Decentralized Optimization over Directed Graphs

Stabilit y and Generalization of Push-Sum Based Decen tralized Optimization o v er Directed Graphs Yifei Liang liangyf65@mail2.sysu.edu.cn Scho ol of Cyb er Scienc e and T e chnolo gy Sun Y at-sen University Shenzhen, Guangdong 518107, China Y an Sun sun9899@uni.sydney.edu.au The University of Sydney Sydney, NSW 2006, Austr alia Xiao c hun Cao cao xiaochun@mail.sysu.edu.cn Scho ol of Cyb er Scienc e and T e chnolo gy Sun Y at-sen University Shenzhen, Guangdong 518107, China Li Shen shenli6@mail.sysu.edu.cn Scho ol of Cyb er Scienc e and T e chnolo gy Sun Y at-sen University Shenzhen, Guangdong 518107, China Abstract Push-Sum-based decentralized learning enables optimization ov er directed communication net works, where information exc hange ma y b e asymmetric. While con vergence prop erties of suc h metho ds are w ell understoo d, their ﬁnite-iteration stabilit y and generalization behavior remain unclear due to structural bias induced by column-sto chastic mixing and asymmetric error propagation. In this w ork, we develop a uniﬁed uniform-stabilit y framework for the Sto c hastic Gradien t Push (SGP) algorithm that captures the eﬀect of directed top ology . A key technical ingredien t is an im balance-aw are consistency b ound for Push-Sum, which controls consensus deviation through t wo quan tities: the stationary distribution imbalance parameter δ and the sp ectral gap (1 − λ ) gov erning mixing speed. This decomp osition enables us to disentangle statistical eﬀects from topology-induced bias. W e establish ﬁnite-iteration stability and optimization guarantees for both conv ex ob jectiv es and non-con vex ob jectives satisfying the P olyak– L o jasiewicz condition. F or con vex problems, SGP attains excess generalization error of order ˜ O  1 √ mn + γ δ (1 − λ ) + γ  under step-size sc hedules, and w e characterize the corresp onding optimal early stopping time that minimizes this b ound. F or P L ob jectives, we obtain con vex-lik e optimization and generalization rates with dominan t dep endence prop ortional to κ  1 + 1 δ (1 − λ )  , revealing a multiplicativ e coupling b etw een problem conditioning and directed communication top ology . Our analysis clariﬁes when Push-Sum correction is necessary compared with standard decen tralized SGD and quantiﬁes how imbalance and mixing jointly shap e the b est attainable learning p erformance. Exp erimen ts on logistic regression and image classiﬁcation b enchmarks under common netw ork top ologies v alidate the theoretical ﬁndings. Keyw ords: Generalization Analysis, Algorithm Stability , Push-Sum, Distributed Learning. 1 Introduction Decen tralized Learning (DL) has b ecome a standard paradigm for large-scale mac hine learning due to its adv an tages in priv acy preserv ation ( Cyﬀers and Bellet , 2022 ), training eﬃciency ( Lian et al. , 2017 ), and system robustness ( Neglia et al. , 2019 ). By distributing computation and data across multiple no des, DL av oids the need for centralized data aggregation and naturally supp orts collab orative training in resource-c onstrained or priv acy-sensitive environmen ts. Classical decen tralized algorithms, © 2026 Yifei Liang and Y an Sun and Xiaoch un Cao and Li Shen. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/ . Liang and Sun and Cao and Shen Figure 1: Comparison of symmetric and asymmetric top ologies. suc h as Decentralized SGD (D-SGD) ( Lian et al. , 2017 ; Kolosko v a et al. , 2020 ), t ypically assume undirected communication graphs, where information exchange b etw een no des is symmetric and a veraging op erations are unbiased. In many practical systems, how ever, comm unication is inherently directed, and information ﬂow may b e one-w ay due to heterogeneous transmission p ow er, asymmetric bandwidth, or pac ket loss (Figure 1 ). The Push-Sum protocol ( Kemp e e t al. , 2003 ) was introduced to ac hieve a verage consensus ov er directed graphs and later extended to distributed optimization under asymmetric communication ( Tsianos et al. , 2012 ; Nedic et al. , 2016 ; B´ en ´ ezit et al. , 2010 ). Building on this idea, Assran et al. ( 2019 ) prop osed Sto chastic Gradien t Push (SGP), which com bines Push-Sum normalization with parallel SGD and enables decen tralized learning o ver directed net works. While the conv ergence prop erties of Push-Sum-based algorithms hav e b een studied ( Nedic et al. , 2016 ; Assran et al. , 2019 ), their stability and generalization b ehavior at ﬁnite iterations remain largely unexplored in theory , especially in realistic learning settings. Understanding generalization in directed decentralized learning is particularly challenging b ecause communication asymmetry in tro duces structural bias that cannot b e treated as a small perturbation. The main tec hnical diﬃcult y arises from the use of column-sto chastic mixing matrices. Unlik e doubly-sto chastic a veraging, column- sto c hastic comm unication doe s not preserve symmetry and instead induces a non-uniform stationary distribution. As a consequence, consensus errors may p ersist throughout training and interact with sto c hastic gradient noise in a non trivial wa y ov er time. These eﬀects lead to additional instability mec hanisms that are absent in cen tralized SGD ( Hardt et al. , 2016 ) and are not captured b y existing decen tralized stabilit y analyses developed for undirected D-SGD ( Sun et al. , 2021 ; Le Bars et al. , 2023 ). In particular, it remains unclear how im balance and slow mixing jointly shape the excess risk and whether Push-Sum correction fundamen tally changes the generalization b ehavior compared with standard decentralized metho ds. This motiv ates the following questions: (i) under what conditions is Push-Sum correction necessary compared with standard decen tralized SGD, and (ii) how do es directed netw ork top ology inﬂuence generalization error and excess risk? 2 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion T o address these questions, we develop a uniﬁed analysis of the stability and optimization b eha vior of SGP ov er directed comm unication top ologies. Our ob jectiv e is to quantify precisely ho w directed mixing aﬀects generalization and excess risk at ﬁnite iterations. T o provide the necessary structural con text, we ﬁrst distinguish balanced and imbalanced communication graphs. By a classical result ( Olfati-Sab er and Murray , 2004 ), in balanced graphs, aggregation is un biased and the dynamics reduce to the standard undirected setting; in imbalanced directed graphs, only column-sto c hastic mixing is possible, which induces a non-uniform stationary distribution and structural bias. Motiv ated b y this distinction, w e characterize directed mixing matrices through their stationary distributions and in tro duce an explicit measure of top ological im balance, denoted b y δ (Deﬁnition 5 ). This parameter measures the deviation of the stationary distribution from uniformit y and provides a concrete criterion for when un biased aggregation is achiev able. In particular, when δ = 1 m , the net work is balanced and aggregation is eﬀectively unbiased; when δ < 1 m , Push-Sum normalization is necessary to eliminate structural bias. This c haracterization directly addresses question (i) b y iden tifying the regime in whic h Push-Sum correction is required. An important tec hnical ingredien t in our analysis is a reﬁned consistency b ound for Push-Sum (Lemma 9 ), which controls the deviation b etw een lo cal iterates and their net work a verage under directed mixing. Building on this decomposition, we disen tangle tw o distinct top ological eﬀects: the sp ectral gap (1 − λ ), whic h gov erns the mixing sp eed, and the imbalance factor δ , which captures asymmetry in aggregation (Section 3.2 ). By separating these quantities, w e deriv e stability b ounds (Section 4 ) that explicitly show ho w directed top ology enters the excess risk through the factors 1 / ( δ (1 − λ )). This pro vides a quantitativ e answ er to question (ii) , clarifying ho w imbalance and slow mixing jointly aﬀect generalization. Lastly , we establish ﬁnite-iteration optimization guarantees for b oth con vex and non-conv ex (including P olyak– Lo jasiewicz) ob jectiv es under constan t and diminishing step-size schedules. Combining stability and optimization results yields explicit b ounds on the excess generalization error and rev eals a trade-oﬀ b etw een optimization accuracy and algorithmic stabilit y . In particular, we c haracterize the optimal early stopping time that minimizes the excess risk and derive the corresp onding minimal achiev able generalization error. Our analysis sho ws that directed top ology inﬂuences not only con vergence behavior but also the best attainable excess risk through the imbalance parameter δ and the sp ectral gap (1 − λ ). T ogether, these results provide a uniﬁed understanding of when directed comm unication degrades learning p erformance and when its eﬀect reduces to that of standard undirected decentralized SGD. Our w ork builds on uniform stability theory , which has b een widely used to relate algorithmic sensitivit y to generalization p erformance ( Devroy e and W agner , 1979 ; McAllester , 1999 )(Section 3.4 ). F or centralized SGD, Hardt et al. ( 2016 ) derived stability guarantees in con vex settings. These results were later extended to decentralized scenarios, including D-SGD ( Sun et al. , 2021 ) and async hronous v ariants ( Deng et al. , 2023 ). Ho wev er, existing decentralized stability analyses typically assume symmetric comm unication and rely on doubly-sto chastic mixing matrices. They therefore do not isolate the role of stationary distribution imbalance in directed net works. By explicitly in tro ducing the imbalance parameter δ and separating it from the sp ectral gap (1 − λ ), our framework generalizes these results to directed comm unication and iden tiﬁes the additional instability induced b y asymmetric mixing. This distinction explains why directed decen tralized learning may exhibit fundamen tally diﬀerent generalization b eha vior from its undirected counterpart. Our Con tributions • T op ology characterization and Push-Sum criterion. W e analyze directed communication from a Mark ov chain p ersp ective and introduce an explicit imbalance parameter δ , which c haracterizes when unbiased aggregation is achiev able and when Push-Sum correction is necessary in directed netw orks, esp ecially under practical asymmetric connectivit y . • Stabilit y , optimization, and excess risk analysis. W e establish uniﬁed ﬁnite-iteration b ounds on uniform stability and optimization error for SGP ov er directed graphs in conv ex settings. By com bining these results, we deriv e guarantees on the excess generalization error and identify the optimal early stopping time that minimizes it. Our analysis disentangles 3 Liang and Sun and Cao and Shen T able 1: Comparison of stability ( ϵ stab ) and optimal error ( ϵ opt ) for distributed learning algorithms. These theoretical results are obtained after T iterations across m distributed no des pro cessing n training samples, where C λ ≍ 1 / (1 − λ ) (sp ectral gap const), δ (asymmetry const), d (input dimension), C w 0 (initial p oin t const), v (learning rate co eﬃcient), L (smo othness). Algorithm Setting Learning Rate ϵ stab ϵ opt C-SGD ( Sun et al. , 2023 ) Non-Con vex γ t = O ( 1 t ) O  1 mn T vL 1+ vL  O  1 nT  D-SGD ( Sun et al. , 2021 ) Con vex γ t = γ O  ( 1 mn + C λ ) T  O  (1 + C λ ) 1 T  γ t = O ( 1 t ) O  ( 1 mn + C λ ) ln T  O  (1 + C λ ) 1 ln T  Non-Con vex γ t = O ( 1 t ) O  ( C λ + 1 mn ) T vL 1+ vL  – SGP (Ours) Con vex γ t = γ O  ( 1 mn + C λ δ ) T  O  (1 + C λ C w 0 δ ) 1 T  γ t = O ( 1 t ) O  1 mn ln T  O  (1 + C λ C w 0 δ ) 1 ln T  Non-Con vex γ t = γ O  1 mn + C λ ( C w 0 +1) δ  exp( Lγ T )  O  (1 + C λ C w 0 δ ) 1 T  γ t = O ( 1 t ) O  C w 0 +1 δmn T 1+ vL 2+ vL + C λ δ T vL 2+ vL  O  (1 + C λ C w 0 δ ) 1 ln T  the roles of the sp ectral gap (1 − λ ) and the imbalance factor δ , and shows that the minimal excess risk decomp oses into a statistical term and a top ology-dep enden t bias term. A detailed comparison of rates is provided in T able 1 . • Non-con vex and P L analysis. F or general non-conv ex ob jectiv es, we characterize how directed mixing ampliﬁes instabilit y under constant step sizes and yields polynomial growth under diminishing schedules. Under the Poly ak– L o jasiewicz condition, w e obtain con vex-lik e optimization rates with constants that dep end explicitly on κ/ ( δ (1 − λ )), highlighting the coupling b et ween problem conditioning and netw ork top ology . • Empirical v alidation. W e v alidate our theoretical predictions on logistic regression (a9a) and image classiﬁcation (CIF AR-10 with LeNet) in Section 5 , illustrating the inﬂuence of top ology and step-size schedules in practice, across diverse directed netw ork settings. 2 Related work Decen tralized Learning ov er Directed Graphs. Decen tralized learning o ver directed graphs builds up on foundational w ork in sto chastic approximation ( Robbins and Monro , 1951 ) and distributed online prediction ( Agarw al and Duchi , 2011 ; Dekel et al. , 2012 ). Algorithms suc h as D-SGD ( Kolosk ov a et al. , 2020 ) ac hieve linear speedup on symmetric netw orks ( Lian et al. , 2017 ) but fail on directed graphs b ecause asymmetric w eight matrices break the doubly sto chastic prop ert y required for consensus. Early theoretical support came from consensus analysis on switching topologies ( Olfati- Sab er and Murray , 2004 ), asynchronous optimization theory ( Tsitsiklis et al. , 1986 ), and the study of nonhomogeneous Marko v c hains. The Push-Sum proto col ( Kemp e et al. , 2003 ) resolv ed the asymmetry c hallenge by in tro ducing auxiliary v ariables alongside column-stochastic matrices, enabling exact a verage consensus without knowledge of netw ork size or out-degrees. This idea was extended to w eighted gossip with general stochastic matrices ( B ´ en´ ezit et al. , 2010 ), to conv ex optimization via distributed dual av eraging ( Tsianos et al. , 2012 ), and to time-v arying directed graphs by Nedi ´ c and Olshevsky ( Nedi´ c and Olshevsky , 2015 ; Nedi ´ c et al. , 2016 ), yielding a con vergence rate of O ( ln t/ √ t ). Sto c hastic Gradient Push ( Nedi´ c and Olshevsky , 2016 ) impro ved this to O ( ln t/t ) for strongly conv ex ob jectives. T o escap e the sublinear con vergence inherent in Push-Sum based metho ds, gradient trac king emerged: EXTRA ( Shi et al. , 2015 ) ac hieved linear con vergence on undirected graphs by 4 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion trac king the av erage gradient, while DEXTRA ( Xi and Khan , 2017 ) and ExtraPush ( Zeng and Yin , 2017 ) adapted this mechanism to directed netw orks using constant stepsizes. The Push-Pull framew ork ( Pu et al. , 2020 ) uniﬁed these ideas through a dual-matrix architecture, employing ro w-sto chastic matrices to pull iterates and column-sto chastic matrices to push gradien ts, thereby separating consensus and optimization dynamics. This design achiev es linear sp eedup for non-conv ex sto c hastic problems and remains stable under unidirectional communication without additional nonlinear corrections. In deep learning, Sto chastic Gradient Push ( Assran et al. , 2019 ) combines Push-Sum with sto chastic gradients to ensure consensus while preserving SGD’s conv ergence rate; its async hronous extension ( Assran and Rabbat , 2020 ) tolerates communication dela ys. Recen t progress includes quantized comm unication ( T aheri et al. , 2020 ), p ersonalized federated learning via directed partial gradient push ( Liu et al. , 2024 ) or asymmetric top ologies ( Li et al. , 2023 ), and B-ary tree structures for heterogeneous data ( Y ou and Pu , 2024 ). Ongoing research explores adaptiv e edge w eighting, sp oradic gradient tracking, generalized smo othness conditions, and top ology learning informed by semantic structure, enhancing decentralized optimization in dynamic en vironments. Stabilit y and Generalization. Algorithmic stability oﬀers a principled wa y to b ound the generalization error of learning algorithms ( Bousquet and Elisseeﬀ , 2002 ; Elisseeﬀ et al. , 2005 ; Shalev-Sh wartz et al. , 2010 ), building on earlier frameworks suc h as VC dimension ( Blumer et al. , 1989 ), V aliant’s P AC learning mo del, and Rademacher complexit y ( Bartlett and Mendelson , 2002 ). Bousquet and Elisseeﬀ ( 2002 ) show ed that h yp othesis stability suﬃces for generalization, and Hardt et al. ( 2016 ) applied this to sto c hastic gradient descent, proving that uniform stability degrades with more iterations under constan t stepsizes, whic h lo osens generalization b ounds o ver time and suggests that faster training or early stopping can impro ve ge neralization. Later work in tro duced data-dep enden t stability ( Kuzb orskij and Lamp ert , 2018 ), on-av erage stability with con vergence- a ware analysis ( Charles and Papail iop oulos , 2018 ; Lei and Ying , 2020 ), reﬁned b ounds under weak er assumptions ( Bassily et al. , 2020 ), and extensions to nonsmo oth or adv ersarial settings ( Xiao et al. , 2022 ; Deng et al. , 2024 ). In distributed learning, these ideas extend to m ulti-agent systems, yielding generalization comparable to centralized SGD under doubly sto c hastic mixing ( Sun et al. , 2021 ) and top ology-dependent bounds that incorp orate sp ectral gaps ( Zh u et al. , 2022 ). Le Bars et al. ( Le Bars et al. , 2023 ) exploited the ro w-sto chastic property of mixing matrices, which preserv es the global a verage of iterates, to transfer centralized stability arguments to decen tralized settings without explicit dep endence on graph structure. Push-Sum and related metho ds for directed graphs, ho wev er, rely on column-sto c hastic matrices. The a verage-preserving inv ariance no longer holds b ecause of asymmetry and dynamic imbalance correction through auxiliary v ariables. Consequently , stability analyses designed for row-stochastic netw orks do not apply: lo cal iterates accumulate bias and weigh ted a veraging breaks uniform stabilit y across agents. This gap calls for new analytical to ols tailored to column-sto c hastic netw orks. Current eﬀorts aim to dev elop robust generalization guaran tees that accoun t for evolving w eights and im balance in fully directed and unbalanced comm unication settings. 3 Preliminaries This section establishes the mathematical framew ork of our analysis. W e formulate the distributed optimization problem in Subsection 3.1 , describ e the key top ological prop erties of the comm unication graph in Subsection 3.2 , and present the Sto c hastic Gradient Push (SGP) algorithm and its consensus dynamics in Subsection 3.3 . Finally , Subsection 3.4 introduces the stabilit y and generalization metrics, distinguishing the dynamics used for analysis from the ﬁnal output mo del used for ev aluation. 3.1 Problem F ormulation Consider a decentralized system consisting of m no des, indexed by V = { 1 , . . . , m } . Eac h no de i op erates on an input space X i ⊆ R d and an output space Y i ⊆ R . The sample space is denoted b y 5 Liang and Sun and Cao and Shen S i = X i × Y i , where data samples are drawn indep endently and iden tically (i.i.d.) from a distribution D i . The sample space is the union S = S m i =1 S i , with each node p ossessing a dataset of size n . The global goal is to collab orativ ely learn an optimal parameter w ∗ ∈ R d that minimizes the exp ected global risk F ( w ), given as the mean of lo cal exp ected risks: min w ∈ R d F ( w ) = 1 m m X i =1 F i ( w ) , F i ( w ) = E ξ ∼D i [ f ( w ; ξ )] . (1) Here, f ( w ; ξ ) denotes the loss function computed for a sample ξ . Because the distributions D i are unkno wn in practice, we rely on the Empirical Risk Minimization (ERM) framew ork to approximate the exp ected risk using observed data. With S i = { ξ i, 1 , . . . , ξ i,n } as the lo cal dataset, the global empirical risk F S ( w ) and its minimizer w ∗ S are deﬁned as: F S ( w ) : = 1 mn m X i =1 n X ζ =1 f ( w ; ξ i,ζ ) = 1 m F S i ( w ) , w ∗ S : = arg min w F S ( w ) . (2) 3.2 Communication T op ology and Structural Prop erties The agents communicate ov er a strongly connected directed graph G = ( V , E ). A directed edge ( j, i ) ∈ E indicates that no de i receiv es information from no de j . The In- and Out-neighbor sets of no de i are N in i := { j | ( j, i ) ∈ E } ∪ { i } and N out i := { j | ( i, j ) ∈ E } ∪ { i } , with the out-degree d i := |N out i | . The mixing matrix P ∈ R m × m is deﬁned as: [ P ] ij = ( 1 / ( d j + 1) , if j ∈ N in i ; 0 , otherwise. (3) whic h ensures column-sto c hasticity , i.e., 1 ⊤ P = 1 ⊤ . The p erformance of distributed learning algorithms is inﬂuenced by the underlying comm unication top ology . As discussed ab o ve, general directed graphs admit only column-sto chastic mixing matrices. An exception class of directed graphs deﬁned as follows. Deﬁnition 1 (Balanced Graph ( Olfati-Sab er and Murra y , 2004 )) A dir e cte d gr aph G is b al- anc e d if and only if the in-de gr e e e quals the out-de gr e e for every no de i ∈ V , i.e. , |N in i | = |N out i | holds for al l no des i in the vertex set V . Suc h graphs admit the following fundamen tal characterization. Lemma 2 (Prop ert y of Balanced Graph ( Olfati-Sab er and Murray , 2004 )) Ther e exists a non-ne gative matrix P c omp atible with G that is doubly sto chastic (satisfying P 1 = 1 and 1 ⊤ P = 1 ⊤ ) if and only if gr aph G is b alanc e d. F rom a Marko v chain p ersp ective, the information exc hange induced by the communication matrix P can b e describ ed as a random walk o ver the graph top ology: w ( t +1) = P w ( t ) , where w ( t ) denotes the mo del parameter held by eac h agent at iteration t . The matrix P is assumed to b e nonnegative, column-sto c hastic, and primitive, consisten t with the mixing matrices introduced 6 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion ab o ve. Its long-term b ehavior is characterized b y the Perr on–F r ob enius the or em ( Horn and Johnson , 2012 ; Meyer , 2023 ): lim t →∞ P t = π 1 ⊤ , where π ∈ R m ++ is the unique stationary distribution, satisfying P π = π and ∥ π ∥ 1 = 1, and c haracterizing the limiting inﬂuence that each agen t contributes to the aggregation pro cess. The communication graph structure determines π , giving tw o representativ e regimes. Balanced Graph: If the underlying graph admits a doubly sto chastic mixing matrix, then the stationary distribution b ecomes uniform, i.e. , π = 1 m 1 . In this regime, traditional D-SGD maintains un biased optimization, as each no de contributes equally to global av erage. Un balanced Graph: When the in-degree |N in i | and out-degree |N out i | of no de i diﬀer, matrix P is column-sto chastic but not row-stochastic ( P 1  = 1 ), yielding a non-uniform stationary distribution π . This im balance introduces systematic bias in DL aggregation. Understanding the distinction b etw een balanced and un balanced graphs is essential for character- izing how netw ork top ology inﬂuences distributed optimization algorithms. Figure 2 illustrates the classiﬁcation of several commonly used decen tralized netw ork top ologies. Figure 2: Classiﬁcation of Balanced and Unbalanced Graphs. T o characterize topology and balance, we in tro duce tw o fundamen tal parameters. Deﬁnition 3 (Sp ectral Gap ( Mon tenegro and T etali , 2006 )) L et σ ( P ) denote the sp e ctrum of the primitive matrix P . The se c ond lar gest eigenvalue mo dulus (SLEM) is deﬁne d as λ := max {| µ | : µ ∈ σ ( P ) , µ  = 1 } . (4) It holds that 0 < λ < 1 . Remark 4 The p ar ameter λ char acterizes the c onver genc e r ate of a Markov chain to its stationary distribution. F or an irr e ducible and ap erio dic chain with tr ansition matrix P , the total variation 7 Liang and Sun and Cao and Shen distanc e after k steps de c ays as O ( λ k ) . Henc e, the sp e ctr al gap 1 − λ me asur es the mixing sp e e d: smal ler λ (lar ger gap) implies faster mixing. This pr op erty is crucial in gossip algorithms, de c entr alize d aver aging, and c onsensus pr oto c ols, wher e λ governs the exp onential r ate at which information homo genizes acr oss no des. The same analysis applies to doubly sto chastic matric es, which pr eserve the uniform stationary distribution and ar e widely use d in distribute d systems for fairness and symmetry. Sharp b ounds on λ for such matric es and their implic ations for algorithmic c onver genc e ar e discusse d in ( Sun et al. , 2021 ; Zhu et al. , 2022 ). Deﬁnition 5 (T op ological Im balance) By the Perr on–F r ob enius the or em, ther e exists a unique stationary distribution ve ct or π ∈ R m ++ such that P π = π and ∥ π ∥ 1 = 1 . The top olo gic al imb alanc e p ar ameter δ is deﬁne d as δ := min 1 ≤ i ≤ m [ π ] i . (5) Remark 6 The top olo gic al imb alanc e p ar ameter δ ∈ (0 , 1 /m ] me asur es the agents’ inﬂuenc e on the glob al state under r ep e ate d applic ation of the c ommunic ation matrix P . The entries of the stationary distribution π i r epr esent e ach agent’s r elative asymptotic authority: agents with lar ger π i dominate the c onsensus value, while those with smal ler π i c ontribute less and pr op agate information mor e slow ly. F or doubly sto chastic matric es, the distribution is uniform and δ = 1 /m , achieving p erfe ct b alanc e. In highly unb alanc e d top olo gies (e.g., dir e cte d gr aphs with lar ge out-de gr e e disp arities), δ → 0 . Smal ler δ implies gr e ater diﬃculty in r e aching c onsensus, and err or b ounds that typic al ly gr ow in 1 /δ 3.3 Algorithm: Sto chastic Gradient Push (SGP) In un balanced top ologies, standard D-SGD is ineﬀective b ecause the stationary distribution π is non-uniform, introducing systematic bias. T o mitigate this, earlier w orks ( Kemp e et al. , 2003 ; Tsianos et al. , 2012 ) employ the Push-Sum proto col for approximate a veraging. SGP ( Assran et al. , 2019 ) builds on Push-Sum b y running tw o parallel message-passing pro cesses. No de i k eeps a pro xy vector w ( t ) i that transp orts parameter information across the net work, and a scalar weigh t u ( t ) i (with u (0) i = 1) that tracks how m uch mixing inﬂuence the no de gradually accum ulates. Because directed graphs amplify messages unevenly , the proxy w ( t ) i b ecomes biased on its o wn. The accompanying weigh t u ( t ) i captures this distortion, and the corrected estimate z ( t ) i = w ( t ) i /u ( t ) i reco vers an un biased representation of the parameter. W e can conclude that the SGP up date at each iteration is given b y W ( t +1) = P  W ( t ) − γ t ∇ f ( Z ( t ) ; S ( t ) )  . (9) Let w ( t ) := 1 m P m i =1 w ( t ) i denote the global a verage of the parameter proxies, which serves as the c onsensus mo del . Then the induced evolution of this av erage is as follows. Prop osition 7 The uniform aver age up date of SGP satisﬁes w ( t +1) = w ( t ) − γ t m 1 ⊤ ∇ f ( Z ( t ) ; S ( t ) ) . (10) Pro of See App endix B.1 for detailed pro of. Remark 8 Pr op osition 7 shows that the evolution of the c onsensus mo del w ( t ) dep ends solely on the uniformly aver age d sto chastic gr adients acr oss al l agents and is c ompletely indep endent of the sp e ciﬁc structur e of the c ommunic ation matrix P , despite p otential ly asymmetric information ﬂow. 8 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion Algorithm 1 Sto chastic Gradien t Push (SGP) Require: Initialize step size sequence { γ t } , w (0) i = z (0) i ∈ R d , and u (0) i = 1 for all i ∈ V 1: for t = 0 , 1 , 2 , · · · , T − 1 do 2: for each node i ∈ V in parallel do 3: Sample ξ ( t ) i ∼ D from lo cal distribution. 4: Compute sto c hastic gradient at z ( t ) i : g ( t ) i = ∇ f ( z ( t ) i ; ξ ( t ) i ) , w ( t + 1 2 ) i = w ( t ) i − γ t g ( t ) i . 5: Send  [ P ] ij w ( t + 1 2 ) i , [ P ] ij u ( t ) i  to N out i ( t ) 6: Receiv e  [ P ] ij w ( t + 1 2 ) j , [ P ] ij u ( t ) j  from N in i ( t ) 7: Up date lo cal v ariables: w ( t +1) i = X j ∈N in i [ P ] ij  w ( t ) j − γ t g ( t ) j  (6) u ( t +1) i = X j ∈N in i [ P ] ij u ( t ) j (7) z ( t +1) i = w ( t +1) i /u ( t +1) i (8) 8: end for 9: end for Ensure: Output w ( T ) . The eﬀectiv eness of the algorithm dep ends on how the lo cal de-biased v ariables z ( t ) i follo w the consensus mo del w ( t ) . The follo wing lemma provides a quantitativ e b ound. Lemma 9 (Consistency of Push-Sum) Assume the sto chastic gr adients ar e uniformly b ounde d, i.e. ther e exists G > 0 such that ∥∇ f ( z ( t ) i ; ξ ( t ) i ) ∥ ≤ G . Deﬁne C w 0 := 1 m P m i =1 ∥ w (0) i ∥ . Then ther e exist c onstants C > 0 and λ ∈ (0 , 1) such that for al l t ≥ 1 and al l i ∈ [ m ] , ∥ z ( t ) i − w ( t ) ∥ ≤ C δ λ t C w 0 + t − 1 X s =0 λ t − s γ s G ! . (11) Pro of See App endix B.2 for detailed pro of. Remark 10 L emma 9 b ounds the c onsensus err or ∥ z ( t ) i − w ( t ) ∥ by a tr ansient term λ t C w 0 fr om initial mismatch (de c aying at r ate λ ) and a gr adient ac cumulation term smo othe d by the same de c ay. The 1 /δ factor c aptur es the c ost of imb alanc e: mo dest p enalty when δ ≈ 1 /m (b alanc e d networks), but signiﬁc ant err or ampliﬁc ation when δ → 0 (we aker inﬂuenc e of low-authority agents). This highlights the c or e chal lenge of de c entr alize d algorithms on asymmetric top olo gies, and shows why Push-Sum r emains essential for r eliable p erformanc e. 3.4 Stability and Generalization Measures Deﬁnition 11 (Generalization and Optimization Error) Given a dataset S and r andomize d algorithm A : S → W , we deﬁne: (i) Gener alization err or is ϵ gen = E S [ F ( A ( S )) − F S ( A ( S ))] ,i.e. the exp e cte d statistic al discr ep ancy b etwe en p opulation and empiric al risk distributions. 9 Liang and Sun and Cao and Shen (ii) Exc ess gener alization err or is ϵ exc = E S [ F ( A ( S )) − F ( w ∗ )] ,i.e. the exp e cte d p erformanc e gap b etwe en p opulation risk and the glob al true minimizer. (iii) Optimization err or is ϵ opt = E S [ F S ( A ( S )) − F S ( w ∗ S )] ,i.e. the exp e cte d c onver genc e gap b etwe en p opulation risk and the empiric al risk minimizer solution. F urthermore, ϵ exc can b e decomp osed as follows: E S , A [ F ( A ( S )) − F ( w ∗ )] = E S , A [ F ( A ( S )) − F S ( A ( S ))] | {z } ϵ gen + E S , A [ F S ( A ( S )) − F S ( w ∗ S )] | {z } ϵ opt + E S , A [ F S ( w ∗ S ) − F ( w ∗ )] | {z } ≤ 0 , T o ev aluate the optimization error ϵ opt , we ﬁrst sp ecify the algorithmic output A ( S ). Since the last iterate w ( T ) is unstable under sto chastic noise in non-conv ex settings, we follow classical con vergence analyses that rely on av eraged iterates ( Ghadimi and Lan , 2013 ): w ( T ) avg := P T t =1 γ t w ( t ) P T t =1 γ t . (12) This construction accounts for time-v arying stepsizes and yields a stable surrogate for theoretical analysis. Hence, our excess generalization error is decomp osed as : ϵ exc ≤ ϵ av e-stab + ϵ opt . (13) T o b ound ϵ gen , we employ the concept of uniform stability: Deﬁnition 12 (Uniform stabilit y ( Bousquet and Elisseeﬀ , 2002 )) A sto chastic algorithm A is ϵ stab -uniformly stable if for any p air of datasets S and S ′ that diﬀer in at most one tr aining example, the fol lowing uniform b ound holds: sup z E A [ f ( A ( S ); z ) − f ( A ( S ′ ); z )] ≤ ϵ stab . (14) Lemma 13 (Generalization for Con vex Ob jectiv es ( Hardt et al. , 2016 )) L et the sto-chastic algorithm A b e ϵ stab -uniformly stable. Then, the gener alization err or satisﬁes: E S , A [ F ( A ( S )) − F S ( A ( S ))] ≤ ϵ stab . (15) Lemma 14 (Generalization for Non-Con vex Ob jectiv es ( Sun et al. , 2021 )) Supp ose loss func- tion is gr adient b ounde d under c onstant G . Consider w ( t ) and w ′ ( t ) denote the outputs of the de c en- tr alize d algorithm tr aine d on S and S ′ r esp e ctively at step t . L et ∆ t ≜ ∥ w ( t ) − w ′ ( t ) ∥ . Then, for any time step t 0 ∈ { 0 , 1 , . . . , T } and any ξ i ∼ D , under the r andom up date and p ermutation rules, the gener alization err or is b ounde d by: E S , A [ F ( A ( S )) − F S ( A ( S ))] ≤ t 0 G mn + G · E [∆ T | ∆ t 0 = 0] . (16) Remark 15 The The or em 13 and The or em 14 implies that onc e we have ac c ess to the uniform stability err or, we c an derive the gener alization gap as an ac c omp anying r esult. Lemma 16 (Expansion ( Hardt et al. , 2016 )) Fix an up date se quenc e Φ 1 , . . . , Φ T and another se quenc e Φ ′ 1 , . . . , Φ ′ T . L et w 0 = w ′ 0 b e a starting p oint, w t and w ′ t ar e deﬁne d as: w t +1 = Φ t ( w t ) and w ′ t +1 = Φ ′ t ( w ′ t ) . F or non-ne gative step sizes γ t ≥ 0 and loss function f , deﬁne Φ f ,γ t as Φ f ,γ t ( w , ξ ) = w − γ t ∇ f ( w ) , Assume f is L -smo oth. Then, the fol lowing pr op erties hold: 10 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion (i) The up date Φ f ,γ t is (1 + Lγ t ) -exp ansive. (ii) If f is c onvex, then for any γ t ≤ 2 /L , the up date Φ f ,γ t is 1 -exp ansive. Lemma 16 characterizes the ev olution of the distance b etw een tw o optimization tra jectories after a single gradient step and pro vides a fundamen tal to ol for stabilit y-based generalization analysis ( Hardt et al. , 2016 ). (i) In the non-con vex setting, the stability gap ma y increase after each up date, which can in principle lead to exp onential growth of the tra jectory discrepancy if not properly con trolled. (ii) Under conv exity and a suitable choice of step size, the gradient up date b ecomes non-expansive, thereb y preven ting exp onential ampliﬁcation of errors and enabling tighter stabilit y guarantees. 4 Theoretical Analysis In this section, we present a comprehensive theoretical analysis of SGP . W e b egin by outlining the necessary assumptions in Subsection 4.1 . In Subsection 4.2 , we analyze the uniform stability and optimization error for the con vex setting, and com bine them to derive the excess generalization error, W e then extend these results to the non-conv ex case under the P L condition in Subsection 4.3 . Finally , in Subsection 4.4 , we discuss how netw ork top ology prop erties gov ern the learning performance. Detailed pro ofs are provided in Appendix C . 4.1 Basic Assumptions The analysis requires the assumption stated b elo w: Assumption 17 ( G -Lipschitz) The function f ( x ; z ) is diﬀer entiable with r esp e ct to x and G - Lipschitz for every z , i.e. ∃ G ≥ 0 such that | f ( y ; z ) − f ( x ; z ) | ≤ G ∥ y − x ∥ . As a c onse quenc e, the gr adient is uniformly b ounde d: ∥∇ f ( x ; z ) ∥ ≤ G. Assumption 18 ( L -Smo oth) The diﬀer entiable function f ( x ; z ) is L -Smo oth for every z me ans ther e exists a c onstant L > 0 such that ∥∇ f ( y ; z ) − ∇ f ( x ; z ) ∥ ≤ L ∥ y − x ∥ . Assumption 19 (Bounde d Sp ac e) The p ar ameter sp ac e is b ounde d by a close d b al l B ( O , r ) c enter e d at the origin with r adius r > 0 . Remark 20 We brieﬂy c omment on the ab ove assumptions. (i) Lipschitz c ontinuity and smo oth- ness. Assumptions 17 and 18 ar e standar d in the study of algorithmic stability for gr adient-b ase d le arning algorithms. They al low one to c ontr ol the sensitivity of the loss and the optimization dynamics with r esp e ct to smal l p erturb ations of the tr aining data. Similar c onditions ar e c ommonly imp ose d in stability analyses of SGD and r elate d metho ds ( Bousquet and Elisse eﬀ , 2002 ; Har dt et al. , 2016 ; Sun et al. , 2021 ). (ii) Bounde d p ar ameter sp ac e. Assumption 19 ensur es that the iter ates r emain in a c omp act domain and enables the use of pr oje ction ar guments in the c onver genc e and stability analysis. This b ounde dness c ondition is also fr e quently adopte d in prior work ( Har dt et al. , 2016 ; Sun et al. , 2021 ). Over al l, these assumptions ar e c onsistent with the standar d setup in existing stability-b ase d gener alization r esults and ar e not str onger than those typic al ly r e quir e d in the liter atur e. 4.2 Results on Conv ex Case First, we give the deﬁnition of a conv ex function as follows: Deﬁnition 21 (Con vex) The loss function f ( x ; z ) is said to b e c onvex with r esp e ct to x if for any x , y in its domain, it satisﬁes f ( x ; z ) ≤ f ( y ; z ) + ⟨∇ f ( y ; z ) , x − y ⟩ . (17) 11 Liang and Sun and Cao and Shen Under conv exity , the optimization and generalization results for the conv ex case follow. Theorem 22 (Uniform Stabilit y) Assume the loss function f is c onvex with assumptions 17 – 18 hold. Then, when γ t ≤ 2 /L , the uniform stability of SGP satisﬁes: ϵ stab ≤ 2 C GLC w 0 δ T − 1 X t =0 γ t λ t + 2 C G 2 L δ (1 − λ ) T − 1 X t =0 γ t 2 + 2 G 2 mn T − 1 X t =0 γ t . (18) Pro of Sk etch T o analyze the uniform stability of SGP , we start b y b ounding the exp ected divergence b et ween the consensus mo dels w T and w ′ T . By the G -Lipsc hitz assumption (Assumption 17 ), the stabilit y ϵ stab is related to the exp ected divergence E [∆ T ] as follows: ϵ stab ≤ G · E ∥ w T − w ′ T ∥ = G E [∆ T ] , where ∆ T := ∥ w T − w ′ T ∥ represents the div ergence b etw een the mo dels after T iterations. Next, w e analyze the evolution of ∆ t at each iteration. The gradien t update step is non-expansive due to the conv exity and L -smo othness of the loss function (Lemma 16 ). When the no des sample the same data (probability 1 − 1 n ), the divergence ev olves as: E [∆ t +1 ] ≤ E [∆ t ] + 2 Lγ t m m X i =1 E ∥ w ( t ) − z ( t ) i ∥ . The second term can b e b ounded by using Lemma 9 . When a diﬀering sample is selected (with probability 1 n ), the divergence b etw een the mo dels increases due to the discrepancies b etw een the gradien t up dates of the no des. Therefore, the total div ergence is a com bination of b oth the consensus error and the gradient mismatch due to the diﬀering samples. W e express this ev olution as: E [∆ t +1 ] ≤ E [∆ t ] + 2 Lγ t m m X i =1 E ∥ w ( t ) − z ( t ) i ∥ + 2 Gγ t m . Here, the ﬁnal term accounts for the p erturbation introduced by the diﬀering sample. T o obtain the ﬁnal stability bound, we combine the results from b oth cases b y taking the exp ectation o ver the sampling pro cess. Using the fact that, with probability 1 − 1 n , the no des sample the same data, and with probabilit y 1 n , one no de samples a diﬀerent data p oin t, w e obtain the follo wing recurrence for the exp ected divergence: E [∆ t +1 ] ≤ E [∆ t ] + 2 Lγ t δ λ t C w 0 + G t − 1 X s =0 λ t − s γ s ! + 2 Gγ t mn . W e sum this recurrence from t = 0 to T − 1 and simplify the terms to arrive at the b ound: E [∆ T ] ≤ 2 C LC w 0 δ T − 1 X t =0 γ t λ t + 2 C GL δ (1 − λ ) T − 1 X t =0 γ 2 t + 2 G mn T − 1 X t =0 γ t . Finally , by the G -Lipsc hitzness of the loss function, we obtain the ﬁnal stability bound. Detailed deriv ations are pro vided in Appendix C.1 . Remark 23 The or em 22 r eve als how de c entr alization over dir e cte d gr aphs aﬀe ct stability b eyond standar d c entr alize d SGD. The b ound c an b e de c omp ose d into thr e e c omp onents: (i) Initialization and network mixing. The ﬁrst term char acterizes the r esidual eﬀe ct of the initial disagr e ement, which de c ays ge ometric al ly at r ate λ t . Henc e, a smal ler sp e ctr al r adius λ implies faster c onsensus, 12 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion ensuring that the inﬂuenc e of e arly-r ound p erturb ations diminishes r apid ly. (ii) Dir e cte d c ommu- nic ation p enalty. The se c ond term r eﬂe cts the structur al diﬃculty of le arning over dir e cte d gr aphs. The factor 1 /δ quantiﬁes the ampliﬁc ation c ause d by imb alanc e in the stationary distribution, while the factor 1 / (1 − λ ) c aptur es the slowdown due to imp erfe ct c onne ctivity and limite d information pr op agation. (ii) Sto chastic sampling eﬀe ct. The ﬁnal term, of or der P T − 1 t =0 γ t / ( mn ) , c orr e- sp onds to the intrinsic sampling noise in sto chastic optimization. Notably, it matches the stability sc aling of c en tr alize d SGD ( Sun et al. , 2023 ), indic ating that the additional gener alization c ost of de c entr alization is entir ely governe d by the network-dep endent terms ab ove. The learning rate plays a crucial role in ensuring the stability and generalization b eha vior of optimization algorithms. W e now discuss t wo commonly used learning rate schemes below: Corollary 24 (Uniform Stabilit y on Common Learning Rate) Assume the loss function f is c onvex and assumptions 17 – 18 hold. F or a c onstant le arning r ate γ t = γ satisfying γ ≤ 2 /L , we have: ϵ stab ≤ 2 C GLγ C w 0 δ (1 − λ ) +  2 C G 2 Lγ 2 δ (1 − λ ) + 2 G 2 γ mn  T . (19) F or a diminishing le arning r ate γ t = v t +1 satisfying v ≤ 2 /L , we have: ϵ stab ≤ 2 G 2 v mn ln T + 2 v C GLC w 0 + 4 C G 2 Lv 2 δ (1 − λ ) + 2 G 2 v mn . (20) Pro of See App endix C.2 for pro ofs. Remark 25 Cor ol lary 24 makes explicit how the le arning-r ate sche dule determines the long-horizon b ehavior of stability. (i) Gr owth r ate under diﬀer ent sche dules. With a c onstant step size γ , the cumulative sto chastic p erturb ation ac cumulates line arly, le ading to ϵ stab = O ( T ) . In c ontr ast, the diminishing sche dule γ t = v / ( t + 1) suppr esses this ac cumulation and yields the milder gr owth r ate O ( ln T ) . This distinction r eﬂe cts the classic al bias–varianc e tr ade-oﬀ: c onstant step sizes pr eserve optimization sp e e d but incur p ersistent instability, wher e as de c aying sche dules gr adual ly attenuate sensitivity to data p erturb ations. Mor e over, these r ates ar e known to b e essent ial ly unimpr ovable in gener al c onvex sto chastic optimization, as matching lower b ounds exist even in the c entr alize d setting. (ii) Network-dep endent de gr adation. The b ounds further sep ar ate optimization eﬀe cts fr om c ommunic ation-induc e d p enalties. When the gr aph is b alanc e d, the dep endenc e on (1 − λ ) − 1 matches the asymptotic sc aling of undir e cte d D-SGD ( Sun et al. , 2021 ). F or gener al dir e cte d gr aphs, however, the additional factor 1 /δ quantiﬁes the imb alanc e of the stationary distribution and le ads to a strictly lar ger stability c onstant. Henc e, the gener alization gap b etwe en SGP and its undir e cte d c ounterp art is entir ely attributable to the asymmetry of information ﬂow. Theorem 26 (Optimization Error) Assume the loss function f is c onvex and assumptions 17 – 19 hold. Then SGP satisﬁes: ϵ opt ≤ ∥ w (0) − w ∗ S ∥ 2 2 P T − 1 t =0 γ t + 2 r C LC w 0 δ P T − 1 t =0 γ t T − 1 X t =0 γ t λ t +  2 r C LG δ (1 − λ ) + G 2 2  P T − 1 t =0 γ 2 t P T − 1 t =0 γ t . (21) Pro of Detailed pro ofs can b e found in App endix C.3 . 13 Liang and Sun and Cao and Shen Corollary 27 (Optimization Error on Common Learning Rate) When the loss f is c onvex and assumptions 17 – 19 hold, for c onstant le arning r ate γ t = γ satisfying γ ≤ 2 /L , SGP satisﬁes: ϵ opt ≤ ∥ w (0) − w ∗ S ∥ 2 2 γ + 2 r C LC w 0 δ (1 − λ ) ! 1 T + 2 C Gr Lγ δ (1 − λ ) + G 2 γ 2 . (22) F or diminishing le arning r ate γ t = v t +1 satisfying v ≤ 2 /L , SGP satisﬁes: ϵ opt ≤ ∥ w (0) − w ∗ S ∥ 2 v + 4 r C LC w 0 δ (1 − λ ) + 8 v r C GL δ (1 − λ ) + 2 v G 2 ! 1 ln T . (23) Pro of Detailed pro ofs can b e found in App endix C.4 . Remark 28 The or em 26 and Cor ol lary 27 char acterize the optimization b ehavior of SGP in c onvex settings. (i) Step-size dep endenc e. Under c onstant step size, SGP achieves the classic al O (1 /T ) c onver genc e r ate up to a r esidual term pr op ortional to γ , while diminishing step sizes yield the milder O (1 / ln T ) r ate. These r ates ar e c onsistent with standar d sto chastic appr oximation r esults and match the or der of c entr alize d SGD in c onvex pr oblems ( Har dt et al. , 2016 ). (ii) Consistency with undir e cte d de c entr alize d optimization. When the c ommunic ation gr aph is b alanc e d, the dep endenc e on (1 − λ ) − 1 c oincides with the sp e ctr al-gap dep endenc e app e aring in undir e cte d D- SGD analyses ( Lian et al. , 2017 ). In this r e gime, the b ound r e duc es to the classic al de c entr alize d optimization r ate up to c onstants, indic ating that SGP pr eserves the asymptotic or der establishe d for symmetric networks. (iii) Dir e cte d-network sp e ciﬁc eﬀe cts. F or gener al dir e cte d gr aphs, the optimization err or exhibits additional ampliﬁc ation thr ough 1 /δ , r eﬂe cting imb alanc e in the stationary distribution as in push-sum b ase d metho ds ( Ne dic et al. , 2016 ). Mor e over, the exp onential ly weighte d term P t γ t λ t explicitly c aptur es tr ansient c onsensus bias induc e d by asymmetric information ﬂow, which is typic al ly implicit in existing dir e cte d analyses, including SGP ( Assr an et al. , 2019 ). Thus, any de gr adation r elative to classic al de c entr alize d optimization stems fr om dir e cte d mixing eﬀe cts r ather than fr om the sto chastic optimization me chanism. This trade-oﬀ betw een stability and optimization necessitates careful selection of the iteration steps T via early stopping to minimize the excess generalization error. Corollary 29 (Excess Generalization Error on Common Learning Rate) When the loss func- tion f is c onvex and Assumptions 17 – 19 hold, SGP satisﬁes: F or a c onstant le arning r ate γ t = γ with γ ≤ 2 /L , ther e exists an e arly-stopping time T ⋆ = ˜ Θ √ mn γ s δ (1 − λ ) + γ δ (1 − λ ) + mnγ ! ≈ ˜ Θ   1 γ s mn 1 + mnγ δ (1 − λ )   , (24) and the c orr esp onding exc ess gener alization err or satisﬁes ϵ ⋆ exc = ˜ O  1 √ mn + r γ δ (1 − λ ) + γ δ (1 − λ ) + γ  . (25) F or a diminishing le arning r ate γ t = v t +1 with v ≤ 2 /L , ther e exists T ⋆ = ˜ Θ exp √ mn v s 1 + v 2 δ (1 − λ ) !! , (26) 14 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion and at this time the exc ess gener alization err or sc ales as ϵ ⋆ exc = ˜ O 1 √ mn + v δ (1 − λ ) + v p mn δ (1 − λ ) + v 2 ! . (27) Her e ˜ Θ ( · ) and ˜ O ( · ) suppr ess lo garithmic factors and universal c onstants dep ending on G, L, C , r , C w 0 (and initialization-dep endent c onstants). Pro of See App endix C.5 for detailed pro ofs. Remark 30 Cor ol lary 29 highlights statistic al and network-induc e d eﬀe cts in the exc ess gener alization err or. (i) Statistic al term. In b oth step-size r e gimes, the dominant term ˜ O (1 / √ mn ) c oincides with the optimal r ate of c entr alize d SGD using the total sample size mn . This shows that, up to lo garithmic factors, SGP pr eserves the statistic al eﬃciency of c entr alize d le arning and achieves line ar sp e e dup with r esp e ct to the total data volume, despite op er ating over an asymmetric c ommunic ation top olo gy. (ii) T op olo gic al bias. In addition to the statistic al term, the b ound c ontains network-dep endent c ontributions pr op ortional to γ δ (1 − λ ) (or v δ (1 − λ ) in the diminishing step-size c ase). These terms r eﬂe ct a c onsensus bias induc e d by imp erfe ct mixing and asymmetry in the c ommunic ation gr aph. Unlike the statistic al err or, they do not vanish with incr e asing sample size n , and inste ad dep end explicitly on the imb alanc e p arameter δ and the sp e ctr al gap 1 − λ . Conse quently, the achievable ac cur acy is fundamental ly c onstr aine d by the network top olo gy: p o orly c onne cte d or highly imb alanc e d gr aphs le ad to a non-ne gligible r esidual err or even when mn is lar ge. 4.3 Results on Non-conv ex Case Since optimization problems in ML are usually non-con vex, such as deep neural netw ork training, analyzing non-conv ex settings is inherently more complicated but necessary . Theorem 31 [Uniform Stability] Assume the loss function f is non-c onvex and assumptions 17 – 18 hold, t 0 ∈ { 0 , 1 , ..., n } assuming the ﬁrst time step in which SGP cho oses the diﬀer ent example. Then, the uniform stability of SGP satisﬁes: ϵ stab ≤ min t 0 ( t 0 mn + G T − 1 X t = t 0 T − 1 Y k = t +1  1 + Lγ k − Lγ k mn  ×  2 C Lγ t λ t δ m C w 0 + 2 GC Lγ t 2 δ (1 − λ ) + 2 Gγ t mn  ) . (28) Pro of Sk etch W e analyze the uniform stabilit y of SGP by b ounding the divergence betw een the t wo consensus tra jectories w ( t ) and w ′ ( t ) generated on neighboring datasets. Let ∆ t := ∥ w ( t ) − w ′ ( t ) ∥ . Unlik e the conv ex case, the gradient step is no longer non-expansiv e. Lemma 16 implies that in the non-con vex setting the av eraged up date ma y expand distances by a factor (1 + Lγ t ). A t iteration t , separating the gradient diﬀerence at the consensus mo del from the consensus errors yields ∆ t +1 ≤ (1 + Lγ t )∆ t + Lγ t m m X i =1 ∥ w ( t ) − z ( t ) i ∥ + Lγ t m m X i =1 ∥ w ′ ( t ) − z ′ ( t ) i ∥ + 2 Gγ t mn , where the last term accounts for the p ossible gradient mismatc h when the diﬀering sample. The consensus error terms are con trolled by Lemma 9 , which shows that for every t 1 m m X i =1 ∥ w ( t ) − z ( t ) i ∥ ≲ C δ  λ t C w 0 + G t − 1 X s =0 λ t − s γ s  . 15 Liang and Sun and Cao and Shen Substituting this b ound gives a one-step recursion of the form E [∆ t +1 ] ≤  1 + Lγ t  E [∆ t ] + 2 C Lγ t δ  λ t C w 0 + G t − 1 X s =0 λ t − s γ s  + 2 Gγ t mn . T o handle the multiplicativ e expansion, we condition on the ﬁrst time t 0 at whic h the t wo runs decouple. Since ∆ t 0 = 0, unrolling the ab ov e recursion from t 0 to T − 1 shows that the accumulated p erturbations are weigh ted b y the expansion factors Q k (1 + Lγ k ). Finally , removing the conditioning introduces the standard decoupling term t 0 / ( mn ), ϵ stab ≤ t 0 mn + G E [∆ T | ∆ t 0 = 0] , where the second term is con trolled b y the accumulated net work and sto chastic errors. Cho osing t 0 balances the decoupling probabilit y and the subsequent expansion, leading to the stated bound. Detailed pro of is lo cated in App endix C.6 . Remark 32 The or em 31 char acterizes the sensitivity of SGP under non-c onvex obje ctives. (i) A mpliﬁc ation in non-c onvex c ase. The pr o duct term Q T − 1 k = t +1  1 + Lγ k − Lγ k mn  acts as an ampliﬁc ation factor that pr op agates p erturb ations intr o duc e d after time t 0 . In c ontr ast to the c onvex c ase, wher e c ontr active b ehavior c an b e establishe d thr ough monotonic desc ent towar d a minimizer, non-c onvex dynamics do not gener al ly admit such c ontr action. As a r esult, smal l discr ep ancies may ac cumulate multiplic atively acr oss iter ations. The b ound makes this eﬀe ct explicit and quantiﬁes how stability dep ends on the exp ansivity induc e d by the step-size sche dule. (ii) Inter action with dir e cte d top olo gy. The network-dep endent factors 1 δ and 1 1 − λ app e ar inside the ampliﬁe d summation, indic ating that c ommunic ation imb alanc e and slow mixing incr e ase the magnitude of pr op agate d p erturb ations. Unlike in c onvex settings wher e top olo gy c ontributes additively to the stability b ound, her e it inter acts with the ampliﬁc ation me chanism, ther eby inﬂuencing stability thr ough the tr aje ctory. This highlights that dir e cte d mixing pr op erties play a mor e pr onounc e d r ole in non-c onvex stability. A well-designed step size can limit the rapid growth of SGP’s generalization error. The following corollary considers tw o common learning rate schedules: Corollary 33 (Uniform Stabilit y on Common Learning Rate) When f is non-c onvex and Assumptions 17 – 18 hold, for a c onstant le arning r ate γ t = γ , we have ϵ stab ≤  2 GC Lγ C w 0 δ (1 − λ ) + 4 C G 2 Lγ 2 δ (1 − λ ) + 4 G 2 γ mn   1 + Lγ − Lγ mn  T . (29) F or a diminishing le arning r ate γ t = v t +1 , we have ϵ stab ≤ 4 C G v 1 2+ vL C w 0 δ T 1+ vL 2+ vL + v 1 2+ vL + 4 G 2 mn T 1+ vL 2+ vL + 2 C G 2 v 1 2+ vL δ (1 − λ ) T vL 2+ vL . (30) Pro of See App endix C.7 for detailed pro of. Remark 34 Cor ol lary 33 clariﬁes how the le arning-r ate sche dule governs stability in non-c onvex de c entr alize d optimization and c ontr asts with the c onvex c ase. (i) Contr ast with c onvex stability. In c onvex settings, stability gr ows at most line arly or lo garithmic al ly in T , dep ending on the step-size sche dule, due to the c ontr active structur e of the obje ctive. By c ontr ast, under a c onstant step size γ , the non-c onvex b ound sc ales as  1 + Lγ − Lγ mn  T ≈ e Lγ T , indic ating exp onential ampliﬁc ation of 16 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion p erturb ations. This diﬀer enc e r eﬂe cts the absenc e of glob al c ontr action in non-c onvex dynamics, wher e gr adient up dates may exp and discr ep ancies r ather than attenuate them. (ii) Eﬀe ct of diminishing step sizes. A dopting a diminishing sche dule γ t ∝ 1 /t r e duc es the exp onential gr owth to a p olynomial r ate O ( T p ) , wher e p = 1+ v L 2+ v L < 1 . Thus, instability ac cumulates subline arly in T , and sensitivity to e arly p erturb ations de cr e ases as the step size vanishes. Such b ehavior aligns with classic al non-c onvex SGD analyses, wher e diminishing le arning r ates ar e r e quir e d to c ontr ol long-term varianc e. (iii) Comp arison with existing de c entr alize d analyses. While exp onential instability under c onstant step size is also observe d in c entr alize d non-c onvex SGD, the b ound further r eve als the inter action with dir e cte d top olo gy thr ough the factors 1 δ and 1 1 − λ . Unlike c onvex de c entr alize d r esults, wher e top olo gy c ontributes additively to the stability c onstant, her e it app e ars within the ampliﬁe d term, magnifying p erturb ations thr oughout the tr aje ctory. This distinguishes dir e cte d SGP fr om c entr alize d and undir e cte d settings and shows that top olo gic al imb alanc e dir e ctly aﬀe cts non-c onvex stability r ates. The Poly ak– L o jasiewicz (P L ) condition has b een widely used in non-conv ex optimization ( Deng et al. , 2023 ), as it ensures conv ergence in function v alue to the global minimum whereas general non-con vex settings only guarantee conv ergence to a s tationary p oin t. Deﬁnition 35 (P L - Condition) L et w ∗ = argmin w f ( w ) . A function f satisﬁes the P L - Con- dition with p ar ameter α > 0 if for al l w , 2 α [ f ( w ) − f ( w ∗ )] ≤ ∥∇ f ( w ) ∥ 2 . When f is additional ly L -smo oth, the r atio κ := L/α is c al le d the c ondition numb er of f . Remark 36 (i) Applic ability of P L - Condition. The P L c ondition is a standar d r elaxation of str ong c onvexity that guar ante es line ar c onver genc e of gr adient desc ent and its sto chastic variants. It holds for many pr actic al non-c onvex obje ctives, including over-p ar ameterize d neur al networks (esp e cial ly with R eLU activations) and c ertain matrix factorization and phase r etrieval pr oblems ( Song et al. , 2021 ; Chen et al. , 2023 ). (ii) Condition Numb er. The quantity κ = L/α serves as an eﬀe ctive c ondition numb er, analo gous to the r atio of smo othness to str ong c onvexity p ar ameters in the c onvex setting ( Karimi et al. , 2016 ). L ar ger κ r eﬂe cts gr e ater il l-c onditioning: slower function de c ay away fr om the minimum and extende d ﬂat r e gions. Under P L , gr adient metho ds typic al ly c onver ge line arly at r ate 1 − O (1 /κ ) , yielding iter ation c omplexity O ( κ log (1 /ε )) to r e ach ε -ac cur acy. Thus, κ dir e ctly c ontr ols c onver genc e sp e e d and quantiﬁes landsc ap e diﬃculty. W e can derive the optimization error in the non-con vex case: Theorem 37 (Optimization Error) Assume f is non-c onvex and satisﬁes the P L c ondition, and assumptions 17 – 19 hold. Then the optimization err or of SGP satisﬁes: ϵ opt ≤ Gr α P T − 1 t =0 γ t + C GκC w 0 2 δ P T − 1 t =0 γ t T − 1 X t =0 γ t λ t +  C G 2 κ 2 δ (1 − λ ) + G 2 κ 4  P T − 1 t =0 γ 2 t P T − 1 t =0 γ t . (31) Pro of See App endix C.8 for detailed pro ofs. Corollary 38 (Optimization Error on Common Learning Rate) When the loss function f is non-c onvex and satisﬁes the P L -c ondition, and assumptions 17 – 19 hold, for a c onstant le arning r ate γ t = γ , we have: ϵ opt ≤  Gr αγ + C GκC w 0 2 δ (1 − λ )  1 T +  C G 2 κ 2 δ (1 − λ ) + G 2 κ 4  γ . (32) 17 Liang and Sun and Cao and Shen F or a diminishing le arning r ate γ t = v t +1 , we have: ϵ opt ≤  C GκC w 0 + 2 v C G 2 Lκ δ (1 − λ ) + 2 r G + v 2 G 2 L v α  1 ln T . (33) Pro of See App endix C.9 for detailed pro ofs. Remark 39 The or em 37 and Cor ol lary 38 char acterize the c onver genc e b ehavior of SGP under the Polyak– L ojasiewicz (P L ) c ondition. (i) T op olo gy-dep endent err or ﬂo or. In the c onstant step-size r e gime, Cor ol lary 38 shows that the stationary err or is dominate d by κ/ ( δ (1 − λ )) , r eve aling a c oupling b etwe en pr oblem c onditioning and dir e cte d network top olo gy. Her e, κ = L/α me asur es the c onditioning of the obje ctive, while 1 / ( δ (1 − λ )) c aptur es the de gr adation induc e d by imb alanc e and slow mixing. Thus, dir e cte d c ommunic ation c an amplify the optimization diﬃculty thr ough the c onver genc e c onstants. (ii) Convex-like r ates under P L. Under the P L c ondition, SGP attains O (1 /T ) c onver genc e with a c onstant step size and O (1 / ln T ) with a diminishing step size, matching the r ates obtaine d in c onvex de c entr alize d optimization. This indic ates that the P L c ondition yields c onvex-like optimization b ehavior, although the c onstants r emain mor e sensitive to dir e cte d settings. Corollary 40 (Excess Generalization Error with Common Learning Rates) Supp ose the loss function f is non-c onvex and satisﬁes the P L c ondition, and that assumptions 17 – 19 hold. F or a c onstant le arning r ate γ t = γ , ther e exists an e arly-stopping time T ⋆ such that T ⋆ = ˜ Θ  mn L γ ( mn − 1) log 1 γ  , (34) and the c orr esp onding exc ess gener alization err or satisﬁes ϵ ⋆ exc = ˜ O  κG 2  1 + 1 δ (1 − λ )   1 + 1 mn  γ  . (35) F or a diminishing le arning r ate γ t = v t +1 , ther e exists an e arly-stopping time of the form T ⋆ = ˜ Θ  exp  Θ  mn ( mn − 1) v L  , (36) and at this time the exc ess gener alization err or sc ales as ϵ ⋆ exc = ˜ O  κ ( mn − 1) mn  r G + C GLC w 0 δ (1 − λ ) + v G 2  1 + C L δ (1 − λ )  , (37) wher e the notation ˜ Θ ( · ) and ˜ O ( · ) suppr esses lo garithmic factors in T and 1 /γ as wel l as universal c onstants. Pro of See App endix C.10 for pro ofs. Remark 41 Cor ol lary 40 clariﬁes how the P L c ondition shap es the gener alization b ehavior of SGP in the non-c onvex r e gime. (i) Diminishing step sizes. Under a c onstant step size, the optimal stopping time dep ends only lo garithmic al ly on 1 /γ , r esulting in a c omp ar atively limite d stability window. In c ontr ast, a diminishing sche dule of the form γ t = v / ( t + 1) le ads to an exp onential ly lar ge optimal stopping horizon, T ⋆ ≈ exp  Θ(1 /v )  , ther eby enlar ging the r ange of iter ations over which the exc ess risk 18 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion T able 2: Comparison of sp ectral properties ( λ ) and top ological im balance ( δ ) across graph topologies with m no des. C λ ≜ 1 / (1 − λ ) is the mixing time complexit y . T op ology Sp ectral Gap Mixing Cost Im balance (1 − λ ) C λ δ F ully Connected 1 O (1) O (1 /m ) Di-Exp O (1) O (log 2 m ) O (1 /m ) Bipartite O (1 /m ) O ( m ) O (1 /m ) B-tree O (1 /m ) O ( m ) O (1 / 2 m ) Di-Ring O (1 /m 2 ) O ( m 2 ) O (1 /m ) Sub-Ring O (1 /m 2 ) O ( m 2 ) O (1) † Star Graph 1 / 2 2 O (1) † † The v alue of δ for Star Graph dep ends on the weigh t matrix; here w e assume a standard setup. r emains c ontr ol le d. This suggests that diminishing step sizes pr ovide gr e ater toler anc e to longer tr aining tr aje ctories. (ii) Ge ometric–top olo gic al c oupling. The exc ess risk b ound exhibits a multiplic ative structur e c onsistent with optimization under the P L ine quality. In p articular, the dominant dep endenc e c an b e summarize d as ϵ ⋆ exc ∝ κ  1 + 1 δ (1 − λ )  × E noise , indic ating that ac cur acy is jointly governe d by the pr oblem c onditioning (via κ = L/α ) and the network top olo gy (via δ and the sp e ctr al gap 1 − λ ). Henc e, we ak c onne ctivity or imb alanc e in the c ommunic ation gr aph eﬀe ctively ampliﬁes the c onditioning, and c onse quently imp acts b oth optimization and gener alization. 4.4 Discussion on the Impact of T op ology Our theoretical analysis shows that the stability and optimization p erformance of SGP are go verned b y a uniﬁed scaling factor that dep ends on tw o key topological asp ects of the communication graph: structural imbalance and net work connectivit y . Structural im balance is quan tiﬁed by the parameter δ , which reﬂects the inﬂuence of the stationary distribution of the comm unication pro cess. When δ i is small, no des ha ve limited outgoing connectivity relativ e to incoming links. Although suc h no des receive information from the netw ork, their lo cal gradien t information contributes w eakly to the global aggregation. As a result, data and up dates stored at these no des are not propagated across the netw ork. F rom an optimization p ersp ective, this b ehavior is equiv alent to a reduction in the eﬀective sample size. Our results indicate that highly imbal anced top ologies lead to increased generalization error, highlighting the importance of main taining structural balance so that all no des exert inﬂuence on the learning dynamics. Net work connectivit y is primarily captured by the sp ectral gap 1 − λ , whose inv erse 1 / (1 − λ ) c haracterizes the mixing time of the underlying Mark ov c hain. This quantit y determines how quickly lo cal updates diﬀuse throughout the net work. When the sp ectral gap is small, corresp onding to λ close to 1, information propagation b ecomes slow. Although SGP ac hieves asymptotic con vergence rates comparable to cen tralized metho ds, p o or connectivit y substantially increases the num b er of iterations required to reac h this regime. As illustrated in T able 2 , sparse top ologies such as directed rings incur a quadratic mixing cost O ( m 2 ), whic h limits their scalability . In contrast, exponential graphs maintain a logarithmic mixing cost O ( log m ) and remain eﬀective as the netw ork size grows. T aken together, an inherent trade-oﬀ exists in netw ork design. Impro ving connectivit y and main taining structural balance are conﬂicting ob jectives. F or example, star top ologies mix rapidly but introduce imbalance due to the central no de, whereas ring top ologies are balanced but mix ineﬃcien tly . Since the topological factor en ters the error bounds m ultiplicatively , unfa vorable top ology 19 Liang and Sun and Cao and Shen                    w t w 0 t l r = 0 . 0 1 l r = 0 . 0 0 1 l r = 0 . 0 0 0 1 l r = 0 . 0 0 0 0 1 (a) ϵ gen of Diﬀerent LR.                    w t w 0 t m = 4 m = 8 m = 1 6 m = 3 2 (b) ϵ gen of Diﬀerent Client Size.                    w t w 0 t T o p o l o g y = d R i n g T o p o l o g y = E x p T o p o l o g y = F u l l (c) ϵ gen of Diﬀerent T op ology .             l r = 0 . 0 1 l r = 0 . 0 0 1 l r = 0 . 0 0 0 1 l r = 0 . 0 0 0 0 1 (d) ϵ opt of Diﬀerent LR.             m = 4 m = 8 m = 1 6 m = 3 2 (e) ϵ opt of Diﬀerent Client Size.             T o p o l o g y = d R i n g T o p o l o g y = E x p T o p o l o g y = F u l l (f ) ϵ opt of Diﬀerent T op ology . Figure 3: Impacts of generalization and optimization errors on conv ex ob jective. cannot b e oﬀ set by increasing data size. Consequently , in large-scale SGP deploymen ts, netw ork designs that jointly improv e connectivit y while preserving reasonable balance are more eﬀective. Bounded-degree exp onen tial graphs pro vide such a compromise. 5 Exp eriment In this section, w e demonstrate empirical studies to v alidate our theoretical analysis. W e conduct general classiﬁcation exp eriments on the classical Logistic Regression 5.1 and LeNet 5.2 resp ectiv ely , and ev aluate the impacts of the key factors to stability and optimization errors. 5.1 Logistic Regression on a9a Dataset In the v alidation of conv ex ob jectives, w e adopt the classical logistic regression problem ( Hosmer Jr et al. , 2013 ) to ev aluate generalization and optimization p erformance during training. W e consider the following regularized loss: f ( x ) = 1 n n X i =1 log  1 + exp  − b i a ⊤ i x  + µ 2 ∥ x ∥ 2 , where a i ∈ R d and b i ∈ {− 1 , +1 } are data samples and n is the dataset size. Exp erimen ts are conducted on the a9a dataset ( Chang and Lin , 2011 ) with d = 123. W e set the regularization parameter µ = 10 − 4 . F or decentralized training, we randomly split 32k samples into 32 clien ts. T o study the eﬀect of constant learning rates, we select γ ∈ { 0 . 01 , 0 . 001 , 0 . 0001 , 0 . 00001 } . T o study the impact of client size, we v ary m ∈ { 4 , 8 , 16 , 32 } . T o examine the role of top ology , we consider several directed graphs, including d-Ring , Exp , and F ul l . The exp erimen tal results in Figure 3 are consistent with our theoretical analysis in the con vex setting. First, Figures (a) and (d) show a clear learning-rate eﬀect: increasing γ leads to a larger and faster-growing stabilit y error ∥ w ( t ) − w ′ ( t ) ∥ , while sim ultaneously accelerating optimization (the training loss decreases more quickly within the same iteration budget). This behavior matc hes our 20 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion                    w t w 0 t l r = 0 . 0 1 l r = 0 . 0 0 1 l r = 0 . 0 0 0 1 l r = 0 . 0 0 0 0 1 (a) ϵ gen of Diﬀerent LR.                    w t w 0 t m = 4 m = 8 m = 1 6 m = 3 2 (b) ϵ gen of Diﬀerent Client Size.                        w t w 0 t T o p o l o g y = F u l l T o p o l o g y = E x p T o p o l o g y = S t a r T o p o l o g y = B t r e e T o p o l o g y = R i n g T o p o l o g y = S u b R i n g T o p o l o g y = D R i n g (c) ϵ gen of Diﬀerent T op ology .            l r = 0 . 1 l r = 0 . 0 1 l r = 0 . 0 0 1 l r = 0 . 0 0 0 1 (d) ϵ opt of Diﬀerent LR.            m = 4 m = 8 m = 1 6 m = 3 2 (e) ϵ opt of Diﬀerent Client Size.                  T o p o l o g y = F u l l T o p o l o g y = E x p T o p o l o g y = S t a r T o p o l o g y = B t r e e T o p o l o g y = R i n g T o p o l o g y = S u b R i n g T o p o l o g y = D R i n g (f ) ϵ opt of Diﬀerent T op ology . Figure 4: Impacts of generalization and optimization errors on non-conv ex ob jective. b ounds: the stability recursion accumulates p erturbations through step-size sums such as P t γ t and P t γ 2 t , so larger γ ampliﬁes the sensitivity to a single-sample change, whereas the optimization b ound impro ves when the step size is larger (up to the admissible range). The plots th us provide an empirical illustration of the stability–optimization trade-oﬀ captured b y our theory . Second, Figures (b) and (e) indicate that increasing the client size m reduces b oth the generalization (stabilit y) error and the optimization error. This trend is aligned with the role of the eﬀective sample size mn in our conv ex excess-risk analysis: the probability that the tw o runs diﬀer at an iteration scales as 1 /n , while the impact of that mismatch on the netw ork a verage is further diluted by the 1 /m a veraging across clien ts. As m gro ws, sto chastic ﬂuctuations are reduced and the inﬂuence of the single replaced sample b ecomes w eaker, leading to uniformly smaller errors in the exp eriments. Third, Figures (c) and (f ) demonstrate a systematic dep endence on the communication top ology . Densely connected netw orks (e.g., F ull) ac hieve smaller stability errors and faster loss decrease than sparse top ologies (e.g., Ring). This is exactly the dep endence predicted by our theory through the mixing parameters: faster mixing (larger sp ectral gap 1 − λ ) and b etter balance (larger δ ) shrink the consensus-induced error ampliﬁcation, which appears in our b ounds through factors of the form 1 / ( δ (1 − λ )) and through the deca ying-memory term P t − 1 s =0 λ t − s γ s . Under slow er-mixing or more im balanced top ologies, the consensus error persists longer and is rep eatedly injected into the gradient step, yielding more pronounced residual errors, the exp erimen tal trends with resp ect to m , and the top ology are in qualitativ e agreement with the structure of our con vex b ounds: larger step sizes impro ve optimization but w orsen stability , larger client p opulations improv e b oth, and b etter top ological balance reduces the error contributions induced by decentralized comm unication. 5.2 LeNet on CIF AR-10 Dataset W e further ev aluate the non-conv ex setting using LeNet-5 ( LeCun et al. , 1998 ) on the CIF AR-10 dataset ( Krizhevsky et al. , 2009 ). The 50,000 training samples are evenly divided among 100 clients, eac h holding 500 samples. W e ﬁx a batc h size of 50 and adopt stage-wise learning rate deca y . T o study the eﬀect of learning rate under stable training, w e select γ ∈ { 0 . 001 , 0 . 002 , 0 . 003 , 0 . 004 } with m = 32. 21 Liang and Sun and Cao and Shen T o inv estigate the role of client size, we v ary m ∈ { 4 , 8 , 16 , 32 } . In the non-con vex experiments, w e consider a broader family of comm unication graphs than in the conv ex case, including F ul l , Exp , Star , B-tr e e , Ring , Sub-Ring , and D-Ring . W e train for 300 iterations and rep ort stability and loss curves. As sho wn in Figure 4 , the empirical be ha vior is consistent with our non-con vex analysis. First, panels (a) and (d) illustrate a clear learning-rate eﬀect. Larger step sizes lead to faster growth of the stabilit y measure while also accelerating loss reduction in the early stage. This observ ation reﬂects the tra jectory-level ampliﬁcation describ ed in Theorem 31 : in the absence of conv ex contraction, p erturbations may accumulate multiplicativ ely along the optimization path. Consequently , increasing the learning rate improv es short-term optimization but ampliﬁes sensitivit y to data p erturbations. Second, panels (b) and (e) show that increasing the num b er of clients m impro ves b oth stability and optimization. With larger m , the stability curv es gro w more slowly and the loss decreases more rapidly . This is consistent with the role of the eﬀective sample size mn in our b ounds: a larger net work reduces the inﬂuence of any single data p oint and mitigates sto chastic v ariability across iterations. Third, panels (c) and (f ) demonstrate that top ology has a pronounced impact in the non-con vex regime. W ell-connected graphs suc h as F ul l and Exp exhibit the smallest stability growth and the fastest con vergence, whereas sparse or highly directional structures such as Ring and D-R ing sho w substan tially larger stability accumulation and slow er loss deca y . Intermediate graphs, including Star , B-tr e e , and Sub-Ring , fall b et ween these t wo extremes. This empirical ordering aligns with the sp ectral prop erties summarized in T able 2 . Graphs with larger sp ectral gap (1 − λ ) mix information more rapidly , and graphs with larger balance parameter δ distribute inﬂuence more evenly across no des. Since our non-con vex b ounds dep end on the factor 1 / ( δ (1 − λ )), p o or connectivit y or im balance increases the magnitude of perturbation propagation throughout training. The experiments therefore supp ort the theoretical conclusion that, in the non-conv ex setting, netw ork top ology do es not merely aﬀect constants, but in teracts with optimization dynamics. 6 Conclusion In this pap er, we studied the stabilit y and generalization b eha vior of the Stochastic Gradien t Push (SGP) algorithm o ver directed comm unication netw orks. By leveraging the framework of uniform stabilit y , we established explicit generalization and excess risk b ounds for b oth con vex ob jectiv es and non-con vex ob jectives satisfying the P L condition. Our analysis highlights the role of column- sto c hastic communication in shaping learning dynamics and provides a precise c haracterization of ho w top ological imbalance and sp ectral gap jointly inﬂuence stability and optimization p erformance. These results oﬀer a theoretical understanding of when Push-Sum correction is necessary and how directed netw ork structure aﬀects decentralized learning b ey ond asymptotic conv ergence. Sev eral directions remain op en for future inv estigation. First, extending the current analysis to more general non-conv ex settings b eyond the P L condition would further broaden the applicability of the theory . Second, it would b e of interest to study adaptive or time-v arying communication top ologies, where the im balance and sp ectral prop erties evolv e o ver time. Finally , incorp orating additional practical considerations such as comm unication compression, quantization, or partial participation into the stability-based framew ork may pro vide deep er insight in to the generalization b eha vior of decentralized learning systems in realistic environmen ts and large-scale deplo yments. References A. Agarwal and J. C. Duchi. Distributed dela yed sto chastic optimization. Adv ances in neural information pro cessing systems, 24, 2011. M. Assran, N. Loizou, N. Ballas, and M. Rabbat. Sto c hastic gradient push for distributed deep learning. pages 344–353, 2019. 22 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion M. S. Assran and M. G. Rabbat. Asynchronous gradient push. IEEE T ransactions on Automatic Con trol, 66(1):168–183, 2020. P . L. Bartlett and S. Mendelson. Rademac her and gaussian complexities: Risk bounds and structural results. Journal of Mac hine Learning Researc h, 3(Nov):463–482, 2002. R. Bassily , V. F eldman, C. Guzm ´ an, and K. T alwar. Stabilit y of sto chastic gradient descent on nonsmo oth conv ex losses. Adv ances in Neural Information Pro cessing Systems , 33:4381–4391, 2020. F. B ´ en ´ ezit, V. Blondel, P . Thiran, J. Tsitsiklis, and M. V etterli. W eighted gossip: Distributed a veraging using non-doubly sto chastic matrices. In 2010 ieee international symposium on information theory , pages 1753–1757. IEEE, 2010. A. Blumer, A. Ehrenfeuch t, D. Haussler, and M. K. W armuth. Learnabilit y and the v apnik- c hervonenkis dimension. Journal of the A CM (JACM), 36(4):929–965, 1989. O. Bousquet and A. Elisseeﬀ. Stabilit y and generalization. The Journal of Machine Learning Researc h, 2:499–526, 2002. C.-C. Chang and C.-J. Lin. Libsvm: a library for supp ort vector machines. A CM transactions on in telligent systems and technology (TIST), 2(3):1–27, 2011. Z. Charles and D. P apailiop oulos. Stabilit y and generalization of learning algorithms that conv erge to global optima. In International conference on machine learning, pages 745–754. PMLR, 2018. Y. Chen, Y. Shi, M. Dong, X. Y ang, D. Li, Y. W ang, R. Dick, Q. Lv, Y. Zhao, F. Y ang, et al. Ov er-parameterized mo del optimization with p oly ak- lo jasiewicz condition. 2023. E. Cyﬀers and A. Bellet. Priv acy ampliﬁcation by decentralization. In In ternational Conference on Artiﬁcial Intelligence and Statistics, pages 5334–5353. PMLR, 2022. O. Dek el, R. Gilad-Bac hrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batc hes. Journal of Machine Learning Researc h, 13(1), 2012. X. Deng, T. Sun, S. Li, and D. Li. Stability-based generalization analysis of the asynchronous decen tralized sgd. 37(6):7340–7348, 2023. X. Deng, T. Sun, S. Li, D. Li, and X. Lu. Stabilit y and generalization of asynchronous sgd: Sharp er b ounds b ey ond lipschitz and smo othness. Adv ances in Neural Information Pro cessing Systems , 37:7675–7713, 2024. L. Devroy e and T. W agner. Distribution-free p erformance b ounds for p otential function rules. IEEE T ransactions on Information Theory, 25(5):601–604, 1979. A. Elisseeﬀ, T. Evgeniou, M. P ontil, and L. P . Kaelbing. Stabilit y of randomized learning algorithms. Journal of Machine Learning Researc h, 6(1), 2005. S. Ghadimi and G. Lan. Sto chastic ﬁrst-and zeroth-order metho ds for noncon vex sto chastic program- ming. SIAM journal on optimization, 23(4):2341–2368, 2013. M. Hardt, B. Rech t, and Y. Singer. T rain faster, generalize b etter: Stabilit y of sto chastic gradien t descen t. pages 1225–1234, 2016. R. A. Horn and C. R. Johnson. Matrix analysis. Cam bridge universit y press, 2012. D. W. Hosmer Jr, S. Lemeshow, and R. X. Sturdiv an t. Applied logistic regression . John Wiley & Sons, 2013. 23 Liang and Sun and Cao and Shen H. Karimi, J. Nutini, and M. Schmidt. Linear con vergence of gradien t and proximal-gradien t metho ds under the p olyak- lo jasiewicz condition. CoRR , abs/1608.04636, 2016. URL https: //arxiv.org/abs/1608.04636 . D. Kemp e, A. Dobra, and J. Gehrk e. Gossip-based computation of aggregate information. In 44th Ann ual IEEE Symp osium on F oundations of Computer Science, 2003. Pro ceedings. , pages 482–491. IEEE, 2003. A. Kolosko v a, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich. A uniﬁed theory of decentralized sgd with c hanging top ology and lo cal up dates. In In ternational Conference on Machine Learning , pages 5381–5393. PMLR, 2020. A. Krizhevsky , G. Hin ton, et al. Learning m ultiple lay ers of features from tin y images. 2009. I. Kuzb orskij and C. Lampert. Data-dep enden t stabilit y of sto c hastic gradien t descen t. pages 2815–2824, 2018. B. Le Bars, A. Bellet, M. T ommasi, K. Scaman, and G. Neglia. Improv ed stabilit y and generalization guaran tees of the decentralized sgd algorithm. In In ternational Conference on Machine Learning , pages 18641–18663. PMLR, 2023. Y. LeCun, L. Bottou, Y. Bengio, and P . Haﬀner. Gradien t-based learning applied to do cument recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. Y. Lei and Y. Ying. Fine-grained analysis of stability and generalization for sto chastic gradient descen t. In International Conference on Mac hine Learning, pages 5809–5819. PMLR, 2020. Q. Li, M. Zhang, N. Yin, Q. Yin, and L. Shen. Asymmetrically decentralized federated learning. arXiv preprint arXiv:2310.05093, 2023. X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms outp erform centralized algorithms? a case study for decentralized parallel sto chastic gradient descen t. Adv ances in neural information pro cessing systems, 30, 2017. Y. Liu, Y. Shi, Q. Li, B. W u, X. W ang, and L. Shen. Decentralized directed collab oration for p ersonalized federated learning. In Pro ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23168–23178, 2024. D. A. McAllester. P ac-bay esian mo del av eraging. In Pro ceedings of the tw elfth annual conference on Computational learning theory, pages 164–170, 1999. C. D. Meyer. Matrix analysis and applied linear algebra. SIAM, 2023. R. Montenegro and P . T etali. Mathematical asp ects of mixing times in Mark ov chains . F oundations and T rends in Theoretical Computer Science, 2006. A. Nedi´ c and A. Olshevsky . Distributed optimization ov er time-v arying directed graphs. IEEE T ransactions on Automatic Con trol, 60(3):601–615, 2015. A. Nedi´ c and A. Olshevsky . Sto c hastic gradient-push for strongly conv ex functions on time-v arying directed graphs. IEEE T ransactions on Automatic Con trol, 61(12):3936–3947, 2016. A. Nedic, A. Olshevsky , and C. A. Urib e. Distributed gaussian learning o ver time-v arying directed graphs. pages 1710–1714, 2016. A. Nedi ´ c, A. Olshevsky , and C. A. Urib e. Distributed gaussian learning ov er time-v arying directed graphs. In 2016 50th Asilomar Conference on Signals, Systems and Computers , pages 1710–1714. IEEE, 2016. 24 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion G. Neglia, G. Calbi, D. T owsley , and G. V ardo yan. The role of netw ork top ology for distributed mac hine learning. In IEEE INF OCOM 2019-IEEE conference on computer comm unications , pages 2350–2358. IEEE, 2019. R. Olfati-Sab er and R. M. Murray . Consensus problems in netw orks of agents with switching top ology and time-delays. IEEE T ransactions on Automatic Control, 49(9):1520–1533, 2004. S. Pu, W. Shi, J. Xu, and A. Nedi´ c. Push–pull gradient metho ds for distributed optimization in net works. IEEE T ransactions on Automatic Con trol, 66(1):1–16, 2020. H. Robbins and S. Monro. A sto chastic approximation method. The annals of mathematical statistics , pages 400–407, 1951. S. Shalev-Shw artz, O. Shamir, N. Srebro, and K. Sridharan. Learnability , stabilit y and uniform con vergence. The Journal of Machine Learning Researc h, 11:2635–2670, 2010. W. Shi, Q. Ling, G. W u, and W. Yin. Extra: An exact ﬁrst-order algorithm for decen tralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015. C. Song, A. Ramezani-Kebrya, T. Pethic k, A. Eftekhari, and V. Cevher. Subquadratic o verparame- terization for shallo w neural netw orks. Adv ances in Neural Information Pro cessing Systems , 34: 11247–11259, 2021. T. Sun, D. Li, and B. W ang. Stability and generalization of decentralized sto chastic gradien t descen t. 35(11):9756–9764, 2021. Y. Sun, L. Shen, and D. T ao. Whic h mode is b etter for federated learning? centralized or decen tralized. 2023. H. T aheri, A. Mokh tari, H. Hassani, and R. Pedarsani. Quantized decen tralized sto chastic learning o ver directed graphs. In In ternational Conference on Mac hine Learning , pages 9324–9333. PMLR, 2020. K. I. Tsianos, S. La wlor, and M. G. Rabbat. Push-sum distributed dual a veraging for con vex optimization. In 2012 ieee 51st ieee conference on decision and control (cdc) , pages 5453–5458. IEEE, 2012. J. N. Tsitsiklis, D. P . Bertsek as, and M. Athans. Distributed asynchronous deterministic and sto c hastic gradient optimization algorithms. IEEE T ransactions on Automatic Con trol , 31(9): 803–812, 1986. C. Xi and U. A. Khan. Dextra: A fast algorithm for optimization o ver directed graphs. IEEE T ransactions on Automatic Con trol, 62(10):4980–4993, 2017. J. Xiao, Y. F an, R. Sun, J. W ang, and Z.-Q. Luo. Stabilit y analysis and generalization b ounds of adv ersarial training. Adv ances in Neural Information Pro cessing Systems, 35:15446–15459, 2022. R. Y ou and S. Pu. B-ary tree push-pull metho d is pro v ably eﬃcient for decen tralized learning on heterogeneous data. arXiv e-prin ts, pages arXiv–2404, 2024. J. Zeng and W. Yin. Extrapush for conv ex smooth decentralized optimization ov er directed net works. Journal of Computational Mathematics, 35(4):383–396, 2017. T. Zhu, F. He, L. Zhang, Z. Niu, M. Song, and D. T ao. T op ology-aw are generalization of decentralized sgd. In In ternational Conference on Mac hine Learning, pages 27479–27503. PMLR, 2022. 25 Liang and Sun and Cao and Shen APPENDIX A Notations and Abbreviations 27 B T ec hnical Prop ositions and Lemmas 28 B.1 Pro of of Prop osition 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 B.2 Pro of of Lemma 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 C Proof of Theorem and Corollary 30 C.1 Proof of Theorem 22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 C.2 Proof of Corollary 24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 C.3 Proof of Theorem 26 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 C.4 Proof of Corollary 27 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 C.5 Proof of Corollary 29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 C.6 Proof of Theorem 31 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 C.7 Proof of Corollary 33 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 C.8 Proof of Theorem 37 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 C.9 Proof of Corollary 38 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 C.10 Pro of of Corollary 40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 26 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion App endix A. Notations and Abbreviations T able 3: Notations and Abbreviations Sym b ol Meaning X i Input space of no de i Y i Output space of no de i S i X i × Y i S T raining dataset S ′ Dataset diﬀering from S by one sample ξ ( t ) i Sample used by no de i at round t A Learning algorithm w ( t ) i Model parameters of no de i at round t W ( t ) Parameter matrix at round t w ( t ) Netw ork av erage 1 m P m i =1 w ( t ) i w ( T ) avg W eighted averaged iterate P T − 1 t =0 γ t w ( t ) P T − 1 t =0 γ t z ( t ) i Debiased parameter of no de i at round t Z ( t ) Debiased parameter matrix at round t u ( t ) Push-sum w eight vector at round t π Stationary distribution vector of communication matrix P P ( t ) Communication matrix at round t p ( t ) i,j Entry of communication matrix P ( t ) H Residual matrix P − π 1 ⊤ λ Spectral radius of H ( λ < 1) C H Constant satisfying ∥ H t ∥ ∞ ≤ C H λ t m Num b er of nodes n Num b er of samples T T otal n umber of iterations γ t Learning rate (step size) at round t v Learning rate constant in diminishing stepsizes L Smo othness (gradient Lipsc hitz) constant α P L-condition parameter κ Condition n umber L/α r Radius of bounded parameter space G Uniform bound on sto chastic gradients C w 0 Initialization-dependent constan t ∆ t Distance ∥ w ( t ) − w ′ ( t ) ∥ ρ Stability growth factor 1 + Lγ − Lγ mn µ Logarithm of ρ , i.e., µ = ln ρ f ( w ; ξ ) Loss function F S ( w ) Empirical risk ov er dataset S F ( w ) Population risk w ∗ S Minimizer of empirical risk F S w ∗ Minimizer of p opulation risk F ϵ stab Uniform stabilit y b ound ϵ opt Optimization error b ound ϵ av e-stab Averaged stability bound ϵ exc Excess generalization error b ound E [ · ] Expectation op erator ∇ Gradient op erator ERM Empirical Risk Minimization DL Decentralized Learning SGP Stochastic Gradien t Push SGD Stochastic Gradien t Descent 27 Liang and Sun and Cao and Shen App endix B. T ec hnical Prop ositions and Lemmas B.1 Pro of of Prop osition 7 Pro of By the SGP up date rule, W ( t +1) = P ( t )  W ( t ) − γ t ∇ f ( Z ( t ) ; S ( t ) )  . (38) Consider the netw ork av erage w ( t ) = 1 m 1 ⊤ W ( t ) . Since 1 ⊤ P ( t ) = 1 ⊤ , then w ( t +1) = 1 m 1 ⊤ W ( t +1) = 1 m 1 ⊤ P ( t )  W ( t ) − γ t ∇ f ( Z ( t ) ; S ( t ) )  = 1 m 1 ⊤  W ( t ) − γ t ∇ f ( Z ( t ) ; S ( t ) )  (since 1 ⊤ P ( t ) = 1 ⊤ ) = w ( t ) − γ t m 1 ⊤ ∇ f ( Z ( t ) ; S ( t ) ) = w ( t ) − γ t m m X i =1 ∇ f ( z ( t ) i ; ξ ( t ) i ) . (39) This prov es Prop osition 7 . B.2 Pro of of Lemma 9 Pro of Since P is a nonnegativ e, column-sto chastic, and primitive matrix, the P erron–F rob enius theorem guarantees the existence of a unique p ositive righ t eigenv ector π ∈ R m ++ suc h that P π = π and ∥ π ∥ 1 = 1. The corresponding left eigenv ector is 1 ⊤ , satisfying 1 ⊤ P = 1 ⊤ . Deﬁne H := P − π 1 ⊤ . (40) Then H π = 0 and 1 ⊤ H = 0 ⊤ , and hence P t = π 1 ⊤ + H t for any t ≥ 1. Let λ b e the sp ectral radius of H , whic h satisﬁes λ < 1 due to primitivity . Thus there exists a constan t C H > 0 suc h that, for the induced ∞ -norm, ∥ H t ∥ ∞ ≤ C H λ t , ∀ t ≥ 0 . (41) In particular, ( P t ) ij = π i + ( H t ) ij and | ( H t ) ij | ≤ ∥ H t ∥ ∞ . W e examine the ev olution of the push-sum weigh t vector u ( t ) = P u ( t − 1) , u (0) = 1 . Therefore, u ( t ) = P t 1 = ( π 1 ⊤ + H t ) 1 = m π + H t 1 . (42) F or no de i , u ( t ) i = mπ i + ( H t 1 ) i . Let δ = min i π i > 0. Using ( 41 ) and ∥ 1 ∥ ∞ = 1, | ( H t 1 ) i | ≤ ∥ H t ∥ ∞ ∥ 1 ∥ ∞ ≤ C H λ t . Hence there exists T suc h that for all t > T , | ( H t 1 ) i | ≤ 1 2 mδ , implying u ( t ) i ≥ 1 2 mδ . F or the ﬁnite in terv al 0 ≤ t ≤ T , primitivity of P and p ositivit y of u (0) ensure u ( t ) i > 0. Thus, by possibly enlarging constan ts, we ma y take u ( t ) i ≥ 1 2 mδ, ∀ t ≥ 0 , ∀ i ∈ [ m ] . (43) 28 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion Next, the numerator iterates satisfy W ( t ) = P  W ( t − 1) − γ t − 1 ∇ f ( Z ( t − 1) ; S ( t − 1) )  . (44) Unrolling the recursion gives, for no de i , w ( t ) i = m X j =1 ( P t ) ij w (0) j − t − 1 X s =0 m X j =1 ( P t − s ) ij γ s ∇ f ( z ( s ) j ; ξ ( s ) j ) . (45) The global av erage evolv es as w ( t ) = 1 m m X j =1 w (0) j − 1 m t − 1 X s =0 m X j =1 γ s ∇ f ( z ( s ) j ; ξ ( s ) j ) . (46) W e b ound ∥ z ( t ) i − w ( t ) ∥ . Since z ( t ) i = w ( t ) i /u ( t ) i , z ( t ) i − w ( t ) = w ( t ) i − u ( t ) i w ( t ) u ( t ) i . Substituting ( P n ) ij = π i + ( H n ) ij in to ( 45 ) yields w ( t ) i = π i   m X j =1 w (0) j − t − 1 X s =0 m X j =1 γ s ∇ f ( z ( s ) j ; ξ ( s ) j )   + m X j =1 ( H t ) ij w (0) j − t − 1 X s =0 m X j =1 ( H t − s ) ij γ s ∇ f ( z ( s ) j ; ξ ( s ) j ) . (47) By ( 46 ), the term in parentheses equals m w ( t ) . Moreo ver, u ( t ) i w ( t ) = ( mπ i + ( H t 1 ) i ) w ( t ) = mπ i w ( t ) + ( H t 1 ) i w ( t ) . Therefore the dominant term cancels and we obtain w ( t ) i − u ( t ) i w ( t ) = m X j =1 ( H t ) ij w (0) j | {z } T erm I − t − 1 X s =0 m X j =1 ( H t − s ) ij γ s ∇ f ( z ( s ) j ; ξ ( s ) j ) | {z } T erm I I − ( H t 1 ) i w ( t ) | {z } T erm I I I . (48) F or T erm I, using ( 41 ) and the deﬁnition of C w 0 , ∥ T erm I ∥ ≤ m X j =1 | ( H t ) ij |∥ w (0) j ∥ ≤ ∥ H t ∥ ∞ m X j =1 ∥ w (0) j ∥ ≤ C H λ t m X j =1 ∥ w (0) j ∥ = mC H λ t C w 0 . (49) F or T erm I I, by the triangle inequalit y and the uniform b ound ∥∇ f ( z ( s ) j ; ξ ( s ) j ) ∥ ≤ G , ∥ T erm I I ∥ ≤ t − 1 X s =0 m X j =1 | ( H t − s ) ij | γ s   ∇ f ( z ( s ) j ; ξ ( s ) j )   ≤ G t − 1 X s =0 γ s m X j =1 | ( H t − s ) ij | ≤ G t − 1 X s =0 γ s ∥ H t − s ∥ ∞ 29 Liang and Sun and Cao and Shen ≤ GC H t − 1 X s =0 λ t − s γ s . (50) F or T erm I I I, using | ( H t 1 ) i | ≤ ∥ H t ∥ ∞ ∥ 1 ∥ ∞ ≤ C H λ t and ( 46 ) together with the uniform gradien t b ound, ∥ T erm I I I ∥ = | ( H t 1 ) i | · ∥ w ( t ) ∥ ≤ C H λ t       1 m m X j =1 w (0) j − 1 m t − 1 X s =0 m X j =1 γ s ∇ f ( z ( s ) j ; ξ ( s ) j )       ≤ C H λ t   1 m m X j =1 ∥ w (0) j ∥ + 1 m t − 1 X s =0 m X j =1 γ s   ∇ f ( z ( s ) j ; ξ ( s ) j )     ≤ C H λ t C w 0 + G t − 1 X s =0 γ s ! . (51) Com bining ( 49 )–( 51 ) and absorbing numerical factors into a constan t C ′ > 0, w e obtain ∥ w ( t ) i − u ( t ) i w ( t ) ∥ ≤ mC ′ λ t C w 0 + G t − 1 X s =0 λ t − s γ s ! , (52) where we used that P t − 1 s =0 γ s ≤ P t − 1 s =0 λ t − s γ s / (min 1 ≤ k ≤ t λ k ) and absorb ed into C ′ . Finally , using ( 43 ), ∥ z ( t ) i − w ( t ) ∥ = 1 u ( t ) i ∥ w ( t ) i − u ( t ) i w ( t ) ∥ ≤ 2 mδ ∥ w ( t ) i − u ( t ) i w ( t ) ∥ ≤ C δ λ t C w 0 + G t − 1 X s =0 λ t − s γ s ! , (53) where C > 0 absorbs C ′ and numerical constants. This completes the pro of. App endix C. Pro of of Theorem and Corollary C.1 Pro of of Theorem 22 Pro of Supp ose the tw o sample sets S and S ′ diﬀer in only one sample among the n samples. Assume that at each iteration t , a global index is sampled uniformly from { 1 , . . . , n } and broadcast to all no des. Then with probability 1 − 1 n , the sampled indices at all no des are identical for b oth runs; with probability 1 n , the sampled p oint diﬀers. Let ∆ t := ∥ w ( t ) − w ′ ( t ) ∥ and assume w (0) = w ′ (0) , i.e., ∆ 0 = 0. Case 1: Identical Samples. Using Proposition 7 and adding/subtracting ∇ f ( w ( t ) ; ξ ( t ) i ) and ∇ f ( w ′ ( t ) ; ξ ( t ) i ), we obtain ∆ t +1 =    w ( t ) − w ′ ( t ) − γ t m m X i =1  ∇ f ( z ( t ) i ; ξ ( t ) i ) − ∇ f ( z ′ ( t ) i ; ξ ( t ) i )     ≤    w ( t ) − w ′ ( t ) − γ t m m X i =1  ∇ f ( w ( t ) ; ξ ( t ) i ) − ∇ f ( w ′ ( t ) ; ξ ( t ) i )     30 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion +    γ t m m X i =1  ∇ f ( w ( t ) ; ξ ( t ) i ) − ∇ f ( z ( t ) i ; ξ ( t ) i )     +    γ t m m X i =1  ∇ f ( w ′ ( t ) ; ξ ( t ) i ) − ∇ f ( z ′ ( t ) i ; ξ ( t ) i )     ≤ ∆ t + Lγ t m m X i =1 ∥ w ( t ) − z ( t ) i ∥ + Lγ t m m X i =1 ∥ w ′ ( t ) − z ′ ( t ) i ∥ . (54) Applying Lemma 9 , ∆ t +1 ≤ ∆ t + 2 C Lγ t δ  λ t C w 0 + G t − 1 X s =0 λ t − s γ s  . (55) Case 2: Diﬀerent Sample. Without loss of generality , assume no de 1 uses the replaced sample. Pro ceeding as ab o ve and using ∥∇ f ( · ) ∥ ≤ G , ∆ t +1 ≤ ∆ t + Lγ t m m X i =2 ∥ w ( t ) − z ( t ) i ∥ + Lγ t m m X i =2 ∥ w ′ ( t ) − z ′ ( t ) i ∥ + 2 Gγ t m . (56) Since 1 m P m i =2 ∥ · ∥ ≤ 1 m P m i =1 ∥ · ∥ , applying the same lemma gives ∆ t +1 ≤ ∆ t + 2 C Lγ t δ  λ t C w 0 + G t − 1 X s =0 λ t − s γ s  + 2 Gγ t m . (57) Exp ectation recursion. T aking exp ectation and using the tw o cases, E [∆ t +1 ] ≤ E [∆ t ] + 2 C Lγ t δ  λ t C w 0 + G t − 1 X s =0 λ t − s γ s  + 2 Gγ t mn . (58) Summing from t = 0 to T − 1 yields E [∆ T ] ≤ 2 C LC w 0 δ T − 1 X t =0 γ t λ t + 2 C LG δ T − 1 X t =0 γ t t − 1 X s =0 λ t − s γ s + 2 G mn T − 1 X t =0 γ t . (59) Assume the stepsizes are nonincreasing. Then for s ≤ t − 1, γ s ≥ γ t , and hence T − 1 X t =0 γ t t − 1 X s =0 λ t − s γ s ≤ T − 1 X t =0 γ 2 t t − 1 X s =0 λ t − s = T − 1 X t =0 γ 2 t t X k =1 λ k ≤ 1 1 − λ T − 1 X t =0 γ 2 t . (60) Therefore, E [∆ T ] ≤ 2 C LC w 0 δ T − 1 X t =0 γ t λ t + 2 C LG δ (1 − λ ) T − 1 X t =0 γ 2 t + 2 G mn T − 1 X t =0 γ t . (61) Finally , by G -Lipsc hitzness, ϵ stab = E | f ( w T ; z ) − f ( w ′ T ; z ) | ≤ G E [∆ T ] ≤ 2 C GLC w 0 δ T − 1 X t =0 γ t λ t + 2 C G 2 L δ (1 − λ ) T − 1 X t =0 γ 2 t + 2 G 2 mn T − 1 X t =0 γ t . (62) This completes the pro of of Theorem 22 . 31 Liang and Sun and Cao and Shen C.2 Pro of of Corollary 24 Pro of Case 1: Constant Learning Rate. F or constant learning rates γ t = γ ≤ 2 /L , b y Theorem 22 , w e hav e ϵ stab ≤ 2 C GLC w 0 δ T − 1 X t =0 γ λ t + 2 C G 2 L δ (1 − λ ) T − 1 X t =0 γ 2 + 2 G 2 mn T − 1 X t =0 γ ( a ) ≤ 2 C GLγ C w 0 δ (1 − λ ) +  2 C G 2 Lγ 2 δ (1 − λ ) + 2 G 2 γ mn  T . (63) Case 2: Decreasing Learning Rate. F or diminishing learning rates γ t = v t +1 with v ≤ 2 /L , Theorem 22 yields ϵ stab ≤ 2 C GLC w 0 δ T − 1 X t =0 v λ t t + 1 + 2 C G 2 L δ (1 − λ ) T − 1 X t =0  v t + 1  2 + 2 G 2 mn T − 1 X t =0 v t + 1 ( a )( b )( c ) ≤ 2 v C GLC w 0 δ (1 − λ ) + 4 C G 2 Lv 2 δ (1 − λ ) + 2 G 2 v mn (1 + ln T ) = 2 G 2 v mn ln T + 2 v C GLC w 0 + 4 C G 2 Lv 2 δ (1 − λ ) + 2 G 2 v mn , (64) where we hav e used: T − 1 X t =0 λ t ≤ 1 1 − λ , T − 1 X t =0 λ t t + 1 ≤ T − 1 X t =0 λ t ≤ 1 1 − λ , (a) T − 1 X t =0 1 t + 1 = 1 + T − 1 X t =1 1 t + 1 ≤ 1 + Z T 1 1 x dx = 1 + ln T , (b) T − 1 X t =0 1 ( t + 1) 2 ≤ 1 + Z T 1 1 x 2 dx = 2 − 1 T ≤ 2 . (c) This completes the pro of of Corollary 24 . C.3 Pro of of Theorem 26 Pro of Using the conv exity of F S , we hav e F S ( w ( t ) ) − F S ( w ∗ S ) ≤ D ∇ F S ( w ( t ) ) , w ( t ) − w ∗ S E . (d) W e derive E   w ( t +1) − w ∗ S   2 = E    w ( t ) − γ t m m X i =1 ∇ f ( z ( t ) i ; ξ ( t ) i ) − w ∗ S    2 ≤ E   w ( t ) − w ∗ S   2 + 2 γ t m E D − m X i =1 ∇ f ( z ( t ) i ; ξ ( t ) i ) , w ( t ) − w ∗ S E + γ 2 t m 2 E    m X i =1 ∇ f ( z ( t ) i ; ξ ( t ) i )    2 32 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion Ass 17 ≤ E   w ( t ) − w ∗ S   2 + 2 γ t m E D − m X i =1 ∇ f ( z ( t ) i ; ξ ( t ) i ) , w ( t ) − w ∗ S E + γ 2 t G 2 = E   w ( t ) − w ∗ S   2 + m X i =1 2 γ t m E D − ∇ f ( w ( t ) ; ξ ( t ) i ) , w ( t ) − w ∗ S E + m X i =1 2 γ t m E D ∇ f ( w ( t ) ; ξ ( t ) i ) − ∇ f ( z ( t ) i ; ξ ( t ) i ) , w ( t ) − w ∗ S E + γ 2 t G 2 ( d ) ≤ E   w ( t ) − w ∗ S   2 − 2 γ t E  F S ( w ( t ) ) − F S ( w ∗ S )  + m X i =1 2 γ t m E  ∇ f ( w ( t ) ; ξ ( t ) i ) − ∇ f ( z ( t ) i ; ξ ( t ) i ) , w ( t ) − w ∗ S  + γ 2 t G 2 Ass 19 ≤ E   w ( t ) − w ∗ S   2 − 2 γ t E  F S ( w ( t ) ) − F S ( w ∗ S )  + m X i =1 4 r γ t m E   ∇ f ( w ( t ) ; ξ ( t ) i ) − ∇ f ( z ( t ) i ; ξ ( t ) i )   + γ 2 t G 2 Ass 18 ≤ E   w ( t ) − w ∗ S   2 − 2 γ t E  F S ( w ( t ) ) − F S ( w ∗ S )  + m X i =1 4 r Lγ t m E   w ( t ) − z ( t ) i   + γ 2 t G 2 Lem 9 ≤ E   w ( t ) − w ∗ S   2 − 2 γ t E  F S ( w ( t ) ) − F S ( w ∗ S )  + m X i =1 4 r Lγ t m  C δ λ t C w 0 + C G δ t − 1 X s =0 λ t − s γ s  + γ 2 t G 2 ≤ E   w ( t ) − w ∗ S   2 − 2 γ t E  F S ( w ( t ) ) − F S ( w ∗ S )  + 4 r C Lγ t λ t C w 0 δ + 4 r C LGγ t δ t − 1 X s =0 λ t − s γ s + γ 2 t G 2 ( a ) ≤ E   w ( t ) − w ∗ S   2 − 2 γ t E  F S ( w ( t ) ) − F S ( w ∗ S )  + 4 r C Lγ t λ t C w 0 δ +  G 2 + 4 r C LG δ (1 − λ )  γ 2 t . (65) Here, in ( a ) we assume { γ t } is nonincreasing and use γ t t − 1 X s =0 λ t − s γ s ≤ γ 2 t t − 1 X s =0 λ t − s ≤ γ 2 t 1 − λ . Rearranging the recursion and summing o ver t from 0 to T − 1 yields T − 1 X t =0 γ t E  F S ( w ( t ) ) − F S ( w ∗ S )  ≤ 1 2   w (0) − w ∗ S   2 + 2 r C LC w 0 δ T − 1 X t =0 γ t λ t +  2 r C LG δ (1 − λ ) + G 2 2  T − 1 X t =0 γ 2 t . (66) 33 Liang and Sun and Cao and Shen Recall the deﬁnition of the a veraged model w ( T ) avg = P T − 1 t =0 γ t w ( t ) P T − 1 t =0 γ t . By conv exit y of F S , we hav e ϵ opt = E  F S ( w ( T ) avg ) − F S ( w ∗ S )  ≤ P T − 1 t =0 γ t E  F S ( w ( t ) ) − F S ( w ∗ S )  P T − 1 t =0 γ t ≤   w (0) − w ∗ S   2 2 P T − 1 t =0 γ t + 2 r C LC w 0 δ P T − 1 t =0 γ t T − 1 X t =0 γ t λ t +  2 r C LG δ (1 − λ ) + G 2 2  P T − 1 t =0 γ 2 t P T − 1 t =0 γ t . (67) This completes the pro of of Theorem 26 . C.4 Pro of of Corollary 27 Pro of Case 1: Constant learning rate. F or constan t learning rates γ t = γ , substituting into Theorem 26 yields ϵ opt ≤ ∥ w (0) − w ∗ S ∥ 2 2 P T − 1 t =0 γ + 2 r C LC w 0 δ P T − 1 t =0 γ T − 1 X t =0 γ λ t +  2 r C LG δ (1 − λ ) + G 2 2  P T − 1 t =0 γ 2 P T − 1 t =0 γ = ∥ w (0) − w ∗ S ∥ 2 2 T γ + 2 r C LC w 0 δ T T − 1 X t =0 λ t +  2 r C LG δ (1 − λ ) + G 2 2  γ ( a ) ≤ ∥ w (0) − w ∗ S ∥ 2 2 T γ + 2 r C LC w 0 δ T (1 − λ ) +  2 r C LG δ (1 − λ ) + G 2 2  γ = ∥ w (0) − w ∗ S ∥ 2 2 γ + 2 r C LC w 0 δ (1 − λ ) ! 1 T + 2 r C LG δ (1 − λ ) γ + G 2 γ 2 . (68) Case 2: Decreasing learning rate. F or decreasing step sizes γ t = v t +1 , substituting into Theorem 26 yields ϵ opt ≤ ∥ w (0) − w ∗ S ∥ 2 2 P T − 1 t =0 v t +1 + 2 r C LC w 0 δ P T − 1 t =0 v t +1 T − 1 X t =0 v t + 1 λ t +  2 r C LG δ (1 − λ ) + G 2 2  P T − 1 t =0  v t +1  2 P T − 1 t =0 v t +1 . Using the low er b ound T − 1 X t =0 1 t + 1 ≥ T − 1 X t =1 1 t + 1 ≥ 1 2 T − 1 X t =1 1 t ≥ 1 2 T − 1 X t =1 Z t +1 t 1 x dx ≥ 1 2 ln T , (e) w e hav e P T − 1 t =0 v t +1 ≥ v 2 ln T . Moreo ver, P T − 1 t =0 1 ( t +1) 2 ≤ 2 , P T − 1 t =0 λ t t +1 ≤ P T − 1 t =0 λ t . Therefore, ϵ opt ≤ ∥ w (0) − w ∗ S ∥ 2 v ln T + 2 r C LC w 0 δ · P T − 1 t =0 λ t t +1 P T − 1 t =0 1 t +1 +  2 r C LG δ (1 − λ ) + G 2 2  · v P T − 1 t =0 1 ( t +1) 2 P T − 1 t =0 1 t +1 ≤ ∥ w (0) − w ∗ S ∥ 2 v ln T + 4 r C LC w 0 δ ln T T − 1 X t =0 λ t +  2 r C LG δ (1 − λ ) + G 2 2  4 v ln T 34 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion ( a ) ≤ ∥ w (0) − w ∗ S ∥ 2 v ln T + 4 r C LC w 0 δ (1 − λ ) ln T +  2 r C LG δ (1 − λ ) + G 2 2  4 v ln T = ∥ w (0) − w ∗ S ∥ 2 v + 4 r C LC w 0 δ (1 − λ ) + 8 v r C LG δ (1 − λ ) + 2 v G 2 ! 1 ln T . (69) This completes the pro of of Corollary 27 . C.5 Pro of of Corollary 29 Pro of Case 1 :Constant learning rate. Supp ose the learning rate is constant and satisﬁes γ ≤ 2 /L . Applying the av eraged mo del to the stability result in Theorem 22 and using the G -Lipsc hitz assumption 17 , we obtain ϵ av e-stab ≤ G E  ∥ w ( T ) avg − w ′ ( T ) avg ∥  = G P T − 1 t =0 γ E  ∥ w ( t ) − w ′ ( t ) ∥  P T − 1 t =0 γ ≤ C GLγ C w 0 δ (1 − λ ) +  C G 2 Lγ 2 δ (1 − λ ) + G 2 γ mn  T . (70) Com bining this with the optimization error b ound in Corollary 27 , we obtain ϵ exc ≤ ϵ av e-stab + ϵ opt ≤  C G 2 Lγ 2 δ (1 − λ ) + G 2 γ mn  T + ∥ w (0) − w ∗ S ∥ 2 2 γ + 2 r C LC w 0 δ (1 − λ ) ! 1 T + C GLγ C w 0 δ (1 − λ ) + 2 r C LGγ δ (1 − λ ) + G 2 γ 2 . (71) F or conv enience, denote A := C G 2 Lγ 2 δ (1 − λ ) + G 2 γ mn = G 2 γ  C Lγ δ (1 − λ ) + 1 mn  , B := ∥ w (0) − w ∗ S ∥ 2 2 γ + 2 r C LC w 0 δ (1 − λ ) , C := C GLγ C w 0 δ (1 − λ ) + 2 r C LGγ δ (1 − λ ) + G 2 γ 2 . (72) Then ( 71 ) can b e rewritten as ϵ exc ( T ) ≤ AT + B T + C . (73) The right-hand side is minimized ov er T > 0 b y T ∗ = r B A = 1 G √ γ v u u u u u t ∥ w (0) − w ∗ S ∥ 2 2 γ + 2 r C LC w 0 δ (1 − λ ) C Lγ δ (1 − λ ) + 1 mn . (74) 35 Liang and Sun and Cao and Shen Substituting T ∗ bac k into the b ound and using AT ∗ + B /T ∗ = 2 √ AB , w e obtain the explicit optimal excess generalization b ound ϵ ∗ exc ≤ 2 √ AB + C = 2 s  C G 2 Lγ 2 δ (1 − λ ) + G 2 γ mn  ∥ w (0) − w ∗ S ∥ 2 2 γ + 2 r C LC w 0 δ (1 − λ )  + C GLγ C w 0 δ (1 − λ ) + 2 r C LGγ δ (1 − λ ) + G 2 γ 2 . (75) This gives the excess generalization error under constant learning rate. Case 2: Decreasing learning rate. Let γ t = v t +1 with v ≤ 2 /L . Applying the av eraged mo del to the stability result in Corollary 24 and using Assumption 17 , w e hav e ϵ avg-stab ≤ G E  ∥ w ( T ) avg − w ′ ( T ) avg ∥  ≤ G P T − 1 t =0 γ t E  ∥ w ( t ) − w ′ ( t ) ∥  P T − 1 t =0 γ t = G P T − 1 t =0 γ t E [∆ t ] P T − 1 t =0 γ t ≤ G P T − 1 t =0 γ t · 1 G ϵ stab ( t ) P T − 1 t =0 γ t ≤ P T − 1 t =0 γ t  2 GC LC w 0 δ P t − 1 k =0 γ k λ k + 2 C G 2 L δ (1 − λ ) P t − 1 k =0 γ 2 k + 2 G 2 mn P t − 1 k =0 γ k  P T − 1 t =0 γ t . Substituting γ k = v k +1 , and using (a), we obtain ϵ avg-stab ≤ P T − 1 t =0 v t +1  2 GC Lv C w 0 δ (1 − λ ) + 4 C G 2 Lv 2 δ (1 − λ ) + 2 G 2 v mn (1 + ln t )  P T − 1 t =0 v t +1 ( e ) ≤ 4 G 2 v mn P T − 1 t =1 ln t t +1 +  4 GC Lv C w 0 δ (1 − λ ) + 8 C G 2 Lv 2 δ (1 − λ ) + 4 G 2 v mn  P T − 1 t =0 1 t +1 ln T ( f ) , ( b ) ≤ 2 G 2 v mn ln 2 T +  4 GC Lv C w 0 δ (1 − λ ) + 8 C G 2 Lv 2 δ (1 − λ ) + 4 G 2 v mn  (1 + ln T ) ln T ≤ 2 G 2 v mn ln T +  4 GC Lv C w 0 δ (1 − λ ) + 8 C G 2 Lv 2 δ (1 − λ ) + 4 G 2 v mn  1 ln T +  4 GC Lv C w 0 δ (1 − λ ) + 8 C G 2 Lv 2 δ (1 − λ ) + 4 G 2 v mn  . (76) Here we used T − 1 X t =1 ln t t + 1 ≤ Z T 1 ln x x dx = ln 2 T 2 . (f ) Th us, together with Corollary 27 , the excess generalization b ound b ecomes ϵ exc ≤ ϵ avg-stab + ϵ opt ≤ 2 G 2 v mn ln T +  4 GC Lv C w 0 δ (1 − λ ) + 8 C G 2 Lv 2 δ (1 − λ ) + 4 G 2 v mn  1 ln T + ∥ w (0) − w ∗ S ∥ 2 v + 4 r C LC w 0 δ (1 − λ ) + 8 v r C LG δ (1 − λ ) + 2 v G 2 ! 1 ln T 36 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion +  4 GC Lv C w 0 δ (1 − λ ) + 8 C G 2 Lv 2 δ (1 − λ ) + 4 G 2 v mn  . (77) W e hav e ϵ exc ( T ) ≤ A ln T + B ln T + C , (78) where one can take A := 2 G 2 v mn , B :=  4 GC Lv C w 0 δ (1 − λ ) + 8 C G 2 Lv 2 δ (1 − λ ) + 4 G 2 v mn  + ∥ w (0) − w ∗ S ∥ 2 v + 4 r C LC w 0 δ (1 − λ ) + 8 v r C LG δ (1 − λ ) + 2 v G 2 ! , C :=  4 GC Lv C w 0 δ (1 − λ ) + 8 C G 2 Lv 2 δ (1 − λ ) + 4 G 2 v mn  . Viewing the right-hand side as a function of x = ln T > 0, ϕ ( x ) = Ax + B x + C , (79) standard calculus shows that ϕ is minimized at x ∗ = p B / A , i.e., ln T ∗ = r B A = ⇒ T ∗ = exp  p B / A  , (80) and the corresp onding minimum v alue satisﬁes ϵ ∗ exc ≤ 2 √ AB + C. (81) Moreo ver, √ AB = r 2 G 2 v mn · B ≤ √ 2 G √ mn √ B , (82) so keeping the k ey scalings in the problem parameters, we obtain ϵ ∗ exc = O G √ mn s ∥ w (0) − w ∗ S ∥ 2 v + C Lv C w 0 δ (1 − λ ) + C G 2 Lv 2 δ (1 − λ ) + G 2 v mn + G √ mn s r C LC w 0 δ (1 − λ ) + v r C LG δ (1 − λ ) + v G 2 + GC Lv C w 0 δ (1 − λ ) + C G 2 Lv 2 δ (1 − λ ) + G 2 v mn ! . (83) This completes the pro of of Corollary 29 . C.6 Pro of of Theorem 31 Pro of Assume that the tw o sample sets S and S ′ diﬀer in only one sample among the ﬁrst n samples. Let ∆ t := ∥ w ( t ) − w ′ ( t ) ∥ . 37 Liang and Sun and Cao and Shen Case 1: Iden tical Samples (with probability 1 − 1 n ). In this case, ξ ( t ) i are identical for b oth runs, and ∆ t +1 =    w ( t ) − w ′ ( t ) − γ t m m X i =1  ∇ f ( z ( t ) i ; ξ ( t ) i ) − ∇ f ( z ′ ( t ) i ; ξ ( t ) i )     ≤    w ( t ) − w ′ ( t ) − γ t m m X i =1  ∇ f ( w ( t ) ; ξ ( t ) i ) − ∇ f ( w ′ ( t ) ; ξ ( t ) i )     +    γ t m m X i =1  ∇ f ( w ( t ) ; ξ ( t ) i ) − ∇ f ( z ( t ) i ; ξ ( t ) i )     +    γ t m m X i =1  ∇ f ( w ′ ( t ) ; ξ ( t ) i ) − ∇ f ( z ′ ( t ) i ; ξ ( t ) i )     Lem( 16 ) , Ass( 18 ) ≤ (1 + Lγ t )∆ t + Lγ t m m X i =1 ∥ w ( t ) − z ( t ) i ∥ + Lγ t m m X i =1 ∥ w ′ ( t ) − z ′ ( t ) i ∥ . (84) Applying Lemma 9 to b oth runs and using the same C w 0 , 1 m m X i =1 ∥ w ( t ) − z ( t ) i ∥ + 1 m m X i =1 ∥ w ′ ( t ) − z ′ ( t ) i ∥ ≤ 2 C δ  λ t C w 0 + G t − 1 X s =0 λ t − s γ s  . (85) Substituting ( 85 ) into ( 84 ) gives ∆ t +1 ≤ (1 + Lγ t )∆ t + 2 C Lγ t δ  λ t C w 0 + G t − 1 X s =0 λ t − s γ s  . (86) Case 2: Diﬀerent Sample (with probability 1 n ). Without loss of generality , assume no de 1 uses the replaced sample in the primed run, i.e., ξ ′ ( t ) 1 . Then ∆ t +1 ≤    w ( t ) − w ′ ( t ) − γ t m m X i =2  ∇ f ( z ( t ) i ; ξ ( t ) i ) − ∇ f ( z ′ ( t ) i ; ξ ( t ) i )     + γ t m    ∇ f ( z ( t ) 1 ; ξ ( t ) 1 ) − ∇ f ( z ′ ( t ) 1 ; ξ ′ ( t ) 1 )    ≤  1 + Lγ t − Lγ t mn  ∆ t + 2 C Lγ t δ  λ t C w 0 + G t − 1 X s =0 λ t − s γ s  + 2 Gγ t m . (87) The last term uses ∥∇ f ∥ ≤ G . Exp ectation recursion. T aking exp ectation and using P ( Case 1 ) = 1 − 1 n and P ( Case 2 ) = 1 n , from ( 86 )–( 87 ) we obtain E [∆ t +1 ] ≤  1 + Lγ t − Lγ t mn  E [∆ t ] + 2 C Lγ t δ  λ t C w 0 + G t − 1 X s =0 λ t − s γ s  + 2 Gγ t mn . (88) Assume t 0 ∈ { 1 , 2 , . . . , n } to b e dete rmined, and condition on ∆ t 0 = 0. Unrolling ( 88 ) yields E [∆ T | ∆ t 0 = 0] ≤ T − 1 X t = t 0 T − 1 Y k = t +1  1 + Lγ k − Lγ k mn  38 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion × " 2 C Lγ t δ  λ t C w 0 + G t − 1 X s =0 λ t − s γ s  + 2 Gγ t mn # . (89) Finally , the uniform stabilit y under non-con vex loss functions satisﬁes ϵ stab ≤ t 0 mn + G T − 1 X t = t 0 T − 1 Y k = t +1  1 + Lγ k − Lγ k mn  " 2 C Lγ t δ  λ t C w 0 + G t − 1 X s =0 λ t − s γ s  + 2 Gγ t mn # ( a ) ≤ t 0 mn + G T − 1 X t = t 0 T − 1 Y k = t +1  1 + Lγ k − Lγ k mn   2 C Lγ t δ  λ t C w 0 + Gγ t 1 − λ  + 2 Gγ t mn  (90) This completes the pro of of Theorem 31 . C.7 Pro of of Corollary 33 Pro of Case 1: Constant Learning Rate. F or constant learning rates γ t = γ , from Eq. 90 , w e obtain E [∆ T | ∆ t 0 = 0] ≤ T − 1 X t = t 0 T − 1 Y k = t +1  1 + Lγ − Lγ mn  × 2 C Lγ δ  λ t C w 0 + G t − 1 X s =0 λ t − s γ  + 2 Gγ mn ! ≤ 2 C Lγ C w 0 δ T − 1 X t = t 0  1 + Lγ − Lγ mn  T − 1 − t λ t +  2 C GLγ 2 δ (1 − λ ) + 2 Gγ mn  T − 1 X t = t 0  1 + Lγ − Lγ mn  T − 1 − t ( a ) ≤ 2 C Lγ C w 0 δ  1 + Lγ − Lγ mn  T − 1 1 + Lγ − Lγ mn 1 + Lγ − Lγ mn − λ +  2 C GLγ 2 δ (1 − λ ) + 2 Gγ mn   mn mn − 1  "  1 + Lγ − Lγ mn  T − 1 # ≤ 2 C Lγ C w 0 δ (1 − λ )  1 + Lγ − Lγ mn  T +  2 C GLγ 2 δ (1 − λ ) + 2 Gγ mn   mn mn − 1   1 + Lγ − Lγ mn  T ≤  2 C Lγ C w 0 δ (1 − λ ) + 4 C GLγ 2 δ (1 − λ ) + 4 Gγ mn   1 + Lγ − Lγ mn  T . (91) When using b ound (a), w e rely on the fact that λ (1+ Lγ − Lγ mn ) ≤ 1 . Th us, the uniform stabilit y satisﬁes ϵ stab ≤ t 0 mn + G E [∆ T | ∆ t 0 = 0] ≤ t 0 mn + G  2 C Lγ C w 0 δ (1 − λ ) + 4 C GLγ 2 δ (1 − λ ) + 4 Gγ mn   1 + Lγ − Lγ mn  T . (92) Letting t 0 = 0 gives the minimal b ound in this case. Case 2: Decreasing Learning Rate. F rom Eq. 90 , for diminishing learning rates γ t = v t +1 , we hav e E [∆ T | ∆ t 0 = 0] ≤ T − 1 X t = t 0 T − 1 Y k = t +1  1 +  1 − 1 mn  Lv k + 1  × 2 C Lvλ t C w 0 δ ( t + 1) + 2 C GLv 2 δ (1 − λ )( t + 1) 2 + 2 Gv mn ( t + 1) ! ( g ) ≤ T − 1 X t = t 0 exp  1 − 1 mn  Lv T − 1 X k = t +1 1 k + 1 ! × 2 C Lvλ t C w 0 δ ( t + 1) + 2 C GLv 2 δ (1 − λ )( t + 1) 2 + 2 Gv mn ( t + 1) ! 39 Liang and Sun and Cao and Shen ( h ) ≤ T − 1 X t = t 0  T t + 1  Lv (1 − 1 mn ) × 2 C Lvλ t C w 0 δ ( t + 1) + 2 C GLv 2 δ (1 − λ )( t + 1) 2 + 2 Gv mn ( t + 1) ! ≤ T Lv (1 − 1 mn ) T − 1 X t = t 0 2 C Lvλ t C w 0 δ ( t + 1) Lv (1 − 1 mn )+1 + 2 C GLv 2 δ (1 − λ )( t + 1) Lv (1 − 1 mn )+2 + 2 Gv mn ( t + 1) Lv (1 − 1 mn )+1 ! ≤ T Lv (1 − 1 mn ) " 2 C LvC w 0 δ T − 1 X t = t 0 ( t + 1) − Lv (1 − 1 mn ) − 1 + 2 Gv mn T − 1 X t = t 0 ( t + 1) − Lv (1 − 1 mn ) − 1 + 2 C GLv 2 δ (1 − λ ) T − 1 X t = t 0 ( t + 1) − Lv (1 − 1 mn ) − 2 # ( i )( j ) ≤ T Lv (1 − 1 mn ) "  4 C LvC w 0 δ + 4 Gv mn  t − Lv (1 − 1 mn ) 0 Lv + 2 C GLv 2 δ (1 − λ ) · t − Lv (1 − 1 mn ) − 1 0 Lv (1 − 1 mn ) + 1 # ≤  4 C C w 0 δ + 4 G mnL   T t 0  Lv (1 − 1 mn ) + 2 C Gv δ (1 − λ )  T t 0  Lv (1 − 1 mn )+1 1 T . (93) Finally , setting t 0 = v 1 2+ vL T 1+ vL 2+ vL , when v is small enough suc h that t 0 ≤ mn , we hav e  T t 0  vL +1 =  T v 1 2+ vL T 1+ vL 2+ vL  vL +1 = v − vL +1 2+ vL T vL +1 2+ vL , (94) and ϵ stab ≤ t 0 mn + G E [∆ T | ∆ t 0 = 0] ≤ v 1 2+ vL mn T 1+ vL 2+ vL + G  4 C mnC w 0 + 4 δ G δ mn + 2 C Gv δ (1 − λ )   T t 0  vL +1 = v 1 2+ vL mn T 1+ vL 2+ vL + G  4 C mnC w 0 + 4 δ G δ mn + 2 C Gv δ (1 − λ )  v − vL +1 2+ vL T vL +1 2+ vL . (95) This completes the pro of of Corollary 33 . C.8 Pro of of Theorem 37 Pro of Let F S b e L -smo oth. By the smo othness descen t inequality , for any t ≥ 0, E h F S ( w ( t +1) ) − F S ( w ( t ) ) i ≤ E D ∇ F S ( w ( t ) ) , w ( t +1) − w ( t ) E + L 2 E    w ( t +1) − w ( t )    2 . (96) Using Prop osition 7 , E h F S ( w ( t +1) ) − F S ( w ( t ) ) i ≤ − γ t m E * ∇ F S ( w ( t ) ) , m X i =1 ∇ f ( z ( t ) i ; ξ ( t ) i ) + + Lγ 2 t 2 E      1 m m X i =1 ∇ f ( z ( t ) i ; ξ ( t ) i )      2 = − γ t E    ∇ F S ( w ( t ) )    2 + γ t m E * ∇ F S ( w ( t ) ) , m X i =1  ∇ f ( w ( t ) ; ξ ( t ) i ) − ∇ f ( z ( t ) i ; ξ ( t ) i )  + + Lγ 2 t 2 E      1 m m X i =1 ∇ f ( z ( t ) i ; ξ ( t ) i )      2 . (97) F or the middle inner-product term, apply Cauch y–Sc hw arz and the uniform gradient b ound ∥∇ f ( · ) ∥ ≤ G , which implies ∥∇ F S ( · ) ∥ ≤ G : γ t m E * ∇ F S ( w ( t ) ) , m X i =1  ∇ f ( w ( t ) ; ξ ( t ) i ) − ∇ f ( z ( t ) i ; ξ ( t ) i )  + 40 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion ≤ γ t m m X i =1 E h ∥∇ F S ( w ( t ) ) ∥ ·   ∇ f ( w ( t ) ; ξ ( t ) i ) − ∇ f ( z ( t ) i ; ξ ( t ) i )   i ≤ Gγ t m m X i =1 E h   ∇ f ( w ( t ) ; ξ ( t ) i ) − ∇ f ( z ( t ) i ; ξ ( t ) i )   i ≤ GLγ t m m X i =1 E    w ( t ) − z ( t ) i    , (98) where the last step uses L -smo othness of f ( · ; ξ ) (gradient L -Lipsc hitz). F or the last term in ( 97 ) , Jensen’s inequality and ∥∇ f ( · ) ∥ ≤ G give Lγ 2 t 2 E      1 m m X i =1 ∇ f ( z ( t ) i ; ξ ( t ) i )      2 ≤ Lγ 2 t 2 E " 1 m m X i =1 ∥∇ f ( z ( t ) i ; ξ ( t ) i ) ∥ 2 # ≤ LG 2 γ 2 t 2 . (99) Substituting ( 98 ) and ( 99 ) into ( 97 ), w e obtain E h F S ( w ( t +1) ) − F S ( w ( t ) ) i ≤ − γ t E    ∇ F S ( w ( t ) )    2 + GLγ t m m X i =1 E    w ( t ) − z ( t ) i    + LG 2 γ 2 t 2 . (100) By the P L condition, ∥∇ F S ( w ) ∥ 2 ≥ 2 α  F S ( w ) − F S ( w ∗ S )  , hence E h F S ( w ( t +1) ) − F S ( w ( t ) ) i ≤ − 2 αγ t E h F S ( w ( t ) ) − F S ( w ∗ S ) i + GLγ t m m X i =1 E    w ( t ) − z ( t ) i    + LG 2 γ 2 t 2 . (101) No w apply Lemma 9 , Thus, GLγ t m m X i =1 E    w ( t ) − z ( t ) i    ≤ GLγ t · C δ λ t C w 0 + G t − 1 X s =0 λ t − s γ s ! = C GLγ t λ t C w 0 δ + C G 2 Lγ t δ t − 1 X s =0 λ t − s γ s . (102) Assuming the stepsizes are nonincreasing, w e hav e for s ≤ t − 1 that γ s ≥ γ t , and hence t − 1 X s =0 λ t − s γ s ≤ t − 1 X s =0 λ t − s γ t = γ t t X k =1 λ k ≤ γ t 1 − λ . Substituting this into ( 102 ) and then into ( 101 ) yields E h F S ( w ( t +1) ) − F S ( w ( t ) ) i ≤ − 2 αγ t E h F S ( w ( t ) ) − F S ( w ∗ S ) i + C GLγ t λ t C w 0 δ + C G 2 L δ (1 − λ ) γ 2 t + LG 2 2 γ 2 t . (103) Rearranging ( 103 ) gives γ t E h F S ( w ( t ) ) − F S ( w ∗ S ) i ≤ 1 2 α E h F S ( w ( t ) ) − F S ( w ( t +1) ) i + C GL 2 δ α γ t λ t C w 0 +  C G 2 L 2 δ α (1 − λ ) + LG 2 4 α  γ 2 t . (104) 41 Liang and Sun and Cao and Shen Summing ( 104 ) ov er t = 0 , . . . , T − 1 yields T − 1 X t =0 γ t E h F S ( w ( t ) ) − F S ( w ∗ S ) i ≤ 1 2 α E h F S ( w (0) ) − F S ( w ( T ) ) i + C GLC w 0 2 δ α T − 1 X t =0 γ t λ t +  C G 2 L 2 δ α (1 − λ ) + LG 2 4 α  T − 1 X t =0 γ 2 t . (105) Using the Lipschitz condition to upp er b ound the initial function gap, E h F S ( w (0) ) − F S ( w ( T ) ) i ≤ E h F S ( w (0) ) − F S ( w ∗ S ) i ≤ Gr, and dividing ( 105 ) b y P T − 1 t =0 γ t giv es ϵ opt = P T − 1 t =0 γ t E h F S ( w ( t ) ) − F S ( w ∗ S ) i P T − 1 t =0 γ t ≤ Gr α P T − 1 t =0 γ t + C GLC w 0 2 δ α P T − 1 t =0 γ t T − 1 X t =0 γ t λ t +  C G 2 L 2 δ α (1 − λ ) + LG 2 4 α  P T − 1 t =0 γ 2 t P T − 1 t =0 γ t . (106) Finally , letting κ := L/α completes the pro of. C.9 Pro of of Corollary 38 Pro of Case 1: Constant Learning Rate. F or the case of constant learning rate where γ t = γ for all t , the optimization error b ound in Theorem 37 b ecomes ϵ opt ≤ Gr α P T − 1 t =0 γ + C GκC w 0 2 δ P T − 1 t =0 γ T − 1 X t =0 γ λ t +  C G 2 κ 2 δ (1 − λ ) + G 2 κ 4  P T − 1 t =0 γ 2 P T − 1 t =0 γ = Gr T αγ + C GκC w 0 2 δ T γ T − 1 X t =0 γ λ t +  C G 2 κ 2 δ (1 − λ ) + G 2 κ 4  γ = Gr T αγ + C GκC w 0 2 δ T T − 1 X t =0 λ t +  C G 2 κ 2 δ (1 − λ ) + G 2 κ 4  γ ( a ) ≤  Gr αγ + C GκC w 0 2 δ (1 − λ )  1 T +  C G 2 κ 2 δ (1 − λ ) + G 2 κ 4  γ . (107) Case 2: Decreasing Learning Rate. F or the case of decreasing learning rates, i.e. , γ t = v t +1 , w e again start from the b ound in Theorem 37 : ϵ opt ≤ Gr α P T − 1 t =0 v t +1 + C GκC w 0 2 δ P T − 1 t =0 v t +1 T − 1 X t =0 v t + 1 λ t +  C G 2 κ 2 δ (1 − λ ) + G 2 κ 4  P T − 1 t =0  v t +1  2 P T − 1 t =0 v t +1 42 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion = Gr αv P T − 1 t =0 1 t +1 + C GκC w 0 2 δ v P T − 1 t =0 1 t +1 T − 1 X t =0 v t + 1 λ t +  C G 2 κ 2 δ (1 − λ ) + G 2 κ 4  v P T − 1 t =0 1 ( t +1) 2 P T − 1 t =0 1 t +1 . Using the standard low er b ound for the harmonic series, there exists a constant T 0 suc h that for all T ≥ T 0 , T − 1 X t =0 1 t + 1 ≥ 1 2 ln T , thus 1 P T − 1 t =0 1 t +1 ≤ 2 ln T . Applying this to the ﬁrst t wo terms and absorbing constants into the third term, we obtain ϵ opt ( e ) ≤ 2 Gr v α ln T + C GκC w 0 δ ln T T − 1 X t =0 λ t t + 1 +  C G 2 κ δ (1 − λ ) + G 2 κ 2  v P T − 1 t =0 1 ( t +1) 2 ln T ( c ) ≤ 2 Gr v α ln T + C GκC w 0 δ ln T T − 1 X t =0 λ t t + 1 +  2 C G 2 κ δ (1 − λ ) + G 2 κ α · α 1  v ln T , where in step (c) we used P ∞ t =0 1 ( t +1) 2 ≤ 2. Finally , since 1 t +1 ≤ 1 and P T − 1 t =0 λ t ≤ 1 1 − λ , we hav e T − 1 X t =0 λ t t + 1 ≤ T − 1 X t =0 λ t ( a ) ≤ 1 1 − λ , (108) and therefore ϵ opt ( a ) ≤  C GκC w 0 δ (1 − λ ) + 2 Gr v α +  2 C G 2 κ δ (1 − λ ) + G 2 κ 2  v  1 ln T . (109) This completes the pro of of Corollary 38 . C.10 Pro of of Corollary 40 Pro of Case 1: Constan t learning rate. Let γ t = γ and denote ρ := 1 + Lγ − Lγ mn , µ := ln ρ. Recall from Corollary 33 (constant stepsize case) that ϵ stab ( t ) ≤ 2 GC Lγ C w 0 δ (1 − λ ) + 4 C G 2 Lγ 2 δ (1 − λ ) + 4 G 2 γ mn ! ρ t , (110) Since ϵ stab ( t ) ≤ G E ∥ w ( t ) − w ′ ( t ) ∥ = G E [∆ t ], ( 110 ) implies E [∆ t ] ≤ 2 C Lγ C w 0 δ (1 − λ ) + 4 C G 2 Lγ 2 G δ (1 − λ ) + 4 Gγ mn ! ρ t = 2 C Lγ C w 0 δ (1 − λ ) + 4 C GLγ 2 δ (1 − λ ) + 4 Gγ mn ! ρ t . (111) 43 Liang and Sun and Cao and Shen Applying the av eraged mo del and Assumption 17 (the loss is G -Lipschitz) yields ϵ av e-stab ≤ G E   w ( T ) avg − w ′ ( T ) avg   = G P T − 1 t =0 γ E [∆ t ] P T − 1 t =0 γ = G T T − 1 X t =0 E [∆ t ] ( 111 ) ≤ G T 2 C Lγ C w 0 δ (1 − λ ) + 4 C GLγ 2 δ (1 − λ ) + 4 Gγ mn ! T − 1 X t =0 ρ t = 1 T 2 GC Lγ C w 0 δ (1 − λ ) + 4 C G 2 Lγ 2 δ (1 − λ ) + 4 G 2 γ mn ! T − 1 X t =0 ρ t = 1 T 2 GC Lγ C w 0 δ (1 − λ ) + 4 C G 2 Lγ 2 δ (1 − λ ) + 4 G 2 γ mn ! · ρ T − 1 ρ − 1 ≤ 1 T 2 GC Lγ C w 0 δ (1 − λ ) + 4 C G 2 Lγ 2 δ (1 − λ ) + 4 G 2 γ mn ! · ρ T ρ − 1 . (112) Noting ρ − 1 = Lγ (1 − 1 mn ), deﬁne the constant C stab := 2 GC Lγ C w 0 δ (1 − λ ) + 4 C G 2 Lγ 2 δ (1 − λ ) + 4 G 2 γ mn Lγ  1 − 1 mn  , (113) so that ( 112 ) b ecomes ϵ av e-stab ≤ C stab · ρ T T . (114) On the other hand, Corollary 38 giv es (with κ := L/α ) ϵ opt ≤ Gr αγ + C GκC w 0 2 δ (1 − λ ) ! 1 T + C G 2 κ 2 δ (1 − λ ) + G 2 κ 4 ! γ . (115) Com bining ( 114 ) and ( 115 ), we obtain ϵ exc ( T ) ≤ ϵ av e-stab + ϵ opt ≤ Gr αγ + C GκC w 0 2 δ (1 − λ ) ! 1 T + C stab ρ T T + C G 2 κ 2 δ (1 − λ ) + G 2 κ 4 ! γ . (116) W e now c ho ose T to balance the tw o 1 /T -terms in ( 116 ). Let C opt := Gr αγ + C GκC w 0 2 δ (1 − λ ) . T ake T ∗ :=  1 µ ln  C opt C stab   , µ = ln ρ, (117) (assuming C opt > C stab ; otherwise one may tak e T ∗ = 1). Then ρ T ∗ ≤ e C opt C stab , and hence C opt T ∗ + C stab ρ T ∗ T ∗ ≤ C opt T ∗ + e C opt T ∗ = (1 + e ) C opt T ∗ ≤ (1 + e ) C opt · µ ln( C opt /C stab ) . (118) 44 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion Substituting ( 118 ) into ( 116 ) yields the explicit b ound ϵ ∗ exc ≤ (1 + e ) C opt · ln  1 + Lγ − Lγ mn  ln( C opt /C stab ) + C G 2 κ 2 δ (1 − λ ) + G 2 κ 4 ! γ , (119) where C stab is given in ( 113 ). Noting µ = ln ρ = Θ( Lγ ) for γ small, the choice ( 117 ) giv es T ∗ = O  1 Lγ log  C opt C stab   . Moreo ver, since ln ( C opt /C stab ) = Θ( log (1 /γ )) in the t ypical regime, the minimized excess risk satisﬁes ϵ ∗ exc = O  C opt · Lγ log(1 /γ ) + G 2 κ γ  = O  G 2 κ γ  (up to logarithmic factors). Case 2: Decreasing Learning Rate. Let γ t = v t +1 . Recall that the av eraged iterate is w ( T ) avg = P T − 1 t =0 γ t w ( t ) P T − 1 t =0 γ t , w ′ ( T ) avg = P T − 1 t =0 γ t w ′ ( t ) P T − 1 t =0 γ t . By G -Lipschitzness (Assumption 17 ) and Jensen’s inequality , ϵ av e-stab ≤ G E   w ( T ) avg − w ′ ( T ) avg   = G E    P T − 1 t =0 γ t ( w ( t ) − w ′ ( t ) ) P T − 1 t =0 γ t    ≤ G P T − 1 t =0 γ t T − 1 X t =0 γ t E   w ( t ) − w ′ ( t )   . (120) F or each t , applying the (latest) non-conv ex stability b ound in Corollary 33 at horizon t (and dividing b y G on b oth sides), we ha ve E   w ( t ) − w ′ ( t )   ≤ 1 G ϵ stab ( t ) ≤  4 C v a C w 0 δ + v a Gmn + 4 G mn  t p + 2 GC v a δ (1 − λ ) t q , (121) where we denote a := 1 2 + v L , p := 1 + v L 2 + v L = 1 − a, q := v L 2 + v L = 1 − 2 a. Substituting ( 121 ) into ( 120 ) and using γ t = v t +1 yields ϵ av e-stab ≤ G P T − 1 t =0 v t +1 T − 1 X t =0 v t + 1 "  4 C v a C w 0 δ + v a Gmn + 4 G mn  t p + 2 GC v a δ (1 − λ ) t q # = G P T − 1 t =0 1 t +1 "  4 C v a C w 0 δ + v a Gmn + 4 G mn  T − 1 X t =1 t p − 1 + 2 GC v a δ (1 − λ ) T − 1 X t =1 t q − 1 # . (122) Next, we b ound the denominator and the tw o p ow er sums by in tegrals. F or T ≥ 3, the harmonic lo wer bound giv es T − 1 X t =0 1 t + 1 ≥ 1 2 ln T , 45 Liang and Sun and Cao and Shen and for any θ ∈ (0 , 1], T − 1 X t =1 t θ − 1 ≤ 1 + Z T 1 x θ − 1 dx = 1 + T θ − 1 θ ≤ 2 θ T θ . Applying these with θ = p and θ = q to ( 122 ) yields ϵ av e-stab ≤ 2 G ln T "  4 C v a C w 0 δ + v a Gmn + 4 G mn  · 2 p T p + 2 GC v a δ (1 − λ ) · 2 q T q # ≤ K stab , 1 T p + K stab , 2 T q ln T , (123) where we set the explicit constants K stab , 1 := 4 G p  4 C v a C w 0 δ + v a Gmn + 4 G mn  , K stab , 2 := 8 G 2 C v a q δ (1 − λ ) . Com bining ( 123 ) with the optimization error b ound in Corollary 38 , ϵ opt ≤ K opt ln T , w e obtain the excess risk b ound ϵ exc ≤ ϵ av e-stab + ϵ opt ≤ K stab , 1 T p + K stab , 2 T q + K opt ln T . (124) Finally , since p > q , the dominant stabilit y growth is T p . Deﬁne K stab := K stab , 1 + K stab , 2 . Then ( 124 ) implies ϵ exc ≤ K stab T p + K opt ln T . T reating the righ t-hand side as f ( T ) = ( K stab T p + K opt ) / ln T and using the standard balance condition (at the minimizer, the t wo con tributions are comparable up to a ln T factor), p K stab T p ln T ≈ K opt , w e obtain the explicit stopping time T ∗ :=   K opt pK stab ln  K opt pK stab    1 p (for K opt pK stab ≥ e ) , (125) and substituting T ∗ bac k gives an explicit minimum bound ϵ ∗ exc ≤ p K opt ln  K opt pK stab  . (126) Ignoring only logarithmic factors, the optimal stopping time is determined b y balancing the tw o terms K stab T p ≈ K opt , which yields T ∗ = ˜ O      rG v α + C GκC w 0 δ (1 − λ ) + v G 2 κ δ (1 − λ ) + v G 2 α GC w 0 δ + G 2 mn + G 2 δ (1 − λ )   1 p    . 46 St ability and Generaliza tion of Push-Sum Based Decentralized Optimiza tion Substituting T ∗ bac k, the minimized excess risk satisﬁes ϵ ∗ exc = ˜ O  r G v α + C GκC w 0 δ (1 − λ ) + v G 2 κ δ (1 − λ ) + v G 2 α  , whic h makes explicit the dep endence on the netw ork parameters ( λ, δ ), the initialization scale C w 0 , and the gradient bound G . This completes the pro of of Corollary 40 . 47

Stability and Generalization of Push-Sum Based Decentralized Optimization over Directed Graphs

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment