Event-Triggered Gossip for Distributed Learning

While distributed learning offers a new learning paradigm for distributed network with no central coordination, it is constrained by communication bottleneck between nodes. We develop a new event-triggered gossip framework for distributed learning …

Authors: Zhiyuan Zhai, Xiaojun Yuan, Wei Ni

Event-Triggered Gossip for Distributed Learning
1 Ev ent-T riggered Gossip for Distrib uted Learning Zhiyuan Zhai, Xiaojun Y uan, F ellow , IEEE , W ei Ni, F ellow , IEEE , Xin W ang, F ellow , IEEE , Rui Zhang, F ellow , IEEE , and Geof frey Y e Li, F ellow , IEEE Abstract —While distributed lear ning offers a new learning paradigm for distrib uted network with no central coordination, it is constrained by communication bottleneck between nodes. W e develop a new event-trigger ed gossip framework for dis- tributed learning to reduce inter -node communication overhead. The framework introduces an adaptive communication control mechanism that enables each node to autonomously decide in a fully decentralized fashion when to exchange model information with its neighbors based on local model deviations. W e analyze the ergodic con ver gence of the proposed framework under nocon- vex objectives and interpret the conv ergence guarantees under different triggering conditions. Simulation results show that the proposed framework achieves substantially lower communication overhead than the state-of-the-art distributed learning methods, reducing cumulative point-to-point transmissions by 71.61% with only a marginal perf ormance loss, compared with the con ventional full-communication baseline. Index T erms —Distributed learning, event-trigger ed gossip, communication overhead. I . I N T R O D U C T I O N This section provides an ov erview of distributed learning systems and the communication bottlenecks that limit their scalability in decentralized settings. W e first discuss existing communication-reduction strategies and identify their limita- tions in adaptiv ely controlling information exchange. Then, we present the moti v ation, contrib utions and structure of the e vent- triggered gossip frame work proposed to address the limitations in this paper . A. Motivation and Challenges Machine learning has been widely used in intelligent data analytics and decision-making in domains, such as au- tonomous driving, Internet of Things (IoT), and smart health- care. Centralized training usually aggregates raw data at a single location, and is sometimes infeasible due to com- munication bandwidth limitations, priv ac y constraints, and single-point-of-failure risks [1]–[5]. Distributed learning, such as federated learning (FL), enables multiple de vices, each holding its local dataset, to collaborati vely train a shared model without relying on a central coordinator . Each node communicates only with its direct neighbors and exchanges model states or gradients to achiev e consensus over time [6]– [10]. This architecture improves scalability , robustness, and data pri v acy , making it well suited for bandwidth- and energy- limited systems. Despite these advantages, distributed learning often suf- fers from a sev ere communication bottleneck . T o maintain consensus among nodes, standard algorithms (e.g., gossip- based methods [11]–[13]) require all devices to exchange information at e very iteration, leading to massiv e point-to- point transmission and rapid exhaustion of bandwidth re- sources [14], [15]. On the other hand, excessi ve suppression of communication may lead to model drift across nodes and degrades learning accuracy [16], [17]. This raises a central question: How can we design communication-efficient dis- tributed learning algorithms that preserve high model accuracy but significantly reduce inter-node transmissions? B. Related W ork A variety of research ef forts have explored methods to reduce this communication burden. W e devide them into three major categories and discuss each of them. Model and Gradient Compression: One representati ve line of research reduces the size of exchanged updates through quantization or sparsification. For instance, ternary gradient (T ernGrad) [18] and quantized stochastic gradient descent (QSGD) [19] quantize gradients into low-bit representations while maintaining con ver gence guarantees. Later works e xtend these ideas by transmitting only the most significant compo- nents, as in T op- k gradient sparsification [20] or memory- based error compensation methods [21]. System-level de- signs, such as deep gradient compression (DGC) [22], further combine momentum correction and local gradient clipping to achiev e up to 600 × communication reduction without accuracy loss. These compression-based methods focus on minimizing the payload of each communication round. P eriodic or Infrequent Communication: Another major direction reduces communication frequency by allowing de- vices to perform multiple local updates between synchro- nization rounds. The local stochastic gradient descent (SGD) framew ork [23] provides rigorous con ver gence analysis un- der delayed averaging, while [24] extends it with adaptiv e synchronization intervals. In FL, federated av eraging (Fe- dA vg) [25] adopts similar periodic aggregation and partial participation principles to trade off communication and com- putation. More recently , hybrid FL algorithm (FedGiA) [26] combines periodic communication with a hybrid gradient descent and inexact alternating direction method of multipliers (ADMM) update to reduce communication rounds under mild con ver gence conditions. A unified analysis of such approaches is provided in cooperati ve SGD [27]. These approaches reduce communication by decreasing synchronization frequency , but may still suffer from model div ergence under heterogeneous data distributions. Over-the-Air Aggr e gation: At the wireless physical layer, an emerging body of work exploits the superposition property of multiple-access channels to aggregate model updates “over the air . ” Over -the-Air (O T A) computation [28]–[32] enables 2 simultaneous analog transmission of local gradients, such that the recei ved wav eform inherently represents their sum. Recent surve ys [33], [34] summarize advances in channel inv ersion, power control, and error compensation that make OT A a promising paradigm for low-latency , large-scale distributed learning. These techniques focus on improving the efficienc y of the physical-layer aggregation. Although these methods have substantially alleviated the communication burden in distributed learning, they focus primarily on optimizing the amount, frequency , or physical efficienc y of information exchange. Little attention has been paid to designing a communication control mechanism at the algorithmic level that adaptiv ely determines when de- vices should exchange their local models. Such a mechanism represents a complementary perspectiv e of optimization to the above communication-reduction approaches, providing an alternativ e means to reduce inter-node communication while maintaining learning accuracy . C. Contributions This paper de velops a new event-triggered gossip frame- work for distributed learning, which significantly reduces redundant inter-node transmissions while maintaining model accuracy . The core idea is to introduce a communication control mechanism at the algorithmic lev el, allowing each device to autonomously decide when to exchange its model with neighbors based on local dynamics. W e prove that the proposed algorithm has the potential to achieve the same con ver gence rate as centralized SGD. Extensiv e experiments verify that our method can drastically reduce communication volume without sacrificing learning performance. The key contributions of this paper are summarized as follows: • Event-trigger ed communication mechanism: W e design a nov el gossip-based distributed learning algorithm where each node triggers communication according to its local model de viation. This mechanism enables asynchronous and data-dependent information exchange without any global coordination, effecti vely suppressing redundant transmissions while preserving consensus. • Unified con ver gence analysis: For the first time, we es- tablish the er godic conv ergence bound for e vent-triggered distributed learning under nonconv ex objectiv es. Our analysis rev eals that the con ver gence rate depends jointly on the stepsize, the network spectral gap, and the trigger- ing thresholds, providing explicit insights into the trade- off between learning accuracy and communication cost. • Behavior under differ ent triggering strate gies: W e ana- lyze the con v ergence of the proposed algorithm under three representati ve triggering policies, including constant zero threshold, fixed nonzero threshold, and gradually decaying threshold. It is rev ealed that under a decaying threshold policy , our algorithm achiev es the same conv er- gence rate as centralized SGD, i.e., O  T − 1 / 2  . • Comprehensive experimental validation: Extensiv e sim- ulations on the MNIST [35] and Fashion-MNIST [36] datasets demonstrate that the proposed framework 3 4 5 2 1 7 6 Triggered Broadcast Fig. 1. An example of event-triggered communication model. achiev es comparable learning accuracy to the full com- munication scheme while substantially reducing the com- munication volume. The algorithm reduces the cumula- tiv e point-to-point transmissions by 71.6% on Fashion- MNIST and 69.4% on MNIST , with less than 1% accu- racy degradation. The rest of this paper is organized as follows. Section II presents the system model and e vent-triggered gossip formu- lation. Section III provides the con ver gence analysis under different triggering strategies. Section IV reports simulation results and performance comparisons. Finally , Section V con- cludes the paper . Notation: R denotes the set of real numbers. W e represent scalars by regular letters, vectors by bold lowercase letters, and matrices by bold uppercase letters. ( · ) T denotes transpose. W e denote the i -th largest eigen value of a matrix by λ i ( · ) , and the ( i, j ) -th entry of a matrix X by X ij . W e denote the Euclidean norm for vectors/matrices by ∥ · ∥ or ∥ · ∥ 2 , and the Frobenius norm for matrices by ∥ · ∥ F . E [ · ] denotes the expectation operator . W e denote an all-one vector by 1 and the identity matrix by I . O ( · ) denotes an upper bound up to a constant factor; Θ( · ) denotes a tight bound up to constant factors; e i ∈ R n denotes the i -th canonical basis v ector , whose i -th entry is 1 and the rest are 0 . I I . S Y S T E M M O D E L In this section, we present the system model of distrib uted learning. W e start with the moti vation for adopting a dis- tributed training architecture, followed by the corresponding network topology and optimization objective. A. Backgr ound In many modern applications, such as edge intelligence and IoT systems, training data are generated and stored across multiple devices (e.g., mobile phones, sensors, or edge servers). A straightforward solution is centralized learning , where all devices upload their ra w data to a central server , which then trains a global model using the aggregated dataset. This approach often suffers from several practical limitations: (i) transmitting all raw data leads to high communication ov erhead, (ii) sharing raw data may violate priv acy constraints, and (iii) the central server becomes a single point of failure. T o overcome these issues, distributed learning has emerged as an attractiv e alternative. Instead of sending raw data to a central node, each device keeps its local dataset and performs 3 local training, and only exchanges model parameters or gra- dients with neighboring devices. In this way , the system col- laborativ ely learns a shared model with significantly reduced communication cost and improved priv acy protection. B. Network and Learning Objective W e model the considered system as a decentralized network consisting of n computing nodes V = { 1 , . . . , n } connected by a static, undirected, and connected graph G = ( V , E ) , as illustrated in Fig. 1. Here, E denotes the set of communication links. If ( i, j ) ∈ E , nodes i and j are said to be neighbors , and N ( i ) denotes the neighbor set of node i . Communication is restricted to one-hop neighbor e xchanges only : each node can directly communicate only with its imme- diate neighbors in the graph. Multi-hop relaying, centralized coordination, or global broadcasts are not considered. This setting captures practical peer-to-peer or decentralized edge networks, where only local connectivity is a vailable. Each node i holds a local dataset drawn from a data distribution D i . Based on its own data, node i minimizes the local expected loss function, f i ( x ) ≜ E ξ ∼D i  ℓ ( x ; ξ )  , x ∈ R d , (1) where x denotes the model parameter vector , ξ represents a random data sample at node i , and ℓ ( x ; ξ ) is the sample- wise loss (e.g., the cross-entropy loss for classification or the squared loss for regression). The goal of distributed learning is to collaborativ ely opti- mize a shared global model by minimizing the average loss ov er all nodes in a peer-to-peer fashion, as giv en by min x ∈ R d f ( x ) ≜ 1 n n X i =1 f i ( x ) . (2) Since the data remains locally stored and only model informa- tion is exchanged, the above problem must be solved through decentralized cooperation among neighboring nodes. I I I . P R O P O S E D E V E N T - T R I G G E R E D F R A M E W O R K In this section, we introduce the proposed ev ent-triggered communication framework and the corresponding local update rules. W e first explain the motiv ation for reducing commu- nication in distributed learning, and then present the e vent- triggered gossip mechanism, together with a compact matrix- form description of the overall system dynamics. A. Motivation for Event-T rigger ed Communication In existing distrib uted learning algorithms, communication typically follows a full-communication strategy . Specifically , at e very iteration, each node exchanges its current model (or gradient) with all of its neighbors and performs a consensus update. Although this approach ensures fast information mix- ing across the network, it requires communication at e very round and over ev ery link. When the number of de vices or the training horizon is lar ge, such frequent transmissions lead to substantial communication ov erhead, which may quickly exhaust bandwidth and energy resources in practical edge networks. In many scenarios, ho w- ev er , consecutiv e local models change only slightly , making repeated transmissions largely redundant. This observation moti vates the following question: Can nodes communicate only when necessary , while maintaining comparable learning accuracy and consensus performance? T o answer this research question, we propose an event- trigger ed communication mechanism. Instead of communi- cating at ev ery iteration, each node autonomously decides whether to transmit its model according to its local model ev olution. Communication is triggered only when the local model deviates suf ficiently from its last transmitted state, thereby suppressing redundant exchanges and significantly reducing the overall communication cost. B. Event-T rigger ed Gossip W e consider an iterativ e distrib uted learning process. At iteration t , ev ery node maintains a local model, computes a stochastic gradient based on its local data, and exchanges model information with its neighbors to promote consensus. The communication among nodes is governed by an event- triggered mechanism, which adaptiv ely determines whether a node should transmit its local model based on model e volution. Each iteration consists of the following three steps. Step 1: Communication (broadcast and receiv e). Let x i,t denote the local model of node i in the t -th iteration. Each node i maintains two states in addition to x i,t : • Broadcast snapshot ˆ x i,t ∈ R d : The last model of node i broadcast to neighbors (initialized by ˆ x i, 0 = x i, 0 ). • Receive caches ˜ x j → i,t ∈ R d : At receiv er i , the most recently received model from neighbor j (initialized by ˜ x j → i, − 1 = x j, 0 ). W e enforce ˜ x i → i,t ≡ x i,t . At the beginning of round t , each node i computes the drift e t i ≜ x i,t − ˆ x i,t . (3) Giv en a nonnegati v e threshold τ t ≥ 0 , node i triggers communication if ∥ e t i ∥ ≥ τ t , immediately broadcasting x i,t to all j ∈ N ( i ) and setting ˆ x i,t ← x i,t . After receiving the models, each node i forms a con v ex combination of its own and its neighbors’ models using the mixing weights W ∈ R n × n : x i,t, mix = n X j =1 W j i ˜ x j → i,t , (4) where ˜ x j → i,t denotes the cached model received from node j , which is defined as ˜ x j → i,t = ( x j,t , if ∥ e t j ∥ ≥ τ t , ˜ x j → i,t − 1 , otherwise (reuse cache). (5) Moreov er , W is a mixing matrix supported on the network graph G , satisfying W j i > 0 , only if j ∈ N ( i ) ∪ { i } ; (6) W j i = 0 , otherwise . (7) Step 2: Local gradient computation. 4 Algorithm 1: Event-T riggered Gossip: Local Proce- dure at Node i Input: Initial model x i, 0 = x 0 , x i, 0 ∈ R d ; constant stepsize η > 0 ; neighbor set N ( i ) ; mixing weights W ; thresholds { τ t } T − 1 t =0 ; rounds T . Data: Snapshot ˆ x i, 0 ← x i, 0 ; caches ˜ x j → i, − 1 ← x j, 0 for all j ∈ N ( i ) ; set ˜ x i → i, − 1 ← x i, 0 . 1 for t = 0 , 1 , . . . , T − 1 do 2 T rigger test & broadcast on current model. 3 if ∥ x i,t − ˆ x i,t ∥ ≥ τ t then 4 broadcast the vector x i,t to all j ∈ N ( i ) ; ˆ x i,t ← x i,t 5 Receive window and cache update. 6 for j ∈ N ( i ) do 7 if a message fr om j is r eceived in round t then 8 ˜ x j → i,t ← receiv ed x j,t 9 else 10 ˜ x j → i,t ← ˜ x j → i,t − 1 11 ˜ x t i → i ← x i,t 12 Local stochastic gradient on current model. 13 Sample ξ i,t ∼ D i and compute g i,t = g i ( x i,t ; ξ i,t ) 14 One-step gossip mixing with a vailable models. 15 x i,t, mix ← n X j =1 W j i ˜ x j → i,t 16 Local update. 17 x i,t +1 ← x i,t, mix − η g i,t Output: Final av erage model ¯ x T . Each node i computes a stochastic gradient on its local model by sampling ξ i,t 1 : g i,t ≜ g i ( x i,t ; ξ i,t ) = ∇ x ℓ ( x i,t ; ξ i,t ) . (8) Step 3: Model update. After that, each node i updates its model using the mixing model x i,t, mix and the computed stochastic gradient: x i,t +1 = x i,t, mix − η g i,t , (9) where η > 0 is a constant stepsize. W e summarize the above operations in Algorithm 1. C. Obsolescence Err or and One-Round Update In a global point of view , we introduce the matrix form of the system update. At round t , we stack all local models column-wise as X t ≜  x 1 ,t x 2 ,t · · · x n,t  ∈ R d × n , (10) where the i -th column is the model of node i at round t . Similarly , we collect the stochastic gradients as G t ≜  g 1 ,t g 2 ,t · · · g n,t  ∈ R d × n . (11) 1 Since the stochastic gradient is e valuated at the current local model x i,t , it does not depend on the newly received neighbor information. Therefore, the gradient computation does not need to wait for the completion of the communication step, and Steps 1 and 2 can be executed in parallel in practice. When mixing on X t , the cached model ˜ x j → i,t at node i may differ from the true current model x j,t due to ev ent-triggered communication. W e define the per-edge obsolescence err or as v j → i,t ≜ ˜ x j → i,t − x j,t , v i → i,t = 0 . (12) If node j triggers communication at round t , then v j → i,t = 0 , ∀ i ∈ N ( j ) ∪ { i } ; otherwise, ∥ v j → i,t ∥ ≤ τ t , ∀ i, j, (13) since the cache stores a past broadcast of j and because of the triggering rule. Aggregating the obsolescence error in (12) across neighbors, the total error injected at node i is v i,t ≜ n X j =1 v j → i,t W j i . (14) W e collect v i,t into an error matrix, V t ≜  v 1 ,t v 2 ,t · · · v n,t  ∈ R d × n . (15) Now , we obtain the matrix form of system update in (9) as X t +1 = X t W − η G t + V t . (16) Equation (16) compactly summarizes the o verall dynamics: the first term, X t W , corresponds to gossip mixing; the second term, − η G t , is the local stochastic gradient update; and the last term, V t , captures the perturbation induced by ev ent- triggered communication 2 . I V . C O N V E R G E N C E A N A LY S I S In this section, we analyze the con ver gence of the pro- posed event-triggered gossip algorithm. W e first state the assumptions and characterize the dynamics of the network- wide ave rage iterate. W e then establish a unified ergodic con ver gence bound under general triggering thresholds and discuss its implications under different thresholding strategies. A. Assumptions for Conver gence Analysis Assumption 1 (Mixing W eights) . The mixing matrix, W ∈ R n × n , is nonne gative, doubly stoc hastic and symmetric, i.e., W1 = 1 , 1 ⊤ W = 1 ⊤ , W = W ⊤ . Let J ≜ 1 n 11 ⊤ . The spectral contraction factor is δ ≜ ∥ W − J ∥ 2 2 < 1 . Assumption 2 (Smoothness) . Each local objective function f i : R d → R is differ entiable, and its gr adient ∇ f i ( · ) is Lipschitz continuous with parameter L > 0 . That is, for all x , y ∈ R d and for all i ∈ [ n ] , ∥∇ f i ( x ) − ∇ f i ( y ) ∥ ≤ L ∥ x − y ∥ . Assumption 3 (Bounded V ariance) . F or each node i ∈ [ n ] , the stochastic gr adient g i,t = g i ( x i,t ; ξ i,t ) is an unbiased estimator of ∇ f i ( x i,t ) , and its variance is uniformly bounded. 2 When V t = 0 (i.e., all nodes communicate at e very round), (16) reduces to the standard full-communication decentralized SGD update. 5 Mor eover , the diver gence of local gradients fr om the global gradient is also bounded. F ormally , for all x ∈ R d , E   g i,t − ∇ f i ( x i,t )   2 ≤ α 2 , ∀ i ∈ [ n ] , E   ∇ f i ( x ) − ∇ f ( x )   2 ≤ β 2 , ∀ i ∈ [ n ] . The first e xpectation is taken with r espect to the data sampling , ξ i,t ∼ D i . The second expectation is taken over a uniformly random choice of node inde x i ∼ U ([ n ]) , which corresponds to measuring the average gradient discrepancy acr oss nodes. The abov e three assumptions are common in the literature on decentralized stochastic optimization, e.g., [11], [37]–[40]. In Assumption 1, symmetricity and double stochasticity ensure that local updates are con ve x combinations of neighbor models and that the network-wide av erage is preserved. The spectral contraction factor δ = ∥ W − J ∥ 2 < 1 further guarantees that the consensus error decays geometrically , which is equi valent to requiring that all eigen v alues of W have magnitude strictly less than 1 , e xcept the largest one that is 1 . Assumption 2 imposes L -smoothness on the local objectives, meaning that their gradients are Lipschitz continuous. This condition con- trols how fast the gradients can change with respect to the model parameters and is a standard requirement for analyzing the stability of gradient-based methods. Assumption 3 bounds both the stochastic gradient variance and the di ver gence across devices. α 2 quantifies the upper limit of the noise introduced by local data sampling; β 2 controls the level of heterogeneity among de vices. T ogether , these bounds ensure that the stochas- ticity and non-independent and identically distributed (non- i.i.d.) effects remain well behaved and facilitate analysis. B. Global A verage Dynamics Define the column-wise averages ¯ x t ≜ X t 1 n ∈ R d , ¯ g t ≜ G t 1 n ∈ R d , (17) and the av erage perturbation ¯ e t ≜ V t 1 n ∈ R d . (18) Multiplying the matrix update (16) by 1 n and using the column–stochasticity W1 = 1 in Assumption 1, we obtain ¯ x t +1 = ¯ x t − η ¯ g t + ¯ e t . (19) Hence, gossip mixing (by mixing matrix W ) cancels out in the average (it preserves ¯ x t ), and event-trigg ering enters the av erage dynamics only through the additive perturbation ¯ e t . Recall v i,t = P n j =1 W j i v j → i,t in (14) and ∥ v j → i,t ∥ ≤ τ t in (13). Then, we have ∥ ¯ e t ∥ =    1 n n X i =1 v i,t    ≤ 1 n n X i =1 ∥ v i,t ∥ ≤ 1 n n X i =1 n X j =1 W j i ∥ v j → i,t ∥ ≤ τ t . (20) Consequently , the av erage iterate (19) behaves like a central- ized SGD step with an additive perturbation with magnitude controlled by the trigger threshold τ t (e.g., if τ t ≡ 0 , then ¯ e t = 0 ). C. Ergodic Conver gence Result W ith the above assumptions and analyses, we now establish the conv ergence guarantee for the proposed ev ent-triggered gossip algorithm in the following theorem with the proof provided in Appendix A. Theorem 1. Under Assumptions 1–3 and any stepsize η > 0 such that Γ ≜ 1 − 27 nη 2 L 2 (1 − √ δ ) 2 > 0 , ∆ ≜ 1 2 − 27 nη 2 L 2 (1 − √ δ ) 2 Γ > 0 , the iterates of Algorithm 1 satisfy the er godic conver g ence bound 1 T T − 1 X t =0 E   ∇ f ( ¯ x t )   2 ≤ f ( ¯ x 0 ) − f ∗ η ∆ T + η L 2 Γ∆  3 n 2 α 2 1 − δ + 27 η 2 nβ 2 (1 − √ δ ) 2  + η α 2 n ∆ + 9 η L 2 (1 − √ δ ) 2 Γ∆ · 1 T T − 1 X t =0 nτ 2 t + 1 η ∆ · 1 T T − 1 X t =0 τ 2 t . (21) Theorem 1 provides a unified ergodic conv ergence guaran- tee for the proposed e vent-triggered gossip algorithm under any feasible thresholds { τ t } ; see (21). The bound makes explicit ho w the con ver gence depends jointly on the stepsize η , the spectral contraction factor (1 − √ δ ) , the stochastic gradient variance α 2 , the data heterogeneity β 2 , and the triggering sequence { τ t } through the stability constants Γ and ∆ . In particular , the event-triggered mechanism injects perturbations into the a verage-iterate dynamics of ¯ x t via the terms controlled by { τ t } , as shown in (20). A smaller τ t yields more frequent communication and tighter tracking of the centralized trajec- tory , whereas a lar ger τ t reduces communication at the e xpense of a non-vanishing steady-state bias in (21). D. T riggering Threshold Discussion T o further interpret (21) and clarify the accu- racy–communication trade-off, we specialize Theorem 1 to three representativ e regimes of the threshold sequence { τ t } . Case A corresponds to full communication ( τ t ≡ 0 ), which recovers the standard noncon vex rate; Case B studies a constant thr eshold ( τ t ≡ τ > 0 ), which leads to a bounded steady-state bias determined by τ ; Case C considers decaying thr esholds { τ t } , where the triggering condition becomes gradually looser over time. Case A (full communication, τ t ≡ 0 ). When e very round in volv es communication ( τ t ≡ 0 ), the ergodic bound in (21) reduces to three dominant terms: 1 T T − 1 X t =0 E ∥∇ f ( ¯ x t ) ∥ 2 ≤ K 1 η T + K 2 η + K 3 η 3 , where the constants are explicitly giv en by K 1 = f ( ¯ x 0 ) − f ∗ ∆ ; 6 K 2 = L 2 Γ∆  3 n 2 α 2 1 − δ  + α 2 n ∆ ; K 3 = 27 L 2 nβ 2 (1 − √ δ ) 2 Γ∆ . T o ensure the validity of this bound, the stepsize, η , must satisfy three conditions simultaneously . The first condition is that the smoothness argument in the descent step requires η ≤ 1 /L . The second condition is that the stability conditions, Γ > 0 and ∆ > 0 , need to be guaranteed by the suf ficient inequality 27 nη 2 L 2 (1 − √ δ ) 2 < 1 3 , which yields η < 1 − √ δ 9 √ n L . The third condition is that, since η ≤ 1 /L and η < 1 − √ δ 9 √ n L , η ≤ η max := min { 1 L , 1 − √ δ 9 √ n L } . Using η ≤ η max , we squeeze the cubic term by K 3 η 3 ≤ K 3 η 2 max η , so the right-hand side (RHS) is upper-bounded by K 1 η T + e K 2 η with e K 2 := K 2 + K 3 η 2 max . Minimizing this upper bound gi ves η ∗ = q K 1 / ( e K 2 T ) = Θ( T − 1 / 2 ) . Combining these conditions, a con venient stepsize choice is η = Θ  min n 1 L , 1 − √ δ √ n L , 1 √ T o . W ith this setting, the algorithm achieves the standard con ver - gence rate, as giv en by 1 T T − 1 X t =0 E ∥∇ f ( ¯ x t ) ∥ 2 = O  1 √ T  . T o this end, with full communication, the proposed algorithm achiev es the same O (1 / √ T ) rate as centralized SGD [41]– [43] in the noncon ve x setting, while the hidden constants in the con ver gence bound are affected by the stochastic gradient variance ( α > 0 ), data heterogeneity ( β > 0 ), and network spectral gap (1 − √ δ ) . Case B (constant threshold, τ t ≡ τ > 0 ). Plugging τ t ≡ τ into (21) yields, for any feasible stepsize η ∈ (0 , η max ] with η max := min { 1 L , 1 − √ δ 9 √ n L } , 1 T T − 1 X t =0 E ∥∇ f ( ¯ x t ) ∥ 2 ≤ K 1 η T + K 2 η + K 3 η + K 4 η 3 , where K 1 = f ( ¯ x 0 ) − f ∗ ∆ ; K 2 = α 2 n ∆ + L 2 Γ∆  3 n 2 α 2 1 − δ  + 9 L 2 nτ 2 (1 − √ δ ) 2 Γ∆ ; K 3 = τ 2 ∆ ; K 4 = 27 L 2 nβ 2 (1 − √ δ ) 2 Γ∆ . Since η ≤ η max , we have K 4 η 3 ≤ K 4 η 2 max η . Let e K 2 := K 2 + K 4 η 2 max . Then, 1 T T − 1 X t =0 E ∥∇ f ( ¯ x t ) ∥ 2 ≤ K 1 η T + e K 2 η + K 3 η . (22) Minimizing the RHS of (22) over (0 , η max ] giv es the imple- mentable choice: η = min n s K 1 /T + K 3 e K 2 , η max o , and, therefore, 1 T T − 1 X t =0 E ∥∇ f ( ¯ x t ) ∥ 2 ≤ 2 q e K 2  K 1 T + K 3  = O  1 √ T  + 2 q e K 2 K 3 . Hence, with a constant (non-decaying) threshold, the bound consists of a transient O ( T − 1 / 2 ) term plus a non-vanishing steady-state bias 2 q e K 2 K 3 ; the ergodic gradient norm does not conv erge to 0 as T → ∞ . The constants depend on the stochastic gradient v ariance ( α ), data heterogeneity ( β ), and spectral gap (1 − √ δ ) through Γ and ∆ . Case C (decaying threshold, { τ t } ). Let τ 2 := 1 T P T − 1 t =0 τ 2 t . Plugging this into (21) giv es: 1 T T − 1 X t =0 E ∥∇ f ( ¯ x t ) ∥ 2 ≤ K 1 η T +  K 2 + K 4 τ 2  η + b T η + K 3 η 3 , for any feasible stepsize η ∈ (0 , η max ] with η max := min { 1 L , 1 − √ δ 9 √ n L } , where K 1 = f ( ¯ x 0 ) − f ∗ ∆ ; K 2 = α 2 n ∆ + 3 n 2 L 2 α 2 (1 − δ )Γ∆ ; K 3 = 27 L 2 nβ 2 (1 − √ δ ) 2 Γ∆ ; K 4 = 9 L 2 n (1 − √ δ ) 2 Γ∆ ; b T = τ 2 ∆ . Since η ≤ η max , we hav e K 3 η 3 ≤ K 3 η 2 max η . Define e a T := K 2 + K 3 η 2 max + K 4 τ 2 . Then, 1 T T − 1 X t =0 E ∥∇ f ( ¯ x t ) ∥ 2 ≤ K 1 η T + e a T η + b T η . (23) Minimizing the RHS of (23) over (0 , η max ] yields the imple- mentable choice: η = min n s K 1 /T + b T e a T , η max o , and, therefore, the upper bound is given as 1 T T − 1 X t =0 E ∥∇ f ( ¯ x t ) ∥ 2 ≤ 2 s e a T  K 1 T + b T  . (24) There are two scenarios that need to be considered. Scenario (i) If τ t = τ 0 / √ t + 1 , then τ 2 = Θ  log T T  , so b T = Θ  log T ∆ T  and e a T = ( K 2 + K 3 η 2 max ) + Θ  log T T  . Defining K ′ 2 := K 2 + K 3 η 2 max , we can write e a T = K ′ 2 + Θ  log T T  . Note that K 1 T = 7 o ( log T T ) and hence K 1 T + b T = Θ( log T T ) , while e a T = Θ(1) . The RHS of (24) scales as 2 q Θ(1) · Θ( log T T ) = O  q log T T  , which yields 1 T T − 1 X t =0 E ∥∇ f ( ¯ x t ) ∥ 2 = O  q log T T  . Ignoring the slowly varying logarithmic factor , the con v er- gence rate can be described in the soft-O form, as ˜ O ( T − 1 / 2 ) . Scenario (ii) If τ t = τ 0 / ( t + 1) , then τ 2 = Θ  1 T  , so b T = Θ  1 ∆ T  and e a T = ( K 2 + K 3 η 2 max ) + Θ  1 T  . By defining K ′ 2 := K 2 + K 3 η 2 max , we can write e a T = K ′ 2 + Θ  1 T  . Note that K 1 T = Θ( 1 T ) and b T = Θ( 1 T ) , and hence K 1 T + b T = Θ( 1 T ) , while e a T = Θ(1) . The RHS of (24) scales as 2 q Θ(1) · Θ( 1 T ) = O  T − 1 / 2  , which yields 1 T T − 1 X t =0 E ∥∇ f ( ¯ x t ) ∥ 2 = O  T − 1 / 2  . This corresponds to the standard stochastic noncon ve x con v er- gence rate O ( T − 1 / 2 ) of centralized SGD. In both scenarios, the hidden constants depend on the stochastic gradient v ariance ( α ), the data heterogeneity ( β ), and the spectral gap (1 − √ δ ) through Γ and ∆ . The above discussion reveals that the triggering threshold { τ t } plays a decisi ve role in shaping both the communication cost and the achie vable accuracy . When τ t ≡ 0 , every round in volv es communication and the algorithm reduces to standard distributed learning, recov ering the O ( T − 1 / 2 ) con vergence rate of centralized SGD. When τ t is a fixed positi ve con- stant, communication is substantially reduced, but the er- godic gradient norm con v erges only up to a non-vanishing steady-state bias proportional to τ , reflecting persistent model discrepancies. In contrast, when τ t decays ov er time (e.g., τ t = Θ( t − 1 / 2 ) or τ t = Θ( t − 1 ) ), the injected perturbation gradually diminishes and the full O ( T − 1 / 2 ) rate is recov ered. This indicates that properly decaying thresholds enable the proposed ev ent-triggered gossip method to match the theo- retical performance of full-communication decentralized SGD while significantly reducing inter-node transmissions. V . S I M U L A T I O N R E S U LT S In this section, we ev aluate the performance of the proposed ev ent-triggered gossip algorithm through extensi ve simula- tions. W e first describe the experimental settings and datasets, and then examine the impact of the triggering threshold and network sparsity on learning accuracy and communication cost. Finally , we compare the proposed method with state- of-the-art distributed learning schemes. A. Simulation Settings W e consider a distributed learning system consisting of M = 20 edge devices connected through an undirected communication graph. The learning task is image classification on the MNIST and Fashion-MNIST datasets. MNIST con- tains grayscale handwritten digit images (0–9) and is widely 0 50 100 150 Communication round 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Average Accuracy 126 128 130 132 134 0.88 0.9 0.92 Fig. 2. T est accuracy of the averaged model under different triggering thresholds ϵ (MNIST). regarded as a relativ ely simple benchmark, whereas Fashion- MNIST consists of grayscale clothing images from 10 apparel categories and is generally more challenging to learn due to higher visual complexity . Among all samples, 10 , 000 are used for training and 10 , 000 for testing. Since the MNIST and Fashion-MNIST datasets contain 10 distinct classes, we partition the devices into 10 groups of equal size, where each group is assigned a dataset corresponding to one specific class. This setup ensures that the network experiences data heterogeneity , as described in [25]. The network sparsity is defined as the proportion of zero elements in the mixing matrix, specifically , the number of zero elements in W divided by n 2 . Unless otherwise specified, we conduct simulation under network sparsity of 0 . 3 . Each local model is implemented as a con volutional neural network (CNN) with the following structure: two conv olu- tional layers with 5 × 5 kernels and channel sizes (1 , 10) and (10 , 20) , followed by a batch-normalization (or dropout) layer and two fully connected layers with dimensions 320 − 50 − 10 . The ReLU activ ation function is applied after each layer . The network outputs a 10-dimensional vector representing class logits. All models are initialized identically and trained using SGD with a constant learning rate η = 0 . 02 . In the proposed ev ent-triggered gossip algorithm, each device i decides whether to communicate at round t according to the triggered threshold τ t = ϵ ∥ x 0 ∥ , (25) where ϵ > 0 is a small relative threshold coeffici ent and x 0 is the initial model for all devices. The results presented are based on the average of 30 Monte Carlo experiments. B. Impact of T rig ger Threshold on Accuracy and Communi- cation Figs. 2–3 and Figs. 4–5 show the impact of the triggering threshold ϵ on the test performance and communication cost 8 0 50 100 150 Communication round 0 0.5 1 1.5 2 2.5 3 3.5 4 Cumulative point-to-point communications 10 4 Fig. 3. Cumulativ e point-to-point communication volume versus ϵ (MNIST). 0 50 100 150 Communication round 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Average Accuracy 118 120 122 124 126 0.71 0.72 0.73 0.74 Fig. 4. T est accuracy of the averaged model under different triggering thresholds ϵ (Fashion-MNIST). for the MNIST and F ashion-MNIST datasets, respecti vely . The test accuracy is ev aluated on the averaged model, i.e., ¯ x t ≜ 1 M M X i =1 x i,t . The communication cost is measured by the total number of point-to-point one-way transmissions among all devices; that is, each model message sent from a node to one of its neigh- bors is counted as one transmission. The full communication case corresponds to τ t ≡ 0 , meaning that all devices exchange local models with their neighbors at every round. As shown in Fig. 2, the accuracy of the averaged model ¯ x t on MNIST gradually approaches that of the full communi- cation case as ϵ decreases. A smaller ϵ makes the triggering condition, i.e., ∥ x i,t − ˆ x i,t ∥ ≥ τ t = ϵ ∥ x 0 ∥ , easier to satisfy , which results in more frequent inter-de vice communications, faster consensus, and higher accurac y . Corre- 0 50 100 150 Communication round 0 0.5 1 1.5 2 2.5 3 3.5 4 Cumulative point-to-point communications 10 4 Fig. 5. Cumulativ e point-to-point communication volume versus ϵ (Fashion- MNIST). T ABLE I A VE R AG E A C C U R AC Y A N D C U M U L A T I V E P O I N T - T O - P O I N T C O M M U N I C ATI O N V O L U M E A F T E R 1 5 0 R O U N D S Scheme MNIST Fashion-MNIST Accuracy Comm. Accuracy Comm. Full Comm. 0.9219 39900 0.7422 39900 ϵ = 3 × 10 − 3 0.9179 18934 0.7393 17811 ϵ = 5 × 10 − 3 0.9135 12230 0.7360 11328 ϵ = 7 × 10 − 3 0.9087 8893 0.7325 8134 ϵ = 9 × 10 − 3 0.8711 5853 0.7292 6203 Notes: “Comm. ” is the cumulative number of point-to-point transmissions. Results are averaged over trials, unless otherwise specified. spondingly , Fig. 3 shows that the cumulativ e communication volume increases monotonically as ϵ decreases. When ϵ is large, triggers are rare, leading to minimal communication but slower con v ergence; as ϵ → 0 , the behavior conv erges to the full communication case, achieving the highest accuracy at the cost of the largest overhead. The experiments on Fashion-MNIST , reported in Figs. 4 and 5, exhibit a similar trend. As ϵ decreases, the accuracy consistently improv es while the total communication volume increases. Howe ver , the performance gap between dif ferent thresholds is smaller on Fashion-MNIST than on MNIST , indi- cating that the proposed ev ent-triggered mechanism maintains stable accuracy even under sparser communication on more complex datasets. Let CR = 1 − Comm( ϵ ) Comm( Full ) denote the communication reduc- tion ratio, and AD = Acc(F ull) − Acc( ϵ ) denote the accuracy drop relati ve to full communication (in percentage points). In T able I, our event-triggered scheme achiev es a marked reduction in point-to-point transmissions with only a marginal loss in accuracy: • Key operating point ( ϵ = 5 × 10 − 3 ). On MNIST, the proposed method reaches an av erage accuracy of 9 0 . 9135 ( AD = 0 . 84 percentage points (pp), i.e., 0 . 91% relativ e) while using CR = 69 . 35% fe wer transmissions ( 12230 / 39900 ). On Fashion-MNIST, it attains 0 . 7360 (drop 0 . 62 pp, 0 . 84% relative) with CR = 71 . 61% fewer transmissions ( 11328 / 39900 ). These results verify that our event-triggered gossip can preserve the performance of full communication with only about 1 / 3 of the com- munication cost. • More aggressiv e saving ( ϵ = 7 × 10 − 3 ). MNIST re- duces transmissions by 77 . 71% (8893 vs. 39900) with an accuracy drop of only 1 . 32 pp ( 1 . 43% ). Fashion-MNIST reduces by 79 . 61% (8134 vs. 39900) with a 0 . 97 pp ( 1 . 31% ) drop. • Conservative setting ( ϵ = 3 × 10 − 3 ). Even with a smaller threshold, MNIST still saves 52 . 55% communication with merely 0 . 40 pp ( 0 . 43% ) loss; Fashion-MNIST sav es 55 . 36% with a 0 . 29 pp ( 0 . 39% ) loss. Overall, the event-triggered mechanism provides a tunable accuracy–communication trade-off: As ϵ decreases, accuracy approaches that of full communication while communication cost smoothly increases. At ϵ = 5 × 10 − 3 , the method achiev es near-identical performance to full communication yet con- sumes only 30.65% (MNIST) and 28.39% (Fashion-MNIST) of its point-to-point transmissions, i.e., communication savings of 69.35% and 71.61% , respectiv ely . C. P erformance Analysis under Differ ent Network Sparsity T o in vestigate the performance of the proposed scheme un- der varying netw ork configurations, we conducted experiments with different lev els of network sparsity . In Figs. 6 and 7, we observe that: as the sparsity increases, both the point-to-point communication v olume and the accu- racy of our algorithm are af fected. In Fig. 7, we can see a noticeable decrease in the cumulativ e communication volume as the sparsity increases. This is because each node has fewer neighbors to communicate with in sparser topologies, leading to fewer point-to-point transmissions when the communication is triggered. Consequently , as sparsity increases, the number of activ e communication links decreases, resulting in a reduction in ov erall communication volume. On the other hand, Fig. 6 demonstrates that the accuracy of the algorithm decreases as the network sparsity increases. This performance degradation can be attributed to the non- i.i.d. data partitioning deployed in the experiments. As the sparsity gro ws, the communication between nodes becomes more sporadic, meaning that nodes are less able to collaborate effecti v ely and synchronize their model parameters. This leads to slo wer consensus, as fewer nodes communicate in each round, and the model update process becomes less efficient. The lack of communication and collaboration among nodes hampers the model’ s ability to con v erge effecti vely . In general, while higher sparsity reduces the communication ov erhead, it diminishes the accuracy of the algorithm due to the decreased collaboration between nodes. D. Comparison with State-of-the-art Schemes T o comprehensively ev aluate the effecti veness of the pro- posed ev ent-triggered gossip algorithm, we compare it with 0 50 100 150 Communication round 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Average Accuracy 128 130 132 134 136 138 0.82 0.84 0.86 0.88 Fig. 6. A verage accuracy under different lev els of network sparsity . 0 50 100 150 Communication round 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Cumulative point-to-point communications 10 4 Fig. 7. Cumulativ e point-to-point communication volume under different lev els of network sparsity . representativ e distrib uted learning schemes that represent different communication mechanisms. These schemes in- clude both con v entional and state-of-the-art strate gies widely adopted in distributed learning. 1) Full Communication [44]: This scheme performs stan- dard gossip averaging in ev ery iteration [44], where all de vices exchange information with all of their neighbors. It serv es as an upper-bound baseline, representing the best achiev able performance with maximum communication ov erhead, corre- sponding to the case τ t ≡ 0 . 2) Event-T riggered Gossip (Pr oposed): In the proposed method, each device triggers communication only when the deviation between its current model x i,t and its latest broadcast snapshot ˆ x i,t exceeds a relative threshold, i.e., ∥ x i,t − ˆ x i,t ∥ ≥ τ t = ϵ ∥ x 0 ∥ . Only the triggered de vices update their snapshots and com- municate with neighbors. 3) Periodic Gossip [24]: In this time-triggered scheme, devices perform global gossip communication once e very 10 K p rounds and conduct only local updates in-between [24]. The periodic exchange effecti vely reduces communication frequency b ut may slow do wn model consensus, especially when K p is large. 4) Probabilistic Gossip [45]: In this scheme, each directed communication link ( i, j ) is activ ated independently according to a Bernoulli random v ariable. At round t , node i transmits its model to neighbor j with probability p ij , i.e., the transmission ev ent is a ij,t ∼ Bernoulli( p ij ) . If a ij,t = 1 , the message is sent; if a ij,t = 0 , no transmission occurs and node j reuses its own model. See [45] for details. 5) V ariable W orking Nodes [46]: Inspired by the variable node activ ation strategy proposed in [46], this scheme allows each node to be acti vated with probability p k at each round. Only activ ated nodes perform both communication and gradi- ent updates within the active subgraph, whereas inacti ve nodes remain unchanged. This joint control of communication and computation further reduces system overhead at the expense of slower con ver gence in highly sparse activ ation regimes. These five schemes cover a broad range of distributed learning communication paradigms, including ev ent-triggered, time-triggered, and probabilistic triggering mechanisms. W e first provide one representativ e parameter configuration for each scheme to gi ve a concrete comparison. In this setting, we use ϵ = 0 . 005 for event-triggered gossip, K p = 5 for periodic gossip, and p ij = 0 . 5 , ∀ i, j and p k = 0 . 3 for the probabilistic and variable working-node schemes. The corresponding performance after 150 training rounds is sum- marized in T able II. Here, the proposed ev ent-triggered gossip achiev es comparable accuracy to full communication while reducing the total point-to-point transmissions by more than a factor of three on both datasets. Periodic gossip and variable working-node schemes consume fe wer communications b ut suffer noticeable accuracy degradation, whereas probabilistic gossip retains reasonable accuracy but requires significantly more communications than the proposed method. T o visualize ho w accurac y v aries with the communi- cation b udget, we sweep the key communication-related parameters of each scheme and depict their accuracy– communication trade-of fs. For each scheme, we sweep its primary communication-control hyperparameters over a range of v alues (e.g., the relativ e triggering threshold ϵ for event- triggered gossip, the communication period K p for periodic gossip, and the activ ation probabilities p ij , ∀ i, j and p k for the probabilistic and variable-working schemes). For each configuration, we run training for T = 150 rounds and record the pair consisting of the final test accuracy of the av eraged model and the cumulati ve number of point-to-point transmissions among all devices. The resulting accuracy– communication pairs are plotted in Fig. 8 for MNIST and Fig. 9 for Fashion-MNIST . In these plots, desirable operating points lie towards the upper-left corner , corresponding to high accuracy with low communication v olume. On the MNIST dataset (Fig. 8), the full-communication scheme naturally achieves the highest accuracy but also incurs the largest communication v olume, placing its operating points near the upper-right corner . In contrast, the proposed event- triggered gossip algorithm generates a suite of points that T ABLE II A VE R AG E A C C U R AC Y A N D C U M U L A T I V E P O I N T - T O - P O I N T C O M M U N I C ATI O N V O L U M E A F T E R 1 5 0 R O U N D S Scheme MNIST Fashion-MNIST Accuracy Comm. Accurac y Comm. Full communication 0.9219 39900 0.7422 39900 Event-triggered 0.9135 12230 0.7360 11328 Periodic gossip 0.8454 7980 0.6884 7980 Probabilistic gossip 0.9134 19904 0.7357 19904 V ariable working 0.7406 11939 0.6430 11939 cluster close to the upper-left region, forming an approxi- mate Pareto frontier . For a wide range of target accuracies (e.g., above 0 . 90 ), event-triggered gossip attains comparable performance to full communication while requiring signifi- cantly fewer point-to-point transmissions. In particular , for a giv en communication budget, the proposed method achiev es higher accuracy than the periodic gossip and variable-w orking schemes, whose points lie further do wn and/or to the right, indicating either slower con ver gence or pronounced accurac y loss under aggressive communication reduction. Probabilistic gossip yields a set of intermediate trade-offs, but its points are dominated by those of the proposed method in the accuracy– communication plane. A similar pattern is observ ed on the F ashion-MNIST dataset (Fig. 9). The ev ent-triggered gossip scheme again occupies the upper-left region of the plot and remains close to the full- communication accuracy while substantially reducing commu- nication. Periodic gossip and variable-working schemes ex- hibit lower accuracy for comparable communication volumes, reflecting the adverse impact of infrequent synchronization and partial node activ ation on consensus quality . Probabilistic gossip provides moderate savings but still requires noticeably more transmissions than the proposed method to reach the same accuracy level. These the Pareto-style results demonstrate that the pro- posed ev ent-triggered gossip algorithm offers the most fa vor - able accuracy–communication trade-off among all considered schemes. Across both MNIST and Fashion-MNIST , its oper- ating points consistently lie on or near the empirical Pareto frontier , enabling substantial reductions in communication volume while preserving almost the same accuracy as the full- communication baseline. V I . C O N C L U S I O N S In this paper , we dev eloped a distrib uted learning frame work based on an ev ent-triggered gossip mechanism, which allows each device to autonomously decide when to communicate with its neighbors according to local model de viations. W e conducted a rigorous con vergence analysis and discussed the con ver gence properties of the proposed scheme under different triggering thresholds, providing theoretical insights into the trade-of f between its accuracy and communication 11 0 1 2 3 4 Cumulative communication 10 4 0.6 0.65 0.7 0.75 0.8 0.85 0.9 Final test accuracy Event-triggered Periodic Probabilistic Variable Working Nodes Full Communication Fig. 8. T est accuracy versus cumulati ve point-to-point communication volume for dif ferent distrib uted learning schemes on the MNIST dataset. 0 1 2 3 4 Cumulative communication 10 4 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 Final test accuracy Event-triggered Periodic Probabilistic Variable Working Nodes Full Communication Fig. 9. T est accuracy versus cumulati ve point-to-point communication volume for dif ferent distrib uted learning schemes on the Fashion-MNIST dataset. efficienc y . Extensive experiments on MNIST and Fashion- MNIST verified that the proposed framew ork can drastically reduce point-to-point communication while maintaining high learning accuracy . R E F E R E N C E S [1] Z. T ang, S. Shi, W . W ang, B. Li, and X. Chu, “Communication-efficient distributed deep learning: A comprehensi ve surve y , ” arXiv pr eprint arXiv:2003.06307 , 2020. [2] S. Zhou and G. Y . Li, “Federated learning via inexact ADMM, ” IEEE T rans. P attern Anal. Mach. Intell. , vol. 45, no. 8, pp. 9699–9708, 2023. [3] F . Liang, Z. Zhang, H. Lu, V . Leung, Y . Guo, and X. Hu, “Communication-efficient large-scale distributed deep learning: A com- prehensiv e surve y , ” arXiv preprint , 2024. [4] Z. Zhai, X. Y uan, X. W ang, and H. Y ang, “UA V-enabled asynchronous federated learning, ” IEEE T rans. W ir eless Commun. , vol. 24, no. 3, pp. 2358–2372, 2025. [5] J. Zhang, C. Qian, W . Lu, G. Deco, W . Ding, and J. Feng, “Dark signals in the brain: Augment brain network dynamics to the complex-valued field, ” arXiv pr eprint arXiv:2509.24715 , 2025. [6] T . Lin, S. P . Karimireddy , S. U. Stich, and M. Jaggi, “Quasi-global momentum: Accelerating decentralized deep learning on heterogeneous data, ” arXiv pr eprint arXiv:2102.04761 , 2021. [7] X. Cao, T . Bas ¸ar, S. Diggavi, Y . C. Eldar, K. B. Letaief, H. V . Poor , and J. Zhang, “Communication-ef ficient distrib uted learning: An overvie w , ” IEEE J . Sel. Areas Commun. , vol. 41, no. 4, pp. 851–873, 2023. [8] Z. Qin, G. Y . Li, and H. Y e, “Federated learning and wireless commu- nications, ” IEEE W ir eless Commun. , vol. 28, no. 5, pp. 134–140, 2021. [9] H. Y e, L. Liang, and G. Y . Li, “Decentralized federated learning with unreliable communications, ” IEEE J. Sel. T opics Signal Pr ocess. , vol. 16, no. 3, pp. 487–500, 2022. [10] S. Sha, S. Zhou, L. Kong, and G. Y . Li, “Sparse decentralized federated learning, ” IEEE T r ans. Signal Process. , vol. 73, pp. 3406–3420, 2025. [11] A. Kolosko va, S. Stich, and M. Jaggi, “Decentralized stochastic opti- mization and gossip algorithms with compressed communication, ” in Pr oc. Int. Conf. Mach. Learn. (ICML) , 2019, pp. 3478–3487. [12] Z. Song, W . Li, K. Jin, L. Shi, M. Y an, W . Y in, and K. Y uan, “Communication-efficient topologies for decentralized learning with o (1) consensus rate, ” Adv . Neural Inf. Pr ocess. Syst. , vol. 35, pp. 1073– 1085, 2022. [13] Z. Zhai, S. Hu, W . Ni, X. Y uan, and X. W ang, “Spectral-con vergent decentralized machine learning: Theory and application in space net- works, ” arXiv pr eprint arXiv:2511.03291 , 2025. [14] R. Hong and A. Chandra, “Dlion: Decentralized distributed deep learn- ing in micro-clouds, ” in Proc. Int. Symp. High-P erform. P ar allel Distrib . Comput. (HPDC) , 2021, pp. 227–238. [15] C. Renggli, S. Ashkboos, M. Aghagolzadeh, D. Alistarh, and T . Hoefler, “Sparcml: High-performance sparse communication for machine learn- ing, ” in Pr oc. Int. Conf. High P erform. Comput. Netw . Storage Anal. (SC) , 2019, pp. 1–15. [16] Z. Liu, A. Conti, S. K. Mitter , and M. Z. Win, “Communication-ef ficient distributed learning over networks—Part I: Sufficient conditions for accuracy , ” IEEE J. Sel. Areas Commun. , vol. 41, no. 4, pp. 1081–1101, 2023. [17] Y . Sun, H. Ochiai, and H. Esaki, “Decentralized deep learning for multi- access edge computing: A survey on communication efficiency and trustworthiness, ” IEEE Tr ans. Artif. Intell. , vol. 3, no. 6, pp. 963–972, 2021. [18] W . W en, C. Xu, F . Y an, C. Wu, Y . W ang, Y . Chen, and H. Li, “T ernGrad: T ernary gradients to reduce communication in distributed deep learning, ” in Pr oc. Adv . Neural Inf. Pr ocess. Syst. (NeurIPS) , 2017, pp. 1508–1518. [19] D. Alistarh, D. Grubic, J. Li, R. T omioka, and M. V ojnovic, “QSGD: Communication-efficient SGD via gradient quantization and encoding, ” in Pr oc. Adv . Neural Inf. Pr ocess. Syst. (NeurIPS) , 2017, pp. 1709–1720. [20] A. F . Aji and K. Heafield, “Sparse communication for distributed gradient descent, ” arXiv preprint , 2017. [21] S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified SGD with memory , ” Adv . Neural Inf. Pr ocess. Syst. , vol. 31, 2018. [22] Y . Lin, S. Han, H. Mao, Y . W ang, and W . J. Dally , “Deep gradient compression: Reducing the communication bandwidth for distributed training, ” in Pr oc. Int. Conf. Learn. Repr esent. (ICLR) , 2018. [23] S. U. Stich, “Local SGD con ver ges fast and communicates little, ” J. Mach. Learn. Res. , vol. 20, no. 1, pp. 1–31, 2019. [24] F . Haddadpour , M. M. Kamani, M. Mahdavi, and V . Cadambe, “Local SGD with periodic averaging: Tighter analysis and adapti ve synchro- nization, ” Adv . Neural Inf. Pr ocess. Syst. , vol. 32, 2019. [25] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data, ” in Pr oc. Artif. Intell. Statist. (AIST A TS) , 2017, pp. 1273–1282. [26] S. Zhou and G. Y . Li, “FedGiA: An efficient hybrid algorithm for federated learning, ” IEEE T rans. Signal Pr ocess. , v ol. 71, pp. 1493– 1508, 2023. [27] J. W ang and G. Joshi, “Cooperativ e SGD: A unified framew ork for the design and analysis of local-update SGD algorithms, ” J. Mach. Learn. Res. , v ol. 22, no. 213, pp. 1–50, 2021. [28] K. Y ang, T . Jiang, Y . Shi, and Z. Ding, “Federated learning via over- the-air computation, ” IEEE T rans. W ireless Commun. , vol. 19, no. 3, pp. 2022–2035, 2020. [29] G. Zhu, Y . W ang, and K. Huang, “Broadband analog aggregation for low-latenc y federated edge learning, ” IEEE T rans. W ireless Commun. , vol. 19, no. 1, pp. 491–506, 2019. [30] H. Guo, A. Liu, and V . K. N. Lau, “ Analog gradient aggregation for federated learning over wireless networks: Customized design and con vergence analysis, ” IEEE Internet Things J. , vol. 8, no. 1, pp. 197– 210, 2020. [31] T . Sery , N. Shlezinger, K. Cohen, and Y . C. Eldar, “Over -the-air fed- erated learning from heterogeneous data, ” IEEE Tr ans. Signal Process. , vol. 69, pp. 3796–3811, 2021. 12 [32] Z. Zhai, X. Y uan, and X. W ang, “Decentralized federated learning via MIMO over -the-air computation: Consensus analysis and performance optimization, ” IEEE T rans. W ir eless Commun. , vol. 23, no. 9, pp. 11 847–11 862, 2024. [33] X. Cao, Z. L yu, G. Zhu, J. Xu, L. Xu, and S. Cui, “ An overvie w on over -the-air federated edge learning, ” IEEE Wir eless Commun. , vol. 31, no. 3, pp. 202–210, 2024. [34] B. Xiao, X. Y u, W . Ni, X. W ang, and H. V . Poor, “Over-the-air federated learning: Status quo, open challenges, and future directions, ” Fundamental Resear ch , vol. 5, no. 4, pp. 1710–1724, 2025. [35] Y . LeCun, C. Cortes, and C. J. C. Burges, “The MNIST database of handwritten digits, ” http://yann.lecun.com/exdb/mnist/ , 1998. [36] H. Xiao, K. Rasul, and R. V ollgraf, “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms, ” arXiv preprint arXiv:1708.07747 , 2017. [37] J. W ang and G. Joshi, “Cooperativ e SGD: A unified framew ork for the design and analysis of communication-efficient SGD algorithms, ” arXiv pr eprint arXiv:1808.07576 , 2018. [38] X. Li, Y . Xu, J. H. W ang, X. W ang, and J. Lui, “Decentralized stochastic proximal gradient descent with variance reduction over time-varying networks, ” arXiv pr eprint arXiv:2112.10389 , 2021. [39] Z. Zhai, X. Y uan, and X. W ang, “Distributed weight matrix optimization for consensus problems under unreliable communications, ” IEEE T rans. Cogn. Commun. Netw . , vol. 12, pp. 2383–2396, 2026. [40] S. Zhou, O. W ang, Z. Luo, Y . Zhu, and G. Y . Li, “Preconditioned inexact stochastic ADMM for deep models, ” arXiv pr eprint arXiv:2502.10784 , 2025. [41] S. Ghadimi and G. Lan, “Stochastic first- and zeroth-order methods for noncon vex stochastic programming, ” SIAM J. Optim. , vol. 23, no. 4, pp. 2341–2368, 2013. [42] L. Bottou, F . E. Curtis, and J. Nocedal, “Optimization methods for large- scale machine learning, ” SIAM Rev . , vol. 60, no. 2, pp. 223–311, 2018. [43] R. Johnson and T . Zhang, “ Accelerating stochastic gradient descent using predictiv e variance reduction, ” Adv . Neural Inf. Process. Syst. , vol. 26, 2013. [44] A. K olosko va, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “ A unified theory of decentralized SGD with changing topology and local updates, ” in Pr oc. Int. Conf. Mach. Learn. (ICML) , 2020, pp. 5381–5393. [45] Z. Zhai, X. Y uan, X. W ang, and G. Y . Li, “Decentralized federated learning with distributed aggregation weight optimization, ” IEEE T rans. P attern Anal. Mach. Intell. , v ol. 48, no. 3, pp. 3899–3910, 2026. [46] D. Jakoveti ´ c, D. Bajovi ´ c, N. Kreji ´ c, and N. Krklec Jerinki ´ c, “Distributed gradient methods with variable number of working nodes, ” IEEE Tr ans. Signal Pr ocess. , vol. 64, no. 15, pp. 4080–4095, 2016. Zhiyuan Zhai receiv ed the B.S. degree in Com- munication Engineering from the School of Infor- mation and Communication Engineering, University of Electronic Science and T echnology of China, in 2022. He is currently pursuing the Ph.D. degree with the Department of Communication Science and Engineering, Fudan Uni versity , Shanghai, China. His research interests include machine learning, signal processing and mobile edge computing. Xiaojun Y uan (Fellow , IEEE) received the PhD de- gree in electrical engineering from the City Univ er- sity of Hong Kong, in 2009. From 2009 to 2011, he was a research fello w with the Department of Elec- tronic Engineering, City Univ ersity of Hong Kong. He was also a visiting scholar with the Department of Electrical Engineering, Univ ersity of Hawaii at Manoa, in spring and summer 2009, as well as in the same period of 2010. From 2011 to 2014, he was a research assistant professor with the Institute of Network Coding, The Chinese Uni versity of Hong K ong. From 2014 to 2017, he was an assistant professor with the School of Information Science and T echnology , ShanghaiT ech University . He is now a professor with the National Key Laboratory of Wireless Communications, Univ ersity of Electronic Science and T echnology of China. His research inter- ests cov er a broad range of signal processing, machine learning, and wireless communications, including but not limited to intelligent communications, structured signal reconstruction, Bayesian approximate inference, distributed learning, etc. He has published more than 320 peer-re viewed research papers in the leading international journals and conferences in the related areas. He has served on many technical programs for international conferences. He was an editor of IEEE leading journals, including IEEE Transactions on Wireless Communications and IEEE Transactions on Communications. He was a co- recipient of IEEE Heinrich Hertz A ward 2022, and a co-recipient of IEEE Jack Neubauer Memorial A ward 2025. W ei Ni (M’09-SM’15-F’24) receiv ed the B.E. and Ph.D. degrees in Electronic Engineering from Fudan Univ ersity , Shanghai, China, in 2000 and 2005, respectiv ely . He is the Associate Dean (Research) in the School of Engineering, Edith Cowan University , Perth, and a Conjoint Professor at the Univ ersity of New South W ales, Sydney , Australia. He is also a T echnical Expert at Standards Australia with a focus on the international standardization of Big Data and AI. He was a Deputy Project Manager at the Bell Labs, Alcatel/Alcatel-Lucent from 2005 to 2008; a Senior Research Engineer at Nokia from 2008 to 2009; and a Senior Principal Research Scientist and Group Leader at the Commonwealth Scientific and Industrial Research Organisation (CSIR O) from 2009 to 2025. His research interest lies in distributed and trusted learning with constrained resources, quantum Internet, and their applications to system efficiency , integrity , and resilience. He is a co-recipient of the ACM Conference on Computer and Communications Security (CCS) 2025 Distinguished Paper A ward, and four Best Paper A wards. He has been an Editor for IEEE Transactions on Wireless Communications since 2018, IEEE Transactions on V ehicular T echnology since 2022, IEEE T ransactions on Information Forensics and Security and IEEE Communication Surveys and Tutorials since 2024, and IEEE Transac- tions on Network Science and Engineering and IEEE T ransactions on Cloud Computing since 2025. He was Chair of the IEEE VTS NSW Chapter (2020 – 2022), T rack Chair for VTC-Spring 2017, T rack Co-chair for IEEE VTC- Spring 2016, Publication Chair for BodyNet 2015, and Student T ravel Grant Chair for WPMC 2014. 13 Xin W ang (Fellow , IEEE) received the BSc and MSc degrees from Fudan Univ ersity , Shanghai, China, in 1997 and 2000, respecti vely , and the PhD degree from Auburn University , Auburn, Al- abama, in 2004, all in electrical engineering. From September 2004 to August 2006, he was a post- doctoral research associate with the Department of Electrical and Computer Engineering, Univ ersity of Minnesota, Minneapolis. In August 2006, he joined the Department of Electrical Engineering, Florida Atlantic Univ ersity , Boca Raton, Florida, as an assis- tant professor, then was promoted to a tenured associate professor, in 2010. He is currently a distinguished professor and the chair of the Department of Communication Science and Engineering, Fudan Univ ersity . His research interests include stochastic network optimization, energyefficient communi- cations, cross-layer design, and signal processing for communications. He is a member of the Signal Processing for Communications and Networking T echnical Committee of IEEE Signal Processing Society . He is a senior area editor of the IEEE T ransactions on Signal Processing and an editor of the IEEE Transactions on Wireless Communications. In the past, he served as an associate editor for the IEEE T ransactions on Signal Processing, an editor for the IEEE Transactions on V ehicular T echnology , and an associate editor for the IEEE Signal Processing Letters. He is a distinguished speaker of the IEEE V ehicular T echnology Society . Rui Zhang (S’00-M’07-SM’15-F’17) received the B.Eng. (first-class Hons.) and M.Eng. degrees from the National University of Singapore, Singapore, and the Ph.D. degree from the Stanford Universit y , Stanford, CA, USA, all in electrical engineering. From 2007 to 2009, he worked as a research scientist at the Institute for Infocomm Research, AS- T AR, Singapore. In 2010, he joined the Department of Electrical and Computer Engineering of National Univ ersity of Singapore, where he is now a Pro vost’ s Chair Professor . He is also an Adjunct Professor with the School of Science and Engineering, The Chinese University of Hong K ong, Shenzhen, China. He has published over 600 papers, all in the field of wireless communications and networks. He has been listed as a Highly Cited Researcher by Thomson Reuters/Clari vate Analytics since 2015. His current research interests include intelligent surfaces, reconfigurable antennas, radio mapping, non-terrestrial communications, wireless power transfer, AI and optimization methods. He was the recipient of the 6th IEEE Communications Society Asia- Pacific Region Best Y oung Researcher A ward in 2011, the Y oung Researcher A ward of National University of Singapore in 2015, the W ireless Com- munications T echnical Committee Recognition A ward in 2020, the IEEE Signal Processing and Computing for Communications (SPCC) T echnical Recognition A w ard in 2021, and the IEEE Communications Society T echnical Committee on Cogniti ve Networks (TCCN) Recognition A ward in 2023. His works received 18 IEEE Best Journal Paper A wards, including the IEEE Marconi Prize Paper A w ard in W ireless Communications in 2015 and 2020, the IEEE Signal Processing Society Best Paper A ward in 2016, the IEEE Communications Society Heinrich Hertz Prize Paper A ward in 2017, 2020 and 2022, the IEEE Communications Society Stephen O. Rice Prize in 2021, etc. He served for over 30 international conferences as the TPC co- chair or an organizing committee member . He was an elected member of the IEEE Signal Processing Society SPCOM T echnical Committee from 2012 to 2017 and SAM T echnical Committee from 2013 to 2015. He served as the V ice Chair of the IEEE Communications Society Asia-Pacific Board T echnical Affairs Committee from 2014 to 2015, a member of the Steering Committee of the IEEE Wireless Communications Letters from 2018 to 2021, a member of the IEEE Communications Society W ireless Communications T echnical Committee (WTC) A ward Committee from 2023 to 2025. He was a Distinguished Lecturer of IEEE Signal Processing Society and IEEE Communications Society from 2019 to 2020. He served as an Editor for several IEEE journals, including the IEEE TRANSACTIONS ON WIRELESS COMMUNICA TIONS from 2012 to 2016, the IEEE JOURNAL ON SELECTED AREAS IN COMMUNICA TIONS: Green Communications and Networking Series from 2015 to 2016, the IEEE TRANSA CTIONS ON SIGN AL PROCESSING from 2013 to 2017, the IEEE TRANSA CTIONS ON GREEN COMMUNICA TIONS AND NETWORKING from 2016 to 2020, and the IEEE TRANSACTIONS ON COMMUNICA TIONS from 2017 to 2022. He now serves as an Editorial Board Member of npj W ireless T echnology , and the Chair of the IEEE Communications Society W ireless Communications T echnical Committee (WTC) A ward Committee. He is a Fellow of the Academy of Engineering Singapore. Geoffrey Y e Li (Fello w , IEEE) is currently a Chair Professor at Imperial College London, UK. Before joining Imperial in 2020, he was with Georgia T ech and A T&T Labs – Research (previous Bell Labs) for 25 years in total. He made fundamental contrib u- tions to orthogonal frequency division multiplexing (OFDM) for wireless communications, established a framework on resource cooperation in wireless networks, and pioneered deep learning for commu- nications. In these areas, he has published over 700 journal and conference papers in addition to over 40 granted patents. His publications ha ve been cited ov er 85,000 times with an H-index over 130. He has been listed as a Highly Cited Researcher by Clariv ate/W eb of Science almost every year . Dr . Geoffre y Y e Li was elected to Fellow of the Royal Academic of Engineering (FREng), IEEE Fellow , and IET Fellow for his contributions to signal processing for wireless communications. He receiv ed 2024 IEEE Eric E. Sumner A ward, 2019 IEEE ComSoc Edwin How ard Armstrong Achievement A ward, and several other awards from IEEE Signal Processing, V ehicular T echnology , and Communications Societies. 14 A P P E N D I X A P R O O F O F T H E O R E M 1 Step 1 : A verage iterate. Since W1 = 1 , multiplying the update X t +1 = X t W − η G t + V t by 1 /n giv es ¯ x t +1 = ¯ x t − η ¯ g t + ¯ e t , ¯ g t = G t 1 n , ¯ e t = V t 1 n . (26) Step 2 : One-step descent on f ( · ) . Assume 0 < η ≤ 1 /L . Using ¯ x t +1 = ¯ x t − η ( ¯ g t − ¯ e t /η ) , E f ( ¯ x t +1 ) = E f  ¯ x t − η ( ¯ g t − ¯ e t /η )  ( a ) ≤ E f ( ¯ x t ) − η E ⟨∇ f ( ¯ x t ) , ¯ g t − ¯ e t /η ⟩ + Lη 2 2 E   ¯ g t − ¯ e t /η   2 ( b ) = E f ( ¯ x t ) − η 2 E ∥∇ f ( ¯ x t ) ∥ 2 − η 2 E   ¯ g t − ¯ e t /η   2 + η 2 E    ∇ f ( ¯ x t ) −  ¯ g t − ¯ e t /η     2 + Lη 2 2 E   ¯ g t − ¯ e t /η   2 ( c ) ≤ E f ( ¯ x t ) − η 2 E ∥∇ f ( ¯ x t ) ∥ 2 + η 2 E   ∇ f ( ¯ x t ) − ¯ g t + ¯ e t /η   2 ( d ) ≤ E f ( ¯ x t ) − η 2 E ∥∇ f ( ¯ x t ) ∥ 2 + η E ∥∇ f ( ¯ x t ) − ¯ g t ∥ 2 + 1 η E ∥ ¯ e t ∥ 2 = E f ( ¯ x t ) − η 2 E ∥∇ f ( ¯ x t ) ∥ 2 + η T 1 + 1 η E ∥ ¯ e t ∥ 2 (27) where ( a ) is due to the L -smoothness, i.e., f ( y ) ≤ f ( x ) + ⟨∇ f ( x ) , y − x ⟩ + L 2 ∥ y − x ∥ 2 , ( b ) is based on 2 ⟨ a, b ⟩ = ∥ a ∥ 2 + ∥ b ∥ 2 − ∥ a − b ∥ 2 , ( c ) is due to η ≤ 1 /L , and ( d ) is based on ∥ u + v ∥ 2 ≤ 2 ∥ u ∥ 2 + 2 ∥ v ∥ 2 . Step 3 : Bounding T 1 . Recall T 1 = E ∥∇ f ( ¯ x t ) − ¯ g t ∥ 2 with ¯ x t = X t 1 n and ¯ g t = G t 1 n = 1 n P n i =1 g i,t . Add and subtract the av erage true gradient: T 1 = E    ∇ f ( ¯ x t ) − 1 n n X i =1 ∇ f i ( x i,t ) + 1 n n X i =1  ∇ f i ( x i,t ) − g i,t     2 = E ∥ A t ∥ 2 + E ∥ B t ∥ 2 + 2 E ⟨ A t , B t ⟩ , (28) where ∇ f ( ¯ x t ) − 1 n P n i =1 ∇ f i ( x i,t ) = A t , and 1 n P n i =1 ( ∇ f i ( x i,t ) − g i,t ) = B t . By unbiasedness, E [ g i,t ] = ∇ f i ( x i,t ) , hence E [ B t ] = 0 and the cross term vanishes: E ⟨ A t , B t ⟩ = E ⟨ A t , E [ B t ] ⟩ = 0 . Therefore, T 1 = E ∥ A t ∥ 2 + E ∥ B t ∥ 2 . (29) For E ∥ B t ∥ 2 , we hav e E ∥ B t ∥ 2 = E    1 n n X i =1  ∇ f i ( x i,t ) − g i,t     2 ≤ 1 n 2 n X i =1 E   ∇ f i ( x i,t ) − g i,t   2 ( a ) ≤ α 2 n , (30) where ( a ) is based on Assumption 3. Combining (27) with the decomposition E ∥∇ f ( ¯ x t ) − ¯ g t ∥ 2 = T 2 + α 2 n , where T 2 ≜ E    ∇ f ( ¯ x t ) − 1 n P n i =1 ∇ f i ( x i,t )    2 , we obtain E f ( ¯ x t +1 ) ≤ E f ( ¯ x t ) − η 2 E ∥∇ f ( ¯ x t ) ∥ 2 + η T 2 + η α 2 n + 1 η E ∥ ¯ e t ∥ 2 . (31) The last term on the RHS of (31) captures the average per- turbation due to the ev ent-triggered communication in round t . Step 4 : Bounding T 2 via consensus error . By Jensen’ s inequality and the L -smoothness of { f i } , it follows that T 2 = E    1 n P n i =1  ∇ f i ( ¯ x t ) − ∇ f i ( x i,t )     2 ≤ 1 n n X i =1 E   ∇ f i ( ¯ x t ) − ∇ f i ( x i,t )   2 ( a ) ≤ L 2 n n X i =1 E   ¯ x t − x i,t   2 ( a ) = L 2 n E   X t ( I − J )   2 F , (32) where J = 1 n 11 ⊤ , ( a ) is due to the L -smoothness assumption, and ( b ) is due to P n i =1 ∥ x i,t − ¯ x t ∥ 2 = ∥ X t ( I − J ) ∥ 2 F . Step 5 : Unrolling the consensus error Q t,i . Define the per- node disagreement energy Q t,i ≜ E   ¯ x t − x i,t   2 = E   X t 1 n − X t e i   2 , with e i the i -th canonical basis. Unrolling the update X s +1 = X s W − η G s + V s yields X t = X 0 W t − η t − 1 X j =0 G j W t − 1 − j + t − 1 X j =0 V j W t − 1 − j . (33) Hence, ¯ x t − x i,t = X 0 a t,i − η t − 1 X j =0 G j a t − 1 − j,i + t − 1 X j =0 V j a t − 1 − j,i . (34) If the initialization is in consensus (e.g., X 0 ( I − J ) = 0 , which includes X 0 = 0 as a special case), the first term on the RHS of (34) vanishes. By ∥ a + b + c ∥ 2 ≤ 3( ∥ a ∥ 2 + ∥ b ∥ 2 + ∥ c ∥ 2 ) , we obtain Q t,i ≤ 3 η 2 E    t − 1 X j =0  G j − ∂ f ( X j )  a t − 1 − j,i    2 + 3 η 2 E    t − 1 X j =0 ∂ f ( X j ) a t − 1 − j,i    2 + 3 E    t − 1 X j =0 V j a t − 1 − j,i    2 = T 3 ,i + T 4 ,i + T 5 ,i . (35) Here, ∂ f ( X j ) ≜ [ ∇ f 1 ( x j 1 ) · · · ∇ f n ( x j n ) ] . The three terms of (35), i.e., T 3 ,i (stochastic noise), T 4 ,i (bias due to disagree- ment), and T 5 ,i (ev ent-trigger perturbation), can be bounded using the spectral contraction ∥ W k − J ∥ 2 2 ≤ δ k , with δ = ∥ W − J ∥ 2 2 ∈ [0 , 1) . Step 6 : Bounding T 3 ,i (stochastic noise). Recall T 3 ,i = E    t − 1 X j =0  G j − ∂ f ( X j )  a t − 1 − j,i    2 . Let a t,i ≜ 1 n − W k e i . Using ∥ UA ∥ F ≤ ∥ U ∥ F ∥ A ∥ 2 and 15 ∥ a t,i ∥ 2 2 ≤ δ k we obtain T 3 ,i ≤ t − 1 X j =0 E   G j − ∂ f ( X j )   2 F ∥ a t − 1 − j,i ∥ 2 2 ( a ) ≤ t − 1 X j =0 nα 2 δ t − 1 − j ≤ nα 2 1 − δ . (36) where ( a ) is based on Lemma 1 proved in Appendix A. Step 7 : Bounding T 4 ,i (bias due to disagreement). T 4 ,i = E     t − 1 X j =0 ∂ f ( X j ) a t − 1 − j,i     2 F ≤ t − 1 X j =0 E   ∂ f ( X j ) a t − 1 − j,i   2 F + t − 1 X j  = j ′ E  ∂ f ( X j ) a t − 1 − j,i , ∂ f ( X j ′ ) a t − 1 − j ′ ,i  = e T 4 + e T 5 . (37) Next, we proceed to bound e T 4 and e T 5 . e T 4 = t − 1 X j =0 E    ∂ f ( X j ) a t − 1 − j,i    2 ≤ t − 1 X j =0 E ∥ ∂ f ( X j ) ∥ 2 F ∥ a t − 1 − j,i ∥ 2 2 For E ∥ ∂ f ( X j ) ∥ 2 F , we hav e E ∥ ∂ f ( X j ) ∥ 2 F ≤ 3 E    ∂ f ( X j ) − ∂ f ( ¯ x j 1 ⊤ )    2 F + 3 E    ∂ f ( ¯ x j 1 ⊤ ) − ∇ f ( ¯ x j ) 1 ⊤    2 F + 3 E    ∇ f ( ¯ x j ) 1 ⊤    2 F ( a ) ≤ 3 E    ∂ f ( X j ) − ∂ f ( ¯ x j 1 ⊤ )    2 F + 3 nβ 2 + 3 E    ∇ f ( ¯ x j ) 1 ⊤    2 F ( b ) ≤ 3 n X i =1 L 2 Q j,i + 3 nβ 2 + 3 E    ∇ f ( ¯ x j ) 1 ⊤    2 F . (38) where ( a ) is due to Assumption 3, and ( b ) is due to the L - smoothness. Hence, e T 4 ≤ 3 t − 1 X j =0 n X h =1 L 2 E Q j,h ∥ a t − 1 − j,i ∥ 2 2 + 3 t − 1 X j =0 E    ∇ f ( ¯ x j ) 1 ⊤    2 F ∥ a t − 1 − j,i ∥ 2 2 + 3 nβ 2 t − 1 X j =0 ∥ a t − 1 − j,i ∥ 2 2 . (39) For e T 5 , we hav e e T 5 = t − 1 X j  = j ′ E D ∂ f ( X j ) a t − 1 − j,i , ∂ f ( X j ′ ) a t − 1 − j ′ ,i E ≤ t − 1 X j  = j ′ E    ∂ f ( X j ) a t − 1 − j,i       ∂ f ( X j ′ ) a t − 1 − j ′ ,i    ≤ t − 1 X j  = j ′ E ∥ ∂ f ( X j ) ∥ ∥ a t − 1 − j,i ∥ 2 ∥ ∂ f ( X j ′ ) ∥ ∥ a t − 1 − j ′ ,i ∥ 2 ≤ t − 1 X j  = j ′ E ∥ ∂ f ( X j ) ∥ 2 2 ∥ a t − 1 − j,i ∥ 2 ∥ a t − 1 − j ′ ,i ∥ 2 + t − 1 X j  = j ′ E ∥ ∂ f ( X j ′ ) ∥ 2 2 ∥ a t − 1 − j,i ∥ 2 ∥ a t − 1 − j ′ ,i ∥ 2 ( a ) ≤ t − 1 X j  = j ′ E  ∥ ∂ f ( X j ) ∥ 2 2 + ∥ ∂ f ( X j ′ ) ∥ 2 2  δ t − 1 − j + j ′ 2 = t − 1 X j  = j ′ E ∥ ∂ f ( X j ) ∥ 2 δ t − 1 − j + j ′ 2 ( b ) ≤ 3 t − 1 X j  = j ′ n X i =1 L 2 Q j,i + E    ∇ f ( ¯ x j ) 1 ⊤    2 F ! δ t − 1 − j + j ′ 2 + 3 nβ 2 t − 1 X j  = j ′ δ t − 1 − j + j ′ 2 = 6 t − 1 X j =0 n X h =1 L 2 E Q j,h + E    ∇ f ( ¯ x j ) 1 ⊤    2 F ! t − 1 X j ′ = j +1 √ δ 2 t − j − j ′ − 2 + 6 nβ 2 t − 1 X j ′ >j δ t − 1 − j + j ′ 2 ≤ 6 t − 1 X j =0 n X h =1 L 2 E Q j,h + E    ∇ f ( ¯ x j ) 1 ⊤    2 F ! √ δ t − j − 1 1 − √ δ + 6 nβ 2 (1 − √ δ ) 2 , (40) where ( a ) is from Lemma 1, and ( b ) stems from (38). Plugging (39) and (40) into (37) and using Lemma 1, we have T 4 ,i ≤ 3 t − 1 X j =0 n X h =1 E [ L 2 Q j,h ] + 3 t − 1 X j =0 E   ∇ f ( ¯ x j ) 1 ⊤   2 F ! δ t − 1 − j + 6( √ δ ) t − 1 − j 1 − √ δ t − 1 X j =0 n X h =1 E [ L 2 Q j,h ] + E   ∇ f ( ¯ x j ) 1 ⊤   2 F ! + 9 nβ 2 (1 − √ δ ) 2 . (41) Step 8 : Bounding T 5 ,i (ev ent-trigger perturbation). W ith a t,i as abov e and the Cauchy–Schwarz inequality , it follows that T 5 ,i = E    t − 1 X j =0 V j a t − 1 − j,i    2 (42) = t − 1 X j =0 E ∥ V j a t − 1 − j,i ∥ 2 + t − 1 X j  = j ′ E  V j a t − 1 − j,i , V j ′ a t − 1 − j ′ ,i  ( a ) ≤ t − 1 X j =0 E ∥ V j ∥ 2 F δ t − 1 − j + t − 1 X j  = j ′ 1 2 E  ∥ V j ∥ 2 F + ∥ V j ′ ∥ 2 F  δ t − j + j ′ 2 − 1 ≤ t − 1 X j =0 E ∥ V j ∥ 2 F δ t − 1 − j + 2 t − 1 X j =0 E ∥ V j ∥ 2 F t − 1 X j ′ = j +1 δ t − j + j ′ 2 − 1 16 ≤ t − 1 X j =0 E ∥ V j ∥ 2 F δ t − 1 − j + 2( √ δ ) t − 1 − j 1 − √ δ ! , (43) where ( a ) arises from Lemma 1. Step 9 : Bounding Q t,i . Plugging the bounds of T 3 ,i , T 4 ,i , and T 5 ,i into (35) (recall Q t,i ≤ 3 η 2 T 3 ,i + 3 η 2 T 4 ,i + 3 T 5 ,i ) yields Q t,i ≤ 3 η 2 n α 2 1 − δ + 27 η 2 n β 2 (1 − √ δ ) 2 + δ t − 1 − j + 2 ( √ δ ) t − 1 − j 1 − √ δ ! × 9 η 2 t − 1 X j =0  L 2 n X h =1 E Q j,h + E   ∇ f ( ¯ x j ) 1 ⊤   2 F  + 3 t − 1 X j =0 E ∥ V j ∥ 2 F δ t − 1 − j + 2 ( √ δ ) t − 1 − j 1 − √ δ ! . (44) Step 10 : A veraging over devices. Define the a veraged dis- agreement M t ≜ 1 n P n i =1 Q t,i . Using P n h =1 E Q j,h = n E M j and av eraging (44) over i gives E M t ≤ 3 η 2 n α 2 1 − δ + 27 η 2 n β 2 (1 − √ δ ) 2 + 9 η 2 t − 1 X j =0 E   ∇ f ( ¯ x j ) 1 ⊤   2 F δ t − 1 − j + 2 ( √ δ ) t − 1 − j 1 − √ δ ! + 9 nη 2 L 2 t − 1 X j =0 E M j δ t − 1 − j + 2 ( √ δ ) t − 1 − j 1 − √ δ ! + 3 t − 1 X j =0 E ∥ V j ∥ 2 F δ t − 1 − j + 2 ( √ δ ) t − 1 − j 1 − √ δ ! . (45) Step 11 : Bounding T 2 by the a verage disagreement. Recall T 2 = E   ∇ f ( ¯ x t ) − 1 n P n i =1 ∇ f i ( x i,t )   2 . By the L –smoothness, we hav e E T 2 ≤ L 2 E M t , M t ≜ 1 n n X i =1 E ∥ x i,t − ¯ x t ∥ 2 . (46) Step 12 : Iterative inequality and telescoping. From (31), E f ( ¯ x t +1 ) ≤ E f ( ¯ x t ) − η 2 E ∥∇ f ( ¯ x t ) ∥ 2 + η L 2 E M t + 1 η E ∥ ¯ e t ∥ 2 + η α 2 n . (47) Summing (47) for t = 0 , . . . , T − 1 gives η 2 T − 1 X t =0 E ∥∇ f ( ¯ x t ) ∥ 2 ≤ f ( ¯ x 0 ) − f ∗ + η L 2 T − 1 X t =0 E M t + 1 η T − 1 X t =0 E ∥ ¯ e t ∥ 2 + η α 2 T n . (48) Step 13 : Bounding P T − 1 t =0 E M t . From the recursion (45), using 1 1 − δ ≤ 1 (1 − √ δ ) 2 , P ∞ k =0 δ k = 1 1 − δ and P ∞ k =0 ( √ δ ) k = 1 1 − √ δ , we hav e T − 1 X t =0 E M t ≤  3 η 2 nα 2 1 − δ + 27 η 2 nβ 2 (1 − √ δ ) 2  T + 27 η 2 (1 − √ δ ) 2 T − 1 X t =0 E   ∇ f ( ¯ x t ) 1 ⊤   2 F + 27 nη 2 L 2 (1 − √ δ ) 2 T − 1 X t =0 E M t + 9 (1 − √ δ ) 2 T − 1 X t =0 E ∥ V t ∥ 2 F . (49) Equiv alently ,  1 − 27 nη 2 L 2 (1 − √ δ ) 2  T − 1 X t =0 E M t ≤  3 η 2 nα 2 1 − δ + 27 η 2 nβ 2 (1 − √ δ ) 2  T + 27 η 2 (1 − √ δ ) 2 T − 1 X t =0 E   ∇ f ( ¯ x t ) 1 ⊤   2 F + 9 (1 − √ δ ) 2 T − 1 X t =0 E ∥ V t ∥ 2 F . (50) Step 14 : Bounding P T − 1 t =0 E M t . From (50), define Γ ≜ 1 − 27 n η 2 L 2 (1 − √ δ ) 2 > 0 . Using ∥∇ f ( ¯ x t ) 1 ⊤ ∥ 2 F = n ∥∇ f ( ¯ x t ) ∥ 2 , (50) becomes T − 1 X t =0 E M t ≤ 1 Γ "  3 η 2 nα 2 1 − δ + 27 η 2 nβ 2 (1 − √ δ ) 2  T + 27 nη 2 (1 − √ δ ) 2 T − 1 X t =0 E ∥∇ f ( ¯ x t ) ∥ 2 + 9 (1 − √ δ ) 2 T − 1 X t =0 E ∥ V t ∥ 2 F # . (51) Step 15 : Final error bound. Plug (51) into (48): η 2 T − 1 X t =0 E ∥∇ f ( ¯ x t ) ∥ 2 ≤ f ( ¯ x 0 ) − f ∗ + η L 2 Γ  3 η 2 nα 2 1 − δ + 27 η 2 nβ 2 (1 − √ δ ) 2  T + 27 nη 3 L 2 (1 − √ δ ) 2 Γ T − 1 X t =0 E ∥∇ f ( ¯ x t ) ∥ 2 + 9 η L 2 (1 − √ δ ) 2 Γ T − 1 X t =0 E ∥ V t ∥ 2 F + 1 η T − 1 X t =0 E ∥ ¯ e t ∥ 2 + η α 2 T n . (52) Mov e the gradient–sum 27 nη 3 L 2 (1 − √ δ ) 2 Γ P T − 1 t =0 E ∥∇ f ( ¯ x t ) ∥ 2 on the right to the left of (52); then we obtain η  1 2 − 27 nη 2 L 2 (1 − √ δ ) 2 Γ  T − 1 X t =0 E ∥∇ f ( ¯ x t ) ∥ 2 ≤ f ( ¯ x 0 ) − f ∗ + η L 2 Γ  3 η 2 nα 2 1 − δ + 27 η 2 nβ 2 (1 − √ δ ) 2  T + η α 2 T n + η L 2 (1 − √ δ ) 2 Γ 9 T − 1 X t =0 E ∥ V t ∥ 2 F + 1 η T − 1 X t =0 E ∥ ¯ e t ∥ 2 . (53) Notice that ∥ V t ∥ 2 F = n X i =1 ∥ v i,t ∥ 2 ≤ n X i =1 τ 2 t = nτ 2 t , 17 and ∥ ¯ e t ∥ =    1 n n X i =1 v i,t    ≤ 1 n n X i =1 ∥ v i,t ∥ ≤ 1 n n X i =1 τ t = τ t , which yields ∥ ¯ e t ∥ 2 ≤ τ 2 t . Plugging these bounds into (49), we obtain the final inequality: η  1 2 − 27 nη 2 L 2 (1 − √ δ ) 2 Γ  T − 1 X t =0 E ∥∇ f ( ¯ x t ) ∥ 2 ≤ f ( ¯ x 0 ) − f ∗ + η L 2 Γ  3 n 2 α 2 1 − δ + 27 η 2 nβ 2 (1 − √ δ ) 2  T + η α 2 T n + 9 η L 2 (1 − √ δ ) 2 Γ T − 1 X t =0 nτ 2 t + 1 η T − 1 X t =0 τ 2 t . (54) A P P E N D I X B Lemma 1. Under Assumption 1, for any i ∈ { 1 , . . . , n } and any t ∈ N ,   a t,i   2 2 =    1 n 1 − W t e i    2 2 ≤ δ t . Pr oof. Because W is real symmetric and doubly stochastic, it is diagonalizable by an orthonormal basis: W = U diag(1 , λ 2 , . . . , λ n ) U ⊤ , with | λ ℓ | < 1 ( ℓ ≥ 2) . Moreov er , J = 1 n 11 ⊤ is the orthogonal projector onto span { 1 } , hence J = U diag(1 , 0 , . . . , 0) U ⊤ , W − J = U diag(0 , λ 2 , . . . , λ n ) U ⊤ . Therefore, W t − J = U diag(0 , λ t 2 , . . . , λ t n ) U ⊤ , ∥ W t − J ∥ 2 = max ℓ ≥ 2 | λ ℓ | t . Since ∥ W − J ∥ 2 = max ℓ ≥ 2 | λ ℓ | and δ = ∥ W − J ∥ 2 2 , we obtain ∥ W t − J ∥ 2 = ∥ W − J ∥ t 2 = ( √ δ ) t . Using a t,i = ( J − W t ) e i and ∥ e i ∥ 2 = 1 , ∥ a t,i ∥ 2 = ∥ ( J − W t ) e i ∥ 2 ≤ ∥ J − W t ∥ 2 ∥ e i ∥ 2 = ( √ δ ) t . (55) Squaring both sides of (55) yields ∥ a t,i ∥ 2 2 ≤ δ t , completing the proof.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment