Performance Limits of Stochastic Sub-Gradient Learning, Part II: Multi-Agent Case

1 Performance Limits of Stochastic Sub-Gradient Learning, P art II: Multi-Agent Case Bicheng Y ing, Student Member , IEEE , and Ali H. Sayed, F ellow , IEEE Abstract —The analysis in Part I [2] r evealed interesting properties for subgradient learning algorithms in the context of stochastic optimization when gradient noise is pr esent. These algorithms are used when the risk functions are non-smooth and in volve non-differentiable components. They have been long recognized as being slow con verging methods. Howev er , it was re vealed in Part I [2] that the rate of con ver gence becomes linear for stochastic optimization problems, with the error iterate con ver ging at an exponential rate α i to within an O ( µ ) − neighborhood of the optimizer , for some α ∈ (0 , 1) and small step-size µ . The conclusion was established under weak er assumptions than the prior literature and, mor eover , se veral important problems (such as LASSO, SVM, and T otal V ariation) were shown to satisfy these weaker assumptions automatically (but not the pr eviously used conditions fr om the literatur e). These results re vealed that sub-gradient learning methods hav e more fav orable beha vior than originally thought when used to enable continuous adaptation and learning. The results of Part I [2] wer e exclusive to single-agent adaptation. The purpose of the current Part II is to examine the implications of these discoveries when a collection of networked agents employs subgradient learning as their cooperative mechanism. The analysis will show that, despite the coupled dynamics that arises in a networked scenario, the agents are still able to attain linear con vergence in the stochastic case; they are also able to reach agreement within O ( µ ) of the optimizer . Index T erms —Sub-gradient algorithm, afﬁne-Lipschitz, expo- nential rate, diffusion strategy , networked agents, SVM, LASSO. I . I N T RO D U C T I O N A N D R E V I E W O F [ 2 ] W e review brieﬂy the notation and ﬁndings from Part I [2] in preparation for examining the challenges that arise in the multi-agent scenario. In Part I [2], we considered an optimization problem of the form: w ? = arg min w ∈ R M J ( w ) (1) where the possibly non-differ entiable but strongly-con vex risk function J ( w ) was expressed as the expectation of some con ve x but also possibly non-differentiable loss function Q ( · ) , namely , J ( w ) ∆ = E Q ( w ; x ) (2) Here, the letter x represents the random data and the e x- pectation operation is ov er the distribution of this data. The This work was supported in part by NSF grants CIF-1524250, ECCS- 1407712, and D ARP A N66001-14-2-4029. A short conference version appears in [1]. The authors are with Department of Electrical Engineering, University of California, Los Angeles, CA 90095. Emails: { ybc,sayed } @ucla.edu. following sub-gradient algorithm was introduced and studied in Part I [2] for seeking w ? : w i = w i − 1 − µ b g ( w i − 1 ) (3) S i = κS i − 1 + 1 (4) ¯ w i =  1 − 1 S i  ¯ w i − 1 + 1 S i w i (5) with initial conditions S 0 = 1 , w 0 = 0 , and ¯ w 0 = 0 . Boldface notation is used for w i to highlight its stochastic nature since the successive iterates are generated by relying on streaming data realizations for x . Moreov er , the scalar κ ∈ [ α , 1) , where α = 1 − O ( µ ) is a number close to one. The term b g ( w i − 1 ) in [3] is an approximate sub-gradient at location w i − 1 ; it is computed from the data av ailable at time i and approximates a true sub-gradient denoted by g ( w i − 1 ) . This true sub-gradient is una v ailable since J ( w ) itself is una v ailable in the stochastic context. This is because the distribution of the data x is unknown beforehand, which means that the expected loss function cannot be ev aluated. The difference between a true sub-gradient vector and its approximation is gradient noise and is denoted by s i ( w i − 1 ) ∆ = b g ( w i − 1 ) − g ( w i − 1 ) (6) A. Data Model and Assumptions The following three assumptions were motiv ated in Part I [2]: 1. J ( w ) is η − strongly-conv ex so that w ? is unique. The strong con ve xity of J ( w ) means that J ( θ w 1 + (1 − θ ) w 2 ) ≤ θ J ( w 1 ) + (1 − θ ) J ( w 2 ) − η 2 θ (1 − θ ) k w 1 − w 2 k 2 , (7) for any θ ∈ [0 , 1] , w 1 , and w 2 . The abov e condition is equiv alent to requiring [3]: J ( w 1 ) ≥ J ( w 2 )+ g ( w 2 ) T ( w 1 − w 2 )+ η 2 k w 1 − w 2 k 2 . (8) 2. The subgradient is afﬁne Lipschitz, meaning that there exist constants c ≥ 0 and d ≥ 0 such that k g ( w 1 ) − g 0 ( w 2 ) k ≤ c k w 1 − w 2 k + d, ∀ w 1 , w 2 (9) and for any g 0 ( · ) ∈ ∂ J ( · ) . Here, the notation ∂ J ( w ) denotes the differential at location w (i.e., the set of all possible subgradient vectors at w ). It was explained in 2 Part I [2] how this afﬁne Lipschitz condition is weaker than conditions used before in the literature and how important cases of interest (such as SVM, LASSO, T otal V ariation) satisfy it automatically (but do not satisfy the previous conditions). For later use, it is easy to verify (as was done in (50) in Part I [2]) that condition (9) implies that k g ( w 1 ) − g 0 ( w 2 ) k 2 ≤ e 2 k w 1 − w 2 k 2 + f 2 , ∀ w 1 , w 2 , (10) for any g 0 ( · ) ∈ ∂ J ( · ) and some constants e 2 ≥ 0 and f 2 ≥ 0 . 3. The ﬁrst and second-order moments of the gradient noise process satisfy the conditions: E [ s i ( w i − 1 ) | F i − 1 ] = 0 , (11) E [ k s i ( w i − 1 ) k 2 | F i − 1 ] ≤ β 2 k w ? − w i − 1 k 2 + σ 2 , (12) for some constants β 2 ≥ 0 and σ 2 ≥ 0 , and where the notation F i − 1 denotes the ﬁltration (collection) corresponding to all past iterates: F i − 1 = ﬁltration by { w j , j ≤ i − 1 } . (13) It was again shown in Part I [2] how the gradient noise process in important applications (e.g., SVM,LASSO) satisfy (11)—(12) directly . Under the three conditions 1) — 3), which are automatically satisﬁed for important cases of interest, the following impor- tant conclusion was proven in Part I [2] for the stochastic subgradient algorithm (3)–(5) above. At every iteration i , it will hold that lim i →∞ E J ( ¯ w i ) − J ( w ? ) ≤ µ ( f 2 + σ 2 ) / 2 (14) where the con ver gence of E J ( ¯ w i ) to J ( w ? ) occurs at an exponential rate O ( α i ) where α = 1 − µη + O ( µ 2 ) . B. Interpretation of Result For the beneﬁt of the reader, we repeat here the interpre- tation that was given in Sec. IV .D of Part I [2] for the key results (14); these remarks will be relev ant in the networked case and are therefore useful to highlight again: 1) First, it has been observed in the optimization liter- ature [3]–[5] that sub-gradient descent iterations can perform poorly in deterministic problems (where J ( w ) is known). Their con ver gence rate is O (1 / √ i ) under con ve xity and O (1 /i ) under strong-con vexity when de- caying step-sizes, µ ( i ) = 1 /i , are used to ensure con ver gence [5]. Result (14) shows that the situation is different in the context of stochastic optimization when true subgradients are approximated from streaming data due to different requirements. By using constant step- sizes to enable continuous learning and adaptation, the sub-gradient iteration is now able to achieve exponential con ver gence at the rate of O ( α i ) to steady-state. 2) Second, this substantial improv ement in conv ergence rate comes at a cost, but one that is acceptable and con- trollable. Speciﬁcally , we cannot guarantee con ver gence of the algorithm to the global minimum value, J ( w ? ) , anymore but can instead approach this optimal value with high accuracy in the order of O ( µ ) , where the size of µ is under the designer’ s control and can be selected as small as desired. 3) Third, this performance lev el is sufﬁcient in most cases of interest because, in practice, one rarely has an inﬁnite amount of data and, moreover , the data is often subject to distortions not captured by any assumed models. It is increasingly recognized in the literature that it is not always necessary to ensure exact con ver gence towards the optimal solution, w ? , or the minimum value, J ( w ? ) , because these optimal values may not reﬂect accurately the true state due to modeling errors. For example, it is explained in the works [3], [6]–[8] that it is generally unnecessary to reduce the error measures below the statistical error lev el that is present in the data. C. This W ork The purpose of this work is to examine how these properties rev eal themselves in the networked case when a multitude of interconnected agents cooperate to minimize an aggregate cost function that is not generally smooth. In this case, it is necessary to examine closely the ef fect of the coupled dynamics and whether agents will still be able to agree fast enough under non-differentiability . Distributed learning under non-smooth risk functions is common in many applications including distrib uted esti- mation and distributed machine learning. For example, ` 1 - regularization or hinge-loss functions (as in SVM implemen- tations) lead to non-smooth risks. Sev eral useful techniques hav e been dev eloped in the literature for the solution of such distributed optimization problems, including the use of con- sensus strategies [9]–[11] and diffusion strate gies [12]–[15]. In this paper , we will focus on the Adapt-then-Combine (A TC) diffusion strategy mainly because diffusion strate gies hav e been shown to hav e superior mean-square-error and stability performance in adapti ve scenarios where agents are expected to continually learn from streaming data [15]. In particular , we shall examine the performance and stability behavior of networked diffusion learning under weaker conditions than previously considered in the literature. It is true that there hav e been several useful studies that employed sub-gradient constructions in the distributed setting before, most notably [9], [16], [17]. Howe ver , these earlier works generally assume bounded subgradients. As was already e xplained in Part I [2], this is a serious limitation (which does not hold ev en for quadratic risks where the gradient vector is linear in w and grows unbounded). Instead, we shall consider the weaker afﬁne Lipschitz condition (9), which was shown in Part I [2] to be satisﬁed automatically by important risk functions such as those arising in popular quadratic, SVM, and LASSO formulations. Notation : W e use lowercase letters to denote vectors, up- percase letters for matrices, plain letters for deterministic variables, and boldface letters for random v ariables. W e also use ( · ) T to denote transposition, ( · ) − 1 for matrix inv ersion, 3 T r ( · ) for the trace of a matrix, λ ( · ) for the eigen v alues of a matrix, k·k for the 2-norm of a matrix or the Euclidean norm of a vector , and ρ ( · ) for the spectral radius of a matrix. Besides, we use A ≥ B to denote that A − B is positi ve semi-deﬁnite, and p  0 to denote that all entries of vector p are positiv e. I I . P R O B L E M F O R M U L A T I O N : M U LT I - A G E N T C A S E W e now extend the single agent scenario analysis to multi- agent networks where a collection of agents cooperate with each other to seek the minimizer of a weighted aggregate cost of the form: min w N X k =1 q k J k ( w ) , (15) where k refers to the agent index and q k is some positiv e weighting coefﬁcient added for generality . When the { q k } are uniform and equal to each other, then (15) amounts to minimizing the aggregate sum of the individual risks { J k ( w ) } . W e can assume, without loss in generality , that the weights { q k } are normalized to add up to one N X k =1 q k = 1 (16) Each individual risk function continues to be expressed as the expected value of some loss function: J k ( w ) ∆ = E Q k ( w ; x k ) . (17) Here, the letter x k represents the random data at agent k and the expectation is over the distribution of this data. Many problems in adaptation and learning inv olve risk functions of this form, including, for example, mean-square-error designs and support vector machine (SVM) solutions — see, e.g., [18]–[20]. W e again allow each risk function J k ( w ) to be non- differ entiable . This situation is common in machine learning formulations, e.g., in SVM costs and in regularized sparsity- inducing formulations. W e continue to assume that the indi vidual costs satisfy Assumptions 1 and 2 described in the introduction section, namely , conditions (8), (9), and (10), which ensure that each J k ( w ) is strongly-conv ex and its sub-gradient vectors are afﬁne-Lipschitz with parameters { η k , c k , d k , e k , f k } ; we are attaching a subscript k to these parameters to make them agent-dependent (alternatively , if desired, we can replace them by agent-independent parameters by using bounds on their values). A. Network Model W e consider a network consisting of N separate agents connected by a topology . As described in [12], [21], we assign a pair of nonnegati ve weights, { a k` , a `k } , to the edge connecting any two agents k and ` . The scalar a `k is used by agent k to scale the data it receiv es from agent ` and similarly for a k` . The network is said to be connected if paths with nonzero scaling weights can be found linking any two distinct agents in both directions. The network is said to be str ongly–connected if it is connected with at least one self- loop, meaning that a kk > 0 for some agent k . Figure 1 shows one example of a strongly–connected network. For emphasis in this ﬁgure, each edge between two neighboring agents is represented by two directed arrows. The neighborhood of any agent k is denoted by N k and it consists of all agents that are connected to k by edges; we assume by default that this set includes agent k regardless of whether agent k has a self-loop or not. ! " # " $ " % & " ' " ( " )*+,-./0-//1" /2")/1*"34" 5*6276//8" Fig. 1. Agents that are linked by edges can share information. The neighborhood of agent k is marked by the broken line and consists of the set N k = { 6 , 7 , `, k } . There are sev eral strategies that the agents can employ to seek the minimizer , w ? , including consensus and diffusion strategies [9]–[12], [21]. As noted earlier, in this work, we focus on the latter class since diffusion implementations hav e been shown to have superior stability and performance properties over consensus strategies when used in the context of adaptation and learning from streaming data (i.e., when the step-sizes are set to a constant v alue as opposed to a diminishing v alue) [12], [15], [21]. Although diminishing step- sizes annihilate the gradient noise term they , ne vertheless, disable adaptation and learning in the long run. In comparison, constant step-size updates k eep adaptation ali ve, b ut they allo w gradient noise to seep into the operation of the algorithm. The challenge in these scenarios is therefore to show that the dynamics of the diffusion strategy ov er the network is such that the gradient noise effect does not degrade performance and that the network will be able to learn the unknown. This kind of analysis has been answered before in the af ﬁrmati ve for smooth twice-dif ferentiable functions, J k ( w ) — see [12]–[14], [21]. In this work, we want to pursue the analysis more gener- ally for possibly non-differ entiable risks in order to encompass important applications (such as SVM learning by multi-agents or LASSO and sparsity-aware learning by similar agents [22]– [25]). W e also want to pursue the analysis under the weaker afﬁne-Lipschitz assumption (9) on the sub-gradients than the stronger conditions used in the prior literature, as we already explained in the earlier sections and in Part I [2]. 4 B. Diffusion Strate gy W e consider the following diffusion strategy in its adapt- then-combine (A TC) form:    ψ k,i = w k,i − 1 − µ k b g k ( w k,i − 1 ) w k,i = X ` ∈N k a `k ψ `,i (18) Here, the ﬁrst step inv olves adaptation by agent k by using a stochastic sub-gradient iteration, while the second step in volv es aggregation; we assume the gradient noise processes across all agents are independent of each other . The entries A = [ a `k ] deﬁne a left-stochastic matrix, namely , the entries of A are non-negati ve and each of its columns adds up to one. Since the network is strongly-connected, the combination matrix A will be primitive [21], [26]. This implies that A will admit a Jordan-decomposition of the form: A = V  J V − 1  ∆ =  p V R   1 0 0 J   " 1 T V T L # , (19) with a single eigen value at one and all other eigen v alues strictly inside the unit circle. The matrix J  has a Jordan structure with the ones that would typically appear along its ﬁrst sub-diagonal replaced by a small positi ve number,  > 0 . Note that the eigen vectors of A corresponding to the eigen v alue at one are denoted by Ap = p, A T 1 = 1 . (20) where 1 refers to a column vector with all its entries equal to one. It is further known from the Perron-Frobenius theorem [26] that the entries of p are all strictly positive; we normalize them to add up to one. W e denote the individual entries of p by { p k } so that: p k > 0 , N X k =1 p k = 1 . (21) Furthermore, since V  V − 1  = I , it holds that V T R 1 = 0 , V T L p = 0 , V T L V R = I . (22) Next, we introduce the vector q = col { q 1 , q 2 , . . . , q N } (23) where q k is the weight associated with J k ( w ) in (15). Since the designer is free to select the step-size parameters, it turns out that we can always relate the vectors { p, q } in the follo wing manner: q = ζ diag { µ 1 , µ 2 , . . . , µ N } p (24) for some constant ζ > 0 . Note, for instance, that for (24) to be valid the scalar ζ should satisfy ζ = q k /µ k p k for all k . T o make this expression for ζ independent of k , we may parameterize (select) the step-sizes as µ k =  q k p k  µ o (25) for some small µ o > 0 . Then, ζ = 1 /µ o , which is independent of k and relation (24) is satisﬁed. Using (16) and (24) it is easy to check that N X k =1 p k µ k = µ o (26) Note that since the { p k } are positiv e, smaller than one, and their sum is one, the abov e expression shows that µ o can be interpreted as a weighted av erage step-size parameter . I I I . N E T W O R K P E R F O R M A N C E W e are now ready to extend Theorem 1 from Part I [2] to the network case. The analysis is more challenging due to the coupling among the agents. But the result will establish that the distributed strategy is stable and conv erges exponentially fast for sufﬁciently small step-sizes. As was the case with Part I [2], the statement below is again in terms of pocket variables, which we deﬁne as follows. At every iteration i , the risk value that is attained by iterate w k,i is J k ( w k,i ) . This v alue is a random variable due to the randomness in the streaming data used to run the algorithm. W e denote the mean risk value at agent k by E J k ( w k,i ) . W e again introduce a best pocket iterate, denoted by w best k,i . At any iteration i , the value that is saved in this pocket variable is the iterate that has generated the smallest mean risk value up to time i , i.e., w best k,i ∆ = arg min 1 ≤ j ≤ i E J k ( w k,j ) . (27) Observe that in the network case we now hav e N pocket values, one for each agent. Theor em 1 ( N E T W O R K P E R F O R M A N C E ) : Consider using the stochastic sub-gradient diffusion algorithm (18) to seek the unique minimizer , w ? , of the optimization problem (15), where the risk functions, J k ( w ) , are assumed to satisfy assump- tions (8), (10), and (12) with parameters { η k , β 2 k , σ 2 k , e 2 k , f 2 k } . Assume the step-size parameter is suf ﬁciently small (see condition (111)). Then, it holds that E N X k =1 q k J k ( w best k,i ) − N X k =1 q k J k ( w ? ) ! ≤ ξ · α i N X k =1 q k E k w k, 0 − w ? k 2 + µ o 2 N X k =1  q k f 2 k + q k σ 2 k + 2 hq k h f 2 k + k g 0 k ( w ? ) k 2 + 1 2 i (28) The con ver gence of E P N k =1 q k J k ( w best k,i ) tow ards a neigh- borhood of size O ( µ o ) around P N k =1 q k J k ( w ? ) occurs at an exponential rate, O ( α i ) , dictated by the parameter α ∆ = max k  1 − µ k  η k − µ o e 2 k − µ o β 2 k − 2 µ o he 2 k  = 1 − O ( µ o ) . (29) Condition (111) further ahead ensures α ∈ (0 , 1) . Pr oof: : The argument is provided in Appendix A. The above theorem clariﬁes the performance of the network in terms of the best pocket values across the agents. Ho wev er , 5 these pocket values are not readily av ailable because the risk values, J k ( w k,i ) , cannot be e valuated. This is due to the fact that the statistical properties of the data are not kno wn beforehand. As was the case with the single-agent scenario in Part I [2], a more practical conclusion can be deduced from the statement of the theorem as follo ws. W e again introduce the geometric sum: S L ∆ = L X j =0 α L − j = αS L − 1 + 1 = 1 − α L +1 1 − α , (30) as well as the normalized and con vex-combination coef ﬁcients: r L ( j ) ∆ = α L − j S L , j = 0 , 1 , . . . , L. (31) Using these coefﬁcients, we deﬁne a weighted iterate at each agent: ¯ w k,L ∆ = L X j =0 r L ( j ) w k,j = 1 S L  α L w k, 0 + α L − 1 w k, 1 + . . . + w k,L  . (32) and observe that ¯ w k,L satisﬁes the recursiv e construction: ¯ w k,L =  1 − 1 S L  ¯ w k,L − 1 + 1 S L w k,L . (33) In particular, as L → ∞ , we ha ve S L → 1 / (1 − α ) , and the abov e recursion simpliﬁes in the limit to ¯ w k,L = α ¯ w k,L − 1 + (1 − α ) w k,L . (34) Cor ollary 1 ( W E I G H T E D I T E R A T E S ) : Under the same con- ditions as in Theorem 1, it holds that lim L →∞ E N X k =1 q k J k ( ¯ w k,L ) − N X k =1 q k J k ( w ? ) ! ≤ µ o 2 N X k =1  q k f 2 k + q k σ 2 k + 2 hq k h f 2 k + k g 0 k ( w ? ) k 2 + 1 2 i = O ( µ o ) , (35) and con v ergence continues to occur at the same exponential rate, O ( α L ) . Pr oof: The argument is provided in Appendix D. Result (35) is an interesting conclusion. Howe ver , the statement is in terms of the a veraged iterate ¯ w k,L whose computation requires knowledge of α . This latter parameter is a global information, which is not readily av ailable to all agents. Nevertheless, result (35) motiv ates the following useful distributed implementation with a similar guaranteed performance bound. W e can replace α by a design parameter , θ , that is no less than α but still smaller than one, i.e., α ≤ θ < 1 . Next, we introduce the weighted variable: ¯ w k,L ∆ = L X j =0 r L ( j ) w k,j , (36) where now r L ( j ) = θ L − j /S L , j = 0 , 1 . . . , L, (37) and S L = L X j =0 θ L − j . (38) Cor ollary 2 ( D I S T R I BU T E D W E I G H T E D I T E R AT E S ) : Under the same conditions as in Theorem 1 and α ≤ θ < 1 , relation (35) continues to hold with ¯ w k,L in (32) replaced by (36). Moreover , con vergence no w occurs at the exponential rate O ( θ L ) . Pr oof: The argument is similar to the proof of Corollary 2 from Part I [2]. For ease of reference, we summarize in the table below the listing of the stochastic subgradient learning algorithm with exponential smoothing for which Corollaries 1 and 2 hold. Diffusion stochastic subgradient with exponential smoothing Initialization : S 0 = 1 , ¯ w k, 0 = w k, 0 = 0 , θ = 1 − O ( µ ) . repeat for i ≥ 1 : for each agent k : ψ k,i = w k,i − 1 − µ b g k ( w k,i − 1 ) (39) w k,i = X ` ∈N k a `k ψ `,i (40) S i = θ S i − 1 + 1 (41) ¯ w k,i =  1 − 1 S i  ¯ w k,i − 1 + 1 S i w k,i (42) end end A. Interpretation of Results Examining the bound in (35), and comparing it with result (88) from Part I [2] for the single-agent case, we observe that the topology of the network is now reﬂected in the bound through the weighting factor , q k and step-size µ k , which can be related to the Perron entry p k through (25). Recall from (20) that the { p k } are the entries of the right-eigen vector of A corresponding to the eigenv alue at one. Moreover , the bound in (35)inv olves three terms (rather than only two as in the single-agent case — compared with (88) from Part I [2]): (1) q k f 2 k , which arises from the non-smoothness of the risk function; (2) q k σ 2 k , which is due to gradient noise and the approxi- mation of the true sub-gradient vector; (3) 2 q k h h f 2 k + k g 0 k ( w ? ) k 2 + 1 2 i , which is an e xtra term in comparison to the single agent case. W e explained in (93) that the value of h is related to how far the error at each agent is away from the weighted a verage error across the network. As for k g 0 k ( w ? ) k 2 , this quan- tity represents the disagreement among the agents over w ? . Because each function J k ( · ) may have a dif ferent minimizer , g 0 k ( w ? ) is generally nonzero. 6 I V . S I M U L A T I O N S Example 1 (Multi-agent LASSO problem) W e now consider the LASSO problem with 20 agents connected according to Fig. 2. A quick revie w of the LASSO problem is as follows. (A more detailed discussion and the relationship between the proposed assumptions (8)–(10) and the LASSO formulation can be found in Part I [2].) W e consider follwing cost function for each agent: J lasso k ( w ) ∆ = 1 2 E k γ k − h T k w k 2 + δ k w k 1 , (43) where δ > 0 is a regularization parameter and k w k 1 denotes the ` 1 − norm of w . The variable γ k plays the role of a desired signal for agent k , while h k plays the role of a regression vector for the same agent. It is assumed that the regression data are zero-mean wide-sense stationary , and its distribution satisﬁes the standard Gaussian distribution, i.e., h k ∼ N (0 , σ 2 h,k I ) . W e further assume that { γ k , h k } satisfy a linear model of the form γ k generated through: γ k = h T k w o k + n k (44) where n k ∼ N (0 , σ 2 n,k I ) and w o k is some sparse random model for each agent. Each agent is allowed to have different regression and noise po wers, as illustrated in Fig. 3. Under these modeling assumptions, we can determine a closed-form expression for w ? as follows: w ? = arg min w N X k =1 q k J k ( w ) = arg min w 1 2 N X k =1 q k σ 2 h,k k w − w o k k 2 + δ k w k 1 = arg min w 1 2 N X k =1 q k σ 2 h,k k w k 2 − N X k =1 q k σ 2 h,k [ w o k ] T w + δ k w k 1 = arg min w 1 2 N X k =1 q k σ 2 h,k !      w − P N k =1 q k σ 2 h,k w o k P N k =1 q k σ 2 h,k      2 + δ k w k 1 (45) From ﬁrst-order optimality conditions, we obtain [27]: w ? = S  P N k =1 q k σ 2 h,k w o k P N k =1 q k σ 2 h,k ! , (46) where the symbol S  represents the soft-thresholding function with parameter  , i.e., S  ( x ) = sgn( x ) · max { 0 , | x | −  } . (47) and  = δ P N k =1 q k σ 2 h,k (48) where the notation sgn( a ) , for a scalar a , refers to the sign function: sgn [ a ] =    +1 , a > 0 0 , a = 0 − 1 , a < 0 (49) For the stochastic sub-gradient implementation, the follo wing instantaneous approximation for the sub-gradient is employed: b g lasso k ( w i − 1 ) = − h k,i ( γ k ( i ) − h T k,i w k,i − 1 ) + δ · sgn( w k,i − 1 ) (50) In Fig. 4, we compare the performance of this solution against sev eral strategies including standard diffusion LMS [12], [21], [28]:      ψ k,i = w k,i − 1 + µ h k,i ( γ k ( i ) − h T k,i w k,i − 1 ) w k,i = X ` ∈N k a `k ψ `,i (51) and sparse diffusion LMS [22], [24], [25] [23, Eq. 21]. Diffusion sparse LMS with expoential smoothing Initialization : S 0 = 1 , ¯ w k, 0 = w k, 0 = 0 , θ = 1 − O ( µ ) . repeat for i ≥ 1 : for each agent k : ψ k,i = w k,i − 1 + µ k h k,i ( γ k ( i ) − h T k,i w k,i − 1 ) − µ k δ · sgn( w k,i − 1 ) (52) w k,i = X ` ∈N k a `k ψ `,i (53) S i = θ S i − 1 + 1 (54) ¯ w k,i =  1 − 1 S i  ¯ w k,i − 1 + 1 S i w k,i (55) end end The parameter setting is as follows: w o k ∈ R 100 has 5 random non-zero entries uniformly distrib uted between 0.5 and 1.5, and δ = 0 . 005 . W e simply let q k = p k and set the step-size for all agents at µ k = µ o = 0 . 001 . From the simulations we ﬁnd h = 1 . 24 for the factor that appears in (28). As for the exponential smoothing factor θ , we chose θ = 1 − 2 µ o ( 1 N P N k =1 η k ) = 0 . 9985 .  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Fig. 2. Network topology linking N = 20 agents. Example 2 (Multi-agent SVM learning) Next, we will consider the multi-agent SVM problem. Similar to LASSO problem, we provide a brief revie w for notation. More detailed 7 2 4 6 8 10 12 14 16 18 20 0.6 0.7 0.8 0.9 1 Index of Node σ h,k 2 Regression Power for Each Node 2 4 6 8 10 12 14 16 18 20 0.1 0.15 0.2 0.25 0.3 0.35 Index of Node σ n,k 2 Noise Power for Each Node Fig. 3. Feature and noise variances across the agents. 0 1000 2000 3000 4000 5000 6000 7000 8000 −35 −30 −25 −20 −15 −10 −5 0 5 10 Number of iterations Excess Risk(dB) Multi−agent LASSO Problem Diffusion LMS (51) Sparse Diffusion LMS [23] Diffusion LMS with smoothing (51), (54), (55) Sparse Diffusion LMS with smoothing(52)−−(55) Upper bound (28) from Theorem 1 Fig. 4. The excess-risk curves for several strategies. discussion can be found in Part I [2]. The regularized SVM risk function for each agent is of the form: J svm k ( w ) ∆ = ρ 2 k w k 2 + E  max n 0 , 1 − γ k h T k w o , (56) where ρ > 0 is a regularization parameter . W e are generally giv en a collection of independent training data, { γ k ( i ) , h k,i } , consisting of feature vectors and their class designations. W e select q k = 1 N and µ k = µ o / ( N p k ) (57) One approximation for the sub-gradient construction at a generic location w corresponding to generic data { γ , h } is b g svm ( w ) = ρ w + γ h I [ γ h T w ≤ 1] , (58) where the indicator function I [ a ] is deﬁned as follo ws: I [ a ] =  1 , if statement a is true 0 , otherwise (59) Diffusion SVM with exponential smoothing Initialization : S 0 = 1 , ¯ w k, 0 = w k, 0 = 0 , θ = 1 − O ( µ ) . repeat for i ≥ 1 : for each agent k : ψ k,i = (1 − ρµ ) w k,i − 1 − µ γ k ( i ) h i I [ γ ( k, i ) h T k,i w k,i − 1 ≤ 1] (60) w k,i = X ` ∈N k a `k ψ `,i (61) S i = θ S i − 1 + 1 (62) ¯ w k,i =  1 − 1 S i  ¯ w k,i − 1 + 1 S i w k,i (63) end end W e distribute 32561 training data from an adult dataset 1 ov er a network consisting of 20 agents. W e set ρ = 0 . 002 and µ o = 0 . 15 for all agents. From Example 6 in Part I [2] and Theorem 1, we know that for the multi-agent SVM problem: α = max k  1 − µρ + µ 2 (2 h + 1) e 2 k  = max k  1 − µρ + µ 2 (2 h + 1)2 ρ 2  . (64) W e set θ = 1 − 0 . 9 · µ o ρ , which usually guarantees θ ≥ α . Fig. 5 (left) sho ws that cooperation among the agents outperforms the non-cooperativ e solution. Moreov er , the distributed network can almost match the performance of the centralized LIBSVM solution [29]. W e also examined the RCV1 dataset 2 . Here we hav e 20242 training data points and we distribute them ov er 20 agents. W e set the parameters to ρ = 1 × 10 − 5 and µ o = 0 . 5 (due to limited data). W e now use θ = 1 − 0 . 5 · µ o ρ since µ is not that small. The result is shown in Fig. 5 (right).  V . C O N C L U S I O N In summary , we examined the performance of stochastic sub-gradient learning strategies over adaptive networks. W e proposed a new afﬁne-Lipschitz condition, which is quite suitable for strongly con ve x but non-dif ferentiable cost functions and is automatically satisﬁed by se veral important cases including SVM, LASSO, T otal-V ariation denoising, etc. Under this weaker condition, the analysis establishes that sub-gradient strategies can attain exponential con ver gence rates, as opposed to sub-linear rates. The analysis also establishes that these strategies can approach the optimal solution within O ( µ ) , for sufﬁciently small step-sizes. A P P E N D I X A P R O O F O F T H E O R E M 1 Introduce the error vector , e w k,i = w ? − w k,i . W e collect the iterates and the respecti ve errors from across the network 1 https://archiv e.ics.uci.edu/ml/datasets/Adult 2 https://www .csie.ntu.edu.tw/ ∼ cjlin/libsvmtools/datasets/binary .html 8 0 200 400 600 800 1000 1200 1400 1600 50 55 60 65 70 75 80 85 Number of iterations Accuracy Multi−agent SVM problem(Adult data set) 1300 1400 1500 82 84 86 88 Diffusion SVM (60)−−(63) Non−cooperative SVM (60),(62),(63) LIBSVM [29] 0 200 400 600 800 1000 30 40 50 60 70 80 90 100 Number of iterations Accuracy Multi−agent SVM problem(RCV1 data set) 850 900 950 1000 86 88 90 92 94 96 98 Diffusion SVM (60)−−(63) Non−cooperative SVM (60),(62),(63) LIBSVM [29] Fig. 5. Performance of diffusion SVM for the Adult dataset (T op) and RCV1 dataset (Bottom), where vertical axis measures the percentage of correct prediction over test dataset. into block column vectors: W i ∆ = col { w 1 ,i , w 2 ,i , . . . , w N ,i } (65) e W i ∆ = col { e w 1 ,i , e w 2 ,i , . . . , e w N ,i } . (66) W e also deﬁne the extended quantities: A ∆ = A ⊗ I M (67) G ( W i − 1 ) ∆ = col { g 1 ( w 1 ,i − 1 ) , , . . . , g N ( w N ,i − 1 ) } (68) S i ( W i − 1 ) ∆ = col { s 1 ,i ( w 1 ,i − 1 ) , . . . , s N ,i ( w N ,i − 1 ) } , (69) U ∆ = diag { µ 1 , µ 2 · · · , µ N } /µ o (70) U ∆ = U ⊗ I M (71) where ⊗ denotes the Kronecker product operation, and s k,i ( w k,i − 1 ) denotes the gradient noise at agent k . Using this notation, it is straightforward to verify that the network error vector generated by the dif fusion strategy (18) e volv es according to the following dynamics: e W i = A T ( e W i − 1 + µ o U G ( W i − 1 ) + µ o U S i ( W i − 1 )) . (72) Motiv ated by the treatment of the smooth case in [13], [14], [21], we introduce a useful change of v ariables. Let V  = V  ⊗ I M and J  = J  ⊗ I M . Multiplying (72) from the left by V T  giv es V T  e W i = J T  V T  e W i − 1 + µ o V T  U G ( W i − 1 ) + µ o V T  U S i ( W i − 1 )  . (73) where from (19): J ∆ =  1 0 0 J   ⊗ I M (74) and V T  U =  p T V T R  ⊗ I M  ( U ⊗ I M ) =  p T U V T R U  ⊗ I M (24) =  q T ⊗ I M V T R U ⊗ I M  (75) T o proceed, we introduce V T  e W i =  ( p T ⊗ I ) e W i ( V T R ⊗ I ) e W i  ∆ =  ¯ w i ˇ W i  , (76) V T  U G ( W i − 1 )=  ( q T ⊗ I ) G ( W i − 1 ) ( V T R U ⊗ I ) G ( W i − 1 )  ∆ =  ¯ g ( W i − 1 ) ˇ G ( W i − 1 )  (77) V T  U S i ( W i − 1 )=  ( q T ⊗ I ) S i ( W i − 1 ) ( V T R U ⊗ I ) S i ( W i − 1 )  ∆ =  ¯ s i ( W i − 1 ) ˇ S i ( W i − 1 )  (78) where the quantities { ¯ w i , ¯ g ( W i − 1 ) , ¯ s i ( W i − 1 ) } amount to the weighted av erages: ¯ w i = N X k =1 p k e w k,i , (79) ¯ g ( W i − 1 ) = N X k =1 q k g k ( w k,i − 1 ) , (80) ¯ s i ( W i − 1 ) = N X k =1 q k s k,i ( w k,i − 1 ) . (81) It is useful to observe the asymmetry reﬂected in the fact that ¯ w i is obtained by using the weights { p k } while the av erages (77)–(78) are obtained by using the weights { q k } . W e can now rewrite (73) as  ¯ w i ˇ W i  =  I M 0 0 J T    ¯ w i − 1 ˇ W i − 1  (82) + µ o  ¯ g ( W i − 1 ) ˇ G ( W i − 1 )  + µ o  ¯ s i ( W i − 1 ) ˇ S i ( W i − 1 )  . Consider the top recursion, namely , ¯ w i = ¯ w i − 1 + µ o ¯ g ( W i − 1 ) + µ o ¯ s i ( W i − 1 ) . (83) Squaring and taking expectations we have E [ k ¯ w i k 2 | F i − 1 ] = E [ k ¯ w i − 1 + µ o ¯ g ( W i − 1 ) + µ o ¯ s i ( W i − 1 ) k 2 | F i − 1 ] = k ¯ w i − 1 k 2 + 2 µ o ¯ g ( W i − 1 ) T ¯ w i − 1 + µ 2 o k ¯ g ( W i − 1 ) k 2 + µ 2 o E [ k ¯ s i ( W i − 1 ) k 2 | F i − 1 ] . (84) 9 W e examine the terms on the right-hand side one by one. First note that, using Jensen’ s inequality , k ¯ g ( W i − 1 ) k 2 =      N X k =1 q k g k ( w k,i − 1 )      2 ( a ) =      N X k =1 q k g k ( w k,i − 1 ) − N X k =1 q k g 0 k ( w ? )      2 ≤ N X k =1 q k k g k ( w k,i − 1 ) − g 0 k ( w ? ) k 2 (10) ≤ N X k =1 q k  e 2 k k e w k,i − 1 k 2 + f 2 k  . (85) In step (a), we exploit the fact that, by deﬁnition, w ? is the minimizer of (15) and, hence, there exist sub-gradients g 0 k ( w ? ) , k = 1 , 2 , · · · , N , satisfying P N k =1 q k g 0 k ( w ? ) = 0 . Next, the noise term can be bounded by: E [ k ¯ s i ( W i − 1 ) k 2 | F i − 1 ] = E        N X k =1 q k s k ( w k,i − 1 )      2 | F i − 1   ( a ) ≤ N X k =1 q k E [ k s k ( w k,i − 1 ) k 2 | F i − 1 ] ≤ N X k =1 q k  β 2 k k e w k,i − 1 k 2 + σ 2 k  . (86) where step (a) follows from Jensen’ s inequality . Finally , with regards to the cross term in (84), we adapt an argument from [9] to obtain (89) by ﬁrst noting that: ¯ g ( W i − 1 ) T ¯ w i − 1 = N X k =1 q k g T k ( w k,i − 1 )  e w k,i − 1 + ¯ w i − 1 − e w k,i − 1  = N X k =1 q k g T k ( w k,i − 1 ) e w k,i − 1 + N X k =1 q k g T k ( w k,i − 1 )  ¯ w i − 1 − e w k,i − 1  . (87) Using the strong-con ve xity property (8), we hav e g k ( w k,i − 1 ) T e w k,i − 1 ≤ J k ( w ? ) − J k ( w k,i − 1 ) − η k 2 k e w k,i − 1 k 2 , (88) Substituting into (87) giv es ¯ g ( W i − 1 ) T ¯ w i − 1 ≤ N X k =1 q k  J k ( w ? ) − J k ( w k,i − 1 ) − η k 2 k e w k,i − 1 k 2  + N X k =1 q k g T k ( w k,i − 1 )  ¯ w i − 1 − e w k,i − 1  ≤ N X k =1 q k  J k ( w ? ) − J k ( w k,i − 1 ) − η k 2 k e w k,i − 1 k 2  + N X k =1 q k k g k ( w k,i − 1 ) kk ¯ w i − 1 − e w k,i − 1 k . (89) It follows, under expectation, that E ¯ g ( W i − 1 ) T ¯ w i − 1 ≤ N X k =1 q k  J k ( w ? ) − E J k ( w k,i − 1 ) − η k 2 E k e w k,i − 1 k 2  + N X k =1 q k E ( k g k ( w k,i − 1 ) kk ¯ w i − 1 − e w k,i − 1 k ) . (90) Now , using the Cauchy-Schwartz inequality , we can bound the last expectation as E  k g k ( w k,i − 1 ) kk ¯ w i − 1 − e w k,i − 1 k  ≤ q E k g k ( w k,i − 1 ) k 2 E k ¯ w i − 1 − e w k,i − 1 k 2 . (91) After sufﬁcient iterations, it will hold that (see Appendix B for the proof): E k ¯ w i − 1 − e w k,i − 1 k 2 = O ( µ 2 o ) . (92) This means that there exists an I o large enough and a constant h such that for all i ≥ I o : E k ¯ w i − 1 − e w k,i − 1 k 2 ≤ h 2 µ 2 o . (93) Therefore, we ﬁnd that E  k g k ( w k,i − 1 ) kk ¯ w i − 1 − e w k,i − 1 k  ≤ hµ o  q E k g k ( w k,i − 1 ) k 2  ≤ hµ o  q 2 E k g k ( w k,i − 1 ) − g 0 k ( w ? ) k 2 + 2 k g 0 k ( w ? ) k 2  (10) ≤ hµ o  q 2 e 2 k E k e w k,i − 1 k 2 + 2 f 2 k + 2 k g 0 k ( w ? ) k 2  ≤ hµ o  e 2 k E k e w k,i − 1 k 2 + f 2 k + k g 0 k ( w ? ) k 2 R + R 2  , (94) where the last inequality follows from using √ x ≤ 1 2  x R + R  , x ≥ 0 , (95) which follows from the inequality 1 2 x R − √ x + 1 2 R = 1 2  r x R − √ R  2 ≥ 0 for any positiv e R , e.g., R = 1 , which allo ws us to conclude that, as i → ∞ : E ¯ g ( W i − 1 ) T ¯ w i − 1 ≤ N X k =1 q k  J k ( w ? ) − E J k ( w k,i − 1 ) − η k 2 E k e w k,i − 1 k 2  + µ o N X k =1 hq k  e 2 k E k e w k,i − 1 k 2 + f 2 k + k g 0 k ( w ? ) k 2 + 1 2  (96) T aking expectation of (84) over the ﬁltration and substituting 10 (85), (86), and (96), we obtain asymptotically that: E k ¯ w i k 2 ≤ E k ¯ w i − 1 k 2 + 2 µ o N X k =1 q k  J k ( w ? ) − E J k ( w k,i − 1 )  − µ o N X k =1 q k η k E k e w k,i − 1 k 2 + µ 2 o N X k =1 q k  e 2 k E k e w k,i − 1 k 2 + f 2 k  + µ 2 o N X k =1 q k  β 2 k E k e w k,i − 1 k 2 + σ 2 k  + 2 µ 2 o N X k =1 q k h  e 2 k E k e w k,i − 1 k 2 + f 2 k + k g 0 k ( w ? ) k 2 + 1 2  ≤ E k ¯ w i − 1 k 2 + 2 µ o N X k =1 q k  J k ( w ? ) − E J k ( e w k,i − 1 )  − N X k =1 (1 − α k ) p k E k e w k,i − 1 k 2 + µ 2 o N X k =1  q k f 2 k + q k σ 2 k + 2 hq k h f 2 k + k g 0 k ( w ? ) k 2 + 1 2 i (97) where we deﬁned α k in the second inequality as follows: 1 − α k ∆ =  µ o η k − µ 2 o e 2 k − µ 2 o β 2 k − 2 µ 2 o he 2 k  q k p k (25) = µ k  η k − µ o e 2 k − µ o β 2 k − 2 µ o he 2 k  (98) Let α denote the largest α k among all agents: α ∆ = max 1 ≤ k ≤ N { α k } . (99) Then, it holds that when α ∈ (0 , 1) , which will be shown later in (111): N X k =1 (1 − α k ) p k E k e w k,i − 1 k 2 ≥ (1 − α ) N X k =1 p k E k e w k,i − 1 k 2 ≥ (1 − α ) E k ¯ w i − 1 k 2 , (100) where we used Jensen’ s inequality to deduce that k ¯ w i − 1 k 2 =      N X k =1 p k e w k,i − 1      2 ≤ N X k =1 p k k e w k,i − 1 k 2 . (101) It follows from (97) that 2 µ o  N X k =1 q k ( E J k ( w k,i − 1 ) − J k ( w ? ))  ≤ α E k ¯ w i − 1 k 2 − E k ¯ w i k 2 + µ 2 o N X k =1  q k f 2 k + q k σ 2 k + 2 hq k h f 2 k + k g 0 k ( w ? ) k 2 + 1 2 i (102) This inequality recursion has a form similar to the one we encountered in the single agent case. Speciﬁcally , let us introduce the scalars: a ( i ) ∆ = N X k =1 q k ( E J k ( w k,i − 1 ) − J k ( w ? )) (103) b ( i ) ∆ = E k ¯ w i k 2 (104) τ 2 ∆ = N X k =1  q k f 2 k + q k σ 2 k + 2 hq k h f 2 k + k g 0 k ( w ? ) k 2 + 1 2 i (105) Then, recursion (102) can be rewritten more compactly in the form: 2 µ o a ( i ) ≤ αb ( i − 1) − b ( i ) + µ 2 o τ 2 (106) This recursion has the same format as equation (69) in Part I [2]. Lastly , notice that N X k =1 q k  E J k ( w best k,i ) − J k ( w ? )  (27) = N X k =1 q k  min 1 ≤ i ≤ L E J k ( w k,i − 1 ) − J k ( w ? )  ≤ min 1 ≤ i ≤ L N X k =1 q k  E J k ( w k,i − 1 ) − J k ( w ? )  = min 1 ≤ i ≤ L a ( i ) (107) This result ensure that w best k,i satisﬁes a condition similar to (76) in Part I [2]. The argument can no w be continued similarly to arri ve at the conclusions in the statement of the theorem. Stability is ensured by requiring α k ∈ (0 , 1) , i.e., α k =1 − µ k  η k − µ o e 2 k − µ o β 2 k − 2 µ o he 2 k  ∈ (0 , 1) (108) The condition α k < 1 is met for µ o < η k β 2 k + (1 + 2 h ) e 2 k , ∀ k. (109) while the condition α k > 0 requires µ k  η k − µ o e 2 k − µ o β 2 k − 2 µ o he 2 k  < 1 (110) But because η k − µ o e 2 k − µ o β 2 k − 2 µ o he 2 k ≤ η k , we conclude 0 < µ k < 1 η k is sufﬁcient for condition (110). Combining these conditions with (25), we establish µ k < min  1 η k , η k q k p k β 2 k + (1 + 2 h ) p k e 2 k  (111) which ensures α k ∈ (0 , 1) . A P P E N D I X B P R O O F O F ( 9 2 ) W e establish the asymptotic result (92). Let ¯ W i = col { ¯ w i , . . . , ¯ w i } = 1 N ⊗ ¯ w i , (112) where the vector ¯ w i is stacked N times to match the dimen- sion of e W i . W e start from the second relation in the error recursion (82): ˇ W i = J T   ˇ W i − 1 + µ o ˇ G ( W i − 1 ) + µ o ˇ S i ( W i − 1 )  , (113) 11 and ﬁrst explain how to recov er e W i − ¯ W i from ˇ W i . From (76) e W i = V − T   ¯ w i ˇ W i  (19) =  1 ⊗ I M V L   ¯ w i ˇ W i  = ¯ W i + V L ˇ W i (114) Next, returning to the error recursion (113), and computing the expected squared norm, we obtain: E [ k ˇ W i k 2 | F i − 1 ] =    J T   ˇ W i − 1 + µ o ˇ G ( W i − 1 )     2 + µ 2 o E [ kJ T  ˇ S i ( W i − 1 ) k 2 | F i − 1 ] ≤ ρ ( J  J T  ) k ˇ W i − 1 + µ o ˇ G ( W i − 1 ) k 2 + µ 2 o ρ ( J  J T  ) E [ k ˇ S i ( W i − 1 ) k 2 | F i − 1 ] , (115) where, from [21, Ch. 9], we know that ρ ( J  J T  ) ≤ ( ρ ( J  ) +  ) 2 < 1 . (116) Let us examine the terms in (115). T o begin with, note that ρ ( J  J T  ) k ˇ W i − 1 + µ o ˇ G ( W i − 1 ) k 2 ≤ ( ρ ( J  ) +  ) 2     t 1 t ˇ W i − 1 + 1 − t 1 − t µ o ˇ G ( W i − 1 )     2 ( a ) ≤ ( ρ ( J  ) +  ) 2 t k ˇ W i − 1 k 2 + µ 2 o ( ρ ( J  ) +  ) 2 1 − t k ˇ G ( W i − 1 ) k 2 ( b ) ≤ ( ρ ( J  ) +  ) k ˇ W i − 1 k 2 + µ 2 o ( ρ ( J  ) +  ) 2 1 − ρ ( J  ) −  k ˇ G ( W i − 1 ) k 2 , (117) where step (a) is because of Jensen’ s inequality and in step (b) we select t = ρ ( J  ) +  < 1 . Next, we bound the square of the sub-gradient term: k ˇ G ( W i − 1 ) k 2 = kV T R U G ( W i − 1 ) k 2 ≤ k V R k 2 k U k 2  N X k =1 k g k ( w k,i − 1 ) k 2  ( a ) ≤ 2 k V R k 2 k U k 2  N X k =1 k g k ( w k,i − 1 ) − g 0 k ( w ? ) k 2 + k g 0 k ( w ? ) k 2  ≤ 2 k V R k 2 k U k 2  N X k =1 e 2 k k e w k,i − 1 k 2 + f 2 k + k g 0 k ( w ? ) k 2  ( b ) ≤ 2 k V R k 2 k U k 2  e 2 max k e W i − 1 k 2 + N X k =1 ( f 2 k + k g 0 k ( w ? ) k 2 )  , (118) where in step (a) we subtract and add g 0 k ( w ? ) inside of the norm and the factor 2 comes from Jensen’ s inequality , and in step (b) we let e 2 max = max k e 2 k . W e can then bound (117) by ρ ( J  J T  ) k ˇ W i − 1 + µ ˇ G ( W i − 1 ) k 2 ≤ ( ρ ( J  ) +  ) k ˇ W i − 1 k 2 + 2 µ 2 o ( ρ ( J  ) +  ) 2 1 − ρ ( J  ) −  k V R k 2 k U k 2 e 2 max k e W i − 1 k 2 + 2 µ 2 o ( ρ ( J  ) +  ) 2 1 − ρ ( J  ) −  k V R k 2 k U k 2 N X k =1 ( f 2 k + k g 0 k ( w ? ) k 2 ) . (119) Finally , we consider the last term in volving the gradient noise in (115): E [ k ˇ S i ( W i − 1 ) k 2 | F i − 1 ] ≤ k V R k 2 k U k 2 N X k =1 β 2 k k e w k,i − 1 k 2 + σ 2 k ! ≤ k V R k 2 k U k 2 β max k e W i − 1 k 2 + k V R k 2 k U k 2 N X k =1 σ 2 k . (120) Now introduce the constants: a ∆ = 2( ρ ( J  ) +  ) 2 1 − ρ ( J  ) −  k V R k 2 k U k 2 e 2 max + ρ ( J  J T  ) k V R k 2 k U k 2 β max , (121) b ∆ = 2( ρ ( J  ) +  ) 2 1 − ρ ( J  ) −  k V R k 2 k U k 2 N X k =1 ( f 2 k + k g 0 k ( w ? ) k 2 ) + ρ ( J  J T  ) k V R k 2 k U k 2 N X k =1 σ 2 k . (122) Although the matrix U is dependent on the µ k , entries of U are ratios relative to µ o . Then, substituting the previous results into (115), we arriv e at E k ˇ W i k 2 ≤ ( ρ ( J  ) +  ) E k ˇ W i − 1 k 2 + µ 2 o a E k e W i − 1 k 2 + µ 2 o b, (123) In Appendix C we show that E k e W i − 1 k 2 , for any iteration i , is bounded by a constant value for sufﬁcient small step-sizes. In this case, we can conclude that E k ˇ W i k 2 ≤ ( ρ ( J  ) +  ) E k ˇ W i − 1 k 2 + µ 2 o b 0 , (124) for some constant b 0 , so that at steady state: lim sup i →∞ E k ˇ W i k 2 ≤ µ 2 o b 0 1 − ρ ( J  ) −  = O ( µ 2 o ) . (125) Using relation (114), it then follows asymptotically that for i  1 : E k e W i − ¯ W i k 2 ≤ k V L k 2 · E k ˇ W i k 2 = O ( µ 2 o ) , (126) and, consequently , E k e w k,i − ¯ w i k 2 ≤ E k e W i − ¯ W i k 2 = O ( µ 2 o ) . (127) A P P E N D I X C P R O O F T H AT E k e W i k 2 I S U N I F O R M L Y B O U N D E D W e follow mathematical induction to establish that E k e W i k 2 is uniformly bounded by a constant value, for all i . Assume, at the initial time instant we hav e E k e w k, 0 k 2 < c for all k and for some constant v alue c . Then, assuming this bound holds at iteration i − 1 , namely , E k e w k,i − 1 k 2 ≤ c , ∀ k , (128) 12 we would like to sho w that it also holds at iteration i . Recall from (18) that the dif fusion strategy consists of two steps: an adaptation step follo wed by a combination step. The adaptation step has a similar structure to the single-agent case. Hence, the same deriv ation that was used to establish for single agent case in Part I [2, Eq. 64] would show that for agent k : 2 µ k ( E J k ( w k,i − 1 ) − J k ( w ? k )) ≤ α k E k e w k,i − 1 k 2 − E k e ψ k,i k 2 + µ 2 k ( f 2 k + σ 2 k ) , (129) where α k = 1 − µ k η k + µ 2 k ( e 2 k + β 2 k ) = 1 − O ( µ k ) , (130) w ? k ∆ = arg min w J k ( w ) . (131) Now , since E J k ( w k,i − 1 ) ≥ J k ( w ? k ) , we conclude that E k e ψ k,i k 2 ≤ α k E k e w k,i − 1 k 2 + µ 2 k ( f 2 k + σ 2 k ) (128) ≤ α k c + µ 2 k ( f 2 k + σ 2 k ) , (132) where the step-size µ k can be chosen small enough to ensure α k ∈ (0 , 1) . Now , it is also clear that there exist suf ﬁciently small values for µ k to ensure that, for all agents k : α k c + µ 2 k ( f 2 k + σ 2 k ) ≤ c , (133) which then guarantees that E k e ψ k,i k 2 ≤ c . (134) It then follows from the combination step (18) that E k e w k,i k 2 = E      X ` ∈N k a `k e ψ `,i      2 ≤ X ` ∈N k a `k E    e ψ `,i    2 ≤ X ` ∈N k a k` c = c , ∀ k . (135) Therefore, starting from (128), we conclude that E k e w k,i k 2 < c as well, as desired. Finally , since E k e W i k 2 = P N k =1 E k e w k,i k 2 , we conclude that E k e W i k 2 is also uniformly bounded ov er time. A P P E N D I X D P R O O F O F C O R O L L A RY 1 Iterating (106) over 1 ≤ i ≤ L , for some interval length L , giv es: L X i =1 α L − i m (2 µ o a ( i ) − µ 2 o τ 2 ) ≤ α L m b (0) (136) Then, dividing both side by the same sum: L X i =1 α L − i m S L − 1 (2 µ o a ( i ) − µ 2 o τ 2 ) ≤ α L m S L − 1 b (0) (137) Now , because of the con vexity of each J k ( · ) , we hav e J k ( ¯ w k,L − 1 ) ≤ L − 1 X j =0 r L − 1 ( j ) J k ( w k,j ) (138) Thus, we can establish: L X i =1 α L − i m S L − 1 a ( i ) = L X i =1 r L − 1 ( i − 1) N X k =1 q k  E J k ( w k,i − 1 ) − J k ( w ? )  ≥ N X k =1 q k  E J k ( ¯ w k,L − 1 ) − J k ( w ? )  (139) Substituting into (137), we establish: 2 µ o N X k =1 q k  E J k ( ¯ w k,L − 1 ) − J k ( w ? )  ≤ α L m S L − 1 b (0) + µ 2 o τ 2 (140) Letting L → ∞ , we establish (35). R E F E R E N C E S [1] B. Y ing and A. H. Sayed, “Performance limits of single-agent and multi-agent sub-gradient stochastic learning, ” in Pr oc. IEEE ICASSP , Shanghai, China, Mar . 2016, pp. 4905–4909. [2] B. Y ing and A. H. Sayed, “Performance limits of stochastic sub-gradient learning, Part I: Single agent case, ” submitted for publication , 2017. [3] B. T . Polyak, Intr oduction to Optimization , Optimization Software, 1987. [4] D. P . Bertsekas, Nonlinear Pr ogramming , Athena scientiﬁc, 1999. [5] Y . Nesterov , Introductory Lectures on Con vex Optimization , Springer, 2004. [6] O. Bousquet and L. Bottou, “The tradeoffs of large scale learning, ” in Advances in Neural Information Processing Systems(NIPS), 20 , pp. 161–168. 2008. [7] L. Bottou, “Stochastic gradient tricks, ” in Neural Networks, T ricks of the T rade, Reloaded , Lecture Notes in Computer Science (LNCS 7700), pp. 430–445. Springer , 2012. [8] Z. J. T owﬁc and A. H. Sayed, “Stability and performance limits of adaptiv e primal-dual networks, ” IEEE T rans. Signal Process. , vol. 63, no. 11, pp. 2888–2903, June 2015. [9] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi- agent optimization, ” IEEE T rans. Autom. Contr ol , vol. 54, no. 1, pp. 48–61, 2009. [10] W . Y u, G. Chen, Z. W ang, and W . Y ang, “Distributed consensus ﬁltering in sensor networks, ” IEEE T rans. on Systems, Man, and Cybernetics, P art B: Cybernetics , vol. 39, no. 6, pp. 1568–1577, 2009. [11] S. Kar and J. M. F . Moura, “Distributed consensus algorithms in sensor networks with imperfect communication: Link failures and channel noise, ” IEEE T rans. Signal Pr ocess. , vol. 57, no. 1, pp. 355–369, 2009. [12] A. H. Sayed, “ Adaptiv e networks, ” Pr oceedings of the IEEE , vol. 102, no. 4, pp. 460–497, 2014. [13] J. Chen and A. H. Sayed, “On the learning behavior of adaptiv e networks—Part I: Transient analysis, ” IEEE T rans. Inf. Thy . , vol. 61, no. 6, pp. 3487–3517, June 2015. [14] J. Chen and A. H. Sayed, “On the learning behavior of adaptiv e networks—Part II: Performance analysis, ” IEEE Tr ans. Inf. Thy . , vol. 61, no. 6, pp. 3518–3548, June 2015. [15] S.-Y . T u and A. H. Sayed, “Diffusion strategies outperform consensus strategies for distributed estimation over adaptive networks, ” IEEE T rans. Signal Pr ocess. , vol. 60, no. 12, pp. 6217–6234, 2012. [16] A. Nemirovski, A. Juditsky , G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming, ” SIAM J. Optm. , vol. 19, no. 4, pp. 1574–1609, 2009. [17] S. S. Ram, A. Nedi ´ c, and V . V . V eeravalli, “Distributed stochastic subgradient projection algorithms for conv ex optimization, ” J. Optm. Theory and Appl. , vol. 147, no. 3, pp. 516–545, 2010. [18] A. H. Sayed, Adaptive Filters , John Wile y & Sons, 2008. 13 [19] S. Theodoridis and K. Koutroumbas, P attern Recognition , Academic Press, 4th edition, 2008. [20] C. M. Bishop, P attern Recognition and Machine Learning , Springer, 2006. [21] A. H. Sayed, “ Adaptation, learning, and optimization ov er networks, ” F oundations and T r ends in Mac hine Learning , vol. 7, no. 4-5, pp. 311– 801, 2014. [22] P . Di Lorenzo, S. Barbarossa, and A. H. Sayed, “Sparse diffusion LMS for distributed adaptive estimation, ” in Pr oc. IEEE Int. Conf. Acoust. Speech Signal Pr ocess. (ICASSP) , Kyoto, Japan, Mar . 2012, pp. 3281– 3284. [23] P . Di Lorenzo and A. H. Sayed, “Sparse distributed learning based on diffusion adaptation, ” IEEE T rans. Signal Pr ocess. , vol. 61, no. 6, pp. 1419–1433, March 2013. [24] Y . Liu, C. Li, and Z. Zhang, “Diffusion sparse least-mean squares over networks, ” IEEE T rans. Signal Process. , vol. 60, no. 8, pp. 4480–4485, Aug. 2012. [25] S. Chouvardas, K. Slavakis, Y . K opsinis, and S. Theodoridis, “ A sparsity promoting adaptiv e algorithm for distributed learning, ” IEEE T rans. Signal Process. , vol. 60, no. 10, pp. 5412–5425, Oct. 2012. [26] C. D. Meyer , Matrix Analysis and Applied Linear Algebr a , SIAM, 2000. [27] D. L. Donoho and J. M. Johnstone, “Ideal spatial adaptation by wav elet shrinkage, ” Biometrika , vol. 81, no. 3, pp. 425–455, 1994. [28] F . S. Cattiv elli and A. H. Sayed, “Dif fusion LMS strategies for distributed estimation, ” IEEE Tr ans. Signal Pr ocess. , vol. 58, no. 3, pp. 1035–1048, 2010. [29] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines, ” ACM T rans. on Intelligent Systems and T ech. , vol. 2, pp. 27:1–27:27, 2011.

Performance Limits of Stochastic Sub-Gradient Learning, Part II: Multi-Agent Case

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment