A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates
This paper proposes a novel proximal-gradient algorithm for a decentralized optimization problem with a composite objective containing smooth and non-smooth terms. Specifically, the smooth and nonsmooth terms are dealt with by gradient and proximal u…
Authors: Zhi Li, Wei Shi, Ming Yan
JOURNAL OF TSP L A T E X CLASS FILES, VOL. X, NO. X, SEP 2018 1 A Decentralized Proximal-Gradient Method with Network Independent Step-sizes and Separated Con v er gence Rates Zhi Li, W ei Shi, and Ming Y an Abstract —This paper proposes a novel proximal- gradient algorithm for a decentralized optimization pr ob- lem with a composite objective containing smooth and non-smooth terms. Specifically , the smooth and nonsmooth terms are dealt with by gradient and pr oximal updates, respecti vely . The proposed algorithm is closely r elated to a previous algorithm, PG-EXTRA [ 1 ], but has a few advantages. First of all, agents use uncoordinated step-sizes, and the stable upper bounds on step-sizes are independent of network topologies. The step-sizes depend on local objective functions, and they can be as large as those of the gradient descent. Secondly , for the special case without non-smooth terms, linear con vergence can be achieved under the strong con vexity assumption. The dependence of the con vergence rate on the objecti ve functions and the network are separated, and the conv ergence rate of the new algorithm is as good as one of the two con vergence rates that match the typical rates for the general gradient descent and the consensus a veraging. W e pr ovide numerical experiments to demonstrate the efficacy of the introduced algorithm and validate our theoretical discoveries. Index T erms —decentralized optimization, proximal- gradient, con vergence rates, network independent I . I N T R O D U C T I O N T HIS paper focuses on the following decentralized optimization problem: minimize x ∈ R p ¯ f ( x ) := 1 n n X i =1 ( s i ( x ) + r i ( x )) , (1) where s i : R p → R and r i : R p → R ∪ { + ∞} are two lower semi-continuous proper conv ex functions held priv ately by agent i to encode the agent’ s objectiv e. W e Zhi Li is with Department of Computational Mathematics, Science and Engineering, Michigan State Uni versity , East Lansing, MI, 48824, USA. (zhili@msu.edu) W ei Shi was with Department of Electrical Engineering, Princeton Univ ersity , Princeton, NJ 08544, USA. Ming Y an is with Department of Computational Mathematics, Sci- ence and Engineering and Department of Mathematics, Michigan State Univ ersity , East Lansing, MI, 48824, USA. (myan@msu.edu) This work is supported in part by the NSF grant DMS-1621798. assume that one function (e.g., without loss of generality , s i ) is differentiable and has a Lipschitz continuous gradient with parameter L > 0 , and the other function r i is proximable, i.e., its proximal mapping pro x λr i ( y ) = arg min x ∈ R p λr i ( x ) + 1 2 k x − y k 2 , has a closed-form solution or can be computed easily . Examples of s i include linear functions, quadratic func- tions, and logistic functions, while r i could be the ` 1 norm, 1D total variation, or indicator functions of simple con vex sets. In addition, we assume that the agents are connected through a fixed bi-directional communication network. Every agent in the network wants to obtain an optimal solution of ( 1 ) while it can only receive/send nonsensitiv e messages 1 from/to its immediate neighbors. Specific problems of form ( 1 ) that require a decen- tralized computing architecture have appeared in various areas including networked multi-vehicle coordination, distributed information processing and decision making in sensor networks, as well as distributed estimation and learning. Some examples include distributed av erage consensus [ 2 ]–[ 4 ], distributed spectrum sensing [ 5 ], in- formation control [ 6 ], [ 7 ], po wer systems control [ 8 ], [ 9 ], statistical inference and learning [ 10 ]–[ 12 ]. In general, decentralized optimization fits the scenarios where the data is collected and/or stored in a distributed network, a fusion center is either inapplicable or unaffordable, and/or computing is required to be performed in a distributed but collaborative manner by multiple agents or by network designers. A. Literatur e Revie w The study on distributed algorithms dates back to the early 1980s [ 13 ], [ 14 ]. Since then, due to the emergence 1 W e believ e that agent i ’ s instantaneous estimation on the optimal solution is not a piece of sensitiv e information but s i and r i are. JOURNAL OF TSP L A T E X CLASS FILES, VOL. X, NO. X, SEP 2018 2 of large-scale networks, decentralized (optimization) al- gorithms, as a special type of distributed algorithms for solving problem ( 1 ), have received attention. Many efforts ha ve been made on star networks with one master agent and multiple slav e agents [ 15 ], [ 16 ]. This scheme is “centralized” due to the use of a “master” agent. It may suffer a single point of failure and may violate the priv acy requirement in certain applications. In this paper, we focus on solving ( 1 ) in a decentralized fashion, where no “master” agent is used. Incremental algorithms [ 17 ]–[ 22 ] can solve ( 1 ) with- out the need of a “master” agent and it is based on a directed ring network. T o handle general (possibly time-varying) networks, the distributed sub-gradient al- gorithm was proposed in [ 23 ]. This algorithm and its variants [ 24 ], [ 25 ] are intuitiv e and simple but usually slow due to the diminishing step-size that is needed to obtain a consensual and optimal solution, even if the objectiv e functions are dif ferentiable and strongly con vex. With a fixed step-size, these distributed methods can be fast, but they only conv erge to a neighborhood of the solution set which depends on the step-size. This phenomenon creates an exactness-speed dilemma [ 26 ]. A class of distributed approaches that bypass this dilemma is based on introducing the Lagrangian dual. The resulting algorithms include distributed dual de- composition [ 27 ] and decentralized alternating direction method of multipliers (ADMM) [ 28 ]. The decentralized ADMM and its proximal-gradient variant can employ a fixed step-size to achiev e O (1 /k ) rate under general con vexity assumptions [ 29 ]–[ 31 ]. Under the strong con- ve xity assumption, the decentralized ADMM has linear con vergence for time-inv ariant undirected graphs [ 32 ]. There exist some other distributed methods that do not (explicitly) use dual variables but can still conv erge to an optimal consensual solution with fixed step-sizes. In par- ticular , works in [ 33 ], [ 34 ] employ multi-consensus inner loops, Nesterov’ s acceleration, and/or the adapt-then- combine (A TC) strategy . Under the assumption that the objectiv es have bounded and Lipschitz continuous gradi- ents 2 , the algorithm proposed in [ 34 ] has O ln( k ) /k 2 rate. References [ 1 ], [ 36 ] use a difference structure to cancel the steady state error in decentralized gradient descent [ 23 ], [ 26 ], thereby developing the algorithm EXTRA and its proximal-gradient v ariant PG-EXTRA. It con verges at an O (1 /k ) rate when the objective function in ( 1 ) is con vex and has a linear conv ergence rate when 2 This means that the nonsmooth terms r i ’ s are absent. Such assump- tion is much stronger than the one used for achieving the O (1 /k 2 ) rate in Nesterov’ s optimal gradient method [ 35 ]. the objectiv e function is strongly con ve x and r i ( x ) = 0 for all i . A number of recent works employed the so-called gradient tracking [ 37 ] to conquer different issues [ 38 ]– [ 42 ]. T o be specific, works [ 38 ], [ 42 ] relax the step- size rule to allow uncoordinated step-sizes across agents. Paper [ 39 ] solves non-con vex optimization problems. Paper [ 41 ] aims at achie ving geometric con ver gence over time-varying graphs. W ork [ 40 ] improv es the con ver- gence rate over EXTRA, and its formulation is the same as that in [ 42 ]. Another topic of interest is decentralized optimization ov er directed graphs [ 41 ], [ 43 ]–[ 46 ], which is beyond the scope of this paper . B. Pr oposed Algorithm and its Advantages T o proceed, let us introduce some basic notation first. Agent i holds a local v ariable x i ∈ R p , and we denote x k i as its v alue at the k -th iteration. Then, we introduce a new function that is the average of all the local functions with local variables as f ( x ) := 1 n n X i =1 ( s i ( x i ) + r i ( x i )) , (2) where x := − x > 1 − − x > 2 − . . . − x > n − ∈ R n × p . (3) If all local variables are identical, i.e., x 1 = · · · = x n , we say that x is consensual. In addition, we define s ( x ) := 1 n n X i =1 s i ( x i ) , r ( x ) := 1 n n X i =1 r i ( x i ) . (4) W e hav e f ( x ) = s ( x ) + r ( x ) . The gradient of s at x is giv en in the same way as x in ( 3 ) by ∇ s ( x ) := − ( ∇ s 1 ( x 1 )) > − − ( ∇ s 2 ( x 2 )) > − . . . − ( ∇ s n ( x n )) > − ∈ R n × p . (5) By making a simple modification ov er PG- EXTRA [ 1 ], our proposed algorithm brings a big improv ement in the speed and the dependency of con vergence over networks. T o better expose this simple 2 In the original EXTRA, two mixing matrices W and f W are used. For simplicity , we let f W = I + W 2 here. JOURNAL OF TSP L A T E X CLASS FILES, VOL. X, NO. X, SEP 2018 3 modification, let us compare a special case of our proposed algorithm with EXTRA for the smooth case, i.e., r ( x ) = 0 . ( EXTRA 3 ) x k +2 = I + W 2 (2 x k +1 − x k ) − α ∇ s ( x k +1 ) + α ∇ s ( x k ) , (6a) ( Proposed NIDS ) x k +2 = I + W 2 (2 x k +1 − x k − α ∇ s ( x k +1 ) + α ∇ s ( x k )) . (6b) Here, W ∈ R n × n is a matrix that represents informa- tion exchange between neighboring agents (more details about this matrix are in Assumption 1 ) and α is the step-size. The only difference between EXTRA and the proposed algorithm is the information exchanged between the agents. EXTRA exchanges only the es- timations 2 x k +1 − x k , while the proposed algorithm exchanges the gradient adapted estimations, i.e., 2 x k +1 − x k − α ∇ s ( x k +1 ) + α ∇ s ( x k ) . Because of this small modification, the proposed algorithm has a N etwork InDependent Step-size, which will be explained later . Therefore we name the proposed algorithm NIDS and will use this abbreviation throughout the paper . For the nonsmooth case, more detailed comparison between PG- EXTRA and NIDS will be giv en in Section III . A large and network independent step-size for NIDS: All works mentioned abov e either employ pes- simistic step-sizes or hav e network dependent upper bounds on step-sizes. Furthermore, the step-sizes for the strongly con vex case are more conservati ve. For example, the step-size used to achie ve linear con ver- gence rates for EXTRA in [ 36 ], [ 48 ] is in the order of O ( µ/L 2 ) , where µ and L are the strong conv exity constant of s ( x ) and Lipschitz constant of ∇ s ( x ) , re- spectiv ely . As a contrast, the centralized gradient descent can choose a step-size in the order of O (1 /L ) . The upper bound of step-size for EXTRA was recently improved to (5 + 3 λ n ( W )) / (4 L ) in [ 47 ]. W e can choose 1 / (2 L ) for any W satisfying Assumption 1 . Another example of employing a constant step-size in distributed optimiza- tion is DIGing [ 41 ]. Although its A TC variant in [ 42 ] was shown to con verge faster than DIGing, the step-size is still very conservati ve compared to O (1 /L ) . W e will show that the step-size of NIDS can have the same upper bound 2 /L as that of the centralized gradient descent. The achiev able step-sizes of NIDS for o (1 /k ) rate in the general con ve x case and the linear conv ergence rate in the strongly con vex case are both in the order of O (1 /L ) . Furthermore, NIDS allows each agent to hav e an individual step-size. Each agent i can choose a step- size α i that is as large as 2 /L i on an y connected network, where L i is the Lipschitz constant of ∇ s i ( x ) and L i ≤ L for all i ( L = max i L i ). Apart from the step-sizes, to run NIDS, a common/public parameter c is needed for the construction of f W (see ( 8 ) for the algorithm and f W ). This parameter c can be chosen without any knowledge of the network (or the mixing matrix W ). For example, c = 1 / (2 max i α i ) . T able I provides an ov erview of algorithmic configurations for EXTRA and NIDS. NIDS works as long as each agent can estimate its local functional parameter: No agent needs any other global information including the number of agents in the whole network except the largest step-size, if it is not the same for all agents. In the line of research of optimization ov er hetero- geneous networks, after the initial work [ 38 ] regarding uncoordinated step sizes, references [ 49 ] and [ 50 ] in- troduce and analyze a diffusion strategy with correc- tions that achiev es an exact linear conv ergence with uncoordinated step sizes. This exact diffusion algorithm is still related to the Lagrangian method but can be considered as having incorporated a CT A structure. The CT A strategy can, similar to the A TC strategy , improv e the conv ergence speed of certain consensus optimization algorithms (see [ 41 , Remark 3] and [ 46 , Section II.C]). Howe ver , the analysis in [ 50 ], though allowing step size mismatch across the network, does not take into consideration the heterogeneity of agents’ functional conditions. Furthermore, their upper bound for the step- size is in the order of O ( µ/L 2 ) . Sublinear conv ergence rate for the general case: Under the general conv exity assumption, we show that NIDS has a con vergence rate of o (1 /k ) , which is slightly better than the O (1 /k ) rate of PG-EXTRA. Because the step-size of NIDS does not depend on the network topology and is much larger than that of PG-EXTRA, NIDS can be much faster than PG-EXTRA, as shown in the numerical experiments. Linear conv ergence rate f or the strongly con vex case: Let us first define “scalability”. When the iterate x k of an algorithm conv erges to the optimal solution x ∗ linearly , i.e., k x k − x ∗ k 2 = O ((1 − 1 /S ) k ) with some positiv e constant S , we say that the algorithm needs to run O ( S ) log(1 / ) iterations to reach -accuracy . So we call O ( S ) the scalability of the algorithm. For the case where the non-smooth terms are absent and the functions { s i } n i =1 are strongly con vex, we show that NIDS achiev es a linear con ver gence rate whose dependencies on the functions { s i } n i =1 and the network topology are decoupled. T o be specific, to reach - JOURNAL OF TSP L A T E X CLASS FILES, VOL. X, NO. X, SEP 2018 4 T ABLE I S U MM A RY O F A L G O RI T H M IC P A R AM E T E RS U SE D IN E XT R A A N D N I D S . T H E S T E P S I ZE B OU N D O N α F O R E X TR A CO M E S F RO M R E FE R E N CE [ 47 ] W H I C H I M PR OV ES T HAT G I V EN IN R EF E R E NC E [ 3 6 ] . Method α or α i c EXTRA ( λ n ( W ) given) α < (5 + 3 λ n ( W )) / (4 max i L i ) – EXTRA ( λ n ( W ) not giv en) α < 1 / (2 max i L i ) – NIDS ( λ n ( W ) given) α i < 2 /L i c ≤ 1 / ((1 − λ n ( W )) max i α i ) NIDS ( λ n ( W ) not giv en) α i < 2 /L i c ≤ 1 / (2 max i α i ) accuracy , the number of iterations needed for NIDS is O max L µ , 1 − λ n ( W ) 1 − λ 2 ( W ) log 1 , where λ i ( W ) is the i th largest eigen value of W . Both L µ and 1 − λ n ( W ) 1 − λ 2 ( W ) are typical in the literature of optimization and av erage consensus, respectively . The value L µ , also called the condition number of the objective function, is aligned with the scalability of the standard gradient descent [ 35 ]. The value 1 − λ n ( W ) 1 − λ 2 ( W ) is understood as the condition number 4 of the network and aligned with the scalability of the simplest linear iterations for distributed av eraging [ 51 ]. Separating the condition numbers of the objective function and the netw ork pro vides a way to determine the bottleneck of NIDS for a specific problem and a given network. Therefore, the system designer might be able to smartly apply preconditioning on { s i } n i =1 or improve the connectivity of the network to cost-effecti vely obtain a better conv ergence. Summary and comparison of state-of-the-art algo- rithms: W e list the properties of a few relev ant algo- rithms in T able II . W e let σ := 1 − λ n ( W ) 1 − λ 2 ( W ) . This quantity is directly affected by the network topology and how the matrix W is defined, thus is also related to the con- sensus ability of a network. When the network is fully connected (a complete graph), we can choose W so that λ 2 ( W ) = λ n ( W ) = 0 and thus σ = 1 (the best case); in general σ ≥ 1 since 0 < 1 − λ 2 ( W ) ≤ 1 − λ n ( W ) < 2 ; in the worst case, we have σ ≤ 1 1 − λ 2 ( W ) = O ( n 2 ) [ 4 , Section 2.3]. W e keep σ in the bounds/rates of in volved algorithms for a fair comparison instead of focusing on the worst case that often giv es pessimistic/conservati ve results. W e omit “ O ( · ) ” in “Bounds of step-sizes” and “Scalabilities” for brevity and only compare the effect of functional properties ( µ and L ) and network properties ( σ and/or n ). Before talking into details, let us clarify a 4 When we choose W = I − τ Ł where Ł is the Laplacian of the underlying graph and τ is a positive tunable constant, we have 1 − λ n ( W ) 1 − λ 2 ( W ) = λ 1 ( Ł ) λ n − 1 ( Ł ) which is the finite condition number of Ł . Note that λ n ( Ł ) = 0 . few points. In EXTRA, µ g is a quantity that is associated with the strong conv exity of the original function ¯ f ( x ) , so it covers a larger class of problems; In DIGing, ¯ µ is the mean value of the strong con ve xity constants of local objectives; In Acc-DNGD, the step-size for the con vex case contains k , the current number of iterations. Thus it represents a diminishing step-size sequence; In Optimal [ 53 ], [ 54 ], the total number of iterations K is used to determine the step-size for the con vex case. In addition, they apply to problems in which the objectiv es are dual friendly (see [ 54 ] for its definition). Note some types of objectiv es are suitable for gradient update, some are suitable for dual gradient update (dual friendly), and some are suitable for proximal update. Finding the algorithm with the lowest per-iteration cost depends on the problem (functions). Apparently , our bounds on step-sizes and the corresponding scala- bility/rate are better than those gi ven in EXTRA and Harness (see T able II ). When σ is close to 1 (the graph is well connected), the step-size bound and scalability giv en in DIGing are the same as NIDS. Howe ver , when σ is large, their result becomes rather conserv ativ e. Acc- DNGD and Optimal have improved the scalability/rate of gradient-based distributed optimization by employing Nesterov’ s acceleration technique on primal and dual problems, respecti vely . For the con vex case, our rate is worse than theirs because our algorithm does not em- ploy Nesterov’ s acceleration. For the primal distributed gradient method after acceleration [ 52 ], the scalability in σ is still worse than our result. Algorithm Optimal achiev es the optimal scalability/rate for distributed opti- mization. Howe ver , as we hav e mentioned above, their algorithms are dual based thus apply to a different class of problems. In addition, NIDS supports proximable non- smooth functions and uncoordinated step-sizes while these have not been considered in Acc-DNGD and Optimal. T o sum up, we hav e reached the best possi- ble performance of first-order algorithms for distributed optimization without acceleration. Further improving the performance by incorporating Nesterov’ s techniques to our algorithm will be a future direction. JOURNAL OF TSP L A T E X CLASS FILES, VOL. X, NO. X, SEP 2018 5 T ABLE II S U MM A RY O F A F EW R EL E V A N T A L G OR I T H MS . H E R E , µ ( O R µ g O R ¯ µ ) I S T H E S T RO NG C ON V E X IT Y C O N S T A N T O F TH E OB J E C TI V E F U NC T I O N ( OR T HAT O F A M OD I FI E D O BJ E C T IV E FU N C T IO N ) ; L ( O R ¯ L ) I S T H E L I PS C H IT Z CO N S T A N T O F T H E O B J EC T I V E G R AD I E NT ( OR I T S M O DI FI E D V E RS I O N ). A LS O σ = 1 − λ n ( W ) 1 − λ 2 ( W ) I S C O N SI D E RE D AS T HE C ON D I T IO N NU M B E R O F T H E N E TW O R K , W HI C H S C A L ES A T TH E O R DE R OF O ( n 2 ) IN T HE W OR S T C A S E S C EN A RI O . W E O M I T “ O ( · ) ” I N “ O RD E R S O F S T E P - S I Z E B O UN D S ” A N D “ S C AL A B IL I T Y ” F O R B R EV I T Y . N OT E QUA N T I TI E S I N VO LVI N G K O N L Y H OL D F O R A FI NI T E K . Algorithm Support Orders of step-size bounds Uncoordinated Scalability Rate prox. operators (strongly conv ex; con vex) step-size (strongly conv ex) (con vex) EXTRA [ 1 ], [ 36 ] Y es µ g L 2 ; 1 L No L µ g 2 o 1 k Aug-DGM [ 38 ] No small enough Y es – con verges Harness [ 40 ] No µ L 2 σ 2 ; 1 Lσ 2 No L µ σ 2 O 1 k Acc-DNGD ( σ − 1) 3 σ 6 L µ L 3 / 7 ; [ 52 ] No min { (1 − σ − 1 ) 2 , ( σ ) − 3 } Lk 0 . 6 No σ 3 ( σ − 1) 1 . 5 L µ 5 / 7 O 1 k 1 . 4 DIGing [ 41 ], [ 42 ] No min { ¯ µ 0 . 5 σ ( σ − 1) L 1 . 5 n 0 . 5 , 1 ¯ L } ; – Y es max { L ¯ µ , ( σ − 1) 2 n 0 . 5 L 1 . 5 + σ 2 ¯ µ 1 . 5 ¯ µ 1 . 5 } – Optimal [ 53 ], [ 54 ] No µ ; Lσ K 2 No L µ σ 0 . 5 O 1 K 2 NIDS Y es 1 L ; 1 L Y es max { L µ , σ } o ( 1 k ) Finally , we note that, references [ 49 ], [ 50 ], appearing simultaneously with this work, also proposed ( 6b ) to enlarge the step-size and use column stochastic matrices rather than symmetric doubly stochastic matrices. How- ev er, their algorithm only works for smooth problems, and their analysis seems to be restrictive and requires twice differentiability and strong conv exity of { s i } n i =1 . The stepsize is also in the order of µ/L 2 [ 48 ]. C. Futur e W orks The capability of our algorithm using purely lo- cally determined parameters increases its potential to be extended to dynamic networks with a time-v arying number of nodes. Gi ven such flexibility , we may use similar schemes to solve the decentralized empirical risk minimization problems. Furthermore, it also enhances the priv acy of the agents through allowing each agent to perform their own optimization procedure without negotiation on any parameter . By using Nesterov’ s acceleration technique, refer- ence [ 4 ] shows that the scalability of a new av erage consensus protocol can be improved to O ( n ) ; when the nonsmooth terms r i ’ s are absent, reference [ 53 ] shows that the scalability of a new dual based accel- erated distributed gradient method can be improved to O ( p σ L/µ ) . One of our future work is exploring the con vergence rates/scalability of the Nesterov’ s acceler- ated version of our algorithm. D. P aper Or ganization The rest of this paper is organized as follo ws. T o facili- tate the description of the technical ideas, the algorithms, and the analysis, we introduce additional notation in Sub- section I-E . The intuition for the network-independent step-size is provided in Section II . In Section III , we introduce our algorithm NIDS and discuss its relation to some other existing algorithms. In Section IV , we first show that NIDS can be understood as an iterativ e algorithm for seeking a fixed point. Follo wing this, we establish that NIDS con ver ges at an o (1 /k ) rate for the general conv ex case and a linear rate for the strongly con vex case. Then, numerical simulations are given in Section V to corroborate our theoretical claims. Final remarks are given in Section VI . E. Notation W e use bold upper-case letters such as W to define matrices in R n × n and bold lower -case letters such as x and z to define matrices in R n × p (when p = 1 , they are vectors). Let 1 and 0 be matrices with all ones and zeros, respecti vely , and their dimensions are provided when necessary . For matrices x , y ∈ R n × p , we define JOURNAL OF TSP L A T E X CLASS FILES, VOL. X, NO. X, SEP 2018 6 their inner product as h x , y i = tr( x > y ) and the norm as k x k = p h x , x i . Additionally , by an abuse of notation, we define h x , y i Q = tr( x > Qy ) and k x k 2 Q = h x , x i Q for any giv en symmetric matrix Q ∈ R n × n . Note that h· , ·i Q is an inner product defined in R n × p if and only if Q is positi ve definite. Howe ver , when Q is not positiv e definite, h x , y i Q can still be an inner product defined in a subspace of R n × p , see Lemma 3 for more details. W e define the range of A ∈ R n × n by range ( A ) := { x ∈ R n × p : x = Ay , y ∈ R n × p } . The largest eigenv alue of a symmetric matrix A is also denoted as λ max ( A ) . For two symmetric matrices A , B ∈ R n × n , A B (or A < B ) means that A − B is positiv e definite (or positi ve semidefinite). Moreov er , we use N i to represent the set of agents that can directly send messages to agent i . I I . I N T U I T I O N F O R N E T W O R K - I N D E P E N D E N T S T E P - S I Z E In this section, we provide an intuition for the network-independent step-size for NIDS with only the differentiable function s . The decentralized optimization problem is equiv alent to minimize x s ( x ) , s.t. ( I − W ) 1 / 2 x = 0 , where ( I − W ) 1 / 2 is the square root of I − W , and the constraint is the same as the consensual condition with the mixing matrix W giv en in Assumption 1 . Denote Ł = ( I − W ) 1 / 2 . The corresponding optimality condition with the introduction of the dual v ariable p is 0 Ł − Ł 0 x ∗ p ∗ = − ∇ s ( x ∗ ) 0 . EXTRA is equiv alent to the Condat-V u primal-dual algorithm [ 47 ], [ 55 ], and it can be further explained as a forward-backward splitting applied to the equation, i.e., 1 α I − Ł − Ł 2 α I + 0 Ł − Ł 0 x k +1 p k +1 = 1 α I − Ł − Ł 2 α I x k p k − ∇ s ( x k ) 0 . The update is x k +1 = x k − α Ł p k − α ∇ s ( x k ) , α p k +1 = α p k − 1 2 Ł x k + Ł x k +1 . It is equiv alent to EXTRA after p is eliminated. In this case, the new metric is a full matrix, and therefore, the upper bound of the step-size α depends on the matrix Ł . T o be more specific, 1 α I − Ł − Ł 2 α I < L 2 I 0 0 0 , which gi ves α ≤ 2(1 + λ min ( W )) /L . A larger and optimal upper bound for the step-size of EXTRA is shown in [ 47 ] (See T able I ), and it still depends on W . Howe ver , we choose a block diagonal metric and hav e 1 α I 0 0 α ( I + W ) + 0 Ł − Ł 0 x k +1 p k +1 = 1 α I 0 0 α ( I + W ) x k p k − ∇ s ( x k ) 0 . The update becomes 2 α p k +1 = α ( I + W ) p k + Ł x k − α Ł ∇ s ( x k ) , x k +1 = x k − α ∇ s ( x k ) − α Ł p k +1 , which is equi valent to NIDS after p is eliminated. Because the new metric is block diagonal, and the nonexpansi veness of the forward step depends on the function only , i.e., α ≤ 2 /L . I I I . P R O P O S E D A L G O R I T H M N I D S In this section, we describe our proposed NIDS in Algorithm 1 for solving ( 1 ) in more details and explain the connections to other related methods. Algorithm 1 NIDS Each agent i obtains its mixing values w ij , ∀ j ∈ N i ; Each agent i chooses its o wn step-size α i > 0 and the same parameter c (e.g., c = 0 . 5 / max i α i ); Each agent i sets the mixing v alues e w ij := cα i w ij , ∀ j ∈ N i and e w ii := 1 − cα i + cα i w ii ; Each agent i picks arbitrary initial x 0 i ∈ R p and performs z 1 i = x 0 i − α i ∇ s i x 0 i , x 1 i = arg min x ∈ R p α i r i ( x ) + 1 2 k x − z 1 i k 2 . for k = 1 , 2 , 3 . . . do Each agents i performs z k +1 i = z k i − x k i + X j = N i ∪{ i } e w ij 2 x k j − x k − 1 j − α j ∇ s j x k j + α j ∇ s j x k − 1 j , x k +1 i = arg min x ∈ R p α i r i ( x ) + 1 2 k x − z k +1 i k 2 . end for The mixing matrix satisfies the following assumption, which comes from [ 1 ], [ 36 ]. Assumption 1 (Mixing matrix): The connected network G = {V , E } consists of a set of agents V = { 1 , 2 , · · · , n } JOURNAL OF TSP L A T E X CLASS FILES, VOL. X, NO. X, SEP 2018 7 and a set of undirected edges E . An undirected edge ( i, j ) ∈ E means that there is a connection between agents i and j and both agents can exchange data. The mixing matrix W = [ w ij ] ∈ R n × n satisfies: 1) (Decentralized property). If i 6 = j and ( i, j ) / ∈ E , then w ij = 0 ; 2) (Symmetry). W = W T ; 3) (Null space property). Null ( I − W ) = span ( 1 n × 1 ) ; 4) (Spectral property). 2 I < W + I 0 n × n . Remark 1: Assumption 1 implies that the eigenv alues of W lie in ( − 1 , 1] and the multiplicity of eigen value 1 is one, i.e., 1 = λ 1 ( W ) > λ 2 ( W ) ≥ · · · ≥ λ n ( W ) > − 1 . Item 3 of Assumption 1 sho ws that ( I − W ) 1 n × 1 = 0 and the orthogonal complement of span ( 1 n × 1 ) is the row space of I − W , which is also the column space of I − W because of the symmetry of W . The functions { s i } n i =1 and { r i } n i =1 satisfy the follow- ing assumption. Assumption 2: Functions { s i ( x ) } n i =1 and { r i ( x ) } n i =1 are lower semi-continuous proper conv ex, and { s i ( x ) } n i =1 hav e Lipschitz continuous gradients with constants { L i } n i =1 , respectively . Thus, we hav e h x − y , ∇ s ( x ) −∇ s ( y ) i ≥ k∇ s ( x ) − ∇ s ( y ) k 2 L − 1 , (7) where L = Diag ( L 1 , · · · , L n ) is the diagonal matrix with the Lipschitz constants [ 35 ]. Instead of using the same step-size for all the agents, we allow agent i to choose its own step-size α i and let Λ = Diag( α 1 , · · · , α n ) ∈ R n × n . Then NIDS can be expressed as z k +1 = z k − x k + f W (2 x k − x k − 1 − Λ ∇ s ( x k ) + Λ ∇ s ( x k − 1 )) , (8a) x k +1 = arg min x ∈ R n × p r ( x ) + 1 2 k x − z k +1 k 2 Λ − 1 , (8b) where f W = I − c Λ( I − W ) and c is chosen such that Λ − 1 / 2 f W Λ 1 / 2 = I − c Λ 1 / 2 ( I − W )Λ 1 / 2 < 0 . This condition shows that the upper bound of the parameter c depends on W and Λ . When the information about W is not gi ven, we can just let c = 1 / (2 max i α i ) because λ n ( W ) > − 1 . T o set such a parameter, a preprocessing step is needed to obtain the maximum. Howe ver , since the maximum can be easily computed in a connected network in no more than n − 1 rounds of communication wherein each node repeatedly takes maximum of the values from neighbors, the cost of this preprocessing is essentially negligible compared to the worst-case running time of our optimization protocol. If all agents choose the same step-size, i.e., Λ = α I , and we let c = 1 / (2 α ) , ( 8 ) becomes z k +1 = z k − x k + I + W 2 (2 x k − x k − 1 − α ∇ s ( x k ) + α ∇ s ( x k − 1 )) , (9a) x k +1 = arg min x ∈ R n × p r ( x ) + 1 2 α k x − z k +1 k 2 . (9b) Remark 2: The update of PG-EXTRA is z k +1 = z k − x k + I + W 2 (2 x k − x k − 1 ) − α ∇ s ( x k ) + α ∇ s ( x k − 1 ) , (10a) x k +1 = arg min x ∈ R n × p r ( x ) + 1 2 α k x − z k +1 k 2 . (10b) The only difference between NIDS and PG-EXTRA is that the mixing operation is further applied to the succes- siv e difference of the gradients − α ∇ s ( x k ) + α ∇ s ( x k − 1 ) in NIDS. When there is no function r ( x ) , ( 8 ) becomes x k +1 = f W (2 x k − x k − 1 − Λ ∇ s ( x k ) + Λ ∇ s ( x k − 1 )) , and it further reduces to ( 6b ) when Λ = α I and c = 1 / (2 α ) . Note that, though ( 6b ) appears in [ 49 ], [ 50 ], its con vergence still needs a small step-size that also depends on the network topology and the strong con vexity constant. In Theorem 1 of [ 50 ], the upper bound for the step-size is also O ( µ/L 2 ) , which is the same as that of PG-EXTRA. I V . C O N V E R G E N C E A NA LY S I S O F N I D S In order to show the conv ergence of NIDS, we also need the following assumption. Assumption 3 (Solution existence): Problem ( 1 ) has at least one solution. T o simplify the analysis, we introduce a new sequence { d k } k ≥ 0 which is defined as d k := Λ − 1 ( x k − 1 − z k ) − ∇ s x k − 1 . (11) Using the sequence { x k } k ≥ 0 , we obtain a recursiv e (update) relation for { d k } k ≥ 0 : d k +1 =Λ − 1 ( x k − z k +1 ) − ∇ s ( x k ) =Λ − 1 ( x k − z k + x k ) − ∇ s ( x k ) − Λ − 1 f W (2 x k − x k − 1 − Λ ∇ s ( x k ) + Λ ∇ s ( x k − 1 )) =Λ − 1 ( x k − z k + x k − 2 x k + x k − 1 ) − ∇ s ( x k ) + ∇ s ( x k ) − ∇ s ( x k − 1 ) + c ( I − W )(2 x k − x k − 1 − Λ ∇ s ( x k ) + Λ ∇ s ( x k − 1 )) = d k + c ( I − W )(2 x k − z k − Λ ∇ s ( x k ) − Λ d k ) , JOURNAL OF TSP L A T E X CLASS FILES, VOL. X, NO. X, SEP 2018 8 where the second equality comes from the update of z k +1 in ( 8a ) and the last one holds because of the definition of d k in ( 11 ). Therefore, the iteration ( 8 ) is equiv alent to, with the update order ( x , d , z ) , x k = arg min x ∈ R n × p r ( x ) + 1 2 k x − z k k 2 Λ − 1 , (12a) d k +1 = d k + c ( I − W ) 2 x k − z k − Λ ∇ s x k − Λ d k , (12b) z k +1 = x k − Λ ∇ s x k − Λ d k +1 , (12c) in the sense that both ( 8 ) and ( 12 ) generate the same { x k , z k } k> 0 sequence. Because x k is determined by z k only and can be eliminated from the iteration, iteration ( 12 ) is essentially an operator for ( d , z ) . Note that we ha ve d 1 = Λ − 1 ( x 0 − z 1 ) − ∇ s x 0 = 0 from Algorithm 1 . Therefore, from the update of d k +1 in ( 12b ), d k ∈ range ( I − W ) for all k . In fact, any z 1 such that d 1 ∈ range ( I − W ) works for NIDS. The following two lemmas show the relation between fixed points of ( 12 ) and optimal solutions of ( 1 ). The proofs for all lemmas and propositions are included in the supplemental material. Lemma 1 (F ixed point of ( 12 ) ): ( d ∗ , z ∗ ) is a fixed point of ( 12 ) if and only if there exists a subgradient q ∗ ∈ ∂ r ( x ∗ ) such that z ∗ = x ∗ + Λ q ∗ and d ∗ + ∇ s ( x ∗ ) + q ∗ = 0 , (13a) ( I − W ) x ∗ = 0 . (13b) Lemma 2 (Optimality condition): x ∗ is consensual with x ∗ 1 = x ∗ 2 = · · · = x ∗ n = x ∗ being an optimal solution of problem ( 1 ) if and only if there exists p ∗ and a subgradient q ∗ ∈ ∂ r ( x ∗ ) such that: ( I − W ) p ∗ + ∇ s ( x ∗ ) + q ∗ = 0 , (14a) ( I − W ) x ∗ = 0 . (14b) In addition, ( d ∗ = ( I − W ) p ∗ , z ∗ = x ∗ + Λ q ∗ ) is a fixed point of iteration ( 12 ). Lemma 2 sho ws that we can find a fixed point of iter- ation ( 12 ) to obtain an optimal solution of problem ( 1 ). It also tells us that we need d ∗ ∈ range ( I − W ) to get an optimal solution of problem ( 1 ). Therefore, we need d 1 ∈ range ( I − W ) . Lemma 3 (Norm over range space): For any symmet- ric positive semidefinite matrix A ∈ R n × n with rank r ≤ n , let λ 1 ≥ λ 2 ≥ · · · ≥ λ r > 0 be its r eigen- values. Then range ( A ) defined in Section I-E is a r p - dimensional subspace in R n × p and has a norm defined by k x k 2 A † := h x , A † x i , where A † is the pseudo in verse of A . In addition, λ − 1 1 k x k 2 ≤ k x k 2 A † ≤ λ − 1 r k x k 2 for all x ∈ range ( A ) . Pr oposition 1: Let M = c − 1 ( I − W ) † − Λ with I < c Λ 1 / 2 ( I − W )Λ 1 / 2 < 0 . Then k · k M is a norm defined for range ( I − W ) . The following lemma compares the distance to a fixed point of ( 12 ) for two consecutive iterates. Lemma 4 (Fundamental inequality): Let ( d ∗ , z ∗ ) be a fixed point of iteration ( 12 ) with d ∗ ∈ range ( I − W ) . The update ( d k , z k ) ⇒ ( d k +1 , z k +1 ) in ( 12 ) satisfies k z k +1 − z ∗ k 2 Λ − 1 + k d k +1 − d ∗ k 2 M ≤k z k − z ∗ k 2 Λ − 1 + k d k − d ∗ k 2 M − k z k − z k +1 k 2 Λ − 1 − k d k − d k +1 k 2 M + 2 h∇ s x k − ∇ s ( x ∗ ) , z k − z k +1 i − 2 h x k − x ∗ , ∇ s ( x k ) − ∇ s ( x ∗ ) i . (15) Pr oof: From the update of z k +1 in ( 12c ), we have h d k +1 − d ∗ , z k +1 − z k + x k − x ∗ i = h d k +1 − d ∗ , 2 x k − z k − Λ ∇ s ( x k ) − Λ d k +1 − x ∗ i = h d k +1 − d ∗ , c − 1 ( I − W ) † ( d k +1 − d k ) + Λ d k − Λ d k +1 i = h d k +1 − d ∗ , d k +1 − d k i M , (16) where the second equality comes from ( 12b ), ( 14b ), and d k +1 − d ∗ ∈ range ( I − W ) . From ( 12a ), we have that h x k − x ∗ , z k − x k − z ∗ + x ∗ i Λ − 1 ≥ 0 . (17) Therefore, we have h x k − x ∗ , ∇ s ( x k ) − ∇ s ( x ∗ ) i ≤h x k − x ∗ , Λ − 1 ( z k − x k − z ∗ + x ∗ ) + ∇ s ( x k ) − ∇ s ( x ∗ ) i = h x k − x ∗ , Λ − 1 ( z k − z k +1 ) − d k +1 + d ∗ i = h x k − x ∗ , z k − z k +1 i Λ − 1 + h d k +1 − d ∗ , z k +1 − z k i − h d k +1 − d ∗ , d k +1 − d k i M = h Λ − 1 ( x k − x ∗ ) − d k +1 + d ∗ , z k − z k +1 i − h d k +1 − d ∗ , d k +1 − d k i M = h Λ − 1 ( z k +1 − z ∗ ) + ∇ s ( x k ) − ∇ s ( x ∗ ) , z k − z k +1 i − h d k +1 − d ∗ , d k +1 − d k i M = h z k +1 − z ∗ , z k − z k +1 i Λ − 1 + h∇ s ( x k ) − ∇ s ( x ∗ ) , z k − z k +1 i + h d k +1 − d ∗ , d k − d k +1 i M . The inequality and the second equality comes from ( 17 ) and ( 16 ), respecti vely . The first and fourth equalities hold JOURNAL OF TSP L A T E X CLASS FILES, VOL. X, NO. X, SEP 2018 9 because of the update of z k +1 in ( 12c ). Using 2 h a , b i = k a + b k 2 − k a k 2 − k b k 2 and rearranging the previous inequality give us that 2 h x k − x ∗ , ∇ s ( x k ) − ∇ s ( x ∗ ) i − 2 h∇ s ( x k ) − ∇ s ( x ∗ ) , z k − z k +1 i ≤ 2 h z k +1 − z ∗ , z k − z k +1 i Λ − 1 + 2 h d k +1 − d ∗ , d k − d k +1 i M = k z k − z ∗ k 2 Λ − 1 − k z k +1 − z ∗ k 2 Λ − 1 − k z k − z k +1 k 2 Λ − 1 + k d k − d ∗ k 2 M − k d k +1 − d ∗ k 2 M − k d k − d k +1 k 2 M . Therefore, ( 15 ) is obtained. A. Sublinear conver gence of NIDS As explained in Section II , NIDS is equiv alent to the primal-dual algorithm [ 56 ] applied to problem minimize x s ( x ) + r ( x ) + ι (( I − W ) 1 / 2 x ) , (18) where ι ( · ) is the indicator function, which return 0 for 0 and + ∞ otherwise, with the metric matrix being Λ − 1 0 0 c − 1 I − ( I − W ) 1 / 2 Λ( I − W ) 1 / 2 . W e apply [ 56 , Theorem 1] and obtain the follo wing sublinear conv ergence result. Theor em 1 (Sublinear rate): Let ( d k , z k ) be the se- quence generated from NIDS in ( 12 ) with α i < 2 /L i for all i and I < c Λ 1 / 2 ( I − W )Λ 1 / 2 . W e hav e k z k − z k +1 k 2 Λ − 1 + k d k − d k +1 k 2 M ≤ k z 1 − z ∗ k 2 Λ − 1 + k d 1 − d ∗ k 2 M k (1 − max i α i L i 2 ) , (19) k z k − z k +1 k 2 Λ − 1 + k d k − d k +1 k 2 M = o 1 k + 1 . Furthermore, ( d k , z k ) con verges to a fixed point ( ¯ d , ¯ z ) of iteration ( 12 ) and ¯ d ∈ range ( I − W ) , if I c Λ 1 / 2 ( I − W )Λ 1 / 2 . Remark 3: Note the con vergence in Theorem 1 is shown in z and d . W e will show the con ver gence in terms of ( 14 ). Recall that z k +1 − z k = x k − Λ ∇ s ( x k ) − Λ d k +1 − z k = − Λ( d k +1 + ∇ s ( x k ) + q k ) , where q k ∈ ∂ r ( x k ) . Therefore, k z k +1 − z k k 2 Λ − 1 → 0 implies the conv ergence in terms of ( 14a ). Combining ( 12b ) and ( 12c ), we hav e d k +1 = d k + c ( I − W )( x k − z k + z k +1 ) + c ( I − W )Λ( d k +1 − d k ) . Rearranging it gives ( I − c ( I − W )Λ) d k +1 − d k = c ( I − W ) x k − z k + z k +1 . Then we have k c ( I − W ) x k − z k + z k +1 k 2 = k c ( I − W ) M 1 / 2 M 1 / 2 d k +1 − d k k 2 ≤k c ( I − W ) M 1 / 2 k 2 d k +1 − d k 2 M , where the second equality comes from d k +1 − d k ∈ range ( I − W ) . Thus k z k +1 − z k k 2 Λ − 1 + k d k +1 − d k k 2 M → 0 implies the conv ergence in terms of ( 14b ). B. Linear conver gence for special cases In this subsection, we provide the linear con ver gence rate for the case when r ( x ) = 0 , i.e., z k = x k in NIDS. Theor em 2: If { s i ( x ) } n i =1 are strongly con ve x with parameters { µ i } n i =1 , then h x − y , ∇ s ( x ) − ∇ s ( y ) i ≥ k x − y k 2 S , (20) where S = Diag( µ 1 , · · · , µ n ) ∈ R n × n . Let ( d k , x k ) be the sequence generated from NIDS with α i < 2 /L i for all i and I < c Λ 1 / 2 ( I − W )Λ 1 / 2 . W e define ρ = max 1 − (2 − max i ( α i L i )) min i ( µ i α i ) , 1 − c λ max (Λ − 1 / 2 ( I − W ) † Λ − 1 / 2 ) , (21) and have k x k +1 − x ∗ k 2 Λ − 1 + k d k +1 − d ∗ k 2 M +Λ ≤ ρ k x k − x ∗ k 2 Λ − 1 + k d k − d ∗ k 2 M +Λ . (22) Pr oof: From ( 15 ), we have k x k +1 − x ∗ k 2 Λ − 1 + k d k +1 − d ∗ k 2 M ≤k x k − x ∗ k 2 Λ − 1 + k d k − d ∗ k 2 M − k x k − x k +1 k 2 Λ − 1 − k d k − d k +1 k 2 M (23) + 2 h∇ s x k − ∇ s ( x ∗ ) , x k − x k +1 i − 2 h x k − x ∗ , ∇ s ( x k ) − ∇ s ( x ∗ ) i . JOURNAL OF TSP L A T E X CLASS FILES, VOL. X, NO. X, SEP 2018 10 For the two inner product terms, we have 2 h∇ s x k − ∇ s ( x ∗ ) , x k − x k +1 i − 2 h x k − x ∗ , ∇ s ( x k ) − ∇ s ( x ∗ ) i = − k x k − x k +1 − Λ ∇ s x k + Λ ∇ s ( x ∗ ) k 2 Λ − 1 + k x k − x k +1 k 2 Λ − 1 + k∇ s x k − ∇ s ( x ∗ ) k 2 Λ − 2 h x k − x ∗ , ∇ s ( x k ) − ∇ s ( x ∗ ) i ≤ − k d k +1 − d ∗ k 2 Λ + k x k − x k +1 k 2 Λ − 1 + k∇ s x k − ∇ s ( x ∗ ) k 2 Λ − max i ( α i L i ) k∇ s x k − ∇ s ( x ∗ ) k 2 L − 1 − (2 − max i ( α i L i )) k x k − x ∗ k 2 S ≤ − k d k +1 − d ∗ k 2 Λ + k x k − x k +1 k 2 Λ − 1 − (2 − max i ( α i L i )) min i ( µ i α i ) k x k − x ∗ k 2 Λ − 1 . (24) The first inequality comes from x k +1 = x k − Λ ∇ s x k − d k +1 , Λ ∇ s ( x ∗ ) + d ∗ = 0 , ( 7 ), and ( 20 ). Combing ( 23 ) and ( 24 ), we hav e k x k +1 − x ∗ k 2 Λ − 1 + k d k +1 − d ∗ k 2 M ≤k x k − x ∗ k 2 Λ − 1 + k d k − d ∗ k 2 M − k d k +1 − d ∗ k 2 Λ − (2 − max i ( α i L i )) min i ( µ i α i ) k x k − x ∗ k 2 Λ − 1 . Therefore, k x k +1 − x ∗ k 2 Λ − 1 + k d k +1 − d ∗ k 2 M +Λ ≤ (1 − (2 − max i ( α i L i )) min i ( µ i α i )) k x k − x ∗ k 2 Λ − 1 + λ max (Λ − 1 / 2 M Λ − 1 / 2 ) λ max (Λ − 1 / 2 M Λ − 1 / 2 ) + 1 k d k − d ∗ k 2 M +Λ . Since λ max (Λ − 1 / 2 M Λ − 1 / 2 ) λ max (Λ − 1 / 2 M Λ − 1 / 2 ) + 1 =1 − c λ max (Λ − 1 / 2 ( I − W ) † Λ − 1 / 2 ) . Let ρ defined as ( 21 ), then we have ( 22 ). Remark 4: The condition I < c Λ 1 / 2 ( I − W )Λ 1 / 2 implies that c ≤ λ n − 1 (Λ − 1 / 2 ( I − W ) † Λ − 1 / 2 ) . • If the agents across the whole network use an identical stepsize α , that is, Λ = α I , then ρ = max 1 − (2 − α max i L i ) α min i µ i , 1 − cα λ max (( I − W ) † ) . A concise but informati ve expression of the rate ρ = max 1 − min i µ i max i L i , λ 2 ( W ) − λ n ( W ) 1 − λ n ( W ) can be ob- tained when we specifically choose α = 1 max i L i and c = 1 (1 − λ n ( W )) α . When λ n ( W ) is not given, we choose c = 1 / (2 α ) and obtain the scalability max max i L i min i µ i , 2 1 − λ 2 ( W ) . In this case, the network impact and the functional impact are decoupled. • If we let Λ = L − 1 and c = λ n − 1 ( L − 1 / 2 ( I − W ) † L − 1 / 2 ) , then the rate becomes ρ = max 1 − min i µ i L i , 1 − λ n − 1 ( L 1 / 2 ( I − W ) † L 1 / 2 ) λ max ( L 1 / 2 ( I − W ) † L 1 / 2 ) . When λ n ( W ) is not giv en, we choose c = 1 / (2 max i α i ) = min i L i / 2 and obtain the scal- ability max max i L i µ i , max i L i min i L i · 2 1 − λ 2 ( W ) . In this case, the networking impact is coupled with the function factors, i.e., the smoothness heterogeneity max i L i min i L i is multiplied on the networking impact. While the other number depends on the functional condition numbers L i µ i ’ s only . Remark 5: Theorem 2 separates the dependence of the linear con vergence rate on the functions and the network structure. In our current scheme, all the agents perform information exchange and the proximal-gradient step once in each iteration. If the proximal-gradient step is expensi ve, this explicit rate formula can help us to decide whether the so-called multi-step consensus can help reducing the computational time. For the sake of simplicity , let us assume for this mo- ment that all the agents hav e the same strong conv exity constant µ and gradient Lipschitz continuity constant L . Suppose that the “ t -step consensus” technique is employed, i.e., the mixing matrix W in our algorithm is replaced by W t , where t is a positiv e integer . Then to reach -accuracy , the number of iterations needed is O max L µ , 1 − λ n ( W t ) 1 − λ 2 ( W t ) log 1 . When L/µ = 1 and step sizes are chosen as Λ = L − 1 , it says that we should let t → + ∞ if the graph is not a complete graph. Such theoretical result is correct in intuition since in this case, the centralized gradient descent only needs one step to reach optimal and the bottleneck in decentralized optimization is the network. Suppose t max is a reasonable upper bound on t , which is set by the system designer . It is dif ficult to explicitly find an optimal t . But with the above analysis as an evidence, we suggest that one choose t = min [log λ 2 ( W ) (1 − µ L )] , t max if 1 − µ L > λ 2 ( W ) ; otherwise t = 1 . Here [ · ] giv es the nearest integer . JOURNAL OF TSP L A T E X CLASS FILES, VOL. X, NO. X, SEP 2018 11 If the bottleneck is on the functions, we can introduce a mapping x = B y and change the unknown variable from x to y . E.g., if the function s i ( x ) is a composition of a conv ex function with a linear mapping, replacing x using y changes the linear mapping and the condition number L/µ of the function. When B is diagonal, it is similar to the column normalization in machine learning applications. There are other possible ways for reducing the condition number of the functions. It is out of the scope of this work, and we leave this as future work. V . N U M E R I C A L E X P E R I M E N T S In this section, we compare the performance of NIDS with several state-of-the-art algorithms for decentralized optimization. These methods are • The EXTRA/PG-EXTRA (see ( 10 )); • The DIGing-A TC [ 42 ]. For reference, the DIGing- A TC updates are provided as follows: x k +1 = W ( x k − α y k ) , y k +1 = W ( y k + ∇ f ( x k +1 ) − ∇ f ( x k )) . • The accelerated distributed Nesterov gradient de- scent (Acc-DNGD-SC in [ 52 ]); • The (dual friendly) optimal algorithm (OA) for distributed optimization (equation (7) in [ 54 ]). Note there are two rounds of communication in each iteration of DIGing-A TC and Acc-DNGD-SC while there is only one round in that of EXTRA/NIDS/OA. For all the experiments, we first compute the exact solution x ∗ for ( 1 ) using the centralized (proximal) gradient descent. All networks are randomly generated with connectivity ratio τ , where τ is defined as the number of actual edges divided by the total number of possible edges n ( n − 1) 2 . W e will report the specific τ used in each test. The mixing matrix W is always chosen with the Metropolis rule (see [ 57 ] and [ 36 , Section 2.4]). The experiments are carried in Matlab R2016b run- ning on a laptop with Intel i7 CPU @ 2.60HZ, 16.0 GB of RAM, and W indows 10 operating system. The source codes for reproducing the numerical results can be accessed at https://github .com/mingyan08/NIDS . A. The strongly con vex case with r ( x ) = 0 Consider the decentralized problem that solves for an unknown signal x ∈ R p . Each agent i ∈ { 1 , · · · , n } takes its own measurement via y i = M i x + e i , where y i ∈ R m i is the measurement vector , M i ∈ R m i × p is the sensing matrix, and e i ∈ R m i is the independent and identically distributed noise. T o estimate x collabo- rativ ely , we apply the decentralized algorithms to solve minimize x 1 n n X i =1 1 2 k M i x − y i k 2 In order to ensure that each function 1 2 k M i x − y i k 2 is strongly con vex, we choose m i = 60 and p = 50 and set the number of nodes n = 40 . For the first experiment, we choose M i such that the Lipshchitz constant of ∇ s i satisfies L i = 1 and the strongly con ve x constant µ i = 0 . 5 for all i . Based on Remark 4 , we choose α = 1 / (max i L i ) = 1 and c = 1 / (1 − λ n ( W )) for NIDS. In addition, we choose c = 1 / 2 such that f W = I + W 2 , which gives the same as that for EXTRA. The comparison of these methods (NIDS with c = 1 / ((1 − λ n ( W )) α ) , NIDS with c = 1 / 2 , EXTRA, DIGing-A TC, Acc-DNGD-SC, and OA) is shown in Fig. 1 for two different networks with connectivity ratios τ = 0 . 35 (top) and τ = 0 . 45 (bottom), respectiv ely . It shows better performance of NIDS in both choices of c (corresponding to known W and unknown W ) than that of other algorithms. NIDS with c = 1 / ((1 − λ n ( W )) α ) always takes less than half the number of iterations used by EXTRA to reach the same accuracy . In our experiment, DIGing-A TC appears to be sensitiv e to net- works. Under a better connected network (see Fig. 1 bottom), DIGing-A TC can catch up with NIDS with c = 1 / (2 α ) . The theoretical step-size of Acc-DNGD- SC is too small due to a very small constant in the bound in [ 52 ], and the con ver gence of Acc-DNGD-DC under such theoretical step-size in our test is slow and uncompetitiv e. Thus we have carefully tuned its step- size. W ith the hand-optimized step-size, Acc-DNGD-SC can achieve a comparable performance as NIDS with c = 1 / (2 α ) . In the plots, we observe that OA is fast in terms of the number of iterations. Howe ver , in this case, the per-iteration cost of OA is relativ ely high since it requires solving a system of linear equations at each iteration (though factorization tricks may be used to save some computational time). Next, we demonstrate the effort of uncoordi- nated/adaptiv e step-size. W e construct the function with µ i = 0 . 02 and L i = 1 for each node i . Then we change the L i values by multiplying the function by a constant. W e use the same mixing matrix for the following two experiments. • W e change half nodes. W e randomly pick an even number node and multiply its function by 4 . For remaining even number nodes, we multiply their functions by a random integer 2 or 3. JOURNAL OF TSP L A T E X CLASS FILES, VOL. X, NO. X, SEP 2018 12 0 10 20 30 40 50 60 70 80 90 number of iterations 10 -14 10 -12 10 -10 10 -8 10 -6 10 -4 10 -2 0 10 20 30 40 50 60 70 80 90 number of iterations 10 -14 10 -12 10 -10 10 -8 10 -6 10 -4 10 -2 Fig. 1. Relative error k x − x ∗ k k x ∗ k against the number of iterations for two different networks (top: τ = 0 . 35 ; bottom: τ = 0 . 45 ). NIDS, EXTRA, and DIGing-A TC use the same step-size α = 1 /L , where L = max i L i . The step-size for Acc-DNGD-SC is hand-optimized. W e use the default step-sizes for OA as suggested by the authors. 0 200 400 600 800 1000 1200 number of iterations 10 -14 10 -12 10 -10 10 -8 10 -6 10 -4 10 -2 10 0 Fig. 2. The relative error k x − x ∗ k k x ∗ k against the number of iterations. NIDS-1/L uses the same step-size 1 /L , where L = max i L i , and NIDS-adaptiv e uses the step-size 1 /L i for each node. W e assume that no graph information is av ailable, thus c = 1 / (2 max i α i ) . The connectivity ratio of the network is set as τ = 0 . 1 . • W e change a quarter nodes. W e randomly pick an node not in U = { 4 , 8 , 16 , . . . , 40 } and multiply its function by 10. Then for other nodes in U , we multiply their functions by a random integer between 2 and 9. W e compare NIDS with adaptiv e step-size ( 1 /L i for node i ) and NIDS with same step-size 1 / max i L i in Fig. 2 . W e let c = 1 / (2 max i α i ) , so no network information is needed. As shown in Fig. 2 , NIDS with adaptiv e step-size con verges faster than same step-size. B. The case with nonsmooth function r ( x ) In this subsection, we compare the performance of NIDS with PG-EXTRA [ 1 ] only since other methods in Section V -A , such as DIGing, can not be applied to this nonsmooth case. W e consider a decentralized compressed sensing problem. Again, each agent i ∈ { 1 , · · · , n } takes its own measurement via y i = M i x + e i , where y i ∈ R m i is the measurement vector , M i ∈ R m i × p is the sensing matrix, and e i ∈ R m i is the independent and identically distributed noise. Here, x is a sparse signal. The optimization problem is minimize x 1 n n X i =1 1 2 k M i x − y i k 2 + 1 n n X i =1 λ i k x k 1 , where the connectivity ratio of the network τ = 0 . 1 . W e normalize the problem to make sure that the Lipschitz constant satisfies L i = 1 for each node, we choose m i = 3 and p = 200 and set the number of nodes n = 40 . Fig. 3 shows that a larger step-size in NIDS leads to faster conv ergence. W ith step-size 1 , NIDS and PG- EXTRA conv erge at the same speed. But if we keep increasing the step-size, PG-EXTRA will diver ge with step-size 1 . 4 while the step-size of NIDS can be in- creased to 1 . 9 maintaining conv ergence at a faster speed. C. An application in classification for healthcar e data W e consider a decentralized sparse logistic regres- sion problem to classify the colon-cancer data [ 58 ]. There are 62 samples, and each sample features 2,000 pieces of gene expressing information (numericalized and normalized [ 58 ]) and a binary outcome. The out- come can be normal/negativ e ( +1 ) or tumor/positive ( − 1 ) and the data set contains 22 normal and 40 tumor colon tissue samples. W e store the gene information in M i ∈ R 1 × 2001 (one more dimension is augmented to take care of the linear of fset/constant in the logit function model), and the outcome information is y i ∈ {− 1 , 1 } , i ∈ S 1 S S 2 , where S 1 serves for training while S 2 serves for testing. Suppose we have a 50-node connected JOURNAL OF TSP L A T E X CLASS FILES, VOL. X, NO. X, SEP 2018 13 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 number of iterations 10 4 10 -8 10 -6 10 -4 10 -2 10 0 10 2 NIDS-1/L NIDS-1.5/L NIDS-1.9/L PGEXTRA-1/L PGEXTRA-1.2/L PGEXTRA-1.3/L PGEXTRA-1.4/L Fig. 3. The relative error k x − x ∗ k k x ∗ k against the number of iterations. Different step-sizes for PG-EXTRA and NIDS are considered. For instance, “NIDS- 1 /L ” is NIDS using the same step-size 1 /L across the network of agents, where L = max i L i . The connectivity ratio of the network is τ = 0 . 4 . network where each node i holds 1 sample ( M i , y i ) (the connected network is randomly generated and its connectivity ratio is set to 0 . 08 ; the 50 in-network samples index ed by S 1 are randomly drawn from the 62 samples). The decentralized logistic regression minimize x 1 |S 1 | X i ∈S 1 ln(1 + exp( − M i x i y i )) + 1 |S 1 | X i ∈S 1 ˆ λ i k x i k 2 2 + 1 |S 1 | X i ∈S 1 λ i k x i k 1 , is then solved over the network to train a sparse linear classifier x ∗ for the outcome prediction of the remain- ing/future samples/data. In the optimization formulation, the ` 2 norm term imposes strong con vexity to s , while ` 1 term promotes sparsity of the solution. Aside from the 50 samples used for training purpose, we randomly select 12 nodes from the 50 nodes to show the prediction performance of the remaining 12 samples in Fig. 4 left. The middle and right figures in Fig. 4 show how the consensus error x k I − W and the sparsity of the av erage solution vector 1 50 P 50 i =1 x k i drops, respectively . V I . C O N C L U S I O N W e proposed a nov el decentralized consensus algo- rithm NIDS, whose step-size does not depend on the network structure. In NIDS, the step-size depends only on the objective function, and it can be as large as 2 /L , where L is the Lipschitz constant of the gra- dient of the smooth function. W e showed that NIDS con verges at the o (1 /k ) rate for the general con ve x case and at a linear rate for the strongly conv ex case. For the strongly con vex case, we separated the linear con vergence rate’ s dependence on the objective function and the network. The separated conv ergence rates match the typical rates for the general gradient descent and the consensus a veraging. Furthermore, every agent in the network can choose its own step-size independently by its own objectiv e function. Numerical experiments validated the theoretical results and demonstrated better performance of NIDS over state-of-the-art algorithms. Because the step-size of NIDS does not depend on the network structure, there are many possible future extensions. One extension is to apply NIDS on dynamic networks where nodes can join and drop off. A C K N O W L E D G E M E N T W e thank the anonymous revie wers for helpful com- ments and suggestions to improve the clarity of this paper . R E F E R E N C E S [1] W . Shi, Q. Ling, G. Wu, and W . Y in, “ A proximal gradi- ent algorithm for decentralized composite optimization, ” IEEE T ransactions on Signal Processing , vol. 63, no. 22, pp. 6013– 6023, 2015. [2] L. Xiao, S. Boyd, and S. Kim, “Distributed av erage consensus with least-mean-square deviation, ” Journal of P arallel and Dis- tributed Computing , vol. 67, no. 1, pp. 33–46, 2007. [3] K. Cai and H. Ishii, “ A verage consensus on arbitrary strongly connected digraphs with time-varying topologies, ” IEEE Tr ans- actions on Automatic Contr ol , vol. 59, no. 4, pp. 1066–1071, 2014. [4] A. Olshevsky , “Linear time average consensus and distributed optimization on fixed graphs, ” SIAM Journal on Control and Optimization , vol. 55, no. 6, pp. 3990–4014, 2017. [5] J. Bazerque and G. Giannakis, “Distributed spectrum sensing for cognitiv e radio networks by exploiting sparsity , ” IEEE Tr ansac- tions on Signal Pr ocessing , vol. 58, pp. 1847–1862, 2010. [6] W . Ren, “Consensus based formation control strategies for multi- vehicle systems, ” in Pr oceedings of the American Control Con- fer ence , 2006, pp. 4237–4242. [7] A. Olshevsk y , “Efficient Information Aggregation Strategies for Distributed Control and Signal Processing, ” Ph.D. dissertation, Massachusetts Institute of T echnology , 2010. [8] S. Ram, V . V eeravalli, and A. Nedi ´ c, “Distributed non- autonomous power control through distributed conv ex optimiza- tion, ” in INFOCOM . IEEE, 2009, pp. 3001–3005. [9] L. Gan, U. T opcu, and S. Low , “Optimal decentralized protocol for electric vehicle charging, ” IEEE T ransactions on P ower Systems , vol. 28, no. 2, pp. 940–951, 2013. [10] M. Rabbat and R. Nowak, “Distributed optimization in sensor networks, ” in Pr oceedings of the 3rd international symposium on Information processing in sensor networks . ACM, 2004, pp. 20–27. [11] P . Forero, A. Cano, and G. Giannakis, “Consensus-based dis- tributed support v ector machines, ” Journal of Machine Learning Resear ch , vol. 59, pp. 1663–1707, 2010. [12] A. Nedi ´ c, A. Olshevsky , and C. A. Uribe, “Fast conv ergence rates for distributed non-bayesian learning, ” IEEE T ransactions on Automatic Contr ol , vol. 62, no. 11, pp. 5538–5553, 2017. JOURNAL OF TSP L A T E X CLASS FILES, VOL. X, NO. X, SEP 2018 14 0 1000 2000 3000 4000 5000 6000 7000 8000 number of iterations 0 2 4 6 8 10 12 number of correct predictions 0 1000 2000 3000 4000 5000 6000 7000 8000 number of iterations 10 -8 10 -7 10 -6 10 -5 10 -4 10 -3 consensus error 0 1000 2000 3000 4000 5000 6000 7000 8000 number of iterations 10 0 10 1 10 2 10 3 Fig. 4. Performance of NIDS for sparse logistic regression. Left: number of correct predictions vs. iteration. Middle: consensus error x k I − W vs. iteration; Right: number of non-zero elements in ¯ x k = 1 50 P 50 i =1 x k i vs. iteration. [13] D. Bertsekas, “Distributed asynchronous computation of fixed points, ” Mathematical Pro gramming , vol. 27, no. 1, pp. 107–120, 1983. [14] J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asyn- chronous deterministic and stochastic gradient optimization algo- rithms, ” IEEE T ransactions on Automatic Control , vol. 31, no. 9, pp. 803–812, 1986. [15] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Dis- tributed optimization and statistical learning via the alternating direction method of multipliers, ” F oundations and Tr ends R in Machine Learning , vol. 3, no. 1, pp. 1–122, 2011. [16] V . Cevher , S. Becker , and M. Schmidt, “Con vex optimization for big data: Scalable, randomized, and parallel algorithms for big data analytics, ” IEEE Signal Processing Magazine , vol. 31, no. 5, pp. 32–43, 2014. [17] A. Nedi ´ c and D. Bertsekas, “Con vergence rate of incremental subgradient algorithms, ” in Stochastic optimization: algorithms and applications . Springer , 2001, pp. 223–264. [18] A. Nedi ´ c and D. Bertsekas, “Incremental subgradient methods for nondifferentiable optimization, ” SIAM Journal on Optimization , vol. 12, no. 1, pp. 109–138, 2001. [19] A. Nedi ´ c, D. P . Bertsekas, and V . Borkar, “Distributed asyn- chronous incremental subgradient methods, ” in Pr oceedings of the Marc h 2000 Haifa W orkshop “Inherently P arallel Algorithms in F easibility and Optimization and Their Applications” . Else- vier , Amsterdam, 2001. [20] S. Ram, A. Nedi ´ c, and V . V eerav alli, “Incremental stochastic subgradient algorithms for conv ex optimization, ” SIAM Journal on Optimization , vol. 20, no. 2, pp. 691–717, 2009. [21] D. P . Bertsekas, “Incremental proximal methods for large scale con vex optimization, ” Mathematical Pr ogramming , v ol. 129, pp. 163–195, 2011. [22] M. W ang and D. P . Bertsekas, “Incremental constraint projection- proximal methods for nonsmooth conv ex optimization, ” 2013, lab . for Information and Decision Systems Report LIDS-P-2907, MIT , July 2013. [23] A. Nedi ´ c and A. Ozdaglar , “Distributed subgradient methods for multi-agent optimization, ” IEEE T ransactions on Automatic Contr ol , vol. 54, pp. 48–61, 2009. [24] S. S. Ram, A. Nedi ´ c, and V . V eeravalli, “Distributed stochastic subgradient projection algorithms for conv ex optimization, ” Jour- nal of Optimization Theory and Applications , vol. 147, no. 3, pp. 516–545, 2010. [25] A. Nedi ´ c, “ Asynchronous broadcast-based conv ex optimization over a network, ” IEEE T ransactions on Automatic Contr ol , vol. 56, no. 6, pp. 1337–1351, 2011. [26] K. Y uan, Q. Ling, and W . Y in, “On the conver gence of decentral- ized gradient descent, ” SIAM Journal on Optimization , vol. 26, no. 3, pp. 1835–1854, 2016. [27] H. T erelius, U. T opcu, and R. Murray , “Decentralized multi- agent optimization via dual decomposition, ” IF AC Proceedings V olumes , vol. 44, no. 1, pp. 11 245–11 251, 2011. [28] D. P . Bertsekas and J. Tsitsiklis, P arallel and Distributed Compu- tation: Numerical Methods , 2nd ed. Nashua: Athena Scientific, 1997. [29] E. W ei and A. Ozdaglar, “On the O (1 /k ) conver gence of asyn- chronous distributed alternating direction method of multipliers, ” in Global Confer ence on Signal and Information Pr ocessing (GlobalSIP), 2013 IEEE . IEEE, 2013, pp. 551–554. [30] T .-H. Chang, M. Hong, and X. W ang, “Multi-agent distributed optimization via ine xact consensus ADMM, ” IEEE T ransactions on Signal Processing , vol. 63, no. 2, pp. 482–497, 2015. [31] M. Hong and T .-H. Chang, “Stochastic proximal gradient con- sensus over random networks, ” IEEE T ransactions on Signal Pr ocessing , vol. 65, no. 11, pp. 2933–2948, 2017. [32] W . Shi, Q. Ling, K. Y uan, G. W u, and W . Y in, “On the linear con vergence of the ADMM in decentralized consensus optimization, ” IEEE T ransactions on Signal Pr ocessing , vol. 62, no. 7, pp. 1750–1761, 2014. [33] A. Chen and A. Ozdaglar , “ A fast distributed proximal-gradient method, ” in the 50th Annual Allerton Confer ence on Communi- cation, Contr ol, and Computing (Allerton) , 2012, pp. 601–608. [34] D. Jakovetic, J. Xavier , and J. Moura, “Fast distributed gradient methods, ” IEEE T ransactions on Automatic Contr ol , vol. 59, pp. 1131–1146, 2014. [35] Y . Nesterov , Intr oductory Lectures on Con vex Optimization: A Basic Course . Springer Science & Business Media, 2013, vol. 87. [36] W . Shi, Q. Ling, G. W u, and W . Y in, “EXTRA: An exact first- order algorithm for decentralized consensus optimization, ” SIAM Journal on Optimization , vol. 25, no. 2, pp. 944–966, 2015. [37] M. Zhu and S. Martinez, “Discrete-time dynamic av erage con- sensus, ” Automatica , vol. 46, no. 2, pp. 322–329, 2010. [38] J. Xu, S. Zhu, Y . Soh, and L. Xie, “ Augmented distributed gra- dient methods for multi-agent optimization under uncoordinated constant stepsizes, ” in Proceedings of the 54th IEEE Conference on Decision and Contr ol (CDC) , 2015, pp. 2055–2060. [39] P . Di Lorenzo and G. Scutari, “NEXT: In-network nonconve x optimization, ” IEEE T ransactions on Signal and Information Pr ocessing over Networks , vol. 2, no. 2, pp. 120–136, 2016. [40] G. Qu and N. Li, “Harnessing smoothness to accelerate dis- tributed optimization, ” in Decision and Contr ol (CDC), 2016 IEEE 55th Conference on . IEEE, 2016, pp. 159–166. [41] A. Nedic, A. Olshe vsky , and W . Shi, “ Achieving geometric con- ver gence for distributed optimization ov er time-varying graphs, ” JOURNAL OF TSP L A T E X CLASS FILES, VOL. X, NO. X, SEP 2018 15 SIAM Journal on Optimization , vol. 27, no. 4, pp. 2597–2633, 2017. [42] A. Nedi ´ c, A. Olshevsk y , W . Shi, and C. A. Uribe, “Geometrically con vergent distributed optimization with uncoordinated step- sizes, ” in American Contr ol Conference (ACC), 2017 . IEEE, 2017, pp. 3950–3955. [43] A. Nedi ´ c and A. Olshevsky , “Distrib uted optimization o ver time- varying directed graphs, ” in The 52nd IEEE Annual Confer ence on Decision and Contr ol , 2013, pp. 6855–6860. [44] C. Xi and U. Khan, “On the linear conv ergence of dis- tributed optimization over directed graphs, ” arXiv preprint arXiv:1510.02149 , 2015. [45] J. Zeng and W . Yin, “ExtraPush for conve x smooth decentralized optimization over directed networks, ” Journal of Computational Mathematics, Special Issue on Compressed Sensing, Optimiza- tion, and Structured Solutions , vol. 35, no. 4, pp. 381–394, 2017. [46] Y . Sun, G. Scutari, and D. Palomar , “Distributed nonconv ex multiagent optimization over time-varying networks, ” in 2016 50th Asilomar Conference on Signals, Systems and Computers . IEEE, 2016, pp. 788–794. [47] Z. Li and M. Y an, “ A primal-dual algorithm with optimal step- sizes and its application in decentralized consensus optimization, ” arXiv preprint arXiv:1711.06785 , 2017. [48] S. A. Alghunaim and A. H. Sayed, “Linear Conver gence of Primal-Dual Gradient Methods and their Performance in Dis- tributed Optimization, ” arXiv e-prints , p. arXiv:1904.01196, Apr 2019. [49] K. Y uan, B. Y ing, X. Zhao, and A. H. Sayed, “Exact diffusion for distributed optimization and learningpart i: Algorithm dev el- opment, ” IEEE T ransactions on Signal Processing , vol. 67, no. 3, pp. 708–723, 2017. [50] ——, “Exact diffusion for distributed optimization and learn- ingpart ii: Conver gence analysis, ” IEEE T ransactions on Signal Pr ocessing , vol. 67, no. 3, pp. 724–739, 2017. [51] A. Nedic, A. Olshevsk y , A. Ozdaglar , and J. N. Tsitsiklis, “On distributed a veraging algorithms and quantization effects, ” IEEE T ransactions on Automatic Contr ol , vol. 54, no. 11, pp. 2506– 2517, 2009. [52] G. Qu and N. Li, “ Accelerated distributed nesterov gradient descent for conve x and smooth functions, ” in Decision and Contr ol (CDC), 2017 IEEE 56th Annual Confer ence on . IEEE, 2017, pp. 2260–2267. [53] K. Scaman, F . Bach, S. Bubeck, Y . T . Lee, and L. Massouli ´ e, “Optimal algorithms for smooth and strongly con vex distrib uted optimization in networks, ” in International Conference on Ma- chine Learning , 2017, pp. 3027–3036. [54] C. A. Uribe, S. Lee, A. Gasnikov , and A. Nedi ´ c, “Opti- mal algorithms for distributed optimization, ” arXiv preprint arXiv:1712.00232 , 2017. [55] T . W u, K. Y uan, Q. Ling, W . Y in, and A. H. Sayed, “Decen- tralized consensus optimization with asynchrony and delays, ” IEEE T ransactions on Signal and Information Processing over Networks , vol. 4, no. 2, pp. 293–307, 2018. [56] M. Y an, “ A new primal-dual algorithm for minimizing the sum of three functions with a linear operator, ” Journal of Scientific Computing , p. to appear , 2018. [57] S. Boyd, P . Diaconis, and L. Xiao, “Fastest mixing marko v chain on a graph, ” SIAM revie w , vol. 46, no. 4, pp. 667–689, 2004. [58] J. Liu, J. Chen, and J. Y e, “Large-scale sparse logistic regres- sion, ” in Pr oceedings of the 15th ACM SIGKDD international confer ence on Knowledge discovery and data mining . A CM, 2009, pp. 547–556. [59] D. Davis and W . Y in, “Conver gence rate analysis of several split- ting schemes, ” in Splitting Methods in Communication, Ima ging, Science, and Engineering . Springer , 2016, pp. 115–163. Zhi Li recei ved the B.S. and the M.S. de gree in Applied Mathematics from China Univ er- sity of Petroleum, Shandong, China, in 2007 and 2010, respectiv ely . Then, he participated in the cooperative education program and also received the M.S. degree in Applied Sci- ence from Saint Marys Univ ersity , Halifax, NS, Canada, in 2012. After being awarded the Hong K ong Ph.D. Fellowship, he went to Hong Kong Baptist University , HK, China, where he recei ved his Ph.D. in Applied Math- ematics in 2016, and he was supported by the fellowship from 2013 to 2016. Since 2016 he has been a postdoctoral researcher with the Department of Computational Mathematics, Science and Engineering, Michigan State University , East Lansing, MI. LI U et al . : D VC I N DI S T RI B U T I ON NE T W ORKS : ONL I N E AND R O B U S T I M P L E M E N T A T I ONS 6117 W e i Shi ( M ’ 15) r ecei v e d t he B. E . de gr ee in autom a tion a nd the P h. D. de gree in control s ci- ence and e ngineer ing f r o m t he U n i v er s ity of Science a nd T echnology of China, Hefei, in 2010 and 2015, res p ecti v ely . From 2015 to 2016, he w a s a P o s t - D octor a l R es ear ch A s s o ciate w ith Coordinated S cience L a boratory , Uni v ers ity of Illinois a t U rbana-Cham paign , Urbana. H e i s c ur - r e ntly a P os t- D o ctor al Res ear ch A s s o ciate w ith Bos t on U n i v er s ity . H is r e s ear ch inter e s t is optim iza- tion a nd its applications in s i gnal p r o ces s i ng and control. Ha o Z hu ( M ’ 12) r ecei v e d t he B. E . de gr ee f r o m T s inghua Uni v ers ity in 2006, and t he M. Sc. a nd Ph. D . d e g rees from t he Uni v ers ity of Minnes o ta in 2009 and 2012, all i n e lectrical engineer - i n g . Sh e i s a n A ssi st a n t P ro fe sso r o f E l e c t ri c a l and C om puter E ngineering w ith the U ni v e rs ity of Illinois , Urbana-Cham p a i gn (UIUC). S he w a s a P o s t - D octor a l R es ear ch A s s o ciate o n p o w er gr id m odeling a nd v a lidation w ith the U I U C I nf or m a tion T r us t I ns titute bef o r e joining t he E C E F aculty in 2014. Her c urrent res earch in teres t s i nclude po wer grid m onitoring, po wer s ys tem operati ons and c ontrol, and e ner g y d ata a na- lytics . She w as a r ecipient o f t he NSF C ARE E R A w ard, the S iebel E ner g y Ins titute S eed G r ant A w a rd, a nd the 2nd Bes t P a per A w a rd at the 2016 N o rth Am erican Po wer S ym pos ium . She i s c urrently a m em ber o f t he s t eering com - m ittee o f t he I E E E S MAR T G RI D repres enting t he IE E E Signal P roces s i ng S o ciety . W ei Shi received the B.E. degree in au- tomation and the Ph.D. degree in control science and engineering from the Univer- sity of Science and T echnology of China, Hefei, in 2010 and 2015, respectively . He was a Postdoctoral Researcher with the Co- ordinated Science Laboratory , the University of Illinois at Urbana-Champaign, Urbana, IL, USA, from 2015 to 2016, with Arizona State Univ ersity , T empe, AZ, USA from 2016 to 2018, and with Princeton University from 2018-2019. His research interests spanned in optimization, cyberphys- ical systems, and big data analytics. He was awarded the 2017 Y oung Author Best Paper A ward from the IEEE Signal Processing Society . Ming Y an is currently an Assistant Pro- fessor in the Department of Computational Mathematics, Science and Engineering and the Department of Mathematics, Michigan State University . He received the B.S. Degree and M.S. Degree from University of Science and T echnology of China, and Ph.D. degree from University of California, Los Angeles (UCLA) in 2012. His research interests in- clude optimization methods and their appli- cations in sparse recovery and regularized in verse problems, variational methods for image processing, parallel and distributed algorithms for solving big data problems. L. ZHI, W . SHI, AND M. Y AN 1 S U P P L E M E N T A RY M A T E R I A L F O R “A D E C E N T R A L I Z E D P R OX I M A L - G R A D I E N T M E T H O DW I T H N E T W O R K I N D E P E N D E N T S T E P - S I Z E S A N D S E P A R A T E D C O N V E R G E N C E R AT E S ” A. Pr oof of Lemma 1 Lemma 1 (F ixed point of ( 12 ) ): ( d ∗ , z ∗ ) is a fixed point of ( 12 ) if and only if there exists a subgradient q ∗ ∈ ∂ r ( x ∗ ) such that z ∗ = x ∗ + Λ q ∗ and d ∗ + ∇ s ( x ∗ ) + q ∗ = 0 , ( I − W ) x ∗ = 0 . Pr oof: “ ⇒ ” If ( d ∗ , z ∗ ) is a fixed point of ( 12 ), we hav e 0 = c ( I − W ) (2 x ∗ − z ∗ − Λ ∇ s ( x ∗ ) − Λ d ∗ ) = c ( I − W ) x ∗ , where the two equalities come from ( 12b ) and ( 12c ), respectiv ely . Combining ( 12c ) and ( 12a ) gives 0 = z ∗ − x ∗ + Λ ∇ s ( x ∗ ) + Λ d ∗ = Λ( q ∗ + ∇ s ( x ∗ ) + d ∗ ) , where q ∗ ∈ ∂ r ( x ∗ ) . “ ⇐ ” In order to show that ( d ∗ , z ∗ ) is a fixed point of iteration ( 12 ), we just need to verify that ( d k +1 , z k +1 ) = ( d ∗ , z ∗ ) if ( d k , z k ) = ( d ∗ , z ∗ ) . From ( 12a ), we have x k = x ∗ , then d k +1 = d ∗ + c ( I − W ) (2 x ∗ − z ∗ − Λ ∇ s ( x ∗ ) − Λ d ∗ ) = d ∗ + c ( I − W ) x ∗ = d ∗ , z k +1 = x ∗ − Λ ∇ s ( x ∗ ) − Λ d ∗ = x ∗ + Λ q ∗ = z ∗ . Therefore, ( d ∗ , z ∗ ) is a fixed point of iteration ( 12 ). B. Pr oof of Lemma 2 Lemma 2 (Optimality condition): x ∗ is consensual with x ∗ 1 = x ∗ 2 = · · · = x ∗ n = x ∗ being an optimal solution of problem ( 1 ) if and only if there exists p ∗ and a subgradient q ∗ ∈ ∂ r ( x ∗ ) such that: ( I − W ) p ∗ + ∇ s ( x ∗ ) + q ∗ = 0 , (26a) ( I − W ) x ∗ = 0 . (26b) In addition, ( d ∗ = ( I − W ) p ∗ , z ∗ = x ∗ + Λ q ∗ ) is a fixed point of iteration ( 12 ). Pr oof: “ ⇒ ” Because x ∗ = 1 n × 1 ( x ∗ ) > , we have ( I − W ) x ∗ = ( I − W ) 1 n × 1 ( x ∗ ) > = 0 n × 1 ( x ∗ ) > = 0 . The fact that x ∗ is an optimal solution of problem ( 1 ) means there exists q ∗ ∈ ∂ r ( x ∗ ) such that ( ∇ s ( x ∗ ) + q ∗ ) > 1 n × 1 = 0 . That is to say all columns of ∇ s ( x ∗ ) + q ∗ are orthogonal to 1 n × 1 . Therefore, Remark 1 shows the existence of p ∗ such that ( I − W ) p ∗ + ∇ s ( x ∗ )+ q ∗ = 0 . “ ⇐ ” Equation ( 26b ) shows that x ∗ is consensual be- cause of item 3 of Assumption 1 , i.e., x ∗ = 1 n × 1 ( x ∗ ) > for some x ∗ . From ( 26a ), we hav e 0 = (( I − W ) p ∗ + ∇ s ( x ∗ ) + q ∗ ) > 1 n × 1 = ( p ∗ ) > ( I − W ) 1 n × 1 + ( ∇ s ( x ∗ ) + q ∗ ) > 1 n × 1 = ( ∇ s ( x ∗ ) + q ∗ ) > 1 n × 1 . Thus, 0 ∈ P n i =1 ( ∇ s i ( x ∗ )+ ∂ r i ( x ∗ )) because x ∗ is consensual. This completes the proof for the equi valence. Lemma 1 shows that ( d ∗ = ( I − W ) p ∗ , z ∗ = x ∗ + Λ q ∗ ) is a fixed point of iteration ( 12 ). C. Pr oof of Lemma 3 Lemma 3 (Norm over range space): For any symmet- ric positive semidefinite matrix A ∈ R n × n with rank r ≤ n , let λ 1 ≥ λ 2 ≥ · · · ≥ λ r > 0 be its r eigen- values. Then range ( A ) defined in Section I-E is a r p - dimensional subspace in R n × p and has a norm defined by k x k 2 A † := h x , A † x i , where A † is the pseudo in verse of A . In addition, λ − 1 1 k x k 2 ≤ k x k 2 A † ≤ λ − 1 r k x k 2 for all x ∈ range ( A ) . Pr oof: Let A = U Σ U > , where Σ = Diag( λ 1 , λ 2 , · · · , λ r ) and the columns of U are or- thonormal eigen vectors for corresponding eigen values, i.e., U ∈ R n × r and U > U = I r × r . Then A † = U Σ − 1 U > , where Σ − 1 = Diag( λ − 1 1 , λ − 1 2 , · · · , λ − 1 r ) . Letting x = Ay , we have k x k 2 = h U Σ U > y , U Σ U > y i = h Σ U > y , Σ U > y i = k Σ U > y k 2 . In addition, h x , A † x i = h Ay , A † Ay i = h U Σ U > y , U Σ − 1 U > U Σ U > y i = h Σ U > y , Σ − 1 Σ U > y i . Therefore, λ − 1 1 k x k 2 = λ − 1 1 k Σ U > y k 2 (27) ≤ h x , A † x i ≤ λ − 1 r k Σ U > y k 2 = λ − 1 r k x k 2 , which means that k · k 2 A † = h· , A † ·i is a norm for range ( A ) . D. Pr oof of Pr oposition 1 Pr oposition 1: Let M = c − 1 ( I − W ) † − Λ with Λ being symmetric positiv e definite and I < c Λ 1 / 2 ( I − W )Λ 1 / 2 < 0 . Then k · k M is a norm defined for range ( I − W ) . L. ZHI, W . SHI, AND M. Y AN 2 Pr oof: Rewrite the matrix M as M = c − 1 ( I − W ) † − Λ =Λ 1 / 2 ( c − 1 Λ − 1 / 2 ( I − W ) † Λ − 1 / 2 − I )Λ 1 / 2 =Λ 1 / 2 (( c Λ 1 / 2 ( I − W )Λ 1 / 2 ) † − I )Λ 1 / 2 . For any x ∈ range ( I − W ) , we can find y ∈ R n × p such that x = ( I − W )Λ 1 / 2 y . Then h x , Mx i = h ( I − W )Λ 1 / 2 y , Λ 1 / 2 (( c Λ 1 / 2 ( I − W )Λ 1 / 2 ) † − I )Λ 1 / 2 ( I − W )Λ 1 / 2 y i = h Λ 1 / 2 ( I − W )Λ 1 / 2 y , (( c Λ 1 / 2 ( I − W )Λ 1 / 2 ) † − I )Λ 1 / 2 ( I − W )Λ 1 / 2 y i . W e apply Lemma 3 on Λ 1 / 2 ( I − W )Λ 1 / 2 and obtain the result. E. Pr oof of Theor em 1 Before proving Theorem 1 , we present two lemmas. The first lemma shows that the distance to a fixed point of ( 12 ) is decreasing, and the second one show the distance between two iterations is decreasing. Lemma 4 (A key inequality of descent): Let ( d ∗ , z ∗ ) be a fixed point of ( 12 ) and d ∗ ∈ range ( I − W ) . For the sequence ( d k , z k ) generated from NIDS in ( 12 ) with I < c Λ 1 / 2 ( I − W )Λ 1 / 2 , we have k z k +1 − z ∗ k 2 Λ − 1 + k d k +1 − d ∗ k 2 M ≤k z k − z ∗ k 2 Λ − 1 + k d k − d ∗ k 2 M (28) − (1 − max i α i L i 2 ) k z k − z k +1 k 2 Λ − 1 + k d k − d k +1 k 2 M . Pr oof: Y oung’ s inequality and ( 7 ) giv e us 2 h∇ s x k − ∇ s ( x ∗ ) , z k − z k +1 i − 2 h x k − x ∗ , ∇ s ( x k ) − ∇ s ( x ∗ ) i ≤ 1 2 k z k − z k +1 k 2 L + 2 k∇ s ( x k ) − ∇ s ( x ∗ ) k 2 L − 1 − 2 k∇ s ( x k ) − ∇ s ( x ∗ ) k 2 L − 1 = 1 2 k z k − z k +1 k 2 L . Therefore, from ( 15 ), we hav e k z k +1 − z ∗ k 2 Λ − 1 + k d k +1 − d ∗ k 2 M ≤k z k − z ∗ k 2 Λ − 1 + k d k − d ∗ k 2 M − k z k − z k +1 k 2 Λ − 1 − k d k − d k +1 k 2 M + 1 2 k z k − z k +1 k 2 L ≤k z k − z ∗ k 2 Λ − 1 + k d k − d ∗ k 2 M − (1 − max i α i L i 2 ) k z k − z k +1 k 2 Λ − 1 − k d k − d k +1 k 2 M ≤k z k − z ∗ k 2 Λ − 1 + k d k − d ∗ k 2 M − (1 − max i α i L i 2 ) k z k − z k +1 k 2 Λ − 1 + k d k − d k +1 k 2 M . This completes the proof. Lemma 5 (Monotonicity of successive differ ence in a special norm): Let ( d k , z k ) be the sequence gen- erated from NIDS in ( 12 ) with α i < 2 /L i for all i and I < c Λ 1 / 2 ( I − W )Λ 1 / 2 . Then the sequence k z k +1 − z k k 2 Λ − 1 + k d k +1 − d k k 2 M k ≥ 0 is monotoni- cally nonincreasing. Pr oof: Similar to the proof for Lemma 4 , we can show that h d k +1 − d k , z k +1 − z k + x k i = h d k +1 − d k , d k +1 − d k i M , (29) h d k +1 − d k , z k − z k − 1 + x k − 1 i = h d k +1 − d k , d k − d k − 1 i M , (30) h x k − x k − 1 , z k − x k − z k − 1 + x k − 1 i Λ − 1 ≥ 0 . (31) Subtracting ( 30 ) from ( 29 ) on both sides, we have h d k +1 − d k , x k − x k − 1 i = d k +1 − d k 2 M − h d k +1 − d k , d k − d k − 1 i M + h d k +1 − d k , 2 z k − z k − 1 − z k +1 i ≥ d k +1 − d k 2 M − 1 2 k d k +1 − d k k 2 M − 1 2 k d k − d k − 1 k 2 M + h d k +1 − d k , 2 z k − z k − 1 − z k +1 i = 1 2 k d k +1 − d k k 2 M − 1 2 k d k − d k − 1 k 2 M + h d k +1 − d k , 2 z k − z k − 1 − z k +1 i , (32) where the inequality comes from the Cauchy-Schwarz inequality . Then, the previous inequality , together with ( 31 ) and the Cauchy-Schwarz inequality , giv es h x k − x k − 1 , ∇ s ( x k ) − ∇ s ( x k − 1 ) i ≤ x k − x k − 1 , Λ − 1 ( z k − x k − z k − 1 + x k − 1 ) + ∇ s ( x k ) − ∇ s ( x k − 1 ) = h x k − x k − 1 , Λ − 1 ( z k − z k +1 − z k − 1 + z k ) − d k +1 + d k i L. ZHI, W . SHI, AND M. Y AN 3 ≤h x k − x k − 1 , z k − z k +1 − z k − 1 + z k i Λ − 1 − h d k +1 − d k , 2 z k − z k − 1 − z k +1 i − 1 2 d k +1 − d k 2 M + 1 2 k d k − d k − 1 k 2 M = Λ − 1 ( x k − x k − 1 ) − d k +1 + d k , z k − z k +1 − z k − 1 + z k − 1 2 d k +1 − d k 2 M + 1 2 k d k − d k − 1 k 2 M and consequently h x k − x k − 1 , ∇ s ( x k ) − ∇ s ( x k − 1 ) i ≤h Λ − 1 ( z k +1 − z k ) + ∇ s x k − ∇ s x k − 1 , z k − z k +1 − z k − 1 + z k i − 1 2 d k +1 − d k 2 M + 1 2 d k − d k − 1 2 M ≤h z k +1 − z k , z k − z k +1 − z k − 1 + z k i Λ − 1 + 1 2 z k − z k +1 − z k − 1 + z k 2 Λ − 1 + 1 2 ∇ s x k − ∇ s x k − 1 2 Λ − 1 2 d k +1 − d k 2 M + 1 2 d k − d k − 1 2 M = 1 2 z k − z k − 1 2 Λ − 1 − 1 2 z k +1 − z k 2 Λ − 1 + 1 2 ∇ s x k − ∇ s x k − 1 2 Λ − 1 2 d k +1 − d k 2 M + 1 2 d k − d k − 1 2 M . The three inequalities hold because of ( 31 ), ( 32 ), and the Cauchy-Schwarz inequality , respecti vely . The first and third equalities come from ( 12c ). Rearranging the previous inequality , we obtain z k +1 − z k 2 Λ − 1 + d k +1 − d k 2 M ≤ z k − z k − 1 2 Λ − 1 + d k − d k − 1 2 M + 1 2 ∇ s x k − ∇ s x k − 1 2 Λ − h x k − x k − 1 , ∇ s ( x k ) − ∇ s ( x k − 1 ) i ≤ z k − z k − 1 2 Λ − 1 + d k − d k − 1 2 M + 1 2 ∇ s x k − ∇ s x k − 1 2 Λ − 2 L − 1 ≤ z k − z k − 1 2 Λ − 1 + d k − d k − 1 2 M . where the second and last inequalities come from ( 7 ) and Λ < 2 L − 1 , respectively . It completes the proof. Theor em 1 (Sublinear rate): Let ( d k , z k ) be the se- quence generated from NIDS in ( 12 ) with α i < 2 /L i for all i and I < c Λ 1 / 2 ( I − W )Λ 1 / 2 . W e hav e k z k − z k +1 k 2 Λ − 1 + k d k − d k +1 k 2 M ≤ k z 1 − z ∗ k 2 Λ − 1 + k d 1 − d ∗ k 2 M k (1 − max i α i L i 2 ) , (33) k z k − z k +1 k 2 Λ − 1 + k d k − d k +1 k 2 M = o 1 k + 1 . Furthermore, ( d k , z k ) con verges to a fixed point ( ¯ d , ¯ z ) of iteration ( 12 ) and ¯ d ∈ range ( I − W ) , if I c Λ 1 / 2 ( I − W )Λ 1 / 2 . Pr oof: Lemma 5 shows that k z k +1 − z k k 2 Λ − 1 + k d k +1 − d k k 2 M k ≥ 0 is monotonically nonincreasing. Summing up ( 28 ) from 1 to k , we have k X j =1 k z j − z j +1 k 2 Λ − 1 + k d j − d j +1 k 2 M ≤ 1 (1 − max i α i L i 2 ) ( k z 1 − z ∗ k 2 Λ − 1 + k d 1 − d ∗ k 2 M − k z k +1 − z ∗ k 2 Λ − 1 − k d k +1 − d ∗ k 2 M ) . Therefore, we have k z k − z k +1 k 2 Λ − 1 + k d k − d k +1 k 2 M ≤ 1 k k X j =1 k z j − z j +1 k 2 Λ − 1 + k d j − d j +1 k 2 M ≤ 1 k (1 − max i α i L i 2 ) ( k z 1 − z ∗ k 2 Λ − 1 + k d 1 − d ∗ k 2 M ) , and [ 59 , Lemma 1] giv es us ( 33 ). When I c Λ 1 / 2 ( I − W )Λ 1 / 2 , inequality ( 28 ) shows that the sequence ( d k , z k ) is bounded, and there exists a con ver gent subsequence ( d k i , z k i ) such that ( d k i , z k i ) → ( ¯ d , ¯ z ) . Then ( 33 ) gi ves the conv ergence of ( d k i +1 , z k i +1 ) . More specifically , ( d k i +1 , z k i +1 ) → ( ¯ d , ¯ z ) . Therefore ( ¯ d , ¯ z ) is a fixed point of iteration ( 12 ). In addition, because d k ∈ range ( I − W ) for all k , we hav e ¯ d ∈ range ( I − W ) . Finally Lemma 4 implies the con vergence of ( d k , z k ) to ( ¯ d , ¯ z ) .
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment