Gradient tracking and variance reduction for decentralized optimization and machine learning

1 Gradient tracking and v ariance reduction for decentralized optimization and machine learning Ran Xin, Soummya Kar , and Usman A. Khan Abstract Decentralized methods to solve ﬁnite-sum minimization problems are important in many signal processing and machine learning tasks where the data is distributed o ver a network of nodes and raw data sharing is not permitted due to priv acy and/or resource constraints. In this article, we revie w decentralized stochastic ﬁrst-order methods and provide a uniﬁed algorithmic framew ork that combines variance-reduction with gradient tracking to achie ve both robust performance and fast con vergence. W e provide e xplicit theoretical guarantees of the corresponding methods when the objecti ve functions are smooth and strongly-con vex, and sho w their applicability to non-con vex problems via numerical e xperiments. Throughout the article, we pro vide intuiti ve illustrations of the main technical ideas by casting appropriate tradeoffs and comparisons among the methods of interest and by highlighting applications to decentralized training of machine learning models. I . I N T R O D U C T I O N In multi-agent networks and large-scale machine learning, when data is available at dif ferent de vices with limited communication, it is often desirable to seek scalable learning methods that do not require bringing, storing, and processing data at one single location. In this article, we describe decentralized, stochastic ﬁrst-order methods, which are particularly fa vorable to such ad-hoc and resource-constrained settings. Speciﬁcally , we describe a uniﬁed algorithmic frame work for combining different variance reduction methods with gradient tracking in order to signiﬁcantly improv e upon the performance of the standard decentralized stochastic gradient descent (DSGD). Ho wever , this improv ement comes at a price of losing the simplicity of DSGD and we study the added communication, computation, and storage requirements with the help of precise technical statements. For the ease of accessibility , we restrict the theoretical ar guments to smooth and strongly-con ve x objectiv es, while the applicability to non-con vex problems is shown with the help of numerical experiments. W e emphasize that smooth and strongly-con ve x objectiv es are relev ant in many machine learning applications, e.g., problems where a strongly-conv ex regularization is added to otherwise conv ex costs, or problems where the objecti ve functions are non-con ve x but strongly-con ve x in the neighborhood of the local minimizers [1]. T o provide context, we start by brieﬂy re viewing the problems of interest and their associated centralized solutions. 2 A. Empirical Risk Minimization In parametric learning and inference problems, the goal of a typical machine learning system is to ﬁnd a model g , parameterized by a real vector θ ∈ R p , that maps an input data point x ∈ R d x to its corresponding output y ∈ R d y . The setup requires deﬁning a loss function l ( g ( θ ; x , y )) , which represents the loss incurred by the model g with parameter θ on the data ( x , y ) . In the formulation of statistical machine learning, we assume that each data sample ( x , y ) belongs to a joint probability distrib ution P ( x , y ) . Ideally , we would like to ﬁnd the optimal model parameter e θ ∗ by minimizing the following risk (expected loss) function e F ( θ ) : P0: e θ ∗ = argmin θ ∈ R p e F ( θ ) , e F ( θ ) , E ( x , y ) ∼P ( x , y ) l ( g ( θ ; x , y )) . Ho wever , the true distribution P ( x , y ) is often hidden or intractable in practice. In supervised machine learning, one usually has access to a large set of training samples { x i , y i } N i =1 , which can be considered as independent and identically distributed (i.i.d.) realizations from the distri- bution P ( x , y ) . The av erage of the losses incurred by the model θ on a ﬁnite set of training data samples { x i , y i } N i =1 , known as the empirical risk , thus serves as an appropriate surrogate for the risk function e F ( θ ) . Formally , the empirical risk minimization problem is stated as P1: θ ∗ = argmin θ ∈ R p F ( θ ) , F ( θ ) , 1 N N X i =1 l ( g ( θ ; x i , y i )) , 1 N N X i =1 f i ( θ ) , (1) where θ ∗ is the minimizer of the empirical risk F . This ﬁnite-sum formulation captures a wide range of supervised learning problems. Examples include: hand-written character recognition with re gularized logistic re gression where the objecti ve functions are smooth and strongly- con ve x [2]; te xt classiﬁcation with support v ector machines where the objecti ves are con vex but not necessarily smooth [1]; and perception tasks with deep neural networks where the cost functions are non-con vex in general [1], [2]. Our focus in this article is on smooth and strongly-conv ex objecti ve functions deﬁned as follo ws. An L -smooth and µ -str ongly-conve x function f : R p → R is such that ∀ θ 1 , θ 2 ∈ R p and for some positiv e constants L, µ > 0 , we have µ 2 k θ 1 − θ 2 k 2 2 ≤ f ( θ 2 ) − f ( θ 1 ) − ∇ f ( θ 1 ) > ( θ 2 − θ 1 ) ≤ L 2 k θ 1 − θ 2 k 2 2 . W e deﬁne S µ,L as the class of functions that are L -smooth and µ -strongly-con vex [3]. W e note that if F ∈ S µ,L , then it has a unique global minimum denoted as θ ∗ . For any F ∈ S µ,L , we hav e that L ≥ µ , and we deﬁne κ , L µ as the condition number of F [3]; clearly , κ ≥ 1 . For the ease of accessibility , we restrict the theoretical arguments to the function class S µ,L , while the applicability to non-con vex problems is sho wn with the help of numerical experiments. 3 B. Stochastic Gradient Descent Stochastic Gradient Descent (SGD) is a simple yet powerful method that has been extensi vely used to solve the empirical risk minimization problem P1. SGD, in its simplest form, starts with an arbitrary θ 0 ∈ R p and performs the following iterations to learn θ ∗ as k → ∞ : θ k +1 = θ k − α k · ∇ f s k ( θ k ) , k ≥ 0 , (2) where s k is chosen uniformly at random from { 1 , · · · , N } and { α k } k ≥ 0 is a sequence of positive step-sizes. Comparing to batch gradient method where the descent direction ∇ F ( θ k ) at each iteration k is computed from the entire batch of data, SGD iteratively descends in the direction of the gradient of a randomly sampled component function. SGD is thus computationally-efﬁcient as it ev aluates one component gradient (extendable to more than one randomly selected functions) at each iteration and is a popular alternati ve in problems with a large number of high-dimensional training data samples and model parameters. W e note that the stochastic gradient ∇ f s k ( θ k ) is an unbiased estimate of batch gradient ∇ F ( θ k ) , i.e., E s k [ ∇ f s k ( θ k ) | θ k ] = ∇ F ( θ k ) . Under the assumptions that each f i ∈ S µ,L and each stochastic gradient ∇ f s k ( θ k ) has bounded variance 1 , i.e., E s k h k∇ f s k ( θ k ) − ∇ F ( θ k ) k 2 2 | θ k i ≤ σ 2 , ∀ k , we note that with a constant step-size α ∈  0 , 1 L  , E  k θ k − θ ∗ k 2 2  decays linearly (on the log-scale), at the rate of (1 − µα ) k , to a neighborhood of θ ∗ . F ormally , we have [1], E  k θ k − θ ∗ k 2 2  ≤ (1 − µα ) k + ασ 2 µ , ∀ k ≥ 0 . (3) This steady-state error ασ 2 µ or the ine xact con ver gence is due to the f act that ∇ f s k ( θ ∗ ) 6 = 0 , in gen- eral, and the step-size is constant. A diminishing step-size ov ercomes this issue and leads to an ex- act con verg ence to the minimizer θ ∗ albeit at slo wer rate. For e xample, with α k = 1 µ ( k +1) , we ha ve E  k θ k − θ ∗ k 2 2  ≤ max n 2 σ 2 µ 2 , k θ 0 − θ ∗ k 2 2 o k + 1 , (4) for all k ≥ 0 , [1]. In other words, to reach an  -accurate solution of θ ∗ , i.e., E  k θ k − θ ∗ k 2  ≤  , SGD (with decaying step-sizes) requires O  1   component gradient ev aluations. C. V ariance-Reduced Stochastic Gradient Descent In practice, a successful implementation of SGD relies heavily on the tuning of the step- sizes, and typically a decaying step-size sequence { α k } k ≥ 0 has to be carefully chosen due to the 1 In this article, we restrict to the bounded variance assumption for simplicity . This assumption howev er can be relaxed, see [1], [4], [5], for example. 4 potentially large variance in SGD, i.e., the sampled gradient ∇ f s k ( θ k ) at θ k can be very far from the batch gradient ∇ F ( θ k ) . In recent years, certain V ariance-Reduction (VR) techniques ha ve been dev eloped tow ards addressing this issue [6]–[9]. The ke y idea here is to design an iterative estimator of the batch gradient whose variance progressiv ely decays to zero as θ k approaches θ ∗ . Beneﬁting from this reduction in variance, VR methods hav e a low per-iteration computation cost, a key feature of SGD, and, at the same time, con ver ge linearly to the minimizer θ ∗ as the batch gradient descent (with a constant step-size for the objectiv e function class S µ,L ). Different constructions of the aforementioned gradient estimator lead to different VR methods [6]–[9]. W e focus on two popular VR methods in this article described as follo ws. SA GA [7]: The SAGA method starts with an arbitrary θ 0 ∈ R p and maintains a table that stores all component gradients {∇ f i ( b θ i ) } N i =1 , where b θ i denotes the most recent iterate at which ∇ f i was e valuated, initialized with {∇ f i ( θ 0 ) } N i =1 . At ev ery iteration k ≥ 0 , SA GA chooses an index s k uniformly at random from { 1 , . . . , N } and performs the following two updates: g k = ∇ f s k ( θ k ) − ∇ f s k ( b θ s k ) + 1 N N X i =1 ∇ f i ( b θ i ) , θ k +1 = θ k − α · g k . (5) Subsequently , the entry ∇ f s k ( b θ s k ) in the gradient table is replaced by ∇ f s k  θ k  , while the other entries remain unchanged. Under the assumption that each f i ∈ S µ,L , it can be shown that with α = 1 3 L , we have [7], E h k θ k − θ ∗ k 2 2 i ≤ C  1 − min  1 4 N , 1 3 κ  k , ∀ k ≥ 0 , (6) for some C > 0 . In other words, SA GA achieves  -accuracy of θ ∗ with O  max { N , κ } log 1   component gradient ev aluations, where recall that κ = L µ is the condition number of the global objecti ve function F . Indeed, SA GA has a non-tri vial storage cost of O ( N p ) due to the gradient table, which can be reduced to O ( N ) for certain problems of interest, for example, logistic regression and least squares, by exploiting the structure of the objectiv e functions [6], [7]. SVRG [8]: Instead of storing the gradient table, SVRG achieves variance reduction by com- puting the batch gradient periodically and can be interpreted as a double-loop method described as follows. The outer loop of SVRG, indexed by k , updates the estimate θ k of θ ∗ . At each outer iteration k , SVRG computes the batch gradient ∇ F ( θ k ) and executes a ﬁnite number T of SGD-type inner loop iterations, inde xed by t : with θ 0 = θ k and for t = 0 , · · · , T − 1 , v t = ∇ f s t ( θ t ) − ∇ f s t ( θ 0 ) + ∇ F ( θ 0 ) , θ t +1 = θ t − α · v t , (7) where the index s t is selected uniformly at random from { 1 , · · · , N } . After the inner loop completes, θ k +1 can be updated in a few dif ferent ways; applicable choices include setting θ k +1 5 as θ T , 1 T P T − 1 t =0 θ t , or choosing it randomly from the inner loop updates { θ t } T − 1 t =0 . For instance, assuming that each f i ∈ S µ,L , it can be shown that with θ k +1 = 1 T P T − 1 t =0 θ t , α = 1 10 L , and T = 50 κ , we have [8], E [ k θ k − θ ∗ k 2 ] ≤ D · 0 . 5 k , ∀ k ≥ 0 , for some D > 0 . That is to say , SVRG achie ves  -accuracy with O (log 1  ) outer-loop iterations. W e further note that each outer-loop update requires N + 2 T component gradient ev aluations (7). Therefore, SVRG achiev es  -accuracy of θ ∗ with O  ( N + κ ) log 1   component gradient ev alu- ations, which is comparable to the con vergence rate of SA GA. Remark 1 (SGD with decaying step-sizes vs. VR): SGD, con ver ging at a sublinear rate O (1 /k ) to the minimizer (4), typically makes a fast progress in its early stage for certain large-scale, complex machine learning tasks and then slo ws down considerably . Its complexity (4) is not explicitly dependent on the sample size N , which is a strong feature, but it comes at a price of a direct dependence on σ 2 (the variance of the stochastic gradient). On the other hand, the VR methods achiev e fast linear con vergence with the help of reﬁned gradient estimators, for example, g k or v t , which approach the corresponding batch gradients as their v ariance diminishes. Their con vergence, although dependent on the sample size N , is independent of σ 2 . Remark 2 (SAGA vs. SVRG): The fundamental trade-off between SAGA and SVRG is con- ver gence speed versus storage and is often described as a trade-off between time and space [7]. Although SA GA and SVRG in theory achiev e conv ergence rates of the same order, SVRG in practice requires 2 - 3 times more component gradient ev aluations to reach the same accuracy as SA GA, ho wever , without storing all the component gradients [7]. In the rest of this article, we show ho w to cast SGD and VR methods in the decentralized optimization frame work. Section: Pr oblem F ormulation describes the decentralized optimization problem over a network of nodes. In Section: Decentralized Stochastic Optimization , we extend centralized SGD to the decentralized problem and show that an appropriate decentralization is achie ved with the help of gradient tracking. Subsequently , in Section: Decentralized VR Methods , we describe recent advances in decentralized methods that combine gradient tracking and variance reduction. Section: Numerical Illustrations provides numerical experiments on strongly-conv ex and non-con ve x problems and further highlights dif ferent tradeoffs between the methods described in this article. Section: Extensions and Discussion summarizes certain extensions and commu- nication/computation aspects of the corresponding problems that are popular in the literature. Finally , Section: Conclusions concludes the paper and brieﬂy describe some open problems. 6 I I . P R O B L E M F O R M U L A T I O N : D E C E N T R A L I Z E D E M P I R I C A L R I S K M I N I M I Z A T I O N In this article, our focus is on the solutions for optimization problems that arise in peer-to-peer decentralized networks. Unlike traditional master-w orker architectures, where a central node acts as a master that coordinates communication with all workers; there is no central coordinator in peer-to-peer networks and each node is only able to communicate with its immediate neighbors, see Fig. 2. The canonical form of decentralized optimization problems can be described as follo ws. Consider n nodes, such as machines, devices, or robots, that communicate over a static undirected graph G = ( V , E ) , where V = { 1 , · · · , n } is the set of nodes, and E ⊆ V × V is the set of edges, i.e., a collection of ordered pairs ( i, r ) , i, r ∈ V , such that nodes i and r can exchange information. Follo wing the discussion in Section: Empirical Risk Minimization , each node i holds a local risk function , e f i : R p → R , not accessible by any other node in the network. The decentralized risk minimization problem can thus be deﬁned as P2: e θ ∗ = argmin θ ∈ R p e F ( θ ) , e F ( θ ) , 1 n n X i =1 e f i ( θ ) . As in the centralized case with Problem P0, the underlying data distributions at the nodes may not be av ailable or tractable, we thus employ a local empirical risk function at each node as a surrogate of the local risk. Speciﬁcally , we consider each node i as a computing resource that stores/collects a local batch of m i training samples that are possibly priv ate (not shared with other nodes) and the corresponding local empirical risk function is decomposed over the local data samples as f i , 1 m i P m i j =1 f i,j . The goal of the networked nodes is to agr ee on the optimal solution of the following decentralized empirical risk minimization problem: P3: θ ∗ = argmin θ ∈ R p F ( θ ) , F ( θ ) = 1 n n X i =1 f i ( θ ) , 1 n n X i =1   1 m i m i X j =1 f i,j ( θ )   . The rest of this article is dedicated to the solutions of Problem P3. Fig. 1. (Left) A master-w orker network. (Right) Decentralized optimization in peer-to-peer networks. I I I . D E C E N T R A L I Z E D S T O C H A S T I C O P T I M I Z A T I O N W e no w consider decentralized solutions of Problem P3. At each node i giv en the current estimate θ i k of θ ∗ at iteration k , related algorithms typically in volv e the follo wing steps: 7 (1) Sample one or more component gradients from {∇ f i,j ( θ i k ) } m i j =1 ; (2) Fuse information with the av ailable neighbors; (3) Compute θ i k +1 according to a speciﬁc optimization protocol. Recall that each node in the network only communicates with a fe w nearby nodes and only has partial knowledge of the global objectiv e, see Fig. 2 (right). Due to this limitation, an information propagation mechanism is required that disseminates local information ov er the entire network. Decentralized optimization thus has two ke y components: (i) agr eement or consensus –all nodes must agree on the same state, i.e., θ i k → θ cons , ∀ i ; and, (ii) optimality –the agreement should be on the minimizer of the global objectiv e F , i.e., θ cons = θ ∗ . A verage-consensus algorithms [10] are information fusion protocols that enable each node to appropriately combine the vectors recei ved from its neighbors and to agree on the av erage of the initial states of the nodes. They thus naturally serve as basic building blocks in decentralized optimization, added to which are local gradient corrections that steer the agreement to the global minimizer . T o describe average-consensus, we ﬁrst associate the undirected and connected graph G with a primiti ve, symmetric, and doubly-stochastic n × n weight matrix W = { w ir } , such that w ir 6 = 0 for each ( i, r ) ∈ E . Clearly , we hav e W = W > and W 1 n = 1 n , where 1 n is the column vector of n ones. There are various ways of constructing such weights in a decentralized manner . Popular choices include the Laplacian and Metropolis weights, see [11] for details. A verage- consensus [10] is given as follows. Each node i starts with some vector θ i 0 ∈ R p and updates its state according to θ i k +1 = P r ∈N i w ir θ r k , ∀ k ≥ 0 . It can be written in a vector form as θ k +1 = ( W ⊗ I p ) θ k , (8) where θ k = [ θ 1 k > , · · · , θ n k > ] > . Since W is primitiv e and doubly-stochastic 2 , from the Perron- Frobenius theorem [12], we have lim k →∞ W k = 1 n 1 n 1 > n and lim k →∞ θ k = ( W ⊗ I p ) k θ 0 = ( 1 n ⊗ I p ) θ 0 , where θ 0 , ( 1 > n ⊗ I p ) θ 0 n , at a linear rate of λ k , and λ ∈ [0 , 1) is the spectral radius of ( W − 1 n 1 n 1 > n ) . That is to say , the protocol in (8) enables an agreement across the nodes on the average θ 0 of their initial states, at a linear rate. W ith the agreement protocol in place, we next introduce decentralized gradient descent and its stochastic variant that build on top of average-consensus. 2 In the rest of this article, W = { w ir } denotes a collection of doubly-stochastic weights and λ ∈ [0 , 1) is the spectral radius of ( W − 1 n 1 n 1 > n ) . 8 A. Decentralized Stochastic Gradient Descent (DSGD) Recall that our focus is to solve Problem P3 in a decentralized manner , when the nodes exchange information over an arbitrary undirected graph. A well-known solution to this problem is Decentralized Gradient Descent (DGD) [13], [14], described as follows. Each node i starts with an arbitrary θ i 0 ∈ R p and performs the following update: θ i k +1 = X r ∈N i w ir θ r k − α k ∇ f i  θ i k  , k ≥ 0 . (9) Indeed, at each node i , DGD adds a local gradient correction to av erage-consensus based on the local data batch, i.e., all f i,j ’ s, and is the prototype of many decentralized optimization protocols. T o understand the iterations of DGD, we write them in a vector form. Let θ k and ∇ f ( θ k ) collect all local estimates and gradients, respectiv ely , i.e., θ k = [ θ 1 k > , · · · , θ n k > ] > and ∇ f ( θ k ) , [ ∇ f 1 ( θ 1 k ) > , · · · , ∇ f n ( θ n k ) > ] > , both in R np . Then DGD can be compactly written as θ k +1 = ( W ⊗ I p ) θ k − α k ∇ f ( θ k ) . (10) W e further deﬁne the a verage θ k , 1 n ( 1 > n ⊗ I p ) θ k of the local estimates at time k and multiply both sides of (10) by 1 n ( 1 > n ⊗ I p ) to obtain: θ k +1 = θ k − α k ( 1 > n ⊗ I p ) ∇ f ( θ k ) n . (11) Based on (10) and (11), we note that the consensus matrix W makes the estimates { θ i k } n i =1 at the nodes approach their av erage θ k , while the av erage gradient ( 1 > n ⊗ I p ) ∇ f ( θ k ) n steers θ k to wards the minimizer θ ∗ of F . The overall protocol thus ensures agreement and optimality , the two key components of decentralized optimization as we described before. DGD is a simple yet effecti ve method for various decentralized learning tasks. T o make DGD ef ﬁcient for large-scale decentralized empirical risk minimization, where each m i is very large, Refs. [14], [15] deriv e a stochastic variant, known as Decentralized Stochastic Gradient Descent (DSGD) , by substituting each local batch gradient with a randomly sampled component gradient. DSGD is formally described in Algorithm 1. Assuming that each f i,j ∈ S µ,L and each local stochastic gradient has bounded variance 3 , i.e., E s i k h   ∇ f i,s i k ( θ i k ) − ∇ f i ( θ i k )   2 2 | θ i k i ≤ σ 2 , ∀ i, k , we hav e [16]: under a constant step-size, α k = α ∈  0 , O  (1 − λ ) Lκ i , ∀ k , E [ k θ i k − θ ∗ k 2 2 ] decays at a linear rate of (1 − O ( µα )) k to a neighborhood of θ ∗ such that lim sup k →∞ 1 n n X i =1 E h   θ i k − θ ∗   2 2 i = O  ασ 2 nµ + α 2 κ 2 σ 2 1 − λ + α 2 κ 2 b (1 − λ ) 2  , (12) 3 The bounded variance assumption can also be relaxed as noted in Footnote 1 for the centralized case, see [16]. 9 where b , 1 n P n i =1 k∇ f i ( θ ∗ ) k 2 and κ = L/µ . With a diminishing step-size α k = O ( 1 k ) , DSGD achie ves an exact con ver gence [17], [18], such that 1 n n X i =1 E h   θ i k − θ ∗   2 2 i = O  1 k  , ∀ k ≥ 0 . (13) Algorithm 1 DSGD: At each node i Require: θ i 0 , { α k } k ≥ 0 , { w ir } r ∈N i . 1: for k = 0 , 1 , 2 , · · · do 2: Choose s i k uniformly at random in { 1 , · · · , m i } 3: Compute the local stochastic gradient ∇ f i,s i k ( θ i k ) . 4: Update : θ i k +1 = P r ∈N i w ir θ r k − α k ∇ f i,s i k ( θ i k ) 5: end f or Remark 3 (SGD vs. DSGD): Comparing (3) to (12), when a constant step-size α is used, the steady-state error in both SGD and DSGD decays linearly to a certain neighborhood (controlled by α ) of θ ∗ . Unlike SGD, howe ver , the steady-state error of DSGD has an additional bias, inde- pendent of the variance σ 2 of the stochastic gradient, that comes from b = 1 n P n i =1 k∇ f i ( θ ∗ ) k 2 . The constant b is not zero in general and characterizes the difference between the minimizer of each local objective f i and that of the global objectiv e F . The resulting bias O  α 2 κ 2 b (1 − λ ) 2  can be signiﬁcantly large when the data distributions across nodes are substantially heterogeneous or when the graph is not well-connected, a scenario that commonly arises in certain wireless networks and IoT applications, see Section: Numerical Illustrations . In the following, we describe a gradient tracking technique that eliminates the bias in DSGD due to the term b and thus can be considered as a more appropriate decentralization of the centralized SGD. B. Decentralized First-Order Methods with Gradient T racking T o present the intuition behind the gradient tracking technique, we ﬁrst recall the iterations of the (non-stochastic) Decentralized Gradient Descent (DGD) with a constant step-size in (9). Let us ﬁrst assume, for the sake of argument, that all nodes agree on the minimizer of F at some iteration k , i.e., θ i k = θ ∗ , ∀ i . Then at the next iteration k + 1 , we hav e θ i k +1 = X r ∈N i w ir θ ∗ − α ∇ f i ( θ ∗ ) = θ ∗ − α ∇ f i ( θ ∗ ) , (14) where θ ∗ − ∇ f i ( θ ∗ ) 6 = θ ∗ , in general. In other words, the minimizer θ ∗ is not necessarily a ﬁxed point of (9). Of course, using the gradient ∇ F  θ i k  of the global objecti ve, instead of ∇ f i  θ i k  , ov ercomes this issue b ut the global gradient is not a vailable at any node. The natural yet innov ative 10 idea of gradient tracking is to design a local iterative gradient tracker d i k that asymptotically approaches the global gradient ∇ F  θ i k  as θ i k approaches θ ∗ [19]–[23]. Gradient tracking is implemented with the help of dynamic average consensus (D A C) [24], brieﬂy described next. In contrast to classical av erage-consensus [10] that learns the average of ﬁxed initial states, D A C [24] tracks the average of time-varying signals. Formally , each node i measures a time- v arying signal r i k and all nodes cooperate to track the av erage r k , 1 n P n i =1 r i k of these signals. The D AC protocol is given as follo ws. Each node i iterativ ely updates its estimate d i k of r k as d i k +1 = X r ∈N i w ir d r k + r i k +1 − r i k , k ≥ 0 , (15) where d i 0 = r i 0 , ∀ i . For a doubly-stochastic weight matrix W = { w ir } , it is shown in [24] that if   r i k +1 − r i k   2 → 0 , we hav e that   d i k − r k   2 → 0 . Clearly , in the aforementioned design of gradient tracking, the time-varying signal that we intend to track is the av erage of the local gradients 1 n P n i =1 ∇ f i  θ i k  . W e thus combine DGD (9) and D AC (15) to obtain GT -DGD (DGD with Gr adient T rac king) [19]–[23], as follows: θ i k +1 = X r ∈N i w ir θ r k − α · d i k , (16a) d i k +1 = X r ∈N i w ir d r k + ∇ f i  θ i k +1  − ∇ f i  θ i k  , (16b) where d i 0 = ∇ f i  θ i 0  , ∀ i . Intuiti vely , as θ i k → θ k and d i k → 1 n P n i =1 ∇ f i  θ i k  → ∇ F  θ k  , (16a) asymptotically becomes the centralized batch gradient descent. It has been shown in [21]–[23], [25] that GT -DGD con verges linearly to the minimizer θ ∗ of F under a constant step-size when each f i,j ∈ S µ,L , unlike DGD that conv erges sublinearly to θ ∗ with decaying step-sizes. The stochastic variant of GT -DGD is deriv ed in [26], termed as GT -DSGD (DSGD with Gradi- ent T racking) , and is formally described in Algorithm 2. Under the same assumptions of smooth- ness, strong-con vexity , and bounded variance as in DSGD, the conv ergence of GT -DSGD is sum- marized in the following [26]: with a constant step-size, α k = α ∈  0 , O  (1 − λ ) 2 Lκ i , ∀ k , E [ k θ i k − θ ∗ k 2 2 ] decays linearly at the rate of (1 − O ( µα )) k to a neighborhood of θ ∗ such that lim sup k →∞ 1 n n X i =1 E h   θ i k − θ ∗   2 2 i = O  ασ 2 nµ + α 2 σ 2 κ 2 (1 − λ ) 3  . (17) Note that GT -DSGD, in contrast to GT -DGD, loses the exact linear conv ergence to the minimizer because the gradients are now stochastic. Exact conv ergence can be recov ered albeit at a slo wer sublinear rate, i.e., with a diminishing step-size α k = O ( 1 k ) , we have [26] 1 n n X i =1 E h   θ i k − θ ∗   2 2 i = O  1 k  , ∀ k ≥ 0 . (18) 11 Algorithm 2 GT -DSGD: At each node i Require: θ i 0 , { α k } k ≥ 0 , { w ir } r ∈N i , d i 0 = ∇ f i,s i 0 ( θ i 0 ) , where s i 0 is chosen uniformly at random in { 1 , · · · , m i } 1: for k = 0 , 1 , 2 , · · · do 2: Update θ i k +1 = P r ∈N i w ir θ r k − α k d i k 3: Choose s i k +1 uniformly at random in { 1 , · · · , m i } 4: Compute the local stochastic gradient ∇ f i,s i k +1 ( θ i k +1 ) 5: Update : d i k +1 = P r ∈N i w ir d r k + ∇ f i,s i k +1 ( θ i k +1 ) − ∇ f i,s i k ( θ i k ) 6: end f or Remark 4 (DSGD vs. GT -DSGD): By comparing DSGD (12) and GT -DSGD (17), we note that under a constant step-size, GT -DSGD removes the bias O  α 2 κ 2 b (1 − λ ) 2  that comes from b , 1 n P n i =1 k∇ f i ( θ ∗ ) k 2 in DSGD. Howe ver , the network dependence in GT -DSGD, O  1 (1 − λ ) 3  , is worse than DSGD where it is O  1 (1 − λ ) 2  . A tradeoff here is imminent where the two approaches hav e their o wn merits depending on the relati ve sizes of b and λ . Clearly , when the bias b dominates, e.g., when the data across nodes is lar gely heterogeneous, GT -DSGD achiev es a lo wer steady-state error than DSGD. Under diminishing step-sizes, DSGD and GT -DSGD have comparable performance. Of rele vance here are EXTRA [27] and Exact Dif fusion [28], both of which eliminate the bias caused by b and are built on a different principle from gradient tracking. Remark 5 (SGD vs. GT -DSGD): Note that with constant step-sizes, the performance of SGD in (3) and GT -DSGD in (17) is comparable. In particular , both methods con ver ge linearly but there is a steady-state error, which is controlled by the step-size α and the variance σ 2 of the stochastic gradient, see Remark 3. Since GT -DSGD removes the bias in DSGD that comes due to the difference of the local and global objectiv es (see b in (12)), it may be considered as a more appropriate decentralization of SGD. This argument naturally leads to the idea that one can incorporate the centralized V ariance Reduction (VR) techniques in the GT -DSGD to further improv e the performance and achieve faster con vergence. As we show in the following, adding v ariance reduction to GT -DSGD in fact leads to an exact linear con ver gence with a constant step-size and further improves its network dependence to O  1 (1 − λ ) 2  . Remark 6 (DSGD + VR): W e emphasize that adding VR to DSGD does not enable exact linear con ver gence. Follo wing Remark 1, VR remov es the steady-state error caused by the variance of the stochastic gradient. Ho wever , in a decentralized setting, the heterogeneity across the local data 12 batches is not accounted for unless gradient tracking is employed. This difference between the local batches across the nodes is captured by the aforementioned bias b in (12) and is removed by gradient tracking that estimates the average of local gradients across the nodes. I V . D E C E N T R A L I Z E D V A R I A N C E - R E D U C E D M E T H O D S W I T H G R A D I E N T T R AC K I N G W e now provide a uniﬁed algorithmic frame work, GT -VR, that provably improv es DSGD and follo ws from Remarks 5 and 6. This framew ork combines v ariance-reduction with GT -DSGD to achieve both robust performance and fast con ver gence. First, recall from Section: V ariance- Reduced Stochastic Gradient Descent that VR methods iterativ ely estimate the batch gradient from randomly drawn samples. In the decentralized case, each node i thus implements VR locally to estimate its local batch gradient ∇ f i . Gradient tracking, on the other hand, estimates the a verage of the local VR estimators across the nodes and can be thought of as fusion in space. Consequently , VR and gradient tracking jointly learn the global batch gradient ∇ F at each node asymptotically . F or deﬁniteness, we present and analyze two instances of GT -VR, namely , GT -SA GA and GT -SVRG, and show that they achie ve exact linear con ver gence with constant step-sizes for the class of smooth and strongly-con vex functions. W e further sho w that in a “big- data” regime, both GT -SA GA and GT -SVRG act effecti vely as means for parallel computation and achie ve a linear speed-up compared with their centralized counterparts. A. GT -SA GA T o implement the SA GA estimators locally , each node i maintains a gradient table that stores all local component gradients {∇ f i,j ( b θ i,j ) } m i j =1 , where b θ i,j represents the most recent iterate where the gradient of f i,j was ev aluated. At iteration k ≥ 0 , each node i chooses an index s i k uniformly at random from { 1 , · · · , m i } and computes the local SA GA gradient g i k as g i k = ∇ f i,s i k  θ i k  − ∇ f i,s i k  b θ i,s i k  + 1 m i m i X j =1 ∇ f i,j  b θ i,j  , (19) where it can be sho wn that g i k is an unbiased estimator of the local batch gradient ∇ f i ( θ i k ) . Next, the element ∇ f i,s i k ( b θ i,s i k ) in the gradient table is replaced by ∇ f i,s i k  θ i k  , while the other elements remain unchanged. The gradient tracking iteration d i k is then implemented on the estimators g i k ’ s. The complete implementation of GT -SA GA [29] is summarized in Algorithm 3. Similar to centralized SA GA [7], GT -SA GA con ver ges linearly to θ ∗ with a constant step-size. More precisely , assuming each f i,j ∈ S µ,L and by choosing α = min n O  1 µM  , O  m M (1 − λ ) 2 Lκ o , 13 Algorithm 3 GT -SA GA at each node i Require: θ i 0 , α , { w ir } r ∈N i , d i 0 = g i 0 = ∇ f i ( θ i 0 ) , Gradient table {∇ f i,j ( b θ i,j ) } m i j =1 , b θ i,j = θ i 0 , ∀ j . 1: for k = 0 , 1 , 2 , · · · do 2: Update θ i k +1 = P r ∈N i w ir θ r k − α d i k ; 3: Choose s i k +1 uniformly at random from { 1 , · · · , m i } ; 4: Compute g i k +1 = ∇ f i,s i k +1  θ i k +1  − ∇ f i,s i k +1  b θ i,s i k +1  + 1 m i P m i j =1 ∇ f i,j  b θ i,j  ; 5: Replace ∇ f i,s i k +1  b θ i,s i k +1  by ∇ f i,s i k +1  θ i k +1  in the gradient table. 6: Update d i k +1 = P r ∈N i w ir d r k + g i k +1 − g i k ; 7: end f or where m = min i { m i } , M = max i { m i } , we have [29], 1 n n X i =1 E h   θ i k − θ ∗   2 2 i ≤ R  1 − min  O  1 M  , O  m M (1 − λ ) 2 κ 2  k , ∀ k ≥ 0 , (20) for some R > 0 . In other words, GT -SA GA achiev es  -accuracy of θ ∗ in O  max  M , M m κ 2 (1 − λ ) 2  log 1   parallel local component gradient computations. W e emphasize that GT -SAGA, unlike the stochas- tic algorithms (DSGD and GT -DSGD) discussed before, exhibits linear con ver gence to the global minimizer θ ∗ of F . This exact linear con ver gence is a consequence of both variance reduction and gradient tracking; see Remarks 7, 8, 9 and 10 for additional comments. B. GT -SVRG GT -SVRG, formally described in Algorithm 4, is a double-loop method, where the outer loop index is k and the inner loop index is t , that b uilds upon the centralized SVRG. At e very outer loop, each node i computes a local batch gradient and proceeds to a ﬁnite number T of inner loop iterations; in the inner loop, each node i performs GT -DSGD (type) iterations in addition to updating the local gradient estimate v i t (Algorithm 4: Step 7). It can be veriﬁed that v i t is an unbiased estimator of the corresponding local batch gradient at node i . In practice, all options (a)-(c) work similarly well. For example, under option (a), it is shown in [29] that with α = O  (1 − λ ) 2 Lκ  and T = O  κ 2 log κ (1 − λ ) 2  , the outer loop of GT -SVRG follows: 1 n n X i =1 E h   θ i k − θ ∗   2 2 i ≤ U · 0 . 9 k , (21) for some U > 0 . This argument implies that GT -SVRG achiev es  -accuracy of θ ∗ in O  log 1   outer loop iterations. W e further note that each outer-loop update requires each node i to com- 14 pute m i + 2 T local component gradients. GT -SVRG thus achiev es  -accuracy of θ ∗ in totally O  M + κ 2 log κ (1 − λ ) 2  log 1   parallel local component gradient computations. Algorithm 4 GT -SVRG at each node i Require: θ i 0 , α , { w ir } r ∈N i , d i 0 = v i 0 = ∇ f i ( θ i 0 ) . 1: for k = 0 , 1 , 2 , · · · do 2: Initialize θ i 0 = θ i k 3: Compute ∇ f i ( θ i 0 ) = 1 m i P m i j =1 ∇ f i,j ( θ i 0 ) 4: f or t = 0 , 1 , 2 , · · · , T − 1 do 5: Update θ i t +1 = P r ∈N i w ir θ r t − α · d i t ; 6: Choose s i t +1 uniformly at random from { 1 , · · · , m i } ; 7: Compute v i t +1 = ∇ f i,s i t +1  θ i t +1  − ∇ f i,s i t +1  θ i 0  + ∇ f i ( θ i 0 ) ; 8: Update d i t +1 = P r ∈N i w ir d r t + v i t +1 − v i t ; 9: end f or 10: Set d i 0 = d i T and v i 0 = v i T 11: Option (a): Set θ i k +1 = θ i T 12: Option (b): Set θ i k +1 = 1 T P T − 1 t =0 θ i t 13: Option (c): Set θ i k +1 as a random selection from { θ i t } T − 1 t =0 14: end f or Remark 7 (GT -SA GA vs. GT -SVRG: Linear speedup): Both GT -SA GA and GT -SVRG hav e a lo w per-iteration computation cost and conv erge linearly to θ ∗ , i.e., they reach  -accuracy of θ ∗ respecti vely in O  max n M , M m κ 2 (1 − λ ) 2 o log 1   and O  M + κ 2 log κ (1 − λ ) 2  log 1   parallel local com- ponent gradient computations. Interestingly , when the data sets at the nodes are large and balanced such that M ≈ m  κ 2 1 − λ 2 , the complexities of GT -SAGA and GT -SVRG become O ( M log 1  ) , independent of the network, and are n times faster than that of centralized SA GA and SVRG. Clearly , in this “big-data” regime, GT -SAGA and GT -SVRG each acts effecti vely as a means for parallel computation and achiev es a linear speed-up compared with its centralized counterpart. Remark 8 (GT -SA GA vs. GT -SVRG: Unbalanced data): It can also be observed that when data samples are distributed ov er the network in an unbalanced way , i.e., M m is large, GT -SVRG may achie ve a lo wer complexity than GT -SAGA in terms of number of component gradient ev aluations. Ho wever , from a practical implementation standpoint, an unbalanced data distribution may lead to a longer wall-clock time in GT -SVRG. This is because the next inner loop cannot be executed 15 until all nodes ﬁnish their local batch gradient computations and nodes with a large amount of data take longer to ﬁnish this computation, leading to an overall increase in runtime. Clearly , there is an inherent trade-off between network synchrony , latency , and the storage of gradients as far as the relative implementation complexities of GT -SAGA and GT -SVRG are concerned. If each node is capable of storing all local component gradients, then GT -SA GA is preferable due to its ﬂexibility of implementation and faster con ver gence in practice. On the other hand, for large-scale optimization problems where each node holds a very large number of data samples, storing all component gradients may be infeasible and therefore GT -SVRG may be preferred. Remark 9 (Related work on decentr alized VR methods): Existing decentralized VR methods include DSA [30] that combines EXTRA [27] with SAGA [7], diffusion-A VRG that combines exact dif fusion [28] and A VRG [31], DSBA [32] that adds proximal mapping [33] to each iteration of DSA, ADFS [34] that applies an accelerated randomized proximal coordinate gradient method [35] to the dual formulation of Problem P3, and Network-SVRG/SARAH [36] that implements variance-reduction in the decentralized D ANE framew ork based on gradient tracking. W e note that in large-scale scenarios where M ≈ m is very large, both GT -SA GA and GT -SVRG improv e upon the con vergence rate of these methods in terms of the joint dependence on κ and M ≈ m , with the exception of DSB A and ADFS. Both DSBA and ADFS achiev e better iteration complexity , howe ver , at the expense of computing the proximal mapping of a component function at each iteration. Although the computation of this proximal mapping is efﬁcient for certain function classes, it can be very e xpensive for general functions. Remark 10 (Communication complexity): W e now compare the communication complexities of the decentralized algorithms discussed in this article. Since the node deployment is not neces- sarily deterministic, we provide the expected number of communication rounds per node required to achie ve an  -accurate solution (each communication is over a p -dimensional vector). Note that DSGD, GT -DSGD, and GT -SA GA all incur O ( d exp ) expected number of communication rounds per node, at each iteration, where d exp is the expected degree of the (possibly random) commu- nication graph G . Thus, their expected communication complexity is their iteration complexity scaled by d exp and is giv en by O ( d exp 1  ) , O ( d exp 1  ) , and O  max n M , M m κ 2 (1 − λ ) 2 o d exp log 1   , respecti vely . For GT -SVRG, we note that a total number of O (log 1  ) outer-loop iterations are required, where each corresponding inner loop incurs O ( T ) = O  κ 2 log κ (1 − λ ) 2 d exp  rounds of com- munication, resulting into to a total communication complexity of O  κ 2 log κ (1 − λ ) 2 d exp log 1   . Clearly , GT -SA GA and GT -SVRG, due to their fast linear con vergence, improve upon the communication 16 complexities of DSGD and GT -DSGD. It is further interesting to observe that in the big-data regime where each node has a lar ge number of data samples, GT -SVRG achie ves a lower communication complexity than GT -SAGA. Finally , we note that all gradient-tracking based algorithms require two consecutiv e rounds of communication per stochastic gradient ev aluation with neighboring nodes to update the estimate θ i k and the gradient tracker d i k , respectiv ely . This may increase the communication burden of the network especially when θ i k is of high dimension. W e note that, for the sak e of completeness, we add d exp to the communication comple xities, which is a function of the underlying graph G ; in particular , d exp = O (1) for random geometric graphs (assuming constant density of deployment of nodes) and d exp = O (log n ) for exponential graphs, see also Section: Numerical Illustrations on these graphs. V . N U M E R I C A L I L L U S T R AT I O N S In this section, we present numerical experiments to illustrate the con vergence properties of the decentralized stochastic optimization algorithms discussed in this article, i.e., DSGD, GT - DSGD, GT -SA GA, and GT -SVRG. W e show e xperimental results on two dif ferent types of graphs shown in Fig. 2: (i) an exponential graph with n = 16 nodes modeling a highly-structured training environment with a large number of data samples per node; and, (ii) a random geometric graph with n = 1 , 000 nodes modeling a large-scale, ad-hoc training scenario. Their associated doubly-stochastic weight matrices W are generated by the Metroplis method with the second largest eigen value λ of 0 . 75 in the former and 0 . 9994 in the latter . The decentralized training problem we consider is classiﬁcation of hand-written digits from the MNIST dataset [37] with the help of logistic regression (strongly-con vex) and a two-layer neural network (non-con vex). Fig. 2. (Left) An exponential graph with 16 nodes. (Right) A random geometric graph with 1 , 000 nodes. A. Logistic Regression: Strongly-con vex W e ﬁrst compare the algorithms of interest in the context of training a regularized logistic regression model [2], that is smooth and strongly-con vex, to classify two digits { 3 , 8 } . W e use a total of N = 12 , 000 images for training and 1 , 966 images for testing. Each node i holds m i 17 training samples, i.e., { x i,j , y i,j } m i j =1 ⊆ R 784 × {− 1 , +1 } , where x i,j is the feature vector (image) and y i,j is the corresponding binary label. The nodes cooperate to solve the follo wing problem: min b ∈ R 784 , c ∈ R F ( b , c ) = 1 n n X i =1 1 m i m i X j =1 ln h 1 + exp n − ( b > x i,j + c ) y i,j oi + λ 2 k b k 2 2 , where θ = [ b > c ] > , the regularization parameter is λ = 1 / N , and the features are normalized to unit vectors [6], [38]. W e plot the optimality gap, i.e., F ( θ k ) − F ( θ ∗ ) , vs. the number of parallel component gradient ev aluations and compare the algorithms in both balanced and unbalanced data distribution scenarios, recall Remarks 7 and 8. The step-size for all algorithms is constant and is chosen to be 1 /L , while the inner-loop length T of GT -SVRG is N /n in the case of balanced data and 4 N /n in the case of unbalanced data. Balanced Data: T o model a stable training en vironment with a balanced data distribution, e.g., in data centers or computing clusters, we choose a highly structured, well-connected, exponential graph with n = 16 nodes resulting into a relativ ely large number of samples ( m i = 750 ) per node. Each node has approximately the same number of images in each class, i.e., the data distrib ution is balanced and homogeneous, leading to similar local cost functions among the nodes and therefore the bias term b in DSGD is relati vely small. From Remarks 3 and 4, recall that when b is small and the graph is well-connected, DSGD and GT -DSGD exhibit similar performance that is also veriﬁed numerically in Fig. 3. Adding variance reduction to GT -DSGD howe ver signiﬁcantly improv es the performance in terms of both the optimality gap and the test accuracy , leading to a linear con vergence in both GT -SA GA and GT -SVRG to the e xact solution. 0 10 20 30 40 Epochs 1 0 1 5 1 0 1 2 1 0 9 1 0 6 1 0 3 1 0 0 Optimality gap DSGD GT-DSGD GT-SAGA GT-SVRG 0 5 10 15 20 Epochs 0.940 0.945 0.950 0.955 0.960 0.965 0.970 Test Accuracy DSGD GT-DSGD GT-SAGA GT-SVRG Fig. 3. Decentralized logistic regression with balanced data over the 16 -node exponential graph, where each epoch represents N /n = 750 component gradient ev aluations at each node. Unbalanced Data: W e next compare the algorithms when the data distribution is unbalanced and the nodes interact o ver a random geometric graph of n = 1 , 000 nodes, modeling a lar ge-scale, wireless communication network. In this case, the N = 12 , 000 training images are randomly distributed among the nodes, see Fig. 4 (right) for the number of training samples at each node. W e make a further restriction that the training data samples at each node belong to only one 18 class, either 3 or 8 . This leads to unbalanced data sizes and heterogeneous data distributions at the nodes, making the local functions signiﬁcantly dif ferent from each other and thus the bias b in DSGD is relativ ely large. The performance comparison is shown in Fig. 4 (left), where it can be observed that DSGD degrades considerably in this case and the addition of gradient tracking results into a smaller steady-state error (Remark 4). Adding variance reduction, as before, leads to a linear con vergence to the exact solution. 0 200 400 600 800 1000 Epochs 1 0 4 1 0 3 1 0 2 1 0 1 Optimality gap DSGD GT-DSGD GT-SAGA GT-SVRG 0 200 400 600 800 1000 Node index 0 10 20 30 40 50 60 70 Number of samples Fig. 4. Decentralized logistic regression with unbalanced data over a 1 , 000 -node random geometric graph, where each epoch represents N /n = 12 component gradient ev aluations at each node. Discussion: In both balanced and unbalanced data scenarios, the performance improv ement due to gradient tracking comes at a price of one additional round of communication per iteration, see also Remark 10. The addition of variance reduction in GT -SA GA and GT -SVRG signiﬁcantly outperforms both DSGD and GT -DSGD. Their linear con vergence ho wev er comes at a price of additional storage in GT -SA GA and a synchronization ov erhead in GT -SVRG. From Remark 7, we recall that when each node has roughly the same number of training samples, GT -SAGA conv erges faster than GT -SVRG in terms of the number of parallel component gradient computations required, as can be observed in Fig. 3. On the other hand, as discussed in Remark 8, the iteration complexity of GT -SVRG is more robust to unbalanced data as it is independent of the M /m factor that appears in GT -SA GA, as it is shown in Fig. 4, where GT -SA GA and GT -SVRG exhibit similar conv ergence. Howe ver , GT -SVRG may incur additional latency and synchronization when the data is unbalanced, due to the different computing time of the local batch gradient e valuations across the network, before the execution of each inner-loop. B. Neural Network: Non-con vex W e now compare the performance of the algorithms when training a neural network with a non-con ve x loss function. The local neural network implemented at each node has one fully- connected hidden layer with 64 neurons and 51 , 675 parameters in total. The goal is to train a neural network that classiﬁes all ten digits { 0 , . . . , 9 } from the MNIST dataset with 60 , 000 19 training samples (around 6 , 000 images in each class) and 10 , 000 test images. The training dataset is di vided randomly over 1 , 000 nodes such that each node has 60 data points. All algorithms use a constant step-size that is manually optimized for best performance. Fig. 5 shows the loss F ( θ k ) and the test accuracy over epochs. W e note that adding gradient tracking to DSGD improv es both the transient and steady-state performance in this non-con ve x setting. Similarly , adding variance- reduction improves the performance further . This behavior is also notable in the test accuracy . 0 20 40 60 80 100 Epochs 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Loss DSGD GT-DSGD GT-SAGA 0 20 40 60 80 100 Epochs 0.800 0.825 0.850 0.875 0.900 0.925 0.950 0.975 Test accuracy DSGD GT-DSGD GT-SAGA Fig. 5. T wo layer neural network over a 1 , 000 -node random geometric graph, where one epoch represents N/n = 60 component gradient ev aluations at each node. V I . E X T E N S I O N S A N D D I S C U S S I O N W e now discuss some recent progress on sev eral key aspects of decentralized optimization rele vant to the ﬁrst-order stochastic approaches described in this article. Directed Graphs: The methods described in this article are restricted to undirected graphs. Over dir ected graphs , the main challenge is that the weight matrices are either ro w-stochastic (RS) or column-stochastic (CS), but cannot be doubly-stochastic (DS), in general. A well-studied solution to this issue is based on the push-sum (type) algorithms [39] that enable consensus with non-DS weights with the help of eigen vector estimation. Combining push-sum respectively with DSGD [14], [15], and GT -DGD [20]–[22] leads to SGP [40], and ADD-OPT [41] that require CS weights. A similar idea is used in FR OST [42] to implement decentralized optimization with RS weights. The issue with push-sum based extensions is that they require eigen vector estimation, which in itself is an iterativ e procedure and may slow down the underlying algorithms especially when the corresponding communication graphs are not well-connected. More recently , it is shown that GT -DGD (16), ADD-OPT , and FR OST are special cases of the AB algorithm [23], [43] that employs RS weights in (16a) and CS weights in (16b), and thus is immediately applicable to arbitrary directed graphs. The AB frame work naturally leads to stochastic optimization with gradient tracking over directed graphs, see SAB [44] that extends GT -DSGD to directed graphs, and further opens the possibility to extend GT -SA GA and GT -SVRG to their directed counterparts. 20 Communication and computation aspects: Communication efﬁcienc y is an important aspect of decentralized optimization since communication can potentially become a bottleneck of the system when nodes are frequently transmitting high-dimensional vectors (model parameters) in the network. Dif ferent communication-efﬁcient schemes [36], [45], communication/computation tradeof fs [11], asynchronous implementations [46], and quantization techniques [47], [48] have been studied with existing decentralized methods to efﬁciently manage the resources at each node. Master -worker architectur es: The problems described in this article hav e experienced a signiﬁcant research acti vity because of their direct applicability to large-scale training problems in machine learning [40], [49]. Since these applications are typically hosted in controlled settings, e.g., data centers with highly-sophisticated communication and a large number of highly-efﬁcient computing clusters, master-w orker architectures and parameter-server models hav e become popu- lar . In such architectures, see Fig. 2 (left), a central master maintains the current model parameters and communicates strategically with the workers, which individually hold a local batch of the total training data. Indeed, this architecture is not restricted to data centers alone and is also applicable to certain Internet-of-Things (IoT) scenarios where the de vices are able to communicate to the master either directly via the cloud or via a mesh network among the devices. V arious programming models and sev eral variants of master-work er conﬁgurations have been proposed, such as MapReduce, All-Reduce, and federated learning [50], that are tailored for speciﬁc computing needs and en vironments. W e emphasize that, on the contrary , the motiv ation behind the decentralized methods studied in this article comes from the scenarios where communication among the nodes is ad hoc, unstructured, and specialized topologies are not av ailable. V I I . C O N C L U S I O N S In this article, we discuss general formulation and solutions for decentralized, stochastic, ﬁrst- order optimization methods. Our focus is on peer-to-peer networks that is applicable to ad-hoc wireless communication where the nodes have resource-constraints and limited communication capabilities. W e discuss sev eral fundamental algorithmic frame works with a focus on gradient tracking and variance-reduction. For all algorithms, we provide a detailed discussion on their con ver gence rates, properties, and tradeof fs, with a particular emphasis on smooth and strongly- con ve x objective functions. An important line of future work in the ﬁeld of decentralized machine learning is to analyze e xisting methods and de velop new techniques for general non-con ve x objecti ves, gi ven the tremendous success of deep neural networks. 21 S H O RT B I O G R A P H I E S Ran Xin (ranx@andrew .cmu.edu) is a PhD candidate in the Electrical and Computer Engineer- ing (ECE) department at Carnegie Mellon Uni versity (CMU), P A. His research interests include optimization theory and methods. Soummya Kar (soummyak@andrew .cmu.edu) is an Associate Professor of ECE at Carnegie Mellon Uni versity , P A. His research interests include large-scale stochastic systems. Usman A. Khan (khan@ece.tufts.edu) is an Associate Professor of ECE at T ufts Univ ersity , MA. His research interests include optimization, control, and signal processing. A C K N OW L E D G M E N T S The authors acknowledge the support of NSF under aw ards CCF-1350264, CCF-1513936, CMMI-1903972, and CBET -1935555. The authors would like to thank Boyue Li (CMU), Jianyu W ang (CMU), and Shuhua Y u (CMU) for their help and valuable discussions. R E F E R E N C E S [1] L. Bottou, F . E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning, ” SIAM Review , vol. 60, no. 2, pp. 223–311, 2018. [2] C. M. Bishop, P attern Recognition and Machine Learning , Springer , 2006. [3] Y . Nesterov , Lectur es on conve x optimization , vol. 137, Springer, 2018. [4] D. Needell, R. W ard, and N. Srebro, “Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm, ” in Advances in Neural Information Pr ocessing Systems , pp. 1017–1025. 2014. [5] R. M. Gower , N. Loizou, X. Qian, A. Sailanbayev , E. Shulgin, and P . Richtarik, “SGD: General analysis and improv ed rates, ” in International Conference on Machine Learning , 2019. [6] M. Schmidt, N. Le Roux, and F . Bach, “Minimizing ﬁnite sums with the stochastic average gradient, ” Mathematical Pr ogramming , vol. 162, no. 1-2, pp. 83–112, 2017. [7] A. Defazio, F . Bach, and S. Lacoste-Julien, “SA GA: A fast incremental gradient method with support for non-strongly conv ex composite objectiv es, ” in Advances in NIPS , 2014, pp. 1646–1654. [8] R. Johnson and T . Zhang, “ Accelerating stochastic gradient descent using predictive variance reduction, ” in Advances in Neural Information Processing Systems , 2013, pp. 315–323. [9] L. M. Nguyen, J. Liu, K. Scheinberg, and M. T ak ´ a ˇ c, “SARAH: A novel method for machine learning problems using stochastic recursi ve gradient, ” in 34th International Confer ence on Machine Learning , 2017, pp. 2613–2621. [10] R. Olfati-Saber , J. A. Fax, and R. M. Murray , “Consensus and cooperation in networked multi-agent systems, ” Pr oceedings of the IEEE , vol. 95, no. 1, pp. 215–233, 2007. [11] A. Nedi ´ c, A. Olshevsky , and M. G. Rabbat, “Network topology and communication-computation tradeoffs in decentralized optimization, ” Proceedings of the IEEE , vol. 106, no. 5, pp. 953–976, 2018. [12] R. A. Horn and C. R. Johnson, Matrix analysis , Cambridge Univ ersity Press, 2012. [13] A. Nedi ´ c and A. Ozdaglar , “Distrib uted subgradient methods for multi-agent optimization, ” IEEE T rans. Autom. Contr ol , vol. 54, no. 1, pp. 48, 2009. 22 [14] J. Chen and A. H. Sayed, “Diffusion adaptation strate gies for distributed optimization and learning o ver networks, ” IEEE T rans. Signal Pr ocess. , vol. 60, no. 8, pp. 4289–4305, 2012. [15] S. S. Ram, A. Nedi ´ c, and V . V . V eeravalli, “Distrib uted stochastic subgradient projection algorithms for con vex optimization, ” Journal of Optimization Theory and Applications , vol. 147, no. 3, pp. 516–545, 2010. [16] K. Y uan, S. A. Alghunaim, B. Y ing, and A. H. Sayed, “On the performance of exact diffusion over adaptiv e networks, ” , 2019. [17] D. Jako vetic, D. Bajovic, A. K. Sahu, and S. Kar, “Conv ergence rates for distributed stochastic optimization ov er random networks, ” in IEEE Conference on Decision and Control , 2018, pp. 4238–4245. [18] A. Olshevsky , I. C. Paschalidis, and S. Pu, “ A non-asymptotic analysis of network independence for distributed stochastic gradient descent, ” , 2019. [19] P . Di Lorenzo and G. Scutari, “NEXT : In-network noncon vex optimization, ” IEEE Tr ans. Signal Inf. Pr ocess. Netw . Process. , vol. 2, no. 2, pp. 120–136, 2016. [20] J. Xu, S. Zhu, Y . C. Soh, and L. Xie, “ Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant stepsizes, ” in 54th IEEE Conference on Decision and Control , 2015, pp. 2055–2060. [21] G. Qu and N Li, “Harnessing smoothness to accelerate distributed optimization, ” IEEE T rans. Contr ol of Network Systems , vol. 5, no. 3, pp. 1245–1260, 2017. [22] A. Nedi ´ c, A. Olshevsk y , and W . Shi, “ Achieving geometric con vergence for distributed optimization over time- varying graphs, ” SIAM Journal on Optimization , vol. 27, no. 4, pp. 2597–2633, 2017. [23] R. Xin and U. A. Khan, “ A linear algorithm for optimization over directed graphs with geometric conv ergence, ” IEEE Contr ol Systems Letters , vol. 2, no. 3, pp. 315–320, 2018. [24] M. Zhu and S. Mart ´ ınez, “Discrete-time dynamic average consensus, ” Automatica , vol. 46(2), pp. 322–329, 2010. [25] S. A. Alghunaim, K. Y uan, and A. H. Sayed, “ A linearly con vergent proximal gradient algorithm for decentralized optimization, ” , 2019. [26] S. Pu and A. Nedich, “ A distributed stochastic gradient tracking method, ” in 2018 IEEE Conference on Decision and Contr ol , 2018, pp. 963–968. [27] W . Shi, Q. Ling, G. W u, and W . Y in, “EXTRA: An exact ﬁrst-order algorithm for decentralized consensus optimization, ” SIAM Journal on Optimization , vol. 25, no. 2, pp. 944–966, 2015. [28] K. Y uan, B. Y ing, X. Zhao, and A. H. Sayed, “Exact diffusion for distributed optimization and learningPart I: Algorithm development, ” IEEE T rans. Signal Pr ocess. , vol. 67, no. 3, pp. 708–723, 2018. [29] R. Xin, U. A. Khan, and S. Kar , “V ariance-reduced decentralized stochastic optimization with accelerated con vergence, ” arXiv pr eprint arXiv:1912.04230 , 2019. [30] A. Mokhtari and A. Ribeiro, “DSA: Decentralized double stochastic averaging gradient algorithm, ” The Journal of Machine Learning Resear ch , vol. 17, no. 1, pp. 2165–2199, 2016. [31] B. Y ing, K. Y uan, and A. H. Sayed, “V ariance-reduced stochastic learning under random reshufﬂing, ” arXiv:1708.01383 , 2017. [32] Z. Shen, A. Mokhtari, T . Zhou, P . Zhao, and H. Qian, “T owards more efﬁcient stochastic decentralized learning: Faster con vergence and sparse communication, ” , 2018. [33] A. Defazio, “ A simple practical accelerated method for ﬁnite sums, ” in Advances in Neur al Information Pr ocessing Systems , 2016, pp. 676–684. [34] H. Hendrikx, F . Bach, and L. Massouli ´ e, “ Asynchronous accelerated proximal stochastic gradient for strongly con vex distributed ﬁnite sums, ” , 2019. 23 [35] Q. Lin, Z. Lu, and L. Xiao, “ An accelerated randomized proximal coordinate gradient method and its application to regularized empirical risk minimization, ” SIAM Journal on Optimization , vol. 25, no. 4, pp. 2244–2273, 2015. [36] B. Li, S. Cen, Y . Chen, and Y . Chi, “Communication-efﬁcient distributed optimization in networks with gradient tracking, ” , 2019. [37] Y . LeCun, “The mnist database of handwritten digits, ” http://yann. lecun. com/exdb/mnist/ , 1998. [38] K. Y uan, B. Y ing, J. Liu, and A. H. Sayed, “V ariance-reduced stochastic learning by networked agents under random reshufﬂing, ” IEEE T rans. Signal Pr ocess. , vol. 67, no. 2, pp. 351–366, 2018. [39] D. Kempe, A. Dobra, and J. Gehrke, “Gossip-based computation of aggregate information, ” in 44th Annual IEEE Symposium on F oundations of Computer Science, 2003. Pr oceedings. IEEE, 2003, pp. 482–491. [40] M. Assran, N. Loizou, N. Ballas, and M. Rabbat, “Stochastic gradient push for distributed deep learning, ” in Pr oceedings of the 36th International Confer ence on Machine Learning , 2019, pp. 97: 344–353. [41] C. Xi, R. Xin, and U. A. Khan, “ADD-OPT : Accelerated distributed directed optimization, ” IEEE T rans. Autom. Contr ol , vol. 63, no. 5, pp. 1329–1339, 2017. [42] R. Xin, C. Xi, and U. A. Khan, “FR OST – Fast row-stochastic optimization with uncoordinated step-sizes, ” EURASIP Journal on Advances in Signal Pr ocessing , Nov . 2018. [43] S. Pu, W . Shi, J. Xu, and A. Nedi ´ c, “ A push-pull gradient method for distributed optimization in networks, ” in IEEE Conference on Decision and Contr ol , Dec. 2018, pp. 3385–3390. [44] R. Xin, A. K. Sahu, U. A. Khan, and S. Kar , “Distrib uted stochastic optimization with gradient tracking ov er strongly-connected networks, ” in IEEE Conference on Decision and Contr ol , 2019. [45] G. Lan, S. Lee, and Y . Zhou, “Communication-efﬁcient algorithms for decentralized and stochastic optimization, ” Mathematical Pr ogramming , pp. 1–48, 2017. [46] J. Zhang and K. Y ou, “ASYSP A: An exact asynchronous algorithm for conv ex optimization ov er digraphs, ” IEEE T ransactions on Automatic Contr ol , 2019. [47] A. Reisizadeh, A. Mokhtari, H. Hassani, and R. Pedarsani, “ An exact quantized decentralized gradient descent algorithm, ” IEEE T rans. Signal Pr ocess. , vol. 67, no. 19, pp. 4934–4947, 2019. [48] A. K oloskova, S. U. Stich, and M. Jaggi, “Decentralized stochastic optimization and gossip algorithms with compressed communication, ” arXiv pr eprint arXiv:1902.00340 , 2019. [49] X. Lian, C. Zhang, H. Zhang, C. Hsieh, W . Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent, ” in Advances in Neural Information Processing Systems , 2017, pp. 5330–5340. [50] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Arcas, “Communication-Efﬁcient Learning of Deep Networks from Decentralized Data, ” in Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics , Fort Lauderdale, FL, USA, Apr . 2017, vol. 54, pp. 1273–1282.

Gradient tracking and variance reduction for decentralized optimization and machine learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment