GADMM: Fast and Communication Efficient Framework for Distributed Machine Learning

GADMM: F ast and Comm unication Eﬃcien t F ramew ork for Distributed Mac hine Learning Anis Elgabli anis.elgabli@oulu.fi Jihong P ark jihong.p ark@oulu.fi Amrit S. Bedi amritbd@iitk.ac.in Mehdi Bennis mehdi.bennis@oulu.fi V aneet Aggarw al v aneet@purdue.edu Editor: Abstract When the data is distributed across m ultiple serv ers, low ering the communication cost betw een the servers (or w ork ers) while solving the distributed learning problem is an imp ortan t problem and is the fo cus of this pap er. In particular, we propose a fast, and comm unication-eﬃcien t decentral ized framework to solv e the distributed machine learning (DML) problem. The proposed algorithm, Group Alternating Direction Metho d of Multipliers (GADMM) is based on the Alternating Direction Metho d of Multipliers (ADMM) framew ork. The k ey nov elt y in GADMM is that it solv es the problem in a decentralized top ology where at most half of the work ers are comp eting for the limited communication resources at an y given time. Moreov er, each work er exchanges the lo cally trained model only with tw o neighboring work ers, thereby training a global mo del with a low er amoun t of comm unication ov erhead in eac h exchange. W e prov e that GADMM conv erges to the optimal solution for conv ex loss functions, and n umerically sho w that it conv erges faster and more communication-eﬃcien t than the state-of-the-art communication-eﬃcien t algorithms suc h as the Lazily Aggregated Gradient (LAG) and dual a veraging, in linear and logistic regression tasks on synthetic and real datasets. F urthermore, w e prop ose Dynamic GADMM (D-GADMM), a v arian t of GADMM, and prov e its conv ergence under the time-v arying net work top ology of the work ers. 1. In tro duction Distributed optimization plays a piv otal role in distributed mac hine learning applica- tions (Ahmed et al., 2013; Dean et al., 2012; Li et al., 2013, 2014) that commonly aims to minimize 1 N P N n =1 f n ( Θ ) with N w orkers. As illustrated in Fig. 1-(a), this problem is often solv ed by lo cally minimizing f n ( θ n ) at each work er and globally av eraging their mo del parameters θ n ’s (and/or gradien ts) at a parameter server, thereby yielding the global mo del parameters Θ (Tsianos et al., 2012). Another wa y is to formulate the problem as an a verage consensus problem that minimizes 1 N P N n =1 f n ( θ n ) under the constraint θ n = Θ , ∀ n whic h can b e solved using dual decomp osition or Alternating Direction Metho d of Multipliers (ADMM). ADMM is preferable s ince standard dual decomp osition may fail in up dating the v ariables in some cases. F or example, if the ob jectiv e function f n ( θ n ) is a nonzero aﬃne function of any comp onen t in the input parameter θ n , then the θ n -up date fails, since the 1 authors Lagrangian is unbounded from b elow in θ n for most choices of the dual v ariables (Boyd et al., 2011). Ho wev er, using ADMM or dual decomp osition, an existence of a cen tral entit y is necessary . Suc h a centralized solution is, how ev er, not capable of addressing a large netw ork size exceeds the parameter server’s co verage range. Even if the parameter server has a link to eac h work er, communication resources may b ecome the b ottleneck since, at every iteration, all work ers need to transmit their up dated mo dels to the server b efore the server up dates the global mo del and send it to the work ers. Hence, as the num ber of w orkers increases, the uplink comm unication resources b ecome the b ottleneck. Because of this, w e aim to dev elop a fast and comm unication-eﬃcient decentralized algorithm, and prop ose Gr oup Alternating Dir e ction Metho d of Multipliers (GADMM) . GADMM solves the problem 1 N P N n =1 f n ( θ n ) sub ject to θ n = θ n +1 , ∀ n ∈ { 1 , · · · , N − 1 } , in whic h the work ers are divided into tw o groups ( he ad and T ail ), and eac h work er in the head (tail) group comm unicates only with its t wo neighboring w ork ers from the tail (head) group as sho wn in Fig. 1-(b). Due to its comm unication with only tw o neighbors rather than all the neighbors or a central en tity , the comm unication in each iteration is signiﬁcan tly reduced. Moreo ver, by dividing the work ers in to t wo equal groups, at most half of the work ers are comp eting for the comm unication resources at ev ery communication round. Despite this sparse communication where each work er communicates with at most t wo neigh b ors, we pro ve that GADMM conv erges to the optimal solution for conv ex functions. W e numerically show that its communication ov erhead is low er than that of state-of-the-art comm unication-eﬃcient centralized and decen tralized algorithms including Lazily Aggregated Gradien t (LAG) (Chen et al., 2018), and dual a veraging (Duchi et al., 2011) for linear and logistic regression on synthetic and real datasets. F urthermore, we prop ose a v arian t of GADMM, Dynamic GADMM (D-GADMM), to consider the dynamic netw orks in whic h the w orkers are mo ving ob jects ( e.g., v ehicles), so the neighbors of each work er could change o ver time. Moreov er, w e prov e that D-GADMM inherits the same con vergence guarantees of GADMM. Interestingly , w e show that D-GADMM not only adjusts to dynamic net works, but it also improv es the conv ergence sp eed of GADMM, i.e., giv en a static ph ysical top ology , k eeping on randomly c hanging the wa y the connectivity chain is constructed (Fig. 1-(b)) can signiﬁcan tly accelerate the con v ergence of GADMM. It is w orth men tioning that it w as sho wn in (Nedi´ c et al., 2018) as the n umber of links in the netw ork graph decreases, the conv ergence sp eed b ecomes slow er. Ho wev er, w e show that the decrease of the con vergence sp eed of GADMM compared to the standard parameter server-based ADMM (fully connected graph) due to sparsifying the net work graph can b e comp ensated b y contin uously keep changing neigh b ors and utilize D-GADMM. 2. Related W orks and Contributions Distributed Optimization. There are a v ariet y of distributed optimization algorithms prop osed in the literature, such as primal metho ds (Jak ov eti ´ c et al., 2014; Nedi´ c and Olshevsky, 2014; Nedi´ c and Ozdaglar, 2009; Shi et al., 2015) and primal-dual metho ds (Chang et al., 2014a; Kopp el et al., 2017; Bedi et al., 2019). Consensus optimization underlies most of the primal metho ds, while dual decomp osition and ADMM are the most p opular among the primal-dual algorithms (Glowinski and Marro co, 1975; Gabay and 2 shor t title w 1 2 w 2 w N + AAAXNXicndjbbuPGGQdwJU3b1D1tGsA3BQoh3gW8rbuQkxYBFigQW/LZlm35IB9oC8PhUOKKQ3LJoQ87YF8gt+mz9Flykbuit32FDsnh/Ena7UUE7Irz+74ZzpAfD7Id+V4ier3vP/r4J5/89Gc///QXC7/81a9/89sXn/3uPAnTmLIzGvphfGGThPlewM6EJ3x2EcWMcNtnY3vez+PjOxYnXhiciseI3XAyDTzXo0QoOv7T5MVS702v+HSfbqzqjaWO/hxNPlv8g+WENOUsENQnSXK92ovEjSSx8KjPsoWFBStNWETonEyZdElAH2dOvNJlUeJ60/J7pUt4Ima8+OZEzIqN5JHbxYYbBiJZ6SaprXqkMVvpFgvNGiOLWcxUIxaC2Pl+a7HrICTOuzRR06KeaMakUwzfNBao9cSknUpDni9ToRdQP3WYnDHisNZEeOoLLw7vm2p700TEqeq8YAXsXsxYGDMu9XcmT/VGI6r6BuG9z5wpK3Z8rQM3cq0VWXjV6OdPw9hTB7Tew1hzFw9e2Egr2o0UquqpltHPm80En3j1MfpFu5kS5kcsL7t6HrCd7HjiSa62duo7RoUqimZuhc1ktX4Wt8Y11kh1mOsF7TkMgI1k9kB45NdnsKGllcZi6iXNPE2NRJ9xTjK5X3y1lhuHvk/ix0wtstpsZAShIK1pDytqJEZxqK6T+mk70tJOi8KkXHSeYRqNLPWPxPNMjsrvRiwJ/bQ1o5OKmompWq9aTy1PS7NiE5UYlRNaw/bTM5JJy3a7z56KfPFuGPMyxSxcXUXqxFvlaqS0gtALHHV5dfOscm1vrVmirmkme2/+qrpk+fWc93nPHPnSsn11ySbvUxKzl1kZULsKXQwlLcajWX4kQ/dt1s3qWcn8Sd7JnAk664bFJBs9WODooWeu5/vdfAaZXkAtVkz2j9LKc7K/50mWuknmGSL27vInipUGydyLflzfu7zrV5GoDsOGtPL7d+LKjWqiJzvDkdG8YQJ1r/FOPX1kjrAvLZ+5QmfF0oq96axqLpfBZd18raOvddu3y7glq+62zrCq3V6XGde6eaPjN/qGnd/7SeCoE6TKTp0l9YSw8zu/npzNVbkUTyV1LTiZ/HPx6a6H4TyvGkVdTdX+ZvmcZkSdaF89nh2STWwTSVuRVEcoUcsW7EGoQ9PPSLVpDl1E1TMpmtBbGYLyeipxufe64pEcTSwmSNUskkpq5tU4xNSLeTUnWR3YKlZ+T+Ry3nlFmBH75cmlxFdLqPZyINVeDkyT501eNffRY7++rJFalexlt6ZaRlk93G+H+6Se4LCi/+r/6O+wfjtc768KM5DWvecw4fnqGFllpRYxUR4YUbV9J0zV/cWMM8huLaauKL2rp9FIh1I3Nrw5QhnYLbeRn6u+dRTVoe5JSl+aedvPptj1FOa6snC1kbWrX+TVfb16I8u+tYpfWi1GaCTbfsqeZBeo03/01XOOmjg3a/MCad/qRXkB1hymAgHVMJFBKgeTktPIoG3QnhmMDEbUoGPQCSocpnI4kRi0YgfsYG6jVFX60+yRA65lj1M5vm3Pd+wYxCxEKsUz4woHXBs3le1BHdke8a4U25V3tZOgyZyDbd1v2xx8W8v6urlFuJpGmxW55elUI7lmqvpWJ09jQ6HQprYqpOrC1CfBDBd/MLu42izK7JUqJt/p+uG9euFSL7B6co/qf/n2b9VciTTzIPnBedWt1mDcRllRKIU6UAfKGJjVrkZX1lde8RQ6hc6gM6gH9aDvoO+gc+gc6kN9KIdyaACtX2DQEBpBUYH2e+h7aAyNoQk0gQooLmU7habQO+gd9B56D32APkAfoY/QD9AP0B60V9zedNlREnmC+F2fCfVjI9HZa8hewxjr0HVoH9qHDqAD6AZ0A7oJ3YRuQbeg29Bt6A50B7oL3YXuQfeg+9B96AH0ADqEDqGH0EPoEfQIegw9ho6geJzbJ9AT6Cn0FHoGPYOeQ8+hY+gYegG9gF5CL6FX0CtdU0XT/j+lxU1p2bXS4utQlBbvQ1FafABFafENKEqLb0JRWnwLitLi21CUFt+BorT4LhSlxfegKC2+D0Vp8QMoSosPoSgtfghFafEjKEqLH0NRWnwERWnxEyhKi59CUVr8DIrS4udQlBYfQ1Fa/AKK0uKXUJQWv4JWpdUnvjeNSTTz6POlRdfw5rVmnsHrQPOYp8+96dMBcGBwA2h+NtJNoHmw0y3glsFtoHnvoDvAHYO7wF2De8A9g8/96KAHQPNjhQ6BQ4OHwEODR8Ajg8fAY4MjoPlZQU+AJwZPgacGz4BnBp97ZaZj4NjgBfDC4CXw0uAV8MqgA1RvQJMXS6vtvx8/3Tj/8s2q2j7+y9I36/pvy592ft/5orPcWe183fmms9056px1aId1vu181/nH4j8Xf1j81+K/y9SPP9J9Pu80Pov/+S+A+O08 AAAXNXicndjbbuPGGQdwJU3b1D1tGsA3BQoh3gW8rbuQkxYBFigQW/LZlm35IB9oC8PhUOKKQ3LJoQ87YF8gt+mz9Flykbuit32FDsnh/Ena7UUE7Irz+74ZzpAfD7Id+V4ier3vP/r4J5/89Gc///QXC7/81a9/89sXn/3uPAnTmLIzGvphfGGThPlewM6EJ3x2EcWMcNtnY3vez+PjOxYnXhiciseI3XAyDTzXo0QoOv7T5MVS702v+HSfbqzqjaWO/hxNPlv8g+WENOUsENQnSXK92ovEjSSx8KjPsoWFBStNWETonEyZdElAH2dOvNJlUeJ60/J7pUt4Ima8+OZEzIqN5JHbxYYbBiJZ6SaprXqkMVvpFgvNGiOLWcxUIxaC2Pl+a7HrICTOuzRR06KeaMakUwzfNBao9cSknUpDni9ToRdQP3WYnDHisNZEeOoLLw7vm2p700TEqeq8YAXsXsxYGDMu9XcmT/VGI6r6BuG9z5wpK3Z8rQM3cq0VWXjV6OdPw9hTB7Tew1hzFw9e2Egr2o0UquqpltHPm80En3j1MfpFu5kS5kcsL7t6HrCd7HjiSa62duo7RoUqimZuhc1ktX4Wt8Y11kh1mOsF7TkMgI1k9kB45NdnsKGllcZi6iXNPE2NRJ9xTjK5X3y1lhuHvk/ix0wtstpsZAShIK1pDytqJEZxqK6T+mk70tJOi8KkXHSeYRqNLPWPxPNMjsrvRiwJ/bQ1o5OKmompWq9aTy1PS7NiE5UYlRNaw/bTM5JJy3a7z56KfPFuGPMyxSxcXUXqxFvlaqS0gtALHHV5dfOscm1vrVmirmkme2/+qrpk+fWc93nPHPnSsn11ySbvUxKzl1kZULsKXQwlLcajWX4kQ/dt1s3qWcn8Sd7JnAk664bFJBs9WODooWeu5/vdfAaZXkAtVkz2j9LKc7K/50mWuknmGSL27vInipUGydyLflzfu7zrV5GoDsOGtPL7d+LKjWqiJzvDkdG8YQJ1r/FOPX1kjrAvLZ+5QmfF0oq96axqLpfBZd18raOvddu3y7glq+62zrCq3V6XGde6eaPjN/qGnd/7SeCoE6TKTp0l9YSw8zu/npzNVbkUTyV1LTiZ/HPx6a6H4TyvGkVdTdX+ZvmcZkSdaF89nh2STWwTSVuRVEcoUcsW7EGoQ9PPSLVpDl1E1TMpmtBbGYLyeipxufe64pEcTSwmSNUskkpq5tU4xNSLeTUnWR3YKlZ+T+Ry3nlFmBH75cmlxFdLqPZyINVeDkyT501eNffRY7++rJFalexlt6ZaRlk93G+H+6Se4LCi/+r/6O+wfjtc768KM5DWvecw4fnqGFllpRYxUR4YUbV9J0zV/cWMM8huLaauKL2rp9FIh1I3Nrw5QhnYLbeRn6u+dRTVoe5JSl+aedvPptj1FOa6snC1kbWrX+TVfb16I8u+tYpfWi1GaCTbfsqeZBeo03/01XOOmjg3a/MCad/qRXkB1hymAgHVMJFBKgeTktPIoG3QnhmMDEbUoGPQCSocpnI4kRi0YgfsYG6jVFX60+yRA65lj1M5vm3Pd+wYxCxEKsUz4woHXBs3le1BHdke8a4U25V3tZOgyZyDbd1v2xx8W8v6urlFuJpGmxW55elUI7lmqvpWJ09jQ6HQprYqpOrC1CfBDBd/MLu42izK7JUqJt/p+uG9euFSL7B6co/qf/n2b9VciTTzIPnBedWt1mDcRllRKIU6UAfKGJjVrkZX1lde8RQ6hc6gM6gH9aDvoO+gc+gc6kN9KIdyaACtX2DQEBpBUYH2e+h7aAyNoQk0gQooLmU7habQO+gd9B56D32APkAfoY/QD9AP0B60V9zedNlREnmC+F2fCfVjI9HZa8hewxjr0HVoH9qHDqAD6AZ0A7oJ3YRuQbeg29Bt6A50B7oL3YXuQfeg+9B96AH0ADqEDqGH0EPoEfQIegw9ho6geJzbJ9AT6Cn0FHoGPYOeQ8+hY+gYegG9gF5CL6FX0CtdU0XT/j+lxU1p2bXS4utQlBbvQ1FafABFafENKEqLb0JRWnwLitLi21CUFt+BorT4LhSlxfegKC2+D0Vp8QMoSosPoSgtfghFafEjKEqLH0NRWnwERWnxEyhKi59CUVr8DIrS4udQlBYfQ1Fa/AKK0uKXUJQWv4JWpdUnvjeNSTTz6POlRdfw5rVmnsHrQPOYp8+96dMBcGBwA2h+NtJNoHmw0y3glsFtoHnvoDvAHYO7wF2De8A9g8/96KAHQPNjhQ6BQ4OHwEODR8Ajg8fAY4MjoPlZQU+AJwZPgacGz4BnBp97ZaZj4NjgBfDC4CXw0uAV8MqgA1RvQJMXS6vtvx8/3Tj/8s2q2j7+y9I36/pvy592ft/5orPcWe183fmms9056px1aId1vu181/nH4j8Xf1j81+K/y9SPP9J9Pu80Pov/+S+A+O08 AAAXNXicndjbbuPGGQdwJU3b1D1tGsA3BQoh3gW8rbuQkxYBFigQW/LZlm35IB9oC8PhUOKKQ3LJoQ87YF8gt+mz9Flykbuit32FDsnh/Ena7UUE7Irz+74ZzpAfD7Id+V4ier3vP/r4J5/89Gc///QXC7/81a9/89sXn/3uPAnTmLIzGvphfGGThPlewM6EJ3x2EcWMcNtnY3vez+PjOxYnXhiciseI3XAyDTzXo0QoOv7T5MVS702v+HSfbqzqjaWO/hxNPlv8g+WENOUsENQnSXK92ovEjSSx8KjPsoWFBStNWETonEyZdElAH2dOvNJlUeJ60/J7pUt4Ima8+OZEzIqN5JHbxYYbBiJZ6SaprXqkMVvpFgvNGiOLWcxUIxaC2Pl+a7HrICTOuzRR06KeaMakUwzfNBao9cSknUpDni9ToRdQP3WYnDHisNZEeOoLLw7vm2p700TEqeq8YAXsXsxYGDMu9XcmT/VGI6r6BuG9z5wpK3Z8rQM3cq0VWXjV6OdPw9hTB7Tew1hzFw9e2Egr2o0UquqpltHPm80En3j1MfpFu5kS5kcsL7t6HrCd7HjiSa62duo7RoUqimZuhc1ktX4Wt8Y11kh1mOsF7TkMgI1k9kB45NdnsKGllcZi6iXNPE2NRJ9xTjK5X3y1lhuHvk/ix0wtstpsZAShIK1pDytqJEZxqK6T+mk70tJOi8KkXHSeYRqNLPWPxPNMjsrvRiwJ/bQ1o5OKmompWq9aTy1PS7NiE5UYlRNaw/bTM5JJy3a7z56KfPFuGPMyxSxcXUXqxFvlaqS0gtALHHV5dfOscm1vrVmirmkme2/+qrpk+fWc93nPHPnSsn11ySbvUxKzl1kZULsKXQwlLcajWX4kQ/dt1s3qWcn8Sd7JnAk664bFJBs9WODooWeu5/vdfAaZXkAtVkz2j9LKc7K/50mWuknmGSL27vInipUGydyLflzfu7zrV5GoDsOGtPL7d+LKjWqiJzvDkdG8YQJ1r/FOPX1kjrAvLZ+5QmfF0oq96axqLpfBZd18raOvddu3y7glq+62zrCq3V6XGde6eaPjN/qGnd/7SeCoE6TKTp0l9YSw8zu/npzNVbkUTyV1LTiZ/HPx6a6H4TyvGkVdTdX+ZvmcZkSdaF89nh2STWwTSVuRVEcoUcsW7EGoQ9PPSLVpDl1E1TMpmtBbGYLyeipxufe64pEcTSwmSNUskkpq5tU4xNSLeTUnWR3YKlZ+T+Ry3nlFmBH75cmlxFdLqPZyINVeDkyT501eNffRY7++rJFalexlt6ZaRlk93G+H+6Se4LCi/+r/6O+wfjtc768KM5DWvecw4fnqGFllpRYxUR4YUbV9J0zV/cWMM8huLaauKL2rp9FIh1I3Nrw5QhnYLbeRn6u+dRTVoe5JSl+aedvPptj1FOa6snC1kbWrX+TVfb16I8u+tYpfWi1GaCTbfsqeZBeo03/01XOOmjg3a/MCad/qRXkB1hymAgHVMJFBKgeTktPIoG3QnhmMDEbUoGPQCSocpnI4kRi0YgfsYG6jVFX60+yRA65lj1M5vm3Pd+wYxCxEKsUz4woHXBs3le1BHdke8a4U25V3tZOgyZyDbd1v2xx8W8v6urlFuJpGmxW55elUI7lmqvpWJ09jQ6HQprYqpOrC1CfBDBd/MLu42izK7JUqJt/p+uG9euFSL7B6co/qf/n2b9VciTTzIPnBedWt1mDcRllRKIU6UAfKGJjVrkZX1lde8RQ6hc6gM6gH9aDvoO+gc+gc6kN9KIdyaACtX2DQEBpBUYH2e+h7aAyNoQk0gQooLmU7habQO+gd9B56D32APkAfoY/QD9AP0B60V9zedNlREnmC+F2fCfVjI9HZa8hewxjr0HVoH9qHDqAD6AZ0A7oJ3YRuQbeg29Bt6A50B7oL3YXuQfeg+9B96AH0ADqEDqGH0EPoEfQIegw9ho6geJzbJ9AT6Cn0FHoGPYOeQ8+hY+gYegG9gF5CL6FX0CtdU0XT/j+lxU1p2bXS4utQlBbvQ1FafABFafENKEqLb0JRWnwLitLi21CUFt+BorT4LhSlxfegKC2+D0Vp8QMoSosPoSgtfghFafEjKEqLH0NRWnwERWnxEyhKi59CUVr8DIrS4udQlBYfQ1Fa/AKK0uKXUJQWv4JWpdUnvjeNSTTz6POlRdfw5rVmnsHrQPOYp8+96dMBcGBwA2h+NtJNoHmw0y3glsFtoHnvoDvAHYO7wF2De8A9g8/96KAHQPNjhQ6BQ4OHwEODR8Ajg8fAY4MjoPlZQU+AJwZPgacGz4BnBp97ZaZj4NjgBfDC4CXw0uAV8MqgA1RvQJMXS6vtvx8/3Tj/8s2q2j7+y9I36/pvy592ft/5orPcWe183fmms9056px1aId1vu181/nH4j8Xf1j81+K/y9SPP9J9Pu80Pov/+S+A+O08 AAAXNXicndjbbuPGGQdwJU3b1D1tGsA3BQoh3gW8rbuQkxYBFigQW/LZlm35IB9oC8PhUOKKQ3LJoQ87YF8gt+mz9Flykbuit32FDsnh/Ena7UUE7Irz+74ZzpAfD7Id+V4ier3vP/r4J5/89Gc///QXC7/81a9/89sXn/3uPAnTmLIzGvphfGGThPlewM6EJ3x2EcWMcNtnY3vez+PjOxYnXhiciseI3XAyDTzXo0QoOv7T5MVS702v+HSfbqzqjaWO/hxNPlv8g+WENOUsENQnSXK92ovEjSSx8KjPsoWFBStNWETonEyZdElAH2dOvNJlUeJ60/J7pUt4Ima8+OZEzIqN5JHbxYYbBiJZ6SaprXqkMVvpFgvNGiOLWcxUIxaC2Pl+a7HrICTOuzRR06KeaMakUwzfNBao9cSknUpDni9ToRdQP3WYnDHisNZEeOoLLw7vm2p700TEqeq8YAXsXsxYGDMu9XcmT/VGI6r6BuG9z5wpK3Z8rQM3cq0VWXjV6OdPw9hTB7Tew1hzFw9e2Egr2o0UquqpltHPm80En3j1MfpFu5kS5kcsL7t6HrCd7HjiSa62duo7RoUqimZuhc1ktX4Wt8Y11kh1mOsF7TkMgI1k9kB45NdnsKGllcZi6iXNPE2NRJ9xTjK5X3y1lhuHvk/ix0wtstpsZAShIK1pDytqJEZxqK6T+mk70tJOi8KkXHSeYRqNLPWPxPNMjsrvRiwJ/bQ1o5OKmompWq9aTy1PS7NiE5UYlRNaw/bTM5JJy3a7z56KfPFuGPMyxSxcXUXqxFvlaqS0gtALHHV5dfOscm1vrVmirmkme2/+qrpk+fWc93nPHPnSsn11ySbvUxKzl1kZULsKXQwlLcajWX4kQ/dt1s3qWcn8Sd7JnAk664bFJBs9WODooWeu5/vdfAaZXkAtVkz2j9LKc7K/50mWuknmGSL27vInipUGydyLflzfu7zrV5GoDsOGtPL7d+LKjWqiJzvDkdG8YQJ1r/FOPX1kjrAvLZ+5QmfF0oq96axqLpfBZd18raOvddu3y7glq+62zrCq3V6XGde6eaPjN/qGnd/7SeCoE6TKTp0l9YSw8zu/npzNVbkUTyV1LTiZ/HPx6a6H4TyvGkVdTdX+ZvmcZkSdaF89nh2STWwTSVuRVEcoUcsW7EGoQ9PPSLVpDl1E1TMpmtBbGYLyeipxufe64pEcTSwmSNUskkpq5tU4xNSLeTUnWR3YKlZ+T+Ry3nlFmBH75cmlxFdLqPZyINVeDkyT501eNffRY7++rJFalexlt6ZaRlk93G+H+6Se4LCi/+r/6O+wfjtc768KM5DWvecw4fnqGFllpRYxUR4YUbV9J0zV/cWMM8huLaauKL2rp9FIh1I3Nrw5QhnYLbeRn6u+dRTVoe5JSl+aedvPptj1FOa6snC1kbWrX+TVfb16I8u+tYpfWi1GaCTbfsqeZBeo03/01XOOmjg3a/MCad/qRXkB1hymAgHVMJFBKgeTktPIoG3QnhmMDEbUoGPQCSocpnI4kRi0YgfsYG6jVFX60+yRA65lj1M5vm3Pd+wYxCxEKsUz4woHXBs3le1BHdke8a4U25V3tZOgyZyDbd1v2xx8W8v6urlFuJpGmxW55elUI7lmqvpWJ09jQ6HQprYqpOrC1CfBDBd/MLu42izK7JUqJt/p+uG9euFSL7B6co/qf/n2b9VciTTzIPnBedWt1mDcRllRKIU6UAfKGJjVrkZX1lde8RQ6hc6gM6gH9aDvoO+gc+gc6kN9KIdyaACtX2DQEBpBUYH2e+h7aAyNoQk0gQooLmU7habQO+gd9B56D32APkAfoY/QD9AP0B60V9zedNlREnmC+F2fCfVjI9HZa8hewxjr0HVoH9qHDqAD6AZ0A7oJ3YRuQbeg29Bt6A50B7oL3YXuQfeg+9B96AH0ADqEDqGH0EPoEfQIegw9ho6geJzbJ9AT6Cn0FHoGPYOeQ8+hY+gYegG9gF5CL6FX0CtdU0XT/j+lxU1p2bXS4utQlBbvQ1FafABFafENKEqLb0JRWnwLitLi21CUFt+BorT4LhSlxfegKC2+D0Vp8QMoSosPoSgtfghFafEjKEqLH0NRWnwERWnxEyhKi59CUVr8DIrS4udQlBYfQ1Fa/AKK0uKXUJQWv4JWpdUnvjeNSTTz6POlRdfw5rVmnsHrQPOYp8+96dMBcGBwA2h+NtJNoHmw0y3glsFtoHnvoDvAHYO7wF2De8A9g8/96KAHQPNjhQ6BQ4OHwEODR8Ajg8fAY4MjoPlZQU+AJwZPgacGz4BnBp97ZaZj4NjgBfDC4CXw0uAV8MqgA1RvQJMXS6vtvx8/3Tj/8s2q2j7+y9I36/pvy592ft/5orPcWe183fmms9056px1aId1vu181/nH4j8Xf1j81+K/y9SPP9J9Pu80Pov/+S+A+O08 local gradient update 1 global weight update 3 global weight DL head group 2 head primal update 1 tail primal update 3 w 1 w 3 w N-1 w 2 w 4 tail group tail → head Tx 4 ✓ ( k +1) n 2 N h ✓ ( k ) n  1 , ✓ ( k ) n +1 AAAXinicndjbbuO4GQdw7/a0TbftbAs0FwUKo5kBkm52YM+2KDpFgU3snBMncZw4ByUGJVGWxqKkkagchlAfrG/Su962T1FKoviXlLQXa2DG4o8fKZL6KMkxI99LeK/3z88+/8EPf/Tjn3zx06WfffnzX/zy1Ve/ukjCNLbouRX6YXxpkoT6XkDPucd9ehnFlDDTp1NzMcjrp/c0TrwwmPCniN4yMg88x7MIlzR7NTG4SzmZicDwAoMR7lrEF6Ns5mZ3YnXxdX8t6xo+dTiJ4/Chq6O/6Rf1a9k67OvKZq9Wem97xaf7/KCvDlY66nMy++o3vzPs0EoZDbjlkyS56fcifitIzD3Lp9nS0pKRJjQi1oLMqXBIYD25drzepVHiePPye71LWMJdVnznMykOkidmFgdOGPBkvZukpmyRxnS9W6xe1uiZuzGVhZhzYubnrdXdBCGxP6SJHJbl8WadsIvum0YDOZ+YtEOtkOXTlOgFlp/aVLiU2LQ1EJb63JMr3lTTmyc8TmXjJSOgD3Lhw5gyob4zMVEHjVrZNggffGrPaXHiG1VxKzZaNUtvGu38eRh7ckHrLbQ1T/HohY2wotwIsWSS1iIGebEZ4BOv3segKDdDwnzF8lyuxwHbwbbHn8Uqa4d+oBaXSdGMrbAZLOdP41a/2hqhNnW8oD2GIbARTB8Ji/z6CLaUtMJobHlJM05RI9CnjJFMHBZfrenGoe+T+CmTk6wOGxFByElr2KOKGoFRHMp9Ur9sJ0raYVGYlJPOI3ShESX/kXiRiXH53ahLQj9tjeisomZgKucr51OLU9LM2EQGRuWANnD8/IpkwjCd7ouXIp+8E8asDNETl7tIXnijnI0QRhB6gS23VzePKuf23nATuaep6L39k2yS5fs5b/OR2uK1YfpyyyYfUxLT11lZIU8VOuhKGJRFbr6SofM+62b1qGTxLO5sQbnldsNikI0WNLBV167j+X43H0GmJlCrKwb7B2HkMdnf8yBD3iTzCB579/ljykiDZOFF36/tfd7024hXy7AliidR4oitaqBne6Ox1rygK+pe4716+FivsC+Kp5mKioURe3O3Kq6WlauquKZq11TZN8t6Q1TNTRVhVKe9KSNuVPFW1d+qG3Z+7yeBLS+QTDt5leQTwszv/GpwJpPpUjyV5F6wM/FN8eluhuEizxpJXUXV+dx8TC6RF9qXz3ybZDNT16StmlTVWEROm9NHLpdmkJHqUC9dZMlnUjSz7kQIyvOpxNXeWsVjMZ4Z8tlfFYugkppxNQ4x9GJczUFWC1vVld8zsZo3Xue6x4HQ7yoDfZYjIc9ypIssL7KqeIgWh/VpjeWsRE++uFTZMs7q1YN29YDUA2xatO//j/Y2HbSr6+1lYgbCePBsyj1frpFRZmpRx8uF4VXZt8NU3l90P8PszqByR6lTPa+NVFXqxJq3x0gDs+Um4nNVt44iO+Q9SeprPW7zxRCzHkIdRxQuD7J29vM8u2/6t6JsW8v4lX7RQyPY9FP6LLpAFf69d88FcuJCz80LhHmnJuUFmHOYclTIgq4ZpmI4KzmNNJoaTVdjpDGyNNoa7aDCUSpGM4FOK7bBNsY2TmWmP48e2+Ba9DQV07v2eKe2RoyCp4K/0C+3wbV+U9Hu1BbtHu9LMR1xX7sIivQ12FXtdvXim0o2N/UtwlE03q7IKS+n7MnRQ1W3OjGJNYVcmTyq0JIbU10E3V38SZ/iertIszcymXy764cP8oVLvsCqwT3J/8X7v1VjJUKPg+SL86ZbzUG7ibSyoBbUhtpQSsG0thsdUZ95xXPoHOpCXagH9aAfoB+gC+gC6kN9KIMyaACtbzBoCI2gyEDzI/QjNIbG0ASaQDkUW9lMoSn0HnoPfYA+QB+hj9An6BP0E/QTtAftFbc3lXYWiTxO/K5PufyxkajoDURvoI9N6CZ0AB1Ah9AhdAu6Bd2GbkN3oDvQXegudA+6B92H7kMPoAfQQ+gh9Ah6BB1BR9Bj6DH0BHoCPYWeQsdQPM7NM+gZdAKdQM+h59AL6AV0Cp1CL6GX0CvoFfQaeq1yqiia/ye1mE4ts5ZabBOK1GIDKFKLDaFILbYFRWqxbShSi+1AkVpsF4rUYntQpBbbhyK12AEUqcUOoUgtdgRFarERFKnFjqFILXYCRWqxUyhSi42hSC12BkVqsQkUqcXOoUgtdgFFarEpFKnFLqFILXYFRWqxa2iVWgPie/OYRK5nvZxa1gbevDb0M3gTqB/z1ktv+tYQONS4BdQ/G61toH6wWzvAHY27QP3eYe0B9zTuA/c1HgAPNL70o8M6AuofK9YIONJ4DDzWeAI80XgKPNU4BuqfFdYZ8EzjBDjReA481/jSK7M1BU41XgIvNV4BrzReA6812kD5BjR7tdJv//34+cHFu7f9b9++O/3jynfr6m/LX3R+2/l9Z7XT7/y5811nt3PSOe9YnX90/tX5d+c/y18uv1v+y/Jfy9DPP1Ntft1pfJaH/wUctgqn ✓ ( k +1) n 2 N h AAAXWHicndjbbuO4GQdw7/awu+lpdgvsXBQojM0MkGnTgb3bosAABTaxc06cxDk4ByUGRVG2xqKkkagchlCv+zS9bV+lfZpSEsW/pKR7sQZmLP74kSKpj5IcO/K9RPR6//nk05/89Gc//+zzL5Z+8ctf/fo3L7786jwJ05iyMxr6YXxhk4T5XsDOhCd8dhHFjHDbZxN7McjrJ3csTrwwOBWPEbvhZBZ4rkeJUDR98Y0l5kyQqQwsL7A4EXNKfDnKpvPsVq4s/th/k01fLPfe9opP9+lBXx8sd/TnaPrl17+3nJCmnAWC+iRJrvu9SNxIEguP+ixbWlqy0oRFhC7IjEmXBPRx7sSrXRYlrjcrv1e7hCdizovvfFzFQfLI7eLADQORrHaT1FYt0pitdou1yBo9i3nMVCEWgtj5eWt110FInPdpooZFPdGsk07RfdNYoOYTk3YoDXk+TYVeQP3UYXLOiMNaA+GpL7w4vG+q7c0SEaeq8ZIVsHt1IcKYcam/M3mqDxq1qm0Q3vvMmbHixNe64kautWqWXjfa+bMw9tSC1lsYa57iwQsbYUW5EUJVytUiBnmxGeATr97HoCg3Q8J8xfLMrMcB28GOJ57EamuHvmdUqKRoxlbYDFbzZ3GrX2ONUIe5XtAewxDYCGYPhEd+fQQbWlphLKZe0ozT1Aj0Geckk/vFV2u6cej7JH7M1CSrw0ZEEArSGvaookZgFIdqn9Qv25GWdlgUJuWk8whTaESpfyReZHJcfjfqktBPWyM6qagZmKr5qvnU4rQ0MzZRgVE5oDUcP70imbRst/vspcgn74YxL0PMxNUuUhfeKmcjpRWEXuCo7dXNo8q5vbPmidrTTPbe/kU1yfL9nLf5wBz5yrJ9tWWTDymJ2ausrFCnCl10JS3Go3m+kqH7Lutm9ahk8STuZMEEnXfDYpCNFixwdNdz1/P9bj6CTE+gVlcM9g/SymOyv+dBlrpJ5hEi9u7yh46VBsnCi35c27u86XeRqJZhQxbPlcSVG9VAT3ZGY6N5wVTUvcY79fCxWWFfWj5zhY6KpRV7s3lVXCkrV3Txja59o8u+XdZbsmpu6wirOu11GXGtize6/kbfsPN7PwkcdYFU2qmrpJ4Qdn7n14OzuUqX4qmk9oKTyT8Vn+56GC7yrFHU1VSdb56PaU7UhfbVE9wh2dQ2NWmrJtU1lKhpC/Yg1NIMMlIdmqWLqHomRVN6K0NQnk8lrvTeVDyW46ml3gWqYhFUUjOuxiGGXoyrOchqYau68nsqV/LGq8L0OJDmzWNgznIg1VkOTJHnRV4V99Fivz6tsZqV7KlXlypbxlm9etCuHpB6gMOK9v3/095hg3Z1vb1KzEBa957DhOerNbLKTC3qRLkwoir7Tpiq+4vpZ5jdWkztKH2qp7WRrkrd2PDmGGlgt9xGfK761lFkh7onKX1lxm0/G2LXQ5jrysLVQdbOfpFn93X/RpZtaxm/3C96aATbfsqeRBeow3/07jlHTpybuXmBtG/1pLwAcw5TgQpVMDXDVA6nJaeRQdugPTcYGYyoQcegE1Q4SuVoKtFpxQ7YwdjGqcr0p9FjB1yLnqRyctse78QxiFGIVIpn+hUOuNZvKtudOrLd410ptivvahdBk7kG27rdtll8W8v6urlFuJrGmxW55eVUPblmqPpWJ09jQ6HQpo4qpGpj6otguos/mlNcbRZp9lolk+90/fBevXCpF1g9uEf1v3z3t2qsRJpxkHxxXnerORi3kVYUSqEO1IEyBma13ejK+swrnkFn0Dl0DvWgHvQ99D10AV1AfagP5VAODaD1DQYNoREUGWh/gH6AxtAYmkATqIBiK9spNIXeQe+g99B76AP0AfoIfYR+hH6E9qC94vam046SyBPE7/pMqB8biY5eQ/Qa+liHrkMH0AF0CB1CN6Ab0E3oJnQLugXdhm5Dd6A70F3oLnQPugfdh+5DD6AH0BF0BD2EHkKPoEfQY+gxdAzF49w+gZ5AT6Gn0DPoGfQceg6dQCfQC+gF9BJ6Cb2CXumcKor2D6QWN6ll11KLr0ORWnwARWrxIRSpxTegSC2+CUVq8S0oUotvQ5FafAeK1OK7UKQW34Mitfg+FKnFD6BILT6CIrX4IRSpxY+gSC1+DEVq8TEUqcVPoEgtfgpFavEzKFKLn0ORWnwCRWrxCyhSi19CkVr8Clql1oD43iwm0dyjz6cWXcOb15p5Bq8DzWOePvemT4fAocENoPnZSDeB5sFOt4BbBreB5r2D7gB3DO4Cdw3uAfcMPvejgx4AzY8VOgKODB4CDw0eAY8MHgOPDY6B5mcFPQGeGDwFnho8A54ZfO6VmU6AE4MXwAuDl8BLg1fAK4MOUL0BTV8s99t/P356cP7t2/53b789/vPy96v6b8ufd37X+aaz0ul3/tr5vrPdOeqcdWjnH51/dv7V+ffX/33ZefnZyy/K0E8/0W1+22l8Xn71P8mg95k= (a) Distributed gradient descent (centralized) (b) GADMM (decentralized) ✓ ( k +1) n 2 N t ✓ ( k ) n  1 , ✓ ( k ) n +1 AAAXinicndjbbuO4GQdw7/a0TbftbAs0FwUKo5kBkm52YM+2KDpFgU3snBMncZw4ByUGJVGWxqKkkagchlAfrG/Su962T1FKoviXlLQXa2DG4o8fKZL6KMkxI99LeK/3z88+/8EPf/Tjn3zx06WfffnzX/zy1Ve/ukjCNLbouRX6YXxpkoT6XkDPucd9ehnFlDDTp1NzMcjrp/c0TrwwmPCniN4yMg88x7MIlzR7NTG4SzmZicDwAoMR7lrEF6NsxrM7sbr4ur+WdQ2fOpzEcfjQ1dHf9Iv6tWwd9nVls1crvbe94tN9ftBXBysd9TmZffWb3xl2aKWMBtzySZLc9HsRvxUk5p7l02xpaclIExoRa0HmVDgksJ5cO17v0ihxvHn5vd4lLOEuK77zmRQHyRMziwMnDHiy3k1SU7ZIY7reLVYva/TM3ZjKQsw5MfPz1upugpDYH9JEDsvyeLNO2EX3TaOBnE9M2qFWyPJpSvQCy09tKlxKbNoaCEt97skVb6rpzRMep7LxkhHQB7nwYUyZUN+ZmKiDRq1sG4QPPrXntDjxjaq4FRutmqU3jXb+PIw9uaD1Ftqap3j0wkZYUW6EWDJJaxGDvNgM8IlX72NQlJshYb5ieS7X44DtYNvjz2KVtUM/UIvLpGjGVtgMlvOncatfbY1Qmzpe0B7DENgIpo+ERX59BFtKWmE0trykGaeoEehTxkgmDouv1nTj0PdJ/JTJSVaHjYgg5KQ17FFFjcAoDuU+qV+2EyXtsChMyknnEbrQiJL/SLzIxLj8btQloZ+2RnRWUTMwlfOV86nFKWlmbCIDo3JAGzh+fkUyYZhO98VLkU/eCWNWhuiJy10kL7xRzkYIIwi9wJbbq5tHlXN7b7iJ3NNU9N7+STbJ8v2ct/lIbfHaMH25ZZOPKYnp66yskKcKHXQlDMoiN1/J0HmfdbN6VLJ4Fne2oNxyu2ExyEYLGtiqa9fxfL+bjyBTE6jVFYP9gzDymOzveZAhb5J5BI+9+/wxZaRBsvCi79f2Pm/6bcSrZdgSxZMoccRWNdCzvdFYa17QFXWv8V49fKxX2BfF00xFxcKIvblbFVfLylVVXFO1a6rsm2W9IarmpoowqtPelBE3qnir6m/VDTu/95PAlhdIpp28SvIJYeZ3fjU4k8l0KZ5Kci/Ymfim+HQ3w3CRZ42krqLqfG4+JpfIC+3LZ75Nspmpa9JWTapqLCKnzekjl0szyEh1qJcusuQzKZpZdyIE5flU4mpvreKxGM8M+eyvikVQSc24GocYejGu5iCrha3qyu+ZWM0br3Pd40Dod5WBPsuRkGc50kWWF1lVPESLw/q0xnJWoidfXKpsGWf16kG7ekDqATYt2vf/R3ubDtrV9fYyMQNhPHg25Z4v18goM7Wo4+XC8Krs22Eq7y+6n2F2Z1C5o9SpntdGqip1Ys3bY6SB2XIT8bmqW0eRHfKeJPW1Hrf5YohZD6GOIwqXB1k7+3me3Tf9W1G2rWX8Sr/ooRFs+il9Fl2gCv/eu+cCOXGh5+YFwrxTk/ICzDlMOSpkQdcMUzGclZxGGk2Npqsx0hhZGm2NdlDhKBWjmUCnFdtgG2MbpzLTn0ePbXAtepqK6V17vFNbI0bBU8Ff6Jfb4Fq/qWh3aot2j/elmI64r10ERfoa7Kp2u3rxTSWbm/oW4Sgab1fklJdT9uTooapbnZjEmkKuTB5VaMmNqS6C7i7+pE9xvV2k2RuZTL7d9cMH+cIlX2DV4J7k/+L936qxEqHHQfLFedOt5qDdRFpZUAtqQ20opWBa242OqM+84jl0DnWhLtSDetAP0A/QBXQB9aE+lEEZNIDWNxg0hEZQZKD5EfoRGkNjaAJNoByKrWym0BR6D72HPkAfoI/QR+gT9An6CfoJ2oP2itubSjuLRB4nftenXP7YSFT0BqI30McmdBM6gA6gQ+gQugXdgm5Dt6E70B3oLnQXugfdg+5D96EH0APoIfQQegQ9go6gI+gx9Bh6Aj2BnkJPoWMoHufmGfQMOoFOoOfQc+gF9AI6hU6hl9BL6BX0CnoNvVY5VRTN/5NaTKeWWUsttglFarEBFKnFhlCkFtuCIrXYNhSpxXagSC22C0VqsT0oUovtQ5Fa7ACK1GKHUKQWO4IitdgIitRix1CkFjuBIrXYKRSpxcZQpBY7gyK12ASK1GLnUKQWu4AitdgUitRil1CkFruCIrXYNbRKrQHxvXlMItezXk4tawNvXhv6GbwJ1I9566U3fWsIHGrcAuqfjdY2UD/YrR3gjsZdoH7vsPaAexr3gfsaD4AHGl/60WEdAfWPFWsEHGk8Bh5rPAGeaDwFnmocA/XPCusMeKZxApxoPAeea3zpldmaAqcaL4GXGq+AVxqvgdcabaB8A5q9Wum3/378/ODi3dv+t2/fnf5x5bt19bflLzq/7fy+s9rpd/7c+a6z2znpnHeszj86/+r8u/Of5S+X3y3/ZfmvZejnn6k2v+40PsvD/wIxVQqz ✓ ( k +1) = ✓ ( k )  ↵ N X n =1 r f n ( ✓ ( k ) n ) AAAXg3icndhpb+O4GQdw77bbbtNrtgWaFwUKoZkBkm52YO9uUWCAATZx7sNJnMM5lBiURFkai5JGonIMoX6pfpp9236RUhLFv6SkfbEGZiz++JDi8ehwrDjwU97v//jZ5z/7+Re/+OWXv1r49W9++7vfv/rqDxdplCU2PbejIEouLZLSwA/pOfd5QC/jhBJmBXRizYdF/eSeJqkfhWf8Kaa3jMxC3/VtwiVNXx2a3KOc3Inl+deDldx4bwBk8RuTBLFHDDPN2FSE7wf53cgwQ2IFxHCn4XIVPA2r8JXpq6X+2375MZ4fDNTBUk99jqdf/ekvphPZGaMhtwOSpjeDfsxvBUm4bwc0X1hYMLOUxsSekxkVLgntJ89JVg0ap64/q75XDcJS7rHymxHulQfpE7PKAzcKebpqpJklW2QJXTXKZctbPXMvobKQcC7nRlt1N2FEnA9ZKodl+7xdJ5yy+7bRUM4nId1QO2LFNCX6oR1kDhUeJQ7tDIRlAfeT6KGtlj9LeZLJxgtmSB/kqkcJZUJ95+JMHbRqZdswegioM6PliW9Uxa1Y69QsvGm1C2ZR4ssFbbbQ1j7Fox+1wspyK8SW2dmIGBbFdkBA/GYfw7LcDomKFSuSuBkH7AY7Pn8Wq6wb+oHaXCZFO7bGdrCcP006/WprhTrU9cPuGDaArWD6SFgcNEewqaQTRhPbT9txilqBAWWM5OKg/OpMN4mCgCRPuZxkfdiKCCNOOsMe1dQKjJNIXifNbTtW0g2Lo7SadBGhC60o+Y8k81yMq+9WXRoFWWdEpzW1AzM5XzmfRpySdsamMjCuBrSG4+c7kgvTco0Xt6KYvBslrArRE5dXkdx4s5qNEGYY+aEjLy+jiKrm9s70UnlNU9F/+3fZJC+u56LNR+qI16a8q9rz9GNGEvo6ryrkqSIXXQmTstgrVjJy3+VG3oxK58/iTueU254RlYNstaCho7r2XD8IjGIEuZpAo64c7N+EWcTk/yyCTHmTLCJ44t8XzyczC9O5H/+0tvdF0+9iXi/DpjCL+3fqis16oKe7o7HWoqArmt7g3Wb4WK9wIMyAulxFJcJM/JlXF5erymVVXFG1K6ocWFW9Kermloow69PeVBE3qnir6m/VDbu495PQkRsk007uknxCWMWdXw3OYjJdyqeSvBacXHxTfoz1KJoXWSPJUFSfzyvG5BG50YF82Dskn1q6JuvUZKrGJnLanD5yuTTDnNSHeuliWz6T4ql9JyJQkU8VLvdXah6L8dSUD/66WAZV1I5rcIShl+NqD7Je2Lqu+p6K5aLxKtc9DqvNtUkgp1Cf5VDIsxzqIiuKrC4eoMVBc1pjOSvRz+90tozzZvWwWz0kzQCHlu0H/6O9Q4fd6mZ7mZihMB98h3I/kGtkVpla1vFqYXhdDpwok/cX3c9GfmdSeUWpUz2vjVVV5iaat8ZIA6vjFuILVbeOMjvkPUnqaz1u68UQqxlCXVeULg/ybvbzIrtvBreiatvI+KVB2UMr2Aoy+iy6RBX+k6+eC+TEhZ6bHwrrTk3KDzHnKOOokAVds5GJjWnFWazR0mh5GmONsa3R0eiENY4yMZoKdFqzA3YwtnEmM/159NgBN6InmZjcdcc7cTRiFDwT/IV+uQNu9JuJbqeO6PZ4X4nlivvGJijSe7Cj2u3oxbeUrK/rW4SraLxVk1ttp+zJ1UNVtzpxlmiKuDJ5VKMtL0y1Cbq75JM+xfVWmWZvZDIFjhFED/KFS77AqsE9yf/Fu/f1WInQ4yDF4rwx6jlot5BWNtSGOlAHSimYNq5GVzRnXvMMOoN6UA/qQ33oB+gH6Bw6hwbQAMqgDBpCmxcYNILGUGSg9RH6EZpAE2gKTaEcikvZyqAZ9B56D32APkAfoY/QJ+gT9BP0E7QP7Ze3N5V2Nol9TgIjoFz+2EhV9Bqi19DHOnQdOoQOoRvQDegmdBO6Bd2CbkO3oTvQHegudBe6B92D7kP3oQfQA+gh9BA6go6gR9Aj6DH0GHoCPYGOoXicW6fQU+gZ9Ax6Dj2HXkAvoBPoBHoJvYReQa+g19BrlVNl0fo/qcV0almN1GLrUKQWG0KRWmwDitRim1CkFtuCIrXYNhSpxXagSC22C0VqsT0oUovtQ5Fa7ACK1GKHUKQWG0GRWuwIitRix1CkFjuBIrXYGIrUYqdQpBY7gyK12DkUqcUuoEgtNoEitdglFKnFrqBILXYNrVNrSAJ/lpDY8+2XU8tew5vXmn4GrwP1Y95+6U3f3gBuaNwE6p+N9hZQP9jtbeC2xh2gfu+wd4G7GveAexr3gfsaX/rRYR8C9Y8VewQcaTwCHmk8Bh5rPAGeaBwD9c8K+xR4qvEMeKbxHHiu8aVXZnsCnGi8BF5qvAJeabwGXmt0gPINaPpqadD9+/Hzg4tv3w6+e9s/+X7ph7762/KXvT/3/tpb7g16/+j90NvpHffOe3bvX70fe//u/Wfxi8WvF79d/L4K/fwz1eaPvdZn8f1/AW0vBe0= ✓ ( k +1) AAAXQnicndjbbuO4GQdw7/a0TU+zLZCbAoXRzACZNh3Y3RYFBiiwiZ1z4iTOwTkoMSiKsjUWJY1E5TCE+hy9bZ+lL9FX6F3R216Ukij+JSXbizUwY/HHjxRJfZTk2JHvJaLX++dnn3/nu9/7/g+++OHSj378k5/+7NWXP79IwjSm7JyGfhhf2iRhvhewc+EJn11GMSPc9tnEXgzy+sk9ixMvDM7EU8RuOZkFnutRIhTdWWLOBLmTq4vf9t9m01crvXe94tN9ftDXBysd/Tmefrn8K8sJacpZIKhPkuSm34vErSSx8KjPsqWlJStNWETogsyYdElAn+ZOvNZlUeJ6s/J7rUt4Iua8+OZEzIuD5InbxYEbBiJZ6yaprVqkMVvrFrPOGj2LecxUIRaC2Pl5a3U3QUicD2mihkU90ayTTtF901ig5hOTdigNeT5NhV5A/dRhcs6Iw1oD4akvvDh8aKrtzRIRp6rxkhWwB7XoYcy41N+ZPNMHjVrVNggffObMWHHiG11xK9dbNUtvGu38WRh7akHrLYw1T/HohY2wotwIoSq5ahGDvNgM8IlX72NQlJshYb5ieQ7W44DtYMcTz2K1tUM/MCpUUjRjK2wGq/mzuNWvsUaow1wvaI9hCGwEs0fCI78+gk0trTAWUy9pxmlqBPqMc5LJg+KrNd049H0SP2VqktVhIyIIBWkNe1RRIzCKQ7VP6pftWEs7LAqTctJ5hCk0otQ/Ei8yOS6/G3VJ6KetEZ1W1AxM1XzVfGpxWpoZm6jAqBzQOo6fX5FMWrbbffFS5JN3w5iXIWbiahepC2+Vs5HSCkIvcNT26uZR5dzeW/NE7Wkme+/+qJpk+X7O23xkjnxt2b7assnHlMTsdVZWqFOFLrqSFuPRPF/J0H2fdbN6VLJ4Fne6YILOu2ExyEYLFji667nr+X43H0GmJ1CrKwb7G2nlMdlf8iBL3STzCBF79/njxUqDZOFF367tfd70q0hUy7Aprfz+nbhysxro6e5obDQvmIq613i3Hj42K+xLy2eu0FGxtGJvNq+Kq2Xlqi6+1bVvddm3y3pLVs1tHWFVp70pI2508VbX3+obdn7vJ4GjLpBKO3WV1BPCzu/8enA2V+lSPJXUXnAy+bvi090Iw0WeNYq6mqrzzfMxzYm60L56Vjskm9qmJm3VpLqGEjVtwR6FWppBRqpDs3QRVc+kaErvZAjK86nE1d7bisdyPLXUc78qFkElNeNqHGLoxbiag6wWtqorv6dyNW+8JkyPg/LiUuKrKVRnOZTqLIemyPMir4oHaHFQn9ZYzUr2sjuTLeOsXj1oVw9IPcBhRfv+N7R32KBdXW+vEjOQ1oPnMOH5ao2sMlOLOlEujKjKvhOm6v5i+hlmdxZTO0qf6nltpKtSNza8NUYa2C23EZ+rvnUU2aHuSUpfm3HbL4bY9RDmurJwdZC1s1/k2X3Tv5Vl21rGr/SLHhrBtp+yZ9EF6vBvvXsukBMXZm5eIO07PSkvwJzDVKBCFUzNMJXDaclpZNA2aM8NRgYjatAx6AQVjlI5mkp0WrEDdjC2caoy/Xn02AHXoiepnNy1xztxDGIUIpXihX6FA671m8p2p45s93hfiu3K+9pF0GSuwY5ut2MW39aysWFuEa6m8VZFbnk5VU+uGaq+1cmz2FAotKmjCqnamPoimO7iT+YU11tFmr1RyeQ7XT98UC9c6gVWD+5J/S/f/7kaK5FmHCRfnDfdag7GbaQVhVKoA3WgjIFZbTe6sj7zimfQGXQOnUM9qAf9AP0AXUAXUB/qQzmUQwNofYNBQ2gERQbaH6EfoTE0hibQBCqg2Mp2Ck2h99B76AP0AfoIfYQ+QZ+gn6CfoD1or7i96bSjJPIE8bs+E+rHRqKj1xG9jj42oBvQAXQAHUKH0E3oJnQLugXdhm5Dd6A70F3oLnQPugfdh+5DD6AH0EPoIXQEHUGPoEfQY+gx9AR6Ah1D8Ti3T6Gn0DPoGfQceg69gF5AJ9AJ9BJ6Cb2CXkGvodc6p4qi/X9Si5vUsmupxTegSC0+gCK1+BCK1OKbUKQW34Iitfg2FKnFd6BILb4LRWrxPShSi+9DkVr8AIrU4odQpBYfQZFa/AiK1OLHUKQWP4EitfgYitTip1CkFj+DIrX4ORSpxS+gSC0+gSK1+CUUqcWvoEgtfg2tUmtAfG8Wk2ju0ZdTi67jzWvdPIM3gOYxT19606dD4NDgJtD8bKRbQPNgp9vAbYM7QPPeQXeBuwb3gHsG94H7Bl/60UEPgebHCh0BRwaPgEcGj4HHBk+AJwbHQPOzgp4CTw2eAc8MngPPDb70ykwnwInBS+ClwSvglcFr4LVBB6jegKavVvrtvx8/P7j4/bv+V+96J39Y+bqn/7b8ReeXnV93Vjv9zp86X3d2Osed8w7txJ2/dv7W+fvyP5b/tfzv5f+UoZ9/ptv8otP4LP/3fwAr8mM= r f n ( ✓ ( k ) n ) AAAXUHicndjbbuS2GQfwSdJD6h6yaYEN0ALFoN4F7MBdjJsUBRYoEHvGZ3tsjw/jg+wBRVEz2hElrUT5sISCPE1u02fpXd+kdy0lUfxLstuLDLAr8cePFEl9OtmOfC8Rvd6/Pvr4k5/89Gc///QXC7/81a9/89mLz397noRpTNkZDf0wvrBJwnwvYGfCEz67iGJGuO2zsT3v5/XjOxYnXhiciseI3XAyDTzXo0Qomrz4vRUQ2ydddxIsWWLGBJkEt3JpvpwtT14s9t70il/36c6q3lns6N/R5POXf7SckKacBYL6JEmuV3uRuJEkFh71WbawsGClCYsInZMpky4J6OPMiVe6LEpcb1puV7qEJ2LGiy0nYlbsJI/cLnbcMBDJSjdJbdUijdlKt1iDrNGzmMVMFWIh1NxYo+46CInzLk3UsKgnmnXSKbpvGgvUfGLSDqUhz6ep0AuonzpMzhhxWGsgPPWFF4f3TbW9aSLiVDVesAJ2r1Y9jBmXepvJU73TqFVtg/DeZ86UFQe+1hU3cq1Vs/C60c6fhrGnFrTewljzEA9e2Agryo0QqlKtFtHPi80An3j1PvpFuRkS5iuWZ2Q9DtgOdjzxJFZbO/Qdo0IlRTO2wmawmj+LW/0aa4Q6zPWC9hgGwEYweyA88usj2NDSCmMx9ZJmnKZGoM84J5ncLzat6cah75P4MVOTrHYbEUEoSGvYw4oagVEcquukftqOtLTDojApJ51HmEIjSv0j8TyTo3LbqEtCP22N6KSiZmCq5qvmU4vT0szYRAVG5YDWsP/0jGTSst3us6cin7wbxrwMMRNXV5E68VY5GymtIPQCR11e3TyqnNtba5aoa5rJ3pu/qiZZfj3nbd4zR76y1F2VzpP3KYnZq6ysUIcKXXQlLcajWb6Sofs262b1qGT+JO5kzgSddcNikI0WLHB01zPX8/1uPoJMT6BWVwz2S2nlMdm3eZClbpJ5hIi9u/xhY6VBMveiH9f2Lm/6VSSqZdiQVn7/Tly5UQ30ZGc4MpoXTEXda7xTDx+ZFfal5TNX6KhYWrE3nVXFpbJySReXde2yLvt2WW/JqrmtI6zqsNdlxLUu3uj6G33Dzu/9JHDUCVJpp86SekLY+Z1fD87mKl2Kp5K6FpxM/rn4ddfDcJ5njaKupup4s3xMM6JOtK+e3A7JJrapSVs1qa6hRE1bsAehlqafkWrXLF1E1TMpmtBbGYLyfCpxqbdc8UiOJpZ68FfFIqikZlyNQwy9GFdzkNXCVnXldiKX8sYrwvTYL08uJb6aQnWUA6mOcmCKPC/yqriPFvv1aY3UrGQvuzXZMsrq1f12dZ/UAxxWtF/9H+0d1m9X19urxAykde85THi+WiOrzNSiTpQLI6qy74Spur+YfgbZrcXUFaUP9bQ20lWpGxveHCEN7JbbiM9V3zqK7FD3JKWvzLjtZ0PseghzXVm42sna2S/y7L5evZFl21rGL64WPTSCbT9lT6IL1OE/+uo5R06cm7l5gbRv9aS8AHMOU4EKVTA1g1QOJiWnkUHboD0zGBmMqEHHoBNUOEzlcCLRacUO2MHYRqnK9KfRIwdcix6ncnzbHu/YMYhRiFSKZ/oVDrjWbyrbnTqy3eNdKbYr72onQZM5B9u63bZZfFvL+rq5RbiaRpsVueXpVD25Zqj6VidPY0Oh0Kb2KqTqwtQnwXQXfzCHuNos0uy1Sibf6frhvXrhUi+wenCP6n/59u/VWIk04yD54rzuVnMwbiOtKJRCHagDZQzMalejK+szr3gKnUJn0BnUg3rQd9B30Dl0DvWhPpRDOTSA1i8waAiNoMhA+z30PTSGxtAEmkAFFJeynUJT6B30DnoPvYc+QB+gj9BH6AfoB2gP2itubzrtKIk8Qfyuz4T62Eh09Bqi19DHOnQd2of2oQPoALoB3YBuQjehW9At6DZ0G7oD3YHuQnehe9A96D50H3oAPYAOoUPoIfQQegQ9gh5Dj6EjKB7n9gn0BHoKPYWeQc+g59Bz6Bg6hl5AL6CX0EvoFfRK51RRtP9PanGTWnYttfg6FKnF+1CkFh9AkVp8A4rU4ptQpBbfgiK1+DYUqcV3oEgtvgtFavE9KFKL70ORWvwAitTiQyhSix9CkVr8CIrU4sdQpBYfQZFa/ASK1OKnUKQWP4Mitfg5FKnFx1CkFr+AIrX4JRSpxa+gVWr1ie9NYxLNPPp8atE1vHmtmWfwOtA85ulzb/p0ABwY3ACaz0a6CTQPdroF3DK4DTTvHXQHuGNwF7hrcA+4Z/C5jw56ADQfK3QIHBo8BB4aPAIeGTwGHhscAc1nBT0Bnhg8BZ4aPAOeGXzulZmOgWODF8ALg5fAS4NXwCuDDlC9AU1eLK62/378dOf8L29Wv3rTO/568ZsV/bflTzt/6Pyps9RZ7fyt801nu3PUOevQzned7zs/dP7x8p8v//3yP198VIZ+rLed33Uavy8W/gsqGfV5 local gradient UL w N head/tail dual update 5  ( k +1) n ✓ ( k +1) n , ✓ ( k +1) n +1 AAAXfXicndhpb+S2GQfwydE2da9NCmRfBAgG9S7gbVxjJmkRYIEAsWd83+NjfMgeUBQ1ox1R0kqUjyXUb9RP01cF2s/SUhLFvyS7eZEBdkf88SFFUg8lje3I9xLR6/3ro48/+fQXv/zVZ79e+M1vf/f7P7z4/IvzJExjys5o6IfxhU0S5nsBOxOe8NlFFDPCbZ+N7fkgrx/fsTjxwuBUPEbshpNp4LkeJULR5MWm5atgh0xkkN3Kpfk3/TdZ1/KZK0gch/ddS8yYaNQuw77pG528WOyt9IpP9+lBXx8sdvTnaPL5l19bTkhTzgJBfZIk1/1eJG4kiYVHfZYtLCxYacIiQudkyqRLAvo4c+LlLosS15uW38tdwhMx48U3J2JWHCSP3C4O3DAQyXI3SW3VIo3ZcrdYsKzRs5jFTBViIYidn7dWdx2ExHmXJmpY1BPNOukU3TeNBWo+MWmH0pDn01ToBdRPHSZnjDisNRCe+sJTa95U25smIk5V4wUrYPdq6cOYcam/M3mqDxq1qm0Q3vvMmbLixNe64kautmoWXjfa+dMw9tSC1lsYa57iwQsbYUW5EUJVXtYiBnmxGeATr97HoCg3Q8J8xfL0rccB28GOJ57EamuHvmNUqKRoxlbYDFbzZ3GrX2ONUIe5XtAewxDYCGYPhEd+fQTrWlphLKZe0ozT1Aj0Geckk3vFV2u6cej7JH7M1CSrw0ZEEArSGvZBRY3AKA7VPqlftiMt7bAoTMpJ5xGm0IhS/0g8z+So/G7UJaGftkZ0UlEzMFXzVfOpxWlpZmyiAqNyQKs4fnpFMmnZbvfZS5FP3g1jXoaYiatdpC68Vc5GSisIvcBR26ubR5Vze2vNErWnmeyt/E01yfL9nLd5zxz5yrJ9tWWT9ymJ2ausrFCnCl10JS3Go1m+kqH7Nutm9ahk/iTuZM4EnXXDYpCNFixwdNcz1/P9bj6CTE+gVlcM9s/SymOyv+dBlrpJ5hEi9u7yJ5OVBsnci35e27u86XeRqJZhXVr5/Ttx5Xo10JPtg5HRvGAq6l7j7Xr4yKywL4vnmY6KpRV701lVXCorl3Txja59o8u+XdZbsmpu6wirOu11GXGtize6/kbfsPN7PwkcdYFU2qmrpJ4Qdn7n14OzuUqX4qmk9oKTyb8Un+5aGM7zrFHU1VSdb5aPaUbUhS6f3NnENjVpqybVNZSoaQv2INTSDDJSHZqli6h6JkUTeitDUJ5PJS713lQ8kqOJpZ7+VbEIKqkZV+MQQy/G1RxktbBVXfU+spQ3Xhamx0F5cSnx1RSqs+xLdZZ9U+R5kVfFPbTYq09rpGYle+rVpcqWUVavHrSrB6Qe4LCiff//tHfYoF1db68SM5DWvecw4flqjawyU4s6US6MqMq+E6bq/mL6GWa3FlM7Sp/qaW2kq1I3NrwxQhrYLbcRn6u+dRTZoe5JSl+ZcdvPhtj1EOa6snB1kLWzX+TZfd2/kWXbWsYv9oseGsG2n7In0QXq8J+9e86RE+dmbl4g7Vs9KS/AnMNUoEIVTM0wlcNJyWlk0DZozwxGBiNq0DHoBBUepPJgItFpxQ7YwdhGqcr0p9EjB1yLHqdyfNse79gxiFGIVIpn+hUOuNZvKtudOrLd410ptivvahdBk7kGW7rdlll8W8vamrlFuJpGGxW55eVUPblmqPpWJ09jQ6HQpo4qpGpj6otguos/mFNcbRRp9lolk+90/fBevXCpF1g9uEf1v3z7QzVWIs04SL44r7vVHIzbSCsKpVAH6kAZA7PabnRlfeYVT6FT6Aw6g3pQD/oO+g46h86hPtSHciiHBtD6BoOG0AiKDLTfQ99DY2gMTaAJVECxle0UmkLvoHfQe+g99AH6AH2EPkI/QD9Ae9BecXvTaUdJ5Anid30m1I+NREevInoVfaxB16AD6AA6hA6h69B16AZ0A7oJ3YRuQbeg29Bt6A50B7oL3YXuQfeg+9B96AH0AHoIPYQeQY+gx9Bj6AiKx7l9Aj2BnkJPoWfQM+g59Bw6ho6hF9AL6CX0EnoFvdI5VRTtn0gtblLLrqUWX4MitfgAitTiQyhSi69DkVp8A4rU4ptQpBbfgiK1+DYUqcV3oEgtvgtFavE9KFKL70ORWvwAitTih1CkFj+CIrX4MRSpxUdQpBY/gSK1+CkUqcXPoEgtfg5FavExFKnFL6BILX4JRWrxK2iVWgPie9OYRDOPPp9adBVvXqvmGbwGNI95+tybPh0ChwbXgeZnI90Amgc73QRuGtwCmvcOug3cNrgD3DG4C9w1+NyPDroPND9W6AHwwOAh8NDgEfDI4DHw2OAIaH5W0BPgicFT4KnBM+CZwedemekYODZ4AbwweAm8NHgFvDLoANUb0OTFYr/99+OnB+ffrvS/W/n2+K+LPy7rvy1/1vmq86fOUqff+b7zY2erc9Q569DOPzr/7Py7858v//vy9cvllytl6Mcf6TZ/7DQ+L7//H+bGBVM= 4 head → tail Tx ✓ ( k +1) n 2 N t ✓ ( k +1) n  1 , ✓ ( k +1) n +1 AAAXkHicndjbbuS2GQfwSXpK3dMmBeqLAMW03gW8ibOYSVq0WCBo7PH5MLbHh/FB9oCiqBntiJJWonxYQn22vkZfoLftI5SSKP4l2clFBtgd8cePFEl9lDS2I99LRK/3748+/slPf/bzX3zyy4Vf/fo3v/3di08/O0/CNKbsjIZ+GF/YJGG+F7Az4QmfXUQxI9z22dieD/L68R2LEy8MTsVjxG44mQae61EiFE1eXFlixgSZyMDyAosTMaPEl8NsIrJbuTz/sv8661o+cwWJ4/C+a6K/6pt6608r8C/hkxdLvTe94tN9etDXB0sd/TmafPqHP1pOSFPOAkF9kiTX/V4kbiSJhUd9li0sLFhpwiJC52TKpEsC+jhz4pUuixLXm5bfK13CEzHjxXc+n+IgeeR2ceCGgUhWuklqqxZpzFa6xRpmjZ7FLGaqEAtB7Py8tbrrICTOuzRRw6KeaNZJp+i+aSxQ84lJO5SGPJ+mQi+gfuowOWPEYa2B8NQXnlr3ptreNBFxqhovWAG7V0sfxoxL/Z3JU33QqFVtg/DeZ86UFSe+1hU3crVVs/Cq0c6fhrGnFrTewljzFA9e2Agryo0QqlK1FjHIi80An3j1PgZFuRkS5iuWZ3Q9DtgOdjzxJFZbO/Qdo0IlRTO2wmawmj+LW/0aa4Q6zPWC9hjWgY1g9kB45NdHsKGlFcZi6iXNOE2NQJ9xTjK5X3y1phuHvk/ix0xNsjpsRAShIK1hDytqBEZxqPZJ/bIdaWmHRWFSTjqPMIVGlPpH4nkmR+V3oy4J/bQ1opOKmoGpmq+aTy1OSzNjExUYlQNaxfHTK5JJy3a7z16KfPJuGPMyxExc7SJ14a1yNlJaQegFjtpe3TyqnNtba5aoPc1k781fVZMs3895m/fMkS8t21dbNnmfkpi9zMoKdarQRVfSYjya5SsZum+zblaPSuZP4k7mTNBZNywG2WjBAkd3PXM93+/mI8j0BGp1xWC/kFYek/0zD7LUTTKPELF3lz+srDRI5l7049re5U2/iUS1DBuyeB4lrtyoBnqyMxwZzQumou413qmHj8wK+7J4pumoWFqxN51VxeWyclkXX+va17rs22W9Javmto6wqtNelxHXunij62/0DTu/95PAURdIpZ26SuoJYed3fj04m6t0KZ5Kai84mfyq+HTXwnCeZ42irqbqfLN8TDOiLrSvnvwOySa2qUlbNamuoURNW7AHoZZmkJHq0CxdRNUzKZrQWxmC8nwqcbn3uuKRHE0s9fSvikVQSc24GocYejGu5iCrha3qyu+JXM4brwjT40CaN5aBOcuBVGc5MEWeF3lV3EeL/fq0RmpWsqdeXapsGWX16kG7ekDqAQ4r2ve/p73DBu3qenuVmIG07j2HCc9Xa2SVmVrUiXJhRFX2nTBV9xfTz3p2azG1o/SpntZGuip1Y8ObI6SB3XIb8bnqW0eRHeqepPSlGbf9bIhdD2GuKwtXB1k7+0We3df9G1m2rWX8Ur/ooRFs+yl7El2gDv/Ru+ccOXFu5uYF0r7Vk/ICzDlMBSpUwdSsp3J9UnIaGbQN2jODkcGIGnQMOkGFw1QOJxKdVuyAHYxtlKpMfxo9csC16HEqx7ft8Y4dgxiFSKV4pl/hgGv9prLdqSPbPd6VYrvyrnYRNJlrsK3bbZvFt7WsrZlbhKtptFmRW15O1ZNrhqpvdfI0NhQKbeqoQqo2pr4Iprv4gznF1WaRZq9UMvlO1w/v1QuXeoHVg3tU/8u331ZjJdKMg+SL86pbzcG4jbSiUAp1oA6UMTCr7UZX1mde8RQ6hc6gM6gH9aDvoO+gc+gc6kN9KIdyaACtbzBoCI2gyED7PfQ9NIbG0ASaQAUUW9lOoSn0DnoHvYfeQx+gD9BH6CP0A/QDtAftFbc3nXaURJ4gftdnQv3YSHT0KqJX0ccadA06gA6g69B16AZ0A7oJ3YRuQbeg29Bt6A50B7oL3YXuQfeg+9B96AH0ADqEDqGH0EPoEfQIegw9ho6geJzbJ9AT6Cn0FHoGPYOeQ8+hY+gYegG9gF5CL6FX0CudU0XR/oHU4ia17Fpq8TUoUosPoEgtvg5FavENKFKLb0KRWnwLitTi21CkFt+BIrX4LhSpxfegSC2+D0Vq8QMoUosPoUgtfghFavEjKFKLH0ORWnwERWrxEyhSi59CkVr8DIrU4udQpBYfQ5Fa/AKK1OKXUKQWv4JWqTUgvjeNSTTz6POpRVfx5rVqnsFrQPOYp8+96dN14LrBDaD52Ug3gebBTreAWwa3gea9g+4AdwzuAncN7gH3DD73o4MeAM2PFToEDg0eAg8NHgGPDB4Djw2OgOZnBT0Bnhg8BZ4aPAOeGXzulZmOgWODF8ALg5fAS4NXwCuDDlC9AU1eLPXbfz9+enD+9Zv+N2++Pv7L0ndr+m/Ln3Q+7/y5s9zpd/7W+a6z3TnqnHVo51+d/3T+2/nf4meLf1/8x+JqGfrxR7rN7zuNz+Lu/wHYFww6 parameter server Figure 1: An illustration of (a) distributed gradient descen t with a parameter server and (b) GADMM without an y central entit y . Mercier, 1975; Bo yd et al., 2011; Jaggi et al., 2014; Ma et al., 2017; Deng et al., 2017). The p erformance of distributed optimization algorithms is commonly characterized by their computation time and communication cost. The computation time is determined b y the p er-iteration complexit y of the algorithm. The communication cost is determined b y: (i) the n umber of c ommunic ation r ounds un til con vergence, (ii) the num ber of channel uses p er comm unication round, and (iii) the b andwidth/p ower usage p er channel use. Note that the n umber of comm unication rounds is prop ortional to the num b er of iterations; e.g., 2 rounds at ev ery iteration k , for uplink and downlink transmissions in Fig. 1-(a) or for head-to-tail and tail-to-head transmissions in Fig. 1-(b). F or a large scale net work, the communication cost often b ecomes dominant compared to the computation time, calling for comm unication eﬃcien t distributed optimization (Zhang et al., 2012; McMahan et al., 2017; Park et al., 2019; Jordan et al., 2018; Liu et al., 2019; Sriranga et al., 2019). Comm unication Eﬃcien t Distributed Optimization . A v ast amoun t of work is dev oted to reducing the aforementioned three communication cost comp onen ts. T o reduce the bandwidth/p o wer usage p er c hannel use, decreasing comm unication payload sizes is one p opular solution, which is enabled by gradient quantization (Suresh et al., 2017), mo del parameter quan tization (Zhu et al., 2016; Sriranga et al., 2019), and mo del output exchange for large-sized mo dels via knowledge distillation (Jeong et al., 2018). T o reduce the num b er of channel uses p er communication round, exchanging mo del up dates can b e restricted only to the work ers whose computation delays are less than a target threshold (W ang et al., 2018), or to the work ers whose up dates are suﬃciently changed from the preceding up dates, with resp ect to gradients (Chen et al., 2018), or mo del parameters (Liu et al., 2019). Albeit their improv ement in comm unication eﬃciency for ev ery iteration k , most of the algorithms in this literature are based on distributed gradien t descent, and this limits their required comm unication rounds to the conv ergence rate of distributed gradient descen t, which is O (1 /k ) for diﬀerentiable and smo oth ob jectiv e functions and can b e as low as O (1 / √ k ) ( e.g., when the ob jectiv e function is non-diﬀerentiable everywhere (Boyd et al., 2011)). On the other hand, primal-dual decomp osition metho ds are shown to b e eﬀective in enabling distributed optimization (Jaggi et al., 2014; Boyd et al., 2011; Ma et al., 2017; Glo winski and Marro co, 1975; Gabay and Mercier, 1975; Deng et al., 2017), among which ADMM is a comp elling solution that often pro vides a fast conv ergence rate with low 3 authors complexit y (Glowinski and Marro co, 1975; Gabay and Mercier, 1975; Deng et al., 2017). It w as shown in (Chen et al., 2016) that Gauss-Seidel ADMM (Glo winski and Marro co, 1975) ac hieves the conv ergence rate o (1 /k ). How ever, this con vergence rate is ensured only when the ob jectiv e function is a sum of tw o separable con vex functions. Finally , all aforemen tioned distributed algorithms require a parameter server b eing connected to every work er, which ma y induce a costly comm unication link to some w orkers or it ma y not ev en b e feasible particularly for the w orkers lo cated b eyond the server’s co verage. In sharp contrast, we aim at developing a decentralized optimization framework ensuring fast con vergence without any central en tity . Decen tralized Optimization . F or decentralized topology , decentralized gradien t descen t (DGD) has b een inv estigated in (Nedi´ c et al., 2018). Since DGD encoun ters a lo wer n umber of connection p er w orker compared to parameter-serv er based GD, it ac hieves a slo wer con vergence. Bey ond GD based approaches, several comm unication-eﬃcient decentralized algorithms were prop osed for both time-v arian t and inv ariant top ologies. (Duchi et al., 2011; Scaman et al., 2018) prop osed decen tralized algorithms to solv e the problem for time-in v ariant top ology at a conv ergence rate of O (1 / p k ) . On the other hand, (Lan et al., 2017) prop osed a decentralized algorithm that enforces each work er to transmit the up dated primal and dual v ariables at each iteration. Note that, in GADMM, each work er is required to share the primal parameters only p er iteration. Finally , it is worth mentioning that a decen tralized algorithm was prop osed in (He et al., 2018), but that algorithm was studied only for linear learning tasks. F or time-v arying top ology , there are a few prop osed algorithms in the literature. F or instance, (Nedi´ c and Olshevsky, 2014) prop osed a sub-gradient based algorithm for time- v ariant directed graph. The algorithm enforces each w orker to send tw o sets of v ariables to its neigh b oring no des p er iteration and achiev es O (1 / p k ) con vergence rate. In contrast to that, in D-GADMM, only primal v ariables are shared with neigh b ors at eac h iteration. Finally , (Nedic et al., 2017) prop osed an algorithm that achiev es a linear conv ergence sp eed but for strongly conv ex functions only . Moreo ver, it also enforces each work er to send more than one set of v ariables p er communication round. Con tribution . W e formulate the decen tralized machine learning (DML) problem as a constrained optimization problem that can b e solved in a decentralized wa y . Moreov er, w e prop ose a nov el algorithm to solve the form ulated problem optimally for con vex functions. The prop osed algorithm is sho wn to b e fast and communication-eﬃcien t. It achiev es signiﬁcan tly less comm unication ov erhead compared to the standard ADMM. The prop osed GADMM algorithm allows (i) only half of the work ers to transmit their up dated parameters at each communication round, (ii) the work ers up date their mo del parameters in parallel, while each work er communicates only with t wo neighbors whic h makes it comm unication- eﬃcien t. Moreov er, we prop ose D-GADMM whic h has tw o adv an tages: (i) it accoun ts for time-v arying netw ork top ology , (ii) it improv es the conv ergence sp eed of GADMM by randomly c hanging neigh b ors ev en when the ph ysical top ology is not time-v arying. Therefore, D-GADMM in tegrates the communication eﬃciency of GADMM whic h uses only tw o links p er work er (sparse graph) with the fast conv ergence sp eed of the standard ADMM with parameter serv er (star topology with N connection to a central entit y). It is w orth men tioning that GADMM is closely related to other group-based ADMM metho ds as in (W ang et al., 2017), but these metho ds consider more communication links p er iteration than our prop osed 4 shor t title GADMM algorithm. Notably , the algorithm in (W ang et al., 2017) still relies on multiple cen tral entities, i.e., master work ers under a master-sla ve architecture, whereas GADMM requires no cen tral entit y wherein work ers are equally divided into head and tail groups. The rest of the pap er is organized as follows. In section 3, we describ e the problem form ulation. W e describe our prop osed v ariant of ADMM (GADMM) and analyze its con vergence guaran tees in sections 4 and 5 resp ectively . In section 6, w e describ e D- GADMM whic h is an extension of our prop osed algorithm to time v arying netw orks. In section 7, we discuss our sim ulation results comparing GADMM to the considered baselines. Finally , in section 8, we conclude the pap er and brieﬂy discuss future directions. 3. Problem F orm ulation W e consider a net work of N w orkers where each w orker is equipp ed with the task to learn a global parameter Θ . The aim is to minimize the global conv ex loss function F ( Θ ) which is sum of the lo cal con vex, prop er, and closed functions f n ( Θ ) for all n . W e consider the follo wing optimization problem min Θ F ( Θ ) , F ( Θ ) := N X n =1 f n ( Θ ) , (1) where Θ ∈ R d is the global mo del parameter. Gradient descent algorithm can b e used to solv e the problem in (1) iterativly in a cen tral entit y . The goal here is to solve the problem in a distributed manner. The standard tec hnique used in the literature for distributed solution is consensus form ulation of (1) giv en by . min Θ , { θ n } N n =1 N X n =1 f n ( θ n ) (2) s.t. θ n = Θ , ∀ n. (3) Note that w ith the reformulation in (2) - (3) , the ob jectiv e function b ecomes separable across the w orkers and hence can b e solv ed in a distributed manner. The problem in (2) - (3) is kno wn as the glob al c onsensus pr oblem since the constraint forces all the v ariables across diﬀeren t work ers to b e equal as detailed in (Boyd et al., 2011). The problem in (2) - (3) can b e solved using the primal-dual based algorithms as in (Chang et al., 2014b; T ouri and Nedic, 2009; Nedi ´ c and Ozdaglar, 2009), saddle p oint algorithms prop osed in (Kopp el et al., 2017; Bedi et al., 2019), and ADMM-based tec hniques such as (Glowinski and Marro co, 1975; Bo yd et al., 2011; Deng et al., 2017). ADMM forms an augmen ted Lagrangian which adds a quadratic term to the Lagrange function and breaks the main problem in to sub-problems that are easier to solve p er iteration. Note that in the ADMM implemen tation (Boyd et al., 2011; Deng et al., 2017), only the primal v ariables { θ n } N n =1 can b e up dated in a distributed manner. How ev er, the step of up dating Θ requires collecting θ n from all work ers whic h is comm unication ineﬃcient (Boyd et al., 2011). 5 authors The problem form ulation in (2) - (3) can b e solved using standard ADMM (parameter serv er based-ADMM). The augmented Lagrangian of the optimization problem in (2) - (3) as L ρ ( Θ , { θ n } N n =1 , λ ) = N X n =1 f n ( θ n ) + N X n =1 h λ n , θ n − Θ i + ρ 2 N X n =1 k θ n − Θ k 2 , (4) where λ := [ λ T 1 , · · · , λ T N ] T is the collection of the dual v ariables, and ρ is a constan t adjusting the p enalty for the disagreement b et ween θ n and Θ . The primal and dual v ariables under ADMM are up dated in the following three steps. 1) At iteration k + 1, the primal variable of e ach workers is up dated as: θ k +1 n = arg min θ n  f n ( θ n )+ h λ k n , θ n − Θ k i + ρ 2 k θ n − Θ k k 2  , n ∈ { 1 , · · · , N } (5) 2) After the up date in (5) , each work ers sends its primal v ariable (up dated mo del) to the parameter serv er. The primal variable of the p ar ameter server is then up dated as: Θ k +1 = 1 N N X n =1  θ k +1 n + 1 ρ λ k n  . (6) 3) After the up date in (6) , the parameter serv er broadcasts its primal v ariable (the up dated global mo del) to all work ers. After receiving the global mo del ( Θ k +1 ) from the parameter serv er, e ach worker lo c al ly up dates its dual variable λ n as follo ws λ k +1 n = λ k n + ρ ( θ k +1 n − Θ k +1 ) , n = { 1 , · · · , N } . (7) Note that standard ADMM requires a parameter serv er that collects up dates from all work ers, up date a global mo del and broadcast that mo del to all work ers. Suc h a scheme ma y not b e a comm unication-eﬃcient due to: (i) N w orkers comp eting for the limited communication resources at every iteration, (ii) the work er with the weak est communication c hannel will b e the b ottlenec k for the communication rate of the broadcast c hannel from the parameter serv er to the work ers, (iii) some w orkers may not b e in the cov erage zone of the parameter serv er. In contrast to standard ADMM, we prop ose a decen tralized algorithm that minimizes the communication cost required p er work er by allowing only N / 2 work ers to transmit at ev ery communication round, so the comm unication resources to each w orker are doubled compared to parameter serv er-based ADMM. Moreov er, it limits the communication of each w orker to include only tw o neighbors. W e consider the optimization problem in (2) - (3) and rewrite the constrain ts as follows. θ ? := arg min { θ n } N n =1 N X n =1 f n ( θ n ) (8) s.t. θ n = θ n +1 , n = 1 , · · · , N − 1 . (9) 6 shor t title Here θ ? is the optimal and note that θ ? n − 1 = θ ? n and θ ? n = θ ? n +1 for all n . This implies that each work er n has joint constraints with only tw o neighbors (except for the t wo end w orkers which hav e only one). Nonetheless, ensuring θ n = θ n +1 for all n ∈ { 1 , · · · , N − 1 } at the con vergence p oint yields conv ergence to a global mo del parameter that is shared across all w orkers. 4. Prop osed Algorithm: GADMM W e will no w describ e our prop osed algorithm, GADMM, that solves the optimization problem deﬁned in (8) - (9) in a decen tralized manner. The prop osed algorithm is fast since it allows w orkers b elonging to the same group to up date their mo del parameters in parallel, and it is communication-eﬃcien t since it allows work ers to exc hange v ariables with a minimum n umber of neighbors and enjo ys a fast con vergence rate. Moreo ver, it allows only half of the w orkers to transmit their up dated mo del parameters at eac h communication round. Note that when the num b er of work ers who up date their parameters p er communication round is reduced to half, the communication ph ysical resources ( e.g., bandwidth) av ailable to each w orker are doubled when those resources are shared among work ers. The main idea of the prop osed algorithm is presen ted in Fig. 1-(b). The prop osed GADMM algorithm splits the netw ork no des (w orkers) connected with a chain in to tw o groups he ad and tail such that each work er in the head’s group is connected to other w orkers through tw o tail work ers. It allows up dating the parameters in parallel for the work ers in the same group. In one algorithm iterate, the w orkers in the head group up date their mo del parameters, and eac h head work er transmits its up dated mo del to its directly connected tail neigh b ors. Then, tail work ers up date their mo del parameters to complete one iteration. In doing so, each work er (except the edge w orkers) communicates with only t wo neighbors to up date its parameter, as depicted in Fig. 1-(b). Moreo v er, at any communication round, only half of the w orkers transmit their parameters, and these parameters are transmitted to only t wo neighbors. In con trast to the Gauss-Seidel ADMM in (Bo yd et al., 2011), GADMM allows all the head (tail) work ers to up date their parameters in parallel and still conv erges to the optimal solution for con vex functions as will b e sho wn later in this pap er. Moreo ver, GADMM has m uch less communication ov erhead as compared to PJADMM in (Deng et al., 2017) whic h requires all work ers to send their parameters to a cen tral en tity at ev ery comm unication round. Also, GADMM has fewer hyperparameters to control and less computation p er iteration than PJADMM. The detailed steps of the prop osed algorithm are summarized in Algorithm 1. T o in tuitively describ e GADMM, without loss of generality , w e consider an even N n umber of work ers under their linear connectivity graph shown in Fig. 1-(b), wherein eac h head (or tail) w orker communicates at most with tw o neighboring tail (or head) work ers, except for the edge work ers ( i.e., ﬁrst and last w orkers). With that in mind, we start by writing the augmen ted Lagrangian of the optimization problem in (8)-(9) as L ρ ( { θ n } N n =1 , λ ) = N X n =1 f n ( θ n ) + N − 1 X n =1 h λ n , θ n − θ n +1 i + ρ 2 N − 1 X n =1 k θ n − θ n +1 k 2 , (10) 7 authors Let’s divide the N w orkers in to t w o groups, head N h = { θ 1 , θ 3 , θ 5 , · · · , θ N − 1 } , and tail N t = { θ 2 , θ 4 , θ 6 , · · · , θ N } , resp ectiv ely . The primal and dual v ariables under GADMM are up dated in the following three steps. 1) At iteration k + 1, the primal variables of he ad workers are up dated as: θ k +1 n = arg min θ n  f n ( θ n )+ h λ k n − 1 , θ k n − 1 − θ n i + h λ k n , θ n − θ k n +1 i + ρ 2 k θ k n − 1 − θ n k 2 + ρ 2 k θ n − θ k n +1 k 2  , n ∈ N h \ { 1 } (11) Since the ﬁrst head work er ( n = 1) do es not ha ve a left neighbor ( θ n − 1 is not deﬁned), its mo del is up dated as follows. θ k +1 n = arg min θ n  f n ( θ n ) + h λ k n , θ n − θ k n +1 i + ρ 2 k θ n − θ k n +1 k 2  , n = 1 (12) 2) After the up dates in (11) and (12) , head work ers send their up dates to their tw o tail neigh b ors. The primal variables of tail workers are then up dated as: θ k +1 n = arg min θ n  f n ( θ n )+ h λ k n − 1 , θ k +1 n − 1 − θ n i + h λ k n , θ n − θ k +1 n +1 i + ρ 2 k θ k +1 n − 1 − θ n k 2 + ρ 2 k θ n − θ k +1 n +1 k 2  , n ∈ N t \ { N } . (13) Since the last tail w orker ( n = N ) do es not hav e a right neighbor ( θ n +1 is not deﬁned), its mo del is up dated as follows. θ k +1 n = arg min θ n  f n ( θ n )+ h λ k n − 1 , θ k +1 n − 1 − θ n i + ρ 2 k θ k +1 n − 1 − θ n k 2  , n = N . (14) 3) After receiving the up dates from neighbors, every worker lo c al ly up dates its dual variables λ n − 1 and λ n as follo ws λ k +1 n = λ k n + ρ ( θ k +1 n − θ k +1 n +1 ) , n = { 1 , · · · , N − 1 } . (15) These three steps of GADMM are summarized in Algorithm 1. W e remark that when f n ( θ n ) is conv ex, prop er, closed, and diﬀerentiable for all n , the subproblems in (11) and (13) are con vex and diﬀerentiable with resp ect to θ n . That is true since the additiv e terms in the augmented Lagrangian are the addition of quadratic and linear terms, which are also con vex and diﬀerentiable. 5. Con v ergence Analysis In this section, w e fo cus on the conv ergence analysis of the proposed algorithm. It is essen tial to prov e that the prop osed algorithm indeed con verges to the optimal solution of the problem in (8) - (9) for conv ex, prop er, and closed ob jectiv e functions. The idea to pro ve the conv ergence is related to the pro of of Gauss-Seidel ADMM in (Bo yd et al., 2011), while additionally accounting for the following three c hallenges: (i) the additional terms 8 shor t title Algorithm 1 Group ADMM (GADMM) 1: Input : N , f n ( θ n ) for all n, ρ 2: Initialization : 3: N h = { θ n | n : o dd } , N t = { θ n | n : even } 4: θ (0) n = 0 , λ (0) n = 0 for all n 5: for k = 0 , 1 , 2 , · · · , K do 6: Head w ork er n ∈ N h : 7: computes its primal v ariable θ k +1 n via (11) in parallel; and 8: sends θ k +1 n to its neigh b oring work ers n − 1 and n + 1. 9: T ail w orker n ∈ N t : 10: computes its primal v ariable θ k +1 n via (13) in parallel; and 11: sends θ k +1 n to its neigh b or work ers n − 1 and n + 1. 12: Ev ery w orker up dates the dual v ariables λ k n − 1 and λ k n via (15) lo cally . 13: end for that app ear when the problem is a sum of more than t wo separable functions, (ii) the fact that eac h work er can comm unicate with tw o neigh b ors only , and (iii) the parallel mo del parameter up dates of the head (tail) w orkers. W e sho w that the GADMM iterates conv erge to the optimal solution after addressing all the ab o ve-men tioned challenges in the pro of. Before presenting the main technical Lemmas and Theorems, we start with the necessary and suﬃcient optimality conditions, which are the primal and the dual feasibilit y conditions (Bo yd et al., 2011) deﬁned as θ ? n = θ ? n − 1 , n ∈ { 2 , · · · , N } (primal feasibilit y) (16) 0 ∈ ∂ f n ( θ ? n ) − λ ? n − 1 + λ ? n , n ∈ { 2 , · · · , N − 1 } 0 ∈ ∂ f n ( θ ? n ) + λ ? n , n = 1 (dual feasibilit y) 0 ∈ ∂ f n ( θ ? n ) + λ ? n − 1 , n = N (17) W e remark that the optimal v alues θ ? n are equal for each n , we denote θ ? = θ ? n = θ ? n − 1 for all n . Note that, at iteration k + 1, we calculate θ k +1 n for all n ∈ N t \ { N } as in (13) , from the ﬁrst order optimalit y condition, it holds that 0 ∈ ∂ f n ( θ k +1 n ) − λ k n − 1 + λ k n + ρ ( θ k +1 n − θ k +1 n − 1 ) + ρ ( θ k +1 n − θ k +1 n +1 ) . (18) Next, rewrite the equation in (18) as 0 ∈ ∂ f n ( θ k +1 n ) −  λ k n − 1 + ρ ( θ k +1 n − 1 − θ k +1 n )  +  λ k n + ρ ( θ k +1 n − θ k +1 n +1 )  . (19) F rom the up date in (15), the equation in (19) implies that 0 ∈ ∂ f n ( θ k +1 n ) − λ k +1 n − 1 + λ k +1 n , n ∈ N t \ { N } . (20) Note that for the N -th work er, W e calculate θ k +1 N as in (14) , then we follow the same steps, and w e get 0 ∈ ∂ f n ( θ k +1 n ) − λ k +1 n − 1 , n = N . (21) 9 authors F rom the result in (20) and (21) , it holds that the dual feasibility condition in (17) is alwa ys satisﬁed for all n ∈ N t . Next, consider every θ k +1 n suc h that n ∈ N h \ { 1 } whic h is calculated as in (11) at iteration k . Similarly from the ﬁrst order optimality condition, we can write 0 ∈ ∂ f n ( θ k +1 n ) − λ k n − 1 + λ k n + ρ ( θ k +1 n − θ k n − 1 ) + ρ ( θ k +1 n − θ k n +1 ) . (22) Note that in (22) , we don’t hav e all the primal v ariables calculated at k + 1 instance. Hence, w e add and subtract the terms θ k +1 n − 1 and θ k +1 n +1 in (22) to get 0 ∈ ∂ f n ( θ k +1 n ) −  λ k n − 1 + ρ ( θ k +1 n − 1 − θ k +1 n )  +  λ k n + ρ ( θ k +1 n − θ k +1 n +1 )  + ρ ( θ k +1 n − 1 − θ k n − 1 ) + ρ ( θ k +1 n +1 − θ k n +1 ) . (23) F rom the up date in (15), it holds that 0 ∈ ∂ f n ( θ k +1 n ) − λ k +1 n − 1 + λ k +1 n + ρ ( θ k +1 n − 1 − θ k n − 1 ) + ρ ( θ k +1 n +1 − θ k n +1 ) . (24) F ollowing the same steps for the ﬁrst head work er ( n = 1) after excluding the terms λ k n − 1 and ρ ( θ k +1 n − θ k n − 1 ) from (22) (w orker 1 do es not hav e a left neighbor) gives 0 ∈ ∂ f n ( θ k +1 n ) + λ k +1 n + ρ ( θ k +1 n +1 − θ k n +1 ) . (25) Let s k +1 n ∈N h , the dual residual of w orker n ∈ N h at iteration k + 1, b e deﬁned as follows s k +1 n =  ρ ( θ k +1 n − 1 − θ k n − 1 ) + ρ ( θ k +1 n +1 − θ k n +1 ) , for n ∈ N h \ { 1 } ρ ( θ k +1 n +1 − θ k n +1 ) , for n = 1 . (26) Next, w e discuss about the primal feasibilit y condition in (16) at iteration k + 1. Let r k +1 n,n +1 = θ k +1 n − θ k +1 n +1 b e the primal residual of each work er n ∈ { 1 , · · · , N − 1 } . T o sho w the conv ergence of GADMM, we need to prov e that the conditions in (16) - (17) are satisﬁed for eac h work er n . W e ha ve already sho wn that the dual feasibility condition in (17) is alw ays satisﬁed for the tail work ers, and the dual residual of tail work ers is alwa ys zero. Therefore, to pro ve the conv ergence and the optimality of GADMM, we need to show that the r k n,n +1 for all n = 1 , · · · , N − 1 and s k n ∈N h con verge to zero, and P N n =1 f n ( θ k n ) conv erges to P N n =1 f n ( θ ? ) as k → ∞ . Now we are in p osition to in tro duce our ﬁrst result in terms of Lemma 1. Lemma 1 F or the iter ates θ k +1 n gener ate d by A lgorithm 1, we have (i) Upp er b ound on the optimality gap N X n =1 [ f n ( θ k +1 n ) − f n ( θ ? )] ≤ − N − 1 X n =1 h λ k +1 n , r k +1 n,n +1 i + X n ∈N h h s k +1 n , θ ? n − θ k +1 n i . (27) (ii) L ower b ound on the optimality gap N X n =1 [ f n ( θ k +1 n ) − f n ( θ ? )] ≥ − N − 1 X n =1 h λ ? n , r k +1 n,n +1 i . (28) 10 shor t title The detailed pro of is provided in App endix A. The main idea for the pro of is to utilize the optimalit y of the up dates in (11) and (13) . W e deriv e the upp er b ound for the ob jectiv e function optimalit y gap in terms of the primal and dual residuals as stated in (27). T o get the lo wer b ound in (28) in terms of the primal residual, the deﬁnition of the Lagrangian is used at ρ = 0. The result in Lemma 1 is used to derive the main results in Theorem 2 of this pap er presented next. Theorem 2 When f n ( θ n ) is close d, pr op er, and c onvex for al l n , and the L agr angian L 0 has a sadd le p oint, for GADMM iter ates, it holds that (i) the primal r esidual c onver ges to zer o as k → ∞ . i.e., lim k →∞ r k n,n +1 = 0 , n ∈ { 1 , · · · , N − 1 } . (29) (ii) the dual r esidual c onver ges to zer o as k → ∞ . i.e., lim k →∞ s k n = 0 , n ∈ N h . (30) (iii) the optimality gap c onver ges to zer o as k → ∞ . i.e., lim k →∞ N X n =1 f n ( θ k n ) = N X n =1 f n ( θ ? ) . (31) Pro of The detailed pro of of Theorem 2 is provided in App endix B. There are three main steps to prov e conv ergence of the prop osed algorithm. F or a prop er, closed, and conv ex ob jectiv e function f n ( · ), with Lagrangian L 0 whic h has a saddle p oin t ( θ ? , { λ n } ∀ n ), w e deﬁne a Ly apunov function V k as V k = 1 /ρ N − 1 X n =1    λ k n − λ ? n    2 + ρ X n ∈N h \{ 1 }    θ k n − 1 − θ ?    2 + ρ X n ∈N h    θ k n +1 − θ ?    2 . (32) In the pro of, we sho w that V k is monotonically decreasing at each iteration k of the prop osed algorithm. This prop erty is then used to prov e that the primal residuals go to zero as k → ∞ whic h implies that r k n,n +1 → 0 for all n . Secondly , we prov e that the dual residuals conv erges to zero as k → ∞ whic h implies that s k n → 0 for all n ∈ N h . Note that the con vergence in the ﬁrst and the second step implies that the ov erall constrain t violation due to the prop osed algorithm go es to zero as k → ∞ . In the ﬁnal step, w e utilize statemen t (i) and (ii) of Theorem 2 into the results of Lemma 1 to pro ve that the ob jectiv e optimality gap go es to zero as k → ∞ . 11 authors Algorithm 2 Dynamic GADMM (D-GADMM) 1: Input : N , f n ( · ) for all n , ρ , and τ 2: Initialization : 3: N h = { θ n | n : o dd } , N t = { θ n | n : even } 4: θ (0) n = 0 , λ (0) n = 0, for all n 5: for k = 0 , 1 , 2 , · · · , K do 6: if k mo d τ = 0 then 7: Ev ery w orker: 8: broadcasts current mo del parameter 9: ﬁnds neigh b ors and refreshes indices { n } as explained in App endix-D. 10: sends λ k n to its righ t neighbor (work er n r,k ) 11: end if 12: Head w ork er n ∈ N h : 13: computes its primal v ariable θ k +1 n via (11) in parallel; and 14: sends θ k +1 n to its neigh b oring work ers n l,k and n r,k . 15: T ail w orker n ∈ N t : 16: computes its primal v ariable θ k +1 n via (13) in parallel; and 17: sends θ k +1 n to its neigh b or work ers n l,k and n r,k . 18: Ev ery w orker up dates the dual v ariables λ k n − 1 and λ k n via (15) lo cally . 19: end for 6. Extension to Time-V arying Net w ork: D-GADMM In this section, we present an extension of the prop osed GADMM algorithm to the scenario where the set of neighboring w orkers to eac h work er is v arying ov er time. Note that the o verla y logical top ology under consideration is still chain while the physical neigh b ors are allo wed to change. Under this dynamic setting, the execution of the prop osed GADMM in Algorithm 1 would b e disrupted. Therefore, we prop ose, D-GADMM (summarized in Algorithm 2) whic h adjusts to the changes in the set of neighbors. In D-GADMM, all the work ers p erio dically reconsider their connections after every τ iterations. if neighbors and/or work er assignmen t to head/tail group change, every w orker broadcasts its current mo del parameter to the new neigh b ors. W e assume that the work ers run an algorithm that can keep constructing a communication-eﬃcien t logical c hain as the underlying ph ysical top ology changes, and the design of suc h an algorithm is not the main fo cus of the pap er. It is worth men tioning that a logical graph that starts at one work er and reac hes every other w orker only once in the most comm unication-eﬃcient wa y is an NP-hard problem. It can b e easily shown that this problem can b e reduced to a T rav eling Salesman Problem (TSP). This is due to the fact that starting from one work er and choosing every next one such that the total communication cost is minimized is exactly equal to starting from one city and reaching every other cit y such that the total distance is minimized, i.e., the work ers in our problem are the cities in TSP , and the communication cost b etw een each pair of w orkers in our problem is the distance b et ween each pair of cities in TSP . Hence, prop osed heuristics to solve TSP (Lenstra and Kan, 1975; Bonomi and Lutton, 1984) can b e used to construct the c hain in our problem with the aid of a central entit y , and then 12 shor t title the algorithm contin ues working on the decen tralized wa y . Decen tralized heuristics for TSP ha ve b een prop osed whic h can also b e used (Peterson, 1990; Dorigo and Gambardella, 1997). Ho wev er, in this pap er, we use a simple decentralized heuristic that we describ e in App endix C. Finally , it is worth mentioning that D-GADMM can still b e utilized ev en if the physical top ology do es not c hange. In suc h a scenario, the w orkers can agree on a predeﬁned sequence of logical chains, so changing neighbors do es not require running an online algorithm, and th us it encoun ters zero o verhead. W e observ e in section 7 that D-GADMM can improv e the con vergence sp eed when it is utilized even when the ph ysical top ology do es not change. The detailed steps of D-GADMM is describ ed in Algorithm 2. W e note that b efore the execution, w e assume that all the no des are connected to each other with a chain. Each no de is asso ciated with an index n and there exists a link from no de 1 to no de 2, no de 2 to node 3, and so on till N − 1 to N . F or each no de n there is an asso ciated primal v ariable θ n from n = 1 to N and a dual v ariable λ n from no de n = 1 to N − 1. Under the dynamic settings, we assume that the no des at p osition n = 1 and N are ﬁxed while the other no des are allow ed to mov e in the netw ork. This means that instead of having the connection in the order 1 − 2 − 3 · · · N , the no des are allo wed to connect in any order such as 1 − 5 − 3 − · · · − 4 − N , or 1 − 4 − 2 − · · · − 5 − N , etc. Alternatively , the neighbors of eac h no de n are no longer ﬁxed under the dynamic settings. T o denote this b ehavior, since the top ology is still the chain, we call the left neighbor to no de n as n l,k at iteration k and similarly n r,k for the right neighbor no de. Therefore, at each iteration k , each no de implemen ts the algorithm considering n l,k and n r,k as its neighbors. Note that, when the top ology changes at iteration k , every work er n transmits its right dual v ariable λ k n to its righ t neighbor in the new c hain to ensure that b oth neighbors share the same dual v ariable. Therefore, the right neighbor of eac h work er n will replace λ k n l,k with the dual v ariable that is received from its new left neigh b or. With that, we show in App endix D that the algorithm con verges to the optimal solution in a similar manner to GADMM. 7. Numerical Results T o v alidate our theoretical foundations, w e numerically ev aluate the performance of GADMM in linear and logistic regression tasks, compared with the follo wing b enchmark algorithms. • LA G-PS (Chen et al., 2018): A version of LAG where parameter server selects commu- nicating w orkers. • LA G-WK (Chen et al., 2018): A version of LA G where w orkers determine when to comm unicate with the server. • Cycle-IA G (Blatt et al., 2007; Gurbuzbalaban et al., 2017): A cyclic mo diﬁed version of the incremen tal aggregated gradient (IAG). • R-IA G (Chen et al., 2018; Sc hmidt et al., 2017): A non-uniform sampling version of sto c hastic av erage gradient (SAG). • GD : Batch gradien t descent. • DGD (Nedi´ c et al., 2018) Decentralized gradient descent. 13 authors Iteration Linear Regression Logistic Regression N =14 20 24 26 N =14 20 24 26 LA G-PS 542 8,043 54,249 141,132 21,183 20,038 19,871 20,544 LA G-WK 385 6,444 44,933 121,134 18,584 17,475 17,050 17,477 GADMM 78 292 558 550 120 235 112 160 GD 524 8,163 55,174 143,651 1,190 1,204 1,181 1,152 TC Linear Regression Logistic Regression N =14 20 24 26 N =14 20 24 26 LA G-PS 3,183 52,396 363,571 1,035,778 316,570 419,819 495,792 553,493 LA G-WK 820 12,369 82,985 241,944 18,786 17,835 17,432 17,915 GADMM 1,092 5,840 13,392 14,300 696 1,962 1,030 1,712 GD 7,860 171,423 1,379,350 3,878,577 17,850 25,284 29,525 31,104 T able 1: The required n umber of iterations (top) and total comm unication cost (b ottom) to ac hieve the target ob jective error 10 − 4 for diﬀeren t num b er of w orkers, in linear and logistic regression with the real datasets. • DualAvg (Duc hi et al., 2011) Dual av eraging. F or the tuning parameters, we use the setup in (Chen et al., 2018). F or our decentralized algorithm, w e consider N w orkers without an y cen tral entit y , whereas for cen tralized algorithms, a uniformly randomly selected w orker is considered as a cen tral controller having a direct link to eac h work er. The p erformance of eac h algorithm is measured using: • the ob jective error | P N n =1  f n ( θ ( k ) n ) − f n ( θ ∗ )  | at iteration k . • (ii) The total comm unication cost (TC) . The TC of a decen tralized algorithm is P T a t =1 P N n =1 1 n,t · L m n,t , where T a is the num b er of iterations to ac hieve a target accuracy a , and 1 n,t denotes an indicator function that equals 1 if work er n is sending an up date at t , and 0 otherwise. The term L m n,t is the cost of the communication link b et ween work ers n and m at comm unication round t . Next, let L c n,t denote the cost of the communication b etw een work er n and the central controller at t . Then, the TC of a cen tralized algorithm is P T a t =1 ( L c BC ,t + P N n =1 1 n,t · L c n,t ), where L c BC ,t and L c n,t ’s corresp ond to downlink broadcast and uplink unicast costs, resp ectively . It is noted that the communication ov erhead in (Chen et al., 2018) only takes into account uplink costs. • The total running time (clo ck time) to achiev e ob jective error a . This metric considers b oth the communication and the lo cal computation time. W e consider L m n,t = L c n,t = L c BC ,t = 1 unless otherwise sp eciﬁed. All simulations are conducted using the synthetic and real datasets describ ed in (Dua and Graﬀ, 2017; Chen et al., 2018). The synthetic data for the linear and logistic regression tasks are generated as describ ed in (Chen et al., 2018). W e consider 1 , 200 samples with 50 features, which are evenly split into work ers. Next, the real data tests linear and logistic 14 shor t title 0 2 4 6 8 10 Number of Iterations (a) × 10 4 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 Objective Error GD cyclic-IAG R-IAG LAG-PS LAG-WK DGD Dual-Avg GADMM, ρ =3 GADMM, ρ =5 GADMM, ρ =7 0 0.5 1 1.5 2 Cumulative Communication Cost (b) × 10 6 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 Objective Error GD LAG-PS LAG-WK DGD Dual-Avg GADMM, ρ =3 GADMM, ρ =5 GADMM, ρ =7 0 1 2 3 4 5 6 7 Clock Time (c) 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 Objective Error GD LAG-PS LAG-WK DGD Dual-Avg GADMM, ρ =3 GADMM, ρ =5 GADMM, ρ =7 200 600 1000 10 0 0 2 4 × 10 4 10 0 0 0.05 0.1 0.15 10 -5 10 0 10 5 Figure 2: Ob jectiv e error, total communication cost, and total running time comparison b et ween GADMM and ﬁv e b enchmark algorithms, in line ar regression with synthetic ( N = 24) datasets. 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Number of Iterations (a) × 10 4 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 Objective Error GD cyclic-IAG R-IAG LAG-PS LAG-WK DGD Dual-Avg GADMM, ρ =3 GADMM, ρ =5 GADMM, ρ =7 1 2 3 4 5 6 7 8 9 10 Cumulative Communication Cost (b) × 10 4 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 Objective Error GD LAG-PS LAG-WK DGD Dual-Avg GADMM, ρ =3 GADMM, ρ =5 GADMM, ρ =7 0 0.05 0.1 0.15 0.2 0.25 Clock Time (c) 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 Objective Error GD LAG-PS LAG-WK DGD Dual-Avg GADMM, ρ =3 GADMM, ρ =5 GADMM, ρ =7 200 600 1000 10 -4 10 -2 10 0 Figure 3: Ob jectiv e error, total communication cost, and total running time comparison b et ween GADMM and ﬁve b enchmark algorithms, in line ar regression with real ( N = 10) datasets. regression tasks with Bo dy F at (252 samples, 14 features) and Derm (358 samples, 34 features) datasets (Dua and Graﬀ, 2017), resp ectively . As the real dataset is smaller than the synthetic dataset, we b y default consider 10 and 24 work ers for the real and synthetic datasets, resp ectiv ely . Figs. 2, 3, 4, and 5 corrob orate that GADMM outp erforms the b enchmark algorithms b y sev eral orders of magnitudes, thanks to the idea of t wo alternating groups where each work er comm unicates only with t wo neigh b ors. F or linear regression with the synthetic dataset, Fig. 2 sho ws that all v ariants of GADMM with ρ = 3 , 5 , and 7 achiev e the target ob jective error of 10 − 4 in less than 1 , 000 iterations, whereas GD, LAG-PS, and LAG-WK (the closest among baselines) require more than 40 , 000 iterations to achiev e the same target error. F urthermore, the TC of GADMM with ρ = 3 and ρ = 5 are 6 and 9 times low er than that of LAG-WK resp ectiv ely . T able 1 shows similar results for diﬀeren t num b ers of w orkers, only except for linear regression with the smallest num b er of w orkers (14), in whic h LAG-WK achiev es the lo west TC. W e also observe from Figs. 2 and 3 that GADMM outp erforms all baselines in terms of the total running time, thanks to the fast conv ergence. GADMM p erforms matrix 15 authors 0 2 4 6 8 10 Number of Iterations (a) × 10 4 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 Objective Error GD cyclic-IAG R-IAG LAG-PS LAG-WK DGD dualAvg GADMM, ρ =2E-3 GADMM, ρ =3E-3 0 2 4 6 8 10 Cumulative Communication Cost (b) × 10 5 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 Objective Error GD LAG-PS LAG-WK DGD dualAvg GADMM, ρ =2E-3 GADMM, ρ =3E-3 0 0.5 1 1.5 Clock Time (c) 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 Objective Error GD LAG-PS LAG-WK DGD dualAvg GADMM, ρ =2E-3 GADMM, ρ =3E-3 500 1000 1500 2000 10 0 20 40 60 80 10 0 Figure 4: Ob jectiv e error, total communication cost, and total running time comparison b et ween GADMM and ﬁv e b enchmark algorithms, in lo gistic regression with synthetic ( N = 24) datasets. 2 4 6 8 10 Number of Iterations (a) × 10 4 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 Objective Error GD cyclic-IAG R-IAG LAG-PS LAG-WK DGD dualAvg GADMM, ρ =2E-2 GADMM, ρ =3E-2 0 2 4 6 8 10 Cumulative Communication Cost (b) × 10 5 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 Objective Error GD LAG-PS LAG-WK DGD dualAvg GADMM, ρ =2E-2 GADMM, ρ =3E-2 0 0.5 1 1.5 2 2.5 3 Clock Time (c) 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 Objective Error GD LAG-PS LAG-WK DGD dualAvg GADMM, ρ =2E-2 GADMM, ρ =3E-2 0 50 100 150 10 -4 10 -2 10 0 0 500 1000 1500 10 -4 10 -2 10 0 Figure 5: Ob jectiv e error, total communication cost, and total running time comparison b et ween GADMM and ﬁve b enc hmark algorithms, in lo gistic regression with real ( N = 10) datasets. in version which is computationally complex compared to calculating gradient. How ev er, the computation cost p er iteration is comp ensated by fast conv ergence. F or logistic regression, Figs. 4 and 5 v alidate that GADMM outp erforms the b enc hmark algorithms, as in the case of linear regression in Figs. 2 and 3. One thing that is worth men tioning here is shown in Fig 4-(c), where w e can see that the total running time of GADMM is equal to the running time of GD. The reason b ehind this is that the logistic regression problem is not solved in a closed-form expression at eac h iteration. How ever, GADMM still signiﬁcan tly outp erforms GD in communication-eﬃciency . Next, comparing the results in Fig. 2 and Fig. 3, we observe that the optimal ρ dep ends on the data distribution across work ers. Namely , when the lo cal data samples of each work er are highly correlated with the other work ers’ samples ( i.e., Bo dy F at dataset, Fig. 3), the lo cal optimal of each w orker is v ery close to the global optimal. Therefore, reducing the p enalt y for the disagreemen t b etw een θ n and θ n +1 b y low ering ρ yields faster conv ergence. F ollowing the same reasoning, higher ρ pro vides faster conv ergence when the lo cal data samples are indep enden t of each other ( i.e., syn thetic datasets in Fig. 2). 16 shor t title 0 0.5 1 1.5 2 2.5 3 x 10 6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 F(x) GD LAG-PS LAG-WK GADMM CDF 0 100 200 300 400 500 Iteration # 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Average Constraint Violation 450 460 470 480 490 1 1.5 2 2.5 10 -6 0 2 4 6 8 10 x 10 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 F(x) GD LAG-PS LAG-WK GADMM ACV T otal Communication Cost Iterations CDF T otal Communication Cost 0 5 10 15 x 10 4 0 0.2 0.4 0.6 0.8 1 F(x) 0 5 10 15 x 10 4 0 0.2 0.4 0.6 0.8 1 F(x) 0 5 10 15 x 10 4 0 0.2 0.4 0.6 0.8 1 F(x) 0 5 10 15 x 10 4 0 0.2 0.4 0.6 0.8 1 F(x) 0 5 10 15 x 10 4 0 0.2 0.4 0.6 0.8 1 F(x) 0 2000 4000 6000 8000 10000 x 0 0.2 0.4 0.6 0.8 1 F(x) 0 2000 4000 6000 8000 10000 x 0 0.2 0.4 0.6 0.8 1 F(x) 0 5 10 15 x 10 4 0 0.2 0.4 0.6 0.8 1 F(x) 1 0 0.2 0.4 0.6 0.8 (a) CDF of TC: Linear , real data (Derm) (b) CDF of TC: Logistic, r eal data (Derm) (c) ACV: Logistic, r eal data (Derm) Figure 6: The cumulativ e distribution function (CDF) of total communication cost (TC) in (a) linear and (b) logistic regression by uniformly randomly distributed 24 work ers with 1 , 000 observ ations, and (c) the av erage consensus constraint violation (ACV) of GADMM in logistic regression b y 4 work ers. Fig. 6-(a) and (b) demonstrate that GADMM is comm unication eﬃcient under diﬀeren t net work top ologies. In fact, the TC calculations of GADMM in T able 1 and Fig. 2 rely on a unit comm unication cost for all communication links, i.e., L m n,t = L c n,t = L c BC ,t = 1, whic h ma y not capture the comm unication eﬃciency of GADMM under a generic netw ork top ology . Instead, we use the consumed energy p er communication iteration as the communication cost metric. W e illustrate the cum ulative distribution function (CDF) of TC by observing 1 , 000 diﬀeren t netw ork top ologies. A t the b eginning of eac h observ ation, 24 work ers are randomly distributed o ver a 10 × 10 m 2 square area. In GADMM, the metho d describ ed in App endix C is used to construct the logical c hain. In cen tralized algorithms, the w orker closest to the cen ter b ecomes a cen tral work er asso ciating with all the other w orkers. W e assume that the bandwidth is evenly distributed among users, and we also assume that eac h work er needs a bit rate of 10Mbps to transmit its mo del in a one-time slot. Therefore, the communication cost p er work er p er iteration is the amount of energy that work er consumes to achiev e the rate of 10Mbps. Note that according to Shannon’s formula, the achiev able rate is a function of the bandwidth and p ow er, i.e., R = B · l og 2 ( P d 2 · N 0 · B ), where B is the bandwidth, P is the comm unication p o wer, N 0 is the noise sp ectral density , and d is the distance b et ween the transmitter and the receiver (McKeague, 1981), so w e assume a free-space communication link. In our simulations, we assume, B = 2MHz, N 0 = 1 E − 6, we ﬁnd the required p ow er (energy) to achiev e 10Mbps ov er link l at time slot t , and that reﬂects the communication cost of using link l at time slot t . The CDF results in Fig. 6-(a) and (b) show that with high probabilit y , GADMM ac hieves m uch low er TC in b oth linear and logistic regression tasks for generic net work top ology , compared to other baseline algorithms. On the other hand, Fig. 6-(c) v alidates that GADMM guaran tees consensus on the mo del parameters of all work ers when training con verges. Indeed, GADMM complies with the constrain t θ n = θ n +1 in (3) . W e observ e in Fig. 4-(c) that the a verage consensus constraint violation (ACV), deﬁned as P N − 1 n =1 | θ ( k ) n − θ ( k ) n +1 | / N , go es to zero with the n umber of iterations. Sp eciﬁcally , A VC b ecomes 8 × 10 − 7 after 495 iterations at whic h the loss b ecomes 1 × 10 − 4 . This underpins that GADMM is robust against its consensus violations temp orarily at the early phase of training, thereby ac hieving the av erage consensus at the end. 17 authors 0 1000 2000 3000 4000 5000 Number of Iterations (a) 10 -6 10 -4 10 -2 10 0 10 2 10 4 Objective Error GADMM, ρ = 1 D-GADMM, ρ =1, coherence time =15 iter 0 0.5 1 1.5 2 Cumulative Communication Cost (b) × 10 8 10 -6 10 -4 10 -2 10 0 10 2 10 4 Objective Error GADMM, ρ =1 D-GADMM, ρ =1, coherence time =15 iter 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Clock Time (c) 10 -6 10 -4 10 -2 10 0 10 2 10 4 Objective Error GADMM, ρ =1 D-GADMM, ρ =1, coherence time =15 iter Figure 7: Ob jective error, total communication cost, and total running time of D-GADMM v ersus GADMM in linear regression with the synthetic dataset at ρ = 1, N = 50 W e now extend GADMM to D-GADMM, and ev aluate its p erformance under the time- v arying netw ork top ology . One note to mak e, in simulating D-GADMM, we do not exchange dual v ariables b etw een neighbors at every top ology change as describ ed in line 10, Algorithm 2. How ever, as w e will sho w, D-GADMM still conv erges. Therefore, the extra comm unication o verhead that might b e encoun tered in D-GADMM when work ers share their dual v ariables is a voided and the con vergence is still preserv ed. W e change the top ology ev ery 15 iterations. Therefore, w e assume that the sys tem coherence time is 15 iterations. T o sim ulate the c hange in the top ology , 50 work ers are randomly distributed ov er a 250 × 250 m 2 square area ev ery 15-th iteration. D-GADMM uses the metho d describ ed in app endix D which consumes 2 iterations (4 communication rounds) to build the chain. In contrast, GADMM keeps the logical work er connectivit y graph unchanged ev en when the underlying physical top ology c hanges. In linear regression with the synthetic dataset and 50 work ers, as observed in Fig. 7, ev en though D-GADMM consumes tw o iterations p er top ology change in building the chain, b oth the total num b er of iterations to ac hieve the ob jective error of 1 E − 4 and the TC of D-GADMM are signiﬁcan tly reduced compared to GADMM. W e observe that by c hanging the neighboring set of eac h work er more frequently , the conv ergence sp eed is signiﬁcan tly impro ved. Therefore, even for the static scenario in which the ph ysical top ology do es not c hange, reconstructing the logical chain every few iterations can signiﬁcantly improv e the con vergence sp eed. W e ﬁnally compare b oth GADMM and D-GADMM with the standard ADMM whic h requires a parameter server (star top ology). Since the top ology do es not change, we replace “system coherence time” with “refresh rate”. Therefore, the ob jective of using D-GADMM is not to adapt to top ology changes, while to improv e the conv ergence sp eed of GADMM. T o compare the algorithms, we use 24 w orkers ( N = 24), and we randomly drop them ov er a 250 × 250 m 2 square area. F or standard ADMM, we use the work er that is closest to the cen ter of the grid as the parameter server. As observed from Fig. 8, compared to GADMM, standard ADMM requires few er iterations to ac hieve the ob jective error of 1 E − 4, but that comes at signiﬁcantly higher comm unication cost as shown in Fig. 8-(b) (4 times higher cost than GADMM). W e show that by randomly c hanging the logical connectivit y graph and utilizing D-GADMM, we can reduce the gap in the num b er of iterations b et ween GADMM and standard ADMM and signiﬁcantly reduce 18 shor t title 0 500 1000 1500 2000 2500 3000 Number of Iterations (a) 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 Objective Error ADMM GADMM D-GADMM, refresh rate =1 D-GADMM, refresh rate =10 D-GADMM, refresh rate =50 0 1 2 3 4 5 6 7 8 9 Cumulative Communication Cost (b) × 10 10 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 10 3 Objective Error ADMM GADMM D-GADMM, refresh rate =1 D-GADMM, refresh rate =10 D-GADMM, refresh rate =50 Figure 8: Ob jective error, total comm unication cost, and total running time of D-GADMM, GADMM, and Standared ADMM in linear regression with the synthetic dataset at ρ = 1, N = 24 the communication cost. In fact, Fig. 8 sho ws that by c hanging the logical graph every iteration, D-GADMM conv erges faster than standard ADMM and ac hieves a communication cost that is 40 times less. It is worth mentioning that for static physical top ology , changing the logical graph comes at zero cost since work ers can agree on a predeﬁned pseudorandom sequence in the graph changes. Therefore, every work er kno ws its neighbors in the next iteration. 8. Conclusions and F uture w ork In this pap er, we form ulate a constrained optimization problem for distributed mac hine learning applications, and prop ose a no vel decen tralized algorithm based on ADMM, termed Group ADMM (GADMM) to solv e this problem optimally for conv ex functions. GADMM is sho wn to maximize the communication eﬃciency of each w orker. Extensive simulations in linear and logistic regression with synthetic and real datasets show signiﬁcan t improv emen ts in con vergence rate and comm unication ov erhead compared to the state-of-the-art algorithms. F urthermore, we extend GADMM to D-GADMM whic h accounts for time-v arying net work top ologies. Both analysis and sim ulations conﬁrm that D-GADMM ac hieves the same con vergence guaran tees as GADMM with low er communication o verhead under the time- v arying top ology scenario. Constructing a comm unication-eﬃcient logical c hain ma y not alw ays b e possible; therefore, extending the algorithm to ac hieve a lo w comm unication o verhead under an arbitrary top ology could b e an in teresting topic for future study . 19 authors App endix A. Proof of Lemma 1 Pr o of of statement (i): W e note that f n ( θ n ) for all n is closed, prop er, and con vex, hence L ρ is sub-diﬀeren tiable. Since θ k +1 n ∈N h minimizes L ρ ( θ n ∈N h , θ k n ∈N t , λ n ), the follo wing must hold true at eac h iteration k + 1 0 ∈ ∂ f n ( θ k +1 n ) − λ k +1 n − 1 + λ k +1 n + s k +1 n , n ∈ N h \ { 1 } (33) 0 ∈ ∂ f n ( θ k +1 n ) + λ k +1 n + s k +1 n , n = 1 (34) Note that we use (34) for w orker 1 since it do es not hav e a left neighbor ( i.e., λ k +1 0 is not deﬁned). Ho wev er, for simplicity and to a void writing separate equations for the edge w orkers (work ers 1 and N ), we use: λ k +1 0 = λ k +1 N = 0 throughout the rest of the pro of. Therefore, w e can use a single equation for each group ( e.g., equation (33) for n ∈ N h ). The result in (33) implies that θ k +1 n for n ∈ N h minimizes the following con vex ob jective function f n ( θ n ) + h − λ k +1 n − 1 + λ k +1 n + s k +1 n ∈N h , θ n i . (35) Next, since θ k +1 n for n ∈ N h is the minimizer of (35), then, it holds that f n ( θ k +1 n ) + h − λ k +1 n − 1 + λ k +1 n + s k +1 n ∈N h , θ k +1 n i ≤ f n ( θ ? ) + h − λ k +1 n − 1 + λ k +1 n + s k +1 n ∈N h , θ ? i (36) where θ ? is the optimal v alue of the problem in (8) - (9) . Similarly for θ k +1 n for n ∈ N t satisﬁes (17) and it holds that f n ( θ k +1 n ) + h − λ k +1 n − 1 + λ k +1 n , θ k +1 n i ≤ f n ( θ ? ) + h − λ k +1 n − 1 + λ k +1 n , θ ? i . (37) Adding (36) and (37), and then taking the summation ov er all the w orkers, we get N X n =1 f n ( θ k +1 n ) + X n ∈N t h − λ k +1 n − 1 + λ k +1 n , θ k +1 n i + X n ∈N h h − λ k +1 n − 1 + λ k +1 n + s k +1 n ∈N h , θ k +1 n i ≤ N X n =1 f n ( θ ? ) + X n ∈N t h − λ k +1 n − 1 + λ k +1 n , θ ? i + X n ∈N h h − λ k +1 n − 1 + λ k +1 n + s k +1 n ∈N h , θ ? i (38) After rearranging the terms, w e get N X n =1 f n ( θ k +1 n ) − N X n =1 f n ( θ ? ) ≤ X n ∈N t h − λ k +1 n − 1 + λ k +1 n , θ ? i + X n ∈N h h − λ k +1 n − 1 + λ k +1 n , θ ? i − X n ∈N t h − λ k +1 n − 1 + λ k +1 n , θ k +1 n i − X n ∈N h h − λ k +1 n − 1 + λ k +1 n , θ k +1 n i + X n ∈N h h s k +1 n ∈N h , θ ? − θ k +1 n i . (39) 20 shor t title Note that, X n ∈N h h − λ k +1 n − 1 + λ k +1 n , θ n i = h λ k +1 1 , θ 1 i − h λ k +1 2 , θ 3 i + h λ k +1 3 , θ 3 i + · · · · · · − h λ k +1 N − 2 , θ N − 1 i + h λ k +1 N − 1 , θ N − 1 i , (40) and X n ∈N t h − λ k +1 n − 1 + λ k +1 n , θ n i = − h λ k +1 1 , θ 2 i + h λ k +1 2 , θ 2 i − h λ k +1 3 , θ 4 i + · · · · · · − h λ k +1 N − 1 , θ N i + h λ k +1 N − 1 , θ N i . (41) F rom (40) and (41), at θ k +1 n , it holds that X n ∈N t h − λ k +1 n − 1 + λ k +1 n , θ k +1 n i + X n ∈N h h − λ k +1 n − 1 + λ k +1 n , θ k +1 n i = h λ k +1 1 , θ k +1 1 − θ k +1 2 i + h λ k +1 2 , θ k +1 2 − θ k +1 3 i + · · · + h λ k +1 N − 1 , θ k +1 N − 1 − θ k +1 N i = h λ k +1 1 , r k +1 1 , 2 i + h λ k +1 2 , r k +1 2 , 3 i + · · · + h λ k +1 N − 1 , r k +1 N − 1 ,N i , (42) where for the second equality , we ha ve used the deﬁnition of primal residuals deﬁned after (24). Similarly , it holds for θ ? that X n ∈N t h − λ k +1 n − 1 + λ k +1 n , θ ? i + X n ∈N h h − λ k +1 n − 1 + λ k +1 n , θ ? i (43) = h λ k +1 1 , θ ? i + h λ k +1 2 − λ k +1 1 , θ ? i + h λ k +1 3 − λ k +1 2 , θ ? i + · · · + h λ k +1 N − λ k +1 N − 1 , θ ? i = 0 . The equalit y in (43) holds since λ k +1 N = 0 . Next, substituting the results from (42) and (43) in to (39), we get N X n =1 f n ( θ k +1 n ) − N X n =1 f n ( θ ? ) ≤ − N − 1 X n =1 h λ k +1 n , r k +1 n,n +1 i + X n ∈N h h s k +1 n ∈N h , θ ? − θ k +1 n i , (44) whic h concludes the pro of of statement (i) of Lemma 1. Pr o of of statement (ii): The pro of of this Lemma is along the similar line as in (Boyd et al., 2011, A.3) but is provided here for completeness. W e note that for a saddle p oin t ( θ ? , { λ ? n } ∀ n ) of L 0 ( { θ n } ∀ n , { λ n } ∀ n ), it holds that L 0 ( θ ? , { λ ? n } ∀ n ) ≤ L 0 ( { θ k +1 n }∀ n, { λ ? n }∀ n ) (45) for all n . Substituting the expression for the Lagrangian from (10) on the b oth sides of (45) , w e get N X n =1 f n ( θ ? ) ≤ N X n =1 f n ( θ k +1 n ) + N − 1 X n =1 h λ ? n , r k +1 n,n +1 i . (46) After rearranging the terms, w e get N X n =1 f n ( θ k +1 n ) − N X n =1 f n ( θ ? ) ≥ − N − 1 X n =1 h λ ? n , r k +1 n,n +1 i (47) whic h is the statement (ii) of Lemma 1. 21 authors App endix B. Proof of Theorem 2 T o pro ceed with the analysis, add (44) and (47), multiply by 2, w e get 2 N − 1 X n =1 h λ k +1 n − λ ? n , r k +1 n,n +1 i + 2 X n ∈N h h s k +1 n , θ k +1 n − θ ? i ≤ 0 . (48) By applying, λ k +1 n = λ k n + ρ r k +1 n,n +1 obtained from the dual up date in (15) , (48) can b e recast as 2 N − 1 X n =1 h λ k n + ρ r k +1 n,n +1 − λ ? n , r k +1 n,n +1 i + 2 X n ∈N h h s k +1 n , θ k +1 n − θ ? i ≤ 0 . (49) Note that the ﬁrst term on the left hand side of (49) can b e written as N − 1 X n =1 2 h λ k n − λ ? n , r k +1 n,n +1 i + ρ    r k +1 n,n +1    2 + ρ    r k +1 n,n +1    2 . (50) Replacing r k +1 n,n +1 in the ﬁrst and second terms of (50) with λ k +1 n − λ k n ρ , w e get N − 1 X n =1 (2 /ρ ) h λ k n − λ ? n , λ k +1 n − λ k n i + (1 /ρ )    λ k +1 n − λ k n    2 + ρ    r k +1 n,n +1    2 . (51) Using the equalit y λ k +1 n − λ k n = ( λ k +1 n − λ ? n ) − ( λ k n − λ ? n ), w e can rewrite (51) as N − 1 X n =1 (2 /ρ ) h λ k n − λ ? n , ( λ k +1 n − λ ? n ) − ( λ k n − λ ? n ) i + (1 /ρ )    ( λ k +1 n − λ ? n ) − ( λ k n − λ ? n )    2 + ρ    r k +1 n,n +1    2 = N − 1 X n =1 (2 /ρ ) h λ k n − λ ? n , λ k +1 n − λ ? n i − (2 /ρ )    λ k n − λ ? n    2 + (1 /ρ )    λ k +1 n − λ ? n    2 − (2 /ρ ) h λ k +1 n − λ ? n , λ k n − λ ? n i + 1 /ρ    λ k n − λ ? n    2 + ρ    r k +1 n,n +1    2 (52) = N − 1 X n =1 h (1 /ρ )    λ k +1 n − λ ? n    2 − (1 /ρ )    λ k n − λ ? n    2 + ρ    r k +1 n,n +1    2 i . (53) Next, consider the second term on the left hand side of (49) . F rom the equalit y (26) , it holds that 2 X n ∈N h h s k +1 n , θ k +1 n − θ ? i (54) = X n ∈N h \{ 1 }  2 ρ h θ k +1 n − 1 − θ k n − 1 , θ k +1 n − θ ? i  + X n ∈N h  2 ρ h θ k +1 n +1 − θ k n +1 , θ k +1 n − θ ? i  . 22 shor t title Note that θ k +1 n = − r k +1 n − 1 ,n + θ k +1 n − 1 = r k +1 n,n +1 + θ k +1 n +1 , ∀ n = { 2 , · · · , N − 1 } , which implies that w e can rewrite (54) as follows 2 X n ∈N h h s k +1 n , θ k +1 n − θ ? i = X n ∈N h \{ 1 }  − 2 ρ h θ k +1 n − 1 − θ k n − 1 , r k +1 n − 1 ,n i + 2 ρ h θ k +1 n − 1 − θ k n − 1 , θ k +1 n − 1 − θ ? i  + X n ∈N h  2 ρ h θ k +1 n +1 − θ k n +1 , r k +1 n,n +1 i + 2 ρ h θ k +1 n +1 − θ k n +1 , θ k +1 n +1 − θ ? i  . (55) Using the equalities, θ k +1 n − 1 − θ ? =( θ k +1 n − 1 − θ k n − 1 ) + ( θ k n − 1 − θ ? ) , ∀ n ∈ N h \ { 1 } θ k +1 n +1 − θ ? =( θ k +1 n +1 − θ k n +1 ) + ( θ k n +1 − θ ? ) , ∀ n ∈ N h (56) w e rewrite the right hand side of (55) as X n ∈N h \{ 1 }  − 2 ρ h θ k +1 n − 1 − θ k n − 1 , r k +1 n − 1 ,n i + 2 ρ h θ k +1 n − 1 − θ k n − 1 , ( θ k +1 n − 1 − θ k n − 1 ) + ( θ k n − 1 − θ ? ) i  + X n ∈N h  2 ρ h θ k +1 n +1 − θ k n +1 , r k +1 n,n +1 i + 2 ρ h θ k +1 n +1 − θ k n +1 , ( θ k +1 n − θ k n +1 ) + ( θ k n +1 − θ ? ) i  = X n ∈N h \{ 1 }  − 2 ρ h θ k +1 n − 1 − θ k n − 1 , r k +1 n − 1 ,n i + 2 ρ    θ k +1 n − 1 − θ k n − 1    2 + 2 ρ h θ k +1 n − 1 − θ k n − 1 , θ k n − 1 − θ ? i  + X n ∈N h  2 ρ h θ k +1 n +1 − θ k n +1 , r k +1 n,n +1 i + 2 ρ    θ k +1 n +1 − θ k n +1    2 + 2 ρ h θ k +1 n +1 − θ k n +1 , θ k n +1 − θ ? n +1 i  . (57) F urther using the equalities θ k +1 n − 1 − θ k n − 1 =( θ k +1 n − 1 − θ ? ) − ( θ k n − 1 − θ ? ) , ∀ n ∈ N h \ { 1 } θ k +1 n +1 − θ k n +1 =( θ k +1 n +1 − θ ? ) − ( θ k n +1 − θ ? ) , ∀ n ∈ N h (58) w e can write (57) as X n ∈N h \{ 1 }  − 2 ρ h θ k +1 n − 1 − θ k n − 1 , r k +1 n − 1 ,n i + 2 ρ    θ k +1 n − 1 − θ k n − 1    2 + 2 ρ h ( θ k +1 n − 1 − θ ? ) − ( θ k n − 1 − θ ? ) , θ k n − 1 − θ ? i  + X n ∈N h  2 ρ h θ k +1 n +1 − θ k n +1 , r k +1 n,n +1 i + 2 ρ    θ k +1 n +1 − θ k n +1    2 + 2 ρ h ( θ k +1 n +1 − θ ? ) − ( θ k n +1 − θ ? ) , θ k n +1 − θ ? i  (59) = X n ∈N h \{ 1 }  − 2 ρ h θ k +1 n − 1 − θ k n − 1 , r k +1 n − 1 ,n i + 2 ρ    θ k +1 n − 1 − θ k n − 1    2 + 2 ρ h θ k +1 n − 1 − θ ? , θ k n − 1 − θ ? i − 2 ρ    θ k n − 1 − θ ?    2  + X n ∈N h  2 ρ h θ k +1 n +1 − θ k n +1 , r k +1 n,n +1 i + 2 ρ    θ k +1 n +1 − θ k n +1    2 + 2 ρ h θ k +1 n +1 − θ ? , θ k n +1 − θ ? i − 2 ρ    θ k n +1 − θ ?    2  . (60) 23 authors After rearranging the terms, w e can write = X n ∈N h \{ 1 }  − 2 ρ h θ k +1 n − 1 − θ k n − 1 , r k +1 n − 1 ,n i + ρ    θ k +1 n − 1 − θ k n − 1    2 + ρ    ( θ k +1 n − 1 − θ ? ) − ( θ k n − 1 − θ ? )    2 + 2 ρ h θ k +1 n − 1 − θ ? , θ k n − 1 − θ ? i − 2 ρ k θ k n − 1 − θ ? k 2 2  + X n ∈N h  2 ρ h θ k +1 n +1 − θ k n +1 , r k +1 n,n +1 i + ρ    θ k +1 n +1 − θ k n +1    2 + ρ    ( θ k +1 n +1 − θ ? ) − ( θ k n +1 − θ ? )    2 + 2 ρ h θ k +1 n +1 − θ ? , θ k n +1 − θ ? i − 2 ρ    θ k n +1 − θ ?    2  . (61) Next, expanding the square terms in (61), w e get 2 X n ∈N h h s k +1 n , θ k +1 n − θ ? i = X n ∈N h \{ 1 }  − 2 ρ h θ k +1 n − 1 − θ k n − 1 , r k +1 n − 1 ,n i + ρ    θ k +1 n − 1 − θ k n − 1    2 (62) + ρ    θ k +1 n − 1 − θ ?    2 − ρ    θ k n − 1 − θ ?    2  + X n ∈N h  2 ρ h θ k +1 n +1 − θ k n +1 , r k +1 n,n +1 i + ρ    θ k +1 n +1 − θ k n +1    2 + ρ    θ k +1 n +1 − θ ?    2 − ρ    θ k n +1 − θ ?    2  . Substituting the equalities from (53) and (62) to the left hand side of (49), we obtain N − 1 X n =1 h (1 /ρ )    λ k +1 n − λ ? n    2 − (1 /ρ )    λ k n − λ ? n    2 + ρ    r k +1 n,n +1    2 i + X n ∈N h \{ 1 }  − 2 ρ h θ k +1 n − 1 − θ k n − 1 , r k +1 n − 1 ,n i + ρ    θ k +1 n − 1 − θ k n − 1    2 (63) + ρ    θ k +1 n − 1 − θ ?    2 − ρ    θ k n − 1 − θ ?    2  + X n ∈N h  2 ρ h θ k +1 n +1 − θ k n +1 , r k +1 n,n +1 i + ρ    θ k +1 n +1 − θ k n +1    2 + ρ    θ k +1 n +1 − θ ?    2 − ρ    θ k n +1 − θ ?    2  ≤ 0 . (64) Multiplying b oth the sides by − 1, we get N − 1 X n =1 h − (1 /ρ )    λ k +1 n − λ ? n    2 + (1 /ρ )    λ k n − λ ? n    2 − ρ    r k +1 n,n +1    2 i − X n ∈N h \{ 1 }  − 2 ρ h θ k +1 n − 1 − θ k n − 1 , r k +1 n − 1 ,n i + ρ    θ k +1 n − 1 − θ k n − 1    2 (65) + ρ    θ k +1 n − 1 − θ ?    2 − ρ    θ k n − 1 − θ ?    2  + X n ∈N h  2 ρ h θ k +1 n +1 − θ k n +1 , r k +1 n,n +1 i + ρ    θ k +1 n +1 − θ k n +1    2 + ρ    θ k +1 n +1 − θ ?    2 − ρ    θ k n +1 − θ ?    2  ≥ 0 , (66) 24 shor t title After rearranging the terms in (65) and using the deﬁnition of the Ly apunov function in (32), w e get V k +1 ≤ V k − N − 1 X n =1 ρ    r k +1 n,n +1    2 − h X n ∈N h \{ 1 } ρ    θ k +1 n − 1 − θ k n − 1    2 + X n ∈N h ρ    θ k +1 n +1 − θ k n +1    2 i − h X n ∈N h \{ 1 } − 2 ρ h θ k +1 n − 1 − θ k n − 1 , r k +1 n − 1 ,n i + X n ∈N h 2 ρ h θ k +1 n +1 − θ k n +1 , r k +1 n,n +1 i i . (67) In order to prov e that k + 1 is a one step tow ards the optimal solution or the Lyapuno v function decreases monotonically at eac h iteration, we need to sho w that the sum of the inner pro duct terms on the righ t hand side of the inequality is p ositive. In other words, we need to pro ve that the term P n ∈N h \{ 1 } − 2 ρ h θ k +1 n − 1 − θ k n − 1 , r k +1 n − 1 ,n i + P n ∈N h 2 ρ h θ k +1 n +1 − θ k n +1 , r k +1 n,n +1 i is alw ays p ositive. Note that this te rm can b e written as. X n ∈N h \{ 1 } − 2 ρ h θ k +1 n − 1 − θ k n − 1 , r k +1 n − 1 ,n i + X n ∈N h 2 ρ h θ k +1 n +1 − θ k n +1 , r k +1 n,n +1 i (68) =2 ρ h h r k +1 1 , 2 , θ k +1 2 − θ k 2 i − h r k +1 2 , 3 , θ k +1 2 − θ k 2 i + h r k +1 3 , 4 , θ k +1 4 − θ k 4 i − h r k +1 4 , 5 , θ k +1 4 − θ k 4 i + · · · + r k +1 N − 1 ,N ( θ k +1 N − θ k N ) i =2 ρ h r k +1 1 , 2 − r k +1 2 , 3 , θ k +1 2 − θ k 2 i + 2 ρ h r k +1 3 , 4 − r k +1 4 , 5 , θ k +1 4 − θ k 4 i + · · · + 2 ρ h r k +1 N − 1 ,N , θ k +1 N − θ k N i . W e know that θ k +1 n ∈N t minimizes f n ( θ n ) + h − λ k +1 n − 1 + λ k +1 n , θ n i ; hence it holds that f n ( θ k +1 n ) + h − λ k +1 n − 1 + λ k +1 n , θ k +1 n i ≤ f n ( θ k n ) + h − λ k +1 n − 1 + λ k +1 n , θ k n i , (69) Similarly , θ k n ∈N t minimizes f n ( θ n ) + h − λ k n − 1 + λ k n , θ n i , whic h implies that f n ( θ k n ) + h − λ k n − 1 + λ k n , θ k n i ≤ f n ( θ k +1 n ) + h − λ k n − 1 + λ k n , θ k +1 n i . (70) Adding (69) and (70), we get h ( − λ k +1 n − 1 + λ k +1 n ) − ( − λ k n − 1 + λ k n ) , θ k +1 n − θ k n i ≤ 0 . (71) F urther after rearranging, we get h ( λ k n − 1 − λ k +1 n − 1 ) + ( λ k +1 n − λ k n ) , θ k +1 n − θ k n i ≤ 0 . (72) By kno wing that r k +1 n − 1 ,n = (1 /ρ )( λ k +1 n − 1 − λ k n − 1 ) and r k +1 n,n +1 = (1 /ρ )( λ k +1 n − λ k n ), (72) can b e written as − ρ h r k +1 n − 1 ,n − r k +1 n,n +1 , θ k +1 n − θ k n i ≤ 0 , ∀ n ∈ N t . (73) The ab o ve inequality implies that ρ h r k +1 n − 1 ,n − r k +1 n,n +1 , θ k +1 n − θ k n i ≥ 0 , ∀ n ∈ N t . (74) 25 authors Note that since w orker N does not hav e a righ t neighbor, r k +1 N ,N +1 = λ k +1 N = λ k N = 0. Next, for ρ > 0. Using the inequalit y in (74) into (68), we get X n ∈N h \{ 1 } − 2 ρ h θ k +1 n − 1 − θ k n − 1 , r k +1 n − 1 ,n i + X n ∈N h 2 ρ h θ k +1 n +1 − θ k n +1 , r k +1 n,n +1 i ≥ 0 . (75) Next, w e use the result in (75) into (67) to get V k +1 ≤ V k − N − 1 X n =1 ρ    r k +1 n,n +1    2 − h X n ∈N h \{ 1 } ρ    θ k +1 n − 1 − θ k n − 1    2 + X n ∈N h ρ    θ k +1 n +1 − θ k n +1    2 i . (76) The result in (76) pro ves that V k +1 decreases with k . Now, since V k ≥ 0 and V k ≤ V 0 , it holds that " P N − 1 n =1 ρ    r k +1 n,n +1    2 + h P n ∈N h \{ 1 } ρ    θ k +1 n − 1 − θ k n − 1    2 + P n ∈N h ρ    θ k +1 n +1 − θ k n +1    2 i # . is b ounded. T aking the telescopic sum ov er k in (76) as limit K → ∞ , we get lim K →∞ K X k =0 " N − 1 X n =1 ρ    r k +1 n,n +1    2 + h X n ∈N h \{ 1 } ρ    θ k +1 n − 1 − θ k n − 1    2 + X n ∈N h ρ    θ k +1 n +1 − θ k n +1    2 i # ≤ V 0 . (77) The result in (77) implies that the primal residual r k +1 n,n +1 → 0 as k → ∞ for all n ∈ { 1 , · · · , N − 1 } completing the pro of of statement (i) in Theorem 2. Similarly , the norm diﬀerences    θ k +1 n − 1 − θ k n − 1    and    θ k +1 n +1 − θ k n +1    → 0 as k → ∞ whic h implies that the dual residual s k n → 0 as k → ∞ for all n ∈ N h stated in the statement (ii) of Theorem 2. In order to pro ve the statement (iii) of Theorem 2, consider the lo wer and the upp er b ounds on the ob jectiv e function optimality gap given by N X n =1 [ f n ( θ k +1 n ) − f n ( θ ? )] ≤ − N − 1 X n =1 h λ k +1 n , r k +1 n,n +1 i + X n ∈N h h s k +1 n , θ ? − θ k +1 n i (78) N X n =1 [ f n ( θ k +1 n ) − f n ( θ ? )] ≥ − N − 1 X n =1 h λ ? n , r k +1 n,n +1 i . (79) Note that from the results in statement (i) and (ii) of Theorem 2, it holds that the right hand side of the upp er b ound in (78) con verge to zero as k → ∞ and also the right hand side of the lo wer b ound in (79) conv erges to zero as k → ∞ . This implies that lim k →∞ N X n =1 [ f n ( θ k +1 n ) − f n ( θ ? )] = 0 (80) whic h is the statement (iii) of Theorem 2. 26 shor t title App endix C. Method for D-GADMM Chain Construction T o ensure that a chain in a given graph is found in a decen tralized wa y , we use the follo wing metho d. • The N w orkers ( N is assumed to b e even) share a pseudorandom co de that is used ev ery τ seconds, where τ is the system coherence time, to generate a set H con taining ( N / 2 − 2) in teger num b ers, with each num b er chosen from the set { 2 , · · · , N − 1 } . In other w ords, we assume that the top ology c hanges every τ seconds. Note that the generated num b ers at i · τ and ( i + 1) · τ time slots may diﬀer. Ho w ever, at the i · τ -th time slot, the same set of n umbers is generated across w orkers with no communication. • Ev ery work er with physical index n ∈  H ∪ { 1 }  is assigned to the head set. Note that the work er whose ph ysical index 1 is alw ays assigned to the head set. On the other hand, every work er with ph ysical index n suc h that n / ∈  H ∪ { 1 }  assigns itself to the tail set. Therefore, the work er whose ph ysical index N is alwa ys assigned to the tail set. F ollowing this strategy , the num b er of heads will b e equal to the num b er of tails, and are b oth equal to N / 2. • Ev ery work er in the head set broadcasts its physical index alongside a pilot signal. Pilot signal is a signal kno wn to eac h work er. It is used to measure the signal strength and ﬁnd neigh b oring work ers. • Ev ery work er in the tail set calculates its cost of communication to every head based on the received signal strength. F or example, the comm unication cost b et ween head n and tail m is equal to [1/p ow er of the received signal from head n to tail m ], in whic h the link with low er receiv ed signal lev el is more costly , as it is incuring higher transmission p o wer. • Ev ery work er in the tail set broadcasts a vector of length N / 2, containing the com- m unication cost to the N / 2 heads, i.e., the ﬁrst elemen t in the vector captures the comm unication cost b et ween this tail and w orker 1, since work er 1 is alw ays in the head set, whereas the second elemen t represents the comm unication cost b et ween this tail and the ﬁrst index in the head set H and so on. • Once head w orker n ∈  H ∪ { 1 }  receiv es the communication cost vector from tail w orkers, it ﬁnds a comm unication-eﬃcient chain that starts from w orker 1 and passes through every other w orker to reach w ork er N . In our sim ulations, w e use the following simple and greedy strategy that is p erformed b y every head to ensure they generate the same c hain. The strategy is as follows: – Find the tail that has the minimum comm unication cost to work er 1 and create a link b et ween this tail and work er 1. – F rom the remaining set of heads, ﬁnd the head that has the minim um communi- cation cost to this tail and create a link b etw een this head and the corresp onding tail. – F ollow the same strategy un til all wor kers are connected. When ev ery head follo ws this strategy , all heads will generate the same chain. 27 authors – Under the following tw o assumptions: (i) the comm unication cost b etw een any pair of work ers is < ∞ , and (ii) no tw o tails hav e equal communication cost to the same head, this strategy guarantees that every head will generate the same c hain. • Once ev ery head ﬁnds its c hain, all neighbors share their curren t mo dels, and D- GADMM is carried out for τ seconds using the current chain. Note that, the describ ed heuristic requires 4 communication rounds (2 iterations). Finally , it is worth mentioning that this approac h has no guarantee to ﬁnd the most communication- eﬃcien t chain. As mentioned in section 6, our fo cus in this pap er is not to design the c hain construction algorithm. App endix D. Con vergence Analysis of D-GADMM F or the dynamic settings, we assume that the ﬁrst n = 1 and the last no de n = N are ﬁxed and the others can mov e at eac h iterate. Therefore, w e denote the neigh b ors to each no de n at iteration k as n l,k and n r,k as the left and the right neighbors , resp ectiv ely . Note that when, n l,k = n − 1 and n r,k = n + 1 for all k , w e recov er the GADMM implemen tation. With that in mind, we start b y writing the augmen ted Lagrangian of the optimization problem in (8)-(9) at eac h iteration k as L k ρ ( { θ n } N n =1 , λ ) = N X n =1 f n ( θ n ) + N − 1 X n =1 h λ n , θ n − θ n r,k i + ρ 2 N − 1 X n =1 k θ n − θ n +1 k 2 , (81) where λ := [ λ T 1 , · · · , λ T N − 1 ] T is the collection of dual v ariables. Note that the set of no des in head N k h and tail N k t will change with k .6 The primal and dual v ariables under GADMM are up dated in the following three steps. The mo diﬁed algorithm up dates are written as 1. At iteration k + 1, the primal variables of he ad workers are up dated as: θ k +1 n = arg min θ n  f n ( θ n )+ h λ k n l,k , θ k n l,k − θ n i + h λ k n , θ n − θ k n r,k i + ρ 2 k θ k n l,k − θ n k 2 + ρ 2 k θ n − θ k n r,k k 2  , n ∈ N h \ { 1 } (82) Since the ﬁrst head work er ( n = 1) do es not hav e a left neighbor ( θ n − 1 is not deﬁned), its mo del is up dated as follows. θ k +1 n = arg min θ n  f n ( θ n ) + h λ k n , θ n − θ k n r,k i + ρ 2 k θ n − θ k n l,k k 2  , n = 1 (83) 2. After the up dates in (82) and (83) , head w orkers send their up dates to their tw o tail neigh b ors. The primal variables of tail workers are then up dated as: θ k +1 n = arg min θ n  f n ( θ n )+ h λ k n l,k , θ k +1 n l,k − θ n i + h λ k n , θ n − θ k +1 n r,k i + ρ 2 k θ k +1 n l,k − θ n k 2 + ρ 2 k θ n − θ k +1 n r,k k 2  , n ∈ N t \ { N } . (84) 28 shor t title Since the last tail w orker ( n = N ) do es not ha ve a right neighbor ( θ n +1 is not deﬁned), its mo del is up dated as follows θ k +1 n = arg min θ n  f n ( θ n )+ h λ k n l,k , θ k +1 n l,k − θ n i + ρ 2 k θ k +1 n l,k − θ n k 2  , n = N . (85) 3. After receiving the up dates from neighbors, every worker lo c al ly up dates its dual variables λ n − 1 and λ n as follo ws λ k +1 n = λ k n + ρ ( θ k +1 n − θ k +1 n r,k ) , n = { 1 , · · · , N − 1 } . (86) Note that when the top ology c hanges, λ k n of work er n is received from the left neigh b or n l,k b efore up dating λ k +1 n according to (86) . F or the pro of, we start with the necessary and suﬃcien t optimalit y conditions, whic h are the primal and the dual feasibility conditions (Bo yd et al., 2011) for each k are deﬁned as θ ? n = θ ? n l,k , n ∈ { 2 , · · · , N } (primal feasibilit y) (87) 0 ∈ ∂ f n ( θ ? n ) − λ ? n l,k + λ ? n , n ∈ { 2 , · · · , N − 1 } 0 ∈ ∂ f n ( θ ? n ) + λ ? n , n = 1 (dual feasibilit y) 0 ∈ ∂ f n ( θ ? n ) + λ ? n − 1 , n = N (88) W e remark that the optimal v alues θ ? n are equal for each n , we denote θ ? = θ ? n = θ ? n − 1 for all n . Note that, at iteration k + 1, w e calculate θ k +1 n for all n ∈ N k t \ { N } as in (13) , from the ﬁrst order optimalit y condition, it holds that 0 ∈ ∂ f n ( θ k +1 n ) − λ k n l,k + λ k n + ρ ( θ k +1 n − θ k +1 n l,k ) + ρ ( θ k +1 n − θ k +1 n r,k ) . (89) Next, rewrite the equation in (89) as 0 ∈ ∂ f n ( θ k +1 n ) −  λ k n l,k + ρ ( θ k +1 n l,k − θ k +1 n )  +  λ k n + ρ ( θ k +1 n − θ k +1 n r,k )  . (90) F rom the up date in (86), the equation in (90) implies that 0 ∈ ∂ f n ( θ k +1 n ) − λ k +1 n l,k + λ k +1 n , n ∈ N k t \ { N } . (91) Note that for the N -th w orker, we calculate θ k +1 N as in (14) , then we follow the same steps, and w e get 0 ∈ ∂ f N ( θ k +1 N ) − λ k +1 N l,k . (92) F rom the result in (91) and (92) , it holds that the dual feasibility condition in (88) is alwa ys satisﬁed for all n ∈ N k t . Next, consider ev ery θ k +1 n suc h that n ∈ N k h \ { 1 } whic h is calculated as in (82) at iteration k . Similarly from the ﬁrst order optimality condition, we can write 0 ∈ ∂ f n ( θ k +1 n ) − λ k n l,k + λ k n + ρ ( θ k +1 n − θ k n l,k ) + ρ ( θ k +1 n − θ k n r,k ) . (93) 29 authors Note that in (93) , we don’t hav e all the primal v ariables calculated at k + 1 instance. Hence, w e add and subtract the terms θ k +1 n l,k and θ k +1 n r,k in (93) to get 0 ∈ ∂ f n ( θ k +1 n ) −  λ k n l,k + ρ ( θ k +1 n l,k − θ k +1 n )  +  λ k n + ρ ( θ k +1 n − θ k +1 n r,k )  + ρ ( θ k +1 n l,k − θ k n l,k ) + ρ ( θ k +1 n r,k − θ k n r,k ) . (94) F rom the up date in (86), it holds that 0 ∈ ∂ f n ( θ k +1 n ) − λ k +1 n l,k + λ k +1 n + ρ ( θ k +1 n l,k − θ k n l,k ) + ρ ( θ k +1 n r,k − θ k n r,k ) . (95) F ollowing the same steps for the ﬁrst head w orker ( n = 1) after excluding the terms λ k 0 and ρ ( θ k +1 1 − θ k 0 ) from (93) (w orker 1 do es not hav e a left neighbor) gives 0 ∈ ∂ f 1 ( θ k +1 1 ) + λ k +1 1 + ρ ( θ k +1 1 r,k − θ k 1 r,k ) . (96) Let s k +1 n , the dual residual of w orker n ∈ N k h at iteration k + 1, b e deﬁned as follows s k +1 n = ( ρ ( θ k +1 n l,k − θ k n l,k ) + ρ ( θ k +1 n r,k − θ k n r,k ) , for n ∈ N k h \ { 1 } ρ ( θ k +1 n r,k − θ k n r,k ) , for n = 1 . (97) Next, w e discuss about the primal feasibilit y condition in (87) at iteration k + 1. Let r k +1 n,n r,k = θ k +1 n − θ k +1 n r,k b e the primal residual of each work er n ∈ { 1 , · · · , N − 1 } . T o show the conv ergence of GADMM, we need to prov e that the conditions in (87) - (88) are satisﬁed for eac h work er n . W e ha ve already sho wn that the dual feasibility condition in (88) is alw ays satisﬁed for the tail work ers, and the dual residual of tail work ers is alwa ys zero. Therefore, to pro ve the conv ergence and the optimality of GADMM, we need to show that the r k n,n r,k for all n = 1 , · · · , N − 1 and s k n ∈N k h con verge to zero, and P N n =1 f n ( θ k n ) conv erges to P N n =1 f n ( θ ? ) as k → ∞ . W e pro ceed as follows to prov e the same. W e note that f n ( θ n ) for all n is closed, prop er, and conv ex, hence L k ρ is sub-diﬀerentiable. Since θ k +1 n for n ∈ N k h at k minimizes L k ρ ( θ n ∈N h , θ k n ∈N t , λ n ), the following must hold true at eac h iteration k + 1, which implies that 0 ∈ ∂ f n ( θ k +1 n ) − λ k +1 n l,k + λ k +1 n + s k +1 n , n ∈ N k h \ { 1 } (98) 0 ∈ ∂ f 1 ( θ k +1 1 ) + λ k +1 1 + s k +1 1 , n = 1 (99) Note that we use (99) for w orker 1 since it do es not hav e a left neighbor ( i.e., λ k +1 0 is not deﬁned). Ho wev er, for simplicity and to a void writing separate equations for the edge w orkers (work ers 1 and N ), we use: λ k +1 0 = λ k +1 N = 0 throughout the rest of the pro of. Therefore, w e can use a single equation for each group ( e.g., equation (33) for n ∈ N k h ). The result in (98) implies that θ k +1 n for n ∈ N k h minimizes the follo wing conv ex ob jec tiv e function f n ( θ n ) + h − λ k +1 n l,k + λ k +1 n + s k +1 n , θ n i . (100) 30 shor t title Next, since θ k +1 n for n ∈ N k h is the minimizer of (100), then, it holds that f n ( θ k +1 n ) + h − λ k +1 n l,k + λ k +1 n + s k +1 n , θ k +1 n i ≤ f n ( θ ? ) + h − λ k +1 n l,k + λ k +1 n + s k +1 n , θ ? i (101) where θ ? is the optimal v alue of the problem in (8) - (9) . Similarly for θ k +1 n for n ∈ N k t satisﬁes (88) and it holds that f n ( θ k +1 n ) + h − λ k +1 n l,k + λ k +1 n , θ k +1 n i ≤ f n ( θ ? ) + h − λ k +1 n l,k + λ k +1 n , θ ? i . (102) Add (101) and (102) , and then tak e the summation ov er all the w orkers, note that for a giv en k , the top ology in the netw ork is ﬁxed, we get N X n =1 f n ( θ k +1 n ) + X n ∈N k t h − λ k +1 n l,k + λ k +1 n , θ k +1 n i + X n ∈N k h h − λ k +1 n l,k + λ k +1 n + s k +1 n , θ k +1 n i ≤ N X n =1 f n ( θ ? ) + X n ∈N k t h − λ k +1 n l,k + λ k +1 n , θ ? i + X n ∈N k h h − λ k +1 n l,k + λ k +1 n + s k +1 n , θ ? i (103) After rearranging the terms, w e get N X n =1 f n ( θ k +1 n ) − N X n =1 f n ( θ ? ) ≤ X n ∈N k t h − λ k +1 n l,k + λ k +1 n , θ ? i + X n ∈N k h h − λ k +1 n l,k + λ k +1 n , θ ? i − X n ∈N k t h − λ k +1 n l,k + λ k +1 n , θ k +1 n i − X n ∈N k h h − λ k +1 n l,k + λ k +1 n , θ k +1 n i + X n ∈N k h h s k +1 n , θ ? − θ k +1 n i . (104) Note that w e can write X n ∈N t t h − λ k +1 n l,k + λ k +1 n , θ k +1 n i + X n ∈N k h h − λ k +1 n l,k + λ k +1 n , θ k +1 n i = N − 1 X n =1 h λ k +1 n , r k +1 n,n k r i , (105) where for the equality , we hav e used the deﬁnition of primal residuals deﬁned after (95) . Similarly , it holds for θ ? as X n ∈N k t h − λ k +1 n l,k + λ k +1 n , θ ? i + X n ∈N k h h − λ k +1 n l,k + λ k +1 n , θ ? i = 0 . (106) The equality in (106) holds since λ k +1 N = 0 . Next, substituting the results from (105) and (106) in to (104), we get N X n =1 f n ( θ k +1 n ) − N X n =1 f n ( θ ? ) ≤ − N − 1 X n =1 h λ k +1 n , r k +1 n,n r,k i + X n ∈N k h h s k +1 n , θ ? − θ k +1 n i , (107) 31 authors whic h pro vides an upp er b ound on the optimality gap. Next, we get the low er b ound as follo ws. W e note that for a saddle p oin t ( θ ? , { λ ? n } ∀ n ) of L 0 ( { θ n } ∀ n , { λ n } ∀ n ), it holds that L 0 ( θ ? , { λ ? n } ∀ n ) ≤ L 0 ( { θ k +1 n } ∀ n , { λ ? n } ∀ n ) . (108) Substituting the expression for the Lagrangian from (81) on the b oth sides of (108) , w e get N X n =1 f n ( θ ? ) ≤ N X n =1 f n ( θ k +1 n ) + N − 1 X n =1 h λ ? n , r k +1 n,n r,k i . (109) After rearranging the terms, w e get N X n =1 f n ( θ k +1 n ) − N X n =1 f n ( θ ? ) ≥ − N − 1 X n =1 h λ ? n , r k +1 n,n r,k i (110) whic h provide the lo wer b ound on the optimality gap. Next, we show that b oth the low er and upp er b ound conv erges to zero as → ∞ . This would prov e that the optimality gap con verges to zero with k → ∞ . T o pro ceed with the analysis, add (107) and (110), multiply by 2, w e get 2 N − 1 X n =1 h λ k +1 n − λ ? n , r k +1 n,n r,k i + 2 X n ∈N k h h s k +1 n , θ k +1 n − θ ? i ≤ 0 . (111) F rom the dual up date in (86), we hav e λ k +1 n = λ k n + ρ r k +1 n,n r,k and (111) can b e written as 2 N − 1 X n =1 h λ k n + ρ r k +1 n,n r,k − λ ? n , r k +1 n,n r,k i + 2 X n ∈N k h h s k +1 n , θ k +1 n − θ ? i ≤ 0 . (112) Note that the ﬁrst term on the left hand side of (112) can b e written as N − 1 X n =1 2 h λ k n − λ ? n , r k +1 n,n r,k i + ρ    r k +1 n,n r,k    2 + ρ    r k +1 n,n r,k    2 . (113) Replacing r k +1 n,n r,k in the ﬁrst and second terms of (113) with λ k +1 n − λ k n ρ , w e get N − 1 X n =1 (2 /ρ ) h λ k n − λ ? n , λ k +1 n − λ k n i + (1 /ρ )    λ k +1 n − λ k n    2 + ρ    r k +1 n,n r,k    2 . (114) Using the equalit y λ k +1 n − λ k n = ( λ k +1 n − λ ? n ) − ( λ k n − λ ? n ), w e can write (114) as N − 1 X n =1 (2 /ρ ) h λ k n − λ ? n , ( λ k +1 n − λ ? n ) − ( λ k n − λ ? n ) i + (1 /ρ )    ( λ k +1 n − λ ? n ) − ( λ k n − λ ? n )    2 + ρ    r k +1 n,n r,k    2 = N − 1 X n =1 (2 /ρ ) h λ k n − λ ? n , λ k +1 n − λ ? n i − (2 /ρ )    λ k n − λ ? n    2 + (1 /ρ )    λ k +1 n − λ ? n    2 − (2 /ρ ) h λ k +1 n − λ ? n , λ k n − λ ? n i + 1 /ρ    λ k n − λ ? n    2 + ρ    r k +1 n,n r,k    2 (115) = N − 1 X n =1 h (1 /ρ )    λ k +1 n − λ ? n    2 − (1 /ρ )    λ k n − λ ? n    2 + ρ    r k +1 n,n r,k    2 i . (116) 32 shor t title Next, consider the second term on the left hand side of (112) , from the equalit y (97) , it holds that 2 X n ∈N k h h s k +1 n , θ k +1 n − θ ? i (117) = X n ∈N h \{ 1 }  2 ρ h θ k +1 n l,k − θ k n l,k , θ k +1 n − θ ? i  + X n ∈N h  2 ρ h θ k +1 n r,k − θ k n r,k , θ k +1 n − θ ? i  . Note that θ k +1 n = − r k +1 n l,k ,n + θ k +1 n l,k = r k +1 n,n r,k + θ k +1 n r,k , ∀ n = { 2 , · · · , N − 1 } , which implies that w e can rewrite the equation in (117) as follows 2 X n ∈N k h h s k +1 n , θ k +1 n − θ ? i = X n ∈N k h \{ 1 }  − 2 ρ h θ k +1 n l,k − θ k n l,k , r k +1 n l,k ,n i + 2 ρ h θ k +1 n l,k − θ k n l,k , θ k +1 n l,k − θ ? i  + X n ∈N h  2 ρ h θ k +1 n r,k − θ k n r,k , r k +1 n,n r,k i + 2 ρ h θ k +1 n r,k − θ k n r,k , θ k +1 n r,k − θ ? i  . (118) Using the equalities, θ k +1 n l,k − θ ? =( θ k +1 n l,k − θ k n l,k ) + ( θ k n l,k − θ ? ) , ∀ n ∈ N k h \ { 1 } θ k +1 n r,k − θ ? =( θ k +1 n r,k − θ k n r,k ) + ( θ k n r,k − θ ? ) , ∀ n ∈ N k h (119) w e rewrite the right hand side of (118) as X n ∈N k h \{ 1 }  − 2 ρ h θ k +1 n l,k − θ k n l,k , r k +1 n l,k ,n i + 2 ρ k θ k +1 n l,k − θ k n l,k k 2 + 2 ρ h θ k +1 n l,k − θ k n l,k , θ k n l,k − θ ? i  + X n ∈N k h  2 ρ h θ k +1 n r,k − θ k n r,k , r k +1 n,n r,k i + 2 ρ k θ k +1 n r,k − θ k n r,k k 2 + 2 ρ h θ k +1 n r,k − θ k n r,k , θ k n r,k − θ ? n r,k i  . (120) F urther using the equalities θ k +1 n l,k − θ k n l,k =( θ k +1 n l,k − θ ? ) − ( θ k n l,k − θ ? ) , ∀ n ∈ N k h \ { 1 } θ k +1 n r,k − θ k n r,k =( θ k +1 n r,k − θ ? ) − ( θ k n r,k − θ ? ) , ∀ n ∈ N k h , (121) w e can write (120) after the rearrangement as X n ∈N k h \{ 1 }  − 2 ρ h θ k +1 n l,k − θ k n l,k , r k +1 n − 1 ,n i + ρ k θ k +1 n l,k − θ k n l,k k 2 + ρ k ( θ k +1 n l,k − θ ? ) − ( θ k n l,k − θ ? ) k 2 + 2 ρ h θ k +1 n l,k − θ ? , θ k n l,k − θ ? i − 2 ρ k θ k n l,k − θ ? k 2 2  + X n ∈N k h  2 ρ h θ k +1 n r,k − θ k n r,k , r k +1 n,n r,k i + ρ k θ k +1 n r,k − θ k n r,k k 2 + ρ k ( θ k +1 n r,k − θ ? ) − ( θ k n r,k − θ ? ) k 2 + 2 ρ h θ k +1 n r,k − θ ? , θ k n r,k − θ ? i − 2 ρ k θ k n r,k − θ ? | 2  . (122) 33 authors Next, expanding the square terms in (122) , w e get the upp er b ound for the term in (116) as follo ws X n ∈N k h \{ 1 }  − 2 ρ h θ k +1 n l,k − θ k n l,k , r k +1 n − 1 ,n i + ρ k θ k +1 n l,k − θ k n l,k k 2 + ρ k θ k +1 n l,k − θ ? k 2 − ρ k θ k n l,k − θ ? k 2  + X n ∈N k h  2 ρ h θ k +1 n r,k − θ k n r,k , r k +1 n,n r,k i + ρ k θ k +1 n r,k − θ k n r,k k 2 + ρ k θ k +1 n r,k − θ ? k 2 − ρ k θ k n r,k − θ ? k 2  . (123) Substituting the equalities from (116) and (123) to the left hand side of (112), we obtain N − 1 X n =1 h − (1 /ρ ) k λ k +1 n − λ ? n k 2 + (1 /ρ ) k λ k n − λ ? n k 2 − ρ k r k +1 n,n r,k k 2 i − X n ∈N k h \{ 1 }  − 2 ρ h θ k +1 n l,k − θ k n l,k , r k +1 n l,k ,n i + ρ k θ k +1 n l,k − θ k n l,k k 2 + ρ k θ k +1 n l,k − θ ? k 2 − ρ k θ k n l,k − θ ? k 2  − X n ∈N k h  2 ρ h θ k +1 n r,k − θ k n r,k , r k +1 n,n r,k i + ρ k θ k +1 n r,k − θ k n r,k k 2 + ρ k θ k +1 n r,k − θ ? k 2 − ρ k θ k n r,k − θ ? k 2  ≥ 0 , (124) Next, consider the Ly apunov function V k as V k = 1 /ρ N − 1 X n =1 k λ k n − λ ? n k 2 + ρ X n ∈N k h \{ 1 } k θ k n l,k − θ ? k 2 + ρ X n ∈N k h k θ k n l,k − θ ? k 2 . (125) After rearranging the terms in (124) and using the deﬁnition of the Ly apunov function in (125), w e get V k +1 ≤ V k − N − 1 X n =1 ρ k r k +1 n,n r,k k 2 − h X n ∈N k h \{ 1 } ρ k θ k +1 n l,k − θ k n l,k k 2 + X n ∈N k h ρ k θ k +1 n r,k − θ k n r,k k 2 i − h X n ∈N k h \{ 1 } − 2 ρ h θ k +1 n l,k − θ k n l,k , r k +1 n l,k ,n i + X n ∈N k h 2 ρ h θ k +1 n r,k − θ k n r,k , r k +1 n,n r,k i i . (126) W e rewrite (126) as V k +1 ≤ V k − X n ∈N k h \{ 1 } ρ k r k +1 n l ,n k 2 + X n ∈N k h ρ k r k +1 n,n r k 2 − h X n ∈N k h \{ 1 } ρ k θ k +1 n l,k − θ k n l,k k 2 + X n ∈N k h ρ k θ k +1 n r,k − θ k n r,k k 2 i − h X n ∈N k h \{ 1 } − 2 ρ h θ k +1 n l,k − θ k n l,k , r k +1 n l,k ,n i + X n ∈N k h 2 ρ h θ k +1 n r,k − θ k n r,k , r k +1 n,n r,k i i . (127) 34 shor t title Next, the equation in (127) can b e re-written as V k +1 ≤ V k − ρ X n ∈N k h \{ 1 } h k r k +1 n l ,n k 2 − 2 h θ k +1 n l,k − θ k n l,k , r k +1 n l,k ,n i + k θ k +1 n l,k − θ k n l,k k 2 i − ρ X n ∈N k h h k r k +1 n,n r k 2 + 2 h θ k +1 n r,k − θ k n r,k , r k +1 n,n r,k i + k θ k +1 n r,k − θ k n r,k k 2 i (128) F urther, we write (128) as V k +1 ≤ V k − ρ   X n ∈N k h \{ 1 } k r k +1 n l ,n − ( θ k +1 n l,k − θ k n l,k ) k 2 + X n ∈N k h k r k +1 n,n r + ( θ k +1 n r,k − θ k n r,k ) k 2   (129) The result in (129) pro ves that V k +1 decreases in each iteration k . Now, since V k ≥ 0 and V k ≤ V 0 , it holds that " P n ∈N k h \{ 1 } k r k +1 n l ,n − ( θ k +1 n l,k − θ k n l,k ) k 2 + P n ∈N k h k r k +1 n,n r + ( θ k +1 n r,k − θ k n r,k ) k 2 # is b ounded. T aking the telescopic sum ov er k in (129) and taking limit K → ∞ , we get lim K →∞ K X k =0 " X n ∈N k h \{ 1 } k r k +1 n l ,n − ( θ k +1 n l,k − θ k n l,k ) k 2 + X n ∈N k h k r k +1 n,n r + ( θ k +1 n r,k − θ k n r,k ) k 2 # ≤ V 0 . (130) The result in (130) implies that the primal residual r k +1 n,n r,k → 0 as k → ∞ for all n ∈ { 1 , · · · , N − 1 } . Similarly , the norm diﬀerences    θ k +1 n l,k − θ k n l,k    and    θ k +1 n r,k − θ k n r,k    → 0 as k → ∞ whic h implies that the dual residual s k n → 0 as k → ∞ for all n ∈ N k h . In order to pro ve the conv ergence to optimal p oint, , consider the low er and the upp er b ounds on the ob jectiv e function optimality gap given by N X n =1 [ f n ( θ k +1 n ) − f n ( θ ? )] ≤ − N − 1 X n =1 h λ k +1 n , r k +1 n,n r,k i + X n ∈N k h h s k +1 n , θ ? − θ k +1 n i (131) N X n =1 [ f n ( θ k +1 n ) − f n ( θ ? )] ≥ − N − 1 X n =1 h λ ? n , r k +1 n,n r,k i . (132) Note that from the results established in this app endix, it holds that the right hand side of the upp er b ound in (131) con verge to zero as k → ∞ and also the right hand side of the lo wer b ound in (132) conv erges to zero as k → ∞ . This implies that lim k →∞ N X n =1 [ f n ( θ k +1 n ) − f n ( θ ? )] = 0 (133) whic h is the required result. Hence prov ed. 35 authors References Amr Ahmed, Nino Sherv ashidze, Shra v an Naray anamurth y , V anja Josifovski, and Alexander J Smola. Distributed large-scale natural graph factorization. In Pr o c e e dings of World Wide Web, R io de Janeir o, Br azil , May 2013. Amrit Singh Bedi, Alec Kopp el, and Ra jaw at Ketan. Async hronous saddle p oint algorithm for stochastic optimization in heterogeneous net works. IEEE T r ansactions on Signal Pr o c essing , 67(7):1742–1757, 2019. ISSN 1053-587X. doi: 10.1109/TSP .2019.2894803. Doron Blatt, Alfred O Hero, and Hillel Gauchman. A conv ergen t incremen tal gradien t metho d with a constant step size. SIAM Journal on Optimization , 18(1):29–51, 2007. Ernesto Bonomi and Jean-Luc Lutton. The n-city trav elling salesman problem: Statistical mec hanics and the metrop olis algorithm. SIAM r eview , 26(4):551–568, 1984. Stephen Bo yd, Neal P arikh, Eric Chu, Borja Peleato, Jonathan Ec kstein, et al. Distributed optimization and statistical learning via the alternating direction metho d of m ultipliers. F oundations and T r ends R  in Machine le arning , 3(1):1–122, 2011. Tsung-Hui Chang, Mingyi Hong, and Xiangfeng W ang. Multi-agen t distributed optimization via inexact consensus admm. IEEE T r ansactions on Signal Pr o c essing , 63(2):482–497, 2014a. Tsung-Hui Chang, Angelia Nedi´ c, and Anna Scaglione. Distributed constrained optimization b y consensus-based primal-dual p erturbation metho d. IEEE T r ansactions on Automation and Contr ol , 59(6):1524–1538, 2014b. Caih ua Chen, Bingsheng He, Yinyu Y e, and Xiaoming Y uan. The direct extension of admm for multi-block con vex minimization problems is not necessarily conv ergen t. Mathematic al Pr o gr amming , 155(1-2):57–79, 2016. Tian yi Chen, Georgios Giannakis, T ao Sun, and W otao Yin. Lag: Lazily aggregated gradien t for communication-eﬃcien t distributed learning. A dvanc es in Neur al Information Pr o c essing Systems , 31:5055–5065, 2018. Jeﬀrey Dean, Greg Corrado, Ra jat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul T uc ker, Ke Y ang, Quo c V Le, et al. Large scale distributed deep net works. A dvanc es in Neur al Information Pr o c essing Systems , 25:1223–1231, 2012. W ei Deng, Ming-Jun Lai, Zhimin P eng, and W otao Yin. Parallel m ulti-blo ck admm with o (1 /k ) con vergence. Journal of Scientiﬁc Computing , 71(2):712–736, 2017. Marco Dorigo and Luca Maria Gambardella. Ant colonies for the trav elling salesman problem. biosystems , 43(2):73–81, 1997. Dheeru Dua and Casey Graﬀ. UCI mac hine learning rep ository , 2017. URL http://archive. ics.uci.edu/ml . 36 shor t title John C Duchi, Alekh Agarw al, and Martin J W ain wright. Dual av eraging for distributed optimization: Con vergence analysis and net work scaling. IEEE T r ansactions on Automatic c ontr ol , 57(3):592–606, 2011. Daniel Gaba y and Bertrand Mercier. A dual algorithm for the solution of non line ar variational pr oblems via ﬁnite element appr oximation . Institut de rec herche d’informatique et d’automatique, 1975. Roland Glo winski and A Marro co. Sur l’appro ximation, par ´ el´ emen ts ﬁnis d’ordre un, et la r ´ esolution, par p´ enalisation-dualit´ e d’une classe de probl ` emes de dirichlet non lin ´ eaires. ESAIM: Mathematic al Mo del ling and Numeric al A nalysis-Mo d´ elisation Math´ ematique et A nalyse Num´ erique , 9(R2):41–76, 1975. Mert Gurbuzbalaban, Asuman Ozdaglar, and P ablo A P arrilo. On the conv ergence rate of incremental aggregated gradien t algorithms. SIAM Journal on Optimization , 27(2): 1035–1048, 2017. Bingsheng He. A class of pro jection and contraction metho ds for monotone v ariational inequalities. Applie d Mathematics and Optimization , 35(1):69–76, Jan 1997. ISSN 1432- 0606. doi: 10.1007/BF02683320. URL https://doi.org/10.1007/BF02683320 . Bingsheng He and Xiaoming Y uan. On the o(1/n) conv ergence rate of the douglas–rac hford alternating direction metho d. SIAM Journal on Numeric al A nalysis , 50(2):700–709, 2012. Bingsheng He and Xiaoming Y uan. On non-ergo dic conv ergence rate of douglas–rachford alternating direction metho d of m ultipliers. Numerische Mathematik , 130(3):567–577, 2015. Bingsheng He, Liusheng Hou, and Xiaoming Y uan. On full jacobian decomp osition of the augmen ted lagrangian metho d for separable con vex programming. SIAM Journal on Optimization , 25(4):2274–2312, 2015. Lie He, An Bian, and Martin Jaggi. Cola: Decentralized linear learning. In A dvanc es in Neur al Information Pr o c essing Systems , pages 4536–4546, 2018. Martin Jaggi, Virginia Smith, Martin T ak´ ac, Jonathan T erhorst, Sanja y Krishnan, Thomas Hofmann, and Mic hael I Jordan. Comm unication-eﬃcient distributed dual co ordinate ascen t. A dvanc es in Neur al Information Pr o c essing Systems , 27:3068–3076, 2014. Du ˇ san Jako v eti´ c, Joao Xa vier, and Jos´ e MF Moura. F ast distributed gradient metho ds. IEEE T r ansactions on Automation and Contr ol Automa. Contr ol , 59(5):1131–1146, 2014. Eunjeong Jeong, Seungeun Oh, Hyesung Kim, Jihong Park, Mehdi Bennis, and Seong-Lyun Kim. Communication-eﬃcien t on-device machine learning: F ederated distillation and augmen tation under non-iid priv ate data. pr esente d at Neur al Information Pr o c essing Systems Workshop on Machine L e arning on the Phone and other Consumer Devic es (MLPCD), Montr´ eal, Canada , 2018. doi: arXiv:1811.11479. URL abs/1811.11479 . 37 authors Mic hael I. Jordan, Jason D. Lee, and Y un Y ang. Comm unication-eﬃcient distributed statistical inference. Journal of the Americ an Statistic al Asso ciation , 2018. Alec Kopp el, Brian M Sadler, and Alejandro Rib eiro. Proximit y without consensus in online m ultiagent optimization. IEEE T r ansactions on Signal Pr o c essing , 65(12):3062–3077, 2017. Guangh ui Lan, So omin Lee, and Yi Zhou. Comm unication-eﬃcien t algorithms for decentral- ized and sto c hastic optimization. Mathematic al Pr o gr amming , pages 1–48, 2017. Jan Karel Lenstra and AHG Rinno o y Kan. Some simple applications of the trav elling salesman problem. Journal of the Op er ational R ese ar ch So ciety , 26(4):717–733, 1975. Mu Li, Da vid G Andersen, and Alexander Smola. Distributed delay ed pro ximal gradient metho ds. pr esente d at Neur al Information Pr o c essing Systems Workshop on Optimization for Machine L e arning, L ake T aho e, NV, USA , December 2013. Mu Li, Da vid G Andersen, Alexander J Smola, and Kai Y u. Communication eﬃcien t distributed mac hine learning with the parameter serv er. A dvanc es in Neur al Information Pr o c essing Systems , 27:19–27, 2014. Pierre-Louis Lions and Bertrand Mercier. Splitting algorithms for the sum of tw o nonlinear op erators. SIAM Journal on Numeric al Analysis , 16(6):964–979, 1979. Y aohua Liu, W ei Xu, Gang W u, Zhi Tian, and Qing Ling. Comm unication-censored ADMM for decentralized consensus optimization. IEEE T r ansactions on Signal Pr o c essing , 67 (10):2565–2579, 2019. Chenxin Ma, Jakub Kone ˇ cn ` y, Martin Jaggi, Virginia Smith, Michael I Jordan, Peter Ric ht´ arik, and Martin T ak´ aˇ c. Distributed optimization with arbitrary lo cal solv ers. Optimization Metho ds and Softwar e , 32(4):813–848, 2017. Ian W McKeague. On the capacit y of channels with gaussian and non-gaussian noise. Information and Contr ol , 51(2):153–173, 1981. H. Brendan McMahan, Ramage Daniel Mo ore, Eider, Seth Hampson, and Blaise Ag ¨ uera yArcas. Communication-eﬃcien t learning of deep netw orks from decentralized data. In Pr o c e e dings of Artiﬁcial Intel ligenc e and Statistics, F ort L auder dale, FL, USA , April 2017. Angelia Nedi´ c and Alex Olshevsky . Distributed optimization ov er time-v arying directed graphs. IEEE T r ans. Automa. Contr ol , 60(3):601–615, 2014. Angelia Nedi´ c and Asuman Ozdaglar. Distributed subgradient metho ds for multi-agen t optimization. IEEE T r ansactions on Automation and Contr ol , 54(1):48–61, 2009. Angelia Nedic, Alex Olshevsky , and W ei Shi. Achieving geometric con vergence for distributed optimization o ver time-v arying graphs. SIAM Journal on Optimization , 27(4):2597–2633, 2017. 38 shor t title Angelia Nedi ´ c, Alex Olshevsky , and Michael G Rabbat. Net work topology and communication- computation tradeoﬀs in decentralized optimization. Pr o c e e dings of the IEEE , 106(5): 953–976, 2018. Jihong Park, Sumudu Samarak o on, Mehdi Bennis, and M´ erouane Debbah. Wireless netw ork in telligence at the edge. to app e ar in Pr o c e e dings of the IEEE [Online]. Early ac c ess is available at: https://ie e explor e.ie e e.or g/do cument/8865093 , Nov em b er 2019. Carsten Peterson. Parallel distributed approac hes to com binatorial optimization: b enchmark studies on tra veling salesman problem. Neur al c omputation , 2(3):261–269, 1990. Kevin Scaman, F rancis Bac h, S ´ ebastien Bub ec k, Laurent Massouli´ e, and Yin T at Lee. Optimal algorithms for non-smo oth distributed optimization in net works. In A dvanc es in Neur al Information Pr o c essing Systems , pages 2740–2749, 2018. Mark Sc hmidt, Nicolas Le Roux, and F rancis Bac h. Minimizing ﬁnite sums with the sto c hastic av erage gradient. Mathematic al Pr o gr amming , 162(1-2):83–112, 2017. W ei Shi, Qing Ling, Gang W u, and W otao Yin. A pro ximal gradien t algorithm for decentral- ized comp osite optimization. IEEE T r ansactions on Signal Pr o c essing , 63(22):6013–6023, 2015. Nandan Sriranga, Chandra R. Murth y , and V aneet Aggarwal. A method to improv e consensus a veraging using quantized admm. In 2019 IEEE International Symp osium on Information The ory (ISIT) . IEEE, 2019. Ananda Theertha Suresh, F elix X Y u, Sanjiv Kumar, and H Brendan McMahan. Distributed mean estimation with limited comm unication. Pr o c e e dings of Machine L e arning R ese ar ch , 70:3329–3337, 2017. Behrouz T ouri and Angelia Nedic. Distributed consensus ov er netw ork with noisy links. In Pr o c e e dings of International Confer enc e on Information F usion, Se attle, W A, USA , July 2009. Konstan tinos I. Tsianos, Sean La wlor, and Mic hael G. Rabbat. Consensus-based distributed optimization: Practical issues and applications in large-scale machine learning. In Pr o- c e e dings of Al lerton Confer enc e on Communic ation, Contr ol, and Computing, Montic el lo, IL, USA , Octob er 2012. Huah ua W ang, Arindam Banerjee, and Zhi-Quan Luo. Parallel direction metho d of mu ltipli- ers. A dvanc es in Neur al Information Pr o c essing Systems , 27:181–189. Huih ui W ang, Y ang Gao, Yingh uan Shi, and Ruili W ang. Group-based alternating direc- tion metho d of multipliers for distributed linear classiﬁcation. IEEE T r ansactions on Cyb ernetics , 47(11):3568–3582, 2017. Shiqiang W ang, Tiﬀany T uor, Theo doros Salonidis, Kin K. Leung, Christian Mak a ya, Ting He, and Kevin Chan. Adaptive federated learning in resource constrained edge computing systems. ArXiv pr eprint , abs/1804.05271, 2018. 39 authors Y uchen Zhang, Martin J W ainwrigh t, and John C Duc hi. Communication-eﬃcien t algorithms for statistical optimization. A dvanc es in Neur al Information Pr o c essing Systems , 25:1502– 1510, 2012. Shengyu Zhu, Mingyi Hong, and Biao Chen. Quan tized consensus ADMM for multi-agen t distributed optimization. In Pr o c e e dings of International Confer enc e on A c oustics, Sp e e ch, and Signal Pr o c essing, Shanghai, China , March 2016. 40

GADMM: Fast and Communication Efficient Framework for Distributed Machine Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment