Bundle EXTRA for Decentralized Optimization
Decentralized primal-dual methods are widely used for solving decentralized optimization problems, but their updates often rely on the potentially crude first-order Taylor approximations of the objective functions, which can limit convergence speed. …
Authors: Haijuan Liu, Zhuoqing Zheng, Cong Li
Bundle EXTRA for Decen tralized Optimization Haijuan Liu, Zh uo qing Zheng, Cong Li, W en ying Xu, and Xuyang W u Abstract— Decen tralized primal-dual metho ds are widely used for solving decentralized optimization problems, but their up dates often rely on the p otentially crude rst-order T aylor appro ximations of the ob jectiv e functions, whic h can limit con vergence speed. T o o vercome this, w e replace the rst-order T aylor approximation in the primal update of EXTRA, whic h can be in terpreted as a primal-dual metho d, with a more accurate multi-cut bundle mo del, resulting in a fully decen tralized bundle EXTRA metho d. The bundle mo del incorp orates historical information to impro ve the appro ximation accuracy , potentially leading to faster con ver- gence. Under mild assumptions, w e sho w that a KKT residual con verges to zero. Numerical experiments on decen tralized least-squares problems demonstrate that, compared to EX- TRA, the bundle EXTRA method conv erges faster and is more robust to step-size choices. Index T erms— Bundle method, EXTRA, decentralized op- timization, consensus optimization. I. INTR ODUCTION Decen tralized optimization emplo ys a net work of agen ts to solve a global optimization problem, where each agen t can communicate only with its neigh b ors. This problem has attracted m uch atten tion in the last decades due to its broad applications in div erse research areas suc h as distributed control [1] and distributed machine learning [2]. In the category of decentralized optimization methods, early metho ds are mainly primal and based on the consensus op eration, suc h as the decentralized gradient descen t method (DGD) [3], [4], [5]. Although these meth- o ds can be well adapted to a wide range of scenarios, such as time-v arying netw ork [3], sto c hastic optimization [2], and async hronous optimization [6], [7], their conv ergence prop erties are unsatisfactory . In particular, when using xed step-sizes, these metho ds are only guaran teed to con verge to a sub-optimal solution. Although theoret- ically these metho ds con verge to the optimum when H. Liu, Z. Zheng, C. Li and X. W u are with the School of Automation and Intelligen t Man ufacturing, Southern Univ ersity of Science and T echnology , Shenzhen, China, and the State Key Lab oratory of Autonomous In telligent Unmanned Systems, Beijing 100081, China. Email: 12431368@mail.sustech.edu.cn; 12433033@mail.sustec h.edu.cn; licong@sustec h .edu.cn; wuxy6@sustec h.edu.cn W. Xu is with the School of Mathematics, Southeast Universit y , Nanjing 211189, China. Email: wyxu@seu.edu.cn This work is supp orted in part by the Guangdong Provin- cial Key Lab oratory of F ully A ctuated System Control Theory and T echnology under grant No. 2024B1212010002, in part by the Shenzhen Science and T echnology Program under gran t No. JCYJ20241202125309014, in part by the Shenzhen Science and T echnology Program under grant No. KQTD20221101093557010, and in part by the Guangdong Basic and Applied Basic Research F oundation under Grant No. 2026A1515012017. using diminishing step-sizes, the quickly v anishing step- size often yields slow con vergence. T o ac hieve faster conv ergence, a list of adv anced metho ds [8], [9], [10], [11], [12], [13] are prop osed, such as EXTRA [8], DIGing [11], and sev eral primal-dual metho ds [14], [15], [16], [17], where EXTRA and DIGing are also shown to hav e a primal-dual interpretation [18]. These methods can con verge to the exact optimum with constan t step-sizes and enjo y faster conv ergence b oth theoretically and empirically . Ho wev er, most of their up dates inv olve approximating the primal/dual function b y its rst-order T aylor expansion, whic h can yield a crude approximation and preven t faster conv ergence. With this observ ation, a natural idea for accelerating these methods is to use a more accurate surrogate. In this pap er, w e view EXTRA as a primal-dual metho d, and replace the rst-order T aylor expansion of the ob jective function with a surrogate function referred to as the bundle mo del in its primal up date, leading to a decentralized bundle EXTRA method. Our metho d allo ws for several c hoices of the bundle mo del, which incorp orate lo wer bounds of the ob jective function or historical function v alues and gradien ts to impro v e the accuracy of the surrogate. The main contributions of this pap er are summarized as follows: 1) W e adapt the bundle tec hnique in centralized opti- mization to decentralized optimization to accelerate the conv ergence of EXTRA. Although an earlier w ork [19] also applies the bundle technique in decen tralized optimization, it focuses on non-smo oth problems, allows only one sp ecic bundle mo del, and does not provide any theoretical con vergence guaran tees. 2) Under mild assumptions, we pro ve an O (1 /k ) con- v ergence rate of the bundle EXTRA metho d. 3) Numerical exp erimen ts on decentralized least squares demonstrate that the proposed metho d not only conv erges faster but also is more robust in the step-sizes compared to EXTRA [8]. The remainder of this pap er is organized as follows: Section I I introduces consensus optimization and the EXTRA algorithm. Section I II develops the bundle EX- TRA algorithm and section IV analyses its con vergence. Section V presen ts numerical exp eriments and section VI concludes the pap er. Notations and denitions. W e use R d and R n × d to denote the d -dimensional Euclidean space and the set of n × d real matrices, resp ectively . A dditionally , ⟨· , ·⟩ and ∥ · ∥ represen t the inner pro duct and the Euclidean norm, resp ectiv ely . F or any matrix A ∈ R n × d , Null( A ) = { x ∈ R d | Ax = 0 } and Range( A ) = { Ax | x ∈ R d } are the n ull space and range of A , resp ectiv ely , and A † denotes its Mo ore–P enrose pseudoinv erse. F or tw o matrices A, B ∈ R n × n , A ≻ B ( A ⪰ B ) means that A − B is positive denite (p ositiv e semidenite). W e use 1 d , 0 n × d , and I n to denote the d -dimensional all-one vector, the n × d zero matrix and the n dimensional identit y matrix, resp ectiv ely , and ignore the subscripts when they are clear from the context. F or any x ∈ R n and symmetric and positive semidenite matrix A ∈ R n × n , the w eighted norm ∥ x ∥ A = √ x T Ax . F or an y vector v ∈ R d , w e use span( v ) to denote its span. F or any f : R d → R , ∂ f ( · ) represen ts the sub dierential of f , and we say it is L - smo oth for some L > 0 if it is dierentiable and ∥∇ f ( y ) − ∇ f ( x ) ∥ ≤ L ∥ y − x ∥ , ∀ x, y ∈ R d . I I. Consensus Optimization and EXTRA This section rst formulates the consensus optimiza- tion problem, and then reviews EXTRA that is a foundation of the algorithm developmen t in Section I II. A. Consensus optimization Consider a net work G = ( V , E ) of agents, where V = { 1 , . . . , n } is the v ertex set and E is the edge set. Consensus optimization solves the follo wing problem through the collaboration of all agen ts: minimize x i ∈ R d n X i =1 f i ( x i ) , sub ject to x 1 = x 2 = · · · = x n , (1) W e aim to solve (1) ov er a netw ork G = ( V , E ) of nodes, where V = { 1 , . . . , n } is the vertex set and E is the edge set. T o this end, we mak e the follo wing assumptions. Assumption 1. The objective function f i is prop er, closed, con v ex, and L -smooth for some L > 0 . Assumption 2. There exists at least one optimal solution to problem (1). Assumption 3. The netw ork G is undirected and con- nected. Assumptions 1-3 are standard in decentralized opti- mization and are required in the conv ergence analysis of man y typical decentralized optimization metho ds, such as DGD [4], EXTRA [8], and DIGing [9]. B. EXTRA EXTRA is a t ypical decen tralized method for solving problem (1). T o in tro duce it, w e dene x = x T 1 . . . x T n ∈ R n × d , f ( x ) = n X i =1 f i ( x i ) , ∇ f ( x ) = ( ∇ f 1 ( x 1 )) T . . . ( ∇ f n ( x n )) T ∈ R n × d . (2) Then, EXTRA can b e describ ed as: x 1 = W x 0 − α ∇ f ( x 0 ) . (3) F or eac h k ≥ 0 , x k +2 =( I + W ) x k +1 − ˜ W x k − α ( ∇ f ( x k +1 ) − ∇ f ( x k )) , (4) where W, ˜ W ∈ R n × n are tw o weigh t matrices satisfying the follo wing assumption. Assumption 4. The matrices W = [ w ij ] n × n and ˜ W = [ ˜ w ij ] n × n satisfy (a) (Decen tralized) If i = j and ( i, j ) / ∈ E , then w ij = ˜ w ij = 0 . (b) (Symmetry) W = W T , ˜ W = ˜ W T . (c) (Null space) Null( W − ˜ W ) = span( 1 ) , Null( I − ˜ W ) ⊇ span( 1 ) . (d) (Spectral) ˜ W ≻ 0 and I + ˜ W 2 ⪰ ˜ W ⪰ W . F or simplicit y , throughout this paper w e set ˜ W = W + I 2 so that the EXTRA up date (4) becomes x k +2 = 2 ˜ W x k +1 − ˜ W x k − α ∇ f ( x k +1 ) − ∇ f ( x k ) . (5) With the setting ˜ W = ( W + I ) / 2 , Assumption 4 can b e guaran teed if Assumption 3 holds and W satises 1) Assumption 4(a)–(b); 2) w ij > 0 for all i = j and { i, j } ∈ E ; 3) W 1 = 1 . T ypical c hoices of W include 1) Metropolis w eights. Let d i b e the degree of agent i . The ij -element of W is w ij = 1 1+max { d i ,d j } , { i, j } ∈ E , i = j, 0 , { i, j } / ∈ E , i = j, 1 − P j ∈N i w ij , i = j. 2) Laplacian-based weigh ts [20]. Let L G b e the Lapla- cian matrix of G where the ij -elemen t is [ L G ] ij = − 1 { i, j } ∈ E , i = j, 0 , { i, j } / ∈ E , i = j, d i , i = j. The matrix is W = I − τ L G , where 0 < τ < 2 /λ max ( L G ) with λ max ( L G ) being the largest eigen v alue of L G . More options of W are discussed in [8, Section 2.4]. It is shown that the EXTRA up date (5) with the initialization (3) can b e rewritten as a primal-dual up date [18]: x k +1 = arg min x f ( x k ) + ⟨∇ f ( x k ) , x − x k ⟩ + ⟨ x , q k ⟩ + 1 2 α ∥ x − ˜ W x k ∥ 2 , (6) q k +1 = q k + 1 α ( I − ˜ W ) x k +1 , (7) where q k is the dual iterate with q 0 = 1 α ( I − ˜ W ) x 0 . (8) (a) Poly ak mo del. (b) Cutting-plane mo del. (c) Poly ak cutting-plane mo del. Fig. 1: Surrogate function ˜ f k i . I II. Algorithm Developmen t In the primal update (6) of EXTRA, the function f is appro ximated b y its rst order T aylor expansion whic h, in general, only admits a lo w appro ximation accuracy and limits faster conv ergence of the algorithm. T o accelerate EXTRA, we replace the rst-order T aylor expansion with a more accurate approximation ˜ f k , leading to x k +1 = arg min x ˜ f k ( x ) + ⟨ x , q k ⟩ + 1 2 α ∥ x − ˜ W x k ∥ 2 , (9) q k +1 = q k + 1 α ( I − ˜ W ) x k +1 . (10) W e require ˜ f k to hav e a separable structure for dis- tributed implemen tation: for a set of ˜ f k i , ˜ f k ( x ) = n X i =1 ˜ f k i ( x i ) . Then, the algorithm can b e implemen ted as x k +1 i = arg min x i n ˜ f k i ( x i ) + ( q k i ) T x i + 1 2 α x i − X j ∈N i ∪{ i } ˜ w ij x k j 2 o , (11) q k +1 i = q k i + 1 α x k +1 i − X j ∈N i ∪{ i } ˜ w ij x k +1 j , (12) where N i = { j | { i, j } ∈ E } is the neighbor set of agent i . W e require each ˜ f k i to satisfy the follo wing assumption. Assumption 5. The surrogate function ˜ f k i ( x i ) satises (a) ˜ f k i ( x i ) is con vex; (b) ˜ f k i ( x i ) ≥ f i ( x k i ) + ⟨∇ f i ( x k i ) , x i − x k i ⟩ for all x i ∈ R d ; (c) ˜ f k i ( x i ) ≤ f i ( x i ) for all x i ∈ R d . Assumption 5 requires eac h lo cal surrogate function to b e a con vex minoran t of f i that dominates the rst- order cutting plane at x k i . Moreov er, Assumption 5 (b) and (c) imply ˜ f k i ( x k i ) = f i ( x k i ) , i.e., the mo del is exact at x k i . Since Assumption 5 is usually assumed by bundle metho ds [21], [22], [23], we refer to surrogate functions satisfying Assumption 5 as a bundle mo del and the algorithm (9)–(10) with a bundle mo del as the bundle EXTRA method. A detailed implementation is giv en in Algorithm 1. Algorithm 1 Bundle EXTRA 1: Initialization: Determine the step-size α > 0 , the mixing matrix ˜ W , and the initial v ariables x 0 . 2: Eac h agen t i shares x 0 i with its neigh b ors j ∈ N i = { j | { i, j } ∈ E } and computes q 0 i = 1 α ( x 0 i − P j ∈N i ∪{ i } ˜ w ij x 0 j ) . 3: for k = 0 , 1 , 2 , . . . do 4: for all agen ts i ∈ V do 5: Construct the surrogate function ˜ f k i . 6: Up date the primal v ariable x k +1 i b y (11). 7: Share x k +1 i with all its neighbors j ∈ N i . 8: Up date q k +1 i b y (12) after receiving x k +1 j from all j ∈ N i . 9: end for 10: end for A. Candidates of ˜ f k i W e provide a list of candidates for ˜ f k i , whic h incorp o- rates historical objective function v alues and gradien ts or low er b ounds of the ob jectiv e function to yield a high appro ximation accuracy on f i . Under Assumption 1, all options listed b elo w satisfy Assumption 5 and some of them are depicted in Fig. 1. 1) P olyak mo del: Let γ f i b e a lo wer b ound of min x i f i ( x i ) . The model tak es the form of ˜ f k i ( x i ) = max { f i ( x k i ) + ⟨∇ f i ( x k i ) , x i − x k i ⟩ , γ f i } . W e name this function the P oly ak mo del since steep est descent with the Poly ak step-size minimizes the abov e surrogate function. 2) Cutting-plane mo del: The mo del takes the maxi- m um of historical cutting planes: ˜ f k i ( x i ) = max t ∈S k i { f i ( x t i ) + ⟨∇ f i ( x t i ) , x i − x t i ⟩} , where S k i ⊆ [0 , k ] is an index set of historical iterates. This mo del is used in the cutting-plane metho d [24]. 3) P olyak cutting-plane mo del: It takes the form of ˜ f k i ( x i ) = max t ∈S k i { f i ( x t i ) + ⟨∇ f i ( x t i ) , x i − x t i ⟩ , γ f i } , whic h is the maxim um of the Poly ak and the cutting-plane mo dels and is referred to as the Poly ak cutting-plane model. 4) T w o-cut mo del: ˜ f 0 i ( x i ) = f i ( x 0 i ) + ⟨∇ f i ( x 0 i ) , x i − x 0 i ⟩ . F or eac h k ≥ 1 , ˜ f k i ( x i ) = max { ℓ k − 1 ,k i , f i ( x k i ) + ⟨∇ f i ( x k i ) , x i − x k i ⟩} , where ℓ k − 1 ,k i = ˜ f k − 1 i ( x k i ) + ⟨ v k i , x i − x k i ⟩ , the sub- gradien t v k i ∈ ∂ ˜ f k − 1 i ( x k i ) . This mo del tak es the maxim um of the cutting planes of b oth f i and ˜ f k − 1 i at x k i . The ab o ve four mo dels appro ximate f i b y a piecewise linear function dened as the maximum of multiple ane functions, which is conv ex and yields a substan tially tigh ter low er bound than f ( x k )+ ⟨∇ f ( x k ) , x − x k ⟩ used in EXTRA. This tigh ter appro ximation p oten tially yields faster con v ergence. Remark 1. An existing w ork [19] also prop oses a de- cen tralized bundle-type metho d for solving problem (1). Ho wev er, it considers the non-smo oth setting (using subgradien ts), only allo ws for the t wo-cut mo del, and includes no theoretical conv ergence analysis. In contrast, w e consider the smo oth ob jectiv e function, allo w for a broader range of surrogate functions, and theoretically analyzed the con vergence (see Section IV). B. Subproblem in the primal up date (11) When using the surrogate functions in Section I I I- A, the subproblems in the primal up date (11) can be transformed into quadratic problems and can b e solv ed at lo w cost. W e rst simplify the subproblem in (11) as min x ∈ R d max j =1 ,...,m { a T j x + b j } + 1 2 α ∥ x − c ∥ 2 , (13) where the subscript i is ignored for simplicity , m is the n umber of ane functions in ˜ f k i , a T j x + b j is the j th ane function, and c = P j ∈N i ∪{ i } ˜ w ij x k j − αq k i . F urther, the subproblem (13) can b e equiv alently reformulated as a quadratic program: minimize x ∈ R d , ξ ∈ R ξ + 1 2 α ∥ x − c ∥ 2 sub ject to ξ − a T j x − b j ≥ 0 , j = 1 , . . . , m. (14) When d is small, directly solving the abov e quadratic problem is not exp ensive. If d is large, since the v ariable dimension of problem (14) is m , whic h is typically small (around 5 − 20 in practical implementations), a more ecient wa y is to solv e problem (14) by solving its lo w-dimensional dual problem maximize λ ∈ R m h ( λ ) sub ject to 1 T λ = 1 , λ ≥ 0 . (15) where h ( λ ) = inf x 1 2 α ∥ x − c ∥ 2 + λ T Ax + λ T b = − α ∥ A T λ ∥ 2 2 + λ T ( Ac + b ) , is the dual ob jectiv e function, A = ( a 1 , . . . , a m ) T ∈ R m × d , and b = ( b 1 , . . . , b m ) T ∈ R m . F or any optim um λ ⋆ of the dual problem (15), x ⋆ = c − αA T λ ⋆ is an optimum to problem (13) (see [25, Section 5.5]). Note that problem (15) is a quadratic program with v ariable dimension m and the constrain t set is the simplex where the pro jection on to it can be p erformed at a low cost of O ( m ) [26]. Therefore, problem (15) can b e solved quickly . T o test the practical eectiveness of this approach, w e apply FIST A to solv e problem (15) with randomly generated A, b, c where d = 100 , 000 and m = 15 , whic h tak es only 19 ∼ 40 iterations to reach a high accuracy of 10 − 7 ( 0 . 1 ∼ 0 . 2 second on a PC with the Apple M3 8-core CPU). IV. Con vergence Analysis This section analyzes the con vergence of the bundle EXTRA method. Theorem 1. Supp ose that Assumptions 1-5 hold. Let the sequences { x k } be generated b y Algorithm 1. If α ≤ λ min ( ˜ W ) /L, (16) then the KKT residuals satisfy ∞ X k =1 1 α ( x k ) T ( I − ˜ W ) x k + 1 L ∥∇ f ( x k ) + q ⋆ ∥ 2 ≤ 1 α ∥ x 0 − x ⋆ ∥ 2 ˜ W + α ∥ q 0 + ∇ f ( x ⋆ ) ∥ 2 ( I − ˜ W ) † < ∞ , (17) where x ⋆ is an optim um to problem (1). Pro of. See App endix. In (17), the terms ( x k ) T ( I − ˜ W ) x k and ∥∇ f ( x k ) + q ⋆ ∥ 2 measure the KKT residual of problem (1). T o see this, note that problem (1) is equiv alent to minimize x ∈ R n × d f ( x ) sub ject to ( I − ˜ W ) 1 2 x = 0 , (18) where the constrain t implies that all x i ’s are iden tical due to Assumption 4(c) and Null( I − ˜ W ) = Null(( I − ˜ W ) 1 2 ) . By the KKT conditions of problem (18), a point ˜ x is optimal if and only if there exists ˜ q ∈ Range(( I − ˜ W ) 1 2 ) suc h that ( I − ˜ W ) 1 2 ˜ x = 0 , (19) ∇ f ( ˜ x ) + ˜ q = 0 . (20) Since x ⋆ is optimal to problem (18), w e ha ve ∇ f ( x ⋆ ) ∈ Range(( I − ˜ W ) 1 2 ) . Therefore, ( x k ) T ( I − ˜ W ) x k = 0 implies (19) with ˜ x = x k , and ∥∇ f ( x k ) − ∇ f ( x ⋆ ) ∥ 2 = 0 indicates (20) with ˜ x = x k and ˜ q = −∇ f ( x ⋆ ) . Next, we show that (17) implies the con vergence of the KKT residuals ( x k ) T ( I − ˜ W ) x k and ∥∇ f ( x k ) + q ⋆ ∥ 2 to 0 . Corollary 1. Supp ose that all the conditions in Theorem 1 hold. Let { x k } be generated b y Algorithm 1. It holds that lim k → + ∞ ( x k ) T ( I − ˜ W ) x k = 0 , lim k → + ∞ ∥∇ f ( x k ) + q ⋆ ∥ = 0 . Moreo ver, min t ≤ k ( x t ) T ( I − ˜ W ) x t = o 1 k , min t ≤ k ∥∇ f ( x t ) − ∇ f ( x ⋆ ) ∥ 2 = o 1 k , 1 k k X t =1 ( x t ) T ( I − ˜ W ) x t = O 1 k , 1 k k X t =1 ∥∇ f ( x t ) − ∇ f ( x ⋆ ) ∥ 2 = O 1 k . Pro of. By (17) in Theorem 1 and [8, Prop osition 3.4], w e obtain the results. V. Numerical Experiment T o ev aluate the p erformance of the prop osed algo- rithm, we consider decen tralized least-squares problems of the form min x ∈ R d 1 2 n n X i =1 ∥ P i x − q i ∥ 2 , (21) whic h is equiv alent to problem (1) with f i ( x ) = 1 2 n ∥ P i x − q i ∥ 2 . All entries in the matrices P i ∈ R η × d and vectors q i ∈ R η are randomly generated from Gaussian distribu- tions: P lj i ∼ N (2 , 2) and q l i ∼ N (1 , 0 . 5) . The parameter settings are as follows: W e set n = 20 , d = 100 and η = 6 . The communication netw ork is randomly generated with 32 edges and is guaran teed to b e connected. W e c ho ose the mixing matrix W as the Metrop olis weigh ts describ ed in Section I I-B. W e set the bundle model in bundle EXTRA (11) as the follo wing cutting-plane model: ˜ f k i ( x i ) = max max(0 ,k − m ) ≤ t ≤ k f i ( x t i ) + ⟨∇ f i ( x t i ) , x i − x t i ⟩ , where m is a non-negative integer. W e v alidate the eectiv eness of the proposed bundle EXTRA through comparison with EXTRA. W e rst compare the con vergence sp eed where the step-sizes in b oth methods are ne-tuned for better p erformance. Then, we test the robustness of metho ds with resp ect to the step-size. The results are display ed in Figs 2–3, from whic h we make the follo wing observ ations: First, from Fig. 2, the prop osed bundle EXTRA metho d outp er- forms EXTRA and the adv antage b ecomes more evident when the num b er of historical cutting planes increases. One explanation for this acceleration is that the in- creased num b er of historical cutting planes enhances the appro ximation of the surrogate function, which further accelerates the conv ergence. Second, Fig. 3 shows that 0 200 400 600 800 1000 10 -6 10 -2 10 2 Fig. 2: Con vergence of bundle EXTRA and EXTRA. 0 2 4 6 8 10 -7 10 -2 10 3 Fig. 3: The parameter robustness of bundle EXTRA and EXTRA methods with step-size α = 0 . 003 × 2 t . bundle EXTRA can con verge with a muc h wider range of step-sizes compared with EXTRA, especially when m is large, which indicates a higher robustness of the bundle EXTRA in step-size selection. VI. Conclusion W e ha ve developed a bundle EXTRA metho d for de- cen tralized consensus optimization. This metho d incor- p orates a bundle-based m ulti-cut model in to the EXTRA framew ork, whic h impro ves the appro ximation accuracy of the primal ob jective functions while maintaining a fully decentralized structure. W e ha v e established global con vergence of the KKT residual under mild assump- tions. Numerical results demonstrate that our method not only conv erges faster, but also exhibits higher robust- ness in step-size selection compared to EXTRA. F uture w ork will in vestigate adaptive selection of the n um b er of cuts and more rened surrogate functions to further impro ve p erformance. App endix: Proof of Theorem 1 W e rst rewrite the bundle EXTRA metho d in an equiv alent form: Dene H = ( I − ˜ W ) 1 2 . By letting u k = H † q k , the algorithm can b e equiv alen tly rewritten as x k +1 = arg min x ˜ f k ( x ) + ⟨ x , H u k ⟩ + 1 2 α ∥ x − ˜ W x k ∥ 2 , (22) u k +1 = u k + 1 α H x k +1 , (23) where u 0 = 1 α H x 0 . Since q 0 ∈ Range( H ) and q k +1 − q k ∈ Range( H ) by (10), we hav e q k ∈ Range( H ) , ∀ k ≥ 0 , so that eac h q k corresp onds to a unique u k . By the optimalit y conditions of problem (18), for some u ⋆ , it holds that ( ∇ f ( x ⋆ ) + H u ⋆ = 0 , H x ⋆ = 0 . (24) A ccording to the rst-order optimalit y conditions of (22), for some g k +1 ∈ ∂ ˜ f k ( x k +1 ) , w e ha v e g k +1 + H u k + 1 α ( x k +1 − ˜ W x k ) = 0 . (25) Subtracting the rst equation in (24) from (25) gives g k +1 − ∇ f ( x ⋆ ) + H ( u k − u ⋆ ) + 1 α ( x k +1 − ˜ W x k ) = 0 , or equiv alen tly , H ( u k − u ⋆ ) = − ( g k +1 − ∇ f ( x ⋆ )) − 1 α ( x k +1 − ˜ W x k ) . (26) By (23), ∥ u k +1 − u ⋆ ∥ 2 = ∥ u k − u ⋆ + 1 α H x k +1 ∥ 2 = ∥ u k − u ⋆ ∥ 2 + 1 α 2 ∥ H x k +1 ∥ 2 + 2 α ⟨ u k − u ⋆ , H x k +1 ⟩ = ∥ u k − u ⋆ ∥ 2 + ∥ u k +1 − u k ∥ 2 + 2 α ⟨ u k − u ⋆ , H x k +1 ⟩ . (27) By (26) and H x ⋆ = 0 , w e ha ve ⟨ u k − u ⋆ , H x k +1 ⟩ = ⟨ u k − u ⋆ , H ( x k +1 − x ⋆ ) ⟩ = ⟨ H ( u k − u ⋆ ) , x k +1 − x ⋆ ⟩ = − ⟨ g k +1 − ∇ f ( x ⋆ ) , x k +1 − x ⋆ ⟩ − 1 α ⟨ x k +1 − ˜ W x k , x k +1 − x ⋆ ⟩ . (28) F or the terms in the right-hand side of (28), we ha ve − ⟨ x k +1 − ˜ W x k , x k +1 − x ⋆ ⟩ = − ⟨ x k +1 − ˜ W x k − ˜ W x k +1 + ˜ W x k +1 , x k +1 − x ⋆ ⟩ = ⟨ ˜ W ( x k +1 − x k ) , x ⋆ − x k +1 ⟩ − ∥ x k +1 − x ⋆ ∥ 2 H 2 = 1 2 ∥ x k − x ⋆ ∥ 2 ˜ W − ∥ x k +1 − x ⋆ ∥ 2 ˜ W (29) − 1 2 ∥ x k − x k +1 ∥ 2 ˜ W − ∥ x k +1 − x ⋆ ∥ 2 H 2 , where the last step follo ws from the basic equalit y 2 ⟨ b − a, c − b ⟩ = ∥ c − a ∥ 2 − ∥ c − b ∥ 2 − ∥ a − b ∥ 2 , and by Assumption 5, the L -smoothness of f , and [27, Theorem 2.1.5], − ⟨ g k +1 − ∇ f ( x ⋆ ) , x k +1 − x ⋆ ⟩ ≤ ˜ f k ( x ⋆ ) − ˜ f k ( x k +1 ) − ⟨∇ f ( x ⋆ ) , x ⋆ − x k +1 ⟩ ≤ f ( x ⋆ ) − f ( x k ) − ⟨∇ f ( x k ) , x k +1 − x k ⟩ − ⟨∇ f ( x ⋆ ) , x ⋆ − x k +1 ⟩ ≤ f ( x ⋆ ) − f ( x k +1 ) + L ∥ x k +1 − x k ∥ 2 2 − ⟨∇ f ( x ⋆ ) , x ⋆ − x k +1 ⟩ ≤ L ∥ x k +1 − x k ∥ 2 2 − 1 2 L ∥∇ f ( x k +1 ) − ∇ f ( x ⋆ ) ∥ 2 . (30) Substituting (29) and (30) into (28) and using ∥ x k +1 − x ⋆ ∥ 2 H 2 = α 2 ∥ u k +1 − u k ∥ 2 , w e ha ve ⟨ u k − u ⋆ , H x k +1 ⟩ ≤ 1 2 α ∥ x ⋆ − x k ∥ 2 ˜ W − ∥ x k +1 − x ⋆ ∥ 2 ˜ W − ∥ x k − x k +1 ∥ 2 ˜ W − α ∥ u k +1 − u k ∥ 2 − 1 2 L ∥∇ f ( x k +1 ) − ∇ f ( x ⋆ ) ∥ 2 + L ∥ x k +1 − x k ∥ 2 2 , whic h, together with (27), yields α ∥ u k +1 − u ⋆ ∥ 2 ≤ α ∥ u k − u ⋆ ∥ 2 − α ∥ u k +1 − u k ∥ 2 + L ∥ x k +1 − x k ∥ 2 + 1 α ∥ x ⋆ − x k ∥ 2 ˜ W − ∥ x k +1 − x ⋆ ∥ 2 ˜ W − 1 α ∥ x k − x k +1 ∥ 2 ˜ W − 1 L ∥∇ f ( x k +1 ) − ∇ f ( x ⋆ ) ∥ 2 . (31) Dene Q = αI 0 0 1 α ˜ W , and let z k = (( u k ) T , ( x k ) T ) T . W e rearrange terms in (31) and obtain ∥ z k − z ⋆ ∥ 2 Q − ∥ z k +1 − z ⋆ ∥ 2 Q = 1 α ∥ x k − x ⋆ ∥ 2 ˜ W + α ∥ u k − u ⋆ ∥ 2 − 1 α ∥ x k +1 − x ⋆ ∥ 2 ˜ W − α ∥ u k +1 − u ⋆ ∥ 2 ≥∥ x k − x k +1 ∥ 2 ˜ W /α − LI + α ∥ u k +1 − u k ∥ 2 + 1 L ∥∇ f ( x k +1 ) − ∇ f ( x ⋆ ) ∥ 2 . (32) Since α ≤ λ min ( ˜ W ) L as assumed in (16), it holds that k X t =0 ∥ x t +1 − x t ∥ 2 ˜ W /α − LI + α ∥ u t +1 − u t ∥ 2 + 1 L ∥∇ f ( x t +1 ) − ∇ f ( x ⋆ ) ∥ 2 ≤ ∥ z 0 − z ⋆ ∥ 2 Q < ∞ . (33) Moreo ver, b y (24), we hav e u ⋆ = − H † ∇ f ( x ⋆ ) , which, together with u 0 = H † q 0 , yields u 0 − u ⋆ = H † H † ( q 0 + ∇ f ( x ⋆ )) . Substituting the ab ov e equation into (33) gives (17). This completes the proof of Theorem 1. References [1] R. Olfati-Saber, J. A. F ax, and R. M. Murray , “Consensus and coop eration in net w orked m ulti-agent systems,” Pro ceedings of the IEEE, v ol. 95, no. 1, pp. 215–233, 2007. [2] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outp erform centralized algorithms? a case study for decentralized parallel sto chastic gradient descen t,” Adv ances in Neural Information Pro cessing Systems, vol. 30, 2017. [3] A. Nedić and A. Ozdaglar, “Distributed subgradient metho ds for multi-agen t optimization,” IEEE T ransactions on A uto- matic Control, vol. 54, no. 1, pp. 48–61, 2009. [4] K. Y uan, Q. Ling, and W. Yin, “On the con vergence of de- centralized gradient descen t,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016. [5] D. Jako vetić, J. Xavier, and J. M. F. Moura, “F ast distributed gradient metho ds,” IEEE T ransactions on Automatic Con trol, vol. 59, no. 5, pp. 1131–1146, 2014. [6] M. S. Assran and M. G. Rabbat, “Async hronous gradient push,” IEEE T ransactions on Automatic Control, v ol. 66, no. 1, pp. 168–183, 2020. [7] X. W u, C. Liu, S. Magnsson, and M. Johansson, “Asyn- chronous distributed optimization with dela y-free parame- ters,” IEEE T ransactions on A utomatic Control, v ol. 71, no. 1, pp. 259–274, 2026. [8] W. Shi, Q. Ling, G. W u, and W. Yin, “EXTRA: An exact rst- order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, v ol. 25, no. 2, pp. 944–966, 2015. [9] A. Nedić, A. Olshevsky , and W. Shi, “Ac hieving geometric conv ergence for distributed optimization o ver time-v arying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017. [10] W. Shi, Q. Ling, G. W u, and W. Yin, “A proximal gradient algorithm for decen tralized comp osite optimization,” IEEE T ransactions on Signal Pro cessing, vol. 63, no. 22, pp. 6013– 6023, 2015. [11] A. Nedić, A. Olshevsky , W. Shi, and C. A. Urib e, “Geometri- cally conv ergent distributed optimization with unco ordinated step-sizes,” in American Control Conference, 2017, pp. 3950– 3955. [12] G. Qu and N. Li, “Harnessing smo othness to accelerate distributed optimization,” IEEE T ransactions on Con trol of Netw ork Systems, vol. 5, no. 3, pp. 1245–1260, 2018. [13] C. Xi and U. A. Khan, “DEXTRA: A fast algorithm for optimization ov er directed graphs,” IEEE T ransactions on Automatic Control, vol. 62, no. 10, pp. 4980–4993, 2017. [14] S. A. Alghunaim and A. H. Say ed, “Linear conv ergence of primal–dual gradien t methods and their performance in distributed optimization,” Automatica, vol. 117, p. 109003, 2020. [15] A. Makhdoumi and A. Ozdaglar, “Conv ergence rate of dis- tributed ADMM ov er netw orks,” IEEE T ransactions on A u- tomatic Control, vol. 62, no. 10, pp. 5082–5095, 2017. [16] J. Lei, H.-F. Chen, and H.-T. F ang, “Primal–dual algorithm for distributed constrained optimization,” Systems & Control Letters, vol. 96, pp. 110–117, 2016. [17] N. S. A ybat, Z. W ang, T. Lin, and S. Ma, “Distributed linearized alternating direction metho d of multipliers for com- posite conv ex consensus optimization,” IEEE T ransactions on Automatic Control, vol. 63, no. 1, pp. 5–20, 2017. [18] X. W u and J. Lu, “A unifying approximate metho d of multipliers for distributed comp osite optimization,” IEEE T ransactions on Automatic Control, vol. 68, no. 4, pp. 2154– 2169, 2022. [19] Z. W ang, Q. Ling, and W. Yin, “Decentralized bundle metho d for nonsmo oth consensus optimization,” in IEEE Global Con- ference on Signal and Information Pro cessing, 2017, pp. 568– 572. [20] L. Xiao and S. Boyd, “F ast linear iterations for distributed av eraging,” Systems & Control Letters, vol. 53, no. 1, pp. 65– 78, 2004. [21] M. Díaz and B. Grimmer, “Optimal conv ergence rates for the proximal bundle metho d,” SIAM Journal on Optimization, vol. 33, no. 2, pp. 424–454, 2023. [22] D. Cederberg, X. W u, S. P . Boyd, and M. Johansson, “An async hronous bundle metho d for distributed learning problems,” in The Thirteen th In ternational Conference on Learning Representations, 2025. [23] F. Iutzeler, J. Malick, and W. de Oliveira, “Asynchronous level bundle methods,” Mathematical Programming, vol. 184, no. 1, pp. 319–348, 2020. [24] J. E. Kelley , Jr., “The cutting-plane method for solving conv ex programs,” Journal of the So ciet y for Industrial and Applied Mathematics, vol. 8, no. 4, pp. 703–712, 1960. [25] S. Bo yd and L. V andenberghe, Conv ex Optimization, 1st ed. Cambridge Universit y Press, 2004. [26] L. Condat, “F ast projection onto the simplex and the l 1 ball,” Mathematical Programming, vol. 158, no. 1, pp. 575– 585, 2016. [27] Y. Nesterov, Lectures on Conv ex Optimization. Springer, 2018, vol. 137.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment