In this paper, we present a detailed convergence analysis of a recently developed approximate Newton-type fully distributed optimization method for smooth, strongly convex local loss functions, called Network-GIANT, which has been empirically illustrated to show faster linear convergence properties while having the same communication complexity (per iteration) as its first order distributed counterparts. By using consensus based parameter updates, and a local Hessian based descent direction at the individual nodes with gradient tracking, we first explicitly characterize a global linear convergence rate for Network-GIANT, which can be computed as the spectral radius of a $3 \times 3$ matrix dependent on the Lipschitz continuity ($L$) and strong convexity ($μ$) parameters of the objective functions, and the spectral norm ($σ$) of the underlying undirected graph represented by a doubly stochastic consensus matrix. We provide an explicit bound on the step size parameter $η$, below which this spectral radius is guaranteed to be less than $1$. Furthermore, we derive a mixed linear-quadratic inequality based upper bound for the optimality gap norm, which allows us to conclude that, under small step size values, asymptotically, as the algorithm approaches the global optimum, it achieves a locally linear convergence rate of $1-η(1 -\fracγμ)$ for Network-GIANT, provided the Hessian approximation error $γ$ (between the harmonic mean of the local Hessians and the global hessian (the arithmetic mean of the local Hessians) is smaller than $μ$. This asymptotically linear convergence rate of $\approx 1-η$ explains the faster convergence rate of Network-GIANT for the first time. Numerical experiments are carried out with a reduced CovType dataset for binary logistic regression over a variety of graphs to illustrate the above theoretical results.
Distributed optimization and learning generally refer to the paradigm where multiple agents optimize a global cost function or train a global model collaboratively, often by optimiz-ing their local cost functions or training a local model based on their local data sets. In applications such as networked autonomous systems, Internet of Things (IoT), collaborative robotics and smart manufacturing, moving large amounts of locally collected data to a central processing unit can be expensive from both communication and storage point of view. Additionally, distributed optimization/learning prevents sharing of raw data, thus guaranteeing data privacy to a certain extent. Distributed/decentralized optimization or learning can occur over a network consisting of a central server with multiple local nodes (Federated Learning), or over a fully distributed setting without a central server, where nodes only communicate to their neighbours. There have been significant progress in recent years in both settings -see [1], [2], [3] for an extensive survey on these topics focusing on their key challenges and opportunities.
In this work, we will primarily focus on the topic of fully distributed optimization where nodes collaboratively optimize a global cost which is a sum of local cost functions. We will also restrict ourselves to convex, and in particular, strongly convex cost functions. In this context, distributed optimization or learning algorithms proceed in an iterative fashion, where each node updates its local estimate of the global optimization variable based on consensus-type averaging of its own information and the information received from its neighbours, combined with (typically) a gradient-descent type algorithm with a suitably chosen step size (constant or diminishing with time). Without going into a detailed literature survey of gradient based distributed optimization algorithms, we refer the readers to [4], [5]. For strongly convex functions, it is well known that gradient based techniques achieve linear (exponential) convergence with a constant step size. Replicating this in a fully distributed setting requires a gradient-tracking algorithm where each node keeps a vector that keeps a local estimate of the global gradient [6]. A detailed convergence analysis was carried out in [6] to illustrate linear convergence of gradienttracking based distributed first order optimization algorithms for smooth and strongly convex functions over undirected graphs, whereas a sublinear convergence rate O(1/k) was guaranteed for smooth convex functions that are not strongly convex. A gradient-tracking based distributed first-order optimization with an additional distributed eigenvector tracking was presented in [7], with a provable linear convergernce rate.
It is well known that for strongly convex and smooth arXiv:2602.14830v1 [math.OC] 16 Feb 2026
functions, centralized optimization algorithms using curvature (or Hessian) information of the cost function, such as the Newton-Raphson method, achieves faster locally quadratic convergence in the vicinity of the optimum. In the context of fully distributed optimization algorithms utilizing Newtontype methods, achieving quadratic convergence is difficult, primarily due to the fact that communicating local Hessian information to neighboring nodes is infeasible for large model size or parameter dimension, and the underlying consensus algorithm essentially slows down the convergence to at best a linear rate. Computation of the Hessian and its inversion is another bottleneck, which has motivated a plethora of recent approximate Newton-type Federated learning algorithms [8], [9], [10], [11], which can also display a mixed linear-quadratic convergence rate. In the fully distributed regime, earlier examples of second order optimization algorithms include [12], [13], [14], [15] that approximate the Hessian based on Taylor series expansion and historical data. More recent works, such as [16] developed a fully distributed version of DANE [17] (termed as Network-DANE) to overcome the high computational costs of the existing algorithms, and [18] captured the second order information of the objective functions based on an augmented Lagrangian function along with gradient diffusion technique to come up with a distributed algorithm leveraging the Hessian information of the objective function.
Recent papers such as [19] and [20] have shown however, that the superlinear convergence properties of Federated second order optimization algorithms such as GIANT can be retrieved only when exact averages of gradients or parameters can be computed via finite-time exact consensus [19], or through distributed finite-time set consensus [20], both of which can require up to O(N ) (where N is the number of nodes) consensus rounds between two successive optimization iterations, thus making them infeasible for large networks.
In this article, we focus on a recently developed fully distributed optimization algo
This content is AI-processed based on open access ArXiv data.