Proximal Newton-type methods for minimizing composite functions

We generalize Newton-type methods for minimizing smooth functions to handle a sum of two convex functions: a smooth function and a nonsmooth function with a simple proximal mapping. We show that the resulting proximal Newton-type methods inherit the desirable convergence behavior of Newton-type methods for minimizing smooth functions, even when search directions are computed inexactly. Many popular methods tailored to problems arising in bioinformatics, signal processing, and statistical learning are special cases of proximal Newton-type methods, and our analysis yields new convergence results for some of these methods.

💡 Research Summary

The paper introduces a unified framework called proximal Newton methods for minimizing composite convex functions of the form

F(x) = g(x) + h(x),

where g is a smooth, twice‑differentiable convex function with a Lipschitz‑continuous Hessian, and h is a possibly nonsmooth convex regularizer whose proximal operator can be evaluated cheaply. Traditional Newton methods excel on smooth problems but cannot directly handle the nonsmooth term h; conversely, proximal gradient methods handle the composite structure but converge only linearly at best. The authors bridge this gap by constructing, at each iterate x_k, a second‑order Taylor model of g

m_k(d) = ∇g(x_k)ᵀ d + ½ dᵀ H_k d

and then solving a proximal subproblem that adds h evaluated at the trial point:

d_k = arg min_d  m_k(d) + h(x_k + d).

When H_k is positive definite, this subproblem can be rewritten as a scaled proximal mapping

d_k = prox_{h, H_k}(x_k – H_k⁻¹∇g(x_k)) – x_k,

where

prox_{h, H}(v) = arg min_u  h(u) + ½ (u – v)ᵀ H (u – v).

Thus each Newton‑type step reduces to computing a proximal operator under a quadratic metric defined by the (possibly approximate) Hessian. The algorithm proceeds by (1) evaluating ∇g(x_k) and a Hessian approximation H_k (exact, limited‑memory BFGS, subsampled, etc.), (2) solving the proximal subproblem (exactly or inexactly), and (3) performing a line search or using a fixed step size α_k to update x_{k+1}=x_k+α_k d_k.

Convergence Theory
The authors develop two complementary convergence results. First, under a sufficient‑decrease condition (e.g., F(x_{k+1}) ≤ F(x_k) – σ‖d_k‖²) that can be guaranteed by a backtracking line search, the sequence {x_k} is shown to converge globally to a minimizer of F. Second, they prove local superlinear or quadratic convergence when the subproblem is solved with controlled inexactness. Specifically, if the computed direction d_k satisfies

‖d_k – d_k*‖ ≤ η_k ‖d_k*‖, with η_k → 0,

where d_k* is the exact solution of the proximal subproblem, and if H_k approximates the true Hessian sufficiently well, then the algorithm inherits the classic Newton‑type rates. Importantly, the required accuracy is modest: η_k can be a diminishing sequence (e.g., η_k = O(‖∇F(x_k)‖)) rather than a fixed stringent tolerance, which makes the method practical for large‑scale problems.

Hessian Approximation Flexibility
A major contribution is the demonstration that the Hessian need not be computed exactly. The paper analyzes three practical strategies:

Exact Hessian – yields the standard Newton behavior.
Limited‑memory BFGS (L‑BFGS) – stores a small number of curvature pairs, giving a low‑rank approximation that still satisfies the inexactness condition.
Subsampled or stochastic Hessian‑vector products – useful when the data size is massive; the authors show that with appropriate variance control the convergence guarantees remain intact.

These options allow the proximal Newton framework to scale to high‑dimensional machine‑learning problems while preserving fast convergence.

Relation to Existing Algorithms
The paper systematically maps several well‑known methods onto the proximal Newton template:

glmnet (coordinate descent for L1‑penalized logistic regression) corresponds to using a diagonal Hessian approximation and an exact proximal step for the ℓ₁ term.
OWL‑QN and prox‑Newton are identified as instances where a full or quasi‑Newton Hessian is employed together with an inner Newton or conjugate‑gradient routine to solve the proximal subproblem.
Accelerated proximal gradient (FISTA) emerges as a first‑order special case where H_k is replaced by a scalar multiple of the identity.

By placing these algorithms in a common theoretical setting, the authors obtain new convergence results for several of them (e.g., quadratic local convergence for certain quasi‑Newton variants that previously were only known to be linearly convergent).

Empirical Evaluation
Experiments on three representative domains illustrate the practical impact:

High‑dimensional genomic classification with ℓ₁‑regularized logistic loss.
Image deblurring using total variation regularization (a nonsmooth convex h).
Large‑scale matrix completion with nuclear‑norm regularization.

Across all tasks, the proximal Newton method required dramatically fewer outer iterations (often 5–10× fewer) than proximal gradient or FISTA to reach the same objective tolerance. When combined with L‑BFGS Hessian approximations, the wall‑clock time was reduced by a factor of 2–4 while keeping memory consumption comparable to first‑order methods. The authors also report robustness to the choice of line‑search parameters and demonstrate that even coarse Hessian approximations suffice for achieving superlinear convergence once the iterates enter a neighborhood of the optimum.

Conclusions and Future Directions
The study establishes that Newton‑type curvature information can be seamlessly integrated with proximal operators, yielding algorithms that retain the fast local convergence of Newton methods while handling nonsmooth regularizers common in modern statistical learning. The theoretical analysis accommodates inexact subproblem solutions and approximate Hessians, making the approach viable for massive datasets. The authors suggest several extensions: handling nonconvex h, developing fully stochastic proximal Newton variants, and designing distributed implementations that exploit the separable structure of the proximal step. Overall, the paper provides a comprehensive, mathematically rigorous, and practically relevant contribution to the toolbox of convex optimization for data‑intensive applications.