Optimistic Bilevel Optimization with Composite Lower-Level Problem

Optimistic Bilevel Optimization with Composite Lower-Level Problem
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper introduces a novel double regularization scheme for bilevel optimization problems whose lower-level problem is composite and convex, but not necessarily strongly convex, in the lower-level variable. The analysis focuses on the primal-dual solution mapping of the regularized lower-level problem and exploits its properties to derive an almost-everywhere formula for the gradient of the regularized hyper-objective under mild assumptions. The paper then establishes conditions under which the hyper-objective of the actual problem is well defined and shows that its gradient can be approximated by the gradient of the regularized hyper-objective. Building on these results, a gradient sampling-based algorithm computes approximately stationary points of the regularized hyper-objective, and we prove its convergence to stationary points of the actual problem. Two numerical examples from machine learning demonstrate the proposed approach.


💡 Research Summary

The paper addresses optimistic bilevel optimization where the lower‑level problem is composite, convex (but not necessarily strongly convex), and may involve nonsmooth epi‑polyhedral functions. To overcome the difficulties caused by set‑valued lower‑level solution maps, the authors introduce a double regularization scheme. The first regularizer is a quadratic term β‖y‖²/2 that makes the lower‑level objective strongly convex in y. The second regularizer applies a Moreau envelope e_α h to the composite term h∘G, smoothing the possibly nonsmooth function h while preserving its convexity and monotonicity.

With this regularization, the lower‑level problem becomes strongly convex, guaranteeing that both the primal solution map Y_{α,β}(x) and the primal‑dual map S_{α,β}(x) are single‑valued for any (α,β)>0. The authors formulate the optimality condition as a generalized equation 0∈ψ_x(y,p)+∂ĥ(y,p) and invoke an implicit‑function theorem for generalized equations (Theorem 2.1) to prove that S_{α,β}(·) is locally C¹ under a mild regularity condition: the critical cone K associated with ψ_x must satisfy K⊥∩K={0}. This condition does not require a constant‑rank assumption on ∇_yG, making it substantially weaker than assumptions used in earlier works.

The C¹ property yields an explicit Jacobian for the primal solution map: ∇Y_{α,β}(x)=B(Bᵀ∇ψ_x B)^{-1}Bᵀ, where B’s columns span K. Consequently, the gradient of the regularized hyper‑objective Φ_{α,β}(x)=f(x,Y_{α,β}(x)) can be written in closed form: ∇Φ_{α,β}(x)=∇_x f(x,y)+∇y f(x,y)·∇Y{α,β}(x). The authors show that this formula holds almost everywhere, providing a practically computable hyper‑gradient even when the original hyper‑objective Φ(x)=f(x,Y(x)) is set‑valued or nonsmooth.

Section 4 proves that as α,β→0, the regularized primal‑dual solutions converge (in the Painlevé‑Kuratoswki sense) to solutions of the original lower‑level problem, and the corresponding hyper‑gradients converge to a Clarke subgradient of Φ. This establishes that Φ is well defined and locally Lipschitz under the same mild conditions.

Building on these analytical results, the paper proposes a gradient‑sampling algorithm for the regularized hyper‑objective. At each iteration, a set of random perturbations within a ball of radius ε is generated around the current x. For each perturbed point, the regularized lower‑level problem is solved (yielding y_{α,β}) and the exact gradient ∇Φ_{α,β} is computed using the Jacobian formula. The convex hull of these gradients provides an ε‑Goldstein subgradient approximation, which is then used in a line‑search step (Armijo rule) to update x. The algorithm gradually shrinks ε, α, and β according to a prescribed schedule.

The convergence analysis proceeds in two stages. First, standard results for gradient sampling guarantee that the iterates converge to an ε‑Goldstein stationary point of Φ_{α,β}. Second, using the continuity of the solution maps and the convergence of the regularized problems, the authors show that any limit point of the iterates as ε,α,β→0 is a Clarke stationary point of the original hyper‑objective Φ. Notably, the analysis does not rely on the strict complementarity (NE) or constant‑rank (CR) assumptions required in earlier bilevel methods, thanks to the regularity of the primal‑dual mapping.

Numerical experiments on two machine‑learning tasks illustrate the practical impact. In a regularized logistic regression hyper‑parameter tuning problem, the proposed method achieves smoother and faster convergence compared with classical hyper‑gradient approaches that differentiate through the lower‑level solver. In a deep‑learning learning‑rate and weight‑decay tuning problem, where the lower‑level loss may have multiple minimizers, the double regularization prevents abrupt changes in the hyper‑objective, leading to stable descent. The experiments confirm that the algorithm can handle realistic composite lower‑level structures and that the theoretical guarantees translate into empirical robustness.

In summary, the paper makes four key contributions: (1) a double regularization that renders the lower‑level solution map globally piecewise smooth, (2) a rigorous C¹ analysis of the primal‑dual mapping and an explicit Jacobian formula, (3) a convergence proof that bridges regularized stationary points to Clarke stationary points of the original bilevel problem, and (4) an algorithmic framework that avoids restrictive assumptions of prior work while being applicable to a broad class of composite, nonsmooth lower‑level problems. The work opens avenues for stochastic extensions, large‑scale implementations, and handling nonconvex lower‑level objectives.


Comments & Academic Discussion

Loading comments...

Leave a Comment