Topological Exploration of High-Dimensional Empirical Risk Landscapes: general approach, and applications to phase retrieval

We consider the landscape of empirical risk minimization for high-dimensional Gaussian single-index models (generalized linear models). The objective is to recover an unknown signal $\boldsymbolθ^\star \in \mathbb{R}^d$ (where $d \gg 1$) from a loss …

Authors: Antoine Maillard, Tony Bonnaire, Giulio Biroli

Topological Exploration of High-Dimensional Empirical Risk Landscapes: general approach, and applications to phase retrieval
T op ological Exploration of High-Dimensional Empirical Risk Landscap es: general approac h, and applications to phase retriev al An toine Maillard ∗ 1 , T ony Bonnaire 2,3 , and Giulio Biroli 3 1 INRIA P aris & DI ENS, PSL Univ ersity , P aris, F rance. 2 Univ ersité Paris-Sacla y , CNRS, Institut d’Astroph ysique Spatiale, 91405 Orsay , F rance 3 Lab oratoire de Ph ysique de l’École normale sup érieure, ENS, Univ ersité PSL, CNRS, Sorb onne Univ ersité, Université P aris Cité, F-75005 Paris, F rance F ebruary 23, 2026 Abstract W e consider the landscap e of empirical risk minimization for high-dimensional Gaussian single-index mo dels (generalized linear mo dels). The ob jectiv e is to recov er an unknown signal θ ⋆ ∈ R d (where d ≫ 1 ) from a loss function ˆ R ( θ ) that dep ends on pairs of lab els ( x i · θ , x i · θ ⋆ ) n i =1 , with x i i . i . d . ∼ N (0 , I d ) , in the prop ortional asymptotic regime n ≍ d . Using the Kac-Rice formula, we analyze differen t complexities of the landscap e—defined as the ex- p ected n um b er of critical p oin ts—corresp onding to v arious types of critical p oints, including lo cal minima. W e first sho w that some v ariational form ulas previously established in the literature for these complexities can b e drastically simplified, reducing to explicit v ariational problems ov er a finite num b er of scalar parameters that we can efficien tly solve numerically . Our framew ork also provides detailed predictions for prop erties of the critical p oin ts, in- cluding the sp ectral prop erties of the Hessian and the join t distribution of lab els. W e apply our analysis to the real phase retriev al problem for which we deriv e complete top ological phase diagrams of the loss landscap e, characterizing notably BBP-type transitions where the Hessian at lo cal minima (as predicted by the Kac-Rice form ula) b ecomes unstable in the direction of the signal. W e test the predictive p o wer of our analysis to characterize gradien t flo w dynamics, finding excellen t agreemen t with finite-size sim ulations of lo cal op- timization algorithms, and capturing fine-grained details such as the empirical distribution of lab els. Overall, our results op en new av enues for the asymptotic study of loss landscap es and top ological trivialization phenomena in high-dimensional statistical mo dels. Con ten ts 1 In tro duction 2 1.1 The setting: single/m ulti-index models and phase retriev al . . . . . . . . . . . . . 3 1.2 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Related w orks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Organization of the pap er . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 ∗ an toine.maillard@inria.fr 1 2 Landscap e analysis via the Kac-Rice approac h 6 2.1 Setting and ob jectives of the Kac-Rice metho d . . . . . . . . . . . . . . . . . . . 7 2.2 Sk etch of the deriv ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Main theoretical results from the Kac-Rice approach . . . . . . . . . . . . . . . . 11 2.4 The BBP-Kac-Rice metho d, and the instability of lo cal minima . . . . . . . . . . 14 2.5 Applications: exploring the landscap e of phase retriev al . . . . . . . . . . . . . . 16 3 Probing the theory: gradien t descen t dynamics 20 3.1 Settings of the exp erimen ts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Prop erties of minima: Kac-Rice predictions vs numerical exp erimen ts . . . . . . 22 3.3 Phase diagram from Kac-Rice vs gradien t descen t algorithmic thresholds . . . . . 23 4 Conclusion 25 A The Kac-Rice formula: technicalities and n umerical solutions 30 B Deriv ation of the BBP transition condition 37 C F urther exploration of the phase retriev al landscap e 39 D Extension of the dynamical comparison to a = 1 . 0 42 1 In tro duction Minimization of non-conv ex, high-dimensional landscap es with p oten tially many lo cal minima is routinely carried out in machine learning and optimization with remarkable success using simple lo cal iterativ e pro cedures suc h as gradien t descen t and its sto c hastic v ariants. Understanding wh y , and under what conditions, these algorithms work so well remains a ma jor op en c hallenge, and has motiv ated a large b o dy of recent w ork. Without aiming to b e exhaustive, representativ e con tributions include [ Ney+17 ; BMM18 ; MBB18 ; VBB19 ; MVZ20 ; MBB23 ; Ann+23 ]. Ov er the past decade, considerable effort has b een dev oted to analyzing the geometry of loss landscap es—fo cusing on prop erties such as the presence of spurious minima, saddle points, and connectivity—and to elucidating their relationship with training dynamics. Sev eral w orks ha ve sho wn that, in specific parameter regimes, spurious lo cal minima disapp ear, for instance when the signal-to-noise ratio is sufficien tly large [ SC16 ; Cai+22 ]. In such cases, despite their in trinsic non-con v exity , landscapes effectively b ecome easy to optimize. This observ ation has led to explanations of the success of gradient-based metho ds in terms of a “trivialization” of the loss landscap e [ F yo04 ], asso ciated with the absence of bad minima. Y et this p erspective lea ves out a salient empirical fact: substan tial evidence no w sho ws that sub optimal minima may still exist even in parameter regimes where optimization p erforms well in practice [ Bai+19 ; LP A20 ]. Among the v arious approac hes dev elop ed to study high-dimensional loss landscap es, the one most closely related to the presen t w ork is the rigorous c haracterization of their geometric struc- ture. This line of research [ F yo04 ; AAC10 ] originated in the study of spin-glass mo dels—ph ysical systems characterized by highly rugged energy landscap es—and was later extended to problems in statistical inference and machine learning [ GM17 ; MBB20 ; AMS25 ]. The main analytical to ol in this context is the Kac–Rice metho d, which has b een reviewed recently in [ RF22 ]. Using this framew ork, Gaussian high-dimensional landscap es ha ve been thoroughly c haracterized, in- cluding those arising in spherical spin-glass models and inference problems suc h as tensor PCA [ Ros+19 ; Aro+19 ]. Connections b et ween landscape geometry and training dynamics ha ve also b een inv estigated using statistical-ph ysics tec hniques [ Man+19a ; Man+19b ; Man+20b ]. In con trast, typical loss landscap es encountered in machine learning are defined as empirical sums ov er training data and are therefore not Gaussian. Extensions of the Kac–Rice metho d 2 to this more general setting w ere first dev elop ed in [ MBB20 ] for generalized linear models, and further adv anced in the recent w ork [ AMS25 ] to m ulti-index mo dels. Nevertheless, a detailed c haracterization of the geometry of non-conv ex landscap es arising in generalized linear models, and of its evolution as a function of the signal-to-noise ratio, is still lac king. Given the large n umber of recen t studies devoted to the training dynamics of generalized linear mo dels, see for instance [ Col+24 ; BBP23 ; BGJ22 ; Lee+24 ; Arn+23 ; MW26 ], this represents a timely and imp ortan t c hallenge. A ddressing it w ould mak e it possible to establish a theoretical connection b et ween loss-landscap e geometry and optimization dynamics b eyond the Gaussian case, and in settings directly relev ant to realistic mac hine learning models. The aim of this work is to initiate a systematic inv estigation along this direction. W e fo cus on phase retriev al, a generalized linear mo del of practical relev ance that has b een extensively studied from the p erspective of training dynamics. Our approac h can b e straigh tforwardly generalized to other generalized linear models. 1.1 The setting: single/m ulti-index models and phase retriev al In this work we fo cus on learning a single-index model through empirical risk minimization. F ormally , w e consider a general form of empirical risk function: ˆ R ( θ ) : = 1 n n X i =1 ℓ ( x i · θ , x i · θ ⋆ ) . (1) The loss function ℓ : R 2 → R is taken so far to b e smo oth but generic. The parameters θ ∈ R d and the ground-truth θ ⋆ ∈ R d are taken to be unit normed ∥ θ ∥ 2 = ∥ θ ⋆ ∥ 2 = 1 , and w e assume that the data/sensing ve ctors { x i } n i =1 are tak en i.i.d. from the standard Gaussian distribution N (0 , I d ) . W e consider the minimization of ˆ R ( θ ) in a high-dimensional regime where b oth the n umber of parameters d and the sample size n = n ( d ) go to infinity , at a fixed ratio: lim d →∞ n d → α > 1 . (2) While we will state our theoretical results for a generic loss function ℓ , the main fo cus of our applications is the so-called phase r etrieval 1 problem, which is relev an t for accurately recon- structing signals and images and enabling reliable measurements in fields like optics, imaging, and quan tum science [ Don+23 ]. Concretely , following [ BBC25 ; Cai+21a ; Cai+21b ], we consider the follo wing loss function ℓ a ( y , y ⋆ ) : = ( y 2 − [ y ⋆ ] 2 ) 2 a + [ y ⋆ ] 2 , (3) for some normalization parameter a > 0 . The reason behind this c hoice is to ev aluate a relativ e error, whereas without the denominator rare but v ery large [ y ⋆ ] 2 can hav e a strong influence on the loss. F urther, this loss function has second deriv ativ e (w.r.t. y ) b ounded from below, which will be a useful property in what follo ws. 1.2 Summary of con tributions The main contributions of this w ork are as follo ws. A systematic framework for loss landscap e analysis b ey ond Gaussian settings – W e initiate a systematic inv estigation of the geometry of non-con vex empirical risk landscap es in high dimension, going substan tially b eyond what was previously av ailable. Building on the Kac–Rice approach of [ MBB20 ; AMS25 ], we deriv e tr actable sc alar variational formulas for 1 Whic h is a slight abuse: we rather are considering a “sign retriev al” problem since v ariables are real. 3 the so-called annealed complexities 1 of generic critical p oints, saddles of sub-extensiv e index, and an upp er bound for the one of lo cal minima. While w e fo cus on phase retriev al as our illustrating example, w e emphasize that our approac h is not limited to this problem: it applies broadly to other single-index mo dels (e.g. logistic regression, other generalized linear models), and can b e extended to m ulti-index mo dels using the framework of [ AMS25 ], or to other high- dimensional estimation tasks suc h as Gaussian mixture classification, as discussed in [ MBB20 ]. F urthermore, although in this w ork w e restrict to parameters constrained to the unit sphere S d − 1 : = { θ ∈ R d : ∥ θ ∥ = 1 } , the same metho dology can b e applied to other geometries and regularization sc hemes, as considered in [ AMS25 ]. Detailed characterization of critical p oin ts – A key feature of our approach is that the v ariational formulas yield not only the complexities of the different types of critical points, but also precise predictions for their statistical prop erties, more precisely we obtain: • The sp e ctr al density of the Hessian of ˆ R ( θ ) at a typical critical p oin t of a given t yp e (lo cal minim um, sub-extensive-index saddle, or generic saddle), p ossibly at a fixed v alue of the loss (or “energy”) ˆ R ( θ ) , and fixed ov erlap q : = θ · θ ⋆ with the signal; • The limiting empiric al distribution of the lab els ( y i , y ⋆ i ) n i =1 = ( x i · θ , x i · θ ⋆ ) n i =1 , at typical critical points of a giv en type (as abov e). Our characterization of lo cal minima goes through a complexity measure e Σ 0 , whic h is a tigh ter upp er b ound on the annealed complexit y of lo cal minima than the sub-extensive-index complex- it y (as studied in [ AMS25 ]). W e add to it an additional constraint, enco ding the requirement that the signal direction is not a descen t direction at a lo cal minimum: it prov es crucial, as it enables us to establish a finite trivialization threshold α triv . for the landscap e of phase retriev al, while we find that the complexit y Σ sub . of sub-extensive-index saddles never trivializes for any finite α . A BBP–Kac-Rice metho d connecting landscap e geometry to dynamics – W e com- bine the Kac–Rice analysis with random matrix theory (sp ecifically , the Baik–Ben Arous–Péc hé transition [ BBP05 ; BN11 ]) to determine whether a negative outlier eigenv alue–aligned with the signal–emerges from the Hessian at a typical critical p oin t. This “BBP–Kac-Rice” metho d ex- tends the analysis of [ BBC25 ] from the equator ( q = 0 ) to arbitrary o verlaps q . It rev eals that the instabilit y of lo cal minima tow ards the signal can occur in a wide range of α where the Kac–Rice upp er b ound still predicts a p ositiv e complexity , providing an accurate prediction of the algorithmic success of gradient descent. This allo ws us to compute precise top ological phase diagrams for phase retriev al in the ( α, q ) plane, identifying the thresholds for trivialization of lo cal minima, sub-extensive-index saddles, and generic critical p oints, as well as BBP instability thresholds for high-energy , t ypical-energy , and low-energy minima. Numerical analysis of gradien t descen t dynamics and comparison with analytical landscap e predictions – W e present a thorough n umerical study of gradient descent using full- batc h gradient-descen t (GD) simulations at dimension d = 512 and for many ( α, q ) , and compare the results to the analytical landscap e predictions. A key c hallenge in such comparisons is the presence of strong finite-size corrections to the infinite-dimensional limit, as sho wn in [ BBC25 ]. T o circumv ent these corrections while faithfully probing the large- d limit, we adopt a mo dified dynamics introduced in [ BBC25 ] and extend it to arbitrary o verlaps q > 0 : the dynamics is first run under a constrain t that pins the o verlap θ · θ ⋆ to a prescrib ed v alue q 0 for a burn-in phase, b efore b eing released to evolv e freely . The n umerical comparison yields the following results: 1 i.e. the av erage num b er of these p oin ts. 4 • Energy bands – F or q ∈ { 0 . 0 , 0 . 1 , 0 . 2 } and a range of α , the a verage energies of minima in whic h GD gets trapped lie consistently inside the energy band predicted by th e Kac–Rice metho d, and closely trac k the typic al energy e ⋆ = arg max e e Σ 0 ( q , e ) . • Hessian sp ectral densit y – The empirical distribution of Hessian eigen v alues at minima in whic h GD gets trapp ed is in excellent agreement with the annealed Kac–Rice prediction ρ 0 , both in the bulk and in the tails, for all tested v alues of ( α, q ) . • Distribution of second deriv atives and joint lab el distribution – The empirical distribution of the second deriv atives F ( y i , y ⋆ i ) : = ∂ 2 1 ℓ ( y i , y ⋆ i ) at minima in whic h GD gets trapp ed matc hes the theoretical prediction closely . The joint la w ν ( y , y ⋆ ) of predicted and ground-truth labels is also visually indistinguishable from the annealed Kac–Rice prediction ν 0 . Crucially , b oth theory and experiment exhibit a mark edly non-Gaussian structure: even at q = 0 (zero o verlap with the signal), the predicted lab els y i = θ · x i align closely with the ground-truth labels y ⋆ i = θ ⋆ · x i , in sharp con trast to what a generic random configuration would pro duce. • Algorithmic phase diagram – W e o verla y the empirical GD success rates in the ( α, q ) plane on top of the theoretical phase diagram. F or uncorrelated initialization, the empirical transition is in v ery goo d agreement with the BBP–KR instability threshold for typical- energy minima. Extending this comparison to the full ( α, q ) plane, the empirical success b oundary aligns with the theoretical prediction but also shows some discrepancy which we link to n umerical difficulties in establishing the correct algorithmic threshold at increasingly larger v alues of q , and p ossibly to the need for a quenched Kac-Rice computation [ Ros+19 ]. T aken together, these comparisons demonstrate that the theoretical framework we dev elop pro- vides a highly accurate and in terpretable picture of the minima encountered b y gradien t descen t in large dimensions. W e plan to release an extended v ersion of this w ork, which will include a complemen tary analysis based on replica-symmetry breaking (1RSB) theory , as w ell as further n umerical exp erimen ts exploring the Kac–Rice predictions across a broader range of mo dels and parameters. The analytical results presen ted in this work are derived at the level of rigor customary in theo- retical physics. W e b eliev e that a fully rigorous mathematical pro of of our results is within reac h b y leveraging the analysis of [ MBB20 ; AMS25 ], but we lea v e this task for fu ture researc h. W e ha ve inten tionally adopted a presentation tailored to mathematically inclined readers, with the aim of stimulating interest within the mathematics communit y and encouraging the developmen t of this line of researc h on a fully rigorous basis. More sp ecifically , w e b egin from the annealed form ulas for the complexities, which are essen tially rigorous thanks to [ MBB20 ; AMS25 ]. The simplification in to scalar v ariational principles is done at the theoretical physics lev el of rigor. While w e exp ect that these results can b e established rigorously , we defer this question to future w ork. Finally , while we can not guarantee mathematically that these scalar v ariational princi- ples do not admit several solutions, w e numerically nev er found more than one solution for lo cal minima and saddles of sub-extensiv e index. This is how ever not the case for the complexity of all critical points, for whic h we exhibited the presence of tw o fixed p oin ts in a (very limited) region of parameters, asso ciated to a first-order phase transition, see App endix C.3 . 1.3 Related w orks There are only a few w orks applying the Kac–Rice method to non-Gaussian landscap es, suc h as those arising from empirical risk functions. The first analysis of this type was carried out in [ MBB20 ], which the present work builds upon. Shortly thereafter, [ FT22 ] extended the Kac–Rice framework beyond purely Gaussian random landscap es. More recently , [ TM25 ] derived equations for the av erage n umber of critical p oints in p erceptron and generalized linear models 5 in the presence of structured data, and [ Cai+21a ; Cai+21b ] show ed that trivialization o ccurs in phase retriev al with the loss of eq. ( 3 ) at large enough α (not scaling with n ). Even more relev ant for our work, [ AMS25 ] applies the Kac-Rice metho d to high-dimensional estimation. In particular, it generalizes the results of [ MBB20 ] to so-called m ulti-index mo dels (and remo ves tec hnical assumptions used in [ MBB20 ]) and derives sharp results in con vex settings using the Kac-Rice formalism. The instability of lo cal minima induced by a BBP transition tow ard the signal, and its impli- cations for gradien t-flow dynamics, were first discussed in [ Man+20b ; Man+19b ] in the con text of tensor–matrix PCA, and later extended to phase retriev al in [ Man+20a ]. Our w ork is closely related to the recen t study [ BBC25 ], whic h inv estigates the role of the BBP transition in the time-dep enden t Hessian for phase retriev al. See also the very recent preprin ts [ MW26 ], which an- alyze a broader class of mo dels, and [ ABC25 ], whic h highlights the role of o ver-parameterization in the BBP transition at the b eginning of the dynamics. 1.4 Organization of the pap er Section 2 is dedicated to the presen tation of the Kac-Rice approach: w e introduce the principles of the metho d and its application to single-index mo dels, as w ell as tractable asymptotic formulas for the complexities of different t yp es of saddle p oin ts in the high-dimensional limit. W e finish with a thorough n umerical exploration of these predictions, sho wing in particular predictions for the sp ectral density of the Hessian, for the join t law of the lab els ( x i · θ , x i · θ ⋆ ) , and trivialization transitions in the landscap e. In Section 3 , we prob e our theory by comparing the Kac-Rice predictions to the results of finite- size simulations of gradient descent dynamics. W e show a strong qualitativ e agreement in terms of thresholds, and a very go od agreement in terms of predictions of the laws of the lab els and of the sp ectrum of the Hessian. W e finally discuss the outcomes of our analysis in Section 4 , as well as its limitations and future directions. 1.5 Notations Throughout this pap er, w e use the following notations: • S d − 1 : Euclidean unit sphere in R d . • grad , Hess : Riemannian gradient and Hessians on the sphere S d − 1 . • Index i( H ) : n umber of negativ e eigen v alues of a matrix H . • ∂ 1 ℓ, ∂ 2 1 ℓ : successiv e deriv atives of ℓ with respect to its first v ariable. • P ( E ) : probabilit y measures on E . 2 Landscap e analysis via the Kac-Rice approac h W e describ e here our main theoretical results regarding the landscap e complexity of generalized linear mo dels. After introducing the setting and our main goals in Section 2.1 , we sketc h in Sec- tion 2.2 the analytical deriv ation of the asymptotics of the a verage log-num b er of critical points (so-called c omplexities ) using the Kac-Rice metho d. This approac h was dev elop ed in [ MBB20 ; AMS25 ] in our context: it yields v ariational form ulas o ver spaces of probabilit y measures for these asymptotic complexities, and pro vides b ounds on the av erage log-n umber of local min- ima. These formulas are presen ted in Section 2.3 , where w e also sho w how they can b e greatly 6 reduced to v ariational form ulas o v er a finite n umber of scalar parameters: w e then present simple algorithms for appro ximating their solution. The Kac–Rice metho d pro vides access to the bulk eigen v alue density of the Hessian at critical points, but it do es not capture p oten tial outlier eigen v alues asso ciated with directions aligned with the signal. T o in v estigate whether suc h outliers emerge as a function of the signal-to-noise ratio, w e apply and extend the analysis of [ BBC25 ]. This approac h allo ws us to determine whether a Baik–Ben Arous–P éc hé (BBP) transition o ccurs as a function of α , for typical critical p oints and minima at fixed loss v alue and fixed ov erlap with the signal. Finally , in Section 2.5 we lev erage these algorithms to n umerically explore the landscap e complexities in high-dimensional (real) phase retriev al, a flagship example of a non-conv ex landscap e in high-dimensional statistics. W e prob e the prop erties of v arious t yp es of critical p oin ts, such as the Hessian’s sp ectral densit y , and presen t sharp results for the trivialization of the landscap e (i.e. the disapp earance of all local minima) as a function of the sample complexit y . 2.1 Setting and ob jectiv es of the Kac-Rice metho d 2.1.1 Critical p oin ts and complexities F or an y Q ⊆ ( − 1 , 1) and B ⊆ R t wo non-empty op en in terv als, let us denote E ( Q, B ) : = { θ ∈ S d − 1 : θ · θ ⋆ ∈ Q and ˆ R ( θ ) ∈ B } (4) W e will separate the counting of different types of critical points in the landscap e, according to the n umber of descending directions in the Hessian:        Σ tot . ( Q, B ) : = lim d →∞ 1 d log E # { θ ∈ E ( Q, B ) : grad ˆ R ( θ ) = 0 } , Σ k ( Q, B ) : = lim d →∞ 1 d log E # { θ ∈ E ( Q, B ) : grad ˆ R ( θ ) = 0 and i[Hess ˆ R ( θ )] ≤ k } . (5) These complexities count resp ectiv ely the critical p oin ts and the saddles of index at most k as d → ∞ (i.e. with at most k descending directions), that ha v e overlap θ · θ ⋆ ∈ Q with the signal, and loss, or ener gy 1 , v alue ˆ R ( θ ) ∈ B . Importantly , Σ 0 ( Q, B ) is the complexit y of lo c al minima , of particular relev ance to the training dynamics. Notice that Σ sub . ( Q, B ) : = lim ε → 0 Σ εd ( Q, B ) coun ts critical p oin ts of sub-extensive index as d → ∞ (i.e. with a o ( d ) n umber of descending directions). With resp ect to other works, Σ tot . w as analyzed first in [ MBB20 ], while Σ sub . is the quan tity characterized in [ AMS25 ] as a proxy for the complexit y of lo cal minima. A b ound on the complexity of minima – Finally , w e define a refined b ound for the com- plexit y of minima, by counting critical p oin ts for whic h ( i ) the Hessian has a sub-extensiv e index, and ( ii ) the signal θ ⋆ is not a desc ending dir e ction , i.e. ( θ ⋆ ) ⊤ [Hess ˆ R ( θ )] θ ⋆ ≥ 0 . e Σ 0 ( Q, B ) : = lim ε → 0 lim d →∞ 1 d log E # { θ ∈ E ( Q, B ) : grad ˆ R ( θ ) = 0 and i[Hess ˆ R ( θ )] ≤ εd (6) and ( θ ⋆ ) ⊤ [Hess ˆ R ( θ )] θ ⋆ ≥ 0 } . Notice that we hav e the series of b ounds. Σ 0 ( Q, B ) ≤ e Σ 0 ( Q, B ) ≤ Σ sub .. ( Q, B ) ≤ Σ tot . ( Q, B ) . (7) In what follows, we pro vide exact asymptotics for ( e Σ 0 , Σ sub . , Σ tot . ) . As we will see, the upp er b ound Σ 0 ≤ e Σ 0 is already sufficien t to establish a top ological trivialization of the landscap e (i.e. e Σ 0 ( Q, B ) < 0 for an y Q, B ) at a finite value of α = n/d . 1 W e use loss or energy interc hangeably to denote the v alue of ˆ R . 7 2.2 Sk etch of the deriv ation Here w e sketc h the deriv ation of asymptotic formulas for ( e Σ 0 , Σ sub . , Σ tot . ) . It uses the Kac- Rice formula (see e.g. [ A W09 ; A T07 ]), and is a straightforw ard adaptation of the computation detailed in [ MBB20 ], and generalized in [ AMS25 ]. F or p edagogical reasons, w e present a brief and heuristic sk etch of the deriv ation in the case of Σ tot . ( Q, B ) , hence following closely [ MBB20 ], to which w e refer for more details. W e indicate in the end how to generalize this computation to Σ sub . and e Σ 0 . W e note that the computation of Σ sub . can be seen as a sp ecial case of the results of [ AMS25 ]. The Kac-Rice form ula – The first step of the computation is to apply the celebrated Kac-Rice form ula [ Kac43 ; Ric45 ] to compute the first moment of the n umber of critical p oints satisfying certain properties. Denote C d ( Q, B ) : = # { θ ∈ E ( Q, B ) : grad ˆ R ( θ ) = 0 } . The Kac-Rice formula reads (see [ A T07 ; A W09 ], or [ MBB20 , Lemma 3] for a detailed mathe- matical statemen t in our precise setting): E C d ( Q, B ) = Z S d − 1 φ grad ˆ R ( θ ) (0) E h 1 { θ · θ ⋆ ∈ Q ; ˆ R ( θ ) ∈ B }    det Hess ˆ R ( θ )       grad ˆ R ( θ ) = 0 i σ (d θ ) , (8) where σ is the usual uniform surface measure on S d − 1 , and φ grad ˆ R ( θ ) is the density of the gradien t (imp ortan tly this is the spherical gradien t, whic h lives in the tangen t space T θ S d − 1 ≃ R d − 1 ). The la w of the Hessian and gradient – F rom eq. ( 1 ) w e can write the spherical gradient and Hessian as (recall y i = θ · x i and y ⋆ i = θ ⋆ · x i ):              grad ˆ R ( θ ) = 1 n n X i =1 ( P ⊥ θ x i ) ∂ 1 ℓ ( y i , y ⋆ i ) , Hess ˆ R ( θ ) = 1 n n X i =1 ( P ⊥ θ x i )( P ⊥ θ x i ) ⊤ ∂ 2 1 ℓ ( y i , y ⋆ i ) − 1 n n X i =1 y i ∂ 1 ℓ ( y i , y ⋆ i ) ! P ⊥ θ . (9) Here P ⊥ θ is the orthogonal pro jector on T θ S d − 1 . Conditioning on the law of the lab els – Crucially , for a fixed v alue q = θ · θ ⋆ , the joint law of (grad ˆ R ( θ ) , Hess ˆ R ( θ )) dep ends only on the empirical law of the lab els ( y i , y ⋆ i ) n i =1 . Notice that for a fixed q , these v ariables are sampled i.i.d. from µ q , where µ q : = N 0 , 1 q q 1 !! . (10) Conditioning on ( u i ) n i =1 : = ( y i , y ⋆ i ) n i =1 , one can then obtain from eq. ( 8 ) (we refer to [ MBB20 ] for more details): E C d ( Q, B ) = Z Q d q ω d ( q ) E u i i . i . d . ∼ µ q h φ g ( U ) (0) δ [ C ( U )] 1 ˆ R ( U ) ∈ B E { z i } [ | det H ( U ) || g ( U ) = 0] i . (11) In eq. ( 11 ) we denoted U : = ( u i ) n i =1 . Moreo ver: ( i ) ω d ( q ) is a v olumetric factor (arising from the v olume integral in eq. ( 8 )). It can b e easily computed as: ω d ( q ) = c d exp  d 2  1 + log α + log (1 − q 2 )   , where (1 /d ) log c d → 0 as d → ∞ . 8 ( ii ) W e denoted ˆ R ( U ) = (1 /n ) P n i =1 ℓ ( y i , y ⋆ i ) , and C ( U ) : = 1 n n X i =1 P ⊥ θ θ ⋆ ∥ P ⊥ θ θ ⋆ ∥ 2 · P ⊥ θ x i ! ∂ 1 ℓ ( y i , y ⋆ i ) = 1 n n X i =1 ( y ⋆ i − q y i ) p 1 − q 2 ∂ 1 ℓ ( y i , y ⋆ i ) . Notice that C ( U ) is the pro jection of the gradien t in the direction of P ⊥ θ θ ⋆ . Intuitiv ely , one needs to imp ose that the gradient in this direction is zero since, as detailed b elo w, g ( U ) is the comp onen t of the gradient only in the { θ , θ ⋆ } ⊥ space. ( iii ) Finally , for z i i . i . d . ∼ N (0 , I d − 2 ) , w e ha ve              g ( U ) : = 1 n n X i =1 ∂ 1 ℓ ( y i , y ⋆ i ) z i , H ( U ) : = 1 n n X i =1 b 2 i b i z ⊤ i b i z i z i z ⊤ i ! ∂ 2 1 ℓ ( y i , y ⋆ i ) − 1 n n X i =1 y i ∂ 1 ℓ ( y i , y ⋆ i ) ! I d − 1 , (12) with b i : = ( y ⋆ i − q y i ) / p 1 − q 2 . Note that g ( U ) is the pro jection of the gradient in { θ , θ ⋆ } ⊥ . In H ( U ) we explicitly separated the direction θ ⋆ (corresp onding to the first direction in the block-decomposition of H ) from { θ , θ ⋆ } ⊥ . Concen tration of the determinant, and large deviations – The k ey next step to simplify eq. ( 11 ) is the computation of E z i [ | det H ( U ) || g ( U ) = 0] . W e first notice that the conditioning at fixed y i , y ⋆ i reduces to a linear conditioning on z i , which can b e easily taken into account and leads to a sligh tly mo dified random matrix H ′ ( U ) which is obtained from H ′ ( U ) by doing the replacemen t: z i → z i − ∂ 1 ℓ ( y i , y ⋆ i ) P j ∂ 1 ℓ ( y j , y ⋆ j ) z j P j ( ∂ 1 ℓ ( y j , y ⋆ j )) 2 . The k ey feature of H ′ ( U ) is that it is simply a low-r ank p erturb ation of H ( U ) . F ollowing [ MBB20 ], we assume that the large deviations of the sp ectral distribution of H ′ ( U ) has rate d 2 , whic h is the usual scaling for this kind of random matrices [ BG97 ; Gui22 ]. Using this result one obtains: lim d →∞ 1 d log E z i    det H ′ ( U )    = lim d →∞ 1 d E z i  log   det H ′ ( U )    (13) The righ t hand side can b e expressed in terms of the sp ectral density of H ′ ( U ) , which coincides with the one of H ( U ) (since these tw o matrices only differ by a lo w-rank term). In consequence, one obtains the final simple expression: κ α ( ν ) : = lim d →∞ 1 d E z i  log   det H ′ ( U )    = Z µ α [ ν ](d w ) log | w − t ( ν ) | , (14) where t ( ν ) : = R ν (d y , d y ⋆ ) y ∂ 1 ℓ ( y , y ⋆ ) and, moreov er, µ α [ ν ] is the asymptotic sp ectral distribu- tion of zDz ⊤ /n , with z ∈ R d × n with i.i.d. N (0 , 1) elements, and D = Diag ( { ∂ 2 1 ℓ ( u i ) } n i =1 ) , for { u i } n i =1 i . i . d . ∼ ν . Eq. ( 14 ) is v alid for any U whose empirical distribution (1 /n ) P n i =1 δ u i con verges, as n → ∞ , to the probability measure ν . Coming back to eq. ( 11 ), since b oth ω d ( q ) and E z i [ | det H ( U ) | | g ( U ) = 0] scale exp onen tially with d , eq. ( 11 ) can b e ev aluated using Laplace’s method on q , and Sanov’s large deviation principle and V aradhan’s lemma 1 on ν : = (1 /n ) P n i =1 δ u i [ DZ09 ]. Combining it with the different results 1 F or a rigorous pro of, one would also need to control the constraint inv olved by δ [ C ( U )] at a finer level that what is sk etc hed in [ MBB20 ]. At the heuristic level of our presen tation, one can replace δ [ C ( U )] b y (2 ε ) − 1 1 {| C ( U ) | ≤ ε } for a small ε > 0 , which one tak es to 0 after d → ∞ . This issue is resolved in the more general results of [ AMS25 ]: w e nevertheless present here the deriv ation of [ MBB20 ], as it matches more closely our setup. 9 men tioned ab ov e, this allows to obtain the result for the complexity of critical p oin ts which was deriv ed in [ MBB20 ] 1 : Σ tot . ( Q, B ) = sup q ∈ Q sup e ∈ B ( 1 + log α 2 + 1 2 log(1 − q 2 ) + sup ν ∈M ℓ ( q ,e )  − 1 2 log Z R 2 ν (d u ) A ( u ) + κ α ( ν ) − α H ( ν | µ q )  ) . (15) where M ℓ ( q , e ) is the set of probability measures on R 2 suc h that R ν (d u ) ℓ ( u ) = e and R ν (d u ) c q ( u ) = 0 , with u : = ( y , y ⋆ ) . Secondly , H ( ν | µ ) : = R log(d ν / d µ )d ν is the relativ e en tropy , or Kullbac k- Leibler divergence, whic h appears from the use of Sano v’s large deviation principle. Finally , recall that κ α ( ν ) is given in eq. ( 14 ), and we defined the t wo functions:      A ( y , y ⋆ ) : = ( ∂ 1 ℓ [ y , y ⋆ ]) 2 , c q ( y , y ⋆ ) : = y ⋆ − q y p 1 − q 2 · ∂ 1 ℓ [ y , y ⋆ ] . (16) Generalization to Σ sub . and e Σ 0 – Let us now detail how to generalize the computations ab ov e to Σ sub . and e Σ 0 . • The computation for Σ sub . is similar to the abov e, with an imp ortant addition: when computing the complexit y of critical p oin ts with index at most k = εd , one imp oses the constrain t 1 { i[Hess ˆ R ( θ )] ≤ k } in eq. ( 8 ). This results in a constrain t 1 { i[ H ( U )] ≤ k } in the expectation in eq. ( 11 ). W e give here a heuristic accoun t for ho w this additional con- strain t influences the computation. Recalling that the la w of ( g ( U ) , H ( U )) only dep ends on the empirical law ν of ( u i ) n i =1 , w e separate t wo cases: – If supp µ α [ ν ] ⊆ [ t ( ν ) , ∞ ) , then w e simply upp er b ound: E { z i } [ | det H ( U ) | 1 { i[ H ( U )] ≤ k }| g ( U ) = 0] ≤ E { z i } [ | det H ( U ) || g ( U ) = 0] . (17) – If supp µ α [ ν ] ⊆ [ t ( ν ) , ∞ ) , then the asymptotic sp ectral la w of H ( U ) puts p ositiv e mass on ( −∞ , 0) . F or sufficiently small ε > 0 , the even t 1 { i[ H ( U )] ≤ εd } has then probabilit y exp {− Θ( d 2 ) } since it requires mo ving the whole spectral density , while the large deviations of the spectral densit y has rate d 2 . The conclusion is that the computation of the complexity Σ sub . ( Q, B ) yields an upp er b ound whic h is again eq. ( 15 ), with the additional constrain t supp µ α [ ν ] ⊆ [ t ( ν ) , ∞ ) in the suprem um o ver ν . This is the result obtained in [ AMS25 ], where it is used as an upp er b ound on the complexit y of local minima. Finally , we notice that eq. ( 17 ) should b e asymptotically tight for an y k = εd as d → ∞ . Indeed since the asymptotic spectral la w of H ( U ) is supp orted on [0 , ∞ ) in this case, the probabilit y that H ( U ) has a num b er of negativ e eigen v alues greater than εd has probability e − Θ( d 2 ) . In a nutshell, w e counted critical p oints whose Hessian’s asymptotic spectral densit y is non-negatively supp orted: since the asymptotic sp ectral density is insensitive to perturbations in the Hessian with rank sub-extensiv e in d , one effectiv ely coun ts all sub-extensiv e-index saddle p oin ts. 1 [ MBB20 ] uses ν to denote the asymptotic la w of  y i , [ y ⋆ i − q y i ] / p 1 − q 2  n i =1 . Here w e use ν to denote the asymptotic law of ( y i , y ⋆ i ) n i =1 , which induces a slight c hange in notations. 10 • Regarding e Σ 0 , the condition ( θ ⋆ ) ⊥ Hess ˆ R ( θ ) θ ⋆ ≥ 0 can b e written as a function of the empirical la w ν as well, as it reads: 1 n n X i =1 ( y ⋆ i − q y i ) 2 1 − q 2 ∂ 2 1 ℓ ( y i , y ⋆ i ) − 1 n n X i =1 y i ∂ 1 ℓ ( y i , y ⋆ i ) = Z ν (d y , d y ⋆ ) " ∂ 2 1 ℓ ( y , y ⋆ ) ( y ⋆ − q y ) 2 1 − q 2 − y ∂ 1 ℓ ( y , y ⋆ ) # ≥ 0 . With resp ect to Σ sub . , this simply accoun ts for an additional linear constrain t on ν in the v ariational principle. Final results – The arguments ab o v e form the sketc h of the deriv ation of asymptotic formulas, whic h we present in detail in Section 2.3 . In particular, notice that the formulas obtained are of the form Σ( Q, B ) = sup q ∈ Q sup e ∈ B Σ( q , e ) . In what follows, we therefore presen t asymptotic formulas for the functions e Σ 0 ( q , e ) , Σ sub . ( q , e ) and Σ tot . ( q , e ) . 2.3 Main theoretical results from the Kac-Rice approac h 2.3.1 Asymptotic form ulas in v olving a supremum o v er probabilit y measures W e state first the asymptotic form ulas (as n, d → ∞ ) of the complexities defined abov e, ob- tained through the deriv ation sketc hed in Section 2.2 , and whic h are natural transp ositions and extensions of the results of [ MBB20 ; AMS25 ]. W e state them for the general loss function of eq. ( 1 ): ˆ R ( θ ) = 1 n n X i =1 ℓ ( x i · θ , x i · θ ⋆ ) . The total num b er of critical p oin ts – F or completeness, let us recall eq. ( 15 ): Σ tot . ( q , e ) = 1 + log α 2 + 1 2 log(1 − q 2 ) + sup ν ∈M ℓ ( q ,e )  − 1 2 log Z R 2 ν (d u ) A ( u ) + κ α ( ν ) − α H ( ν | µ q )  . (18) Recall that M ℓ ( q , e ) is the set of probability measures on R 2 suc h that R ν (d u ) ℓ ( u ) = e and R ν (d u ) c q ( u ) = 0 , with u : = ( y , y ⋆ ) . Moreo ver, we defined µ q in eq. ( 10 ), and A, c q in eq. ( 16 ). Finally , w e hav e        κ α ( ν ) : = Z µ α [ ν ](d w ) log | w − t ( ν ) | , t ( ν ) : = Z ν (d y , d y ⋆ ) y ∂ 1 ℓ ( y , y ⋆ ) . (19) And recall that µ α [ ν ] is the asymptotic spectral distribution of zDz ⊤ /n , with z ∈ R d × n with i.i.d. N (0 , 1) elements, and D = Diag( { ∂ 2 1 ℓ ( u i ) } n i =1 ) , for { u i } n i =1 i . i . d . ∼ ν . Imp ortan tly , µ α [ ν ] can b e characterized straightforw ardly via its Stieltjes transform [ MP67 ; SB95 ], see App endix A.1 . Lo cal minima – Equipp ed with the definitions ab o v e, w e can generalize the formula obtained for Σ tot . ( q , e ) to the other types of complexities we defined in Section 2.1.1 . As discussed ab o ve, 11 w e do not provide a sharp result for the annealed complexity of local minima, but rather an upp er b ound e Σ 0 ( q , e ) . It reads: e Σ 0 ( q , e ) = 1 + log α 2 + 1 2 log(1 − q 2 ) + sup ν ∈M (0) ℓ ( q ,e )  − 1 2 log Z R 2 ν (d u ) A ( u ) + κ α ( ν ) − α H ( ν | µ q )  . (20) Here: M (0) ℓ ( q , e ) : = { ν ∈ M ( q , e ) : supp( µ α [ ν ]) ⊆ [ t ( ν ) , + ∞ ) and (21) Z ν (d y , d y ⋆ ) " ∂ 2 1 ℓ ( y , y ⋆ ) ( y ⋆ − q y ) 2 1 − q 2 − y ∂ 1 ℓ ( y , y ⋆ ) # ≥ 0 } . Saddles of sub-extensiv e index – F or saddles of sub-extensive index on the other hand, w e do reach an exact asymptotic formula for their annealed complexit y . It is almost iden tical to eq. ( 20 ): Σ sub . ( q , e ) = 1 + log α 2 + 1 2 log(1 − q 2 ) + sup ν ∈M sub . ℓ ( q ,e )  − 1 2 log Z R 2 ν (d u ) A ( u ) + κ α ( ν ) − α H ( ν | µ q )  . (22) The only difference with resp ect to eq. ( 20 ) is the lack of the last constrain t on ν in the v ariational principle, as: M sub . ℓ ( q , e ) : = { ν ∈ M ( q , e ) : supp( µ α [ ν ]) ⊆ [ t ( ν ) , + ∞ ) } . Connection with previous w orks – Eq. ( 18 ) is exactly the result of [ MBB20 ], up to a reparametrization. On the other hand, the main result of [ AMS25 ] reduces to the formula we presen t for sub-extensive-index critical points Σ sub . ( q , e ) , see eq. ( 22 ). The formula for e Σ 0 ( q , e ) on the other hand is new: while a simple v ariation ov er the one for Σ sub . ( q , e ) , we will see that the additional constraint prov es crucial in order to establish top ological trivialization from our results. Characterization of critical p oin ts – Crucially , the deriv ation in Section 2.2 shows that the v ariational form ulas abov e pro vide access not only to the complexity , but also to other relev ant statistical prop erties of the critical p oints. In particular, the measure ν achieving the suprem um corresp onds to the prediction for the joint law of ( y , y ⋆ ) for lo cal minima, sub-extensiv e-index saddles, or generic critical p oin ts as predicted by the annealed Kac–Rice formalism. Moreo ver, the asymptotic sp ectral densit y of the Hessian at these points can also b e c haracterized: it is given by µ α [ ν ] , up to a shift t ( ν ) induced by the spherical constraint. W e emphasize that these predictions should b e regarded as appro ximations, since w e consider only the annealed complexit y (i.e. its first moment), and our computation for lo cal minima is an upp er b ound. 2.3.2 Simplifying the problem: asymptotic formulas inv olving only scalar param- eters The v ariational formulation o ver probability measure can be greatly simplified and recast into a v ariational principle inv olving only scalar parameters. The detailed deriv ation of these results is deferred to App endix A.2 . Here we just presen t the final outcome. 12 Lo cal minima – W e start with lo cal minima, i.e. eq. ( 20 ). T o slightly lighten the statemen t of the form ula, let us define three additional functions first (recall that u = ( y , y ⋆ ) ):            F ( u ) : = ∂ 2 1 ℓ ( y , y ⋆ ) , t ( u ) : = y ∂ 1 ℓ ( y , y ⋆ ) , K q ( u ) : = ∂ 2 1 ℓ ( y , y ⋆ ) ( y ⋆ − q y ) 2 1 − q 2 − y ∂ 1 ℓ ( y , y ⋆ ) . (23) Recall also the definition of µ q in eq. ( 10 ). The form ula for e Σ 0 ( q , e ) can b e recast in to the follo wing v ariational principle: e Σ 0 ( q , e ) = − 1 + (1 − 2 α ) log α 2 + 1 2 log(1 − q 2 ) (24) + sup g > 0 A> 0 inf λ c ,λ e ,λ t ,λ A ∈ R λ h ,λ ⋆ ≥ 0 ( − 1 2 log A + α ( λ A A + λ e e ) − log g − λ t g + λ h + α log Z B g µ q (d u ) e − λ c c q ( u ) − λ A A ( u ) − λ e ℓ ( u )+log( α + F ( u ) g ) − g + λ t α t ( u )+ λ t F ( u ) α + F ( u ) g − λ h  F ( u ) g α + F ( u ) g  2 + λ ⋆ α K q ( u ) ) . Here B g : = { u ∈ R 2 : α + g F ( u ) > 0 } . Some imp ortan t remarks are in order: 1. While eq. ( 24 ) app ears in volv ed, the differen t Lagrange multipliers and parameters ap- p earing hav e natural in terpretations, whic h w e discuss in App endix A.2 . 2. Notice that the integral in the last term might b e infinite for some v alues of the parameters and Lagrange multipliers. Moreo ver, as we detail in the deriv ation, if B g  = R 2 (i.e. if there is u suc h that α + g F ( u ) = 0 ) then we must hav e either λ h > 0 or λ h = 0 and λ t ≥ 0 : this ensures in particular that the in tegrand is alw a ys going to 0 at the b oundary of B g . 3. The inner infimum o v er the set of Lagrange m ultipliers in eq. ( 24 ) is con vex. While w e can not guarantee that the outer sup o ver ( g , A ) admits a unique global maxim um, in our n umerical applications (see Section 2.5 ) we never encoun tered the existence of several lo cal maxima to this functional, ev en though w e v aried the initial conditions. As w e discussed abov e, this formula also provides an estimate for the asymptotic empirical law ν 0 of ( x i · θ , x i · θ ⋆ ) n i =1 , for θ uniformly sampled from the set of critical p oints satisfying the prop erties in eq. ( 6 ). This prediction reads: ν 0 (d u ) (25) = µ q (d u ) e − λ c c q ( u ) − λ A A ( u ) − λ e ℓ ( u )+log( α + F ( u ) g ) − g + λ t α t ( u )+ λ t F ( u ) α + F ( u ) g − λ h  F ( u ) g α + F ( u ) g  2 + λ ⋆ α K q ( u ) R B g µ q (d u ) e − λ c c q ( u ) − λ A A ( u ) − λ e ℓ ( u )+log( α + F ( u ) g ) − g + λ t α t ( u )+ λ t F ( u ) α + F ( u ) g − λ h  F ( u ) g α + F ( u ) g  2 + λ ⋆ α K q ( u ) . Here, all parameters and Lagrange multipliers are tak en at their resp ectiv e optimum in eq. ( 24 ). Moreo ver, the asymptotic sp ectral density at the Hessian at a typical critical p oin t satisfying eq. ( 6 ) is given by ρ 0 (d w ) , which satisfies for an y function f : Z ρ 0 (d w ) f ( w ) = Z µ α [ ν 0 ](d w ) f ( w − t ( ν 0 )) , (26) Saddles of sub-extensiv e index – Giv en the similarity of eqs. ( 22 ) and ( 20 ), w e exp ect and find that Σ sub .. ( q , e ) satisfies essen tially eq. ( 24 ), up to remo ving the Lagrange multiplier λ ⋆ : Σ sub . ( q , e ) = − 1 + (1 − 2 α ) log α 2 + 1 2 log(1 − q 2 ) (27) 13 + sup g > 0 A> 0 inf λ c ,λ e ,λ t ,λ A ∈ R λ h ≥ 0 ( − 1 2 log A + α ( λ A A + λ e e ) − log g − λ t g + λ h + α log Z B g µ q (d u ) e − λ c c q ( u ) − λ A A ( u ) − λ e ℓ ( u )+log( α + F ( u ) g ) − g + λ t α t ( u )+ λ t F ( u ) α + F ( u ) g − λ h  F ( u ) g α + F ( u ) g  2 ) . One can directly transpose the predictions of eq. ( 25 ) to the law ν sub . of the lab els ( y i , y ⋆ i ) n i =1 at a t ypical sub-extensive-index saddle by setting λ ⋆ = 0 , and the one of eq. ( 26 ) to the sp ectral densit y of the Hessian ρ sub . . Critical p oin ts – W e finally consider generic critical p oin ts. In this setting, we follow the heuristic approac h suggested in [ MBB20 ]. W e obtain an asymptotic formula that is similar to the ones ab ov e, with a few key differences. Σ tot . ( q , e ) = − 1 + (1 − 2 α ) log α 2 + 1 2 log(1 − q 2 ) + extr A> 0 ,g ∈ C + λ A ,λ e ,λ c ∈ R " − 1 2 log A − log | g | + εg i (28) + α ( λ A A + λ e e ) + α log Z R 2 µ q (d u ) e − λ c c q ( u ) − λ A A ( u ) − λ e ℓ ( u )+log | α + F ( u ) g |− g r α t ( u ) # . Here C + : = { z ∈ C : Im( z ) > 0 } w e denoted g = g r + ig i , and w e fixed a given small ε ≪ 1 (eq. ( 28 ) is understo od in the limit ε ↓ 0 ). An imp ortan t difference with resp ect to eqs. ( 24 ) and ( 27 ) is that the extrem um is not given as a sup inf . Rather, one should find the saddle p oin t of the functional that maximizes the complexit y . A second difference is that the in tegral o v er u is now made o ver the whole R 2 , as the integrand is well-defined for all u ∈ R 2 . Obtaining the predictions for ( ν tot . , ρ tot . ) from eq. ( 28 ) is again immediate: e.g. w e hav e ν tot . (d u ) = µ q (d u ) e − λ c c q ( u ) − λ A A ( u ) − λ e ℓ ( u )+log | α + F ( u ) g |− g r α t ( u ) R R 2 µ q (d u ) e − λ c c q ( u ) − λ A A ( u ) − λ e ℓ ( u )+log | α + F ( u ) g |− g r α t ( u ) . (29) Numerical solutions – The relatively simple form of eqs. ( 24 ),( 27 ) and ( 28 ) suggest natural algorithms to ev aluate these complexit y functions. In App endix A.3 w e detail the practical implemen tations w e used for their ev aluation: we use in particular the remark ab ov e on the con vexit y of the inner minimization problem. A complete implemen tation in Python using adaptiv e quadrature routines accelerated b y Num ba’s just-in-time (JIT) compilation [ LPS15 ] will be provided in a Gith ub repository . Final remarks – In what follows, we will sometimes remo v e the dep endency of the complexit y functions on ( q , e ) , and write e.g. e Σ 0 ( q ) or Σ tot . . In general, the v alue of the ov erlap q will alwa ys b e fixed and precised. On the other hand, if the v alue of the energy/loss e is not given, w e are considering the maximal complexit y across p ossible v alues of e : equiv alently , this is achiev ed b y fixing the Lagrange multiplier λ e = 0 in the v ariational formulas written ab o ve. 2.4 The BBP-Kac-Rice metho d, and the instability of lo cal minima The Kac–Rice analysis pro vides access to the asymptotic eigenv alue densit y of the Hessian at the critical p oin ts under consideration. Ho wev er, to relate the landscap e prop erties to training dynamics, suc h as gradien t flow, it is crucial to determine whether critical p oin ts—particularly lo cal minima—may dev elop a descending direction, i.e. an instability , aligned with the signal as α v aries. S uc h a descending direction w ould b e a low-rank c hange in the Hessian, whic h were not considered by the Kac–Rice metho d. Instead, w e address this problem with random matrix theory , determining the presence of such an outlier by the so-called “BBP” transition [ BBP05 ; BN11 ]. 14 0 2 4 6 8 1 0 w 0 . 0 0 0 . 0 2 0 . 0 4 0 . 0 6 0 . 0 8 0 . 1 0 0 . 1 2 ρ 0 ( w ) 0 . 7 5 0 . 5 0 0 . 2 5 0 . 0 0 0 . 0 0 0 0 . 0 0 5 0 . 0 1 0 0 . 0 1 5 0 . 0 2 0 Hessian density BBP outlier (typical) BBP outlier (lowest) BBP outlier (highest) Figure 1: In the phase retriev al problem (see eq. ( 3 )), with a = 0 . 01 , q = 0 . 0 , and α = 6 . 5 , w e sho w the annealed Kac-Rice prediction for the Hessian density of t ypical-energy minima, see eq. ( 26 ). The predicted complexit y is e Σ 0 ( q ) ≃ 7 . 10 − 3 > 0 (see also Fig. 5 below). The green arro w represents the p osition of the negativ e “BBP” outlier in the sp ectrum predicted b y our theory for this Hessian density , see eq. ( 67 ). The red-shaded area delimitates the prediction for the location of the outlier for all minima with p ositiv e complexity , from the highest-energy to lo west-energy ones. The histograms correspond to n umerical results obtained using finite-size gradien t descent simulations, see Section 3.1 for more details. Concretely , the spherical Hessian reads here (see Section 2.2 ): Hess = 1 n n X i =1 ∂ 2 1 ℓ ( y i , y ⋆ i )( P ⊥ θ x i )( P ⊥ θ x i ) ⊤ − 1 n n X i =1 y i ∂ 1 ℓ ( y i , y ⋆ i ) ! P ⊥ θ , (30) where P ⊥ θ is the orthogonal pro jection on { θ } ⊥ . Crucially , one can develop an analytical theory for whether the Hessian given by eq. ( 30 ) dev elops an informative outlier, for any estimator ν of the asymptotic joint la w of the labels. ( y i , y ⋆ i ) . This extends a previous analysis of [ BBC25 ], whic h only considered the Hessian at the equator ( q = θ · θ ⋆ = 0 ). The deriv ation and main equations are pro vided in App endix B . The BBP-KR metho d – The ab o v e p oin t is the basis of the “BBP-Kac-Rice” prediction: namely , the BBP instabilit y for the Hessian in eq. ( 30 ) can b e derived from the lab el la w ν predicted b y the Kac–Rice analysis, e.g. eq. ( 25 ). Strikingly , as we will see in concrete examples, this analysis p redicts that a BBP instabilit y of the Hessian can o ccur while the Kac-Rice metho d still predicts a p ositiv e complexity , and hence the presence of exp onen tially many lo cal minima. W e illustrate this phenomenon in Fig. 1 for ph ase retriev al, where w e plot the eigen v alue density obtained from the Kac–Rice b ound on the complexity of minima for a = 0 . 01 , q = 0 . 0 , and α = 6 . 5 . While the b ound suggests here a p ositiv e complexity of minima, the BBP–KR analysis instead rev eals the presence of a negativ e outlier. The asso ciated eigen vector is correlated with the signal, indicating that the putativ e minima are in fact unstable tow ards the signal. 1 1 The fact that θ ∗ is not a descending direction is not in con tradiction with the presence of a negativ e BBP eigen v alue, since the corresp onding eigen vector has only a finite ov erlap with the signal θ ∗ . 15 0 . 5 0 . 6 0 . 7 e 0 . 0 5 0 . 0 4 0 . 0 3 0 . 0 2 0 . 0 1 0 . 0 0 0 . 0 1 f Σ 0 ( q = 0 , e ) e α = 6 . 0 α = 7 . 5 α = 8 . 0 0 2 0 4 0 λ 0 . 0 0 0 . 0 2 0 . 0 4 0 . 0 6 0 . 0 8 0 . 1 0 ρ 0 ( λ ) α = 6 . 0 α = 7 . 5 α = 8 . 0 2 . 5 2 . 5 y 2 . 5 2 . 5 y 2 . 5 2 . 5 y 2 . 5 2 . 5 y 2 . 5 2 . 5 y 2 . 5 2 . 5 y 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 ν 0 ( y , y ) Figure 2: F or a = 0 . 01 , q = 0 , and three v alues α = 6 . 0 < α triv . , α = 7 . 5 ≃ α triv . and α = 8 . 0 > α triv . , we show: (left) the complexit y e Σ 0 ( q = 0 , e ) for different v alues of e around the maximal complexit y , (middle) the densit y of the Hessian at local minima of t ypical energy (at e ⋆ : = arg max e Σ 0 ( q = 0 , e ) ), and (righ t) the corresp onding la w ν ( y , y ⋆ ) . The results are obtained b y solving the optimization problem of eq. ( 24 ). 2.5 Applications: exploring the landscap e of phase re triev al W e now presen t the applications of our asymptotic form ulas to the problem of phase retriev al, with the loss of eq. ( 3 ): ℓ a ( y , y ⋆ ) : = ( y 2 − [ y ⋆ ] 2 ) a + [ y ⋆ ] 2 . F rom no w on, we fix a = 0 . 01 : nev ertheless, our results can b e straigh tforwardly extended to an y v alue of a > 0 : as examples, we extend some of the forthcoming results to a ∈ { 0 . 1 , 1 . 0 } in App endix C . An interesting prop ert y of this loss is that the denominator ensures that the second deriv ative ∂ 2 1 ℓ is b ounded from below: in turns this ensures that the Hessian’s asymptotic spectral density is alw ays b ounded from b elo w. In the following we present explicit results obtained from our v ariational form ulas applied to the phase retriev al problem. As w e aim to keep a concise presen tation in the main text, we defer some additional results and a deeper exploration to App endix C , 2.5.1 Complexit y of lo cal minima W e first consider in Fig. 2 our upp er b ound on the complexity of lo cal minima, i.e. the function e Σ 0 ( q , e ) . W e plot this complexit y function as a function of e , for q = 0 and differen t v alues of α = n/d , and show corresponding predictions for the lab el’s distribution ν 0 ( y , y ⋆ ) and the Hessian spectral density ρ 0 ( w ) . Sev eral imp ortan t comments can b e made on these results: • F or small α , there is a band of energy [ e min , e max ] where local minima are lo cated 1 , i.e. where e Σ 0 ( q , e ) > 0 . The most numerous of them (with maximal complexity) ha ve an energy e ⋆ . • As α increase, the size of this band diminishes: it even tually disapp ears at α = α triv . ≃ 7 . 5 . 1 Except the trivial global minimum θ = θ ⋆ , lo cated exactly at e = 0 . 16 0 . 2 0 . 4 0 . 6 0 . 8 e 0 . 1 5 0 . 1 0 0 . 0 5 0 . 0 0 0 . 0 5 Σ t o t . ( q , e ) e α = 4 . 0 α = 4 . 5 5 α = 5 . 0 0 1 0 2 0 3 0 λ 0 . 0 0 0 . 0 2 0 . 0 4 0 . 0 6 0 . 0 8 0 . 1 0 0 . 1 2 0 . 1 4 0 . 1 6 ρ ( λ ) α = 4 . 0 α = 4 . 5 5 α = 5 . 0 2 . 5 2 . 5 y 2 . 5 2 . 5 y 2 . 5 2 . 5 y 2 . 5 2 . 5 y 2 . 5 2 . 5 y 2 . 5 2 . 5 y 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 ν ( y , y ) Figure 3: F or a = 0 . 01 , q = 0 . 4 , and three v alues α = 4 . 0 < α triv . , α = 4 . 55 ≃ α triv . and α = 5 . 0 > α triv . , w e sho w: (left) the complexit y Σ tot . ( q , e ) for differen t v alues of e around the maximal complexit y , (middle) the densit y of the Hessian at critical p oints of t ypical energy (at e ⋆ : = arg max Σ tot . ( q , e ) ), and (righ t) the corresp onding law ν ( y , y ⋆ ) . The results are obtained b y solving the optimization problem of eq. ( 28 ). • The typical Hessian of local minima v aries relatively little close to this transition. In particular, it alwa ys remains mar ginal ly stable (the bulk of eigen v alues touches 0 ). This is an in teresting and remarkable prop erties of the dominan t minima. It is possible that stable ones exist but are less n umerous (sub-dominan t). W e lea ve this issue for a future w ork, where a more refined computation, targeting also sub-dominan t minima, will b e p erformed. • W e emphasize that a random p oin t θ ∈ S d − 1 , such as the one from whic h the gradient flo w dynamics start from, would pro duce a standard Gaussian law for ν ( y , y ⋆ ) : in contrast, the righ t plots of Fig. 2 show that, for θ a lo cal minim um, the estimated y i = θ · x i manage to align themselves v ery w ell with the ground-truth lab els y ⋆ i = θ ⋆ · x i despite θ not aligning itself at all with θ ⋆ (recall q = 0 ). 2.5.2 Complexit y of critical p oin ts and saddles of sub-extensive index In Fig. 3 , w e show a similar analysis to Fig. 2 , but for generic critical p oin ts. W e fo cus on a v alue q = 0 . 4 which shows a trivialization transition. Contrary to the case of minima, we don’t find suc h transition for q = 0 at any finite α – a point that we will discuss further in Section 2.5.4 . As shown in Fig. 3 , the critical points exhibit features similar to those previously discussed for lo cal minima. In particular, they concentrate within a well-defined energy band, and the la w ν ( y , y ⋆ ) displays a clear alignmen t b et ween the estimated lab els and the ground-truth ones. The Hessian is no longer p ositive definite: its sp ectrum contains a small but nonzero fraction of negativ e eigenv alues, b oth below and ab o ve the trivialization threshold in α . Sub-extensiv e-index saddles – W e do not discuss Σ fin here in order to keep the exp osition concise. Its b eha vior closely parallels that of the critical p oin ts; in particular, there is no trivialization of sub-extensiv e-index saddles at q = 0 for any finite v alue of α . This p oin t will b e illustrated and discussed in Section 2.5.4 (see Fig. 5 ). 17 2 3 4 5 6 7 8 9 10 α 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 Energy (loss value) 5 10 α − 0 . 1 0 . 0 0 . 1 e Σ 0 Typical-energy minima Low est-energy minima Highest-energy minima Negative complexity 0 10 20 30 40 w 0 . 00 0 . 02 0 . 04 0 . 06 0 . 08 0 . 10 ρ 0 ( w ) α = 6 . 0 5 10 α 0 . 0 0 . 2 λ ? Figure 4: F or a = 0 . 01 , q = 0 . 0 , w e show (left) the energy of the t ypical, lo west, and highest-energy minima as a function of α . The inset sho ws the v alue of the complexity ˜ Σ 0 (notice that highest and lo west-energy minima are alw ays at zero complexit y b y definition). In the red region the complexit y of local minima is negativ e. (Righ t) The sp ectral density ρ 0 ( w ) of the Hessian at these minima, for α = 6 . 0 . W e sho w in dotted red the line w = 0 . In the inset, w e sho w the Lagrange m ultiplier λ ⋆ as a function of α . The results are obtained b y solving the Kac-Rice formula of eq. ( 24 ). 2.5.3 Prop erties of t ypical-energy , low-energy , and high-energy minima As we saw ab ov e in Fig. 2 , for fixed q ∈ [0 , 1) and α < α triv . ( q ) the annealed Kac-Rice formalism predicts the existence of a band of local minima. In Fig. 4 we sho w the width of this band for q = 0 , as well as the Hessian density of highest and low est-energy minima. W e find that the width of this energy band diminishes as α increases. Around a v alue of α ⋆ ≃ 7 . 0 , one can observ e a “kink” in the v alue of the t ypical energy: it corresp onds to the activ ation of the last constrain t in eq. ( 21 ), and is apparen t from the onset of λ ⋆ a wa y from 0 (see top-right inset, and the discussion in App endix A ). F or α ≲ 7 . 0 lo cal minima are more n umerous than other saddles of sub-extensiv e index. Instead for α ≃ 7 . 0 the complexit y of local minima b ecomes strictly smaller and then v anishes at α triv . , whereas the complexit y of saddles of finite index remains p ositiv e at any finite α > 0 . Nevertheless, w e remind that these claims concern the anne ale d complexity (i.e. the first moment of the num b er of critical p oin ts): while a negative annealed complexity implies the absence of the corresp onding critical p oin ts, a p ositiv e annealed complexit y is not sufficien t to argue their existence with high probabilit y , and should rather seen as an indication of this existence. Finally , we also show in Fig. 4 the Hessian densit y for the lo west-energy , typical-energy , and highest-energy minima. As we will see in Section 2.5.4 , low est- energy minima are the “most stable” under the BBP instabilit y described in Section 2.4 . This w as also shown on a sp ecific example in Fig. 1 , and we give further numerical simulations in App endix C.4 . 2.5.4 Global phase diagrams Finally , we show in Fig. 5 the complete top ological phase diagram of the phase retriev al problem (with a = 0 . 01 ). In Fig. 5a w e show case the different thresholds for q = 0 as a function of α . W e see that the BBP-KR instability app ears first for high-energy minima at α = α (high) BBP ≃ 2 . 76 , b efore propagating to all lo cal minima, and finally low est-energy minima at α = α (low) BBP ≃ 4 . 39 . This is intuitiv e giv en that the low energy minima are exp ected to be more stable than the highest energy ones. 18 A B C D α ( h i g h ) B B P ' 2 . 7 6 α ( t y p ) B B P ' 3 . 8 1 α ( l o w ) B B P ' 4 . 3 9 α t r i v ' 7 . 4 9 α (a) F or q = 0 , the BBP thresholds for highest, typical, and low est-energy minima, and the trivialization of lo cal minima. As α increases, the BBP-instability propagates through the energy landscap e: it first emerges in high-energy minima ( A ) , then spreads to the ma jorit y of minima ( B ) , and even tually encompasses ev en the low est-energy states ( C ) . Finally , in ( D ) , the Kac-Rice upp er b ound implies the absence of an y lo cal minima. 2 3 4 5 6 7 8 9 1 0 α 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 q Trivialization of all critical points Trivialization of sub-extensive-index saddles Trivialization of minima (Kac-Rice upper bound) BBP-KR instability (lowest-energy minima) BBP-KR instability (typical minima) (b) A complete phase diagram in the ( α, q ) plane, showing regions of trivializations of the landscap e, where our Kac-Rice analysis implies the absence of critical p oin ts, of saddles of sub-extensive index, and of lo cal minima. W e also draw the BBP-instability regions for typical-energy minima, and for the lo west-energy minima: in the light green region, the annealed Kac-Rice formalism predicts that all minima ha ve a BBP instabilit y tow ards the signal, b efore our upp er b ound shows the absence of minima in the dark green area. Figure 5: Phase diagram predicted b y the Kac-Rice metho d, for a = 0 . 01 . A t this v alue of α , while our Kac-Rice formalism still predicts a p ositiv e annealed complexity of lo cal minima, the BBP-KR analysis allows us to unv eil that minima undergo a BBP instability to wards the signal. The Kac-Rice formalism is unable to detect this transition b ecause it do es not focus on low-rank contributions to the Hessian. Only for a larger v alue α triv . ≃ 7 . 49 , the Kac- Rice formalism predicts a negativ e complexit y . With some abuse of notations we still call this the “trivialization threshold”, but emphasize that the true trivialization transition takes place b efore, when all minima become unstable due to the BBP instability . An example of this phenomenon w as already illustrated in Fig. 1 , whic h w as made for q = 0 and α = 6 . 5 ∈ ( α BBP , α triv . ) . In Fig. 5b we show the same thresholds in the ( α, q ) plane. W e observe several trivialization regions, corresponding either to a BBP instability or to the disappearance of some t yp es of critical p oin ts. Notably , there is a blue region where the landscape does not possess any critical p oin t (b esides the trivial minimum θ = θ ⋆ at q = 1 ). Moreo ver, w e emphasize that the orange region (the trivialization of finite-index saddles) nev er reaches q = 0 for any finite α > 0 : we therefore exp ect that the computations of [ MBB20 ; AMS25 ], based on the c haracterization of Σ sub . , would not yield a finite trivialization threshold α triv . in this problem, while it is known that local minima disapp ear at a finite α [ Cai+21b ]. 19 2.5.5 Final remarks W e conclude this analysis with a couple of remarks. First of all, while w e discussed some illustrativ e results ab o ve, w e present in App endix C man y other results and a more systematic exploration of the empirical risk landscap e using the Kac-Rice formalism. W e also discuss there a p ossible limitation of our computations: as w e discussed in Section A.2 , in general we can not guaran tee uniqueness of a solution to the scalar v ariational problems of eqs. ( 24 ),( 27 ) and ( 28 ). In practice, w e nev er found that there existed different lo cal maxima to the complexity formulas of lo cal minima and sub-extensiv e-index saddles. Ho wev er, we did exhibit in one limited regime the co existence of t w o saddle p oin ts for the complexit y formula of critical points (eq. ( 28 )), and it is asso ciated to a first-order phase transition that we discuss in App endix C.3 . A more systematic study of suc h phenomena, and a pro of of the uniqueness of the solutions to the complexit y formulas for lo cal minima and sub-extensive-index saddles, are interesting lines of w ork for future researc h. 3 Probing the theory: gradien t descen t dynamics The purp ose of this section is to compare the predictions of the annealed Kac–Rice analysis with the prop erties of minima obtained from n umerical sim ulations of finite-dimensional gradien t- descen t (GD) dynamics. W e first describ e the numerical setup in Section 3.1 . W e then present a detailed statistical analysis of the GD minima, fo cusing on their energies, Hessian sp ectra, and second deriv atives of the loss. All of these observ ables show a remarkably close agreemen t with the annealed theoretical predictions. Finally , in Section 3.3 , we confront the theoretical phase diagram with the empirical outcomes of GD in phase retriev al across v arying v alues of q and α , thereb y directly connecting landscape structure to the observed success or failure of GD. 3.1 Settings of the exp erimen ts Giv en a dataset { x i } n i =1 with x i ∼ N (0 , I d ) and a signal v ector θ ⋆ ∼ N (0 , I d /d ) , we aim to minimize the empirical risk ( 1 ) starting from an initialization θ (0) , and using full-batc h GD up dates, θ ( t +1) = θ ( t ) − η ∇ ˆ R  θ ( t )  , (31) for t ∈ { 0 , 1 , . . . , T − 1 } with fixed learning rate η . Throughout this section, w e consider the loss function defined in eq. ( 3 ) and fix the normalization parameter to a = 0 . 01 . W e are going to fo cus on tw o kinds of initialization: (1) uncorrelated with the signal: θ ∼ N (0 , I d /d ) , (2) warm start: the component of θ along θ ∗ is fixed to q 0 , and all the other comp onen ts (in orthogonal directions) are i.i.d. Gaussian v ariables with mean zero and v ariance 1 /d . Finite-dimension effects, and mo dified dynamics – Our goal is to compare the Kac– Rice predictions, deriv ed in the limit d → ∞ , with n umerical exp erimen ts representativ e of the high-dimensional b ehavior. In the phase retriev al problem, ho w ever, the gradien t-descent dynamics ( 31 ) is strongly affected by finite-dimensional effects. As shown in [ BBC25 ], con ver- gence to the infinite-dimensional limit at finite but large times t requires, for certain v alues of α , dimensions that gro w exp onen tially with t . This phenomenon originates from a time-dep endent BBP transition in the Hessian ov er a range of α : at early times, the Hessian exhibits a negative outlier eigenv alue whose eigenv ector is aligned with the signal, while this outlier disappears at later times (still at times of order one with resp ect to d ). In the infinite-dimensional limit, the initial o verlap with the signal is of order 1 / √ d , which preven ts this instability from growing ov er an y finite time: the outlier disapp ears before b ecoming relev ant. In finite dimensions, by con trast, the instability induces an o verlap with the s ignal that gro ws exp onen tially in time, so that a timescale of order log d is sufficient to generate an O (1) 20 2 3 4 5 α 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40  e ® q = 0 . 0 Typical energy Lowest energy Highest energy E x p e r i m e n t s ( d = 5 1 2 ) 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 α 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 q = 0 . 1 2 . 0 2 . 5 3 . 0 3 . 5 α 0.05 0.08 0.10 0.12 0.15 0.18 0.20 0.23 0.25 q = 0 . 2 Figure 6: Ev olution of the av erage energy ⟨ e ⟩ of the minima for a = 0 . 01 with α at fixed (left) q = 0 . 0 , (middle) q = 0 . 1 , and (right) q = 0 . 2 obtained from the exp erimen ts (in black) and the band of energy predicted b y the annealed Kac-Rice (in shaded red). o verlap. This mechanism is responsible for the strong finite-dimensional corrections. Since sim ulating dimensions that gro w exp onen tially with t is computationally infeasible, [ BBC25 ] in tro duced a mo dified dynamics that suppresses these corrections while preserving the same infinite-dimensional asymptotic limit. F or completeness, we now explain this mo dified dynamics that we adapt to also prob e minima at q > 0 . First, w e dra w θ (0) at random conditioned on ha ving an initial ov erlap q 0 with the signal, i.e. θ (0) · θ ⋆ = q 0 . Then, we p erform t C full-batc h GD steps with a constrain t enforcing that the ov erlap θ ( t ) · θ ⋆ remains equal to q 0 during the descent. In practice, after eac h up date, w e pro ject θ ( t ) bac k onto the manifold { θ ∈ S d − 1 : θ · θ ⋆ = q 0 } for all t < t C b y writing θ ( t ) ← q 0 θ ⋆ + q 1 − q 2 0 θ ( t ) ⊥ ∥ θ ( t ) ⊥ ∥ 2 2 , θ ( t ) ⊥ = θ ( t ) −  θ ( t ) − θ ⋆  θ ⋆ . (32) The constrained iterate θ ( t C ) therefore corresp onds to a low-energy state with an ov erlap of exactly q 0 with the signal that h as near-zero gradien ts in all directions except θ ⋆ . In a last stage, w e use θ ( t C ) to initialize a standard (unconstrained) GD dynamics. This pro cedure has the effect of considerably shifting the success rates to larger v alues of α , more representativ e of the large d limit [ BBC25 ]. T o mak e the comparison betw een the dynamics and the predictions meaningful, we fo cus on minima reac hed at the end of the dynamics falling in a narrow latitude band around a fixed q , i.e. satisfying q ( T ) ∈ [ q − 0 . 05 , q + 0 . 05] . Hyp erparemeters and success criterion – In all our exp erimen ts, we fix the dimension to d = 512 and the learning rate to η = 2 × 10 − 4 . The considered loss function is the one from eq. ( 3 ) with a = 0 . 01 . A dditional comparison of the dynamics with the Kac-Rice predictions are sho wn for a = 1 in App endix D . The maximum num b er of GD iterations in b oth phases is set to t C = 60 , 000 and T = 12 , 000 log 2 ( d ) . W e classify a run as successful if the final absolute ov erlap | θ ( T ) · θ ⋆ | > 0 . 99 , and as a failure otherwise. F or eac h set of ( α, q 0 ) , we dra w betw een 1 , 000 and 2 , 400 indep endan t realizations (by resampling b oth the dataset and the initial condition θ (0) ) to compute means and error bars. W e made sure that η is sufficiently small to appro ximate gradien t flow by observing con vergence of the success rate at fixed ( α, q ) as η is decreased. 21 0 1 0 2 0 λ 0.00 0.03 0.05 0.08 0.10 0.12 0.15 0.18 ρ ( λ ) 0 5 1 0 1 5 2 0 F ( u ) 0 . 0 0 0 . 0 5 0 . 1 0 0 . 1 5 0 . 2 0 0 . 2 5 0 . 3 0 0 . 3 5 0 . 4 0 p ( F ) 2 0 2 y 2 1 0 1 2 y 2 0 2 y 2 1 0 1 2 y 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 ν ( y , y ) (a) F or α = 3 . 5 . 0 1 0 2 0 λ 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 ρ ( λ ) 0 5 1 0 1 5 2 0 F ( u ) 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 p ( F ) 2 0 2 y 2 1 0 1 2 y 2 0 2 y 2 1 0 1 2 y 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 ν ( y , y ) (b) F or α = 4 . 5 . Figure 7: Comparisons of the predicted Kac-Rice prop erties for the minima at q = 0 . 0 and t ypical energy e ⋆ and the empirical minima found b y the gradient descen t dynamics at d = 512 and a = 0 . 01 . W e compare: (left) the eigenv alue distribution of the Hessian ρ ( λ ) , (middle) the distribution of the Hessian w eights F ( u ) = ∂ 2 1 ℓ ( y , y ⋆ ) , and (right) the joint lab el distributions ν ( y , y ⋆ ) . In red are the Kac-Rice predictions describ ed in Section 2 . 3.2 Prop erties of minima: Kac-Rice predictions vs n umerical exp eriments W e b egin our ev aluation of the annealed Kac-Rice predictions by comparing the energies of minima reac hed b y our finite-size dynamics at d = 512 . Figure 6 reports, for q ∈ { 0 . 0 , 0 . 1 , 0 . 2 } , the av erage energies of minima ⟨ e ⟩ reached by our numerical pro cedure as a function of α , together with the energy band predicted analytically b y Kac-Rice. W e observe an excellen t agreemen t across the en tire explored range of α where the energies of GD minima lie inside the band predicted by the Kac-Rice metho d for low est, typical and highest minima. The energies of GD minima even closely align with the t ypical energies predicted by Kac-Rice. This observ ation supp orts the view that, when GD fails, its dynamics do not con v erge to exotic, lo w-probabilit y minima, but rather get trapp ed in t ypical regions of the loss landscap e. W e then prob e more properties of these minima in Fig. 7 b y comparing, for q = 0 and α ∈ { 3 . 5 , 4 . 5 } , three observ ables ev aluated for typical minima at energy e ⋆ . F rom left to right, we sho w: (i) the spectral density of Hessian eigen v alues at con verged minima; (ii) the distribution of F ( u ) = ∂ 2 1 ℓ ( y , y ⋆ ) ; and (iii) the joint distribution ν ( y , y ⋆ ) of predicted and true lab els, which is the central ob ject from which all the quantities are derived. F or all three observ ables, the 22 theoretical predictions (in red) matc h the empirical distributions remarkably w ell, not only around their t ypical v alues, but also in the tails. A minor discrepancy is visible for the singular b eha vior of F ( u ) at zero which is not resolved b y our finite-sample analysis. The theoretical and exp erimen tal distributions ν ( y , y ⋆ ) are visually indistinguishable, b oth b eing similarly sharp and exhibiting the same structure. W e emphasize that this match is highly non-trivial as a generic random configuration at fixed q would pro duce a join tly Gaussian distribution b y standard high-dimensional arguments, whereas the minima selected b y the dynamics displa y a clear non- Gaussian shap e. Observing this non-Gaussian structure prediction matching cleanly the finite- d exp erimen ts is a particularly strong v alidation of the ability of our theoretical metho d to capture prop erties of minima of the loss landscap e. Overall, these comparisons indicate that, even at a mo derate dimension d = 512 , the annealed Kac-Rice predictions pro vide an accurate picture of the minima prop erties in the energy landscap e as a function of ( α, q ) . 3.3 Phase diagram from Kac-Rice vs gradien t descen t algorithmic thresholds F ollowing [ Man+19b ; Man+20a ; BBC25 ], the success or failure of gradient descent in phase retriev al, and more generally in generalized linear mo dels, can b e related to the landscap e prop- erties discussed in Section 2.5.4 . In particular, the algorithmic recov ery threshold 1 is conjectured to coincide with the BBP instabilit y of the lo cal minima located on the equator ( q = 0 ), which trap the dynamics at low signal-to-noise ratio. Since we observ e a band of marginally stable min- ima, it remains unclear which ones of them constitute the “bad minima” resp onsible for trapping the dynamics. A ddressing this question would require characterizing their basins of attraction — a highly c hallenging and, at present, op en problem. A plausible w orking h yp othesis is that the dynamically relev ant minima are the typical ones, i.e., those that are most n umerous 2 . The aim of this section is therefore to compare the algorithmic thresholds found by numerical exp erimen ts on GD with the landscap e phase diagrams obtained b y the Kac-Rice approach. F or conv enience, Fig. 8 repro duces the phase diagrams restricted to minima and ov erla ys the finite-size experimental results. Our Kac-Rice analysis predicts that the signal-to-noise ratio α triv at whic h minima disapp ear from the landscap e is larger than the BBP threshold α (typ) BBP at whic h t ypical-energy minima start to acquire an instability aligned with the signal θ ⋆ . One then exp ects three dynamical regimes in the d → ∞ limit: • F or α < α (typ) BBP , the typical minima are (marginally) stable and GD starting from a random initial condition is trapp ed in to them at long times, • F or α ∈ h α (typ) BBP , α triv i , these minima turn in to saddles with a negative curv ature tow ards θ ⋆ , enabling GD to escap e and reac h the signal in finite time, • F or α > α triv , all the minima disapp ear and the landscap e b ecome trivial to descend, so GD con verges reliably to the signal. The threshold v alues c hange if the initial condition has a finite ov erlap with the signal (warm start). Let us first fo cus on the case of an uncorrelated initial condition ( q = 0 ). Figure 8a sho ws that the empirical success, shown as an error bar, is in very go od agreement with the BBP-KR transition predicted for typical minima. The error bar is obtained by considering the α giving a success rate of 0 . 5 ± 0 . 2 . It is also consisten t with the finite-size analysis of [ BBC25 ], reporting a BBP transition at α ≈ 4 . 0 that we displa y as a star on the diagram. This highligh ts that GD 1 Strictly sp eaking, the Kac-Rice BBP method should be informativ e on the we ak reco very threshold for random initial condition ( q 0 = 0 ). How ever, in our simulations w e alw ays find strong recov ery when recov ery is achiev ed. Therefore, we do not distinguish b et ween weak and strong reco very in what follo ws. F or w arm starts ( q 0 > 0 ), our n umerical results sho w that, at sufficien tly small α , gradient descent is trapp ed in minima with o verlap larger than q 0 . As α increases, we observe a transition tow ard strong recov ery . W e identify this transition with the one predicted by the BBP–Kac–Rice analysis. 2 If the basins of attraction were independent of the energy of the minima, this would indeed b e the case. 23 α ( h i g h ) B B P ' 2 . 7 6 α ( t y p ) B B P ' 3 . 8 1 α ( l o w ) B B P ' 4 . 3 9 α t r i v ' 7 . 4 9 α (a) Same as Fig. 5a with the exp erimen tal success transition for q = 0 . 0 (error bar). The star corresp onds to the finite-size analysis of the same loss function in [ BBC25 ]. 2 3 4 5 6 7 8 α 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 q Trivialization of minima BBP-KR instability (typical-energy minima) BBP-KR instability (lowest-energy minima) Failure-conditioned recovery rate S t r o n g r e c o v e r y ( d = 5 1 2 ) (b) Minima-only phase diagram from Fig. 5 with the strong recov ery rates obtained from our numerical exp erimen ts ( d = 512 ). Success is observed empirically in the region consistently b elo w the predicted BBP and trivialization transitions at all ( α, q ) . Figure 8: Minima phase diagram predicted by the Kac-Rice metho d for a = 0 . 01 along with the experimental results. can indeed succeed in a regime where an exp onen tial num b er of minima are still presen t in the landscap e, well b efore the trivialization threshold, as previously discussed in [ Man+20a ]. W e extend this comparison to the full ( α, q ) plane in Fig. 8b , completing the phase diagram from Fig. 5b with the success-rate line extracted from GD exp eriments at d = 512 . W e find that GD succeeds in the lo wer-left region of the diagram where Kac-Rice predicts the existence of stable spurious minima (including marginally stable ones), consistent with the idea that the practical success is primarily con trolled b y a BBP-lik e transition whic h is not visible using the Kac-Rice bound, but requires the BBP-Kac-Rice metho d we dev elop ed. As q increases a wa y from 0, the success rate (dashed line and error bars) rapidly shifts to smaller v alues of α and progressively deviates from the theoretical BBP prediction for t ypical-energy minima. In particular, when q 0 > 0 , we systematically observe that unsuccessful runs do not remain at their initial ov erlap: ev en when GD fails to reac h the signal, ⟨ q ( T ) ⟩ fail > q 0 for large-enough α . After a w arm start at q 0 , GD is exp ected to b e trapp ed by minima with higher o verlap b efore the instabilit y tow ards the signal takes place. In consequence, in order to compare more fairly theory and numerical exp eriments, w e report a second exp erimen tal line in which w e use the failure- conditioned a verage ⟨ q ( T ) ⟩ fail in place of q 0 when assessing strong recov ery . This correction shifts the empirical contour to larger α , but a clear quantitativ e gap with the theoretical b oundary remains. W e interpret this discrepancy as lik ely due to (i) unresolv ed finite-size effects; and (ii) the need to go b ey ond the annealed Kac-Rice prediction and p erform a quenc hed computation [ Ros+19 ; MBB20 ]. In particular, a quenched result is exp ected to shift the theoretical transition to lo wer α with respect to the annealed one. 24 4 Conclusion This w ork presen ts a systematic theoretical and n umerical in vestigation of the geometry of non-con vex empirical risk landscap es for generalized linear mo dels, with an emphasis on phase retriev al. Com bining the Kac-Rice metho d with random matrix theory , w e deriv e tractable scalar v ariational form ulas for the annealed complexities of lo cal minima, sub-extensiv e-index saddles, and generic critical p oin ts, and w e use them to build detailed phase diagrams for the phase retriev al landscape as a function of the sample complexit y α = n/d and the o verlap q with the ground truth. W e in tro duce a BBP–Kac-Rice metho d to further identify when lo cal minima acquire a signal-aligned instabilit y b efore the Kac-Rice trivialization threshold is reac hed, therefore providing a theoretical connection b et ween landscap e geometry and gradient- descen t dynamics. These predictions are supp orted by extensiv e n umerical experiments: the resulting energy bands, Hessian sp ectral properties, and lab el distributions are in excellen t quan titative agreement with finite-size gradient-descen t simulations at d = 512 , v alidating the practical accuracy of the formalism w e dev elop ed. Limitations – Several limitations of the present work deserve men tion. First, the complexities w e compute are anne ale d (i.e. based on the first momen t E [# { critical points } ] ) rather than quenche d : a p ositiv e annealed complexit y do es not guarantee the existence of exp onen tially man y critical p oin ts with high probability , and whether the t wo notions coincide remains an op en problem. Second, w e cannot guarantee in general the uniqueness of the solution to the scalar v ariational problems of eqs. ( 24 )–( 28 ): while w e never encountered m ultiple lo cal maxima for the complexity of lo cal minima or sub-extensive-index saddles, we did find co existing solutions for Σ tot . in a limited regime, asso ciated with a first-order phase transition (see App endix C.3 ). Third, our n umerical implementation of the Kac–Rice formulas b ecomes unreliable as q → 1 , where the reference measure µ q gro ws increasingly concen trated and the in tegrands dev elop sharp features. Another difficulty arises on the dynamical side when considering large q s: at large warm-start o verlaps, finite-size effects b ecome harder to con trol, making it difficult to reliably estimate the large-dimensional algorithmic threshold in this regime. Finally , let us mention that it w ould be in teresting to compare the results of our approac h with the ones that can be obtained by the replica metho d, whic h has b een often used in the past to p erform landscap e analysis. In an extended v ersion of this w ork w e shall complement the Kac– Rice analysis with a one-step replica-symmetry-breaking (1RSB) approach. The 1RSB metho d yields less accurate results, even at a qualitativ e lev el, for k ey observ ables suc h as the joint lab el distribution ν ( y , y ⋆ ) and the Hessian spectral densit y at lo cal minima; w e discuss p ossible explanations in the extended v ersion. A c kno wledgemen ts During the completion of this work we b ecame a w are of a concurren t w ork [ MS26 ] that also in- v estigates the geometrical prop erties of non-conv ex empirical loss landscap e of generalized linear mo dels b y the Kac-Rice metho d. A detailed comparison will b e presen ted in an extended v ersion of this work. GB ackno wledges supp ort from the F rench gov ernment under the managemen t of the Agence Nationale PR[AI]RIE-PSAI (ANR-23-IA CL-0008). References [AA C10] A. Auffinger, G. Ben Arous, and J. Cern y. “Random Matrices and complexity of Spin Glasses”. In: (Mar. 2010). url : . 25 [ABC25] Brandon Livio Annesi, Dario Bo cc hi, and Chiara Cammarota. “Overparametriza- tion b ends the landscap e: BBP transitions at initialization in simple Neural Net- w orks”. In: arXiv pr eprint arXiv:2510.18435 (2025). [AMS25] Kiana Asgari, Andrea Montanari, and Basil Saeed. “Lo cal minima of the empirical risk in high dimension: General theorems and conv ex examples”. In: arXiv pr eprint arXiv:2502.01953 (2025). [Ann+23] Brandon Livio Annesi et al. “Star-Shap ed Space of Solutions of the Spherical Negativ e Perceptron”. In: Phys. R ev. L ett. 131 (22 No v. 2023), p. 227301. doi : 10 . 1103 / PhysRevLett . 131 . 227301 . url : https : / / link . aps . org / doi / 10 . 1103/PhysRevLett.131.227301 . [Arn+23] Luca Arnaboldi et al. “Escaping medio crit y: ho w tw o-lay er netw orks learn hard generalized linear mo dels with SGD”. In: arXiv pr eprint arXiv:2305.18502 (2023). [Aro+19] Gérard Ben Arous et al. “The Landscap e of the Spiked T ensor Mo del”. In: Com- munic ations on Pur e and A pplie d Mathematics 72 (11 2019), pp. 2282–2330. issn : 10970312. doi : 10.1002/cpa.21861 . [A T07] Rob ert J A dler and Jonathan E T a ylor. R andom fields and ge ometry . Springer, 2007. [A W09] Jean-Marc Azais and Mario W schebor. L evel sets and extr ema of r andom pr o c esses and fields . John Wiley & Sons, 2009. [Bai+19] Marco Bait y-Jesi et al. “Comparing dynamics: deep neural net works v ersus glassy systems”. In: Journal of Statistic al Me chanics: The ory and Exp eriment 12 (12 2019), p. 124013. issn : 02017563. doi : 10.1088/1742- 5468/ab3281 . url : https: //ui.adsabs.harvard.edu/abs/2019JSMTE..12.4013B . [BBC25] T ony Bonnaire, Giulio Biroli, and Chiara Cammarota. “The role of the time- dep enden t Hessian in high-dimensional optimization”. In: Journal of Statistic al Me chanics: The ory and Exp eriment 2025.8 (2025), p. 083401. [BBP05] Jinho Baik, Gérard Ben Arous, and Sandrine Péc hé. “Phase transition of the largest eigen v alue for nonnull complex sample cov ariance matrices”. In: A nn. Pr ob ab. 33.1 (2005), pp. 1643–1697. [BBP23] Alb erto Bietti, Joan Bruna, and Loucas Pillaud-Vivien. “On learning gaussian m ulti-index mo dels with gradient flow”. In: arXiv pr eprint arXiv:2310.19793 (2023). [BG97] Gérard Ben Arous and Alice Guionnet. “Large deviations for Wigner’s la w and V oiculescu’s non-commutativ e entrop y”. In: Pr ob ability the ory and r elate d fields 108.4 (1997), pp. 517–542. [BGJ22] Gerard Ben Arous, Reza Gheissari, and Auk osh Jagannath. “High-dimensional limit theorems for sgd: Effective dynamics and critical scaling”. In: A dvanc es in neur al information pr o c essing systems 35 (2022), pp. 25349–25362. [BMM18] Mikhail Belkin, Siyuan Ma, and Soumik Mandal. “T o Understand Deep Learning W e Need to Understand Kernel Learning”. In: Pr o c e e dings of the 35th International Confer enc e on Machine L e arning . Ed. by Jennifer Dy and Andreas Krause. V ol. 80. Pro ceedings of Mac hine Learning Researc h. PMLR, Oct. 2018, pp. 541–549. url : https://proceedings.mlr.press/v80/belkin18a.html . [BN11] Floren t Bena ych-Georges and Ra j Rao Nadakuditi. “The eigen v alues and eigen- v ectors of finite, low rank p erturbations of large random matrices”. In: A dvanc es in Mathematics 227.1 (2011), pp. 494–521. [Cai+21a] Jian-F eng Cai et al. “The global landscap e of phase retriev al I: p erturbed amplitude mo dels”. In: arXiv pr eprint arXiv:2112.07993 (2021). 26 [Cai+21b] Jian-F eng Cai et al. “The global landscap e of phase retriev al II: quotien t intensit y mo dels”. In: arXiv pr eprint arXiv:2112.07997 (2021). [Cai+22] Jianfeng Cai et al. “Solving phase retriev al with random initial guess is nearly as goo d as b y spectral initialization”. In: A pplie d and Computational Harmonic A nalysis 58 (2022), pp. 60–84. url : . [CL22] Romain Couillet and Zhen yu Liao. R andom matrix metho ds for machine le arning . Cam bridge Universit y Press, 2022. [Col+24] Elizab eth Collins-W o odfin et al. “Hitting the high-dimensional notes: An o de for sgd learning dynamics on glms and multi-index models”. In: Information and In- fer enc e: A Journal of the IMA 13.4 (2024), iaae028. [Don+23] Jonathan Dong et al. “Phase retriev al: F rom computational imaging to machine learning: A tutorial”. In: IEEE Signal Pr o c essing Magazine 40.1 (2023), pp. 45–57. [DZ09] Amir Dem b o and Ofer Zeitouni. L ar ge Deviations T e chniques and A pplic ations . V ol. 38. Springer Science & Business Media, 2009. [FT22] Y an V F y o dorov and Rashel T ublin. “Optimization landscap e in the simplest con- strained random least-square problem”. In: Journal of Physics A: Mathematic al and The or etic al 55.24 (2022), p. 244008. [F yo04] Y an V. F y o dorov. “Complexit y of random energy landscap es, glass transition, and absolute v alue of sp ectral determinant of random matrices”. In: Physic al R e- view L etters 93 (14 2004), pp. 149901–149901. issn : 00319007. doi : 10 . 1103 / PhysRevLett.93.149901 . [GM17] Rong Ge and T engyu Ma. “On the optimization landscap e of tensor decomposi- tions”. In: A dvanc es in neur al information pr o c essing systems 30 (2017). [Gui22] Alice Guionnet. “Rare even ts in random matrix theory”. In: Pr o c. Int. Cong. Math . V ol. 2. 2022, pp. 1008–1052. [Kac43] M Kac. “A correction to “On the a verage n umber of real ro ots of a random alge- braic equation””. In: Bul letin of the A meric an Mathematic al So ciety 49.12 (1943), pp. 938–939. [Lee+24] Jason D Lee et al. “Neural net w ork learns low-dimensional p olynomials with sgd near the information-theoretic limit”. In: A dvanc es in Neur al Information Pr o c ess- ing Systems 37 (2024), pp. 58716–58756. [LN89] Dong C Liu and Jorge No cedal. “On the limited memory BFGS metho d for large scale optimization”. In: Mathematic al pr o gr amming 45.1 (1989), pp. 503–528. [LP A20] Shengc hao Liu, Dimitris Papailiopoulos, and Dimitris A chlioptas. “Bad global min- ima exist and SGD can reac h them”. In: A dvanc es in Neur al Information Pr o c essing Systems 2020-Decem (2020). issn : 10495258. [LPS15] Siu K wan Lam, Antoine Pitrou, and Stanley Seib ert. “Numba: A llvm-based python jit compiler”. In: Pr o c e e dings of the Se c ond W orkshop on the LL VM Compiler In- fr astructur e in HPC . 2015, pp. 1–6. [LS16] Ji Lee and Kevin Sc hnelli. “T racy-widom distribution for the largest eigenv alue of real sample cov ariance matrices with general p opulation”. In: A nnals of Applie d Pr ob ability 26.6 (2016). [Mai21] Antoine Maillard. “Large deviations of extreme eigen v alues of generalized sample co v ariance matrices”. In: Eur ophysics L etters 133.2 (2021), p. 20005. 27 [Man+19a] Stefano Sarao Mannelli et al. “Passed and Spurious: Descent Algorithms and Lo- cal Minima in Spiked Matrix-T ensor Mo dels”. In: Pr o c e e dings of the 36th Interna- tional Confer enc e on Machine L e arning . Ed. by Kamalika Chaudh uri and Ruslan Salakh utdinov. V ol. 97. Proceedings of Machine Learning Research. PMLR, Sept. 2019, pp. 4333–4342. url : https://proceedings.mlr.press/v97/mannelli19a. html . [Man+19b] Stefano Sarao Mannelli et al. “Who is afraid of big bad minima? Analysis of gradien t-flow in a spiked matrix-tensor mo del”. In: A dvanc es in Neur al Information Pr o c essing Systems 32 (2019), pp. 1–28. issn : 10495258. [Man+20a] Stefano Sarao Mannelli et al. “Complex dynamics in simple neural net works: Un- derstanding gradien t flow in phase retriev al”. In: A dvanc es in Neur al Information Pr o c essing Systems (2020), pp. 1–17. issn : 10495258. [Man+20b] Stefano Sarao Mannelli et al. “Marv els and Pitfalls of the Langevin Algorithm in Noisy High-Dimensional Inference”. In: Physic al R eview X 10 (1 2020), pp. 1–45. issn : 21603308. doi : 10.1103/PhysRevX.10.011057 . [MBB18] Siyuan Ma, Raef Bassily , and Mikhail Belkin. “The p o wer of in terp olation: Un- derstanding the effectiv eness of SGD in mo dern o ver-parametrized learning”. In: International Confer enc e on Machine L e arning . PMLR. 2018, pp. 3325–3334. [MBB20] An toine Maillard, Gérard Ben Arous, and Giulio Biroli. “Landscape complexit y for the empirical risk of generalized linear mo dels”. In: Mathematic al and Scientific Machine L e arning . PMLR. 2020, pp. 287–327. [MBB23] Simon Martin, F rancis Bach, and Giulio Biroli. “On the Impact of Overparame- terization on the T raining of a Shallo w Neural Netw ork in High Dimensions”. In: arXiv pr eprint arXiv:2311.03794 (2023). [MP67] Vladimir Alexandrovic h Marc henko and Leonid Andreevich Pastur. “Distribution of eigenv alues for some sets of random matrices”. In: Matematicheskii Sb ornik 114.4 (1967), pp. 507–536. [MS26] Andrea Montanari and Basil Saeed. T op olo gic al trivialization in non-c onvex em- piric al risk minimization . 2026. arXiv: 2602 . 14969 [math.ST] . url : https : / / arxiv.org/abs/2602.14969 . [MVZ20] Stefano Sarao Mannelli, Eric V anden-Eijnden, and Lenka Zdeborov á. “Optimiza- tion and generalization of shallo w neural netw orks with quadratic activ ation func- tions”. In: A dvanc es in Neur al Information Pr o c essing Systems 2020-Decem (2020), pp. 1–26. issn : 10495258. [MW26] Andrea Mon tanari and Zihao W ang. “Phase transitions for feature learning in neural net works”. In: arXiv pr eprint arXiv:2602.01434 (2026). [Ney+17] Behnam Neyshabur et al. “Exploring generalization in deep learning”. In: Pr o- c e e dings of the 31st International Confer enc e on Neur al Information Pr o c essing Systems . NIPS’17. Long Beach, California, USA: Curran Asso ciates Inc., 2017, pp. 5949–5958. isbn : 9781510860964. [RF22] V alen tina Ros and Y an V F yodorov. “The high-d landscap es paradigm: spin-glasses, and beyond”. In: arXiv pr eprint arXiv:2209.07975 (2022). [Ric45] Stephen O Rice. “Mathematical analysis of random noise”. In: The Bel l System T e chnic al Journal 24.1 (1945), pp. 46–156. [Ros+19] V alentina Ros et al. “Complex Energy Landscap es in Spiked-T ensor and Simple Glassy Models: R uggedness, Arrangemen ts of Lo cal Minima, and Phase T ransi- tions”. In: Physic al R eview X 9 (1 2019), p. 11003. issn : 21603308. doi : 10.1103/ PhysRevX.9.011003 . url : https://doi.org/10.1103/PhysRevX.9.011003 . 28 [SB95] Jac k W Silv erstein and Zhi Dong Bai. “On the empirical distribution of eigen v al- ues of a class of large dimensional random matrices”. In: Journal of Multivariate analysis 54.2 (1995), pp. 175–192. [SC16] Daniel Soudry and Y air Carmon. “No bad lo cal minima: Data indep enden t training error guaran tees for multila yer neural netw orks”. In: arXiv pr eprint (2016). [SC95] Jac k W Silverstein and Sang-Il Choi. “Analysis of the limiting sp ectral distribution of large dimensional random matrices”. In: Journal of Multivariate A nalysis 54.2 (1995), pp. 295–309. [TM25] Theo doros G T sironis and Aris L Moustakas. “Landscap e Complexity for the Empirical Risk of Generalized Linear Mo dels: Discrimination b et ween Structured Data”. In: arXiv pr eprint arXiv:2503.14403 (2025). [VBB19] Luca V en turi, Afonso S. Bandeira, and Joan Bruna. “Spurious V alleys in One- hidden-la yer Neural Net work Optimization Landscap es”. In: Journal of Machine L e arning R ese ar ch 20.133 (2019), pp. 1–34. url : http://jmlr.org/papers/v20/ 18- 674.html . 29 A The Kac-Rice form ula: tec hnicalities and n umeri- cal solutions A.1 Remainders on the Marchenk o-Pastur equation W e recall here how to compute µ α [ ν ] (see Section 2.3.1 ), b oth analytically and with numerical pro cedures. As we sa w, it is (up to a global shift due to the spherical constraint) the asymptotic sp ectral density of the Hessian. Recall that µ = µ α [ ν ] is the LSD of zDz ⊤ /n , with z ∈ R d × n an i.i.d. N (0 , 1) matrix, and D = Diag ( { F ( u i ) } n i =1 ) with { u i } n i =1 i . i . d . ∼ ν . Bulk’s densit y – The Stieltjes transform g ( z ) : = R µ (d t )( t − z ) − 1 is giv en by the unique solution in C + to the equation, for all z ∈ C + [ MP67 ; SB95 ]: g ( z ) = − " z − α Z ν (d u ) F ( u ) α + g ( z ) F ( u ) # − 1 . (33) In practice, to compute µ ( x ) to solve eq. ( 33 ) for z = x + iε with x ∈ R , ε > 0 and ε ≪ 1 . Using the Stieltjes-P erron in version theorem, this indeed allo ws to get the densit y of µ : d µ d x = lim ε → 0 1 π Im[ g ( x + iε )] . (34) After ev aluating the la w of X : = F ( u ) for u ∼ ν , one can solv e eq. ( 33 ) numerically v ery efficien tly by using a simple ro ot solv er, as explained e.g. in [ CL22 ]. More precisely , we solve G ( g ) = 0 , with G ( g ) : = g +  z − α Z ν (d u ) F ( u ) α + g ( z ) F ( u )  − 1 . (35) Note as well that we can rewrite eq. ( 33 ) in terms of the inv erse g − 1 ( s ) of the Stieltjes transform: g − 1 ( s ) = − 1 s + α Z F ( u ) α + sF ( u ) ν (d u ) . (36) Edge of the sp ectrum – In full generality , the support of the density can b e computed by in vestigating regions where the right-hand side of eq. ( 36 ) is real and increasing. Under mild assumptions (see [ SC95 ; LS16 ; CL22 ] for more details), the left edge of the sp ectrum is giv en b y the smallest positive solution to [ g − 1 ] ′ ( s ) = 0 , i.e. the unique solution to α Z sF ( u ) α + sF ( u ) ! 2 ν (d u ) = 1 , (37) More generally , by [ SC95 , Theorems 4.1 and 4.2], we hav e that the Stieltjes transform of the left edge is the largest S ≥ 0 such that b oth the follo wing are true: ( i ) If X = F ( u ) for u ∼ ν , then supp( α + S X ) ⊆ (0 , ∞ ) . In particular the righ t-hand side of eq. ( 37 ) is well-defined. ( ii ) α Z S F ( u ) α + S F ( u ) ! 2 ν (d u ) ≤ 1 . This yields then the left edge of the support as x min = g − 1 [ S ] given by eq. ( 36 ). 30 A.2 F rom functional to scalar v ariational formulas A.2.1 Lo cal minima and saddles of sub-extensiv e index W e start from eq. ( 20 ). The deriv ation for Σ sub . follo ws exactly the same lines, removing a single constrain t. e Σ 0 ( q , e ) = 1 + log α 2 + 1 2 log(1 − q 2 ) + sup ν ∈M (0) ℓ ( q ,e )  − 1 2 log Z R 2 ν (d u ) A ( u ) + κ α ( ν ) − α H ( ν | µ q )  . With some notations introduced in eq. ( 23 ), w e ha ve M (0) ℓ ( q , e ) : = { ν : E ν [ ℓ ( u )] = e, E ν [ c q ( u )] = 0 , supp( µ α [ ν ]) ⊆ [ E ν [ t ( u )] , + ∞ ) and E ν [ K q ( u )] ≥ 0 } . Let t ( ν ) : = E ν [ t ( u )] . W e denote g µ ( z ) : = R µ (d x ) / ( x − z ) the Stieltjes transform of a probability measure µ . Importantly , one can sho w (cf. e.g. [ MBB20 ]): κ α ( ν ) = − log g − t ( ν ) g + α E ν log( α + F ( u ) g ) − 1 − α log α , (38) with g = g µ α [ ν ] [ t ( ν )] > 0 (since ν ∈ M 0 ℓ ( q , e ) ). Thus we get: e Σ 0 ( q , e ) = − 1 + (1 − 2 α ) log α 2 + 1 2 log(1 − q 2 ) (39) + sup g > 0 A> 0 sup ν ∈M (0) ℓ ( q ,e ) g µ α [ ν ] ( t ( ν ))= g E ν [ A ( u )]= A  − log g − 1 2 log A − g E ν [ t ( u )] + α E ν log[ α + g F ( u )] − α H ( ν | µ q )  . Using eq. ( 33 ) and the c haracterization of the left edge of the supp ort of µ α [ ν ] which we recalled in Section A.1 , for an y ν and g the condition C ( ν , g ) :  supp( µ α [ ν ]) ⊆ [ t ( ν ) , + ∞ ) and g µ α [ ν ] ( t ( ν )) = g  (40) can be recasted as the following three sim ultaneous conditions: ( i ) α + g F ( u ) ≥ 0 , for all u ∈ supp( ν ) . ( ii ) W e ha ve α E ν "  g F ( u ) α + g F ( u )  2 # ≤ 1 . (41) ( iii ) g is the Stieltjes transform of µ α [ ν ] in t ( ν ) , i.e.: − 1 g − E ν [ t ( u )] + α E ν  F ( u ) α + g F ( u )  = 0 . (42) The first condition constrains supp( ν ) ⊆ B g : = { u ∈ R 2 : α + g F ( u ) ≥ 0 } , while crucially the other t wo constraints are line ar in ν . All in all, plugging this in eq. ( 39 ) w e ha v e e Σ 0 ( q , e ) = − 1 + (1 − 2 α ) log α 2 + 1 2 log(1 − q 2 ) (43) + sup g > 0 A> 0 sup ν : supp( ν ) ⊆ B g P ( ν )  − log g − 1 2 log A − g E ν [ t ( u )] + α E ν log[ α + g F ( u )] − α H ( ν | µ q )  , 31 where P ( ν ) – whic h also dep ends on all ( A, g , q , e ) – inv olves only line ar (equalit y and inequality) constrain ts ov er ν . W e can thus introduce Lagrange m ultipliers for all these constrain ts. W e reac h: e Σ 0 ( q , e ) = − 1 + (1 − 2 α ) log α 2 + 1 2 log(1 − q 2 ) (44) + sup g > 0 A> 0 sup ν : supp( ν ) ⊆ B g inf λ A ,λ c ,λ e ,λ t ∈ R λ h ,λ ⋆ ≥ 0 " − log g − 1 2 log A − g E ν [ t ( u )] + α E ν log[ α + g F ( u )] + αλ A ( A − E ν [ A ( u ]) − αλ c E ν [ c q ( u )] + αλ e ( e − E ν [ ℓ ( u )]) + λ ⋆ E ν [ K q ( u )] + λ t  − 1 g − E ν [ t ( u )] + α E ν  F ( u ) α + g F ( u )  − λ h α E ν "  g F ( u ) α + g F ( u )  2 # − 1 ! − αH ( ν | µ q ) # . F or an y fixed ( A, g ) , the function inside the v ariational problem in eq. ( 44 ) is: • Strictly concav e as a function of ν (it is the sum of a linear functional and of − αH ( ν | µ q ) , whic h is strictly conca ve). Moreov er, notice that the constrain t supp( ν ) ⊆ B g is con vex. • Con vex o v er ( λ a , λ c , λ e , λ t , λ h ) , since it is linear. Moreov er, the constrain ts λ h , λ ⋆ ≥ 0 are con vex. Therefore one can in vert the corresp onding suprem um and infim um in eq. ( 44 ). As a last step, w e use the Gibbs-Boltzmann v ariational formulation, which we recall here for a generic function L :              arg max ν ∈M + 1 ( B ) h Z ν (d x ) L ( x ) − α H ( ν | µ ) i = µ (d x ) e L ( x ) /α R B µ (d x ) e L ( x ) /α , sup ν ∈M + 1 ( B ) h Z ν (d x ) L ( x ) − α H ( ν | µ ) i = α log Z B µ (d x ) e L ( x ) /α . (45) All in all, we reach eq. ( 24 ): e Σ 0 ( q , e ) = − 1 + (1 − 2 α ) log α 2 + 1 2 log(1 − q 2 ) (46) + sup g > 0 A> 0 inf λ A ,λ c ,λ e ,λ t ∈ R λ h ,λ ⋆ ≥ 0 ( − 1 2 log A + α ( λ A A + λ e e ) − log g − λ t g + λ h + α log Z B g µ q (d u ) e − λ c c q ( u ) − λ A A ( u ) − λ e ℓ ( u )+log( α + F ( u ) g ) − g + λ t α t ( u )+ λ t F ( u ) α + F ( u ) g − λ h  F ( u ) g α + F ( u ) g  2 + λ ⋆ α K q ( u ) ) . W e emphasize that the different scalar parameters app earing in eq. ( 46 ) hav e natural interpreta- tions. F or instance, g > 0 represen ts the Stieltjes transform of the Hessian in 0 , i.e. the Stieltjes transform of µ α [ ν ] at z = t ( ν ) (which real b y h yp othesis, since the Hessian spectral density is non-negativ ely supported). The Lagrange (KKT) m ultiplier λ h ≥ 0 enforces the constrain t of eq. ( 41 ): as suc h, it ensures that g is the solution to eq. ( 42 ) corresponding w ell to the Stieltjes transform of µ [ ν ] ev aluated at t ( ν ) : indeed, as we discuss b elow (see the end of Section A.2.2 ), this equation might hav e sev eral solutions. Finally , λ ⋆ ≥ 0 enforces the additional constraint E ν [ K q ( u )] ≥ 0 that we imp ose in order to get a b ound on the complexity of local minima which is tigh ter than the complexity of sub-extensiv e-index saddles. The activ ation of this constrain t is signaled by λ ⋆ > 0 : when this is the case we hav e e Σ 0 < Σ sub . . 32 A.2.2 Generic critical p oin ts W e start from eq. ( 18 ): Σ tot . ( q , e ) = 1 + log α 2 + 1 2 log(1 − q 2 ) + sup ν ∈M ℓ ( q ,e )  − 1 2 log Z R 2 ν (d u ) A ( u ) + κ α ( ν ) − α H ( ν | µ q )  . Similarly to the ab o ve deriv ation for lo cal minima, we in tro duce Lagrange m ultipliers to fix the conditions in M ℓ ( q , e ) . W e reac h: Σ tot . ( q , e ) = 1 + log α 2 + 1 2 log(1 − q 2 ) + sup A> 0 sup ν ∈P ( R 2 ) inf λ A ,λ e ,λ c ∈ R " − 1 2 log A + αλ A ( A − E ν [ A ( u ]) − αλ c E ν [ c q ( u )] + αλ e ( e − E ν [ ℓ ( u )]) + κ α ( ν ) − α H ( ν | µ q ) # . (47) W e now tackle the term κ α ( ν ) . One has a similar formula to eq. ( 38 ), see [ MBB20 , Section 3.2]. Precisely , as ε ↓ 0 , and with g = g r + ig i : κ α ( ν ) = extr g ∈ C + [ − log | g | − t ( ν ) g r + εg i + α E ν log | α + F ( u ) g | − 1 − α log α ] + o ε (1) . (48) Notice that one can easily chec k that the extremization condition in eq. ( 48 ) reduces to the Marc henko-P astur equation, eq. ( 33 ): the corresp onding g ∈ C + is th us the Stieltjes transform of µ α [ ν ] tak en in t ( ν ) + iε . F ollowing [ MBB20 ], w e plug eq. ( 48 ) in eq. ( 47 ), and lo ok for a saddle p oin t of the resulting functional. Again, the saddle p oin t equation on ν can b e solv ed exactly , as w e are extremizing a functional of the type of eq. ( 45 ). W e reac h then eq. ( 28 ): Σ tot . ( q , e ) = − 1 + (1 − 2 α ) log α 2 + 1 2 log(1 − q 2 ) + extr A> 0 ,g ∈ C + λ A ,λ e ,λ c ∈ R " − 1 2 log A − log | g | + εg i (49) + α ( λ A A + λ e e ) + α log Z R 2 µ q (d u ) e − λ c c q ( u ) − λ A A ( u ) − λ e ℓ ( u )+log | α + F ( u ) g |− g r α t ( u ) # . Remark: a difference with the previous formulas – Our formula for the complexity of all critical p oin ts (eq. ( 49 )) is written as a global extremum, while the form ula for lo cal minima and sub-extensiv e saddles is rather written as a sup-inf inv olving more parameters, see eq. ( 46 ). The reason for this is eq. ( 48 ): indeed, while for any ε > 0 eq. ( 48 ) is known to hav e a single saddle point, on the other hand the equation − 1 g − z g + α E ν  F ( u ) α + g F ( u )  = 0 can admit more than one solution g > 0 if z / ∈ supp µ α [ ν ] (sometimes referred to as differ- en t “branches” of solutions to the Marchenk o-Pastur equation, see e.g. [ Mai21 ] or [ CL22 , Re- mark 2.6]). Notice that this second branc h is precisely ruled out by the condition of eq. ( 41 ). The presence of this second branc h motiv ated us to consider a nested max-min solver for the complexit y of lo cal minima (and sub-extensive index saddles), as w e detail in App endix A.3 . Since this problem is absen t for the complexit y of lo cal minima, we rather consider there a sim- pler fixed-p oint solv er for eq. ( 49 ). Nev ertheless, w e did encounter the presence of (at least) t wo fixed p oin ts to eq. ( 49 ) in a limited region of parameters, which w e naturally asso ciate to a phase transition in the complexit y , see Section C.3 , while we never observed s uc h a phenomenon for minima and sub-extensive-index saddles. In vestigating other forms of solvers for eq. ( 49 ), includ- ing ones closer to the nested max-min solvers w e use for lo cal minima and sub-extensive-index saddles, is thus a very natural direction, whic h w e plan to tac kle in the future. 33 A.3 F rom v ariational formulas to algorithms W e show here ho w we compute in practice the solution to the v ariational principles for the complexities, eqs. ( 24 ),( 27 ) and ( 28 ). A.3.1 Lo cal minima and saddles of sub-extensiv e index The algorithm – W e fo cus on e Σ 0 ( q , e ) for concreteness, the case of saddle of sub-extensive index being very similar, and w e commen t on it afterwards. Recall eq. ( 24 ): e Σ 0 ( q , e ) = − 1 + (1 − 2 α ) log α 2 + 1 2 log(1 − q 2 ) (50) + sup g > 0 A> 0 inf λ A ,λ c ,λ e ∈ R λ h ,λ t ,λ ⋆ ≥ 0 ( − 1 2 log A + α ( λ A A + λ e e ) − log g − λ t g + λ h + α log Z B g µ q (d u ) e − λ c c q ( u ) − λ A A ( u ) − λ e ℓ ( u )+log( α + F ( u ) g ) − g + λ t α t ( u )+ λ t F ( u ) α + F ( u ) g − λ h  F ( u ) g α + F ( u ) g  2 + λ ⋆ α K q ( u ) ) . As we emphasized in the deriv ation (and can be chec ked easily), the inner infim um ov er the Lagrange m ultipliers Λ : = ( λ A , λ c , λ e , λ h , λ t , λ ⋆ ) is conv ex, and can b e solv ed efficiently for an y v alue of ( A, g ) . In practice, w e use conv ex minimization pro cedures (the limited memory-BFGS metho d [ LN89 ]) to solv e it efficien tly . W e then apply another lo cal maximization algorithm for the outer maximum in ( A, g ) , see Algorithm 1 . Gradien ts – Notice that in Step 1 of the algorithm, the gradient G θ of the functional with resp ect to the Lagrange m ultipliers λ θ (for θ ∈ { A, c, e, t, h, ⋆ } ) is simply:                                  G A = α [ a − E ˆ ν [ A ( u )]] , G c = − α E ˆ ν [ c q ( u )] , G e = α [ e − E ˆ ν [ ℓ ( u )]] , G t = − 1 g − E ˆ ν [ t ( u )] + α E ˆ ν  F ( u ) α + F ( u ) g  , G h = 1 − α E ˆ ν "  F ( u ) g α + F ( u ) g  2 # , G ⋆ = E ˆ ν [ K q ( u )] . (51) The gradien ts with respect to ( A, g ) are given in Algorithm 1 . Notice that that the condition imp osed b y λ h (eq. ( 41 )) implies that if u ∼ ν , then X : = α + g F ( u ) ≥ 0 either: ( i ) is not supp orted around 0 , or ( ii ) has a quickly decaying densit y around 0 suc h that E [ F 2 / ( α + g F ) 2 ] ≤ 1 . This implies that when differentiating the complexit y functional in eq. ( 50 ) with resp ect to g there is no contribution of the boundary term α E ν [ F ( u ) δ ( α + g F ( u ))] . In practice, we use as well the L-BF GS algorithm for the outer maximization, re-parametrizing the functional in terms of log g and log A to ensure p ositivit y of A, g . While suc h a c hange of v ariable may affect conca vity prop erties, we never encoun tered the presence of differen t lo cal maxima (b y v arying initial conditions), and it allows to impose the p ositivit y constrain t in a straightforw ard manner. W e emphasize, how ev er, that these empirical observ ations do not constitute a formal con vergence guaran tee: nev ertheless, within the scop e of our exp erimen ts, the metho d prov ed sufficien tly stable and efficien t to explore the landscap e of solutions of in terest. Inner integrals and computation time – The inner integrals app earing in the objective function are ev aluated numerically using SciPy’s adaptive quadrature routines. Since these tw o- dimensional integrals m ust b e recomputed repeatedly during the optimization pro cess, their 34 Algorithm 1: An algorithm to compute e Σ 0 ( q , e ) . Initialize v ariables ( A (0) , g (0) ) ; while not c onver ging do • Step 1: find ( λ ( k ) A , λ ( k ) c , λ ( k ) e , λ ( k ) t , λ ( k ) h , λ ( k ) ⋆ ) minimizing the conv ex functional α ( λ A A ( k ) + λ e e ) λ t g ( k ) + λ h + α log Z µ q (d u ) e − λ c c q ( u ) − λ A A ( u ) − λ e ℓ ( u )+log( α + F ( u ) g ( k ) ) − g ( k ) + λ t α t ( u )+ λ t F ( u ) α + F ( u ) g ( k ) × e − λ h  F ( u ) g ( k ) α + F ( u ) g ( k )  2 + λ ⋆ α K q ( u ) o ver ( λ A , λ c , λ e , λ t ) ∈ R and ( λ h , λ ⋆ ) ≥ 0 . • Step 2: Mak e an update of ( A ( k ) , g ( k ) ) , using e.g. a pro jected gradient ascent algorithm or the L-BFGS algorithm. The gradien t ( L A , L g ) of the complexity functional in eq. ( 50 ) is                      L A = − 1 2 A + αλ A , L g = − 1 g + λ ( k ) t g 2 − E ˆ ν ( k ) [ t ( u )] + α E ˆ ν ( k )  F ( u ) α + g F ( u )  − αλ ( k ) t E ˆ ν ( k ) " F ( u ) 2 ( α + g F ( u )) 2 # − 2 α 2 g λ ( k ) h E ˆ ν ( k ) " F ( u ) 2 ( α + g F ( u )) 3 # . Here ˆ ν ( k ) is the estimate of the join t la w of the lab els (eq. ( 25 )) with the curren t estimate of the Lagrange m ultipliers and the giv en v alues of ( A, g ) . ; k = k + 1 ; end ev aluation is accelerated using Numba’s just-in-time (JIT) compilation [ LPS15 ]. In particu- lar, the in tegrand functions are compiled to optimized mac hine co de prior to the n umerical in tegration, significantly reducing the per-ev aluation computational cost. These n umerically ev aluated in tegrals are embedded within the outer maximization procedure used to solve eq. ( 50 ). Owing to the efficiency gains pro vided by JIT compilation and adaptive quadrature, the outer maximization conv erges rapidly , typically requiring on the order of ten iterations. Ov erall, the full solution of eq. ( 50 ) is typically obtained in around one min ute on a standard laptop-class CPU. Saddles of sub-extensive index – The computation of the complexit y Σ sub . ( q , e ) of sub- extensiv e-index saddles (eq. ( 27 )) is extremely similar to the abov e: one simply imp oses λ ⋆ = 0 in Algorithm 1 . A.3.2 Critical p oin ts On the other hand, for the complexity of all critical p oin ts Σ tot . ( q , e ) (eq. ( 28 )), we simply use an iterative pro cedure, as we discussed in Section A.2.2 . Eac h step of the pro cedure pro ceeds as follo ws. W e denote ˆ ν ( k ) the measure ν tot . in eq. ( 29 ) with the current estimate of the v ariables at iteration k . 35 ( i ) Up date g ( k +1) = −  E ˆ ν ( k ) [ t ( u )] + iε − α E ˆ ν ( k )  F ( u ) α + g ( k ) F ( u )  − 1 . ( ii ) Up date A ( k +1) = E ˆ ν ( k ) [ A ( u )] . ( iii ) Find ( λ ( k +1) c , λ ( k +1) e ) satisfying the constraints    E ˆ ν [ λ ( k +1) c ,λ ( k +1) e ] ( c q ( u )) = 0 , E ˆ ν [ λ ( k +1) c ,λ ( k +1) e ] ( ℓ ( u )) = e. ( iv ) Up date λ ( k +1) A = 1 / (2 αA ( k +1) ) . These equations corresp ond to iterations aiming to find a zero of the gradien t of the functional with resp ect to (resp ectiv ely) g , λ a , λ c , λ e , A , and are similar to the ones stated in [ MBB20 ] in a simpler mo del. Notice that in the p en ultimate step we actually require ( λ ( k ) c , λ ( k ) e ) to b e exact solutions to the constrain ts (which ma y require sev eral iterations). As we sa w, this is a conv ex optimization problem and do es not cause an y n umerical issues. Alternatively to ( iv ) , one can also update λ A together with these other parameters (causing a b enign 3 -dimensional con vex optimization problem): we did not find it to affect conv ergence prop erties, but increased sligh tly the time p er iteration. Although this iterativ e sc heme is not deriv ed from a local optimization principle (since the extrem um in eq. ( 28 ) is not in the form of a sup inf ), and thus remains somewhat heuristic, w e found it to p ossess go o d conv ergence prop erties. Using similar pro cedures to the computation of the t wo-dimensional integrals, w e found the iterations to typically conv erge in a few 100 s to a few 1000 s iterations, which can tak e from a few seconds to a few minutes on a standard laptop-class CPU, depending on the parameters of the problem and the required stopping tolerance. A.3.3 F urther details and remarks W e giv e here a few other remarks regarding the implementation of Algorithm 1 (and its coun- terpart for sub-extensive-index saddles). • Consider the gradient L g of the complexity functional giv en in Algorithm 1 . A t a global maximizer ( A, g ) , w e hav e L g = 0 . Ho wev er, the constraint of eq. ( 42 ) imposes that then: λ t 1 g 2 − α E ˆ ν " F ( u ) 2 ( α + g F ( u )) 2 #! − 2 α 2 g λ h E ˆ ν " F ( u ) 2 ( α + g F ( u )) 3 # = 0 . (52) One can separate tw o cases: ( i ) Either the constraint of eq. ( 41 ) is not saturated (i.e. the left-hand side is strictly smaller than 1 ): then the KKT multiplier λ h = 0 . By eq. ( 52 ) we reac h then that λ t = 0 . In this setting the p ositivit y constraint on the Hessian’s density in the suprem um ov er the la w ν of the lab els is not saturated at the maximum. ( ii ) Either the constraint of eq. ( 41 ) is saturated, but then eq. ( 52 ) implies λ h = 0 . W e therefore alw ays ha ve λ h = 0 at the optimal v alue of the v ariational principle. Notice that the presence of the KKT m ultiplier λ h ≥ 0 during the optimization pro cess is still v ery imp ortan t: as we sa w, it allo ws e.g. to remov e spurious lo cal maxima asso ciated to the second branc h of solutions to the Marchenk o-Pastur equation. 36 • A ccording to the remark ab ov e, at a global maximizer ( A, g ) , if there exists u ∈ R 2 suc h that α + g F ( u ) < 0 , then w e m ust ha v e λ t ≥ 0 at this v alue of ( A, g ) , in order for the density of X = α + gF ( u ) (for u ∼ ν ) to decay as X ↓ 0 . Similarly , at a maximizer, we also hav e L A = 0 , and thus λ A = (2 αA ) − 1 > 0 . In practice, we imp ose λ A ≥ 0 when computing Step 1 in Algorithm 1 , and λ t ≥ 0 when there exists u with α + g F ( u ) < 0 , as we found it to greatly impro ve numerical stabilit y . Moreo ver, b y conv exit y , these additional constrain ts do not affect the solution to the v ariational problem as long as they are not saturated at the optimum (i.e. λ A > 0 , λ t > 0 ), whic h w e alwa ys found to hold. B Deriv ation of the BBP transition condition W e deriv e here the condition for the “BBP transition” in the Hessian’s sp ectrum, i.e. the ap- p earance of an isolated eigen v alue at the left of the densit y’s bulk, and whose eigenv ector is p ositiv ely correlated to the signal θ ⋆ . W e assume to ha ve a given law ν ( y , y ⋆ ) , corresp onding to the limiting empirical law of the lab els on the type of p oints w e are considering (e.g. saddles of finite index, or critical p oin ts at a given loss v alue). W e also assume we are giv en an o verlap v alue q ∈ [0 , 1) : in this regard, this computation generalizes the deriv ation of [ BBC25 ] that holds at q = 0 . Let us recall some notations: q = θ · θ ⋆ and ∥ θ ∥ = ∥ θ ⋆ ∥ = 1 . The data vectors are x 1 , · · · , x n i . i . d . ∼ N (0 , I d ) , and we denote y ⋆ i : = x i · θ ⋆ and y i : = x i · θ . Recall n/d → α > 1 . Up to an additive shift (which is irrelev ant in terms of transition for an isolated eigenv alue), the Hessian is (see eq. ( 9 )): H = 1 n n X i =1 ∂ 2 1 ℓ ( y i , y ⋆ i )( P ⊥ θ x i )( P ⊥ θ x i ) ⊤ , (53) where P ⊥ θ is the orthogonal pro jection on { θ } ⊥ . Without loss of generality , b y rotational inv ari- ance of the Gaussian distribution w e can assume that    θ ⋆ = (1 , 0 , · · · , 0) , θ = ( q , q 1 − q 2 , · · · , 0) . (54) Notice that we can decomp ose x i as x i = y ⋆ i , y i − q y ⋆ i p 1 − q 2 , 0 , · · · , 0 ! + v i , (55) where v i is a Gaussian standard v ector in { θ ⋆ , θ } ⊥ , independent of ( y i , y ⋆ i ) . Moreo ver, z i : = P ⊥ θ x i = y ⋆ i − q y i p 1 − q 2 e w + v i , (56) where e w : = P ⊥ θ ( θ ⋆ ) ∥ P ⊥ θ ( θ ⋆ ) ∥ =  q 1 − q 2 , − q , 0 , · · · , 0  ∈ { θ } ⊥ . (57) Let G ( z ) : = ( H − z I d ) − 1 b e the resolven t of H , and g d ( z ) : = (1 /d )T r G ( z ) its Stieltjes transform. Using the relation ( H − z I d ) G = I d , w e get the equation − z g d ( z ) + 1 n n X i =1 ∂ 2 1 ℓ ( y i , y ⋆ i ) z ⊤ i Gz i d = 1 . (58) 37 The bulk – The computation of the bulk of the Hessian is classical [ MP67 ; SB95 ], and can b e obtained b y the cavit y metho d from eq. ( 58 )). One finds that the limiting Stieltjes transform g ( z ) = lim d →∞ g ( z ) of the limiting sp ectral density of H (denoted σ ( x ) ) satisfies, for any z ∈ C + , and assuming the empirical la w of ( y i , y ⋆ i ) con verges to ν : z = − 1 g ( z ) + α E ( y ,y ⋆ ) ∼ ν  f ( y , y ⋆ ) α + g ( z ) f ( y , y ⋆ )  , (59) with f ( y , y ⋆ ) : = ∂ 2 1 ℓ ( y , y ⋆ ) . Eq. ( 59 ) is sometimes called the Marchenk o-Pastur equation. Recall that the densit y σ ( x ) is directly related to the asymptotic sp ectral densit y ρ ( w ) of the spherical Hessian of the empirical loss, with a simple additiv e shift: σ ( x ) = ρ ( x − t ( ν )) , with t ( ν ) defined in eq. ( 19 ). The left edge of the bulk x min is also obtained in a classical fashion, see App endix A.1 . Essen- tially , under mild conditions on the b eha vior of the law of f ( y , y ⋆ ) near its minimal v alue (for ( y , y ⋆ ) ∼ ν ) , the Stieltjes transform g min = g ( x min ) of the left edge of the bulk is the p ositiv e solution to the equation 1 = α E y ,y ⋆ "  g min f ( y , y ⋆ ) α + g min f ( y , y ⋆ )  2 # . (60) One can then obtain the bottom edge x min b y plugging g min = g ( x min ) in eq. ( 59 ). The outlier – W e follow a close computation to the one of [ BBC25 ]. Notice that the Hessian liv es in the tan gen t space to θ : therefore, an outlier correlated with θ ⋆ actually means an outlier in the direction e w defined in eq. ( 57 ). W e denote e g : = e w ⊤ G ( z ) e w . In a similar fashion to eq. ( 58 ), w e get − z e g + 1 n n X i =1 f ( y i , y ⋆ i )( e w ⊤ z i ) · ( z ⊤ i G e w ) = 1 . (61) F rom eq. ( 56 ), we hav e e w ⊤ z i = ( y ⋆ i − q y i ) / p 1 − q 2 . F urther, denoting G − i : =   1 n X j (  = i ) f ( y j , y ⋆ j ) z j z ⊤ j − z I d   − 1 , (62) w e get using the Shermann-Morrison formula: z ⊤ i G e w = z ⊤ i G − i e w 1 + 1 n f ( y i , y ⋆ i ) z ⊤ i G − i z i . (63) W e hav e z ⊤ i G − i z i ≃ dg ( z ) at leading order as d → ∞ (since v i and G − i are indep enden t, and the other term of z i in eq. ( 56 ) only con tributes to sub-leading order). Moreo v er, again as d → ∞ : z ⊤ i G − i e w = y ⋆ i − q y i p 1 − q 2 e w ⊤ G − i e w | {z } ≃ e g + v ⊤ i G − i e w . (64) Plugging eqs. ( 63 ) and ( 64 ) in eq. ( 61 ), w e get − z e g + 1 n n X i =1 f ( y i , y ⋆ i ) 1 + g ( z ) α f ( y i , y ⋆ i ) ( y ⋆ i − q y i ) p 1 − q 2 " y ⋆ i − q y i p 1 − q 2 e g + v ⊤ i G − i e w # = 1 + o d (1) . (65) 38 Notice that v i is a random Gaussian vector, indep endent of ( y i , y ⋆ i ) and of G − i . T aking exp ec- tation in eq. ( 65 ) with resp ect to v i (again, w e assume concen tration of g , e g as d → ∞ ) yields that the last term in eq. ( 65 ) only con tributes to sub-leading-order as d → ∞ . In the end we get, taking d → ∞ : − 1 e g = z − α 1 − q 2 E ( y ,y ⋆ ) ∼ ν " ( y ⋆ − q y ) 2 f ( y , y ⋆ ) α + g ( z ) f ( y , y ⋆ ) # . (66) The presence of an outlier correlated with e w is signaled b y the singularity of the resolv ent, i.e. the div ergence of | e g | (as argued in [ BBC25 ]). Using this criterion in eq. ( 66 ), w e reac h that the equation satisfied by an outlier eigen v alue x ⋆ is x ⋆ = α 1 − q 2 E ( y ,y ⋆ ) ∼ ν " ( y ⋆ − q y ) 2 f ( y , y ⋆ ) α + g ( x ⋆ ) f ( y , y ⋆ ) # , (67) where g ( x ⋆ ) , the limiting density’s Stieltjes transform, can b e computed by solving eq. ( 59 ). Conclusion – F or a given estimate of the asymptotic empirical law of ( y i , y ⋆ i ) n i =1 , and given v alues of ( α, q ) , the threshold α BBP is giv en by the smallest v alue of α suc h that x ⋆ = x min . In particular, at this p oin t w e hav e, with g min the solution of eq. ( 60 ), that x min = x 2 , with            x min : = − 1 g min + α E y ,y ⋆  f ( y , y ⋆ ) α + g min f ( y , y ⋆ )  , x 2 : = α 1 − q 2 E y ,y ⋆ " ( y ⋆ − q y ) 2 f ( y , y ⋆ ) α + g min f ( y , y ⋆ ) # . (68) The equalit y of the tw o terms in eq. ( 68 ) is what w e numerically ev aluate to prob e the BBP transition, see Section C.4 . Notice that generically x 2  = x ⋆ (b ecause of the different Stieltjes transform used in the denominator): they only coincide at the BBP transition point. C F urther exploration of the phase retriev al landscap e In this app endix, w e presen t a deeper exploration of the landscap e of phase retriev al using the Kac-Rice formalism, of which we show ed some essential results in Section 2 : • In Section C.1 , we show the generalization of the top ological phase diagram (Fig. 5 ) to v alues of a ∈ { 0 . 1 , 1 . 0 } , where the same phenomenology remains. • In Section C.2 , we unco ver loss v alues b elow which all local minima are located in a band with p ositive correlation with the signal θ ⋆ , a phenomenon previously observed in Gaussian landscap es [ Ros+19 ]. • In Section C.3 , w e describ e a first-order phase transition observ ed for the complexit y of al l critical p oin ts. • Finally , in Section C.4 , w e provide examples of the computation of the BBP transition in the “BBP-KR” formalism. C.1 The phase diagram for other v alues of a In Fig. 9 w e generalize the topological phase diagram of Fig. 5 to other v alues of a ∈ { 0 . 1 , 1 . 0 } . The same phenomenology is retained: w e simply observe that the sample complexit y thresholds for trivialization of different types of critical p oin ts, and for the BBP transitions, are generically larger as a increases. 39 2 4 6 8 1 0 1 2 α 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 q Trivialization of all critical points Trivialization of sub-extensive-index saddles Trivialization of minima (Kac-Rice upper bound) BBP-KR instability (lowest-energy minima) BBP-KR instability (typical minima) (a) The phase diagram for a = 0 . 1 . 4 6 8 1 0 1 2 1 4 1 6 1 8 2 0 α 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 q Trivialization of all critical points Trivialization of sub-extensive-index saddles Trivialization of minima (Kac-Rice upper bound) BBP-KR instability (lowest-energy minima) BBP-KR instability (typical minima) (b) The phase diagram for a = 1 . 0 . Figure 9: Phase diagram predicted b y the Kac-Rice metho d, for a ∈ { 0 . 1 , 1 . 0 } . The con ven tions and colors are the same as in Fig. 5 in the main text. C.2 App earance of lo cal minima at high ov erlap W e did not find evidence in the annealed complexity for the emergence of minima at high o verlap with the signal when there is no local minima at q = 0 : we alwa ys found the complexities Σ( q ) to decrease monotonically with q , con trary e.g. to what happ ens in some Gaussian mo dels of random landscap es [ Ros+19 ]. On the other hand, w e did exhibit examples where the complexity Σ( q , e ) has a non-mononotic b eha vior with q , if the loss (or energy) v alue e is small enough. W e illustrate it in Fig. 10 : we consider a = 1 . 0 and α = 3 . 0 . A t this v alue the complexity e Σ 0 ( q = 0) > 0 , see Fig. 9b , and the Kac-Rice form ula predicts that lo cal minima ha ve a t ypical energy e ⋆ ( q = 0) . W e then fixed e = e ⋆ ( q = 0) / 2 , and sho w the function q 7→ Σ( q , e ) . 40 0 . 0 0 . 2 0 . 4 0 . 6 q − 0 . 01 0 . 00 0 . 01 0 . 02 e Σ 0 ( q , e = e ? / 2) Figure 10: F or a = 1 . 0 and α = 3 . 0 , the annealed complexit y e Σ 0 ( q , e ) as a function of q , for e = e ⋆ ( q = 0) / 2 ≃ 0 . 048 , half of the loss v alue of t ypical minima. In red we show the o verlap band of p ositive complexity . In terestingly , the annealed complexity predicts that for these low loss v alues (i.e. further down in the landscape), lo cal minima are lo cated at high ov erlap, a phenomenon whic h is not visible when coun ting all minima: in this case, the most n umerous are alwa ys lo cated at the equator (i.e. at q = 0 ). C.3 A phase transition for the complexity of all critical p oin ts In Fig. 11 , w e display the evolution of the total complexity Σ tot . , the energy and the total num b er of steps as functions of α for t wo starting p oin ts to solve the equations n umerically , either α = 3 . 0 or α = 3 . 5 . It clearly sho ws that the iterative algorithm for critical p oin ts (eq. ( 28 )) might hav e more than one fixed p oin t, as w e briefly p oin ted out in Section 2.5.5 . Indeed, starting from α = 3 . 0 , the algorithm stays on a low-energy branc h with decreasing complexit y un til α ≈ 3 . 4 where it jumps to the high-energy branch, asso ciated with a critical slo wing down sho wn in the total num b er of steps on the right. This suggest the existence of a first-order phase transition for the complexit y of all critical points. Ho w ever, in our in vestigations w e did not find an y n umerical evidence of the presence of m ultiple lo cal maxima in ( g , a ) of eqs. ( 24 ) and ( 27 ), our form ulas for the complexit y of local minima and sub-extensive-index saddles. C.4 The BBP-KR instabilit y In Fig. 12 w e sho w an example of a computation of the BBP transition, whose theory is given in Section B . W e plot d ( α ) : = w min − w 2 = x min − x 2 : w e see it crosses zero at a finite α , whic h is larger as we consider larger-energy minima. Recall that w min = x min − t ( ν ) is really the left edge of the Hessian’s asymptotic sp ectral density . W e emphasize in particular the top-right plot in Fig. 12 : it shows that despite the appearance of the BBP instabilit y , the bulk of the Hessian touc hes 0 for all v alues of α considered. 41 3 . 0 3 . 1 3 . 2 3 . 3 3 . 4 3 . 5 α 0 . 2 5 0 0 . 2 5 5 0 . 2 6 0 0 . 2 6 5 0 . 2 7 0 0 . 2 7 5 0 . 2 8 0 Σ t o t . S t a r t i n g f r o m α = 3 . 0 S t a r t i n g f r o m α = 3 . 5 3 . 0 3 . 1 3 . 2 3 . 3 3 . 4 3 . 5 α 0 . 5 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 Energy / loss 3 . 0 3 . 1 3 . 2 3 . 3 3 . 4 3 . 5 α 5 0 0 1 0 0 0 1 5 0 0 2 0 0 0 2 5 0 0 3 0 0 0 T m a x Figure 11: Ev olution of (left) the total complexit y Σ tot . , (middle) the energy , and (righ t) total n umber of steps as functions of α for tw o starting points: α ∈ { 3 . 0 , 3 . 5 } . 4 . 0 4 . 5 5 . 0 5 . 5 6 . 0 6 . 5 7 . 0 7 . 5 8 . 0 α − 0 . 3 − 0 . 2 − 0 . 1 0 . 0 0 . 1 0 . 2 d ( α ) = w min − w 2 typical-energy low est-energy highest-energy 4 5 6 7 8 α − 1 0 1 w min × 10 − 5 4 5 6 7 8 α − 0 . 2 0 . 0 0 . 2 w 2 BBP instability for lo cal minima: a = 1.0, q = 0.0. Figure 12: F or a = 1 . 0 and q = 0 . 0 , w e sho w the computation of the BBP transition p oint for t ypical-energy , lo west-energy and highest-energy minima. D Extension of the dynamical comparison to a = 1 . 0 While the main text fo cuses on comparing results from gradien t descent dynamics at fixed loss normalization a = 0 . 01 in eq. ( 3 ), the present app endix repro duces the comparison from Sec- tion 3.2 of the minima properties with the Kac-Rice prediction for a = 1 . In Fig. 13 w e sho w the comparison of the predicted and observ ed energies for the minima in the landscap e, showing again a v ery go od agreement with the typical energy predicted by the annealed Kac-Rice com- putation. In Fig. 14 , we display the comparison at fixed α ∈ { 3 . 5 , 4 . 5 } of the sp ectral prop erties of the minima, F ( u ) , and the signal-label join t probability distribution ν ( y , y ⋆ ) . Again, the predictions at the typical energy fit remarkably w ell the shape of all the distributions in great details. A w ord ab out the BBP transition for a = 1 – F or q = 0 , our theory predicts that α (typ) BBP = 5 . 94 , whereas our d = 512 simulations ha ve their transition in the interv al [4 . 15 , 5 . 20] . Ho wev er, we observe that the transition line for the su ccess rate at d = 512 for this v alue of a is not sharp. Our definition for the critical α migh t not b e suitable for this case. As an example, the more inv olved finite-size analysis from [ BBC25 ] reports a transition at α ≈ 5 . 55 , m uch closer to our annealed prediction. W e leav e for a future v ersion of this work the computationally-intensiv e task of ev aluating the full phase diagram, similar to Fig. 8b , for a = 1 . 42 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 α 0.05 0.08 0.10 0.12 0.15 0.18 0.20 0.23 0.25  e ® q = 0 . 0 Typical energy Lowest energy Highest energy E x p e r i m e n t s ( d = 5 1 2 ) 3 . 0 3 . 5 4 . 0 4 . 5 5 . 0 α 0.05 0.08 0.10 0.12 0.15 0.18 0.20 0.23 0.25 q = 0 . 1 3 . 0 0 3 . 2 5 3 . 5 0 3 . 7 5 4 . 0 0 α 0.05 0.08 0.10 0.12 0.15 0.18 0.20 0.23 0.25 q = 0 . 2 Figure 13: Ev olution of the av erage energy ⟨ e ⟩ of the minima for a = 1 with α at fixed (left) q = 0 . 0 , (middle) q = 0 . 1 , and (right) q = 0 . 2 obtained from the exp erimen ts (in black) and the band of energy predicted b y the annealed Kac-Rice (in shaded red). 0 2 4 λ 0.00 0.10 0.20 0.30 0.40 0.50 0.60 ρ ( λ ) 2 . 5 0 . 0 2 . 5 5 . 0 7 . 5 1 0 . 0 F ( u ) 0 . 0 0 0 . 0 5 0 . 1 0 0 . 1 5 0 . 2 0 0 . 2 5 0 . 3 0 0 . 3 5 0 . 4 0 p ( F ) 2 0 2 y 2 1 0 1 2 y 2 0 2 y 2 1 0 1 2 y 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 ν ( y , y ) (a) F or α = 3 . 5 . 0 2 4 λ 0.00 0.10 0.20 0.30 0.40 0.50 ρ ( λ ) 2 . 5 0 . 0 2 . 5 5 . 0 7 . 5 1 0 . 0 F ( u ) 0 . 0 0 0 . 0 5 0 . 1 0 0 . 1 5 0 . 2 0 0 . 2 5 0 . 3 0 0 . 3 5 0 . 4 0 p ( F ) 2 0 2 y 2 1 0 1 2 y 2 0 2 y 2 1 0 1 2 y 0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 ν ( y , y ) (b) F or α = 4 . 5 . Figure 14: Comparisons of the predicted Kac-Rice prop erties for the minima at q = 0 . 0 and t ypical energy e ⋆ and the empirical minima found b y the gradient descen t dynamics at d = 512 and a = 1 . W e compare: (left) the eigenv alue distribution of the Hessian ρ ( λ ) , (middle) the distribution of the Hessian w eights F ( u ) = ∂ 2 1 ℓ ( y , y ⋆ ) , and (right) the joint lab el distributions ν ( y , y ⋆ ) . In red are the Kac-Rice predictions describ ed in Section 2 . 43

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment