Generalized Random Direction Newton Algorithms for Stochastic Optimization

Generalized Random Direction Newton Algorithms for Sto c hastic Optimization Soumen P achal Indian Institute of T ec hnology Madras cs22d009@smail.iitm.ac.in Prashan th L.A. Indian Institute of T ec hnology Madras prashla@cse.iitm.ac.in Shalabh Bhatnagar Indian Institute of Science, Bangalore shalabh@iisc.ac.in Avinash Ac har TCS Researc h, Chennai, India achar.avinash@tcs.com Abstract W e present a family of generalized Hessian estimators of the ob jective using random direction sto c hastic approximation (RDSA) b y utilizing only noisy function measuremen ts. The form of each estimator and the order of the bias depend on the num b er of function measuremen ts. In particular, w e demonstrate that estimators with more function measurements exhibit low er-order estimation bias. W e show the asymptotic unbiasedness of the estimators. W e also p erform asymptotic and non-asymptotic conv ergence analyses for sto chastic Newton methods that incorp orate our generalized Hessian estimators. Finally , w e p erform n umerical exp erimen ts to v alidate our theoretical ﬁndings. Keywor ds: Stochastic optimization, sto chastic approximation, random direction sto chastic approximation (RDSA), Newton-based algorithms, gradien t and Hessian estimation. 1 In tro duction Optimization under uncertaint y has been comprehensiv ely studied in many engineering applications, including neural netw orks, reinforcement learning, ﬁnance, op erations researc h, energy systems, and signal pro cessing, to name a few. Uncertaint y may arise due to randomness in the data or the environmen t. Sto c hastic optimization is a mathematical to ol for solving this optimization problem under uncertaint y . In this pap er, w e are in terested in solving the optimization under uncertaint y problem as a sto c hastic optimization problem. The primary goal is to ﬁnd the minima of an ob jectiv e, where the analytical form of the underlying ob jective is not known, instead, noisy function measurements or samples are av ailable. The noise-corrupted measurements are accessible either through an oracle or real data. Such a setting is usually referred to as zeroth-order sto c hastic optimization. In this setting, the goal is to ﬁnd the minima of a real-v alued function F : R d → R . More formally , our aim reduces to ﬁnding a parameter θ ∗ using only noisy measuremen ts of F , i.e., θ ∗ ∈ argmin θ ∈ R d F ( θ ) . (1) A general sto c hastic approximation (SA) algorithm was ﬁrst prop osed by Robbins and Monro (RM) [ 26 ] for ﬁnding the zeros of a function using noisy measuremen ts. Under some conditions, the RM algorithm con verges to the ro ot of the function in an almost sure sense. This metho d can also b e applied for gradient searc h, where noisy measurements of the gradient are only a v ailable. Inﬁnitesimal p erturbation analysis (IP A) [ 10 , 13 ] and likelihoo d ratio (LR) [ 18 , 27 ] are such gradient search metho ds that require only one function measuremen t. The SA algorithm ﬁnds several applications, including ﬁnding the minima of an ob jective function [ 3 , 23 ], ﬁnding the ﬁxed p oint of a function [ 23 ], reinforcement learning [ 31 ], neural net works [12], ﬁne-tuning Large Language mo dels [19], and man y more. 1 The up date rule of a sto c hastic gradient search algorithm is the following: θ ( n + 1) = θ ( n ) − a ( n ) b ∇ F ( θ ( n )) , (2) where a ( n ) , n ≥ 0, are the step sizes that satisfy sto c hastic approximation conditions, and b ∇ F ( θ ( n )) is an estimate of the true gradient ∇ F ( θ ( n )) of the ob jective function F . The tw o-sided (one-sided) ﬁnite diﬀerence sto c hastic approximation (FDSA) or Kiefer-W olfo witz (K-W) [ 15 ] is a gradient estimation metho d that requires 2 d ( d + 1) function measurements p er iteration. F or large d , this method is computationally exp ensiv e as it p erturbs along each co ordinate direction one at a time. T o mitigate this dimension dep endency , random direction sto chastic appro ximation (RDSA) was prop osed in [ 16 ], [ 17 ] in which p erturbation v ariables that are uniformly distributed ov er the surface of a unit sphere in R d are employ ed. This metho d requires only tw o function measuremen ts p er iteration. Many random p erturbation-based metho ds hav e b een studied in the literature, including smo othed functional (SF) [ 14 ], [ 5 ], RDSA with uniform and asymmetric Bernoulli p erturbations [ 25 ], simultaneous p erturbation sto c hastic approximation (SPSA) [ 28 ], and many more. All these methods deal with contin uous v ariables. Discrete simultaneous p erturbation sto chastic appro ximation (DSPSA) is prop osed in [ 35 ], which deals with optimization on discrete v ariables. Recently , a general SPSA-based metho d, MSPSA [ 34 ], has also b een prop osed for mixed v ariables that are b oth discrete and contin uous. The key idea b ehind these approac hes is to reduce the function measuremen ts needed to obtain zeroth-order estimates of the gradien t and higher-order deriv ativ es, thereby achieving a desired level of accuracy . Amongst the most p opular metho ds of the class of random p erturbation approaches are SPSA and SF, as they are easy to implement, exhibit promising p erformance, and normally require just tw o function measurements p er iteration. In order to generate the tw o p erturbed parameters at each iterate for which the sim ulations are run, SPSA metho ds randomly p erturb all the parameter comp onen ts simultaneously using symmetric ± 1-v alued Bernoulli random v ariables. A family of generalized simultaneous p erturbation sto chastic approximation (GSPSA) gradient estimators has recently b een prop osed in [ 22 ] that helps to signiﬁcan tly reduce the bias at the cost of a few more function measurements. Here, given a desired lev el of accuracy , one can construct a zeroth-order gradient estimator with the aforementioned few extra function measurements that ac hieves that level of accuracy in the gradient estimator. Newton-based simultaneous p erturbation algorithms hav e also b een prop osed in the literature, see [ 29 ], [ 4 ], [ 5 ], [ 30 ], [ 25 ], and [ 24 ], resp ectiv ely . In practice, it has b een observ ed that the sto c hastic gradient algorithms hav e man y limitations, suc h as (i) slow conv ergence near optima, (ii) not b eing scale inv ariant, and (iii) b eing sensitive to the c hoice of the initial learning rate. Newton algorithms in general w ould t ypically require knowledge of the minimum eigenv alue of the Hessian for optimal conv ergence rate (see [ 25 ], [ 8 ]). In sto c hastic optimization settings, the minimum eigenv alue is t ypically not known. Sto chastic Newton metho ds mitigate this eigenv alue dep endence. These metho ds require the computation of the Hessian in addition to the gradien t, and p erform the following iterativ e up date rule: θ ( n + 1) = θ ( n ) − a ( n )Θ  ¯ H n  − 1 b ∇ F ( θ ( n )) , (3) where b ∇ F ( θ ( n )) is an estimate of the true gradient ∇ f ( θ ), a ( n ) , n ≥ 0 are the step sizes, ¯ H n is an estimate of the true Hessian ∇ 2 F ( θ ), and Θ is an op erator that pro jects onto the set of p ositiv e deﬁnite matrices. This is needed to ensure that the algorithm pro ceeds along a descent direction at each step. W e consider the RDSA-based adaptive Newton metho d similar to [ 25 ]. In this pap er, our goal is to generalize the RDSA-based Hessian estimation metho d of [ 25 ] to achiev e a low er-order bias but with a sligh t increase in the n umber of function measurements used in the Hessian estimator. A similar generalization for gradient estimation was done recently in [ 22 ] where a family of generalized simultaneous p erturbation-based gradient search estimators, including SPSA, SF, and RDSA was prop osed that helps ac hieve a reduced bias. W e now brieﬂy describ e our key contributions b elo w. 1. Generalized Hessian estimators: W e present a family of generalized RDSA-based Hessian estimators obtained from noisy function measuremen ts. The form of eac h estimator diﬀers from 2 another in the required num b er of function measurements p er iteration. Estimators with more function measuremen ts result in a bias of reduced order. 2. Estimation b ounds: W e provide b ounds on the bias and v ariance of our prop osed Hessian estimators under standard assumptions. These b ounds sho w that our prop osed family of Hessian estimators hav e lo wer bias than 2SPSA [ 29 ], 2RDSA [ 25 ] and other simultaneous p erturbation-based Hessian estimators in the literature. 3. Stationary con vergence: F or a sto chastic Newton metho d that incorp orates our generalized Hes- sian estimator and the generalized gradient estimator from [ 22 ], we establish asymptotic con vergence to the set of ﬁrst-order stationary p oin ts (F OSPs), i.e., p oin ts where the gradient of the ob jective v anishes. 4. Escaping saddle p oin ts: At second-order stationary p oin ts (SOSPs), the gradient v anishes and the Hessian is p ositiv e semi-deﬁnite. Such p oin ts coincide with lo cal minima if all saddle p oin ts are strict, cf. [ 23 , Chapter 7]. W e analyze a zeroth-order v ariant of the well-kno wn cubic-regularized Newton metho d that incorp orates our generalized Hessian estimator in conjunction with the gradien t estimator from [ 22 ]. W e establish a non-asymptotic b ound that quantiﬁes the conv ergence of the zeroth-order cubic Newton algorithm to an ϵ -SOSP , whic h is an approximation to SOSP (see Section 4 for the details). F or a given ϵ > 0, there exists a δ > 0 (the p erturbation parameter) such that for gradien t and Hessian estimators with a bias bound of O ( δ k ), the non-asymptotic b ound on the sample complexity is O ( ϵ − ( 7 2 + 2 k ) ). The latter gradient/Hessian estimation b ounds are achiev able with generalized estimators formed using (2 k + 1) function measuremen ts. T o the b est of our kno wledge, we are the ﬁrst to provide a non-asymptotic b ound for a zeroth-order sto c hastic Newton algorithm with gradien t/Hessian estimators formed using the simultaneous p erturbation metho d. 5. Exp erimen ts: W e sho w the results of simulation exp erimen ts on the Rastrigin ob jective. Our n umerical results show that the Hessian estimators with more function measuremen ts provide b etter p erformance guarantees for a given simulation budget. Related work . W e now compare our prop osed metho d with closely related existing approac hes for Hessian estimation. In [ 9 ], the classical FDSA-based Hessian estimation metho d was prop osed, and this sc heme requires O ( d 2 ) function measurements, making it computationally exp ensiv e in higher dimensions. T o ov ercome this issue, a simultaneous p erturbation-based Hessian estimation metho d, in the similar spirit of [ 28 ], was prop osed in [ 29 ]. The key adv an tage of this approach is that it requires only four function measuremen ts p er iteration, irresp ective of d . A mo diﬁed Hessian estimation metho d was proposed in [ 36 ], where the eigenv alues are pro jected to the half-plane. In [ 4 ], zeroth-order sto chastic Newton algorithms with Hessian estimates formed using four, three, tw o, and one function measurements, resp ectiv ely , are prop osed, and in [ 6 ], Newton algorithms with three-simulation balanced Hessian estimators are proposed, where the Hessian inv ersion at each step is achiev ed using an iterative metho d. Smo othed functional-based second-order metho ds with one and tw o measurements hav e b een explored in [ 5 ] with Gaussian random p erturbations. The same simulations are used to estimate b oth the gradien t and the Hessian there. More recently , an RDSA-based Hessian estimator with uniform and asymmetric Bernoulli p erturbation parameters has b een prop osed in [ 25 ], which requires three function measuremen ts per iteration. F urther, in [24], Hessian estimators based on deterministic p erturbation sequence construction are presen ted and the algorithms analyzed. All these random p erturbation-based Hessian es timation metho ds hav e bias either O ( δ ) or O ( δ 2 ), where δ is the p erturbation constant. In contrast, our Hessian estimators are able to achiev e low er bias O ( δ k ) at an additional cost of (2 k + 1)-function measurements. In [ 2 ], a zeroth-order v arian t of the well-kno wn cubic-regularized Newton metho d with the SF-based gradient and Hessian estimators has b een studied under an additional smo othness assumption on the sample p erformance. In comparison, we do not make this assumption, and more imp ortan tly , w e analyze a cubic-regularized Newton metho d with a generalized Hessian estimator that has a tunable bias. While our non-asymptotic b ound is weak er compared to [ 2 ] owing to the more general setting that we consider, for large v alues of k , our b ound is very close to the one in the aforementioned reference. 3 Organization . The rest of this article is organized as follo ws. In Section 2, w e ﬁrst formulate the Hessian op erator and then presen t Hessian estimators in multiple scenarios. In Section 3, we analyze the asymptotic b eha viour of the stochastic Newton metho d that incorp orates our prop osed Hessian estimators. In Section 4, we provide the non-asymptotic analysis of cubic-regularized Newton metho d. In Section 5, w e provide all the conv ergence pro ofs for all the claims in prior sections. W e show the results of numerical exp erimen ts in Section 6 that are seen to v alidate our theoretical ﬁndings. Finally , Section 7 presen ts the concluding remarks and outlines a few future researc h directions. 2 Generalized RDSA-based Hessian Estimation A family of generalized SPSA-based gradient estimators has b een prop osed in [ 22 ] that pro vides the ﬂexibilit y of lo wering estimator bias at the cost of extra function measurements. In particular, they prop ose t wo families of estimators, which generalize the one-sided SPSA estimate [ 3 , Section 5.3] and the balanced SPSA estimate [ 28 ], resp ectively . The t wo families are essen tially based on t wo diﬀerent inﬁnite series expansions of the diﬀerentiation op erator. Order-1 truncation of either of the inﬁnite series expansions giv es the unbalanced SPSA and balanced SPSA estimators, resp ectiv ely . In this section, we introduce a family of nov el estimators for the Hessian ∇ 2 F , which is based on the family of gradient estimators [ 22 ], generalizing the unbalanced SPSA operator. The estimator bias can b e made arbitrarily small (by taking into accoun t a proportionately high order of the p erturbation constant, a quan tity typically less than 1) by considering a suﬃciently large num b er of function measurements. Consider the multi-dimensional diﬀerentiation op erator D β where β = [ β 1 , β 2 . . . β d ], β i ∈ N 0 (the set of non-negativ e integers) deﬁned as follows: D β F ( θ ) ≜ ∂ | β | F ( θ ) ∂ θ β 1 1 · · · ∂ θ β d d , with | β | = β 1 + · · · + β d . As an example, if β = [1 , 0 , · · · , 0], then D β F ( θ ) = ∂ F ( θ ) ∂ θ 1 . Likewise, if β = [2 , 3 , 0 · · · , 0], then D β F ( θ ) = ∂ 5 F ( θ ) ∂ θ 2 1 ∂ θ 3 2 . Now let ∆ β = ∆ β 1 1 ∆ β 2 2 · · · ∆ β d d and β ! = β 1 ! β 2 ! · · · β d !. Using the ab o v e op erator, the multiv ariate T a ylor’s expansion is as follows: F ( θ + δ ∆) = ∞ X | β | =0 D β F ( θ ) β ! ( δ ∆) β = ∞ X | β | =0  ( δ ∆ D ) β β !  F ( θ ) , (4) where F is assumed to b e inﬁnitely man y times contin uously diﬀeren tiable. Using the exponentiation op erator and shift op erator τ δ ∆ F ( θ ) ≡ F ( θ + δ ∆), we can compactly rewrite (4) as τ δ ∆ = exp( δ ∆ D ) . Rearranging the equation ab o ve and using the log expansion (as shown in [ 22 ]), we arrive at an interesting series expansion of the deriv ativ e op erator, which is given b elo w. D = 1 δ ∆ ∞ X j =1 ( τ δ ∆ − I ) j j ( − 1) j +1 . (5) It is evident that truncating the ab o ve summation to just one term yields the unbalanced SPSA op erator. In this sense, the ab o ve series expansion (5) neatly generalizes the unbalanced SPSA op erator. In [ 22 ], it w as shown that the gradien t estimate obtained b y truncating D to the ﬁrst k terms yields a progressively reduced bias of O ( δ k ), k ≥ 1. The extra cost that one pays for this reduced bias is a progressiv ely increasing n umber of function measurements, namely k + 1. In this pap er, we explore such families of estimators for the Hessian. By deﬁnition, Hessian is obtained b y applying the D op erator (5) t wice. The idea now is to consider a truncated v ersion of the D op erator 4 as in [ 22 ] and apply it t wice on F ( θ ). W e denote the truncated version of D upto ﬁrst k terms as D k , whic h is of the form b elow: D k = 1 δ ∆ k X j =1 ( τ δ ∆ − I ) j j ( − 1) j +1 . (6) On expanding the inner term, ( τ δ ∆ − I ) j , using the Binomial theorem and some rea rrangemen t based on p o w ers of τ δ ∆ [22], w e obtain the following form for D k : D k = 1 δ ∆ k X l =0 ( − 1) 1 − l c k l τ θ + l δ ∆ l ! , (7) where c k l =      1 l l − 1 Q j =0 ( k − j ) l ≥ 1 , P k j =1 1 j l = 0 . (8) Note that since the shift op erator is linear, D k is also linear for an y ﬁnite k . Since D k needs to b e applied t wice, more p ossibilities arise here than in the gradien t case. When D k is applied a second time, one can use a diﬀerent p erturbation constan t and p erturbation vector (sto c hastically indep endent of the ﬁrst one) as in [ 29 ]. Another p ossibilit y is that D k when applied twice, can b e truncated up to a diﬀerent order (say k 1 the ﬁrst time and k 2 the second, for some k 1 , k 2 ≥ 1. W e ﬁrst consider the simple case where for b oth truncated op erators (i) the p erturbation v ector and constant are the same; and (ii) b oth truncations are of the same order (section 2.1). Under these conditions, we prov e an interesting bias reduction structure with increasing order of truncation. The bias here deca ys p olynomially with δ , the p erturbation constant. 2.1 Hessian estimate with equal truncation In this section, we consider a truncated Hessian operator of the form H i = D i ◦ D i , where the op erator D is truncated at the i -th p osition in the series (6) (i.e., with k = i ). W e choose ∆ = (∆ 1 , ∆ 2 , . . . , ∆ d ) T to b e a d -v ector of independent standard Gaussian random v ariables. Before tac kling the general case of the Hessian estimate for an arbitrary order i , we ﬁrst demonstrate this calculation for smaller truncation orders for ease of illustration. 2.1.1 Hessian estimate with O ( δ ) bias Using the function v alues at ( θ + δ ∆) and θ , we form a Hessian estimate in the manner as explained b elo w. First, let D 1 i F ( θ ) = ∆ i  F ( θ + δ ∆) − F ( θ ) δ  . (9) This estimate coincides with the t wo-measuremen t un balanced gradien t estimate prop osed in [ 3 ]. Applying the op erator D 1 j on (9), w e obtain b H 1 i,j F ( θ ) = D 1 j ◦ D 1 i F ( θ ) = ∆ i " D 1 j F ( θ + δ ∆) − D 1 j F ( θ ) δ # = ∆ j ∆ i  F ( θ + 2 δ ∆) − 2 F ( θ + δ ∆) + F ( θ ) δ 2  . Using matrix notation, w e hav e b H 1 F ( θ ) = ∆∆ T  F ( θ + 2 δ ∆) − 2 F ( θ + δ ∆) + F ( θ ) δ 2  . 5 Using T a ylor series expansion of F ( θ + 2 δ ∆) and F ( θ + δ ∆), we obtain b H 1 F ( θ ) = ∆∆ T δ 2  δ 2 ∆ T ∇ 2 F ( θ )∆ + 1 d × d O ( δ 3 )  = (∆∆ T )[∆ T ∇ 2 F ( θ )∆] + 1 d × d O ( δ ) , (10) where 1 d × d denotes the d × d matrix with all entries one. T aking exp ectation ov er ∆, the ﬁrst term on the RHS ab o ve do es not simplify to ∇ 2 F ( θ ). More precisely , E [(∆∆ T )(∆ T ∇ 2 F ( θ )∆)] = 2 ∇ 2 F ( θ ) + tr ( ∇ 2 F ( θ )) I d , where tr ( A ) denotes the trace of a matrix A and I d denotes the d × d iden tity matrix. Note that this will not pro vide the true Hessian. T o ov ercome the bias in the ﬁrst term of (10) , we prop ose a slightly mo diﬁed Hessian estimator H 1 b y using (∆∆ T − I ) instead of ∆∆ T to recov er the true Hessian. The mo diﬁed Hessian estimator is giv en as follo ws: H 1 F ( θ ) = (∆∆ T − I )  F ( θ + 2 δ ∆) − 2 F ( θ + δ ∆) + F ( θ ) δ 2  . 2.1.2 Hessian estimate with O ( δ 2 ) bias Three measuremen ts GRDSA is deﬁned as follows: D 2 i F ( θ ) = ∆ i  4 F ( θ + δ ∆) − 3 F ( θ ) − F ( θ + 2 δ ∆) 2 δ  . (11) Applying D 2 j on (11), w e obtain b H 2 ij F ( θ ) = D 2 j ◦ D 2 i F ( θ ) = ∆ i " 4 D 2 j F ( θ + δ ∆) − 3 D 2 j F ( θ ) − D 2 j F ( θ + 2 δ ∆) 2 δ # = ∆ j ∆ i 4 δ 2 h 22 F ( θ + 2 δ ∆) − 24 F ( θ + δ ∆) − 8 F ( θ + 3 δ ∆) + F ( θ + 4 δ ∆) + 9 F ( θ ) i . In matrix form, w e hav e b H 2 F ( θ ) = ∆∆ T 4 δ 2 h 22 F ( θ + 2 δ ∆) − 24 F ( θ + δ ∆) − 8 F ( θ + 3 δ ∆) + F ( θ + 4 δ ∆) + 9 F ( θ ) i . Therefore, b y using T aylor series expansion of F ( θ + 2 δ ∆), F ( θ + δ ∆) and F ( θ ), we get b H 2 F ( θ ) = ∆∆ T 2!4 δ 2 h 8 δ 2 ∆ T ∇ 2 F ( θ )∆ + 1 d × d O ( δ 4 ) i = ∆∆ T [∆ T ∇ 2 F ( θ )∆] + 1 d × d O ( δ 2 ) . As discussed ab o ve, the ﬁrst term ab o ve is not an un biased estimate of the true Hessian. T o mitigate this issue, w e use the following mo diﬁed Hessian estimator: H 2 F ( θ ) = (∆∆ T − I ) 4 δ 2 h 22 F ( θ + 2 δ ∆) − 24 F ( θ + δ ∆) − 8 F ( θ + 3 δ ∆) + F ( θ + 4 δ ∆) + 9 F ( θ ) i . 2.1.3 Hessian estimate with O ( δ 3 ) bias F our measurements GRDSA is deﬁned as follows: D 3 i F ( θ ) = ∆ i 6 δ h 2 F ( θ + 3 δ ∆) − 9 F ( θ + 2 δ ∆) + 18 F ( θ + δ ∆) − 11 F ( θ ) i . 6 Applying D 3 j on the ab o v e, we get b H 3 ij F ( θ ) = D 3 j ◦ D 3 i F ( θ ) = ∆ i 6 δ h 2 D 3 j F ( θ + 3 δ ∆) − 9 D 3 j F ( θ + 2 δ ∆) + 18 D 3 j F ( θ + δ ∆) − 11 D 3 j F ( θ ) i = ∆ j ∆ i 36 δ 2 h 4 F ( θ + 6 δ ∆) − 36 F ( θ + 5 δ ∆) + 153 F ( θ + 4 δ ∆) − 368 F ( θ + 3 δ ∆) + 121 F ( θ ) + 522 F ( θ + 2 δ ∆) − 396 F ( θ + δ ∆) i . In matrix notation, w e hav e b H 3 F ( θ ) = ∆∆ T 36 δ 2 h 4 F ( θ + 6 δ ∆) − 36 F ( θ + 5 δ ∆) + 153 F ( θ + 4 δ ∆) − 368 F ( θ + 3 δ ∆) + 121 F ( θ ) + 522 F ( θ + 2 δ ∆) − 396 F ( θ + δ ∆) i . Using T a ylor series expansion, we obtain H 3 F ( θ ) = ∆∆ T 2!36 δ 2 h 72 δ 2 ∆ T ∇ 2 F ( θ )∆ + 1 d × d O ( δ 5 ) i = (∆∆ T )[∆ T ∇ 2 F ( θ )∆] + 1 d × d O ( δ 3 ) . In a similar manner, w e use the following a mo diﬁed Hessian estimator: H 3 F ( θ ) = (∆∆ T − I ) 36 δ 2 h 4 F ( θ + 6 δ ∆) − 36 F ( θ + 5 δ ∆) + 153 F ( θ + 4 δ ∆) − 368 F ( θ + 3 δ ∆) + 121 F ( θ ) + 522 F ( θ + 2 δ ∆) − 396 F ( θ + δ ∆) i . 2.1.4 Hessian estimate with O ( δ k ) bias Recall the GRDSA op erator (6) formed using k + 1 function measurements is given b y D k i F ( θ ) = ∆ i δ k X l =0 ( − 1) 1 − l c k l F ( θ + l δ ∆) l ! , (12) where c k l =      1 l l − 1 Q j =0 ( k − j ) l ≥ 1 , P k j =1 1 j . l ≥ 0 . Applying D k j on (12), w e get b H k ij F ( θ ) = D k j ◦ D k i F ( θ ) = ∆ i " k X l =0 ( − 1) 1 − l c k l D k j F ( θ + l δ ∆) δ l ! # = ∆ j ∆ i δ 2 k X l =0 k X m =0 ( − 1) − l − m c k l c k m F ( θ + ( l + m ) δ ∆) l ! m ! . Using matrix notation, w e get b H k F ( θ ) = ∆∆ T δ 2 k X l =0 k X m =0 ( − 1) − l − m c k l c k m F ( θ + ( l + m ) δ ∆) l ! m ! . (13) 7 As discussed earlier, b H k fails to provide true Hessian, we use slightly mo diﬁed Hessian estimator H k as follo ws: H k F ( θ ) = (∆∆ T − I ) δ 2 k X l =0 k X m =0 ( − 1) − l − m c k l c k m F ( θ + ( l + m ) δ ∆) l ! m ! . (14) T o characterize the bias of the Hessian estim ator, we make the b elo w assumption. (A1). The function f is ( k + 1) -times c ontinuously diﬀer entiable with L k +2 ( θ ) △ = ∇ k +2 i 1 ,i 2 ,...,i k +2 f ( θ ) < ∞ , i 1 , i 2 , . . . , i k +2 ∈ { 1 , 2 , . . . , d } , k ≥ 1 and for any θ ∈ R d . Lemma 1. Under assumption (A1), we have H k F ( θ ) = (∆∆ T − I )[∆ T ∇ 2 F ( θ )∆] + 1 d × d O ( δ k ) , for any k ≥ 1 . Pr o of. See Section 5.1. 2.1.5 Generalized Hessian Estimation with noisy measurements In the presence of noisy observ ations, the form of (2 k + 1)-measurements Hessian estimator is given as follo ws: e H k F ( θ, ξ ) = (∆∆ T − I ) δ 2 k X l =0 k X m =0 ( − 1) − l − m c k l c k m f ( θ + ( l + m ) δ ∆ , ξ l ) l ! m ! , (15) where ξ l ( n ) is an i.i.d. set of random v ariables with mean zero. Here f is the noise-corrupted sample of the p erformance ob jective and ∆ is deﬁned as b efore. Remark 1. In our pr op ose d Hessian estimators, (∆∆ T − I ) is use d as a sc aling matrix to obtain an unbiase d Hessian estimator under exp e ctation. This sc aling matrix is p articularly use d when the p erturb ation p ar ameter c omp onents ar e distribute d ac c or ding to a Gaussian distribution with zer o me an. One c an also c onsider diﬀer ent p erturb ation p ar ameters to obtain diﬀer ent estimators. We now pr ovide a few such examples b elow. RDSA-unif: Our pr op ose d metho d c an b e e asily applie d when e ach of the p erturb ation p ar ameters is chosen fr om a uniform distribution over [ − η , η ] . In this c ase, the sc aling matrix is the same as matrix M , deﬁne d in [25]. RDSA-asymp: Our metho d c an b e use d when the p erturb ation p ar ameters ar e taken fr om an asymmetric Bernoul li distribution. In this c ase, the sc aling matrix is the same as the matrix M deﬁne d in [25]. 2.2 Hessian Estimate with Unequal T runcation In this section, w e consider unequal truncations of gradient op erators to form the following Hessian truncated op erator: H k 1 ,k 2 ij = D k 1 j ◦ D k 2 i , for any k 1 , k 2 ≥ 1 and k 1  = k 2 . First, we demonstrate one sp ecial case (e.g. k 1 = k , k 2 = 1) for ease of illustration. Then we consider the general case for any k 1 , k 2 > 1. Sp ecial Case: k 1 = k , k 2 = 1 . Tw o-measurement GRDSA is obtained as follows: D 1 i F ( θ ) = ∆ i  F ( θ + δ ∆) − F ( θ ) δ  . (16) 8 The ( k + 3)-measuremen t Hessian estimator is then obtained as follows: b H k ij F ( θ ) = D k j ◦ D 1 i F ( θ ) = ∆ i " D k j F ( θ + δ ∆) − D k j F ( θ ) δ # = ∆ j ∆ i " k X l =0 ( − 1) 1 − l c k l F ( θ + ( l + 1) δ ∆) δ 2 l ! − k X l =0 ( − 1) 1 − l c k l F ( θ + l δ ∆) δ 2 l ! # . (17) In a similar w ay as discussed in section 2.1, we hav e H k F ( θ ) =(∆∆ T − I ) " k X l =0 ( − 1) 1 − l c k l F ( θ + ( l + 1) δ ∆) δ 2 l ! − k X l =0 ( − 1) 1 − l c k l F ( θ + l δ ∆) δ 2 l ! # . (18) Lemma 2. Under assumptions (A1), we have H k F ( θ ) = (∆∆ T − I )  ∆ T ∇ 2 F ( θ )∆  + O ( δ ) . Pr o of. See Section 5.2. The ab o v e result shows that the bias of the Hessian estimator deﬁned in 18 is O ( δ ). Now we discuss the general case b elo w. General case: k 1 , k 2 ≥ 1, k 1  = k 2 . H k 1 ,k 2 ij F ( θ ) = D k 2 j ◦ D k 1 i F ( θ ) = ∆ i δ 2 " k 2 X l =0 ( − 1) 1 − l c k l D k 1 j F ( θ + l δ ∆) l ! # = ∆ i ∆ j δ 2 " k 2 X l =0 k 1 X m =0 ( − 1) 2 − l − m c k 1 l c k 2 m F ( θ + ( l + m ) δ ∆) l ! m ! # = ∆ i ∆ j δ 2 " c k 1 0 c k 2 0 F ( θ ) + c k 2 0 k 1 X m =1 ( − 1) 2 − m  k 1 m  F ( θ + mδ ∆) m + c k 1 0 k 2 X l =1 ( − 1) 2 − l  k 2 l  F ( θ + l δ ∆) l + k 1 X l =1 k 2 X m =1 ( − 1) 2 − l − m  k 2 l  k 1 m  F ( θ + ( l + m ) δ ∆) lm # . (19) As discussed in section 2.1, w e hav e H k 1 ,k 2 F ( θ ) = (∆∆ − I ) δ 2 " c k 1 0 c k 2 0 F ( θ ) + c k 2 0 k 1 X m =1 ( − 1) 2 − m  k 1 m  F ( θ + mδ ∆) m + c k 1 0 k 2 X l =1 ( − 1) 2 − l  k 2 l  F ( θ + l δ ∆) l + k 1 X l =1 k 2 X m =1 ( − 1) 2 − l − m  k 2 l  k 1 m  F ( θ + ( l + m ) δ ∆) lm # . (20) Lemma 3. Under assumption (A1), we have H k 1 ,k 2 F ( θ ) = (∆∆ T − I )  ∆ T ∇ 2 F ( θ )∆  + O ( δ k ) . 9 In p articular, H k 1 ,k 2 F ( θ ) = (∆∆ T − I ) " ∆ T ∇ 2 F ( θ )∆ + O  δ k ′  + k ′ − 1 X i = k D ∇ i +2 F ( θ ) , ∆ ⊗ . . . ∆ ⊗ ∆ | {z } i +2 times E δ i  P k m =1 ( − 1) 1 − m  k m  m i  ( i + 1)! # , (21) wher e k = min ( k 1 , k 2 ) , k ′ = max ( k 1 , k 2 ) , k 1 , k 2 ≥ 1 , k 1  = k 2 . Pr o of. See Section 5.3. Remark 2. The term P k m =1 ( − 1) 1 − m  k m  m i is non-zer o b e c ause i ≥ k and henc e Identity P k j =0 ( − 1) k − j  k j  j q = 0 for any 0 < q < k c annot b e use d. F or instanc e, if k = 1 , this term is 1 ∀ i . If k = 2 , this term c an b e shown to b e 2 − 2 i , ∀ i ≥ 2 . Remark 3. The ab ove r esult shows that une qual trunc ation ( k 1 , k 2 ) do es not le ad to any bias r e duction advantage. The or der of bias is the same as that obtaine d by ke eping the or der of trunc ation the same at min ( k 1 , k 2 ) for b oth the D op er ators of the Hessian estimate. F urthermor e, r e c al l that an e qual trunc ation-b ase d estimator r e quir es 2 k + 1 me asur ements. The une qual trunc ation op er ator would ne e d k 1 + k 2 + 1 me asur ements. This is easy to se e fr om the double summation term in (20) . Cle arly k 1 + k 2 + 1 ≥ 2 ∗ min ( k 1 , k 2 ) + 1 , which is the numb er of me asur ements ne e de d by the e qual trunc ation scheme. Henc e, une qual trunc ation ( k 1 , k 2 ) also le ads to r elatively mor e function me asur ements. Over al l, ther e se ems to b e neither any statistic al nor c omputational advantages in using une qual trunc ation-b ase d Hessian estimates. 2.3 Hessian estimation b ounds In this section, w e ﬁrst provide the bias and v ariance b ounds for our prop osed Hessian estimators under assumptions that are similar to those made for simultaneous p erturbation-based Hessian estimation, cf. [ 23 ]. Subsequently , we provide the asymptotic con vergence guarantee for a stochastic Newton method with our prop osed Hessian estimators. F or the sake of analysis, we make the following additional assumption. (A2). F or al l n ≥ 1 and l = 0 , 1 , . . . , 2 k + 1 , ther e exist α 1 , α 2 > 0 such that E | ξ l n | < α 1 and E | f ( θ ( n ) + lδ ( n )∆( n )) | 2 < α 2 . The result b elo w pro vides the bias and v ariance for our prop osed Hessian estimators in the equal truncation scenario, i.e., k 1 = k 2 = k . Let F n = σ ( θ ( j ) , j ≤ n, ∆( j ) , j < n, ξ 0 ( n ) , . . . , ξ k 1 ( n )) , n ≥ 1 b e the sigma algebra. Lemma 4 (Bias Lemma) . Under assumptions (A1) - (A2), for al l i, j = 1 , . . . d , the Hessian estimate with O ( δ k ) bias deﬁne d in (15) almost sur ely (a.s.) satisﬁes    E h e H k ij F ( θ, ξ ) |F n i − ∇ 2 ij F ( θ )    = O ( δ k ) , and E    e H k F ( θ, ξ ) − E h e H k F ( θ, ξ ) |F n i    2 = O  1 δ 4  . Pr o of. See Section 5.4. The ab o ve result shows that the bias of our prop osed Hessian estimators is O ( δ k ) when (2 k + 1) function measurements are used, for k ≥ 1. This is clearly superior compared to 2SPSA [ 29 ] and 2RDSA [25], whic h only provide O ( δ 2 ) bias with four and three function measuremen ts, resp ectively . Remark 4. F or the une qual trunc ation sc enario, one c an obtain the bias and varianc e b ound in a similar fashion using L emma 3. In this c ase, one wil l obtain a bias c orr esp onding to the smal ler trunc ation p ar ameter. 10 3 Zeroth-order sto chastic Newton algorithm W e consider the following up date rule, similar to [ 23 ], except that we use generalized Hessian estimators that w e prop ose here along with generalized gradient estimators. θ ( n + 1) = Γ θ ( n ) − a ( n )Θ  ¯ H k n  − 1 b ∇ f ( θ ( n )) ! , (22) ¯ H k n = ¯ H k n − 1 + b ( n )  H k n − ¯ H k n − 1  , where H k n is a Hessian estimate with bias O ( δ k ) deﬁned in (14) , for any k ≥ 1 and b ∇ f ( θ ( n )) corresp onds to any gradient estimator, prop osed in [ 22 ]. Each generalized gradien t estimate requires ( k + 1)-function measuremen ts while a generalized Hessian estimate requires (2 k + 1)-function measuremen ts. These function measuremen ts are indep endent and can be reused. This means that only we need (2 k + 1)-function measuremen ts to form b oth gradien t and Hessian estimates. Here Γ is the pro jection op erator ov er a con vex and compact set, and Θ corresp onds to a matrix op erator that pro jects an y d × d matrix to the space of p ositiv e deﬁnite and symmetric matrices. F or asymptotic conv ergence, we require the following assumptions in addition to (A1) - (A2). (A3). The step-size and p erturb ation se quenc es a ( n ) , b ( n ) and δ ( n ) satisfy the fol lowing c onditions: ∞ X n =1 a ( n ) = ∞ X n =1 b ( n ) = ∞ ; lim n →∞ a ( n ) b ( n ) ! = 0 , (23) ∞ X n =1 a ( n ) δ ( n ) ! 2 < ∞ ; ∞ X n =1 b ( n ) δ ( n ) 2 ! 2 < ∞ . (24) (A4). L et { A n } and { B n } b e any two d × d matrix se quenc es. If lim n →∞ ∥ A n − B n ∥ = 0 , then lim n →∞ ∥ Θ( A n ) − Θ( B n ) ∥ = 0 . (A5). L et { C n } b e a se quenc e of d × d matric es with sup n ∥ C n ∥ < ∞ . Then we have sup n ∥ Θ( C n ) ∥ < ∞ ; sup n ∥ (Θ( C n )) − 1 ∥ < ∞ . (25) (A6). sup n ∥ θ ( n ) ∥ < ∞ w.p. 1 . Theorem 1 (Asymptotic conv ergence) . Under the assumptions (A1) - (A6), The iter ates θ ( n ) ac c or ding to (22) c onver ge to the set { θ ′ : ∇ f ( θ ′ ) = 0 } , as n → ∞ . Pr o of. See Section 5.5. 4 Zeroth-order cubic-regularized Newton algorithm In general, the standard Newton metho d do es not escap e saddle p oin ts for non-conv ex ob jectives [ 1 ]. On the other hand, a cubic regularized Newton metho d [ 21 ] manages to escap e saddle p oin ts by incorp orating a cubic term in the optimization process. W e adopt this technique in a zeroth-order optimization setting, and derive a non-asymptotic b ound that quantiﬁes conv ergence rate to a local minim um. W e present the pseudo code in Algorithm 1. W e b egin by discussing ﬁrst and second-order stationary p oin ts. In a deterministic optimization setting, a p oin t ¯ θ is said to b e a FOSP if ∇ F ( ¯ θ ) = 0. How ev er, conv ergence to F OSPs do es not guaran tee con vergence to lo cal minima as these could b e saddle p oin ts. At an SOSP the gradient v anishes, and the Hessian is positive semi-deﬁnite. If every saddle p oin t, say θ , is strict, i.e., ∇ F ( θ ) = 0 and λ min ( ∇ 2 F ( θ )) < 11 Algorithm 1: Cubic-regularized zeroth-order sto chastic Newton algorithm (CRZON) Input : initial p oin t θ 0 ∈ R d , a non-negativ e sequence { α n } , and iterations N ≥ 1; for n ← 1 to N do Generate m n and b n measuremen ts for gradient and Hessian estimation resp ectiv ely; /* Gradient estimation */ ¯ G k = 1 m n P m n i =1 b D k F ( θ n − 1 , ξ i,n ) /* Hessian estimation */ ¯ H k = 1 b n P b n j =1 ˜ H k F ( θ n − 1 , ξ j,n ) , where b D k i F ( θ, ξ ) = 1 δ ∆ i P k l =0 ( − 1) 1 − l C k 1 l { f ( θ + lδ ∆ ,ξ l ) } l ! and e H k F ( θ, ξ ) = (∆∆ T − I ) δ 2 × P k l =0 P k m =0 ( − 1) − l − m c k l c k m f ( θ +( l + m ) δ ∆ ,ξ l ) l ! m ! . /* Cubic-regularized Newton step */ θ n = argmin θ ∈ R d  ˜ ρ h ( θ , θ n − 1 , ¯ G k , ¯ H k , α n )  , where ˜ ρ h ( x, y , g , H , α ) = ⟨ g, x − y ⟩ + 1 2 ⟨H ( x − y ) , x − y ⟩ + α 6 ∥ x − y ∥ 3 . end Output : Choose R uniformly at random and return θ R . 0, then an SOSP coincides with lo cal minima. Moreov er, it is NP-hard to distinguish b etw een strict and non-strict saddle p oints, cf. [ 1 ]. Thus, SOSPs ha ve b een adopted as the goal for saddle-escaping algorithms in the non-con vex optimization literature. In the zeroth-order optimization setting that we consider, it is diﬃcult to ﬁnd an SOSP directly as the exact form of the ob jective is unav ailable. Instead, noisy function measurements are a v ailable. Moreo ver, in the non-asymptotic regime, an algorithm can ﬁnd an approximate F OSP/SOSP , which are standard notions in mac hine learning literature. W e deﬁne these quantities b elo w. Deﬁnition 1 ( ϵ -F OSP) . L et ϵ > 0 . A r andom p oint θ R is said to b e an ϵ -FOSP of the obje ctive F if E [ ∥∇ F ( θ R ) ∥ ] ≤ ϵ . Deﬁnition 2 ( ϵ -SOSP) . L et ϵ > 0 . A r andom p oint θ R is said to b e an ϵ -SOSP of the obje ctive function F if max { p E [ ∥∇ F ( θ R ) ∥ ] , − 1 √ ξ E [ λ min ( ∇ 2 F ( θ R ))] } ≤ √ ϵ, for some ξ > 0 . Her e λ min ( A ) denotes the minimum eigenvalue of A . F or our non-asymptotic analysis, we require the following additional assumption: (A7). The obje ctive F has a L H -Lipschitz c ontinuous Hessian, i.e., ∥∇ 2 F ( θ 1 ) − ∇ 2 F ( θ 2 ) ∥ ≤ L H ∥ θ 1 − θ 2 ∥ . (26) The main result that establishes the conv ergence of Algorithm 1 to an appro ximate SOSP is given b elo w. Theorem 2 (Conv ergence to ϵ -SOSP) . Supp ose assumptions (A1) - (A2) and (A6)- (A7) hold. L et the iter ates { θ k } b e obtaine d by running Algorithm 1 with the fol lowing p ar ameters: F or n = 1 , . . . , N , set 12 α n = 3 L H ; δ = ϵ 1 k max { 2 C 2 1 , 8 C 4 3 } ; N = 12 √ L H [ F ( θ 0 ) − F ∗ ] ϵ 3 2 m n = 2 C 2 ϵ 2+ 2 k ; b n = max { 4 C 4 C ( d ) , 24 C 2 4 } ϵ 1+ 4 k , (27) wher e C 1 , C 2 , C 3 , C 4 ar e c onstants sp e ciﬁe d in se ction 5.6. Then, 35 √ ϵ ≥ max  p E [ ∥∇ F ( θ R ) ∥ ] , − 1 50 √ L H E [ λ min ( ∇ 2 F ( θ R ))]  . (28) Pr o of. See Section 5.6. Remark 5. F or ﬁnding an ϵ -SOSP, the numb er of gr adient estimates in the CRZON algorithm is P N n =1 m n = O ( ϵ − ( 7 2 + 2 k ) ) , while the numb er of Hessian estimates is P N n =1 b n = O ( ϵ − ( 5 2 + 4 k ) ) . Each gr adient estimate uses k + 1 function me asur ements and Hessian estimate r e quir es 2 k + 1 me asur ements. These function me asur ements ar e not r e quir e d to b e sep ar ate, and a subset of the ones use d in forming the Hessian estimate c an b e r euse d for gr adient estimation. In any c ase, the over al l sample c omplexity of CRZON algorithm is O  ϵ − ( 7 2 + 2 k )  . With unbiase d gr adient and Hessian information, cubic Newton ﬁnds an ϵ -SOSP with sample c omplexity O  ϵ − 3 . 5  , cf. [ 32 ]. In c omp arison, our b ound for CRZON in the zer oth-or der setting is we aker. However, this gap is exp e cte d by c onsidering the gr adients and Hessian estimates ar e b oth biase d. As k gr ows lar ge, our estimators have lower bias, and our sample c omplexity b ound appr o aches O  ϵ − 3 . 5  . Remark 6. In [ 2 ], wher e the authors c onsider a sp e cialize d zer oth-or der setting, the c orr esp onding b ound is O ( ϵ − 3 . 5 ) . The sp e cialization in their setting is that F ( θ ) = E ξ [ f ( θ , ξ )] and f ar e b oth L -smo oth. We assume F alone is L -smo oth, which is mor e gener al than [ 2 ], as L -smo othness of f implies smo othness of the obje ctive F as wel l while the c onverse is not true in gener al. Remark 7. The sample c omplexity of a sto chastic gr adient algorithm to ﬁnd an ϵ -FO SP with unbiase d gr adient information is O ( ϵ − 4 ) , cf. [ 11 ]. On the other hand, the gener alize d SPSA, se e [ 22 ], with a gr adient estimate forme d using k + 1 function me asur ements, ﬁnds an ϵ -FOSP in a zeroth-or der optimization setting with sample c omplexity O ( ϵ − (4+ 4 k ) ) . This sample c omplexity appr o aches O ( ϵ − 4 ) as k b e c omes lar ge. Our r esults ar e in a similar spirit, alb eit for ﬁnding SOSPs. In p articular, our b ound O ( ϵ − ( 7 2 + 2 k ) ) for CRZON algorithm appr o aches O ( ϵ − 3 . 5 ) as k b e c omes lar ge. Remark 8. F or the sp e cial c ase of thr e e me asur ements-b ase d Hessian estimator, i.e., k = 1 , we obtain a sample c omplexity b ound of O ( ϵ − 5 . 5 ) for ﬁnding an ϵ -SOSP. This r esult is of indep endent inter est, as such an estimator is closer to the thr e e me asur ements-b ase d b alanc e d Hessian estimators studie d in [ 7 ],[ 25 ],[ 2 ]. T o the b est of our know le dge, ther e ar e no sample c omplexity b ounds for sto chastic Newton algorithms using wel l-known simultane ous p erturb ation-b ase d Hessian estimates in the afor ementione d r efer enc es. Our r esults close this gap and we also pr ovide gener alize d Hessian estimators that c an appr o ach the sample c omplexity of a sto chastic Newton algorithm with unbiase d gr adient/Hessian estimates. 13 5 Pro ofs 5.1 Pro of of Lemma 1 Pr o of. W e use the following combinatorial identities in the pro of: 1 1  k 1  − 1 2  k 2  + · · · + ( − 1) k +1 1 k  k k  = k X j =1 1 j , (29)  k 1  −  k 2  +  k 3  − . . . ( − 1) k +1  k k  = 1 , (30) k X j =0 ( − 1) k − j  k j  j q = 0 for an y 0 < q < k . (31) Recall that H k F ( θ ) = (∆∆ T − I ) δ 2 " c 2 k 0 F ( θ ) + 2 k X m =1 ( − 1) − m c k 0  k m  F ( θ + mδ ∆) m + k X l =1 k X m =1 ( − 1) 2 − l − m  k l  k m  F ( θ + ( l + m ) δ ∆) lm # . Based on T aylor’s expansion, we analyze the v arious coeﬃcients. The co eﬃcient of F ( θ ) can b e simpliﬁed as follo ws. ( c k 0 ) 2 + 2 k X m =1 ( − 1) 2 − m c k 0  k m  m + k X l =1 k X m =1 ( − 1) 2 − l − m  k l  k m  lm (32) =   k X j =1 1 j   2 + 2   k X j =1 1 j   k X m =1 ( − 1) 2 − m  k m  m ! + k X l =1 ( − 1) 1 − l  k l  l ! k X m =1 ( − 1) 1 − m  k m  m ! . In the ﬁrst term of (32) , we expand c 0 k from (8) . In the second term, we pull out c 0 k and expand it. In the third double summation term, we ﬁrst pull out terms indep enden t of m from the inner summation. But no w what is left in the inner summation with index m is indep endent of the outer index l . Hence, the double summation can b e simpliﬁed as a pro duct of tw o single summations. F urther using Identit y (29) in the ab o v e three summations, we hav e =   k X j =1 1 j   2 − 2   k X j =1 1 j     k X j =1 1 j   +   k X j =1 1 j     k X j =1 1 j   = 2   k X j =1 1 j     k X j =1 1 j   − 2   k X j =1 1 j     k X j =1 1 j   = 0 . The co eﬃcien t of D ∇ F ( θ ) , ∆ E in the aforemen tioned T aylor’s expansion is given as follows: 2 k X m =1 ( − 1) 2 − m c k 0  k m  + k X l =1 k X m =1 ( − 1) 2 − l − m  k l  k m  lm ( l + m ) = 2 c k 0 k X m =1 ( − 1) 2 − m  k m  + k X l =1 k X m =1 ( − 1) 2 − l − m  k l  k m  m + k X l =1 k X m =1 ( − 1) 2 − l − m  k l  k m  l 14 = − 2 k X j =1 1 j + k X l =1 ( − 1) 1 − l  k l  ! k X m =1 ( − 1) 1 − m  k m  m ! + k X l =1 ( − 1) 1 − l  k l  l ! k X m =1 ( − 1) 1 − m  k m  ! . In the previous step, the t wo double summations can b e written as a pro duct of single summations, as in the previous s impliﬁcation. No w using Identities (29) and (30) on the ab ov e single summations, we obtain = − 2   k X j =1 1 j   +   k X j =1 1 j   +   k X j =1 1 j   = 0 . Next, the co eﬃcien t of D ∇ 2 F ( θ ) , ∆ ⊗ ∆ E in the T aylor’s expansion is given as follows: 1 2! h 2 k X m =1 ( − 1) 2 − m c k 0  k m  m m 2 + k X l =1 k X m =1 ( − 1) 2 − l − m  k l  k m  lm ( l + m ) 2 i = c k 0 k X m =1 ( − 1) 2 − m  k m  m + 1 2 k X l =1 k X m =1 ( − 1) 2 − l − m  k l  k m  lm l 2 + k X l =1 k X m =1 ( − 1) 2 − l − m  k l  k m  + 1 2 k X l =1 k X m =1 ( − 1) 2 − l − m  k l  k m  lm m 2 = c k 0 k X m =1 ( − 1) 2 − m  k m  m + 1 2 k X l =1 ( − 1) 1 − l  k l  l ! k X m =1 ( − 1) 1 − m  k m  m ! + k X l =1 ( − 1) 1 − l  k l  ! k X m =1 ( − 1) 1 − m  k m  ! + 1 2 k X l =1 ( − 1) 1 − l  k l  l ! k X m =1 ( − 1) 1 − m  k m  m ! (33) = 1 . In (33) ab o v e, the ﬁrst, third, and fourth terms v anish as a consequence of the Identit y (31) with q = 1. The third term is 1 from the Iden tity (30), and hence the result. The co eﬃcien t of D ∇ q F ( θ ) , ∆ ⊗ · · · ⊗ ∆ E , 3 ≤ q ≤ k in the T aylor’s expansion is as follows: 1 q ! " 2 k X m =1 ( − 1) 2 − m c k 0  k m  m m q + k X l =1 k X m =1 ( − 1) 2 − l − m  k l  k m  lm ( l + m ) q # = 0 . Since q ≥ 3, the ﬁrst term ab o ve v anishes as a consequence of the Identit y (29) for q ≥ 2. In the second term, the Binomial expansion of ( l + m ) q will yield ( q + 1) terms, each ha ving the form  q i  l i m q − i . Since q ≥ 3, for any i s.t. 0 ≤ i ≤ q , max ( i, q − i ) will b e at least 2. This means at least one of the exp onents of l and m will b e 2 or more in each term of the Binomial e xpansion. Without loss of generality , lets assume l is the index whose exp onen t is at least 2. The same argument will hold for m . After cancellation with l in the denominator, the l ’s exp onen t will b e 1 or more. Now this double summation term can be split into a pro duct of single summations, where the index l summation term will b e of the form  P k l =1 ( − 1) 1 − l  k l  l p  , p ≥ 1. F rom Identit y (31) , index ℓ summation term will b e 0. Hence, we ha ve shown that every double summation term from the binomial expansion is 0. 5.2 Pro of of Lemma 2 Pr o of. Recall that H k F ( θ ) = D k ◦ D 1 F ( θ ) 15 = (∆∆ T − I ) " k X l =0 ( − 1) 1 − l c k l D k F ( θ + ( l + 1) δ ∆) δ 2 l ! − k X l =0 ( − 1) 1 − l c k l D k F ( θ + l δ ∆) δ 2 l ! # . (34) No w we apply the T aylor’s expansion to each term in (34) and derive the co eﬃcien ts of diﬀeren t-order deriv ativ es. The co eﬃcient of F ( θ ) is given as follows: = k X l =0 ( − 1) 1 − l c k 0 − k X l =0 ( − 1) 1 − l c k 0 = 0 . Next, the co eﬃcien t of ∇ F ( θ ) is obtained as follows: = − c k 0 + k X l =1 ( − 1) 1 − l 1 l  k l  ( l + 1) − k X l =1 ( − 1) 1 − l 1 l  k l  l = − c k 0 + k X l =1 ( − 1) 1 − l 1 l  k l  = − k X j =1 1 j + k X j =1 1 j = 0 . The co eﬃcien t of ∇ 2 F ( θ ) is given as follows: 1 2! h − c k 0 + k X l =1 ( − 1) 1 − l 1 l  k l  ( l + 1) 2 − k X l =1 ( − 1) 1 − l 1 l  k l  l 2 i = 1 2! h − c k 0 + k X l =2 ( − 1) 1 − l 1 l  k l  (2 l + 1) i = k X l =2 ( − 1) 1 − l  k l  = 1 . F or any q > 2, the co eﬃcien t of ∇ q F ( θ ) is given as follows: 1 2 .q ! " − c k 0 + k X l =1 ( − 1) 1 − l 1 l  k l  ( l + 1) q # − 1 2 .q ! " c k 0 ( − 1) q +1 k X l =2 ( − 1) 1 − l 1 l  k l  ( l − 1) q # = 1 ( q − 1)! . Th us for any q > 2, the co eﬃcient is not equal to zero. This completes the pro of. 5.3 Pro of of Lemma 3 Pr o of. As b efore, we apply T aylor’s expansion to each term in (20) and combine the co eﬃcien ts of terms with similar order deriv ativ es in this new Hessian op erator. The co eﬃcien t of F ( θ ) δ − 2 is the follo wing: c k 1 0 c k 2 0 + c k 2 0 k 1 X m =1 ( − 1) 2 − m  k 1 m  m + c k 1 0 k 2 X l =1 ( − 1) 2 − m  k 2 m  l + k 2 X l =1 ( − 1) 1 − l  k 2 l  l ! k 1 X m =1 ( − 1) 1 − m  k 2 m  m ! ( a ) = c k 1 0 c k 2 0 + c k 2 0 ( − c k 1 0 ) + c k 1 0 ( − c k 2 0 ) + c k 1 0 c k 2 0 16 = 0 , where ( a ) follo ws from (29). Co eﬃcien t of D ∇ F ( θ ) , ∆ E δ − 1 is the follo wing: c k 2 0 k 1 X m =1 ( − 1) 2 − m  k 1 m  + c k 1 0 k 2 X l =1 ( − 1) 2 − l  k 2 l  + k 1 X l =1 k 2 X m =1 ( − 1) 2 − l − m  k 2 l  k 1 m  (1 /l + 1 /m ) = − c k 2 0 − c k 1 0 + " k 2 X l =1 ( − 1) 1 − l  k 2 l  # " k 1 X m =1 ( − 1) 1 − m  k 1 m  m # ] + " k 2 X l =1 ( − 1) 1 − l  k 2 l  l # " k 1 X m =1 ( − 1) 1 − m  k 1 m  # ( b ) = − c k 2 0 − c k 1 0 + c k 2 0 . (1) + 1 . ( c k 2 0 ) = 0 , where ( b ) follo ws from (29) and (30). Co eﬃcien t of D ∇ 2 F ( θ ) , ∆ ⊗ ∆ E is giv en as follows: 1 2! h c k 2 0 k 1 X m =1 ( − 1) 2 − m  k 1 m  m m 2 + c k 1 0 k 2 X l =1 ( − 1) 2 − l  k 2 l  l l 2 + k 2 X l =1 k 1 X m =1 ( − 1) 2 − l − m  k 2 l  k 1 m  lm ( l + m ) 2 i . The ﬁrst tw o terms are 0 (by Iden tity 31). Along similar lines as the pro of of Lemma 2, the double summation term can b e expanded into three terms. Each of these terms can b e written as a pro duct of single summations. The only non-zero term here corresp onds to the 2 lm factor term and turns out to be 1 (from Iden tity (30)). Co eﬃcien t of D ∇ q F ( θ ) , ∆ ⊗ . . . ∆ E δ q − 2 , 3 ≤ q ≤ k , is as follows: 1 q ! " c k 2 0 k 1 X m =1 ( − 1) 2 − m  k 1 m  m m q + c k 1 0 k 2 X l =1 ( − 1) 2 − l  k 2 l  l l q + k 1 X l =1 k 2 X m =1 ( − 1) 2 − l − m  k 2 l  k 1 m  lm ( l + m ) q # (35) = 0 . The ﬁrst tw o terms in (35) are zero from (31) . The third term can b e shown to b e 0 along similar lines as the pro of of Lemma 1. W e no w consider the q = k + 1 case. Contin uing from (35), one of the ﬁrst tw o terms will b e zero and the other non-zero, unlike the previous case. W e consider b oth these terms (as this algebra will b e useful for the range of q w e consider next in the pro of ) and the terms coming from l q and m q from the double summation in (35) b elo w ﬁrst. W e obtain c k 2 0 k 1 X m =1 ( − 1) 2 − m  k 1 m  m k + c k 1 0 k 2 X l =1 ( − 1) 2 − l  k 2 l  l k k 2 X l =1 ( − 1) 1 − l  k 2 l  l k ! k 1 X m =1 ( − 1) 1 − m  k 1 m  m ! + + k 2 X l =1 ( − 1) 1 − l  k 2 l  ℓ ! k 1 X m =1 ( − 1) 1 − m  k 1 m  m k ! (36) = − c k 2 0 k 1 X m =1 ( − 1) 1 − m  k 1 m  m k − c k 1 0 k 2 X l =1 ( − 1) 1 − l  k 2 l  l k + " k 2 X l =1 ( − 1) 1 − l  k 2 l  l k # c k 1 0 + c k 2 0 " k 2 X m =1 ( − 1) 1 − m  k 2 m  m k # (37) = 0 . 17 The remaining terms of the Binomial expansion from the double summation of (35) can b e shown to b e 0 along similar lines as the proof of Lemma 2. There arises a single sum of the form  P k 2 l =1 ( − 1) 1 − l  k 2 l  l p  , 1 ≤ p ≤ ( k − 1) OR  P k 1 m =1 ( − 1) 1 − m  k 1 m  m k − 1  , b oth of which are 0 from (31). The next case that we consider is k + 2 ≤ q ≤ k ′ + 1: Contin uing from (35), we consider the ﬁrst tw o terms and the terms coming from l q and m q from the double summation in (35) ﬁrst. Note that the ﬁrst t wo terms can b oth b e non-zero in this range of q (sp eciﬁcally for q = k ′ + 1). The simpliﬁcation of these four terms will b e exactly similar to (36) and (33), where m k and l k m ust b e replaced by m q − 1 and l q − 1 resp ectiv ely and hence will b e 0. Next, we consider the remaining terms of the Binomial expansion from the double summation of (35). Case 1: k = k 1 Let us ﬁrst consider the term  q 1  lm q − 1 . The asso ciated double summation term gives us the following pro duct of single sums. q q ! k 2 X l =1 ( − 1) 1 − l  k 2 l  ! | {z } 1 k 1 X m =1 ( − 1) 1 − m  k 1 m  m q − 2 ! = 1 ( q − 1)! k X m =1 ( − 1) 1 − m  k m  m q − 2 ! . (38) As q ≥ k 1 + 2 = k + 2, the exp onen t of m is ≥ k 1 = k . Hence the ab o ve sum will b e non-zero. Consider the remaining q − 2 terms of the form  q i  l i m q − i , for i = 2 , . . . , q − 1. The asso ciated typical double summation term giv es us the following pro duct of single sums.  q i  " k 2 X l =1 ( − 1) 1 − l  k 2 l  l i − 1 # " k 1 X m =1 ( − 1) 1 − m  k 1 m  m q − i − 1 # . Consider the l summation ab ov e. The exp onen t of l is at least 1 as i ≥ 2. The exp onen t can b e at most q − 2. Since q ≤ k ′ + 1, i.e. q ≤ k 2 + 1, we hav e ( q − 2) ≤ ( k 2 − 1). Hence the exp onent of l is b et ween 1 and k 2 − 1. F rom Identit y 31, this sum is alwa ys zero. Hence, w e are only left with the term in (38). This exactly matches the co eﬃcient in the summation of (21) with i replaced by q − 2. Case 2: k = k 2 Based on an analogous argument, we can arrive at the same expression as (38) for this case also. W e brieﬂy outline this. Here, we need to consider the double summation term asso ciated with q l q − 1 m 1 , which is as follo ws. q q ! k 2 X l =1 ( − 1) 1 − l  k 2 l  l q − 2 ! k 1 X m =1 ( − 1) 1 − m  k 1 m  ! | {z } 1 = 1 ( q − 1)! k X l =1 ( − 1) 1 − l  k l  l q − 2 ! . (39) Consider the remaining q − 2 terms of the form  q j  l q − j m j , for j = 2 , . . . , q − 1. The asso ciated t ypical double summation term giv es us the following pro duct of single sums.  q j  k 2 X l =1 ( − 1) 1 − l  k 2 l  l q − j − 1 × k 1 X m =1 ( − 1) 1 − m  k 1 m  m j − 1 . The m -summation term can b e argued to v anish for all j = 2 , . . . , q − 1 in exactly the similar manner as the ℓ -summation in the previous case. Hence, we are left with the same expression as the previous case. This completes the pro of. 18 5.4 Pro of of Lemma 4 Pr o of. F rom Lemma 1, we ha ve H k F ( θ ) = (∆∆ T − I )  ∆ T ∇ 2 F ( θ )∆  + O ( δ k ) . The rest of the pro of follo ws from [ 23 ] (Chapter 6, Lemma 6.3 ) by taking the expectation on both sides. By using (A2), it is easy to see that E ∥ e H k − E [ e H k |F n ] ∥ 2 ≤ E [ ∥ ˜ H k ∥ 2 ] = O  1 δ 4  . 5.5 Pro of of Theorem 1 Pr o of. The pro of follo ws from Theorem 6.7 in [ 23 ] by satisfying all assumptions except A.6.11. The assumption A.6.11 follo ws from Lemma 4. 5.6 Pro of of Theorem 2 The pro of pro ceeds through a sequence of lemmas. W e follow the pro of technique from [ 2 ]. Unlik e [ 2 ], we do not assume f ( · , ξ ) is smo oth. Instead, w e work with a smo oth ob jective F and this leads to signiﬁcant deviations in the pro of. Lemma 5. L et ¯ G k b e c ompute d as in Algorithm 1 with m n samples. Under assumptions (A1) - (A2), we have E h   ¯ G k − ∇ f ( θ n − 1 , ξ n )   2 i ≤ 2 C 2 m n δ 2 + 2 C 2 1 δ 2 k , (40) for some c onstants C 1 , C 2 > 0 . Pr o of. F rom Lemma 1 in [22], we ha ve    E [ b D k F ( θ n − 1 )] − ∇ F ( θ n − 1 )    ≤ C 1 δ k and E h ∥ b D k F ( θ n − 1 ) − E [ b D k F ( θ n − 1 )] ∥ 2 i ≤ C 2 δ 2 , (41) for some constan ts C 1 , C 2 . Now we calculate E [ ∥ ¯ G k − ∇ F ( θ n − 1 ) ∥ 2 ] = E [ ∥ ¯ G k − E [ ¯ G k ] + E [ ¯ G k ] − ∇ F ( θ n − 1 ) ∥ 2 ] ( a ) ≤ 2  E [ ∥ ¯ G k − E [ ¯ G k ] ∥ 2 + ∥ E [ ¯ G k ] − ∇ F ( θ n − 1 ) ∥ 2 ]  ( b ) ≤ 2 m 2 n C 2 δ 2 m n + 2 C 2 1 δ 2 k = 2 C 2 m n δ 2 + 2 C 2 1 δ 2 k . In the ab o v e, (a) follows from the fact that ∥ x + y ∥ 2 ≤ 2( ∥ x ∥ 2 + ∥ y ∥ 2 ) and (b) follo ws from (41). Lemma 6. L et ¯ H k b e c ompute d as in Algorithm 1 with b n samples and let b n ≥ C ( d ) = 4(1 + 2 log 2 d ) . Then under assumptions (A1) - (A2), we have E  ∥ ¯ H k − ∇ 2 F ( θ n − 1 ) ∥ 3  ≤ s  4 C 4 C ( d ) b n δ 4 + 2 C 2 3 δ k  24 C 2 4 b 2 n δ 8 + 8 C 4 3 δ 2 k  , for some c onstants C 3 , C 4 . 19 Pr o of. F rom Lemma 4, we ha ve    E h ˜ H k F ( θ, ξ ) i − ∇ 2 F ( θ )    ≤ C 3 δ k and E h ∥ ˜ H k F ( θ, ξ ) − E [ ˜ H k F ( θ, ξ )] ∥ 2 i ≤ C 4 δ 4 , (42) for some constan ts C 3 , C 4 . Note that E  ∥ ¯ H k − ∇ 2 F ( θ n − 1 ) ∥ 2  = E h ∥ ¯ H k − E [ ˜ H k F ( θ n − 1 , ξ i,n )] + E [ ˜ H k F ( θ n − 1 , ξ i,n )] − ∇ 2 F ( θ n − 1 ) ∥ 2 i ≤ 2  E h ∥ ¯ H k − E [ ˜ H k F ( θ n − 1 , ξ i,n )] ∥ 2 + E [ ∥ [ E [ ˜ H k F ( θ n − 1 , ξ i,n )] − ∇ 2 F ( θ n − 1 )] ∥ 2 i ≤ 2 E h ∥ ¯ H k − E [ ˜ H k F ( θ n − 1 , ξ i,n )] ∥ 2 i + 2 C 2 3 δ 2 k . (43) W e know from Theorem 1 in [33] that E h ∥ ¯ H k − E [ ˜ H k F ( θ n − 1 , ξ i,n )] ∥ 2 i ≤ 2 C ( d ) b 2 n     b n X i =1 E [∆ 2 i,n ]    + C ( d ) E h max i ∥ ∆ i,n ∥ 2 i , where ∆ i,n = ˜ H k F ( θ n − 1 , ξ i,n ) − E [ ˜ H k F ( θ n − 1 , ξ i,n )] and C ( d ) = 4(1 + 2 log 2 d ). Using (42), we hav e E [ ∥ ∆ i,n ∥ 2 ] ≤ C 4 δ 4 . Thus from (43), we get E h ∥ ¯ H k − ∇ 2 F ( θ k − 1 ) ∥ 2 i ≤ 4 C 4 C ( d ) b n δ 4 + 2 C 2 3 δ 2 k . (44) No w E h ∥ ¯ H k − ∇ 2 F ( θ n − 1 ) ∥ 4 i = E [ ∥ ¯ H k − E [ ˜ H k ( θ n − 1 , ξ i,n )] + E [ ˜ H k ( θ n − 1 , ξ i,n ) − ∇ 2 F ( θ n − 1 ) ∥ 4 ] ( c ) ≤ 8  E h ∥ ¯ H k − E [ ˜ H k ( θ n − 1 , ξ i,n ) ∥ 4 + ∥ E [ ˜ H k ( θ n − 1 , ξ i,n ) − ∇ 2 F ( θ n − 1 )] ∥ 4 i ( d ) ≤ 8 E h ∥ ¯ H k − E [ ˜ H k ( θ n − 1 , ξ i,n ) ∥ 4 i + 8 C 4 3 δ 4 k . (45) Here (c) follo ws from the fact that ∥ x + y ∥ 4 ≤ 8( ∥ x ∥ 4 + ∥ y ∥ 4 ) and (d) follo ws from (42). Now consider E h    ¯ H k − E [ ˜ H k F ( θ n − 1 , ξ i,n )]    4 F i = E h    1 b n b n X i =1 ˜ H k F ( θ n − 1 , ξ i,n ) − E [ ˜ H k ( θ n − 1 , ξ i,n )]    4 F i ( e ) = 1 b 4 n E h    b n X i =1 h ˜ H k ( θ n − 1 , ξ i,n ) − E [ ˜ H k ( θ n − 1 , ξ i,n )] i    4 F i ( f ) ≤ 1 b 4 n 3 b 2 n E h ∥ ∆ i,n ∥ 4 i ≤ 3 C 2 4 δ 8 b 2 n , where ∥ . ∥ F denotes the F rob enius norm. In the ab o ve, (e) follows from the norm prop ert y and (f ) follows from Lemma 16 in [20]. Thus from (45), we hav e E h ∥ ¯ H k − ∇ 2 F ( θ n − 1 ) ∥ 4 i = 24 C 2 4 δ 8 b 2 n + 8 C 4 3 δ 4 k . (46) 20 No w E h    ¯ H k − ∇ 2 F ( θ n − 1 )    3 i ( g ) ≤ E h    ¯ H k − ∇ 2 F ( θ n − 1 )    2 i E h    ¯ H k − ∇ 2 F ( θ n − 1 )    4 F i ! 1 2 ( h ) ≤ s  4 C 4 C ( d ) b n δ 4 + 2 C 2 3 δ 2 k  24 C 2 4 b 2 n δ 8 + 8 C 4 3 δ 4 k  ( i ) ≤ s  4 C 4 C ( d ) b n δ 4 + 2 C 2 3 δ k  24 C 2 4 b 2 n δ 8 + 8 C 4 3 δ 2 k  . In the ab o ve, (g) follows from the Holder inequality , (h) follows from (44) and (46) , and (i) holds as δ < 1. No w we state the following tw o lemmas from [2] to prov e the main result. Lemma 7. L et the iter ates { θ n } b e c ompute d by Algorithm 1. Then we have r E h ∥ θ n − θ n − 1 ∥ 2 i ≥ max ( s E [ ∥∇ F ( θ n − 1 ) ∥ ] − δ g n − δ H n L H + α k , − 2 α n + L H [ E [ λ min ( ∇ 2 F ( θ n ))] + q 2( α n + L H ) δ H n ] ) , wher e   E [ ¯ G k F ( θ n − 1 )] − ∇ F ( θ n − 1 )   ≤ ( δ g n ) 2 and E  ∥ ¯ H k − ∇ 2 F ( θ n − 1 ) ∥ 3  ≤ [2( α n + L H ) δ H n ] 3 / 2 , for some δ g n , δ H n > 0 . Pr o of. See Lemma 9 in [2]. Lemma 8. L et the iter ates { θ n } N n =1 b e c ompute d by Algorithm 1. Then we have E h ∥ θ R − θ R − 1 ∥ 3 i ≤ 36 P N k =1 α n h F ( θ 0 ) − F ∗ + N X k =1 4( δ g n ) 3 / 2 √ 3 α n + N X k =1 18 4 √ 2 α n ! 2  ( L H + α n ) δ H n  3 / 2 i , wher e θ R is chosen uniformly at r andom fr om { θ 1 , θ 2 , . . . , θ N } . Pr o of. See Lemma 10 in [2]. W e now prov e the main claim in Theorem 2. Pr o of. (The or em 2) With the choice of h yp erparameters deﬁned in (27) , Lemma 5 and Lemma 6 are satisﬁed with δ g n = δ H n = ϵ . Now from Lemma 8, we obtain E [ ∥ θ R − θ R − 1 ∥ 3 ] ≤ 12 L H h F ( θ 0 ) − F ( θ ∗ ) N + 4 ϵ 3 / 2 3 √ L H + 288 √ 2 ϵ 3 / 2 √ L H i (47) ≤ 5000 ϵ 3 / 2 L 3 / 2 H . (48) Using Ly apunov inequality , we hav e [ E [ ∥∇ F ( θ R − θ R − 1 ) ∥ 2 ]] 1 / 2 ≤ [ E [ ∥∇ F ( θ R − θ R − 1 ) ∥ 3 ]] 1 / 3 21 ≤ 5000 1 / 3 ϵ 1 / 2 L 1 / 2 H . F rom Lemma 7, w e obtain p ∥ E [ ∇ F ( θ R )] ∥ ≤ 35 √ ϵ, E h − λ min ( ∇ 2 F ( θ R )) √ L H i ≤ 50 √ ϵ. The last inequalit y follows from the b elo w, which was given in the pro of of Lemma 9 in [2]. p E [ ∥ θ n − θ n − 1 ∥ 2 ] ≥ − 2 α n + 2 L H  q 2( α n + L H ) δ H n + E [ λ min ( ∇ 2 F ( θ n ))]  . (49) Th us 35 √ ϵ ≥ max  p E [ ∥∇ F ( θ R ) ∥ ] , − 1 50 √ L H E [ λ min ( ∇ 2 F ( θ R ))]  . (50) This concludes the pro of. 6 Sim ulation Exp eriments In this section, we implement the stochastic Newton metho d using our prop osed Hessian estimators with diﬀeren t v alues of k to ev aluate the p erformance on the well-kno wn Rastrigin ob jective deﬁned b elo w. W e are in terested in solving the following d -dimensional minimization problem: min θ ∈ R d F ( θ ) . (51) The exact form of F ( θ ) is not kno wn, instead, a noisy observ ation f ( θ , ξ ) = F ( θ ) + ξ is made av ailable. W e tak e the noise term ξ as [ θ T , 1] z , where z is distributed ov er a ( d + 1)-dimensional multiv ariate Gaussian with zero mean and co v ariance σ 2 I d +1 . In our exp eriments, w e take σ = 0 . 001. Rastrigin F unction: W e use the d -dimensional Rastrigin ob jective function in our exp erimen t, which is deﬁned as follo ws: f ( θ , ξ ) = 10 d + d X n =1 h θ 2 n − 10 cos (2 πθ n ) i + ξ , (52) where ξ is deﬁned as b efore. The optimal v alue θ ∗ for the Rastrigin function is the d -dimensional v ector of zeros. W e use the follo wing metric similar to [25], [22] to test the p erformance of our prop osed metho d: P arameter error = ∥ θ T − θ ∗ ∥ 2 ∥ θ 0 − θ ∗ ∥ 2 , (53) where T is the ﬁnal ep och of the simulation experiments. Note that T is diﬀerent for diﬀerent Hessian estimators for a given ﬁxed budget (total num b er of function measurements). Th us, metho ds that hav e lo wer parameter errors would b e preferred o ver others. W e implement the following algorithms: 1. 2SPSA : This is an SPSA-based Newton metho d studied in [ 29 ], which requires four function measuremen ts p er iteration. Here, the p erturbation parameters are chosen from a symmetric ± 1-v alued Bernoulli distribution. 2. 2RDSA-Unif : An RDSA-based adaptiv e Newton metho d prop osed in [ 25 ] with indep enden t uni- form random v ariables as random p erturbations. This metho d requires three function measurements p er iteration. 22 3. G2SF: This corresponds to the adaptive Newton method with our prop osed Gaussian p erturbation based Hessian estimators describ ed in 2.2. 4. GSF: This is a gradient-based metho d with Gaussian p erturbation parameters prop osed in [22]. Exp erimen tal setup: In our exp erimen ts, we set δ ( n ) = 0 . 9 n 0 . 16667 , a ( n ) = 0 . 9 ( n +20) 0 . 9 , and b ( n ) = 0 . 9 ( n +10) 0 . 56 . W e test the p erformance of our prop osed algorithm for d = 5 , 10 , 50 , 100. All the results are sho wn after taking av erage ov er 10 indep endent runs. Exp erimen tal results: W e compare the second-order metho ds with the ﬁrst-order metho ds in T able 1 with v aried num b er of function measurements. This table sho ws that the second-order metho ds outp erform the ﬁrst-order metho ds. In T ables 2 and 3, we implement our prop osed metho ds with diﬀeren t measurements with Gaussian and uniform p erturbation parameters. This clearly sho ws that the second-order metho ds with more function measurements do p erform well. This v alidates our theoretical ﬁndings. Finally , we compare our prop osed metho ds with the existing metho ds including [ 29 ], [ 25 ] and observ e sup erior p erformance of these in T able 4. T able 1: P arameter error ( ± standard deviation) for GSF and G2SF algorithms under the Rastrigin ob jective deﬁned in (52) with budget 5 , 000 and 10 , 000. GSF-5 corresp onds to the ﬁv e measurements GSF prop osed in [22], and G2SF-9 represents our nine measurements Hessian estimator. Budget 5 × 10 3 GSF - 5 G2SF - 9 d = 5 0 . 35 ± 0 . 13 0 . 15 ± 0 . 07 d = 10 0 . 41 ± 0 . 09 0 . 22 ± 0 . 06 d = 50 0 . 61 ± 0 . 09 0 . 35 ± 0 . 04 d = 100 0 . 81 ± 0 . 09 0 . 44 ± 0 . 11 Budget 1 × 10 4 GSF-5 G2SF-9 0 . 40 ± 0 . 15 0 . 15 ± 0 . 17 0 . 41 ± 0 . 11 0 . 15 ± 0 . 05 0 . 48 ± 0 . 04 0 . 37 ± 0 . 17 0 . 67 ± 0 . 07 0 . 40 ± 0 . 13 T able 2: Parameter error ( ± standard deviation) for our algorithm under the Rastrigin ob jectiv e deﬁned in (52) with three and nine measurement Hessian estimators G2SF-3 and G2SF-9, resp ectiv ely , and with budget of 2,000 and 30,000. Budget 2 × 10 3 G2SF - 3 G2SF - 9 d = 5 0 . 40 ± 0 . 14 0 . 32 ± 0 . 12 d = 10 0 . 46 ± 0 . 15 0 . 31 ± 0 . 15 d = 50 0 . 48 ± 0 . 10 0 . 43 ± 0 . 05 d = 100 0 . 45 ± 0 . 05 0 . 45 ± 0 . 04 Budget 3 × 10 4 G2SF - 3 G2SF - 9 0 . 55 ± 0 . 10 0 . 11 ± 0 . 28 0 . 57 ± 0 . 18 0 . 17 ± 0 . 16 0 . 52 ± 0 . 14 0 . 34 ± 0 . 15 0 . 48 ± 0 . 08 0 . 34 ± 0 . 10 23 T able 3: Parameter error ( ± standard deviation) under the Rastrigin ob jective deﬁned in (52) with three measuremen t G2RDSA (G2R-3) and nine measurement G2RDSA(G2R-9) Hessian estimators, with budget of 5,000 and 10,000. Budget 5 × 10 3 G2R-3 G2R-9 d = 5 0 . 52 ± 0 . 20 0 . 20 ± 0 . 14 d = 10 0 . 52 ± 0 . 13 0 . 29 ± 0 . 11 d = 50 0 . 46 ± 0 . 10 0 . 44 ± 0 . 05 d = 100 0 . 48 ± 0 . 06 0 . 45 ± 0 . 03 Budget 1 × 10 4 G2R - 3 G2R - 9 0 . 63 ± 0 . 19 0 . 22 ± 0 . 13 0 . 47 ± 0 . 14 0 . 26 ± 0 . 06 0 . 45 ± 0 . 07 0 . 37 ± 0 . 05 0 . 51 ± 0 . 11 0 . 41 ± 0 . 02 T able 4: Parameter error ( ± standard deviation) comparison with existing approaches for the Rastrigin ob jective deﬁned in (52) for d = 10 and budget of 5,000 and 10,000. Comparison with existing metho ds Budget 2SPSA[29] 2RDSA[25] G2R-9 G2SF-9 5 × 10 3 0 . 58 ± 0 . 15 0 . 44 ± 0 . 14 0 . 29 ± 0 . 11 0 . 22 ± 0 . 06 1 × 10 4 0 . 55 ± 0 . 18 0 . 45 ± 0 . 14 0 . 26 ± 0 . 06 0 . 15 ± 0 . 05 7 Conclusions W e prop osed a family of Hessian estimators that utilize only function measurements and demonstrated that estimators with more function measuremen ts yield low er-order bias. W e analyzed the asymptotic con vergence of the sto c hastic Newton metho d with our prop osed estimators. Also, we established a non-asymptotic b ound for our prop osed algorithm, which is eﬃcient for escaping saddle p oints. As future w ork, it would b e in teresting to explore the sto chastic Newton metho ds with our prop osed estimators in real-w orld applications such as natural language pro cessing and navigation in autonomous systems. References [1] Animashree Anandkumar and Rong Ge. Eﬃcient approaches for escaping higher order saddle p oin ts in non-con vex optimization. In Confer enc e on le arning the ory , pages 81–102. PMLR, 2016. [2] Krishnakumar Balasubramanian and Saeed Ghadimi. Zeroth-order nonconv ex sto c hastic optimiza- tion: Handling constraints, high dimensionality , and saddle p oin ts. F oundations of Computational Mathematics , 22(1):35–76, 2022. [3] S.. Bhatnagar, HL Prasad, and LA Prashanth. Sto chastic R e cursive Algorithms for Optimization: Simultane ous Perturb ation Metho ds . Springer, 2013. 24 [4] Shalabh Bhatnagar. Adaptive multiv ariate three-timescale sto chastic approximation algorithms for sim ulation based optimization. ACM T r ansactions on Mo deling and Computer Simulation (TOMACS) , 15(1):74–107, 2005. [5] Shalabh Bhatnagar. Adaptive newton-based multiv ariate smo othed functional algorithms for sim ula- tion optimization. ACM T r ansactions on Mo deling and Computer Simulation (TOMACS) , 18(1):1–35, 2007. [6] Shalabh Bhatnagar and LA Prashan th. Simultaneous p erturbation newton algorithms for simulation optimization. Journal of Optimization The ory and Applic ations , 164(2):621–643, 2015. [7] Shalabh Bhatnagar and LA Prashan th. Simultaneous p erturbation newton algorithms for simulation optimization. Journal of Optimization The ory and Applic ations , 164(2):621–643, 2015. [8] V aclav F abian. Sto c hastic appro ximation of minima with improv ed asymptotic sp eed. The A nnals of Mathematic al Statistics , pages 191–200, 1967. [9] V´ acla v F abian. Sto c hastic appro ximation. In Optimizing metho ds in statistics , pages 439–470. Elsevier, 1971. [10] Michael C F u et al. Handb o ok of simulation optimization , v olume 216. Springer, 2015. [11] Saeed Ghadimi and Guanghui Lan. Sto c hastic ﬁrst-and zeroth-order methods for noncon vex sto c hastic programming. SIAM journal on optimization , 23(4):2341–2368, 2013. [12] Kevin Gurney . An intr o duction to neur al networks . CRC press, 2018. [13] Y u-Chi Larry Ho and Xi-Ren Cao. Perturb ation analysis of discr ete event dynamic systems , volume 145. Springer Science & Business Media, 2012. [14] V Y a Katko vnik and KULCHITS. O Y. Conv ergence of a class of random searc h algorithms. Automation and R emote Contr ol , 33(8):1321–1326, 1972. [15] Jac k Kiefer and Jacob W olfowitz. Sto c hastic estimation of the maximum of a regression function. The Annals of Mathematic al Statistics , pages 462–466, 1952. [16] Jacek Koronac ki. Random-seeking metho ds for the sto c hastic unconstrained optimization. Interna- tional Journal of Contr ol , 21(3):517–527, 1975. [17] HJ KUSHNER and CLARK DS. Sto c hastic appro ximation metho ds for constrained and unconstrained systems. 1978. [18] Pierre L’Ecuy er and Peter W Glynn. Sto c hastic optimization by simulation: Conv ergence pro ofs for the gi/g/1 queue in steady-state. Management Scienc e , 40(11):1562–1578, 1994. [19] Sadhik a Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language mo dels with just forward passes. A dvanc es in Neur al Information Pr o c essing Systems , 36:53038–53075, 2023. [20] Mizhaan P Maniyar, LA Prashan th, Ak ash Mondal, and Shalabh Bhatnagar. A cubic-regularized p olicy newton algorithm for reinforcement learning. In International Confer enc e on Artiﬁcial Intel ligenc e and Statistics , pages 4708–4716. PMLR, 2024. [21] Y urii Nesterov and Boris T P oly ak. Cubic regularization of newton metho d and its global performance. Mathematic al pr o gr amming , 108(1):177–205, 2006. [22] Soumen Pac hal, Shalabh Bhatnagar, and Prashanth L. A. Generalized simultaneous p erturbation- based gradien t search with reduced estimator bias. IEEE T r ansactions on A utomatic Contr ol , 2025. 25 [23] L.A. Prashanth and Shalabh Bhatnagar. Gradient-based algorithms for zeroth-order optimization. F oundations and T r ends ® in Optimization , 8(1–3):1–332, 2025. [24] LA Prashan th, Shalabh Bhatnagar, Nira v Bhavsar, Michael F u, and Steven I Marcus. Random direc- tions sto c hastic appro ximation with deterministic p erturbations. IEEE T r ansactions on A utomatic Contr ol , 65(6):2450–2465, 2019. [25] L.A. Prashan th, Shalabh Bhatnagar, Michael F u, and Steve Marcus. Adaptive system optimization using random directions sto c hastic approximation. IEEE T r ansactions on Automatic Contr ol , 62(5):2223–2238, 2016. [26] Herb ert Robbins and Sutton Monro. A sto c hastic approximation method. The annals of mathematic al statistics , pages 400–407, 1951. [27] Reuv en Y Rubinstein and Alexander Shapiro. Discr ete event systems: Sensitivity analysis and sto chastic optimization by the sc or e function metho d , v olume 13. Wiley , 1993. [28] James C Spall. Multiv ariate sto c hastic approximation using a sim ultaneous p erturbation gradient appro ximation. IEEE tr ansactions on automatic c ontr ol , 37(3):332–341, 1992. [29] James C Spall. Adaptive sto c hastic approximation by the sim ultaneous p erturbation metho d. IEEE tr ansactions on automatic c ontr ol , 45(10):1839–1853, 2000. [30] James C Spall. F eedback and weigh ting mechanisms for improving jacobian estimates in the adaptive sim ultaneous p erturbation algorithm. IEEE T r ansactions on Automatic Contr ol , 54(6):1216–1229, 2009. [31] Ric hard S Sutton, Andrew G Barto, et al. R einfor c ement le arning: An intr o duction , volume 1. MIT press Cam bridge, 1998. [32] Nilesh T ripuraneni, Mitchell Stern, Chi Jin, Jeﬀrey Regier, and Michael I Jordan. Sto c hastic cubic regularization for fast nonconv ex optimization. A dvanc es in neur al information pr o c essing systems , 31, 2018. [33] Jo el A T ropp. The exp ected norm of a sum of independent random matrices: An elemen tary approach. In High Dimensional Pr ob ability VII: The Car gese V olume , pages 173–202. Springer, 2016. [34] Long W ang, Qi W ang, James C Spall, and Jingyi Zhu. Simultaneous p erturbation sto c hastic appro ximation for mixed v ariables. IEEE T r ansactions on Automatic Contr ol , 2025. [35] Qi W ang and James C Spall. Discrete simultaneous p erturbation sto c hastic approximation on loss function with noisy measuremen ts. In Pr o c e e dings of the 2011 Americ an Contr ol Confer enc e , pages 4520–4525. IEEE, 2011. [36] Xun Zhu and James C Spall. A mo diﬁed second-order spsa optimization algorithm for ﬁnite samples. International Journal of A daptive Contr ol and Signal Pr o c essing , 16(5):397–409, 2002. 26

Generalized Random Direction Newton Algorithms for Stochastic Optimization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment