Stochastic gradient descent on Riemannian manifolds

Stochastic gradient descent on Riemannian manifolds S. Bonnabel ∗ Abstract Stochast ic gradie nt descent is a simple approach to ﬁnd the local minima of a cost functi on whose ev aluatio ns a re co rrupted by n oise. In this p aper , we de velop a p rocedu re ext ending sto chasti c gradient d escen t algori thms to the c ase whe re the function is deﬁne d on a Riemannian manifold . W e prove that, as in the Euclidi an case, the gradi ent descen t algori thm con ver ges to a critical point of the cost function. The algorith m has numerous potent ial applicat ions, and is illustrated here b y four e xamples. In part icular a novel gossip algori thm on the set of co varian ce matrices is deri ved and tested numerica lly . 1 Introduction Stochastic approximation provides a simp le approach, of great practical im portance, to ﬁnd the local m inima of a function wh ose e valuations are corrupted by noise. It has had a long history i n opti mization and control with nu merous appl ications (e.g. [19, 6 , 36]). T o demon - strate the main ideas on a toy e xam ple, we brieﬂy mention a traditional procedure to optimize the ballist ic trajectory o f a projectile in a ﬂuctuati ng wind. Successive gradient corrections (i.e. c orrections proportional to the distance between the projectile i mpact and th e target) are performed on the angle at which the projectile i s launched. W ith a decreasing step size tending to zero, one can reasonably hope the launching angle w ill con verge to a ﬁxed value which is such th at t he corresponding impacts are centered o n the tar get o n average. One of the ﬁrst formal algorithm of this kind is the Robbins-Monro algorithm [ 29], which dates back to the 1950s. It proves that for a s mooth cost function C ( w ) having a unique minimum, the algorithm w t +1 = w t − γ t h t ( w t ) , w here h t ( w t ) is a noisy ev aluation of the gradient of C at w t , con verges in qu adratic mean to the minimum, under speciﬁc conditions on the sequence γ t . Although st ochastic gradient has found app lications in control, system id entiﬁcation, and ﬁltering theories (for instance a Kalm an ﬁlter for noi sy observa tions of a cons tant process is a Robbi ns-Monro algorith m), new challenging applications stem from the active m achine learning communit y . The work of L. Bottou, a decade ago [9 ], has po pularized the stochas- tic g radient app roach, bo th to address the on line learning problem (identiﬁcation of a con- stant parameter in rea l time from noi sy output measurements) and large-scale l earning (with ∗ Robotics lab, Math ´ ematiques et syst ` emes, Mines ParisT ech, 75 272 Paris CEDEX, Fra nce (e- mail: sil- vere.bonn abel@ mines-paristech .fr). 1 e ver -increasing data-sets, a pproximating the cost function with a simpler appropriate stochas- tic function can lead to a reduced numerical compl exity). Some recent problems have been strong dri vers for the development of new estim ation m ethods, such as th e one proposed in the present paper , dealing with stochasti c gradient descent on manifolds. The paper is organized as follows. In Section 2 general st ochastic gradient descent algo- rithms on R i emannian manifolds are introduced. The algorithms already used in [23, 22, 3, 4, 26] can all be cast in the proposed general framew o rk. The main algorith ms are completely intrinsic, i.e. they do not d epend on a sp eciﬁc embeddin g of the manifold o r on a choice of local coordinates. In secti on 3 the con ver g ence properties of the algorithms are analyzed. In the Euclidian case, almost sure (a.s.) con vergence of the parameter t o a critical p oint of the gradient of th e cost function is well-established under reasonable assumption s (see e.g. [9]), but this resul t has never been proven to hold for no n-Euclidian sp aces. In this paper , almost sure con ver gence of the proposed algorithms is obtained under se veral assumptio ns, extending the results of the Euclidian case to the Riemannian case. In Section 4 t he algorithms and the con ver gence results of the preceding sections are ap- plied to four examples. The ﬁrst example re v isits the c elebrated Oja algorithm [26] for online principal component analysis (PCA). This algorithm can be cast in our versatile frame work, and it s con ver gence properties immediately follow from the th eorems of Section 3. More- over , the other result s of th e present paper all ow to deﬁne alternative algorithm s for online PCA with guaranteed con vergence properties. The second example is concerned with the ran- domized com putation of intrinsic means on a hyperboli c space, the Poincar ´ e disk. Thi s is a rather tutorial example, meant to illust rate th e assumptions and results of the thi rd theorem of Section 3. The con ver gence follows from this theorem. The last t wo examples are more detailed and include numerical experiments. The third example is c oncerned with a particular algorithm of [23]. The goal is to identify a pos itive semi-deﬁnit e matrix (a kernel or a Maha- lanobis dist ance) from noisy m easurements. The theoretical con vergence results of Se ction 3 allow to complete the work of [23] , a nd simulations ill ustrate t he con ver gence properties. The last example is concerned with a consensus application on the set of covariance matrices (see e.g. [20]). A novel randomized gossip algorithm based on the Fisher Information Metric is proposed. The algorithm has a meaningful s tatistical interpretation, and admits se veral i n vari- ance and guaranteed con verge nce prop erties that follo w from the results of Section 3. As the state space is con vex, the usual gossip algorithm [10] is well deﬁned and can be implemented on thi s space. Simul ations indicate t he proposed Riemannian consensus algorit hm con verges faster than the usual gossip algorithm. Appendix A brieﬂy presents some links with inform ation geom etry and Amari’ s natural gradient. Appendix B cont ains a brief recap of dif ferential geom etry . Preliminary result s can be found in [8, 7]. 2 2 Stochastic gradien t on Riemannian manif olds 2.1 Standard stochastic gradient in R n Let C ( w ) = E z Q ( z , w ) = R Q ( z , w ) dP ( z ) be a t hree times conti nuously dif ferentiable cost function, wh ere w ∈ R n is a mini mization parameter , and dP is a probability measure on a measurable space Z . Consider the optimization problem min w C ( w ) (1) In stochastic approximation, the cost function cannot be compu ted explicitly as the distribu- tion dP is assum ed t o b e unknown. Instead, one has access to a sequence of independent observations z 1 , z 2 · · · of a random v ariable drawn wit h probabi lity law dP . At eac h time step t , the user can comp ute t he so-called loss function Q ( z t , w ) for any parameter w ∈ R n . The loss can be vie wed as an approximation of the (av erage) cos t fun ction C ( w ) e valuated under the input z t ∈ Z . Stochastic gradient descent is a standard tec hnique to treat this problem. At each s tep the algori thm recei ves an input z t drawn according to dP , and performs a gradient descent on the approximated cos t w t +1 = w t − γ t H ( z t , w t ) where H ( z , w ) can be vie wed as the gradient of the loss, i.e., on av erage E z H ( z , w ) = R H ( z , w ) dP ( z ) = ∇ C ( w ) . As C is not con vex in many applications, one can not hope for a much better result than alm ost sure (a.s.) con ver gence of C ( w t ) to some value C ∞ , and con ver g ence of ∇ C ( w t ) to 0 . Such a result ho lds under a set o f standard assumptions, summ arized in e.g. [9]. Note that, a.s. con- ver gence is a very des irable property for instance in onli ne esti mation, as it ensures asympt otic con ver gence is always achie ved in practice. 2.2 Limits of the appr o ach: a motivating ex ample A topical problem that has attracted a lot o f a ttention in the machine learning community over the last years is low-rank matrix estimati on (or matrix completion, whi ch can be viewed as the matrix counterpart of sparse approximation probl ems) and in particular the col laborative ﬁltering problem : giv en a matrix W ∗ ij containing the preference ratings of users about items (movies, boo ks), the goal i s t o com pute personalized recomm endations of these items. Only a small subset of entries ( i, j ) ∈ Ξ is known, and there are many ways t o complete the matrix. A standard approach to overcome this ambiguity , and to ﬁlter t he noise, is to constrain the state s pace by assumi ng the tastes of the users are explained b y a reduced number of criteria (say , r ). This yields the following non-linear optimization problem min W ∈ R d 1 × d 2 X ( i,j ) ∈ Ξ ( W ∗ ij − W ij ) 2 s.t. rank ( W ) = r The matrix being po tentially of hi gh dimension ( d 1 ≃ 10 5 , d 2 ≃ 10 6 in the so-called Netﬂix prize prob lem), a st andard m ethod to reduce the computatio nal burden is to draw random elements of Ξ , and perform gradient descent ignoring the remaining entries. Unfortunately the updated matrix W − γ t ∇ W ( W ∗ ij − W ij ) 2 does n ot ha ve rank r . Seeking th e matrix of rank r which best approxi mates i t can b e nu merically costly , especially for very large d 1 , d 2 . A more n atural way to enforce the rank constraint is to endow the parameter space wit h a 3 Riemannian metric, and to perform a gradient step within the mani fold of ﬁxed-rank matrices. In [22] this approach has led to st ochastic gradient al gorithms that com pete wit h state of the art m ethods. Y et a con ver gence proof is still lacking. The con vergence results below are general, and in Section 4.3 they will be shown to apply to this problem for the particular case of W ∗ being symmetric positive deﬁnite. 2.3 Pr oposed general stochastic gradient algorithm on Riemannian man- if olds In this p aper we propos e a new procedure to address problem (1) where C ( w ) = E z Q ( z , w ) is a three tim es continuousl y differentiable cost function and where w i s now a m inimizatio n parameter belongin g to a smooth connected Riemannian manifold M . On M , we propo se to replace the usual update with the following update w t +1 = exp w t ( − γ t H ( z t , w t )) (2) where exp w is the exponential m ap at w , and H ( z , w ) can be viewed as the Riemann ian gradient of the lo ss, i.e., we have on a verage E z H ( z , w ) = R H ( z , w ) dP ( z ) = ∇ C ( w ) where ∇ C ( w ) d enotes the Riemannian g radient of C at w ∈ M . The proposed update (2) is a straightforward transposition of the standard gradient up date in t he Euclidian case. Indeed, H ( z , w ) is a tangent v ector to the manifold that describes the direction of steepest descent for the loss. In update (2), the parameter m oves alon g the geodesic emanating from the current parameter positi on w t , in the directio n deﬁned by H ( z t , w t ) and with intensi ty k H ( z t , w t k . If the manifold at hand is R n equipped with the usual Euclidian sca lar product, the g eodesics are straight lines, and the deﬁnitions coinci de. Note th at, th e procedure here is t otally intrinsi c, i.e. the algorithm is completely independent of the choice of local coordinates on the manifold. In many cases, t he exponential map is not easy to com pute (a calculus o f variations prob- lem must be solved, or t he Christo ff el symbol s need be known), and i t i s much easier and much faster to use a ﬁrst-order approximation of the exponential, called a retraction. Indeed a retraction R w ( v ) : T w M 7→ M maps the tangent s pace at w to the manifold, and it is such that d ( R w ( tv ) , exp w ( tv )) = O ( t 2 ) . It yields the alt ernativ e update w t +1 = R w t ( − γ t H ( z t , w t )) (3) Let us give a simple example to illust rate the ideas: if the manifol d were the sphere S n − 1 endowed with the natural metri c inherit ed th rough immersion in R n , a retraction would consist of a simple addition i n the am bient space R n followed by a projection onto t he sphere. This is a numerically v ery simple operation that a voids ca lculating the geod esic distance explicitly . See the Appendix for more details on Riemannian manifolds. 3 Con vergence r esults In this s ection, the con ver gence of t he proposed algori thms (2) and (3) are analyzed. The parameter is proved t o con ver ge almost surely to a critical point o f the cos t function in var ious cases and under various conditions. More speciﬁcally , three general results are deri ved. 4 In Subsectio n 3.1) a ﬁrst general result is derived: when t he parameter w ∈ M is proved to remain in a com pact set, t he algorit hm (2) con verges a.s. under standard con ditions on the step size s equence. This theorem applies in particular to all connected compact manifolds. Important examples of such manifol ds in applications are the orthogonal g roup, the group of rotations, the sphere, the real projective space, the Grassm ann and the Stiefel manifol d. In Subsection 3.2), the result is proved to hold when a twice continuously d iffe rentiable retrac- tion is used instead of the exponential map. Finally , in Subsectio n 3.3), we cons ider a slightly modiﬁed version of algorit hm (2) on speciﬁc non p ositively curved Riemannian manifolds . The step size γ t is adapted at each step i n o rder t o take into account the effects of negati ve curv ature that tend to destabi lize t he algorithm. Under a set of mild assumpt ions naturally extending t hose of the Euclidian case, the parameter is pro ved to a.s. remain in a compact set, and thus a.s . conv ergence is proved. Important examples of such m anifolds are the Poincar ´ e disk or the Poincar ´ e h alf plane, and the space of real symmetric positive deﬁnite matrices P + ( n ) . The sequence of step sizes ( γ t ) t ≥ 0 will satisfy the usual condition in stochastic approximation: X γ 2 t < ∞ and X γ t = + ∞ (4) 3.1 Con vergence on compact sets The following theorem proves the a.s. con vergenc e of the algorit hm under some assumptions when the tra jectories have been prov ed to remai n in a predeﬁned compact set at all t imes. This is of course the case if M is compact. Theor em 1 . Consider the algo rithm (2) on a connected Riemannian manifold M with in - jectivity radius uniformly bounded from below by I > 0 . Assume the sequence of step sizes ( γ t ) t ≥ 0 satisfy t he standar d condition (4) . Suppose ther e exists a compact set K such that w t ∈ K for all t ≥ 0 . W e also suppose that the gradient is bound ed on K , i.e. ther e exists A > 0 such that for all w ∈ K and z ∈ Z we have k H ( z , w ) k ≤ A . Then C ( w t ) con ver ges a.s. and ∇ C ( w t ) → 0 a.s. Pr oof. The proof builds upon the usual proof in the Eu clidian case (see e.g. [9]). As the parameter is proved to remain in a compact set, all continuous functions of the parameter can be bou nded. M oreover , as γ t → 0 , there exists t 0 such that for t ≥ t 0 we have γ t A < I . Suppose now that t ≥ t 0 , th en there exists a geodesi c exp( − sγ t H ( z t , w t )) 0 ≤ s ≤ 1 linking w t and w t +1 as d ( w t , w t +1 ) < I . C (exp( − γ t H ( z t , w t ))) = C ( w t +1 ) and thus the T aylor formu la implies that (see Appendix) C ( w t +1 ) − C ( w t ) ≤ − γ t h H ( z t , w t ) , ∇ C ( w t ) i + γ 2 t k H ( z t , w t ) k 2 k 1 (5) where k 1 is an upper bound on the Riemannian Hessian of C in the compact set K . Let F t be the increasing sequence of σ -algebras generated by t he v ariables ava ilable just before time t : F t = { z 0 , · · · , z t − 1 } 5 w t being comput ed from z 0 , · · · , z t − 1 , is measurable F t . As z t is independent from F t we hav e E [ h H ( z t , w t ) , ∇ C ( w t ) i|F t ] = E z [ h H ( z , w t ) , ∇ C ( w t ) i ] = k∇ C ( w t ) k 2 . Th us E ( C ( w t +1 ) − C ( w t ) |F t ) ≤ − γ t k∇ C ( w t ) k 2 + γ t 2 A 2 k 1 (6) as k H ( z t , w t ) k ≤ A . As C ( w t ) ≥ 0 , this proves C ( w t ) + P ∞ t γ 2 k A 2 k 1 is a nonnegative su- permartingale, hence it con verges a.s. i mplying that C ( w t ) con ver ges a.s. Moreover summin g the inequalities we hav e X t ≥ t 0 γ t k∇ C ( w t ) k 2 ≤ − X t ≥ t 0 E ( C ( w t +1 ) − C ( w t ) |F t ) + X t ≥ t 0 γ t 2 A 2 k 1 (7) Here we would like to prove the right term is bounded so that the left term con verges. But the fact that C ( w t ) conv erges does not imply i t has bounded v ariations. Howe ver , as in the Euclidian case, we can use a theorem by D.L. Fisk [15] ensu ring that C ( w t ) is a q uasi mar - tingale, i.e., it can be decomposed into a sum of a martingale and a process whose tra jectories are of bounded variation. F or a random v ariable X , let X + denote the quantity max( X , 0) . Pr oposition 1. [Fisk (196 5)] Let ( X n ) n ∈ N be a non-ne gati ve stochastic pr o cess with bounded positive variations, i.e., such that P ∞ 0 E ([ E ( X n +1 − X n ) |F n )] + ) < ∞ . Then the pr ocess is a quasi-martin gale, i .e. ∞ X 0 | E [ X n +1 − X n |F n ] | < ∞ a.s. , and X n con ver ges a.s. Summing (6) over t , it is clear that C ( w t ) satisﬁes the propositio n’ s assumption s, and thus C ( w t ) is a quasi-martingale, imply ing P t ≥ t 0 γ t k∇ C ( w t ) k 2 con ver ges a.s. because of inequality (7) where the central term can be bounded by its absolu te value which i s con vergent thanks to the proposi tion. But, as γ t → 0 , this does not prove k∇ C ( w t ) k con ver g es a.s . Howe ver , if k∇ C ( w t ) k is proved to con verge a.s., it can only con verge to 0 a.s. because of condition (4). Now consider the nonnegativ e process p t = k∇ C ( w t ) k 2 . Boundi ng the s econd deriv ative of k∇ C k 2 by k 2 , along the g eodesic l inking w t and w t +1 , a T aylor expansion yields p t +1 − p t ≤ − 2 γ t h∇ C ( w t ) , ( ∇ 2 w t C ) H ( z t , w t ) i + ( γ t ) 2 k H ( z t , w t ) k 2 k 2 , and thus bounding from below the Hessian of C on the compact set by − k 3 we have E ( p t +1 − p t |F t ) ≤ 2 γ t k∇ C ( w t ) k 2 k 3 + γ t 2 A 2 k 2 . W e just proved the sum of the rig ht term is ﬁnite. It implies p t is a quasi-martingale, thus it implies a.s. con vergence of p t tow ards a value p ∞ which can only be 0. 3.2 Con vergence with a r etra ction In thi s sectio n, we prove Theorem 1 still holds w hen a retraction is used instead of the expo- nential map. 6 Theor em 2. Let M be a con nected Riemannian ma nifold with injectivity radius uniformly bounded fr om below by I > 0 . Let R w be a t wice continuousl y differ entiable r etraction, and consider t he update (3) . Ass ume t he s equence of st ep sizes ( γ t ) t ≥ 0 satisfy the s tandar d condition (4) . Suppose ther e e xist s a compact set K s uch that w t ∈ K for al l t ≥ 0 . W e suppose also t hat t he gradient is bounded in K , i .e. f or w ∈ K we have ∀ z k H ( z , w ) k ≤ A for some A > 0 . Then C ( w t ) con ver ges a.s. and ∇ C ( w t ) → 0 a.s. Pr oof. Let w exp t +1 = exp w t ( − γ t H ( z t , w t )) . The proof essentially reli es o n the fact t hat the points w t +1 and w exp t +1 , are close to each other o n the mani fold for sufﬁc iently large t . In- deed, as the retraction is twice continuousl y differentiable there exists r > 0 such that d ( R w ( sv ) , exp w ( sv ) ) ≤ r s 2 for s sufﬁciently small, k v k = 1 , and w ∈ K . As for t suf- ﬁciently large γ t A can be made arbit rarily small (in particular smal ler t han the injectivity radius), this implies d ( w exp t +1 , w t +1 ) ≤ γ t 2 r A 2 . W e can now reiterate t he proof of Theorem 1. W e ha ve C ( w t +1 ) − C ( w t ) ≤ | C ( w t +1 ) − C ( w exp t +1 )) | + C ( w exp t +1 ) − C ( w t ) . The term C ( w exp t +1 ) − C ( w t ) can be bounded as in (5) whereas we ha ve j ust prov ed | C ( w t +1 ) − C ( w exp t +1 )) | i s bo unded by k 1 r γ t 2 A 2 where k 1 is a bound on the Riemannian gradient of C in K . Thus C ( w t ) is a quasi -martingale and P ∞ 1 γ t k∇ C ( w t ) k 2 < ∞ . It means that if k∇ C ( w t ) k con ver ges, it can onl y con ver ge to zero. Let us consi der the variations of th e function p ( w ) = k∇ C ( w ) k 2 . Writing p ( w t +1 ) − p ( w t ) ≤ | p ( w t +1 ) − p ( w exp t +1 )) | + p ( w exp t +1 ) − p ( w t ) and bo unding t he ﬁrst term of the right term by k 3 r γ 2 t A 2 where k 3 is a bound on the gradient of p , we see the inequalities o f Theorem 1 are u nchanged up to second order terms in γ t . Thus p ( w t ) is a quasi -martingale and thus con ver ges. 3.3 Con vergence on Hadamard manif olds In the previous sectio n, we proved conv ergence as long as th e p arameter i s kno wn to remain in a compact set. For s ome manifolds, the algorithm can be pro ved to con verge wi thout this assumption . This is the case for i nstance in the Euclidian space, where the trajectories can be proved to be conﬁned to a comp act set under a set of conditions [ 9]. In this section, we extend those conditions to the important class of Hadamard m anifolds, and we prove con ver gence. Hadamard manifolds are complete, sim ply-connected Riemannian manifol ds with nonp ositive sectional curvature. In order to account for curvature effects, the step size must be slightl y adapted at each iteration. This st ep adaptation yields a more ﬂexible algorithm, and allows to relax one of the standard conditions e ven in the Euclidian case. Hadamard m anifolds have strong properties. In particular , the exponenti al map at any point is globally i n vertible (e.g. [27]). Let D ( w 1 , w 2 ) = d 2 ( w 1 , w 2 ) be t he squared geodesi c distance. Consider the foll owing assumpt ions, which can be viewed as an extension of the usual ones in the Euclidian case: 1. There is a point v ∈ M and S > 0 such that the opposite of the gradient p oints tow ards v when d ( w , v ) becomes larger than √ S i .e. inf D ( w, v ) >S h exp − 1 w ( v ) , ∇ C ( w ) i < 0 2. There exists a lower bound on the sectional curv ature denoted by κ < 0 . 7 3. There exists a continuous function f : M 7→ R that satisﬁes f ( w ) 2 ≥ max { 1 , E z  k H ( z , w ) k 2 (1 + p | κ | ( p D ( w , v ) + k H ( z , w ) k ))  , E z  (2 k H ( z , w ) k p D ( w , v ) + k H ( z , w ) k 2 ) 2  } Theor em 3. Let M be a Hadama r d manifold. Consider the optimization pr oblem (1) . Under assumption s 1-3, the modiﬁed algorit hm w t +1 = exp w t ( − γ t f ( w t ) H ( z t , w t )) (8) is such that C ( w t ) con ver ges a.s. and ∇ C ( w t ) → 0 a.s. Assumptio ns 1-3 are mild ass umptions that encompass the Eucli dian case. In this l at- ter case A ssumptio n 3 is usually replaced with t he stronger condit ion E z ( k H ( z , w ) k k ) ≤ A + B k w k k for k = 2 , 3 , 4 (note that, this condition imm ediately impli es the existence of the function f ). Indeed, on the one hand our general procedure based on the adaptive step γ t /f ( w t ) allows to relax th is standard condi tion, also in the Eucli dian case, as will be illus- trated by th e example of Section 4.3. O n the ot her hand, contrarily to th e Euclidi an case, one could object that the user must provide at each step an upper bound on a function of D ( w , v ) , where v is the point appearing in Assumption 1, which requires some knowledge of v . This can appear t o be a limitation, b u t i n fact ﬁndin g a point v fulﬁlli ng As sumption 1 may b e quite obvious i n practice, and may be far from requiring d irect knowledge of the po int the algorithm is supposed to con ver ge to, as illustrated by the example of Section 4.2. Pr oof. The following proof builds upon the Euclidian case [9]. W e are ﬁrst go ing to prove that the trajectories asymptotically re main in a compact set. Theorem 1 will then easily apply . A second order T aylor e x pansion yields D ( w t +1 , v ) − D ( w t , v ) ≤ 2 γ t f ( w t ) h H ( z t , w t ) , exp − 1 w t ( v ) i + ( γ t f ( w t ) ) 2 k H ( z t , w t ) k 2 k 1 (9) where k 1 is an u pper bound on the operator norm of half o f the Riemanni an hessian of D ( · , v ) along the geodesi c joi ning w t to w t +1 (see the Appendix). If the s ectional curva ture is bounded from below by κ < 0 we hav e ([12] Lemm a 3.12) λ max  ∇ 2 w ( D ( w , v ) / 2)  ≤ p | κ | D ( w , v ) tanh( p | κ | D ( w , v )) where ∇ 2 w ( D ( w , v ) / 2) is the Hessi an of the s quared h alf d istance and λ max ( · ) denotes the lar gest eigen value of an opera tor . This implies that λ max  ∇ 2 w ( D ( w , v ) / 2)  ≤ p | κ | D ( w , v ) + 1 . Moreover , along the geodesic linking w t and w t +1 , triangle inequ ality implies p D ( w , v ) ≤ p D ( w t , v ) + k H ( z t , w t ) k as f ( w t ) ≥ 1 and there exists t 0 such that γ t ≤ 1 for t ≥ t 0 . Thu s 8 k 1 ≤ β ( z t , w t ) for t ≥ t 0 where β ( z t , w t ) = 1 + p | κ | ( p D ( w t , v ) + k H ( z t , w t ) k ) . Let F t be the increasing s equence of σ -algebras g enerated by the several v ariables av ail able just before time t : F t = { z 0 , · · · , z t − 1 } . As z t is independent from F t , and w t is F t measurable, we hav e E [( γ t f ( w t ) ) 2 k H ( z t , w t ) k 2 k 1 |F t ] ≤ ( γ t f ( w t ) ) 2 E z  k H ( z , w t ) k 2 β ( z , w t )  . Conditioning (9) t o F t , and using Assumpt ion 3: E [ D ( w t +1 , v ) − D ( w t , v ) |F t ] ≤ 2 γ t f ( w t ) h∇ C ( w t ) , exp − 1 w t ( v ) i + γ 2 t (10) Let φ : R + → R + be a smooth function such that • φ ( x ) = 0 for 0 ≤ x ≤ S • 0 < φ ′′ ( x ) ≤ 2 for S < x ≤ S + 1 • φ ′ ( x ) = 1 for x ≥ S + 1 and let h t = φ ( D ( w t , v )) . L et us p rove it con ver ges a.s. t o 0 . As φ ′′ ( x ) ≤ 2 for all x ≥ 0 a second order T aylor e xp ansion on φ yields h t +1 − h t ≤ [ D ( w t +1 , v ) − D ( w t , v )] φ ′ ( D ( w t , v )) + ( D ( w t +1 , v ) − D ( w t , v )) 2 Because of th e triangle inequality we have d ( w t +1 , v ) ≤ d ( w t , v ) + γ t f ( w t ) k H ( z t , w t ) k . T hus D ( w t +1 , v ) − D ( w t , v ) ≤ 2 d ( w t , v ) γ t f ( w t ) k H ( z t , w t ) k + ( γ t f ( w t ) ) 2 k H ( z t , w t ) k 2 which i s less than γ t f ( w t ) [2 d ( w t , v ) k H ( z t , w t ) k + k H ( z t , w t ) k 2 ] for t ≥ t 0 . Using Assum ption 3 and the fact that w t is measurable F t we hav e E [ h t +1 − h t |F t ] ≤ φ ′ ( D ( w t , v )) E [ D ( w t +1 , v ) − D ( w t , v ) |F t ] + γ 2 t . Using (10) we hav e E [ h t +1 − h t |F t ] ≤ 2 γ t f ( w t ) h∇ C ( w t ) , exp − 1 w t ( v ) i φ ′ ( D ( w t , v )) + 2 γ 2 t (11) as φ ′ is positiv e, and less than 1. Eith er D ( w t , v ) ≤ S and then we ha ve φ ′ ( D ( w t , v )) = 0 and thus E [ h t +1 − h t |F t ] ≤ 2 γ 2 t . Or D ( w t , v ) > S , and Assumpti on 1 ensures h∇ C ( w t ) , exp − 1 w t ( v ) i is negativ e. As φ ′ ≥ 0 , (11) i mplies E [ h t +1 − h t |F t ] ≤ 2 γ 2 t . In bo th cases E [ h t +1 − h t |F t ] ≤ 2 γ 2 t , proving h t + 2 P ∞ t γ 2 k is a posit iv e supermartingale, hence it conv erges a.s . Let us prov e it necessarily con ver ges to 0 . W e have P ∞ t 0 E ([ E ( h t +1 − h t |F t )] + ) ≤ 2 P t γ 2 t < ∞ . Proposition 1 proves t hat h t is a quasi-martingal e. Using (11) we ha ve inequality − 2 ∞ X t 0 γ t f ( w t ) h∇ C ( w t ) , exp − 1 w t ( v ) i φ ′ ( D ( w t , v )) ≤ 2 ∞ X t 0 γ 2 t − ∞ X t 0 E [ h t +1 − h t |F t ] and as h t is a quasi-martingale we ha ve a.s. − ∞ X t 0 γ t f ( w t ) h∇ C ( w t ) , exp − 1 w t ( v ) i φ ′ ( D ( w t , v )) ≤ 2 | ∞ X t 0 γ 2 t | + ∞ X t 0 | E [ h t +1 − h t |F t ] | < ∞ (12) 9 Consider a sample t rajectory for which h t con ver ges to α > 0 . It means t hat for t lar ge enough D ( w t , v ) > S and thus φ ′ ( D ( w t , v )) > ǫ 1 > 0 . Because of Assumpti on 1 we have als o h∇ C ( w t ) , exp − 1 w t ( v ) i < − ǫ 2 < 0 . This contradicts (12) as P ∞ t 0 γ t f ( w t ) = ∞ . The last equality comes from (4) and the fact that f is continuous and thus bounded along the trajectory . It has been p roved that almost every t rajectory asympto tically enters the ball of center v and radius S and stays inside of it. Let us prove that we can work on a ﬁxed compact set. Let G n = T t>n { D ( w t , v ) ≤ S } . W e have just proved P ( ∪ G n ) = 1 . Thus to prove a.s. con vergence, it is thus suf ﬁcient to prove a.s. con ver gence o n each of t hose sets. W e assume from now on the trajectories all belong to the bal l of center v and radius S . As this is a compact set, all continu ous functions of the parameter can be bounded. In particular γ t /k 3 ≤ γ t /f ( w t ) ≤ γ t for some k 3 > 0 and thus the modiﬁed step size veriﬁes th e condit ions of Th eorem 1. Moreov er , E z ( k H ( z , w ) k 2 ) ≤ A 2 for som e A > 0 on the compact as it is dominated by f ( w ) 2 . As there is n o cut locus, this weaker condition i s s ufﬁ cient, since it implies that (6) ho lds. The proof foll ows from a mere application of Theorem 1 on this compact set. Note that, it would be possibl e to d eriv e an analogous result when a retraction is u sed instead of t he exponential, using the ideas of t he proof of Theorem 2. Howe ver , due to a lack of rele va nt examples, this result is not presented. 4 Examples Four application examples are presented. The ﬁrst two examples are rather tutorial. The ﬁrst one illustrates Theorems 1 and 2. The second one allows to provide a graphical interpretation of Th eorem 3 and its assumpt ions. The third and fou rth examples are more detailed and include numerical e x periments. Througho ut t his sec tion γ t is a sequence of positive step sizes satisfying the usual condition (4). 4.1 Subspace tracking W e propose to ﬁrst r e visit in the light of the preceding results the well-known subspace track- ing algorithm of Oja [26] which is a generalizati on of the power method for computing the dominant eigen vector . In se veral appli cations, one wants t o comput e the r principal eigen- vectors, i.e. perform principal component analysi s (PCA) of a n × n covariance matrix A , where r ≤ n . Furthermore, for comp utational reasons or for adaptiveness, the measurements are s upposed to be a st ream of n -dimensional data ve ctors z 1 , · · · z t , · · · where E ( z t z T t ) = A (online estimation). The problem boils down to est imating an element of the Grassmann man- ifold Gr ( r, n ) of r -dimensional subspaces in a n -dimens ional ambi ent space, which can be identiﬁed to the set of rank r projectors: Gr ( r , n ) = { P ∈ R n × n s . t . P T = P , P 2 = P , T r ( P ) = r } . 10 Those projectors can be represented by matrices W W T where W belong s to the Stiefel man- ifold St ( r , n ) , i.e., m atrices of R n × r whose columns are orthonormal. Deﬁne the c ost function C ( W ) = − 1 2 E z [ z T W T W z ] = − 1 2 T r  W T AW  which is mi nimal when W is a basis of the dom inant su bspace of the covariance matrix A . It is inv ariant to rotations W 7→ W O , O ∈ O ( r ) . Th e s tate-space is therefore the set of equiv alence classes [ W ] = { W O s . t . O ∈ O ( r ) } . This set is d enoted by S t ( r , n ) / O ( r ) . It is a quotient re pr esentation of the Grassm ann m anifold Gr ( r, d ) . This quotient geometry has been well-studied in e.g. [13]. The Riemannian gradient under the event z i s: H ( z , W ) = ( I − W W T ) z z T W . W e have the follo wing result Pr oposition 2. Suppose z 1 , z 2 , · · · ar e uniformly bounded. Consider the stochastic Rieman- nian gradient algorithm W t +1 = W t V t cos ( γ t Θ t ) V T t + U t sin ( γ t Θ t ) V T t (13) wher e U t Θ t V t is the compact SVD of the ma trix ( I − W t W T t ) z t z T t W t . Then W t con ver ges a.s. to an in variant subspace of the covariance matrix A . Pr oof. The proof is a straightforward application of Theorem 1. Indeed, the update (13) corresponds to (2) as it states that W t +1 is on the geodesi c emanating from W t with tangent vector H ( z t , W t ) at a di stance γ t k H ( z t , W t ) k from W t . As the inp ut sequence is bound ed, so is the sequence of gradients. The injectivity radius o f the Grassmann manifold is π / 2 , and is thus bounded away from zero, and the Grassmann manifold is compact. Thus Theorem 1 proves that W t a.s. conv erges to a point s uch t hat ∇ C ( W ) = 0 , i.e. AW = W W T AW . For such points there exists M such that AW = W M , proving W is an in variant subsp ace of A . A local analysis pro ves the dominant su bspace of A (i.e. the s ubspace associated wit h the ﬁrst r eigen values) is the only st able subs pace of the ave raged algorithm [26] u nder basic assumption s. W e al so ha ve the following r esult Pr oposition 3. Consider a twice differ entiable r etraction R W . The alg orithm W t +1 = R W t  W t + γ t ( I − W t W T t ) z t z T t W t  (14) con ver ges a.s. to an in vari ant subspace of the covariance matrix A . The result is a mere application of Theorem 2. Consider in particular the following re- traction: R W ( γ H ) =qf ( W + γ H ) where qf() e x tracts the orth ogonal factor in the QR decom- position of its argument. For small γ t , th is retraction amounts to follow the gradient in the Euclidian ambient space R n × p , and then to orthonorm alize th e matrix at each step. It is an inﬁnitely diffe rentiable retraction [1]. The algorithm (14) wit h this particular retraction is known as Oja’ s vector ﬁeld for subspace tracking and has already been prov ed to con ver ge in [25]. Usi ng the g eneral framework proposed in th e present paper , we see this con ver gence result directly stems from Theorem 2. 11 Figure 1: Th e Poincar ´ e disk. The boundary is at inﬁnite distance from the center . Th e geodesics (solid li nes) are either arcs o f circles perpendi cular to the boun dary of t he disk, or diameters. The d ashed circle is th e boundary of a geodesic ball centered at 0. A ssumptio n 1 is ob vi ously v eriﬁed: if a point w t outside the ball makes a small move t ow ards an y point z t inside the ball along the geodesic linking them, its distance to 0 decreases. This example clearly illustrates the beneﬁts of using a retraction. Indeed, from a numerical viewpoint, the geod esic upd ate (13) requi res to perform a SVD at each tim e step, i.e. O ( nr 2 ) + O ( r 3 ) operations, whereas u pdate (14) is only an orthonormalization of the vectors having a lower computational cost of order O ( nr 2 ) , which can be very adv antageous, especially when r is lar ge. 4.2 Randomized computation of a Karcher mean on a hyperbolic space W e propose t o illustrate Theorem 3 and t he assumptions it relies on on a well-known and tutorial manifold. Consider the unit disk D = { x ∈ R 2 : k x k < 1 } wi th the Riemannian metric deﬁned on the tangent plane at x by h ξ , η i x = 4 ξ · η (1 − k x k 2 ) 2 where “ · ” represents th e con ventional scalar product in R 2 . The metric tensor is thus di agonal, so the angles between two intersecting curves in the Riemannian metric are the same as in the Euclidian space. Howe ver , t he distances dif fer: as a point is moving closer to the boundary of the disk, the distances are dilated so that the bound ary can not be reached in ﬁnite t ime. As illustrated on the ﬁgure, the geodesics are eit her arcs of circles that are orthogonal to the boundary circle, or diameters. The Poincar ´ e disk equipped with its metric is a Hadamard manifold. The Karcher (or Fr ´ echet) mean on a Riemannian m anifold i s d eﬁned as the minimizer of w 7→ P N 1 d 2 ( w , z i ) . It can b e viewed as a natural extension of the usual Euclidian barycenter to the Riemannian case. It is i ntrinsically deﬁned, and it is un ique on Hadamard manifolds. 12 There has been gro win g interest in computing Karcher means rec ently , in particular for ﬁlter - ing on manifolds, see e.g. [2, 5, 4]. On the Poincar ´ e disk we propos e to com pute the mean of N points in a randomized way . The m ethod is as follows, and is closely related to t he approach [4]. Let w t be the optimization parameter . The goal is to ﬁnd the minimum of the cost function C ( w ) = 1 2 N N X 1 d 2 ( w , z i ) At each time st ep, a point z i is randomly picked with an uniform probabili ty law . The loss function is Q ( z t , w t ) = 1 2 d 2 ( w t , z i ) , and H ( z t , w t ) is the Riemannian gradient o f the half squared geodesic distance 1 2 D ( z t , w t ) = 1 2 d 2 ( z t , w t ) . O n the Poincar ´ e disk, the distance func- tion is d eﬁned by d ( z , w ) = cosh − 1 (1 + δ ( z , w )) where δ ( z , w ) = 2 k z − w k 2 (1 −k z k 2 )(1 −k w k 2 ) . As the metric tensor i s di agonal, the Riemann ian gradient and the Euclidian g radient ha ve the sam e direction. Moreover its norm is si mply d ( z t , w t ) (see the App endix). It is t hus easy to p rove that H ( z t , w t ) = (1 − k w t k 2 )( w t − z t ) + k w t − z t k 2 w t k (1 − k w t k 2 )( w t − z t ) + k w t − z t k 2 w t k d ( z t , w t ) (15) When there is a lot of redundancy in th e da ta, i.e. when some points are very close to each other , a randomized algorithm may be much m ore efﬁcient n umerically than a batch alg o- rithm. T his becomes ob vious in the e xtreme case where the z i ’ s are all equal. In this case, the approximated gradient H ( z t , w t ) coincides with the (Riemannian) gradient of the cost C ( w t ) . Howe ver , computi ng this latter quanti ty requires N t imes more operations than com puting the approx imated gradient. When N is large and when there is a lo t o f redundancy in the data, we thus see a randomized algorithm can lead to a drastic reduction in the computational burden. Besides, note that, t he stochastic algorithm can also be used in order to ﬁlter a stream of noisy measurements of a single point on the manifold (and thus track this point in case it slowly moves). Indeed, it is easily seen that i f M = R n and d is the Eu clidian distance, the proposed up date bo ils d own to a ﬁrst order discrete low pass ﬁlter as it comp utes a weighted mean between the current update w t and the new measurement z t . Pr oposition 4. Suppose at each time a point z t is randomly drawn. Let S > 0 b e such that S > (max { d ( z 1 , 0) , · · · , d ( z N , 0) } ) 2 and let α ( w t ) = d ( w t , 0) + √ S . Consider the algorithm (8) wher e H ( z t , w t ) is given by (15) and f ( w t ) 2 = max { 1 , α ( w t ) 2 (1 + d ( w t , 0) + α ( w t )) , (2 α ( w t ) d ( w t , 0) + α ( w t ) 2 ) 2 } Then w t con ver ges a.s. to the Kar cher mean of the point s z 1 , · · · , z N . Pr oof. The con ditions of Theorem 3 are easi ly checked. As sumption 1: it is easy to see on the ﬁgure that Assumptio n 1 is veriﬁed with v = 0 , and S being the radius of an open geodesic ball centered at 0 and containi ng all t he points z 1 , · · · , z N . More technicall y , suppose d ( w , 0) > √ S . The quantity h exp − 1 w (0) , H ( z i , w ) i w is equal to − d ( w , 0) H ( z i , w ) · w / (1 − 13 k w k 2 ) 2 = − λ ((1 − k w k 2 )( k w k 2 − z i · w ) + k w − z i k 2 k w k 2 ) where λ is a positiv e quantity bounded a way f rom zero for d ( w , 0) > √ S . As there exists β > 0 such that k w k − k z i k ≥ β , and k w − z i k ≥ k w k − k z i k , the term h exp − 1 w (0) , H ( z i , w ) i w is ne gative and bounded a way from zero, and so is its avera ge over the z ′ i s. Assumptio n 2: in dimension 2, the sectional curva ture is k nown to be id entically equal t o − 1 . Assump tion 3 is o bviously sat isﬁed as k H ( z , w ) k = d ( z , w ) ≤ d ( z , 0) + d (0 , w ) ≤ √ S + d (0 , w ) = α ( w ) by triangl e inequality . Note that, one coul d object that in general ﬁndi ng a function f ( w t ) satisfying Assumption 3 of Theorem 3 requires knowing d ( w t , v ) and thu s requi res some knowledge of t he point v of As sumption 1. Ho wever , we claim (without proof) that i n applications on Hadam ard manifolds, there m ay be obvious choices for v . This is the case in th e present example where ﬁnding v such t hat assum ptions 1-3 are sati sﬁed requires very lit tle (or no) kn owledge on the point the algorithm is s upposed to con ver ge to. Indeed, v = 0 is a straightforward choice that alwa ys fulﬁlls the assum ptions. This choice is con venient for calculati ons as the geodesics emanating from 0 are radiuses of the disk, but many other c hoices would ha ve been possi ble. 4.3 Identiﬁcation of a ﬁxed rank s ymmetric positiv e semi-deﬁnite matrix T o illus trate the beneﬁts of the approach on a recent non-li near problem, we focus i n this section on an algorithm of [23], and we pro ve new rigorous con ver g ence results. Least Mean Squares (LM S) ﬁlters ha ve been extensi vely utilized i n adaptive ﬁltering for online regre ssion. Let x t ∈ R n be the i nput vector , and y t be the output d eﬁned by y t = w T x t + ν t where the unknown v ector w ∈ R n is to be identiﬁed (ﬁlter weights), and ν t is a noi se. At each st ep we let z t = ( x t , y t ) and the approximated cost function is Q ( z t , w t ) = 1 2 ( w T x t − y t ) 2 . Applyi ng the steepest descent leads t o the stochastic gradient algorith m kno wn as LMS: w t +1 = w t − γ t ( w t T x t − y t ) x t . W e now cons ider a non-li near generalization of this problem coming from the machine learning ﬁeld (see e.g. [37]), wh ere x t ∈ R n is the input , y t ∈ R is the out put, and the mat rix counterpart of the linear model is y t = Tr  W x t x T t  = x T t W x t (16) where W ∈ R n × n is an unknown p ositive sem i-deﬁnite m atrix to be identiﬁed. In data mining , positive semi-deﬁnite matrices W represent a kernel or a Mahalanobis distance, i.e. W ij is the scalar product, or th e distance, between in stances i and j . W e assume at eac h step an expert provides an esti mation o f W ij which can be vi e wed as a random output . The goal is to esti mate the matrix W onli ne. W e let z t = ( x t , y t ) and we will apply our stochastic gradient method to the cost function C ( W ) = E z Q ( z , W ) where Q ( z t , W t ) = 1 2 ( x T t W t x t − y t ) 2 = 1 2 ( ˆ y t − y t ) 2 . Due to the large amount of data a vailable no wadays, matrix classiﬁcation algorithms tend to be appl ied to computational p roblems o f e ver-increa sing size. Y et, they need be adapted to rem ain tractable, and the matrices’ dimens ions need to be reduced so t hat t he matrices are storable. A wide-spread to pical method consist s of working wit h low-rank approxim ations. Any rank r approximati on of a po sitive deﬁnite matrix can be factored as A = GG T where G ∈ R n × r . It is then greatly reduced i n size if r ≪ n , leading to a reduction of th e numerical cost o f t ypical matrix operations from O ( n 3 ) to O ( nr 2 ) , i .e. linear complexity . This f act has 14 motiv ated th e de velopment of l ow-r ank kernel and Mahalanobis dis tance l earning [18 ], and geometric understanding of the set of semideﬁnite positive matrices of ﬁxed rank: S + ( r , n ) = { W ∈ R n × n s . t . W = W T  0 , rank ( W ) = r } . 4.3.1 Pr oposed algorithm and con vergen ce resu lts T o endow S + ( r , n ) wit h a metric we start from the square-root factorization W = GG T , where G ∈ R n × r ∗ , i.e. has rank r . Because the factorization is in var iant by rotation, the search sp ace is ident iﬁed to the q uotient S + ( r , n ) ≃ R n × r ∗ / O ( r ) , which represents the set of equ iv alence classes [ G ] = { GO s . t . O ∈ O ( r ) } . The Euclidian metri c ¯ g G (∆ 1 , ∆ 2 ) = T r  ∆ T 1 ∆ 2  , for ∆ 1 , ∆ 2 ∈ R n × r tangent vectors at G , is in variant alon g the equi valence classes. It thus induces a well-deﬁned metric g [ G ] ( ξ , ζ ) on the quotient, i.e. for ξ , ζ tangent vectors at [ G ] in S + ( r , n ) . Classically [1], the tangent v ec- tors of t he quotient s pace S + ( r , n ) are identiﬁed to the projectio n o nto the horizon tal space (the orthogon al space to [ G ] ) of tangent vectors of the tot al s pace R n × r ∗ . So tangent vec- tors at [ G ] are represented by the set of horizontal tangent vectors { Sym(∆) G, ∆ ∈ R n × n } , where Sym( A ) = ( A + A T ) / 2 . The horizont al gradient of Q ( z t , G t ) is the uniq ue horizon- tal vector H ( z t , G t ) that satisﬁes the d eﬁnition of th e Riemannian gradient. In the sequel we will sys tematically identify an element G to i ts equiv alence class [ G ] , which is a ma- trix of S + ( r , n ) . For m ore details on this manifold see [17]. Elementary computations yield H ( z t , G t ) = 2( ˆ y t − y t ) x t x t T G t , and (8) writes G t +1 = G t − γ t f ( G t ) ( k G T t x t k 2 − y t ) x t x T t G t (17) where we choose f ( G t ) = max(1 , k G t k 6 ) and where the sequence ( γ t ) t ≥ 0 satisﬁes condi- tion (4). This non-li near algori thm is well-deﬁned on the set of equivalence classes, and automatically enforces the rank and p ositive semi-deﬁniteness cons traints of th e parameter G t G T t = W t . Pr oposition 5. Let ( x t ) t ≥ 0 be a sequence of zer o center ed random vectors of R n with inde- pendent id entically a nd nor mally distr ibuted components. Suppose y t = x T t V x t is g enerated by some unkno wn parameter V ∈ S + ( r , n ) . The Riemannian gradient descent (17) is such that G t G T t = W t → W ∞ and ∇ C ( W t ) → 0 a.s . Mor eover , if W ∞ has rank r = n necessarily W ∞ = V a.s. If r < n , necessarily W a.s. con ver ges to an inv ariant subspace o f V of dimension r . If this is the dominant subspace of V , then W ∞ = V . The last proposition c an be c ompleted with the following fact: it can be easily proved t hat the domi nant subspace of V is a stable equil ibrium of the a veraged algorithm. As concerns for the o ther inv ariant subspaces of V , simulations ind icate they are unstable equil ibria. The con ver gence to V is thus always e xp ected in simulations. Pr oof. As here the Euclidian gradient of the lo ss with respect to t he parameter G t coincides with its proj ection onto the horizontal space H ( z t , G t ) , and is th us a tangent vector to the manifold, we propose to apply Theorem 3 to th e Euclidian space R n × r , which is of course 15 a Hadamard manifold . Th is is a s imple way to av oi d to compute the sectional curv ature of S + ( r , n ) . Note that, the adaptiv e step f ( G t ) introduced in Theorem 3 and the results of Theorem 3 are ne vertheless needed, as the usu al assumption E x k H ( x, G ) k k ≤ A + B k G k k of the Eucl idian case is v iolated. In fact the propos ition can be proved under slightly more general assumptions: suppose the components of the input vectors h a ve moments up to the order 8, with second and fourth mom ents denoted by a = E ( x i ) 2 and b = E ( x i ) 4 for 1 ≤ i ≤ n such that b > a 2 > 0 . W e begin with a preliminary result: Lemma 1. Consider the linear (matr ix) map U : M 7→ E x ( T r  xx T M  xx T ) . U ( M ) is th e matrix w h ose coor dinates ar e a 2 ( M ij + M j i ) for i 6 = j , and T r ( M ) a 2 + M ii ( b − a 2 ) for i = j . Assumptio n 1: W e let v = 0 . Let “ · ” denote t he usual scalar prod uct in R n × r . For k G k 2 sufﬁ ciently lar ge, G · ( v − G ) = − T r  [ E ( T r  xx T ( GG T − V )  xx T G ] G T )  < − ǫ < 0 , which means the gradient tends to make the norm of the parameter d ecrease on ave rage, when it is far from the origin. Indeed let P = GG T . W e want to prove that Tr ( U ( P ) P ) > Tr ( U ( V ) P ) + ǫ for su f ﬁciently large k G k 2 = Tr ( P ) . If we choose a basis in which P is d iagonal we have T r ( U ( P ) P ) = a 2 T r ( P ) 2 + ( b − a 2 ) T r ( P 2 ) = a 2 ( P λ i ) 2 + ( b − a 2 )( P λ 2 i ) where λ 1 , . . . , λ n are the eigen values of P . W e ha ve also Tr ( U ( V ) P ) = a 2 T r ( P ) Tr ( V ) + ( b − a 2 ) P ( λ i V ii ) . For sufﬁciently large Tr ( P ) , T r ( P ) 2 is arbitrarily larger t han Tr ( P ) Tr ( V ) . W e h a ve also ( P ( λ 2 i ) P ( V 2 ii )) 1 / 2 ≥ P ( λ i V ii ) and ( P λ 2 i ) 1 / 2 ≥ 1 n T r ( P ) by Cauchy-Schw artz inequal- ity . Thu s for Tr ( P ) ≥ n P ( V 2 ii ) 1 / 2 , we have ( P λ 2 i ) 1 / 2 ≥ P ( V 2 ii ) 1 / 2 and thus P λ 2 i ≥ ( P ( λ 2 i ) P ( V 2 ii )) 1 / 2 ≥ P ( λ i V ii ) . Assumption 2 is satisﬁed as the curv ature of an Euclidian space is zero. For Ass umption 3, using the fa ct that for P, Q positive semi-deﬁnite T r ( P Q ) ≤ ( T r ( P 2 ) T r ( Q 2 )) 1 / 2 ≤ Tr ( P ) T r ( Q ) , and that Tr  GG T  = k G k 2 and Tr  xx T  = k x k 2 , it is easy to prove there exists B > 0 su ch that k H ( x, G ) k 2 = ( k G T x k 2 − y ) 2 k xx T G k 2 ≤ ( k G k 6 + B k G k 2 ) k x k 8 . Thus there exists µ > 0 such that [max(1 , k G k 3 )] 2 is greater than µ E z k H ( x, G ) k 2 . On the other hand, t here exists λ such that λ E z (2 k H ( x, G ) kk G k + k H ( x, G ) k 2 ) 2 ≤ max(1 , k G k 6 ) . But th e alternati ve step max( µ, λ ) γ t satisﬁes condition (4). Let us analyze the set of possible asym ptotic values. It is characterized by U ( GG T − V ) G = 0 . Let M be the symmetric matrix GG T − V . If G is in vertible, it means U ( M ) = 0 . Using the lemma above we see that the off-diagonal terms of M are equal to 0, and summ ing the diagonal t erms (( n − 1) a 2 + b ) T r ( M ) = 0 and thu s T r ( M ) = 0 which th en implies M = 0 as b > a 2 . No w suppos e r < n . If b = 3 a 2 , as for the normal d istribution, U ( M ) = 2 M + T r ( M ) I and U ( GG T − V ) G = 0 implies G ( kI + 2 G T G ) = 2 V G for some k ∈ R . Thus (as in example 4.1) it implies G is an in variant subspace of V . Similarly to the full rank case, it is easy to prove W ∞ = V when V and G span the same subspace. 4.3.2 Gain tuning The condition (4) is common in stochastic approximation. As i n the standard ﬁltering prob- lem, or in Ka lman ﬁlt er theory , t he mo re n oisy observations of a const ant process one g ets, the weaker the gain of t he ﬁlter becom es. It is generally recommended to set γ t = a/ (1 + b t 1 / 2+ ǫ ) where in theory ǫ > 0 , but in practice we propose to take ǫ = 0 , leading to the f amil y of gains γ t = a 1 + b t 1 / 2 (18) 16 If the gain remains too high, the noise will make th e estimator oscillate around the s olution. But a low gain l eads to slow con ver gence. Th e coef ﬁcient a represents the initial ga in. It must be high enoug h t o ensure sufﬁciently fast con ver gence b ut n ot e xcessive to a void amplifying the no ise. b is concerned with t he asymp totic behavior of the algorithm and must be set such that the algo rithm is insensitive to noi se in the ﬁnal i terations (a high noise could destabilize the ﬁnal matrix identiﬁed over a given t raining s et). a is generally set experimentally usi ng a reduced number of i terations, and b must be such that the variance of the gradient is very small compared to the entries of G t for lar ge t . 4.3.3 Simulation r esults Asymptoti c con ver gence of GG T to the true value V is alw ays achie ved i n simulations . When t becomes lar ge, the behavior of the stochastic algorithm is very close t o the beha vior of the a veraged gradient descent alg orithm J t +1 = J t − γ t f ( J t ) E z H ( z , J t ) , as illus trated i n Figure 2. This l atter algorithm has a well characterized behavior in simulatio ns: in a ﬁrst phase the estimation error decreases rapidly , and in a second ph ase it slowly conv erges to zero. As the number of iterations increases, the estimatio n err or becomes arbitrarily small. In all t he experiments, the estimated matrices have an initial norm equal t o k V k . This is because a large initial no rm discrepancy induces a rapid decrease in the estimation error , which then would seem t o tend very quickly t o zero com pared to its in itial value. Thus, a fair experiment requires initi al comparable norms. In the ﬁrst set of numerical experiments Gaussian input v ectors x t ∈ R 100 with 0 mean and identit y covariance mat rix are generated. The output is generated via model (16) where V ∈ R 100 × 100 is a sym metric positive semi- deﬁnite matrix with rank r = 3 . The results are illustrated on F igure 2 and indicate t he matrix V is asymptotically well identiﬁed. In order to comp are the proposed method with another algorithm, we propose to focus on the full-rank case, and compare th e algorithm wi th a naive but ef ﬁcient technique. Indeed, when r = n , the cost function C ( W ) becom es con ve x in the parameter W ∈ P + ( n ) and th e only difﬁ culty is to numerically maintain W as posit iv e semi-deﬁnite. Thus, a simple method to attack the problem of identifying V is to derive a stochastic gradient algorithm in R n × n , and to project at each step the iterate on the cone of positive sem i-deﬁnite matrices, i.e., P 0 ∈ S + ( n, n ) , P t +1 = π ( P t − γ t ∇ Q ( z t , P t )) (19) where π is the proj ection on the cone. It h as been proved in [16] t his projection can be performed by di agonalizing the matrix, and setting all the ne gativ e eigen values e qual to zero. Figure 3 illust rates the results. Both algorit hms have comparable performances. Howev er , the proposed algorithm (17) is a l ittle s lower than the stochastic algorithm (1 9) which is as expected, since t his latter algorithm takes advantage of th e con vexity o f t he ave raged cost function in the full-rank case. Howe ver , the true advantage of the proposed approach is essentially compu tational, and becomes more apparent when t he rank is lo w . Indeed, when n is very large and r ≪ n (17) has linear complexity in n , whereas a metho d based on di agonalization requires at least O ( n 2 ) operations and may become in tractable. M oreover , the problem is not con ve x anymore due to the rank cons traint and an approxim ation then proj ection techniqu e can lead to degraded 17 0 1 2 3 4 5 x 10 4 −4 −2 0 2 Output error Output error x T GG T x−y 0 1 2 3 4 5 x 10 4 0 0.2 0.4 0.6 0.8 Number of iterations Estimation error Convergence and comparison with the averaged flow ||GG T −V|| ||JJ T −V|| Figure 2: Ident iﬁcation o f a rank 3 mat rix V of d imension 1 00 × 1 00 with algorithm (17). T op plot: outp ut (or classiﬁcation) error versus the number of iterations. Bottom p lot: estimation error for the stochastic algorithm k G t G T t − V k (solid li ne) and estimation error for the de- terministic av eraged algorithm J t +1 = J t − γ t f ( J t ) E z H ( z , J t ) (dashed l ine). The curves nearly coincide. The chosen g ain is γ t = . 001 / (1 + t/ 50 00) 1 / 2 . 18 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 0.05 0.1 0.15 0.2 0.25 Number of iterations Estimation error Comparison with a simple algorithm in the full rank case ||GG T −V|| ||P−V|| Figure 3: Full-rank case with r = n = 2 0 . Plot of the estim ation error for algorithm (17) (solid line) and (19) (dashed line). The gain is γ t = . 01 / (1 + t/ 500 ) 1 / 2 . performance. Thus, comparing bot h techniques (17) and a techni que based on diagonali za- tion such as (19) is pointless for low-rank appl ications, and ﬁndin g relev ant algorithms is an in volved task that has been recently addressed in se veral p apers, see e.g. [23, 33]. Since in the present paper the em phasis is put on mathematical con ver gence results, the interested reader can refer to [21] where (17) and its variants ha ve been e xtensively tested on se veral databases. They are shown to comp ete with s tate o f the art metho ds, and to scale very well wh en th e matrix is of very large dimension (a v ariant was tested on t he Netﬂix p rize database). They hav e also been recently compared to more inv olved R iemannian methods in [33]. 4.4 Non-linear gossip algorithm f or decentralized cov a riance matrix es- timation The p roblem of computing distributed av erages on a network appears in many applications, such as mu lti-agent systems, distributed data fusion, and decentralized optimization. The un- derlying i dea is to replace expansive wiring by a network where th e i nformation exchange is reduced owing to various lim itations in data communication. A way t o comput e distributed a verages th at has gained po pularity over the last years is the so-called gossip algorit hm [10]. It is a r andomized proc edure where at e ach iteration, a node communicates with one neig hbor and bot h nod es set their value equal to the a verage of their current v alues. The goal is for all nodes to reach a common intermediate value as quickly as poss ible, with litt le computational power . Gossi p algorithms are a special case of distributed o ptimization algorith ms, wh ere stochastic gradient descent plays a central role (see e.g. [36]). When applied t o multi-agent systems, this algorithm allows the agents to reach a consensus, i.e. agree on a common quan- tity . T his i s the well known consensus p roblem [24 ]. When the consensus sp ace is not li near (for instance a group of agents wants t o agree on a common direction of motion, or a group of oscillators on a commo n phase) the methods need to be adapted to the non-linearities of the problem (see e.g. [32]). Consensus o n manifolds has recently receiv ed increasing attention, 19 see e.g. the rece nt work of [35, 31 ] for determi nistic procedures. In [30] , a gossip algorithm on the circle has been proposed and analyzed. In this section, we address the problem of estimating a covariance matrix W on a sensor network i n a decentralized w ay (see e.g. [20]): suppose each node i provides measurements y i = [ y 1 i , · · · , y m i ] ∈ R n × m where th e vectors y j i ∈ R n are zero centered normally distributed random vectors with a cov ariance matrix to b e estimated. After a local computatio n, each node is assum ed t o possess an initial estimated cov ariance matrix W i, 0 . Neighb oring nod es are allo wed to e x change in formation at random times. W e assume the nodes are labeled according to their proximity as follows: for i ≤ m − 1 the nodes i and i + 1 are neighbors. At each tim e s tep t , we suppo se a node i < m is picked randomly with probability p i > 0 (where p i represents for instance the frequency of av ailabili ty of the com munication channel between nodes i and i + 1 ) and the nodes in q uestion update their co variance estimates W i,t and W i +1 ,t . Our goal is that they reach consensus on a common intermediate v alue. T o do so, the procedure (2) i s impl emented using an i nteresting alternative m etric o n the cone of positive deﬁnit e matrices P + ( n ) . 4.4.1 A statistically meaningful distance on P + ( n ) Information geometry allows t o deﬁne Riemannian distances between probabilit y distribu- tions that are very meaningful from a st atistical point of vie w . On the m anifold of sym metric positive d eﬁnite matrices P + ( n ) , the so-called Fisher Information Metric for two tangent vec- tors X 1 , X 2 at P ∈ P + ( n ) is gi ven by h X 1 , X 2 i P = T r  X 1 P − 1 X 2 P − 1  (20) It deﬁnes an inﬁnitesi mal distance t hat agrees with the celebrated Kullback-Leibler div er g ence between probability distributions: up to third order terms in a small symmetric matrix, say , X we ha ve K L ( N (0 , P ) ||N (0 , P + X )) = h X , X i P for any P ∈ P + ( n ) where N (0 , P ) denotes a Gaussian dis tribution of zero mean and covari- ance m atrix P and K L denotes t he Kullback Leibl er diver gence. The geod esic distance writes d ( P , Q ) = ( P n k =1 log 2 ( λ k )) 1 / 2 where λ 1 , · · · , λ n are the eigen values the matri x P Q − 1 and it represents the amount of information that separates P from Q . The i ntroduced not ion of statistical information i s easily understood from a sim ple vari- ance estimation problem with n = 1 . Indeed, consider the problem of estimating the variance of a random vector y ≃ N ( 0 , σ ) . In st atistics, the Cramer -Rao bound provides a lo wer bound on the accurac y of any unbiased estimator ˆ σ of the v ariance σ : here it states E ( ˆ σ − σ ) 2 ≥ σ 2 . Thus the s maller σ is, the more potential information the dist ribution cont ains about σ . As a re- sult, two sam ples drawn respectively from, say , t he dist ributions N (0 , 1000) and N (0 , 1001) look much more sim ilar than samp les drawn respectiv ely from the dist ributions N (0 , 0 . 1) and N (0 , 1 . 1 ) . In other words, a uni t increment in the variance will have a small impact on the corresponding distributions if initi ally σ = 1000 whereas it will hav e a high impact if σ = 0 . 1 . Identifying zero mean Gaussian d istributions with their va riances, the Fisher Infor - mation M etric accounts for that statistical discrepancy as (20 ) writes h dσ, dσ i σ = ( dσ /σ ) 2 . 20 But th e Eu clidian d istance does not, as the Euclidi an dist ance bet ween t he variances is equal to 1 in both cases. The metri c (20) is also known as the natural metric on P + ( n ) , and admits strong in var iance properties (see Propositio n 6 below). These properties make t he Karcher mean associated with this distance m ore rob ust to outliers than the u sual arithmetic m ean. This is the main reason why this distance has attracted e ver increasing attention in medical im aging applications (see e.g. [28 ]) and radar processing [5] ov er the last fe w years. 4.4.2 A nove l randomized algorithm W e propose the following randomized procedure to tackle the problem abov e. At each step t , a node i < m is picked randomly wit h p robability p i > 0 , and both neighb oring nod es i and i + 1 m ove thei r v alues tow ards each other along the geodesic linking them, to a distance γ t d ( W i,t , W i +1 ,t ) from their current posit ion. Note that, for γ t = 1 / 2 the updated matrix W i,t +1 is at exactly half Fisher information (geod esic) distance between W i,t and W i +1 ,t . This is an a pplication of update (2) where z t denotes the selected node at time t and has probability distribution ( p 1 , · · · , p m − 1 ) , and where the av erage cost function writes C ( W 1 , · · · , W m ) = m − 1 X i =1 p i d 2 ( W i , W i +1 ) on the manifol d P + ( n ) × · · · × P + ( n ) . Us ing the explicit expression of the geodesi cs [14], update (2) writes W i,t +1 = W 1 / 2 i,t exp( γ t log( W − 1 / 2 i,t W i +1 ,t W − 1 / 2 i,t )) W 1 / 2 i,t , W i +1 ,t +1 = W 1 / 2 i +1 ,t exp( γ t log( W − 1 / 2 i +1 ,t W i,t W − 1 / 2 i +1 ,t )) W 1 / 2 i +1 ,t (21) This alg orithm has se veral theoretical advantages. First it is b ased on t he Fisher informa- tion metric, and thus is natural from a statistical viewpoint. Then it has se veral nice properties as illustrated by the following two results: Pr oposition 6. The algor ithm (21) i s in variant t o the a ction of GL ( n ) on P + ( n ) by congru- ence. Pr oof. This is merely a consequence of the i n variance of the metric, and of th e geodesi c distance d ( GP G T , GQG T ) = d ( P , Q ) for any P , Q ∈ P + ( n ) and G ∈ GL ( n ) . The proposi tion has the following m eaning: after a linear change of coordinates, if all the measurements y j i are transformed in to new measurements Gy j i , where G is an in vertible matrix on R n , and the correspon ding estimat ed initial cov ariance matrices are accordingl y transformed int o GW i, 0 G T , then the algo rithm (21) is unchanged, i .e., for any node i the algorithm with initial v alues G W i, 0 G T ’ s will yi eld at time t the m atrix GW i,t G T , where W i,t is th e updated value correspondin g to the initial values W i, 0 ’ s. As a result, the algorithm wil l perform equally well for a gi ven problem independently of the choice o f c oordinates in which the cov ariance matrices are expressed (t his implies in particular inv ariance to the orientation of the axes, and in variance to change of units, such as meters ver sus feet, which can be desirable and physicall y meaningful in some applicatio ns). The most important result is merely an application of the theorems of the present paper: 21 0 2 4 6 8 10 12 14 −0.5 0 0.5 1 Number of iterations Entries of the matrices at each node Convergence of the entries of 2x2 matrices in a 6 nodes graph Figure 4: Entries of the m atrices at each nod e versus the number of i terations over a single run. The matrices are of dimension 2 × 2 and the graph has 6 nodes. Con ver gence to a comm on (symmetric) matrix is observed. Pr oposition 7. If the sequence γ t satisﬁes the usual assumption (4) , and is upper bounded by 1/2, the covariance matrices at each node con ver ge a.s. to a common value . Pr oof. This is sim ply an application of Th eorem 1. Indeed, P + ( n ) endowed with the natural metric is a complete manifold, and th us o ne can ﬁnd a geodesic ball containi ng the initial values W 1 , · · · , W m . In this m anifold, geodesic balls are con ve x (see e.g. [2]). At each time step two poi nts m ove tow ards each other along the geodesic linki ng them , but as γ t ≤ 1 / 2 their updated value lies between their current values, so they remain in the ball by con vex- ity . Thus, the values belong to a com pact set at al l tim es. Moreover t he injectivity radius is bounded aw ay from zero, and the gradient is b ounded in th e ball , so Theorem 1 can be ap- plied. ∇ W m C = 0 im plies that d ( W m − 1 , W m ) = 0 as ∇ W m C = − 2 p m − 1 exp − 1 W m ( W m − 1 ) (see Appendix). Thus W m − 1 and W m a.s. con verge to th e same v alue. But as ∇ W m − 1 C = 0 this implies W m − 2 con ver ges a.s. to the same v alue. By the same toke n we see all nodes con verge to the same value a.s. 4.4.3 Simulation r esults As the cone of p ositive deﬁnite matrices P + ( n ) is con vex, the standard gossip algorithm is well deﬁned on P + ( n ) . If no de i is drawn, it simply consists of the update W i,t +1 = W i +1 ,t +1 = ( W i,t + W i +1 ,t ) / 2 (22) In the foll owing numerical experiments we let n = 10 , m = 6 , and nodes are drawn wit h uniform probabilit y . The step γ t is ﬁxed equal to 1 / 2 over the experiment s o that (21) can be viewed as a Riemannian gossip algorithm (the conditio n (4) is on ly concerned with the asymptotic behavior of γ t and can thus b e sati sﬁed even if γ t is ﬁxed over a ﬁnite number of i terations). Simulations show that both algorithms always conv erge. Con ver gence of the Riemannian algorithm (21) is illustrated in Figure 4. 22 In Figures 5 and 6, the two algorithms are compared. Due to the stochastic nature of the algorithm, the sim ulation results are averaged over 50 runs. Sim ulations show that the Rie- mannian algorithm (21) con ver g es fa ster in a verage than the usual gossip algorithm (22). The con ver gence is slightly fa ster when the initial matrices W 1 , 0 , · · · , W m, 0 hav e approxim ately the same norm. But when the initial matrices are far from each other , the Riemanni an consen- sus algorithm outperforms th e usual gossi p algorithm. In Figure 5, the e volution of t he cost C ( W 1 ,t , · · · , W m,t ) 1 / 2 is pl otted versus the number of iterations. In Figu re 6, the diameter of the con vex hull of matrices W 1 ,t , · · · , W m,t is consi dered as an alternativ e con vergence crite- rion. W e see t he superiori ty of th e Riemannian al gorithm is particularly striking with respect to t his con vergence criterium. It can also be observed in simulations that the Riemann ian al- gorithm is more robust to outl iers. T ogether wit h its statisti cal motiv ations, and its in variance and guaranteed con ver gence p roperties, it m akes it an i nteresting procedure for d ecentralized cov ariance estim ation, or more generally randomized consensus on P + ( n ) . 0 2 4 6 8 10 12 14 0 0.05 0.1 0.15 0.2 Number of iterations ( Σ i p i ||W i+1 −W i || 2 ) 1/2 Convergence with initial matrices all having unit norm Riemannian gossip Euclidian gossip 0 2 4 6 8 10 12 14 0 0.5 1 1.5 Number of iterations ( Σ i p i ||W i+1 −W i || 2 ) 1/2 Convergence with initial matrices having norms ranging from .1 to 10 Riemannian gossip Euclidian gossip Figure 5: Comparison of Riemannian (soli d line) and Euclidian (dashed line) goss ip for co- var iance m atrices of dimension 10 × 10 with a 6 nodes gra ph. The plotted curves represent t he square root of the averaged cost C ( W 1 ,t , · · · , W m,t ) 1 / 2 versus the number of iterations, a ver - aged over 50 runs. The Riemann ian algorithm con verges f ast er (top graphics). Its superiorit y is particularly striking when the nodes hav e heterogeneous ini tial v alues (bottom graphics). 5 Conclusio n In this paper we proposed a s tochastic gradient algorithm on Riemannian manifolds . Under reasonable assumptions th e con vergence of the algorithm was prove d. M oreover the con ver - gence results are pro ved to hold when a retraction is used, a feature of great practical interest. The approach is ve rsatile, and potentially applicable to numerous non-linear problems in se v- eral ﬁelds of research such as control, machine learning, and signal processing, where the manifold approach is often used either to enforce a constraint, or to deri ve an intrinsic algo- rithm. 23 0 2 4 6 8 10 12 14 0 0.2 0.4 0.6 0.8 1 Number of iterations Max ij ||W i −W j || Convergence with initial matrices all having unit norm Riemannian gossip Euclidian gossip 0 2 4 6 8 10 12 14 0 2 4 6 8 10 Number of iterations Max ij ||W i −W j || Convergence with initial matrices having norms ranging from .1 to 10 Riemannian gossip Euclidian gossip Figure 6: Comparison of Riemannian (soli d line) and Euclidian (dashed line) goss ip for co- var iance matrices of dimension 10 × 10 with a 6 nodes graph with another con vergence crite- rion. The plotted curves represent the diameter of the con ve x hull max i,j k W i,t − W j,t k versus the nu mber o f i terations, a veraged over 50 runs. The Riemannian algorithm con verges faster (top p lot). It outperforms t he Eucli dian algo rithm when the nodes ha ve h eterogeneous i nitial values (bottom plot). Another im portant connection with the literature concerns Amari’ s natural gradi ent [3], a technique that has led to substantial gains in the blind so urce separation problem, and th at can be cast in our frame work. Indeed, the idea is to consider successive realizations z 1 , z 2 , · · · of a parametric m odel with parameter w ∈ R n and joint probability p ( z , w ) . The goal is to estimate onlin e the parameter w . Amari proposes to use algorithm (3) where the Riemannian metric is the Fisher Information Metric associated to the parametric m odel, the los s is the log- likelihood Q ( z , w ) = log p ( z , w ) , and the retraction is th e mere additio n in R n . The resulting algorithm, the so-called natural gradient, is prove d to be a symptoti cally efﬁcient, i.e., to reach an asymptotical Cramer-Rao lowe r bound. Us ing the true e xponent ial m ap and thus algorith m (2) would result i n a di f ferent (intrinsic) update. In [34], S. Smith has proposed an intrinsi c Cramer -Rao bound based on the Fisher metric. In future work, one could explore whether the intrinsic algorithm (2) asy mptotically reaches the intrinsic Cramer-Rao bound. More details are give n in Appendix A. In the future we would also like to explore two of the aforementioned applications. First, the m atrix comp letion problem. Proving th e con ver gence of the s tochastic gradient algo- rithm [22] requires to s tudy the critical po ints of the avera ged cost function . This leads to prove m athematically in volved results on low-rank matrix ident iﬁability , possibly extending the non-trivial work of [11]. Th en, we w oul d like to prove mo re general results for non-linear consensus on complete manifolds with a stochastic communication graph. In particular we hope to extend or i mprove the con vergence bou nds of the gossip al gorithms in th e Euclidian case [10] to problems such as t he ones described i n [32, 30], and also to understand to w hat extent the gossip algorithms for consensus ca n be faster in a hyperbolic geometry . 24 Ackno wledgements The author w o uld like to t hank Gilles M eyer and Rodolphe Sepulchre for early collaboration on the subject, L ´ eon Bottou for interesting discussio ns. A ppend ix A: Links with inf o rmation geomet ry An im portant concept in i nformation geom etry is the natural gradient. Let us s how it is related to the m ethod proposed in this paper . Supp ose now that z t are realizations of a parametric model with parameter w ∈ R n and joint probabilit y density function p ( z , w ) . No w let Q ( z , w ) = l ( z , w ) = log ( p ( z , w )) be th e lo g-likelihood o f th e parametric law p . If ˆ w is an estimator of the true parameter w ∗ based on k realizations of the process z 1 , · · · , z k the covar iance matrix is larger than the Cramer -Rao bound: E [( ˆ w − w ∗ )( ˆ w − w ∗ ) T ] ≥ 1 k G ( w ∗ ) − 1 with G th e Fisher information m atrix G ( w ) = − E z [( ∇ E w l ( z , w ))( ∇ E w l ( z , w )) T ] wh ere ∇ E denotes the con ventional gradient in Euclidian spaces. As G ( w ) is a pos itive deﬁnite matrix it deﬁnes a Riemanni an structure on the state space M = R n , known as the Fisher i nformation metric. In th is chart t he Riemannian gradi ent of Q ( z , w ) writes G − 1 ( w ) ∇ E w l ( z , w ) . As M = R n , a s imple retraction is th e addition R w ( u ) = w + u . T aking γ t = 1 /t which is com patible with assumptio n 1, update (3) writes w t +1 = w t − 1 t G − 1 ( w t ) ∇ E w l ( z t , w t ) . This is the celebrated Amari’ s natural gradient [3]. Assuming w t con ver ges to the t rue parameter w ∗ generating the data, Amari proves it is an asymptoticall y ef ﬁcient estimator . Indeed, lettin g V t = E [( w t − w ∗ )( w t − w ∗ ) T ] we ha ve V t +1 = V t − 2 E [ 1 t G − 1 ∇ E w l ( z t , w t )( w t − w ∗ ) T ] + 1 t 2 G − 1 GG − 1 + O ( 1 t 3 ) But up to second order terms ∇ E w l ( z t , w t ) = ∇ E w l ( z t , w ∗ ) + ( ∇ E w ) 2 l ( z t , w ∗ )( w t − w ∗ ) , with E [ l ( z , w ∗ )] = 0 as w ∗ achie ves a maxim um of the expected l og-likelihood where the e xpec- tation is with respect to the law p ( z , w ∗ ) , and G ( w ) = E [( ∇ E w ) 2 l ( z t , w )] because of the basic properties of the C ramer Rao bound. Finally V t +1 = V t − 2 V t /t + G − 1 /t 2 , up to terms whose a verage can be neglected. The asymptotic sol ution of this equati on is V t = G − 1 /t + O (1 /t 2 ) proving statistical ef ﬁciency . It completes o ur conv ergence result, proving that when the space is endowed with the Fisher metric, and the trivial retraction is used, the st ochastic gradiend method propo sed in this paper provides an asymptoti cally ef ﬁcient estimator . The natural gradient has been applied to blind source separation (BSS) and has proved to lead to substantial performance gains. [34] has recently deri ved an intrinsic Cramer-R ao bound. The b ound d oes no t depend on any non-tri vial cho ice of coordinates, i.e. the estimation error k ˆ w − w k is replaced with the Riemannian d istance ass ociated t o t he Fisher information metric. In the same way , t he usu al natural gradient upd ate w t +1 = w t − 1 t G − 1 ∇ E w l ( z t , w t ) could be replaced wit h it s i ntrinsic ver - sion (2) proposed in this paper . It can be conjectured this estimator achie ves Fisher efﬁ ci ency 25 i.e. reaches the intrinsic Cramer-Ra o bound as deﬁned in [34]. Such a result in the theory of information geometry goes beyond the scope of this paper and is left for future resear ch. A ppend ix B: Riemannian geometry backgr ou nd Let ( M , g ) be a connected Riemannian manifold (see e.g. [1] for basic R iemannian geometry deﬁnitions). It carries the structure of a metric space whose distance funct ion is the a rc length of a minimizi ng path betwee n two points. The length L of a curve c ( t ) ∈ M is deﬁned by L = Z b a p g ( ˙ c ( t ) , ˙ c ( t )) dt = Z b a k ˙ c ( t ) k dt If y is sufﬁciently close to x ∈ M , there is a unique path of m inimal length linking x and y . It is called a geodesic. The exponential map is deﬁned as fol lows: exp x ( v ) is the point z ∈ M situated on the geodesic wit h ini tial posit ion-velocity ( x, v ) at distance k v k of x . W e also deﬁne exp − 1 x ( z ) = v . The cut locus of x i s roughly speaking the s et where the geodesics starting at x stop bein g paths of mini mal length (for example π on the circle for x = 0 ). The least distance to the cut locus is the so-ca lled injectivity radius I at x . A geodesic ball is a ball with radius less than the injectivity ra dius at its center . For f : M → R twice cont inuously di ff erentiable, one can deﬁne the Riemannian gra- dient as the tangent vector at x sat isfying d dt | t =0 f (exp x ( tv )) = h v , ∇ f ( x ) i g and t he hessian as the op erator such that d dt | t =0 h∇ f (exp x ( tv )) , ∇ f (exp x ( tv )) i g = 2 h∇ f ( x ) , ( ∇ 2 x f ) v i g . For instance, if f ( x ) = 1 2 d 2 ( p, x ) is half t he squared di stance to a poi nt p the Riemannian gradient is ∇ x f = exp − 1 x ( p ) , i.e. it is a tangent vector at x col linear to the geodesic linking x and p , with norm d ( p, x ) . Letti ng c ( t ) = exp x ( tv ) we have f ( c ( t )) = f ( x ) + t h v , ∇ f ( x ) i g + Z t 0 ( t − s ) h d ds c ( s ) , ( ∇ 2 c ( s ) f ) d ds c ( s ) i g ds. and thus f (exp x ( tv )) − f ( x ) ≤ t h v , ∇ f ( x ) i g + t 2 2 k v k 2 g k , where k is a bou nd o n the h essian along the geodesic. References [1] P .A. Absil, R. M ahony , and R. Sepulchre. Opti mization Alg orithms on Ma trix Manifolds . Princeton Unive rsity Press, 2007. [2] B. Afsari. Riemannian LP center of mass: existence, uniqueness, and con vexity . Pr o- ceedings of the American Mathematical Society , 139:655–6 73, 2011. [3] S.I. Amari. Natural gradient works efﬁciently i n learning . Neural Comput ation, MIT Press, 1998. 26 [4] M. Arnaudon, C. Dombry , A. Phan, and Le Y ang. Stochastic algorithms for computing means of probability measures. Sto chastic Pr ocesses and their Applications , 122:1437– 1455, 2012. [5] F . Barbaresco. Innov ative tools for radar signal processing b ased on cartan’ s geom e- try of sy mmetric po sitive-deﬁ nite matrices and information geometry . In IEEE Radar Confer ence , 2008. [6] A. Ben veniste, M. Goursat, and G. Ruget. Analys is of sto chastic approximati on schemes with disconti nuous and dependent forcing terms with applications to data comm unica- tion algorithms. IEEE T rans. on Automatic Contr o l , 25(6):1042–1058, 1980. [7] S. Bonnabel. Con ver gence d es m ´ ethodes de gradient stochastiqu e su r les v ari ´ et ´ es rie- manniennes. In XIII i ` eme Colloque GRETSI, Bor deaux , 2011. [8] S. Bonnabel, G. Meyer , and R. Sepulchre. Adapt iv e ﬁltering for estim ation for a low- rank positive s emideﬁnite matrix. In Internati onal Symposium on Mathematical Theory of Networks and Systems (MTNS) , 2010. [9] L. Bottou . Onli ne Algorithms and Sto chastic Appr oximatio ns . Online Learning and Neural Networks, Edited by Da vid Saad, Cambridge University Press, 1998. [10] S. Boyd, A. Ghosh, B. Prabhakar , and D. Shah. Randomized goss ip algorithms. IEEE T rans. on Information Theory , 52(6):2508–2530 , 2006. [11] E.J . C and ` es and Y . Plan. T ight ora cle bounds for lo w-rank matrix r ecov ery from a mini- mal number of random measurements. IEEE T rans. on Information Theory , 57(4):2342– 2359, 2009. [12] D. Cordero-Erausquin, R. J. M cCann, and M. Schmu ckenschlager . A Riemannian inter - polation inequality a la Borell, Brascamp and Lieb . In vent. Mat h. , 146:219–257, 2001. [13] T .A. Arias Edelm an, A. and S.T . Smith. The geom etry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis and Applicati ons , 20(2):303–353, 1998. [14] J. Faraut and A. K oranyi. Analysis on Symmetric Cones . Oxford Univ . Press, London, U.K., 1994. [15] D.L. Fisk. Quasi-martingales. T rans. of t he Ameri can Mathemati cal Society , 120(3), 1965. [16] Nicho las J. Higham. Matrix nearness pr ob lems and applications . Oxford Univ ersit y Press, 1989. [17] M. Jou rnee, P .-A. A bsil F . Bach, and R. Sepulchre. Low-rank opti mization on th e cone of positive semideﬁnite matrices. SIAM J our nal on Optimization , 20(5):2327–2351, 2010. [18] B. Kulis , M. Sustik, and Dhillon I.S. Learning low-rank k ernel matrices. Pr oceedings of the 23 r d Internationa l Confer ence on Machine Learning (ICML) , 2006. 27 [19] L. Ljung. Analysis of recursive stochastic algorithms. IEEE T rans. on Automatic Con- tr ol , 22(4):551–575, 1977. [20] L. Ljung, H. Hjalm arsson, and H. Ohlsson. Four encounters wi th system identiﬁcation. Eur opean J ou rnal of Contr ol , pages 449–471, 2011. [21] G. Meyer . Geometric optimizat ion algorithms for linear r e gr essio n on ﬁxed-rank matri- ces . PhD thesis, University of Li ` ege, 2011. [22] G. Meyer , S. Bonnabel, and R. Sepulchre. Linear re gression under ﬁx ed-rank con- straints: a Riemannian approach. In Pr oc. of the 28th Internat ional Confer ence on Machine Le arning (ICML) , 2011. [23] G. Meyer , S. Bonnabel, and R. Sepulchre. Regression on ﬁxed-rank posit iv e semi deﬁnite matrices: a Riemannian approach. Journal of Machine Learning Reasear ch (JMLR) , 12:593–625, 2011. [24] L. Moreau. Stabili ty of m ultiagent systems with time-dependent com munication links. IEEE T rans. on A ut omatic Contr ol , 50(2):169–182, 2005. [25] E. Oj a. Subspace methods of pattern r ecognition. Research Studies Press, 1983. [26] E. Oja. Principal components, mi nor components, and linear neural networks. Neural Networks , 5:927 – 935, 1992. [27] B. O’Neill. Semi-Ri emaniann geometry . Pure and applied mathematics. Academic Press Inc., Ne w Y ork, 1 983. [28] X. Pennec, P . Fillard, and N. A yache. A riemannian framework for t ensor comput ing. International Journal of Computer V ision , 66:41–66, 2006. [29] H. Robbin s and S. Mon ro. A stochastic approximation method. The Annals of Mathe- matical Statisti cs , 22, 1951. [30] Sarlette S., S.E. T u na, V .D. Blondel, and R. Sepulchre. Gl obal syn chronization on the circle. In Pr oceedings of the 17th IF A C W orld Congr ess , 2008. [31] A. Sarlette, S. Bonnabel, and R. Sepulchre. Coordi nated mot ion design on lie groups . IEEE T rans. on A ut omatic Contr ol , 55(5):1047–1058, 2010. [32] S. Sepulchre, P . Derek, and N.E. Leonard. Stabilization of planar c ollectiv e mot ion with limited communication. IEEE T rans. on Automatic Contr o l , 53(3):706–719, 2008. [33] U. Shalit, D. W einshall, and G. Chechik. Online learning in the embedd ed m anifold of low-rank matrices. The Journal of Machine Learni ng Resear ch (JMLR) , 13:429–458, 2012. [34] S.T . Smith. Cov ariance, subspace, and intrinsic Cramer -Rao b ounds. IEEE T rans. on Signal Pr ocessing , 53(5):1610–16 29, 2005. 28 [35] R. Tron, B. Afsari, and R. V id al. A verage consensus on Riemannian manifold s with bounded curvature. In Pr oceedings of the 50th IEEE Confere nce on Decisio n and Con- tr ol , pages 7855 – 7862, 2011. [36] J.N. Ts itskili s, D. P . Bertsekas, and M. Athans. Distributed asynchronous determin- istic and stochastic gradient optimization algorithm s. IEEE T rans. A u tom. Cont r ol , 31(9):803–812, 1986. [37] Ratsch-G. Tsuda, K. and M . W armuth. M atrix exponentiated gradient updates for online learning and bregman projection . J ournal of Ma chine Learning Resear ch , 36(6):995– 1018, 2005. 29

Stochastic gradient descent on Riemannian manifolds

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment