Projection Pursuit through $Phi$-Divergence Minimisation

PR OJECTION PURSUIT THR OUGH Φ -DIVERGENCE MINIMISA TION Jacques T ouboul Univer sité Pierre et Marie Curie Laborato ir e de Statistique Théorique et Appliquée 175 rue du Che valer et, 75013 P aris, F rance jac k_touboul @hotmail.c om Abstract Consider a deﬁned de nsity on a set of very large dime nsion. It is quite di ﬃ c ult to ﬁ nd an estimate of this density from a data set. Howe ver, it is possible through a projection pursuit methodo logy to solve th is prob lem. In his seminal article, Hu ber (see "Projection pursuit", Ann als of St atistics, 1985) demonstra tes the interest of his method in a very simple given case. He considers the factorization of density through a Gaussian com ponen t and some residual density . Huber ’ s w ork is based on maximizing relati ve entr opy . Our proposal leads to a new algorithm. Furtherm ore, we will also consider the case when the density to be factorized is estimated from an i.i.d. sample. W e will then propo se a test for the factorization of the estimated density . Applications include a new t est of ﬁt pertainin g to the Elliptical copulas. K ey wor ds: Projection Pursuit; minimum Φ -divergence; Elliptical distribution; goodness-of-ﬁt; copula; regression. 2000 MSC: 94A1 7 62F05 62J05 62G08. 1. Outline of the article The objective of Projection Pursuit is to generate one or several projections providing as much infor mation as possible about the structure of the data set regardless o f its size: Once a structure has bee n isolated, the c orrespon ding data are elimin ated from the data set. Throu gh a recu rsiv e app roach, this process is iterated to ﬁnd anoth er stru cture in the remaining data, until no futher structure can be evidenced in the data left at t he end. Friedman (1984 and 1987) and Huber ( 1985) count among the ﬁrst auth ors to h av e intro- duced this ty pe of app roaches for evidencing stru ctures. Th ey each describe, with many exam- ples, how to e vidence such a structur e and consequently h ow to estimate the density of such data throug h tw o di ﬀ eren t methodologies each. Their work is based on maximizing relati ve entropy . For a v ery long time, the tw o metho dolog ies exposed by each of the abov e authors were thought to b e equ iv alent b ut Mu Zh u (20 04) showed it was in fact not the case wh en the numb er of iter- ations in the algor ithms exceeds the dim ension of the space con taining the d ata. In the present article, we will theref ore only foc us on H uber’ s s tudy while taking into accou nt Mu Zhu remarks. At pre sent, let us brieﬂy introdu ce Huber’ s m ethodo logy . W e will then expose our ap proach and objective. Prep rint submitted to Arxiv October 24, 2018 1.1. Huber’s an alytic appr oa ch Let f be a density on R d . W e d eﬁne an instrum ental density g with same mean an d variance as f . Huber’ s me thodolo gy requ ires us to start with perform ing th e K ( f , g ) = 0 test - with K being th e relativ e entropy . Sho uld this test turn o ut to be positive, then f = g and the alg orithm stops. If the test we re not to be veriﬁed, the ﬁrst step of H uber’ s algorithm amounts to deﬁning a vector a 1 and a density f (1) by a 1 = arg inf a ∈ R d ∗ K ( f g a f a , g ) and f (1) = f g a 1 f a 1 , (1.1) where R d ∗ is the set of non null vectors of R d , where f a (resp. g a ) stands fo r the density o f a ⊤ X (resp. a ⊤ Y ) when f (re sp. g ) is the den sity of X (resp. Y ). Mor e exactly , this results fro m the max imisation of a 7→ K ( f a , g a ) since K ( f , g ) = K ( f a , g a ) + K ( f g a f a , g ) and it is assum ed that K ( f , g ) is ﬁnite. In a second step, Huber replaces f with f (1) and goes throu gh the ﬁrst step again. By iterating this proc ess, Huber thu s obtains a sequence ( a 1 , a 2 , ... ) o f vectors of R d ∗ and a se- quence of densities f ( i ) . R emark 1.1. Huber stops his algorithm when the r elative entr o py equals zer o or when his algorithm r eaches the d th iteration, he then obtains an appr oximatio n of f fr om g : When ther e e xists an inte ger j such tha t K ( f ( j ) , g ) = 0 with j ≤ d , he obtains f ( j ) = g, i.e. f = g Π j i = 1 f ( i − 1) a i g a i since b y ind uction f ( j ) = f Π j i = 1 g a i f ( i − 1) a i . Similarly , when, for all j, Hube r gets K ( f ( j ) , g ) > 0 with j ≤ d, he assumes g = f ( d ) in or d er to derive f = g Π d i = 1 f ( i − 1) a i g a i . He can also stop his algorithm when the relative en tr op y equals zer o without the co ndition j ≤ d is met. Therefor e, since by indu ction we have f ( j ) = f Π j i = 1 g a i f ( i − 1) a i with f (0) = f , we o btain g = f Π j i = 1 g a i f ( i − 1) a i . Con sequently , we derive a r epr esentation of f as f = g Π j i = 1 f ( i − 1) a i g a i . F ina lly , he obtains K ( f (0) , g ) ≥ K ( f (1) , g ) ≥ ..... ≥ 0 with f (0) = f . 1.2. Huber’s synth etic appr oach Keeping the n otations of the above section , we start with performing the K ( f , g ) = 0 test; should this test turn out to b e positi ve, then f = g and the algorithm stop s, otherwise, the ﬁrst step of his algorithm would consist in deﬁning a vector a 1 and a density g (1) by a 1 = a rg inf a ∈ R d ∗ K ( f , g f a g a ) and g (1) = g f a 1 g a 1 . (1.2) More exactly , th is optimisation results from the maximisation of a 7→ K ( f a , g a ) since K ( f , g ) = K ( f a , g a ) + K ( f , g f a g a ) and it is assumed that K ( f , g ) is ﬁnite. In a second step, Hube r replaces g with g (1) and goes through the ﬁrst step again. By iterating this process, Hub er thus o btains a sequence ( a 1 , a 2 , ... ) o f vectors of R d ∗ and a sequence of densities g ( i ) . R emark 1.2 . F irst, in a similar manner to the ana lytic a ppr o ach, th is methodo logy enab les us to app r oximate and even to r epresent f fr om g: T o obtain an app r oxima tion of f , Huber eith er stops h is algorithm when the r ela tive entr opy equals zer o, i.e. K ( f , g ( j ) ) = 0 implies g ( j ) = f with j ≤ d, or whe n his algorithm r eaches the d th 2 iteration, i.e. he appr oximates f with g ( d ) . T o obtain a r epr esentation of f , Huber s tops his algo rithm when the r elative entr opy equals zer o, since K ( f , g ( j ) ) = 0 implies g ( j ) = f . Ther efor e, since by i nductio n we h ave g ( j ) = g Π j i = 1 f a i g ( i − 1) a i with g (0) = g, we then obtain f = g Π j i = 1 f a i g ( i − 1) a i . Second , he gets K ( f , g (0) ) ≥ K ( f , g (1) ) ≥ ..... ≥ 0 with g (0) = g. 1.3. Pr oposal Let us ﬁrst introdu ce the concept of Φ − di vergence. Let ϕ be a strictly con vex function deﬁned by ϕ : R + → R + , and such that ϕ (1) = 0. W e d eﬁne a Φ − divergence of P from Q - whe re P and Q are two pro bability distributions over a space Ω such that Q is absolutely continu ous with respect to P - by Φ ( Q , P ) = Z ϕ ( d Q d P ) d P . Throu ghout this article, we will also assume that ϕ (0) < ∞ , that ϕ ′ is continuous and that this div ergence is greater than the L 1 distance - see also Annex A.1 page 18. Now , let us intro duce our algorithm . W e start with perfo rming the Φ ( g , f ) = 0 test; should th is test turn out to b e p ositiv e, then f = g and the alg orithm stops, otherwise, the ﬁrst step of our algorithm would consist in deﬁning a vector a 1 and a density g (1) by a 1 = a rg inf a ∈ R d ∗ Φ ( g f a g a , f ) and g (1) = g f a 1 g a 1 . (1.3) Later on, we will prove that a 1 simultaneou sly optimises (1.1 ), (1. 2) and (1.3). In our second step, we will replace g with g (1) , and we will repeat the ﬁrst step. And so on, by iteratin g this pro cess, we will end up obtain ing a sequenc e ( a 1 , a 2 , ... ) of vecto rs in R d ∗ and a sequence of d ensities g ( i ) . W e will thus prove that the underlying structures of f evidenced thro ugh this meth od ar e identical to th e ones obtained th rough the Hu ber’ s method . W e will also evidence the above structures, which will ena ble us to in fer more info rmation on f - see example belo w . R emark 1 .3. As in the p r evious algo rithm, we ﬁrst p r ovide an a ppr oximate and even a r epre- sention of f fr om g: T o obtain an appr o ximation of f , we stop our algorithm when the diver gence e quals zer o, i.e. Φ ( g ( j ) , f ) = 0 implies g ( j ) = f with j ≤ d, or when our algorithm reac hes the d th iteration, i.e. we appr oximate f with g ( d ) . T o obta in a repr esentatio n of f , we stop our algorithm when the diver gence equals zer o. There- for e, since by induction we have g ( j ) = g Π j i = 1 f a i g ( i − 1) a i with g (0) = g , we then obtain f = g Π j i = 1 f a i g ( i − 1) a i . Second , he gets Φ ( g (0) , f ) ≥ Φ ( g (1) , f ) ≥ .. ... ≥ 0 with g (0) = g. F ina lly , th e speciﬁc form of r elationship (1.3) establish es that we deal with M-estimation. W e can ther efo r e state tha t our me thod is mor e r o bust th an Huber’ s - see Y ohai (2008), T oma (2009) as well as Huber (2004). At present, let us study two examples: 3 E xam p le 1.1. Let f be a den sity d eﬁned on R 3 by f ( x 1 , x 2 , x 3 ) = n ( x 1 , x 2 ) h ( x 3 ) , with n being a bi-dimension al Gaussian density , and h b eing a non Gaussian den sity . Let us also consider g, a Gaussian density with same mean and variance as f . Since g ( x 1 , x 2 / x 3 ) = n ( x 1 , x 2 ) , we th en hav e Φ ( g f 3 g 3 , f ) = Φ ( n . f 3 , f ) = Φ ( f , f ) = 0 as f 3 = h, i.e. the functio n a 7→ Φ ( g f a g a , f ) reac hes zer o for e 3 = (0 , 0 , 1 ) ′ . W e ther e for e obtain g ( x 1 , x 2 / x 3 ) = f ( x 1 , x 2 / x 3 ) . E xam p le 1.2. Assuming that the Φ -divergence is gr eater than the L 2 norm. Let us co nsider ( X n ) n ≥ 0 , the Markov chain with continuo us st ate space E . Let f be the density of ( X 0 , X 1 ) and let g be the normal density with same mean and variance as f . Let us now assume th at Φ ( g (1) , f ) = 0 with g (1) ( x ) = g ( x ) f 1 g 1 , i.e. let us assume that o ur algorithm stops fo r a 1 = ( 1 , 0) ′ . Consequently , if ( Y 0 , Y 1 ) is a rand om vector with g density , then the distribution law of X 1 given X 0 is Gaussian and is equal to the distribution law of Y 1 given Y 0 . And then, for any sequen ce ( A i ) - wher e A i ⊂ E - we have P  X n + 1 ∈ A n + 1 | X 0 ∈ A 0 , X 1 ∈ A 1 , . . . , X n − 1 ∈ A n − 1 , X n ∈ A n  = P ( X n + 1 ∈ A n + 1 | X n ∈ A n ) , based on the very deﬁnition of a Markov chain, = P ( X 1 ∈ A 1 | X 0 ∈ A 0 ) , thr ou gh the Markov pr o perty , = P ( Y 1 ∈ A 1 | Y 0 ∈ A 0 ) , as a con sequence of the above nullity of the Φ -diver genc e. T o recapitu late our method , if Φ ( g , f ) = 0, we derive f fro m the relation ship f = g ; should a sequ ence ( a i ) i = 1 ,... j , j < d , of vectors in R d ∗ deﬁning g ( j ) and such that Φ ( g ( j ) , f ) = 0 exist, then f ( ./ a ⊤ i x , 1 ≤ i ≤ j ) = g ( ./ a ⊤ i x , 1 ≤ i ≤ j ), i.e. f co incides w ith g on the comp lement of the vector subspace gener ated by the f amily { a i } i = 1 ,..., j - see also section 2 for a more detailed explanation. In this paper, after ha ving clariﬁed the choice of g , we will consider the statistical solution to the represen tation p roblem , assuming that f is unknown and X 1 , X 2 ,... X m are i.i.d. with density f . W e will pr ovide asymptotic results p ertaining to the family of optimizing v ectors a k , m - that we will deﬁne mor e precisely below - as m goes to inﬁn ity . Our r esults also pr ove that the empirical representatio n schem e co nv erges tow ards the theoretical o ne. As an app lication, section 3.4 permits a n ew test of ﬁt pertaining to the co pula of an u nknown den sity f , section 3.5 g iv es us an estimate of a density d econv oluted with a Gau ssian co mpon ent an d section 3.6 presents some application s to the regression an alysis. Fin ally , we will present simulations. 2. The algorithm 2.1. The model As explained by Friedman (1984 and 1987) and Diaconis ( 1984), the choice of g dep ends on the family of distribution one wants to ﬁnd in f . Until now , the choice has on ly been to use the class of Gaussian distributions. Th is can be extended to the class of elliptic distributions with almost all Φ − divergences. 2.1.1. Elliptical laws The interest of this class lies in th e fact that conditional densities with elliptical distributions are also elliptical - see Camban is (1981), Landsma n (2003). This very proper ty allo ws u s to u se this class in our algorithm . Deﬁnition 2.1 . X is said to abide by a multivariate elliptical distribution - no ted X ∼ E d ( µ, Σ , ξ d ) - if X presents the following density , for a ny x in R d : 4 f X ( x ) = c d | Σ | 1 / 2 ξ d  1 2 ( x − µ ) ′ Σ − 1 ( x − µ )  • with Σ , being a d × d positive-d eﬁnite matrix and with µ , being an d-co lumn vector , • with ξ d , being r eferr ed as the "density gener ator", • with c d , being a normalisation constant, such that c d = Γ ( d / 2) (2 π ) d / 2  R ∞ 0 x d / 2 − 1 ξ d ( x ) d x  − 1 , with R ∞ 0 x d / 2 − 1 ξ d ( x ) d x < ∞ . Property 2.1. 1 / F or any X ∼ E d ( µ, Σ , ξ d ) , for any A, being a m × d matrix with rank m ≤ d , an d for any b, being an m-dimen sional vector , we have A X + b ∼ E m ( A µ + b , A Σ A ′ , ξ m ) . Ther efor e, any marginal density of multivarite elli ptical distribution is elliptic, i.e. X = ( X 1 , X 2 , ..., X d ) ∼ E d ( µ, Σ , ξ d ) ⇒ X i ∼ E 1 ( µ i , σ 2 i , ξ 1 ) , f X i ( x ) = c 1 σ i ξ 1  1 2 ( x − µ i σ ) 2  , 1 ≤ i ≤ d. 2 / Cor o llary 5 of Camb anis (198 1) states that condition al d ensities with elliptical distributions ar e a lso elliptic. Ind eed, if X = ( X 1 , X 2 ) ′ ∼ E d ( µ, Σ , ξ d ) , with X 1 (r esp. X 2 ) being a size d 1 < d (r esp. d 2 < d ), then X 1 / ( X 2 = a ) ∼ E d 1 ( µ ′ , Σ ′ , ξ d 1 ) with µ ′ = µ 1 + Σ 12 Σ − 1 22 ( a − µ 2 ) an d Σ ′ = Σ 11 − Σ 12 Σ − 1 22 Σ 21 , with µ = ( µ 1 , µ 2 ) and Σ = ( Σ i j ) 1 ≤ i , j ≤ 2 . R emark 2.1. Landsman (2 003) shows that mu ltivariate Gaussian distributions derive fr om ξ d ( x ) = e − x . He a lso shows tha t if X = ( X 1 , ..., X d ) has an elliptica l density such that its mar ginals verify E ( X i ) < ∞ and E ( X 2 i ) < ∞ for 1 ≤ i ≤ d , then µ is the mean o f X a nd Σ is the covariance matrix of X . Consequently , f r om n ow on, we will assume that we ar e in this case. Deﬁnition 2 .2. Let t b e an ellip tical density on R k and let q be an elliptical density o n R k ′ . The elliptical densities t an d q are said to belong to the s ame family - or class - of elliptical densities, if their gene rating densities a r e ξ k and ξ k ′ r esp ectively , which b elong to a commo n given family of densities. E xam p le 2.1 . Consider two G aussian den sities N (0 , 1) and N ((0 , 0) , I d 2 ) . They ar e said to belong to the same elliptical families as they both pr esent x 7→ e − x as generating density . 2.1.2. Choice of g Let us begin with studying the follo wing case: Let f be a den sity on R d . Let us assume there exists d not null ind epend ent vectors a j , with 1 ≤ j ≤ d , of R d , such that f ( x ) = n ( a ⊤ j + 1 x , ..., a ⊤ d x ) h ( a ⊤ 1 x , ..., a ⊤ j x ) , (2.1) with j < d , with n b eing an elliptical d ensity on R d − j − 1 and with h b eing a d ensity on R j , which does not belong to the same f amily as n . L et X = ( X 1 , ..., X d ) be a vector presenting f as density . Deﬁne g as an Elliptical distribution with same mean and v ariance as f . For simplicity , let us assume that the f amily { a j } 1 ≤ j ≤ d is the canon ical basis of R d : The very deﬁnitio n of f implies that ( X j + 1 , ..., X d ) is in depend ent from ( X 1 , ..., X j ). Hen ce, the density of ( X j + 1 , ..., X d ) given ( X 1 , ..., X j ) is n . Let us assume that Φ ( g ( j ) , f ) = 0 , for some j ≤ d . W e th en get f ( x ) f a 1 f a 2 ... f a j = g ( x ) g (1 − 1) a 1 g (2 − 1) a 2 ... g ( j − 1) a j , since, by induction , we ha ve g ( j ) ( x ) = g ( x ) f a 1 g (1 − 1) a 1 f a 2 g (2 − 1) a 2 ... f a j g ( j − 1) a j . Consequently , the f act that con ditional densities with elliptical distributions are also elliptical as well as the above relationship e nable us to infer that n ( a ⊤ j + 1 x , ., a ⊤ d x ) = f ( ./ a ⊤ i x , 1 ≤ i ≤ j ) = g ( ./ a ⊤ i x , 1 ≤ i ≤ j ) . In oth er words, f coincides with g on th e com plement of th e vector subspac e gene rated by the family { a i } i = 1 ,..., j . 5 Now , if the f amily { a j } 1 ≤ j ≤ d is no longer the can onical basis of R d , then this family is ag ain a basis of R d . He nce, lemma F .1 - page 24 - implies that g ( ./ a ⊤ 1 x , ..., a ⊤ j x ) = n ( a ⊤ j + 1 x , ..., a ⊤ d x ) = f ( . / a ⊤ 1 x , ..., a ⊤ j x ) , (2.2) which is equiv alent to ha ving Φ ( g ( j ) , f ) = 0 - since by indu ction g ( j ) = g f a 1 g (1 − 1) a 1 f a 2 g (2 − 1) a 2 ... f a j g ( j − 1) a j . The end of our alg orithm im plies that f coin cides with g on the co mplemen t of the vector sub- space g enerated by the family { a i } i = 1 ,..., j . Therefore, the nullity of the Φ − d iv ergence provides us with infor mation on the density structure. In summary , the following propo sition clariﬁes our choice of g which depend s on the family of distribution one w ants to ﬁnd in f : Proposition 2.1. W ith the above nota tions, Φ ( g ( j ) , f ) = 0 is equ ivalent to g ( ./ a ⊤ 1 x , ..., a ⊤ j x ) = f ( . / a ⊤ 1 x , ..., a ⊤ j x ) More generally , the above pro position leads us to d eﬁning th e co-supp ort of f as the vector space generated from vectors a 1 , ..., a j . Deﬁnition 2.3. Let f be a density o n R d . W e deﬁne the co -vectors of f as the sequenc e of vecto rs a 1 , ..., a j which solves the pr oblem Φ ( g ( j ) , f ) = 0 wher e g is an E lliptical d istribution with sa me mean and v ariance as f . W e deﬁn e the co- support of f a s the ve ctor space generated fr om vectors a 1 , ..., a j . R emark 2 .2. An y ( a i ) family deﬁning f as in (2.1), is an orthogonal ba sis of R d - see lemma F .2 2.2. S tochastic outline of our algorithm Let X 1 , X 2 ,.., X m (resp. Y 1 , Y 2 ,.., Y m ) be a sequence of m independe nt random vectors with same density f (resp. g ). As cu stomary in n onpar ametric Φ − diver gence optimizations, all esti- mates of f and f a as well as all uses of Monté Carlo’ s metho ds are being perf ormed using sub- samples X 1 , X 2 ,.., X n and Y 1 , Y 2 ,.., Y n - extracted resp ectiv ely from X 1 , X 2 ,.., X m and Y 1 , Y 2 ,.., Y m - since the estimates are boun ded below b y som e positive deterministic seque nce θ m - see Annex B. Let P n be the empirica l measure of the subsamp le X 1 , X 2 ,., X n . Let f n (resp. f a , n for any a in R d ∗ ) be the kern el estimate of f (resp. f a ), which is built from X 1 , X 2 ,.., X n (resp. a ⊤ X 1 , a ⊤ X 2 ,.., a ⊤ X n ). As deﬁned in section 1.3, we introd uce the follo wing sequences ( a k ) k ≥ 1 and ( g ( k ) ) k ≥ 1 : • a k is a non null vector of R d such that a k = arg min a ∈ R d ∗ Φ ( g ( k − 1) f a g ( k − 1) a , f ) , (2.3) • g ( k ) is the density such that g ( k ) = g ( k − 1) f a k g ( k − 1) a k with g (0) = g . The stochastic setting up of th e algorithm uses f n and g (0) n = g instead o f f and g (0) = g - since g is known. Thus, at the ﬁrst step, we build the vecto r ˇ a 1 which minimizes the Φ − d iv ergence between f n and g f a , n g a and which estimates a 1 : Proposition B.1 p age 20 and lemma F .6 page 25 enable us to min imize the Φ − divergence b e- tween f n and g f a , n g a . De ﬁning ˇ a 1 as th e argument of this minimization , p roposition 3.3 page 8 6 shows us that this vector te nds to a 1 . Finally , we deﬁne the density ˇ g (1) m as ˇ g (1) m = g f ˇ a 1 , m g ˇ a 1 which estimates g (1) throug h theorem 3.1. Now , from the secon d step an d as deﬁne d in sectio n 1.3, the den sity g ( k − 1) is unkn own. Conse- quently , on ce again, we ha ve to truncate the samples: All estimates of f and f a (resp. g (1) and g (1) a ) are being p erform ed u sing a subsample X 1 , X 2 ,.., X n (resp. Y (1) 1 , Y (1) 2 ,.., Y (1) n ) e xtracted from X 1 , X 2 ,.., X m (resp. Y (1) 1 , Y (1) 2 ,.., Y (1) m - which is a seq uence of m in depend ent rand om vectors with same den sity g (1) ) such that the estimates are bounded below by some positive deter ministic sequence θ m - see Annex B. Let P n be the em pirical m easure o f the subsamp le X 1 , X 2 ,.., X n . Let f n (resp. g (1) n , f a , n , g (1) a , n for any a in R d ∗ ) be the kernel estima te of f (resp. g (1) and f a as well as g (1) a ) which is built fr om X 1 , X 2 ,.., X n (resp. Y (1) 1 , Y (1) 2 ,.., Y (1) n and a ⊤ X 1 , a ⊤ X 2 ,.., a ⊤ X n as well as a ⊤ Y (1) 1 , a ⊤ Y (1) 2 ,.., a ⊤ Y (1) n ). The stochastic setting up of the algorithm uses f n and g (1) n instead o f f and g (1) . Thus, we build the vector ˇ a 2 which min imizes the Φ − d iv ergence between f n and g (1) n f a , n g (1) a , n - since g (1) and g (1) a are unknown - and which estimates a 2 . Proposition B.1 page 20 an d lemma F .6 page 2 5 enab le us to minimize the Φ − divergence between f n and g (1) n f a , n g (1) a , n . De ﬁning ˇ a 2 as the argument of this m ini- mization, propo sition 3.3 page 8 s hows us that t his vector tends to a 2 in n . Fin ally , we deﬁne the density ˇ g (2) n as ˇ g (2) n = g (1) n f ˇ a 2 , n g (1) ˇ a 2 , n which estimates g (2) throug h theorem 3.1. And so on, we will end up obtaining a sequence ( ˇ a 1 , ˇ a 2 , ... ) of vectors in R d ∗ estimating th e co- vectors of f and a sequen ce of densities ( ˇ g ( k ) n ) k such that ˇ g ( k ) n estimates g ( k ) throug h theorem 3.1. 3. Re sults 3.1. Con ver gen ce r esults 3.1.1. Hypotheses on f In this paragraph, we d eﬁne the set of h ypothe ses on f which could p ossibly be of use in our work. Discussion on se veral of these hypo theses can be found in Annex E. In this section, to be more legible we replace g with g ( k − 1) . L et Θ = R d , Θ Φ = { b ∈ Θ | R ϕ ∗ ( ϕ ′ ( g ( x ) f ( x ) f b ( b ⊤ x ) g b ( b ⊤ x ) )) d P < ∞} , M ( b , a , x ) = R ϕ ′ ( g ( x ) f ( x ) f b ( b ⊤ x ) g b ( b ⊤ x ) ) g ( x ) f a ( a ⊤ x ) g a ( a ⊤ x ) d x − ϕ ∗ ( ϕ ′ ( g ( x ) f ( x ) f b ( b ⊤ x ) g b ( b ⊤ x ) )) , P n M ( b , a ) = R M ( b , a , x ) d P n , P M ( b , a ) = R M ( b , a , x ) d P , where P is the prob ability measure presenting f as density . Similarly as in chapter V of V an der V aart (1998), let us deﬁne : ( H 1) : For all ε > 0, there is η > 0, such that for all c ∈ Θ Φ verifying k c − a k k ≥ ε, we hav e P M ( c , a ) − η > P M ( a k , a ) , with a ∈ Θ . ( H 2) : ∃ Z < 0, n 0 > 0 such that ( n ≥ n 0 ⇒ sup a ∈ Θ sup c ∈{ Θ Φ } c P n M ( c , a ) < Z ) ( H 3) : Th ere is a neighb ourho od V of a k , and a positive fun ction H , such that, for all c ∈ V , we have | M ( c , a k , x ) | ≤ H ( x ) ( P − a . s . ) with P H < ∞ , ( H 4) : Th ere is a neighb ourho od V of a k , such that for all ε , there is a η such that for all c ∈ V and a ∈ Θ , verifying k a − a k k ≥ ε, we have P M ( c , a k ) < P M ( c , a ) − η. Putting I a k = ∂ 2 ∂ a 2 Φ ( g f a k g a k , f ) , and x → ρ ( b , a , x ) = ϕ ′ ( g ( x ) f b ( b ⊤ x ) f ( x ) g b ( b ⊤ x ) ) g ( x ) f a ( a ⊤ x ) g a ( a ⊤ x ) , let u s no w con sider three new hypotheses: ( H 5) : Th e function ϕ is C 3 in (0 , + ∞ ) and there is a neighbourh ood V ′ k of ( a k , a k ) such that, for all ( b , a ) o f V ′ k , the gradient ∇ ( g ( x ) f a ( a ⊤ x ) g a ( a ⊤ x ) ) and the Hessian H ( g ( x ) f a ( a ⊤ x ) g a ( a ⊤ x ) ) exist ( λ _ a . s . ), and 7 the ﬁrst order partial deriv ati ves g ( x ) f a ( a ⊤ x ) g a ( a ⊤ x ) and the ﬁrst and second order deriv ativ es o f ( b , a ) 7→ ρ ( b , a , x ) are dominated ( λ _a.s.) by λ - integrable functions. ( H 6) : Th e function ( b , a ) 7→ M ( b , a ) is C 3 in a neighbo urho od V k of ( a k , a k ) for all x ; a nd the partial deriv ativ es of ( b , a ) 7→ M ( b , a ) are all dominated in V k by a P _integrable function H ( x ) . ( H 7) : P k ∂ ∂ b M ( a k , a k ) k 2 and P k ∂ ∂ a M ( a k , a k ) k 2 are ﬁnite and the expressions P ∂ 2 ∂ b i ∂ b j M ( a k , a k ) and I a k exist and are in vertible. Finally , we deﬁne ( H 8) : Th ere exists k such that P M ( a k , a k ) = 0. ( H 9) : ( V ar P ( M ( a k , a k ))) 1 / 2 exists and is in vertible. ( H 0): f and g are assumed to be positi ve and bounded. 3.1.2. Estimation of the ﬁrst co-vector of f Let R be the class o f all positi ve functio ns r deﬁned on R an d such th at g ( x ) r ( a ⊤ x ) is a density on R d for all a belon ging to R d ∗ . The follo wing prop osition shows that th ere exists a vector a such that f a g a minimizes Φ ( gr , f ) in r : Proposition 3.1. Ther e exists a vector a belonging to R d ∗ such that ar g min r ∈R Φ ( gr , f ) = f a g a and r ( a ⊤ x ) = f a ( a ⊤ x ) g a ( a ⊤ x ) . R emark 3.1. This pr o position pr oves th at a 1 simultaneou sly optimises (1.1), (1.2) a nd (1.3). In oth er wor ds, it pr oves that the u nderlying structur es of f evidenced thr o ugh ou r method a r e identical to the ones obtaine d thr o ugh Huber’s meth ods - see also Annex D. Follo wing Broniatowski ( 2009), let us introduce the estimate of Φ ( g f a , n g a , f n ), through ˇ Φ ( g f a , n g a , f n ) = Z M ( a , a , x ) d P n ( x ) Proposition 3.2. Let ˇ a be such that ˇ a : = ar g inf a ∈ R d ∗ ˇ Φ ( g f a , n g a , f n ) . Then, ˇ a is a str ong ly con ver gent estimate of a, as deﬁned in pr oposition 3.1. Let us also introdu ce the fo llowing sequences ( ˇ a k ) k ≥ 1 and ( ˇ g ( k ) n ) k ≥ 1 , for any g iv en n - see section 2.2.: • ˇ a k is an estimate of a k as deﬁned in prop osition 3.2 wit h ˇ g ( k − 1) n instead of g , • ˇ g ( k ) n is such that ˇ g (0) n = g , ˇ g ( k ) n ( x ) = ˇ g ( k − 1) n ( x ) f ˇ a k , n ( ˇ a ⊤ k x ) [ ˇ g ( k − 1) ] ˇ a k , n ( ˇ a ⊤ k x ) , i.e. ˇ g ( k ) n ( x ) = g ( x ) Π k j = 1 f ˇ a j , n ( ˇ a ⊤ j x ) [ ˇ g ( j − 1) ] ˇ a j , n ( ˇ a ⊤ j x ) . W e also note that ˇ g ( k ) n is a density . 3.1.3. Con ver gence study at the k th step of the algorithm: In this p aragrap h, we will show that the sequen ce ( ˇ a k ) n conv erges towards a k and that the sequence ( ˇ g ( k ) n ) n conv erges towards g ( k ) . Let ˇ c n ( a ) = ar g sup c ∈ Θ P n M ( c , a ) , with a ∈ Θ , and ˇ γ n = ar g inf a ∈ Θ sup c ∈ Θ P n M ( c , a ). W e state Proposition 3.3. Both sup a ∈ Θ k ˇ c n ( a ) − a k k and ˇ γ n conver ge to war d a k a.s. Finally , the following theorem sho ws that ˇ g ( k ) n conv erges almost everywhere to wards g ( k ) : Theorem 3.1. It holds ˇ g ( k ) n → n g ( k ) a . s . 8 3.2. Asymptotic Infer ence at the k th step of the algorithm The fo llowing theorem shows th at ˇ g ( k ) n conv erges towards g ( k ) at the rate O P ( n − 2 2 + d ) in three di ﬀ erents cases, namely for any giv en x , with the L 1 distance and with the relativ e entropy: Theorem 3.2 . It holds | ˇ g ( k ) n ( x ) − g ( k ) ( x ) | = O P ( n − 2 2 + d ) , R | ˇ g ( k ) n ( x ) − g ( k ) ( x ) | d x = O P ( n − 2 2 + d ) and | K ( ˇ g ( k ) n , f ) − K ( g ( k ) , f ) | = O P ( n − 2 2 + d ) . R emark 3.2. W ith th e r elative entr o py , we have n = O ( m 1 / 2 ) - see lemma F .13. The above rates consequen tly become O P ( m − 1 2 + d ) . The following theorem shows that the laws of our estimators o f a k , namely ˇ c n ( a k ) and ˇ γ n , conv erge towards a linear combination of Gaussian v ar iables. Theorem 3.3. It holds √ n A . ( ˇ c n ( a k ) − a k ) L aw → B . N d (0 , P k ∂ ∂ b M ( a k , a k ) k 2 ) + C . N d (0 , P k ∂ ∂ a M ( a k , a k ) k 2 ) and √ n A . ( ˇ γ n − a k ) L aw → C . N d (0 , P k ∂ ∂ b M ( a k , a k ) k 2 ) + C . N d (0 , P k ∂ ∂ a M ( a k , a k ) k 2 ) wher e A = P ∂ 2 ∂ b ∂ b M ( a k , a k )( P ∂ 2 ∂ a i ∂ a j M ( a k , a k ) + P ∂ 2 ∂ a i ∂ b j M ( a k , a k )) , C = P ∂ 2 ∂ b ∂ b M ( a k , a k ) and B = P ∂ 2 ∂ b ∂ b M ( a k , a k ) + P ∂ 2 ∂ a i ∂ a j M ( a k , a k ) + P ∂ 2 ∂ a i ∂ b j M ( a k , a k ) . 3.3. A stopping rule for the pr ocedu r e In this p aragrap h, w e will c all ˇ g ( k ) n (resp. ˇ g ( k ) a , n ) the k ernel estimator of ˇ g ( k ) (resp. ˇ g ( k ) a ). W e will ﬁrst show th at g ( k ) n conv erges towards f in k and n . Then, we will p rovide a s topping rule for this identiﬁcation proced ure. 3.3.1. Estimation of f The following proposition provides u s with an estimate of f : Theorem 3.4. W e have lim n lim k ˇ g ( k ) n = f a.s. Consequently , the following corollary shows that Φ ( g ( k − 1) n f a k , n g ( k − 1) a k , n , f a k , n ) co n verges towards zero as k an d then as n go to inﬁnity: Corollary 3.1. W e have lim n lim k Φ ( ˇ g ( k ) n f a k , n [ ˇ g ( k ) ] a k , n , f n ) = 0 a.s. 3.3.2. T esting of the criteria In this paragraph , thr ough a test of our criteria, namely a 7→ Φ ( ˇ g ( k ) n f a , n [ ˇ g ( k ) ] a , n , f n ), we will b uild a stopping rule for this proced ure. First, the next theorem enables us to deri ve the la w of our criteria: Theorem 3.5. F or a ﬁ xed k, we h ave √ n ( V ar P ( M ( ˇ c n ( ˇ γ n ) , ˇ γ n ))) − 1 / 2 ( P n M ( ˇ c n ( ˇ γ n ) , ˇ γ n ) − P n M ( a k , a k )) L aw → N (0 , I ) , wher e k r epr esents the k th step of our algorithm and wher e I is the identity matrix in R d . Note that k is ﬁxed in theorem 3.5 since ˇ γ n = a rg inf a ∈ Θ sup c ∈ Θ P n M ( c , a ) where M is a known function of k - see section 3.1. 1. Thus, in the case when Φ ( g ( k − 1) f a k g ( k − 1) a k , f ) = 0 , we obtain Corollary 3.2. W e have √ n ( V ar P ( M ( ˇ c n ( ˇ γ n ) , ˇ γ n ))) − 1 / 2 P n M ( ˇ c n ( ˇ γ n ) , ˇ γ n ) L aw → N (0 , I ) . 9 Hence, we propo se the t est of the null hypo thesis ( H 0 ) : Φ ( g ( k − 1) f a k g ( k − 1) a k , f ) = 0 versus the alternative ( H 1 ) : Φ ( g ( k − 1) f a k g ( k − 1) a k , f ) , 0 . Based on this result, we stop the algorithm, then, d eﬁning a k as the last vector generated, we derive fr om corollary 3.2 a α -le vel conﬁdence ellipsoid around a k , namely E k = { b ∈ R d ; √ n ( V ar P ( M ( b , b ))) − 1 / 2 P n M ( b , b ) ≤ q N (0 , 1) α } where q N (0 , 1) α is the quan tile o f a α -level reduced centered n ormal d istribution and where P n is the empir ical measure araising from a realization of the sequences ( X 1 , . . . , X n ) and ( Y 1 , . . . , Y n ). Consequently , th e following corollary provides u s with a conﬁdence region for the above test: Corollary 3.3. E k is a conﬁden ce r e gion for the test of the null hypothe sis ( H 0 ) versus ( H 1 ) . 3.4. Goodness-o f-ﬁt t est for copulas Let us begin with studying the follo wing case: Let f b e a density deﬁned on R 2 and let g be an Elliptical distribution with same mean and variance as f . Assuming ﬁrst that our algorithm leads us to ha ving Φ ( g (2) , f ) = 0 wh ere family ( a i ) is the c anonical b asis of R 2 . He nce, we hav e g (2) ( x ) = g ( x ) f 1 g 1 f 2 g (1) 2 = g ( x ) f 1 g 1 f 2 g 2 - through lemma F .7 pag e 26 - and g (2) = f . There fore, f = g ( x ) f 1 g 1 f 2 g 2 , i.e. f f 1 f 2 = g g 1 g 2 , and then ∂ 2 ∂ x ∂ y C f = ∂ 2 ∂ x ∂ y C g where C f (resp. C g ) is the copula of f ( resp. g ). At present, let f be a density on R d and let g be the density deﬁned in section 2.1.2. Let us assume that our algorithm implies that Φ ( g ( d ) , f ) = 0. Hence, we have, for any x ∈ R d , g ( x ) Π d k = 1 f a k ( a ⊤ k x ) [ g ( k − 1) ] a k ( a ⊤ k x ) = f ( x ), i.e. g ( x ) Π d k = 1 g a k ( a ⊤ k x ) = f ( x ) Π d k = 1 f a k ( a ⊤ k x ) , since lemma F .7 pag e 26 implies that g ( k − 1) a k = g a k if k ≤ d . Moreover , the family ( a i ) i = 1 ... d is a b asis of R d - see lemma F .8 pag e 26. Hence, p utting A = ( a 1 , ..., a d ) and d eﬁning vector y (resp . de nsity ˜ f , copula ˜ C f of ˜ f , d ensity ˜ g , cop ula ˜ C g of ˜ g ) as the expression of v ector x (resp. density f , co pula C f of f , d ensity g , copula C g of g ) in basis A , the above equality implies ∂ d ∂ y 1 ...∂ y d ˜ C f = ∂ d ∂ y 1 ...∂ y d ˜ C g . Finally , we perform a statistical te st of the n ull hy pothesis ( H 0 ) : ∂ d ∂ y 1 ...∂ y d ˜ C f = ∂ d ∂ y 1 ...∂ y d ˜ C g versus the alter native ( H 1 ) : ∂ d ∂ y 1 ...∂ y d ˜ C f , ∂ d ∂ y 1 ...∂ y d ˜ C g . Since, und er ( H 0 ), we hav e Φ ( g ( d ) , f ) = 0 , th en, as explained in section 3.3.2, corollary 3.3 provides us with a conﬁdence region f or our test. Theorem 3.6. K eeping the notations o f cor ollary 3.3, we infer that E d is a conﬁdence r e gion for the test of the null hypoth esis ( H 0 ) versus the alternative hypothesis ( H 1 ) . 10 3.5. Rewriting of the convolution p r odu ct In th e p resent p aper, we ﬁrst elaborated a n algo rithm aimin g at isolating several known struc- tures fr om initial datas. Our o bjective was to verify if for a known density on R d , a known den sity n on R d − j − 1 such that, for d > 1, f ( x ) = n ( a ⊤ j + 1 x , ..., a ⊤ d x ) h ( a ⊤ 1 x , ..., a ⊤ j x ) , (3.1) did indeed exist, with j < d , with ( a 1 , . . . , a d ) b eing a basis of R d and with h bein g a density on R j . Secondly , ou r next step consisted in building an estimate (resp. a representation) o f f witho ut necessarily assuming that f meets relationship (3.1) - see theorem 3.4. Consequently , let us consider Z 1 and Z 2 , two rando m vectors with respecti ve densities h 1 and h 2 - wh ich is Elliptical - on R d . Let u s consider a random vector X such that X = Z 1 + Z 2 and let f be its density . This density can then be written as : f ( x ) = h 1 ∗ h 2 ( x ) = Z R d h 1 ( x ) h 2 ( t − x ) d t . Then, the following proper ty enables us to r epresent f under the fo rm of a prod uct and withou t the integral sign Proposition 3.4 . Let φ b e a centered Elliptical density w ith σ 2 . I d , σ 2 > 0 , as cova riance ma trix, such that it is a pr o duct density in all orthogo nal co or dinate sys tems and s uch that its character- istic function s 7→ Ψ ( 1 2 | s | 2 σ 2 ) is integr able - see Landsman (2003). Let f be a density on R d which can be decon voluted w ith φ , i.e. f = f ∗ φ = Z R d f ( x ) φ ( t − x ) d t , wher e f is some density on R d . Let g (0) be the E lliptical density belo nging to the same Elliptical fa mily as f and having same mean and variance as f . Then, the sequence ( g ( k ) ) k conver ges uniformly a.s. an d in L 1 towar ds f in k , i.e. lim k →∞ sup x ∈ R d | g ( k ) ( x ) − f ( x ) | = 0 , and lim k →∞ Z R d | g ( k ) ( x ) − f ( x ) | d x = 0 . Finally , with the n otations of section 3.3 and of p roposition 3.4, th e following theorem enables us to estimate any conv o lution pro duct o f a mu ltiv ar iate Elliptical density φ with a con tinuou s density f : Theorem 3.7. It holds lim n lim k ˇ g ( k ) n = f ∗ φ a . s . 3.6. On the r egr ession In this section, we will study sev eral applications of our algor ithm pertaining to the regression analysis. W e deﬁne ( X 1 , ..., X d ) (resp. ( Y 1 , ..., Y d )) as a vector with density f (resp. g - see section 2.1.2). R emark 3.3. In this p aragraph, we will work in the L 2 space. Then, we will ﬁrst only consider the Φ − diverg ences which ar e gr e ater tha n or eq ual to the L 2 distance - see V ajda (1 973). Note also that the co-vectors of f can be obtained in the L 2 space - see le mma F .6 an d pr oposition B.1. 11 3.6.1. The basic idea In this pa ragrap h, we will assume that Θ = R 2 ∗ and that our alg orithm stops for j = 1 and a 1 = (0 , 1) ′ . The following theo rem provides us with the re gression of X 1 on X 2 : Theorem 3.8. The pr obability measur e of X 1 given X 2 is the same as the pr o bability measur e of Y 1 given Y 2 . Mor eover , the r e gr ession between X 1 and X 2 is X 1 = E ( Y 1 / Y 2 ) + ε, wher e ε is a center ed r ando m variable orthogonal to E ( X 1 / X 2 ) . R emark 3.4 . This theorem implies that E ( X 1 / X 2 ) = E ( Y 1 / Y 2 ) . This equatio n can be used in many ﬁelds of r e sear ch. The Markov c hain theory has been used for instance in example 1.2. Mor eover , if g is a Gaussian den sity with same mean and varian ce as f , then Sa porta (2006) implies that E ( Y 1 / Y 2 ) = E ( Y 1 ) + Cov ( Y 1 , Y 2 ) V ar ( Y 2 ) ( Y 2 − E ( Y 2 )) and then X 1 = E ( Y 1 ) + C ov ( Y 1 , Y 2 ) V a r ( Y 2 ) ( Y 2 − E ( Y 2 )) + ε. 3.6.2. General case In this par agraph , we will assume that Θ = R d ∗ and that our alg orithm stops with j fo r j < d . Lemma F .9 implies the existence of an orthog onal an d free family ( b i ) i = j + 1 , .., d of R d ∗ such that R d = V ect { a i } ⊥ ⊕ V ect { b k } and such that g ( b ⊤ j + 1 x , ..., b ⊤ d x / a ⊤ 1 x , ..., a ⊤ j x ) = f ( b ⊤ j + 1 x , ..., b ⊤ d x / a ⊤ 1 x , ..., a ⊤ j x ) . (3.2) Hence, the following theorem pr ovides us with th e regression of b ⊤ k X , k = 1 , ..., d , on ( a ⊤ 1 X , .. ., a ⊤ j X ): Theorem 3 .9. The pr o bability measur e of ( b ⊤ j + 1 X , .. ., b ⊤ d X ) given ( a ⊤ 1 X , .. ., a ⊤ j X ) is the sa me as the pr obab ility measure of ( b ⊤ j + 1 Y , ..., b ⊤ d Y ) given ( a ⊤ 1 Y , ..., a ⊤ j Y ) . Mor eover , the re gr ession of b ⊤ k X , k = 1 , ..., d , o n ( a ⊤ 1 X , .. ., a ⊤ j X ) is b ⊤ k X = E ( b ⊤ k Y / a ⊤ 1 Y 1 , ..., a ⊤ j Y ) + b ⊤ k ε , where ε is a centered random vector such that b ⊤ k ε is orthogon al to E ( b ⊤ k X / a ⊤ 1 X , .. ., a ⊤ j X ) . Corollary 3.4 . If g is a Gaussian density with s ame mean a nd variance as f , and if C o v ( X i , X j ) = 0 for any i , j, then, the re gr ession of b ⊤ k X , k = 1 , ..., d , on ( a ⊤ 1 X , .. ., a ⊤ j X ) is b ⊤ k X = E ( b ⊤ k Y ) + b ⊤ k ε , wher e ε is a center ed r ando m vector such th at b ⊤ k ε is orthogon al to E ( b ⊤ k X / a ⊤ 1 X , .. ., a ⊤ j X ) . 4. Simulations Let us study fou r examples. The ﬁr st inv olves a χ 2 -diver gence, the seco nd a Hellinger dis- tance, the third a Cressie-Read div ergence (still with γ = 1 . 25) and the fo urth a K ullback Leibler div ergence. In each exam ple, our prog ram will f ollow ou r algorithm and will aim at creating a sequ ence of densities ( g ( j ) ), j = 1 , . ., k , k < d , such that g (0) = g , g ( j ) = g ( j − 1) f a j / [ g ( j − 1) ] a j and Φ ( g ( k ) , f ) = 0 , with Φ bein g a d iv ergence and a j = a r g inf b Φ ( g ( j − 1) f b / [ g ( j − 1) ] b , f ) , for all j = 1 , ..., k . Mor eover , in the second example, we will study the robustness of our method with two ouliers. In the third example, deﬁning ( X 1 , X 2 ) a s a vector with f as den sity , we will study the regression of X 1 on X 2 . And ﬁnally , in the fourth example, we will perform our goodness-of-ﬁt tes t for copu las. 12 S imulation 4. 1 (W ith the χ 2 div ergence) . W e a r e in dimension 3 ( = d ), an d we co nsider a sample of 50 ( = n) values o f a random variable X with a density law f d eﬁned by : f ( x ) = G au s sian ( x 1 + x 2) . G au s sian ( x 0 + x 2) . Gumbel ( x 0 + x 1 ) , wher e the No rmal law parameter s a r e ( − 5 , 2) a nd (1 , 1) and whe r e the Gumbel distrib ution parameters ar e − 3 and 4 . Let u s generate then a Gaussian random variable Y - that we will name g - with a density pres enting the same mean and variance as f . W e theor etically obtain k = 1 a nd a 1 = (1 , 1 , 0) . T o get this r esult, we perform the following test: H 0 : a 1 = (1 , 1 , 0) ver sus ( H 1 ) : a 1 , (1 , 1 , 0) . Then, cor o llary 3.3 enables us to estimate a 1 by the following 0.9( = α ) level conﬁdence ellipsoid E 1 = { b ∈ R 3 ; ( V ar P ( M ( b , b ) )) ( − 1 / 2) P n M ( b , b ) ≤ q N (0 , 1) α / √ n ≃ 0 , 2533 / 7 . 07 10678 = 0 . 0 3582 203 } . And, we obta in Our Algorithm Pr ojectio n Study 0 : minimum : 0.020174 1 at point : (1.00912 ,1.094 53,0.01893) P-V a lue : 0.8 1131 T est : H 0 : a 1 ∈ E 1 : T ru e χ 2 (K ernel Estimation of g (1) , g (1) ) 6.172 6 Ther efor e, we conclu de that f = g (1) . S imulation 4. 2 (W ith the Hellinger distance H ) . W e are in dimension 20 ( = d) . W e ﬁrst generate a sample with 100 ( = n) observations, na mely two outliers x = ( 2 , 0 , . . . , 0) and 98 values of a rand om variable X with a den sity law f d eﬁned by f ( x ) = G um bel ( x 0 ) . N or mal ( x 1 , . . . , x 9 ) , where the Gumbel law parameters ar e -5 a nd 1 and wher e the normal distribution is r educed an d center ed . Our r easoning is the same as in Simulation 4.1. In the ﬁrst part of the pr ogram, we th eor etically o btain k = 1 and a 1 = (1 , 0 , . . . , 0 ) . T o get th is r esu lt, we perform the following test ( H 0 ) : a 1 = (1 , 0 , . . . , 0) ver su s ( H 1 ) : a 1 , (1 , 0 , . . . , 0) . W e estimate a 1 by the following 0.9( = α ) level conﬁdence ellipsoid E i = { b ∈ R 2 ; ( V ar P ( M ( b , b ) )) − 1 / 2 P n M ( b , b ) ≤ q N (0 , 1) α / √ n ≃ 0 . 0253 3 } . And, we obta in Our Algorithm Pr ojectio n Study 0 minimum : 0.002692 at point : (1.01326 , 0.0 657, 0.0628, 0.1011, 0.0509, 0.1083, 0.126 1, 0 .0573 , 0.0377, 0.0794, 0.0906, 0.0356, 0.0012, 0.029 2, 0 .0737 , 0.0934, 0.0286, 0.1057, 0.0697, 0.0771) P-V a lue : 0.805 54 T est : H 0 : a 1 ∈ E 1 : T ru e H ( Estimate of g (1) , g (1) ) 3.042 174 Ther efor e, we conclu de that f = g (1) . S imulation 4. 3 (W ith the Cressie-Read di vergence ( Φ ) ) . W e ar e in dimension 2 ( = d), and we con sider a sample of 50 ( = n ) values of a r andom variab le X = ( X 1 , X 2 ) with a density law f deﬁned by f ( x ) = Gu mbel ( x 0 ) . N or mal ( x 1 ) , wher e the Gumbel 13 law pa rameters are -5 and 1 and where the normal distribution parameters ar e (0 , 1) . Let us generate then a Gaussian random va riable Y - that we will n ame g - with a den sity pres enting same mean and varianc e as f . W e theor etically obtain k = 1 and a 1 = (1 , 0) . T o get this r esult, we perform the following test: H 0 : a 1 = (1 , 0) vers us ( H 1 ) : a 1 , (1 , 0) . Then, cor o llary 3.3 enables us to estimate a 1 by the following 0.9( = α ) level conﬁdence ellipsoid E 1 = { b ∈ R 2 ; ( V ar P ( M ( b , b ) )) ( − 1 / 2) P n M ( b , b ) ≤ q N (0 , 1) α / √ n ≃ 0 . 035 82203 } . And, we obta in Our Algorithm Pr ojectio n Study 0 : minimum : 0.021005 8 at point : (1.001,0. 0014) P-V a lue : 0.9 8955 2 T est : H 0 : a 1 ∈ E 1 : T ru e Φ (K ernel Estimation of g (1) , g (1) ) 6.476 17 Ther efor e, we conclu de that f = g (1) . -5 -4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 ’Loi_Inconnue_Avec_x.dat’ ’Approximation-Finale.dat’ Figure 1: Graph of the distrib ution to estimate (red) an d of our own estimate (gree n). 14 -6 -5 -4 -3 -2 -1 0 1 2 3 4 -4 -3 -2 -1 0 1 2 3 4 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 ’Loi_Inconnue_Avec_x.dat’ ’Approximation-Finale-Huber.dat’ Figure 2: Graph of the distri buti on to e stimate (red) and of Huber’ s estimate (gree n). At present, keeping the notations of this simulation, let us study the regression of X 1 on X 2 . Our algorithm leads us to infer that the den sity of X 1 giv en X 2 is the same as the density of y 1 giv en Y 2 . M oreover , pro perty A.1 im plies that the co-factors of f are the same with all di ver - gence. Con sequently , we can use theorem 3.8, i.e . it implies that X 1 = E ( Y 1 / Y 2 ) + ε, where ε is a centered rando m v ariable orthogon al to E ( X 1 / X 2 ). Thus, since g is a Gaussian density , remark 3.4 implies that X 1 = E ( Y 1 ) + C ov ( Y 1 , Y 2 ) V a r ( Y 2 ) ( Y 2 − E ( Y 2 )) + ε. Now , using the least squares method, we estimate α 1 and α 2 such that X 1 = a 1 + a 2 X 2 + ε . Thus, the follo wing table pr esents the r esults of our re gression and of th e least s quares method if we assume that ε is Gaussian. Our Regression E ( Y 1 ) -4.545 483 C ov ( Y 1 , Y 2 ) 0.038 0534 V a r ( Y 2 ) 0.919 0052 E ( Y 2 ) 0.310 3752 correlation coe ﬃ cient ( Y 1 , Y 2 ) 0.021 58213 Least squares method a 1 -4.341 5922 7 a 2 0.068 03317 correlation coe ﬃ cient ( X 1 , X 2 ) 0.048 88484 15 −2 −1 0 1 2 −6 −5 −4 −3 −2 −1 X_2 X_1 Figure 3: Graph of the re gression of X1 on X2 based on the lea st squares metho d (red) and based on our t heory (green). S imulation 4. 4 (W ith the relati ve entropy K ) . W e are in dimension 2 ( = d), and we use the r elative entr opy to p erform our optimisations. Let us consider a sample of 50 ( = n) values of a r andom variable X with a density law f deﬁned by : f ( x ) = c ρ ( F Gumbel ( x 0 ) , F E x ponent ial ( x 1 )) . Gumbe l ( x 0 ) . E x po nent ial ( x 1 ) , wher e : • c is the Gaussian cop ula wi th correlation coe ﬃ cient ρ = 0 . 5 , • the Gumb el distrib u tion parameters ar e − 1 and 1 and • the Exp onential density parameter is 2 . Let us gen erate then a Gaussian random variable Y - that we will n ame g - with a d ensity pr esenting the same mean and variance as f . W e theor etically obtain k = 2 and ( a 1 , a 2 ) = ( (1 , 0) , (0 , 1) ) . T o get this r esult, we p erform the following test: ( H 0 ) : ( a 1 , a 2 ) = ((1 , 0) , (0 , 1) ) versus ( H 1 ) : ( a 1 , a 2 ) , ((1 , 0) , (0 , 1) ) . Then, theorem 3.6 enables us to verify ( H 0 ) by the following 0.9( = α ) level conﬁdence ellipsoid E 2 = { b ∈ R 2 ; ( V a r P ( M ( b , b ) )) ( − 1 / 2) P n M ( b , b ) ≤ q N (0 , 1) α / √ n ≃ 0 , 2 533 / 7 . 071 0678 = 0 . 0358 220 } . And, we obta in Our Algorithm Pr ojectio n Study number 0 : minimum : 0.4451 99 at point : (1.014 2,0.00 26) P-V a lue : 0.9 4579 T est : H 0 : a 1 ∈ E 1 : F alse 16 Pr ojectio n Study number 1 : minimum : 0.0263 at point : (0.008 4,0.90 06) P-V a lue : 0.9 7101 T est : H 0 : a 2 ∈ E 2 : T ru e K (K ernel Estimation of g (2) , g (2) ) 4.068 0 Ther efor e, we can conclud e that H 0 is veriﬁed. X[1] 0.0 0.2 0.4 0.6 0.8 1.0 X[2] 0.0 0.2 0.4 0.6 0.8 1.0 Density function 0.00 0.05 0.10 0.15 Figure 4: Graph of the esti mate of ( x 0 , x 1 ) 7→ c ρ ( F Gumbel ( x 0 ) , F E x ponent ial ( x 1 )) . Critics of the simulations In the case w here f is unknown, we will n ev er be sure to h av e rea ched the m inimum o f the Φ -divergence: we ha ve indeed u sed the simulated annealing method to solve o ur optimisation problem , and therefor e it is only when the num ber o f random ju mps tends in theo ry to wards inﬁnity that th e prob ability to reach the minimum tend s to 1. W e also note that no th eory on th e optimal number of jumps to imp lement does e xist, as this n umber depends on t he speciﬁcities of each particular proble m. Moreover , we cho ose the 5 0 − 4 4 + d (resp. 10 0 − 4 4 + d ) for the AM ISE of simulation s 4 .1, 4 .2 and 4 .3 (resp. simulation 4.4). This choice lea ds us to simulate 50 (resp. 100 ) ran dom v ariables - see Scott (199 2 ) page 151 -, non e of which ha ve been discarded to obtain the truncated sample. Finally , we remark th at some o f the key ad vantages of our method over Hub er’ s consist in the fact that - since there exist di vergences smaller than th e relati ve entropy - ou r metho d req uires a considerably shorter computation time and also in the in the superio rity in robustness of our method. 17 Conclusion Projection Pur suit is useful in e videncin g character istic stru ctures as we ll as one-dim ensional projection s an d their associated distributions in multiv ariate data. Huber (1985) shows us how to achieve it thr ough maximization of the relativ e entropy . The present article s hows that our Φ - diver gence meth od constitutes a good alterna ti ve to Huber’ s particularly in terms of regression an d r obustness as well as in term s of cop ula’ s study . Indeed, the con vergenc e results and simula tions we carried out, co n vincing ly f ulﬁlled ou r expectations regarding our methodology . A. Reminders A.1. Φ -Diver gence Let us call h a the d ensity of a ⊤ Z if h is th e de nsity o f Z . Let ϕ be a strictly co n vex function deﬁned by ϕ : R + → R + , and such that ϕ (1) = 0. Deﬁnition A.1. W e deﬁne the Φ − diver genc e of P fr om Q, wh er e P and Q ar e two pr oba bility distributions o ver a space Ω such that Q is a bsolutely continuo us w ith r espect to P, by Φ ( Q , P ) = Z ϕ ( d Q d P ) d P . (A.1) The above e xpr ession (A.1) is also valid if P an d Q ar e bo th dominated by the same pr obab ility . The most used distances (Kullback, Hellinger or χ 2 ) belong to the Cressie-Read family (see Cressie-Read (1 984), Csiszár I. (1 967) and the book s of Frie drich and Igor (1 987), Pardo Leandro (2006) and Zograf os K. (1990)). They are deﬁned by a speciﬁc ϕ . I ndeed, - with the relativ e entropy , we ass ociate ϕ ( x ) = xln ( x ) − x + 1 - with the Hellinger distance, we associate ϕ ( x ) = 2( √ x − 1) 2 - with the χ 2 distance, we associate ϕ ( x ) = 1 2 ( x − 1) 2 - more generally , with power diver gences, we ass ociate ϕ ( x ) = x γ − γ x + γ − 1 γ ( γ − 1) , where γ ∈ R \ (0 , 1 ) - and, ﬁnally , with the L 1 norm, which is also a diver gence, we associate ϕ ( x ) = | x − 1 | . In particular we have th e following inequalities: d L 1 ( g , f ) ≤ K ( g , f ) ≤ χ 2 ( g , f ). Let us now present some well-known pr operties of di vergences. Property A.1. W e have Φ ( P , Q ) = 0 ⇔ P = Q . Property A. 2. The app lication Q 7→ Φ ( Q , P ) is gr eater than the L 1 distance, conve x, lower semi-continu ous (l. s.c.) - for the top ology th at makes a ll the ap plications of th e form Q 7→ R f d Q continuo us wher e f is bounded and continuou s - as well as l.s.c. for the topology of the uniform conver gence. Property A.3 (coro llary (1.29 ), page 1 9 o f Fried rich and Igor ( 1987)) . If T : ( X , A ) → ( Y , B ) is measurable and if K ( P , Q ) < ∞ , then K ( P , Q ) ≥ K ( PT − 1 , QT − 1 ) , with equality being r eached when T is surjective for ( P , Q ) . Theorem A.1 ( theorem III.4 of Azé (1997)) . Let f : I → R be a co n vex function. Then f is a Lipschitz function in all compact intervals [ a , b ] ⊂ int { I } . In particular , f is continuou s on int { I } . 18 A.2. Useful lemmas Throu gh a reductio ad absurdum ar gument, we deri ve lemmas A.1 and A.2 : lemme A.1. Let f be a density in R d bound ed and p ositive. Then, any pr ojection den sity of f - that we will name f a , with a ∈ R d ∗ - is also boun ded and positive in R . lemme A.2. Let f be a density in R d bound ed and positive. Then any density f ( ./ a ⊤ x ) , fo r an y a ∈ R d ∗ , is also boun ded and positive . By inductio n and from lemmas A.1 and A.2, we ha ve lemme A.3. If f a nd g ar e po sitive and bound ed d ensities, then g ( k ) is positive and boun ded. Finally we introdu ce a last lemma lemme A.4. Let f be a n ab solutely continuo us density , then, for all sequen ces ( a n ) ten ding to a in R d ∗ , sequence f a n uniformly conver ges towar ds f a . Pr oo f. For a ll a in R d ∗ , let F a be the cumu lativ e distribution function of a ⊤ X an d ψ a be a complex function deﬁned by ψ a ( u , v ) = F a ( R e ( u + iv )) + i F a ( R e ( v + iu )), for all u and v in R . First, the fu nction ψ a ( u , v ) is an analytic fun ction, because x 7→ f a ( a ⊤ x ) is continuous and as a result of th e corollary of Dini’ s secon d theorem - accor ding to which "A sequence of cumu - lative distrib ution function s which pointwise con ver ges o n R towar ds a continu ous cum ulative distribution fu nction F o n R , uniformly con ver ges towar ds F on R " - we d educt that, fo r all se- quences ( a n ) con verging towards a , ψ a n unifor mly conver ges towards ψ a . Fin ally , the W eierstrass theorem, (see pro posal (10 . 1) page 220 of the "Calcul inﬁnitésimal" book of Jean Dieudonné ), implies that all sequences ψ ′ a , n unifor mly c onv erge towards ψ ′ a , for all a n tending to a . W e can therefor e conclude. B. Study of the sample Let X 1 , X 2 ,.., X m be a sequ ence of indep enden t random vectors with same den sity f . Let Y 1 , Y 2 ,.., Y m be a sequenc e of independent random vectors with sam e den sity g . Then, the kernel estimators f m , g m , f a , m and g a , m of f , g , f a and g a , for all a ∈ R d ∗ , almost surely and unifo rmly conv erge sin ce we a ssume that the band width h m of these estimato rs meets the following condi- tions (see Bosq (1999)): ( H y p ): h m ց m 0, mh m ր m ∞ , mh m / L ( h − 1 m ) → m ∞ and L ( h − 1 m ) / L Lm → m ∞ , with L ( u ) = ln ( u ∨ e ). Let us consider B 1 ( n , a ) = 1 n Σ n i = 1 ϕ ′ { f a , n ( a ⊤ Y i ) g a , n ( a ⊤ Y i ) g n ( Y i ) f n ( Y i ) } f a , n ( a ⊤ Y i ) g a , n ( a ⊤ Y i ) and B 2 ( n , a ) = 1 n Σ n i = 1 ϕ ∗ { ϕ ′ { f a , n ( a ⊤ X i ) g a , n ( a ⊤ X i ) g n ( X i ) f n ( X i }} . Our go al is to estimate the m inimum of Φ ( g f a g a , f ) . T o do this, it is necessary for us to trun cate our samples: Let us consider now a positi ve seq uence θ m such that θ m → 0 , y m /θ 2 n → 0 , where y m is the almost sure con vergence rate of the kernel density estimator - y m = O P ( m − 2 4 + d ), see lemma F .10 - y (1) m /θ 2 m → 0 , where y (1) m is deﬁned by | ϕ ( g m ( x ) f m ( x ) f b , m ( b ⊤ x ) g b , m ( b ⊤ x ) ) − ϕ ( g ( x ) f ( x ) f b ( b ⊤ x ) g b ( b ⊤ x ) ) | ≤ y (1) m 19 for all b in R d ∗ and all x in R d , and ﬁnally y (2) m θ 2 m → 0 , where y (2) n is deﬁned by | ϕ ′ ( g m ( x ) f m ( x ) f b , m ( b ⊤ x ) g b , m ( b ⊤ x ) ) − ϕ ′ ( g ( x ) f ( x ) f b ( b ⊤ x ) g b ( b ⊤ x ) ) | ≤ y (2) m for all b in R d ∗ and all x in R d . W e will generate f m , g m and g b , m from the starting s ample and w e w ill s elect the X i and Y i vectors such that f m ( X i ) ≥ θ m and g b , m ( b ⊤ Y i ) ≥ θ m , for all i and for all b ∈ R d ∗ . The vectors meeting these conditions will be called X 1 , X 2 , ..., X n and Y 1 , Y 2 , ..., Y n . Consequently , the next pr oposition provides us with the con dition required f or us to derive our estimations Proposition B.1. Using the no tations intr oduce d in Br onia towski (2009) and in section 3.1.1, it holds lim n →∞ sup a ∈ R d ∗ | ( B 1 ( n , a ) − B 2 ( n , a )) − Φ ( g f a g a , f ) | = 0 . R emark B.1. W ith the r elative entr o py , we can take for θ m the e xp r ession m − ν , with 0 < ν < 1 4 + d . C. Case study : f is known In th is An nex, we will study the case when f and g are known. W e will then use the notations introdu ced in sections 3.1.1 and 3.1.2 with f and g , i.e. no longer with their kernel estimates. C.1. Con ver gence s tudy and Asymptotic Inference a t the k th step of the algorithm In this p aragrap h, when k is less than or equal to d , we will show tha t the sequence ( ˇ a k ) n conv erges towards a k and that the sequen ce ( ˇ g ( k ) ) n conv erges towards g ( k ) . Both ˇ γ n and ˇ c n ( a ) are M-estimators and estimate a k - see Broniatowski (2009). W e state Proposition C.1. Assuming ( H 1) to ( H 3) hold. Both sup a ∈ Θ k ˇ c n ( a ) − a k k and ˇ γ n tends to a k a.s. Finally , the fo llowing theorem shows u s th at ˇ g ( k ) conv erges unif ormly alm ost ev erywher e towards g ( k ) , for any k = 1 .. d . Theorem C.1. Assumimg ( H 1) to ( H 3) hold. Then, ˇ g ( k ) → n g ( k ) a.s. and uniformly a.e. The fo llowing theorem sh ows that ˇ g ( k ) conv erges at the rate O P ( n − 1 / 2 ) in three di ﬀ erents cases, namely for any gi ven x , with the L 1 distance and with the Φ − diver gence: Theorem C.2. Assuming ( H 0) to ( H 3) hold, for any k = 1 , ..., d and an y x ∈ R d , we have | ˇ g ( k ) ( x ) − g ( k ) ( x ) | = O P ( n − 1 / 2 ) , (C.1) Z | ˇ g ( k ) ( x ) − g ( k ) ( x ) | d x = O P ( n − 1 / 2 ) , (C.2) | K ( ˇ g ( k ) , f ) − K ( g ( k ) , f ) | = O P ( n − 1 / 2 ) . (C.3) The following theorem shows that the laws of our estimators o f a k , namely ˇ c n ( a k ) and ˇ γ n , conv erge towards a linear combination of Gaussian v ar iables. Theorem C.3. Assuming that condition s ( H 1) to ( H 6 ) hold, then √ n A . ( ˇ c n ( a k ) − a k ) L aw → B . N d (0 , P k ∂ ∂ b M ( a k , a k ) k 2 ) + C . N d (0 , P k ∂ ∂ a M ( a k , a k ) k 2 ) and √ n A . ( ˇ γ n − a k ) L aw → C . N d (0 , P k ∂ ∂ b M ( a k , a k ) k 2 ) + C . N d (0 , P k ∂ ∂ a M ( a k , a k ) k 2 ) wher e A = ( P ∂ 2 ∂ b ∂ b M ( a k , a k )( P ∂ 2 ∂ a i ∂ a j M ( a k , a k ) + P ∂ 2 ∂ a i ∂ b j M ( a k , a k ))) , C = P ∂ 2 ∂ b ∂ b M ( a k , a k ) and B = P ∂ 2 ∂ b ∂ b M ( a k , a k ) + P ∂ 2 ∂ a i ∂ a j M ( a k , a k ) + P ∂ 2 ∂ a i ∂ b j M ( a k , a k ) . 20 C.2. A stopping rule for the pr oced ur e W e now assume that the algorithm d oes n ot stop after d iterations. W e then remark that, it still holds - for any i > d : • g ( i ) ( x ) = g ( x ) Π i k = 1 f a k ( a ⊤ k x ) [ g ( k − 1) n ] a k ( a ⊤ k x ) , with g (0) = g . • K ( g (0) , f ) ≥ K ( g (1) , f ) ≥ K ( g (2) , f ) ... ≥ 0. • Theor ems C.1, C.2 and C.3 . Moreover , a s explaine d in section 14 o f Hub er (1 985) f or th e relative entropy , the seq uence ( Φ ( g ( k − 1) f a k g ( k − 1) a k , f ) ) k ≥ 1 conv erges towards zero. Then, in this parag raph, we will show that g ( i ) conv erges tow ards f in i . And ﬁnally , we will provide a stopping rule for this identiﬁcation proced ure. C.2.1. Repr esentation of f Under ( H 0), the following propo sition sho ws us th at the p robab ility m easure with d ensity g ( k ) conv erges towards the probability measure with density f : Proposition C.2 (Representation of f ) . W e have lim k g ( k ) = f a. s. C.2.2. T esting of the criteria Throu gh a test of the criteria, namely a 7→ Φ ( g ( k − 1) f a g ( k − 1) a , f ) , we will build a stoppin g rule for this proced ure. First, the next theorem enables us to deri ve the la w of the criteria. Theorem C.4. Assuming that ( H 1) to ( H 3) , ( H 6) and ( H 8) hold. Then , √ n ( V ar P ( M ( ˇ c n ( ˇ γ n ) , ˇ γ n ))) − 1 / 2 ( P n M ( ˇ c n ( ˇ γ n ) , ˇ γ n ) − P n M ( a k , a k )) L aw → N (0 , I ) , wher e k r epr esents the k th step of the algorithm and with I being the identity matrix in R d . Note that k is ﬁxed in theorem C.4 since ˇ γ n = ar g inf a ∈ Θ sup c ∈ Θ P n M ( c , a ) where M is a known function of k - see sectio n 3.1.1. Th us, in the case where Φ ( g ( k − 1) f a k g ( k − 1) a k , f ) = 0 , we obtain Corollary C.1. Assuming that ( H 1) to ( H 3) , ( H 6) , ( H 7) an d ( H 8) hold. Then, √ n ( V ar P ( M ( ˇ c n ( ˇ γ n ) , ˇ γ n ))) − 1 / 2 ( P n M ( ˇ c n ( ˇ γ n ) , ˇ γ n )) L aw → N (0 , I ) . Hence, we propo se the t est of the null hypo thesis ( H 0 ) : K ( g ( k − 1) f a k g ( k − 1) a k , f ) = 0 versus ( H 1 ) : K ( g ( k − 1) f a k g ( k − 1) a k , f ) , 0. Based on this result, we stop the algorithm, then, d eﬁning a k as the last vector generated, we derive fr om corollary C.1 a α -le vel conﬁdence ellipsoid around a k , namely E k = { b ∈ R d ; √ n ( V ar P ( M ( b , b ) )) − 1 / 2 P n M ( b , b ) ≤ q N (0 , 1) α } , where q N (0 , 1) α is the quantile of a α -lev el reduced centered normal distrib ution. Consequently , th e following corollary provides u s with a conﬁdence region for the above test: Corollary C.2. E k is a conﬁ dence r e gion for the test of the null hypothe sis ( H 0 ) versus ( H 1 ) . D. The ﬁrst co- vector of f simultaneo usly optimizes f our problems Let us ﬁrst study Huber’ s analytic appr oach. Let R ′ be the class of all positi ve function s r deﬁned on R and suc h that f ( x ) r − 1 ( a ⊤ x ) is a density on R d for all a belon ging to R d ∗ . The following p roposition shows tha t there exists a vector a such that f a g a minimizes K ( f r − 1 , g ) in r : 21 Proposition D . 1 (Analytic Appro ach) . Ther e e xists a vector a belong ing to R d ∗ such that ar g min r ∈R ′ K ( f r − 1 , g ) = f a g a , and r ( a ⊤ x ) = f a ( a ⊤ x ) g a ( a ⊤ x ) as well as K ( f , g ) = K ( f a , g a ) + K ( f g a f a , g ) . Let us also study Huber’ s synthe tic approach: Let R be the class of all positiv e fun ctions r deﬁned on R and such that g ( x ) r ( a ⊤ x ) is a density on R d for all a belon ging to R d ∗ . The following pr oposition shows that there exists a vecto r a suc h that f a g a minimizes K ( gr , f ) in r : Proposition D . 2 (Synthetic Approach ) . Ther e e xists a vector a belonging to R d ∗ such that ar g min r ∈R K ( f , g r ) = f a g a , and r ( a ⊤ x ) = f a ( a ⊤ x ) g a ( a ⊤ x ) as well as K ( f , g ) = K ( f a , g a ) + K ( f , g f a g a ) . In the meanw hile, t he following proposition sho ws that there exists a vector a such that f a g a mini- mizes K ( g , f r − 1 ) in r . Proposition D . 3. Th er e e xists a vector a belong ing t o R d ∗ such that ar g min r ∈R ′ K ( g , f r − 1 ) = f a g a , and r ( a ⊤ x ) = f a ( a ⊤ x ) g a ( a ⊤ x ) as well as K ( g , f ) = K ( g a , f a ) + K ( g , f g a f a ) . R emark D.1. Fir st, thr ou gh pr op erty A.3 pag e 1 8, we g et K ( f , g f a g a ) = K ( g , f g a f a ) = K ( f g a f a , g ) and K ( f a , g a ) = K ( g a , f a ) . Thus, pr o position D.3 imp lies th at ﬁ nding th e argument of th e maximum of K ( g a , f a ) a mounts to ﬁ nding the argument of the maximum K ( f a , g a ) . Consequen tly , the criteria of Huber’s methodologies is a 7→ K ( g a , f a ) . Second, if the Φ - diver gence is the r e lative entr o py , then our criteria is a 7→ K ( g g a f a , f ) an d pr op erty A.3 implies K ( g , f g a f a ) = K ( g f a g a , f ) . T o recapitulate, the cho ice of r = f a g a enables u s to simultaneo usly solve th e follo wing fou r optimisation problem s, for a ∈ R d ∗ : First, ﬁnd a such that a = ar ginf a ∈ R d ∗ K ( f g a f a , g ) , Second, ﬁnd a such that a = ar ginf a ∈ R d ∗ K ( f , g f a g a ) , Third, ﬁnd a such that a = ar gsup a ∈ R d ∗ K ( g a , f a ) , Fourth, ﬁnd a such that a = a r ginf a ∈ R d ∗ K ( g f a g a , f ) . E. H ypotheses’ discussion E.1. Discussion of ( H 2) . Let us work with the relati ve entropy a nd with g and a 1 . For all b ∈ R d ∗ , we h ave R ϕ ∗ ( ϕ ′ ( g ( x ) f b ( b ⊤ x ) f ( x ) g b ( b ⊤ x ) )) f ( x ) d x = R ( g ( x ) f b ( b ⊤ x ) f ( x ) g b ( b ⊤ x ) − 1) f ( x ) d x = 0 , since, for any b in R d ∗ , the function x 7→ g ( x ) f b ( b ⊤ x ) g b ( b ⊤ x ) is a density . Th e complemen t o f Θ Φ in R d ∗ is ∅ and then the supremum looked for in R is −∞ . W e can therefore con clude. It is in teresting to note that we obtain the same veriﬁcation with f , g ( k − 1) and a k . E.2. Dicussion of ( H 4) . This hypoth esis consists in the following assumptions: • W e work with the r elative entr o py , (0) • W e have f ( ./ a ⊤ 1 x ) = g ( ./ a ⊤ 1 x ) , i. e. K ( g f 1 g 1 , f ) = 0 - we could also derive the same p r oo f with f , g ( k − 1) and a k - (1) Pr eliminary ( A ) : Shows that A = { ( c , x ) ∈ R d ∗ \{ a 1 } × R d ; f a 1 ( a ⊤ 1 x ) g a 1 ( a ⊤ 1 x ) > f c ( c ⊤ x ) g c ( c ⊤ x ) , g ( x ) f c ( c ⊤ x ) g c ( c ⊤ x ) > f ( x ) } = ∅ 22 thr ou gh a r eductio ad absur dum, i.e . if we a ssume A , ∅ . Thus, our hypoth esis enables us to deri ve f ( x ) = f ( ./ a ⊤ 1 x ) f a 1 ( a ⊤ 1 x ) = g ( ./ a ⊤ 1 x ) f a 1 ( a ⊤ 1 x ) > g ( ./ c ⊤ x ) f c ( c ⊤ x ) > f since f a 1 ( a ⊤ 1 x ) g a 1 ( a ⊤ 1 x ) ≥ f c ( c ⊤ x ) g c ( c ⊤ x ) implies g ( ./ a ⊤ 1 x ) f a 1 ( a ⊤ 1 x ) = g ( x ) f a 1 ( a ⊤ 1 x ) g a 1 ( a ⊤ 1 x ) ≥ g ( x ) f c ( c ⊤ x ) g c ( c ⊤ x ) = g ( ./ c ⊤ x ) f c ( c ⊤ x ), i.e. f > f . W e can theref ore conclude. Pr eliminary ( B ) : Shows that B = { ( c , x ) ∈ R d ∗ \{ a 1 } × R d ; f a 1 ( a ⊤ 1 x ) g a 1 ( a ⊤ 1 x ) < f c ( c ⊤ x ) g c ( c ⊤ x ) , g ( x ) f c ( c ⊤ x ) g c ( c ⊤ x ) < f ( x ) } = ∅ thr ou gh a r eductio ad absur dum, i.e . if we a ssume B , ∅ . Thus, our hypoth esis enables us to deri ve f ( x ) = f ( ./ a ⊤ 1 x ) f a 1 ( a ⊤ 1 x ) = g ( ./ a ⊤ 1 x ) f a 1 ( a ⊤ 1 x ) < g ( ./ c ⊤ x ) f c ( c ⊤ x ) < f W e can ther efore conclude as above. Let us now v erify ( H 4) : W e h ave P M ( c , a 1 ) − P M ( c , a ) = R ln ( g ( x ) f c ( c ⊤ x ) g c ( c ⊤ x ) f ( x ) ) { f a 1 ( a ⊤ 1 x ) g a 1 ( a ⊤ 1 x ) − f c ( c ⊤ x ) g c ( c ⊤ x ) } g ( x ) d x . Mor eover , th e logar ithm ln is negative on { x ∈ R d ∗ ; g ( x ) f c ( c ⊤ x ) g c ( c ⊤ x ) f ( x ) < 1 } and is positive on { x ∈ R d ∗ ; g ( x ) f c ( c ⊤ x ) g c ( c ⊤ x ) f ( x ) ≥ 1 } . Thus, the pr eliminary studies ( A ) and ( B ) show th at ln ( g ( x ) f c ( c ⊤ x ) g c ( c ⊤ x ) f ( x ) ) and { f a 1 ( a ⊤ 1 x ) g a 1 ( a ⊤ 1 x ) − f c ( c ⊤ x ) g c ( c ⊤ x ) } always present a negati ve prod uct. W e can theref ore conclude, sinc e ( c , a ) 7→ P M ( c , a 1 ) − P M ( c , a ) is not null for all c and for all a - with a , a 1 . F . Proofs This last sectio n includes the proofs of m ost of th e lemm as, propo sitions, theorem s and c orol- laries containe d in the present article. R emark F .1. 1 / ( H 0) - acco r ding to which f and g a r e assumed to be positive and bou nded - thr ou gh lemma A.3 (see page 19) implies that ˇ g ( k ) and ˆ g ( k ) ar e positive and bound ed. 2 / r emark 2.1 page 5 implies that f n , g n , ˇ g ( k ) and ˆ g ( k ) ar e positive and bounded s ince we consider a Gaussian kernel. Proof of propositions D.1 and D.2. Let us ﬁrst study pro position D.2. W ith out loss of genera lity , we w ill prove this proposition with x 1 in lieu of a ⊤ X . Let us deﬁne g ∗ = gr . W e remark that g and g ∗ present the same den sity conditionally to x 1 . Indeed , g ∗ 1 ( x 1 ) = R g ∗ ( x ) d x 2 ... d x d = R r ( x 1 ) g ( x ) d x 2 ... d x d = r ( x 1 ) R g ( x ) d x 2 ... d x d = r ( x 1 ) g 1 ( x 1 ). Thus, we can demonstrate this prop osition. W e have g ( . | x 1 ) = g ( x 1 ,..., x n ) g 1 ( x 1 ) and g 1 ( x 1 ) r ( x 1 ) is the marginal density of g ∗ . Hence, R g ∗ d x = R g 1 ( x 1 ) r ( x 1 ) g ( . | x 1 ) d x = R g 1 ( x 1 ) f 1 ( x 1 ) g 1 ( x 1 ) ( R g ( . | x 1 ) d x 2 .. d x d ) d x 1 = R f 1 ( x 1 ) d x 1 = 1 and since g ∗ is positive, th en g ∗ is a density . Moreover , K ( f , g ∗ ) = Z f { ln ( f ) − ln ( g ∗ ) } d x , (F .1) = Z f { ln ( f ( . | x 1 )) − ln ( g ∗ ( . | x 1 )) + ln ( f 1 ( x 1 )) − ln ( g 1 ( x 1 ) r ( x 1 )) } d x , = Z f { ln ( f ( . | x 1 )) − ln ( g ( . | x 1 )) + ln ( f 1 ( x 1 )) − ln ( g 1 ( x 1 ) r ( x 1 )) } d x , (F .2) as g ∗ ( . | x 1 ) = g ( . | x 1 ). Since the minimu m of this last equation (F .2) is reached th rough the min- imization of R f { ln ( f 1 ( x 1 )) − ln ( g 1 ( x 1 ) r ( x 1 )) } d x = K ( f 1 , g 1 r ), then p roperty A.1 necessarily im- plies that f 1 = g 1 r , hen ce r = f 1 / g 1 . 23 Finally , we have K ( f , g ) − K ( f , g ∗ ) = R f { ln ( f 1 ( x 1 )) − ln ( g 1 ( x 1 )) } d x = K ( f 1 , g 1 ) , which completes the demo nstration of proposition D.2. Similarly , if we replace f ∗ = f r − 1 with f and g with g ∗ , we obtain the prop osition D.1. ✷ Proof o f proposition D .3. Th e de monstration is very similar to the one for proposition D.2, sa ve for the fact we now base ou r reasoning at row (F .1) on R g { ln ( g ∗ ) − ln ( f ) } d x instead of K ( f , g ∗ ) = R f { ln ( f ) − ln ( g ∗ ) } d x . ✷ Proof of proposition 3 .1. W ith out loss of genera lity , we r eason with x 1 in lieu of a ⊤ x . Let us deﬁne g ∗ = gr . W e remark that g and g ∗ present the same den sity conditionally to x 1 . Indeed , g ∗ 1 ( x 1 ) = R g ∗ ( x ) d x 2 ... d x d = R h ( x 1 ) g ( x ) d x 2 ... d x d = h ( x 1 ) R g ( x ) d x 2 ... d x d = h ( x 1 ) g 1 ( x 1 ). W e can ther efore prove this pro position. First, since f and g are k nown, then, for any given function h : x 1 7→ h ( x 1 ), the application T , which is deﬁned by: T : g ( . / x 1 ) h ( x 1 ) f 1 ( x 1 ) g 1 ( x 1 ) 7→ g ( ./ x 1 ) f 1 ( x 1 ), T : f ( ./ x 1 ) f 1 ( x 1 ) 7→ f ( ./ x 1 ) f 1 ( x 1 ) is measurable. Second, the above remark implies that Φ ( g ∗ , f ) = Φ ( g ∗ ( ./ x 1 ) g 1 ( x 1 ) h ( x 1 ) f 1 ( x 1 ) , f ( ./ x 1 ) f 1 ( x 1 )) = Φ ( g ( ./ x 1 ) g 1 ( x 1 ) h ( x 1 ) f 1 ( x 1 ) , f ( ./ x 1 ) f 1 ( x 1 )) . Consequently , p roper ty A.3 page 18 infers : Φ ( g ( ./ x 1 ) g 1 ( x 1 ) h ( x 1 ) f 1 ( x 1 ) , f ( ./ x 1 ) f 1 ( x 1 )) ≥ Φ ( T − 1 ( g ( ./ x 1 ) g 1 ( x 1 ) h ( x 1 ) f 1 ( x 1 ) ) , T − 1 ( f ( ./ x 1 ) f 1 ( x 1 ))) = Φ ( g ( ./ x 1 ) f 1 ( x 1 ) , f ( ./ x 1 ) f 1 ( x 1 )), by the very deﬁnition of T . = Φ ( g f 1 g 1 , f ) , which completes the proo f of this proposition. ✷ Proof of lemma F .1. lemme F .1 . W e have g ( ./ a ⊤ 1 x , ..., a ⊤ j x ) = n ( a ⊤ j + 1 x , ..., a ⊤ d x ) = f ( ./ a ⊤ 1 x , ..., a ⊤ j x ) . Putting A = ( a 1 , .., a d ), let us determine f in basis A . Let us ﬁrst study the fun ction deﬁned by ψ : R d → R d , x 7→ ( a ⊤ 1 x , .., a ⊤ d x ) . W e can immediately say that ψ is con tinuous and since A is a basis, its bijectivity is obvious. Moreover , let us study its Jacobia n. By deﬁn ition, it is J ψ ( x 1 , . . . , x d ) =              ∂ψ 1 ∂ x 1 · · · ∂ψ 1 ∂ x d · · · · · · · · · ∂ψ d ∂ x 1 · · · ∂ψ d ∂ x d              =         a 1 , 1 · · · a 1 , d · · · · · · · · · a d , 1 · · · a d , d         = | A | , 0 since A is a basis. W e can therefo re in fer : ∀ x ∈ R d , ∃ ! y ∈ R d such that f ( x ) = | A | − 1 Ψ ( y ) , i.e. Ψ (resp. y ) is the expression of f (resp of x ) in b asis A , nam ely Ψ ( y ) = ˜ n ( y j + 1 , ..., y d ) ˜ h ( y 1 , ..., y j ), with ˜ n and ˜ h being th e expressions of n and h in basis A . Consequen tly , ou r results in the case where th e family { a j } 1 ≤ j ≤ d is th e canonical b asis of R d , still ho ld f or Ψ in b asis A - see section 2 .1.2. An d then, if ˜ g is th e e xpression of g in basis A , we h ave ˜ g ( ./ y 1 , ..., y j ) = ˜ n ( y j + 1 , ..., y d ) = Ψ ( ./ y 1 , ..., y j ), i.e. g ( ./ a ⊤ 1 x , ..., a ⊤ j x ) = n ( a ⊤ j + 1 x , ..., a ⊤ d x ) = f ( . / a ⊤ 1 x , ..., a ⊤ j x ). ✷ Proof of lemma F .2. lemme F .2. S hould th er e exist a family ( a i ) i = 1 ... d such th at f ( x ) = n ( a ⊤ j + 1 x , ..., a ⊤ d x ) h ( a ⊤ 1 x , ..., a ⊤ j x ) , with j < d, with f , n and h bein g densities, then this family is a orthogonal basis of R d . Using a re ductio ad absur dum, we h av e R f ( x ) d x = 1 , + ∞ = R n ( a ⊤ j + 1 x , ..., a ⊤ d x ) h ( a ⊤ 1 x , ..., a ⊤ j x ) d x . W e can ther efore conclude. ✷ 24 Proof of proposition B.1 . Let us no te ﬁrst that we will prove this propo sition for k ≥ 2, i.e. in the case where g ( k − 1) is n ot known. The initial case usin g the known d ensity g (0) = g , will be an immediate conseq uence from the above. Moreover , go ing fo rward, to b e more legible, we will use g (resp. g n ) in lieu of g ( k − 1) (resp. g ( k − 1) n ). W e can therefore re mark that w e ha ve f ( X i ) ≥ θ n − y n , g ( Y i ) ≥ θ n − y n and g b ( b ⊤ Y i ) ≥ θ n − y n , for all i an d f or all b ∈ R d ∗ , thanks to the u niform con vergence of the kernel estimators. In deed, we have f ( X i ) = f ( X i ) − f n ( X i ) + f n ( X i ) ≥ − y n + f n ( X i ), by deﬁnition of y n , and then f ( X i ) ≥ − y n + θ n , by hypo thesis on f n ( X i ). This is also true for g n and g b , n . This entails sup b ∈ R d ∗ | 1 n Σ n i = 1 ϕ ′ ( f a , n ( a ⊤ Y i ) g a , n ( a ⊤ Y i ) g n ( Y i ) f n ( Y i ) ) . f a , n ( a ⊤ Y i ) g a , n ( a ⊤ Y i ) − R ϕ ′ ( g ( x ) f ( x ) f b ( b ⊤ x ) g b ( b ⊤ x ) ) g ( x ) f a ( a ⊤ x ) g a ( a ⊤ x ) d x | → 0 a . s . Indeed , let us remark that | 1 n Σ n i = 1 { ϕ ′ { f a , n ( a ⊤ Y i ) g a , n ( a ⊤ Y i ) g n ( Y i ) f n ( Y i ) } f a , n ( a ⊤ Y i ) g a , n ( a ⊤ Y i ) } − R ϕ ′ ( g ( x ) f ( x ) f b ( b ⊤ x ) g b ( b ⊤ x ) ) g ( x ) f a ( a ⊤ x ) g a ( a ⊤ x ) d x | = | 1 n Σ n i = 1 ϕ ′ { f a , n ( a ⊤ Y i ) g a , n ( a ⊤ Y i ) g n ( Y i ) f n ( Y i ) } f a , n ( a ⊤ Y i ) g a , n ( a ⊤ Y i ) − 1 n Σ n i = 1 ϕ ′ { f a ( a ⊤ Y i ) g a ( a ⊤ Y i ) g ( Y i ) f ( Y i ) } f a ( a ⊤ Y i ) g a ( a ⊤ Y i ) + 1 n Σ n i = 1 ϕ ′ { f a ( a ⊤ Y i ) g a ( a ⊤ Y i ) g ( Y i ) f ( Y i ) } f a ( a ⊤ Y i ) g a ( a ⊤ Y i ) − R ϕ ′ ( g ( x ) f ( x ) f b ( b ⊤ x ) g b ( b ⊤ x ) ) g ( x ) f a ( a ⊤ x ) g a ( a ⊤ x ) d x | ≤ | 1 n Σ n i = 1 ϕ ′ { f a , n ( a ⊤ Y i ) g a , n ( a ⊤ Y i ) g n ( Y i ) f n ( Y i ) } f a , n ( a ⊤ Y i ) g a , n ( a ⊤ Y i ) − 1 n Σ n i = 1 ϕ ′ { f a ( a ⊤ Y i ) g a ( a ⊤ Y i ) g ( Y i ) f ( Y i ) } f a ( a ⊤ Y i ) g a ( a ⊤ Y i ) | + | 1 n Σ n i = 1 ϕ ′ { f a ( a ⊤ Y i ) g a ( a ⊤ Y i ) g ( Y i ) f ( Y i ) } f a ( a ⊤ Y i ) g a ( a ⊤ Y i ) − R ϕ ′ ( g ( x ) f ( x ) f b ( b ⊤ x ) g b ( b ⊤ x ) ) g ( x ) f a ( a ⊤ x ) g a ( a ⊤ x ) d x | Moreover , since R | ϕ ′ ( g ( x ) f ( x ) f b ( b ⊤ x ) g b ( b ⊤ x ) ) g ( x ) f a ( a ⊤ x ) g a ( a ⊤ x ) | d x < ∞ , as implied by lemma A.3, a nd since we assumed g such that Φ ( g , f ) < ∞ and Φ ( f , g ) < ∞ and since b ∈ Θ Φ , the law of large numb ers enables us to state that | 1 n Σ n i = 1 ϕ ′ { f a ( a ⊤ Y i ) g a ( a ⊤ Y i ) g ( Y i ) f ( Y i ) } f a ( a ⊤ Y i ) g a ( a ⊤ Y i ) − R ϕ ′ ( g ( x ) f ( x ) f b ( b ⊤ x ) g b ( b ⊤ x ) ) g ( x ) f a ( a ⊤ x ) g a ( a ⊤ x ) d x | → 0 a . s . Furthermo re, | 1 n Σ n i = 1 ϕ ′ { f a , n ( a ⊤ Y i ) g a , n ( a ⊤ Y i ) g n ( Y i ) f n ( Y i ) } f a , n ( a ⊤ Y i ) g a , n ( a ⊤ Y i ) − 1 n Σ n i = 1 ϕ ′ { f a ( a ⊤ Y i ) g a ( a ⊤ Y i ) g ( Y i ) f ( Y i ) } f a ( a ⊤ Y i ) g a ( a ⊤ Y i ) | ≤ 1 n Σ n i = 1 | ϕ ′ { f a , n ( a ⊤ Y i ) g a , n ( a ⊤ Y i ) g n ( Y i ) f n ( Y i ) } f a , n ( a ⊤ Y i ) g a , n ( a ⊤ Y i ) − ϕ ′ { f a ( a ⊤ Y i ) g a ( a ⊤ Y i ) g ( Y i ) f ( Y i ) } f a ( a ⊤ Y i ) g a ( a ⊤ Y i ) | and | ϕ ′ { f a , n ( a ⊤ Y i ) g a , n ( a ⊤ Y i ) g n ( Y i ) f n ( Y i ) } f a , n ( a ⊤ Y i ) g a , n ( a ⊤ Y i ) − ϕ ′ { f a ( a ⊤ Y i ) g a ( a ⊤ Y i ) g ( Y i ) f ( Y i ) } f a ( a ⊤ Y i ) g a ( a ⊤ Y i ) | → 0 as a result of the hypo theses intially introdu ced on θ n . Consequ ently , 1 n Σ n i = 1 | ϕ ′ { f a , n ( a ⊤ Y i ) g a , n ( a ⊤ Y i ) g n ( Y i ) f n ( Y i ) } f a , n ( a ⊤ Y i ) g a , n ( a ⊤ Y i ) − ϕ ′ { f a ( a ⊤ Y i ) g a ( a ⊤ Y i ) g ( Y i ) f ( Y i ) } f a ( a ⊤ Y i ) g a ( a ⊤ Y i ) | → 0, as it is a Cesàro mean. This enables us to conclude. Similar ly , we obtain sup b ∈ R d ∗ | 1 n Σ n i = 1 ϕ ∗ { ϕ ′ { f a , n ( a ⊤ X i ) g a , n ( a ⊤ X i ) g n ( X i ) f n ( X i ) }} − R ϕ ∗ ( ϕ ′ ( g ( x ) f ( x ) f b ( b ⊤ x ) g b ( b ⊤ x ) )) f ( x ) d x | → 0 a . s . ✷ Proof of lemma F .3. By deﬁnition of the closure of a s et, we have lemme F .3 . The set Γ c is closed in L 1 for the topology of the uniform con ver gence. Proof of lemma F .4. Since Φ is greater than the L 1 distance, we have lemme F .4 . F o r all c > 0 , we have Γ c ⊂ B L 1 ( f , c ) , wher e B L 1 ( f , c ) = { p ∈ L 1 ; k f − p k 1 ≤ c } . Proof of lemma F .5. The deﬁnition of the closure of a set and lemma A.4 (see page 19) imply lemme F .5 . G is closed in L 1 for the topology of the uniform con ve r gen ce. Proof of lemma F .6. lemme F .6. in f a ∈ R d ∗ Φ ( g ∗ , f ) is r eached when th e Φ -diver genc e is gr eater than the L 1 distance as well as the L 2 distance. Pr oo f. In deed, let G b e { g f a g a ; a ∈ R d ∗ } and Γ c be Γ c = { p ; K ( p , f ) ≤ c } for all c > 0. From lemmas F .3, F . 4 and F .5 (see pag e 25), we get Γ c ∩ G is a compact fo r the topology o f the uniform conv ergence, if Γ c ∩ G is not emp ty . Hence, and since p roperty A.2 (see page 18) implies that 25 Q 7→ Φ ( Q , P ) is lower semi-continuou s in L 1 for the topology of th e un iform conv ergence, then the inﬁmu m is reached in L 1 . ( T aking f or example c = Φ ( g , f ) , Ω is necessarily no t empty because we al ways h av e Φ ( g f a g a , f ) ≤ Φ ( g , f )). Moreover, when the Φ − di vergence is greater than the L 2 distance, th e very d eﬁnition of the L 2 space enab les us to provid e the same pr oof as for the L 1 distance. Proof of lemma F .7. lemme F .7. F o r an y p ≤ d , we have f ( p − 1) a p = f a p - see Huber’s analytic method -, g ( p − 1) a p = g a p - see Huber’s syn thetic method - and g ( p − 1) a p = g a p - see our algorithm. Pr oo f. As it is equivalent to prove either our algorithm o r Huber’ s, we will only develop h ere the proof for our algo rithm. Assuming, without any loss of ge nerality , that the a i , i = 1 , .., p , are the vectors of the can onical basis, since g ( p − 1) ( x ) = g ( x ) f 1 ( x 1 ) g 1 ( x 1 ) f 2 ( x 2 ) g 2 ( x 2 ) ... f p − 1 ( x p − 1 ) g p − 1 ( x p − 1 ) we derive immediately that g ( p − 1) p = g p . W e no te that it is su ﬃ cient to opera te a chan ge in basis on the a i to obtain the general case. Proof of lemma F .8. lemme F .8. I f ther e exits p, p ≤ d, such that Φ ( g ( p ) , f ) = 0 , then the family of ( a i ) i = 1 ,.., p - derived fr om the con struction of g ( p ) - is fr ee and orthogonal. Pr oo f. Wit hout any loss of genera lity , let us assum e that p = 2 and that the a i are the vectors of the canonical basis. Using a reductio a d absur dum with the hypo theses a 1 = (1 , 0 , ..., 0 ) and that a 2 = ( α, 0 , .. ., 0), wh ere α ∈ R , we get g (1) ( x ) = g ( x 2 , .., x d / x 1 ) f 1 ( x 1 ) and f = g (2) ( x ) = g ( x 2 , .., x d / x 1 ) f 1 ( x 1 ) f α a 1 ( α x 1 ) [ g (1) ] α a 1 ( α x 1 ) . Hence f ( x 2 , .., x d / x 1 ) = g ( x 2 , .., x d / x 1 ) f α a 1 ( α x 1 ) [ g (1) ] α a 1 ( α x 1 ) . It consequ ently implies that f α a 1 ( α x 1 ) = [ g (1) ] α a 1 ( α x 1 ) since 1 = R f ( x 2 , .., x d / x 1 ) d x 2 ... d x d = R g ( x 2 , .., x d / x 1 ) d x 2 ... d x d f α a 1 ( α x 1 ) [ g (1) ] α a 1 ( α x 1 ) = f α a 1 ( α x 1 ) [ g (1) ] α a 1 ( α x 1 ) . Therefo re, g (2) = g (1) , i.e. p = 1 which leads to a contrad iction. Hence , the family is free. Moreover , using a r eductio ad absurdum we get the orthogo nality . I ndeed, we have R f ( x ) d x = 1 , + ∞ = R n ( a ⊤ j + 1 x , ..., a ⊤ d x ) h ( a ⊤ 1 x , ..., a ⊤ j x ) d x . The use of the same argu ment as in the pro of of lemma F .2 , enables us to infer the orthogon ality of ( a i ) i = 1 ,.., p . Proof of lemma F .9. lemme F .9. I f ther e exits p, p ≤ d, such that Φ ( g ( p ) , f ) = 0 , wh er e g ( p ) is built fr o m the fr ee a nd orthogonal family a 1 ,...,a j , then, ther e e xists a fr ee an d orthogonal family ( b k ) k = j + 1 ,... , d of vectors of R d ∗ , such that g ( p ) ( x ) = g ( b ⊤ j + 1 x , ..., b ⊤ d x / a ⊤ 1 x , ..., a ⊤ j x ) f a 1 ( a ⊤ 1 x ) ... f a j ( a ⊤ j x ) and such that R d = V ect { a i } ⊥ ⊕ V ect { b k } . Pr oo f. Th rough the incomplete b asis th eorem an d si milarly as in lemma F .8, we o btain the result thanks to the Fubini’ s theore m. Proof of lemma F .10. lemme F .1 0. F or any continuou s d ensity f , we h ave y m = | f m ( x ) − f ( x ) | = O P ( m − 2 4 + d ) . 26 Deﬁning b m ( x ) as b m ( x ) = | E ( f m ( x )) − f ( x ) | , we hav e y m ≤ | f m ( x ) − E ( f m ( x )) | + b m ( x ). More- over , from page 1 50 of Scott (1992), we derive that b m ( x ) = O P ( Σ d j = 1 h 2 j ) wh ere h j = O P ( m − 1 4 + d ). Then, we obtain b m ( x ) = O P ( m − 2 4 + d ). Finally , since the central limit theorem rate is O P ( m − 1 2 ), we infer that y m ≤ O P ( m − 1 2 ) + O P ( m − 2 4 + d ) = O P ( m − 2 4 + d ). ✷ Proof of proposition 3.3. Prop osition 3 .3 comes immediately from p roposition B.1 p age 20 a nd lemma C.1 page 20. ✷ Proof of proposition 3 .4. L et us ﬁrst show by induction the follo wing assertion P ( k ) = { g ( k ) allows a decon volution g ( k ) = g ( k ) ∗ φ } Initialisation : For k = 0, we get the result since g = g (0) is elliptic. Going from k to k + 1 : Let u s assume P ( k ) is true, we then show that P ( k + 1). Since the family of a i , i ≤ k + 1 is free - see lemma F . 8 - the n, we deﬁne B as the basis of R d such that its k + 1 ﬁrst vectors are the a i , i ≤ k + 1 - see the inc omplete basis t heorem for its e xistence. Thus, in B and using the same proced ure to prove lem ma F .1 page 24, we have g ( k ) ( x ) = g ( k ) ( ./ x k + 1 ) g ( k ) k + 1 ( x k + 1 ). Consequently , the very d eﬁnition of the con volution product, the Fubini’ s theore m and the hypothesis made on the Elliptical f amily imply that g ( k ) ( x ) = g ( k ) ( ./ x k + 1 ) g ( k ) k + 1 ( x k + 1 ) with g ( k ) ( ./ x k + 1 ) = g ( k ) ( ./ x k + 1 ) ∗ E d − 1 (0 , σ 2 I d − 1 , ξ d − 1 ) and with g ( k ) k + 1 ( x k + 1 ) = g ( k ) k + 1 ( x k + 1 ) ∗ E 1 (0 , σ 2 , ξ 1 ). Finally , replacing g ( k ) k + 1 with f k + 1 = f k + 1 ∗ E 1 (0 , σ 2 , ξ 1 ), we conclu de this induction with g ( k + 1) = g ( k ) ( ./ x k + 1 ) f k + 1 ( x k + 1 ). Now , let us consider ψ (r ep. ψ , ψ ( k ) , ψ ( k ) ) the characteristic functio n of f (resp. f , g ( k ) , g ( k ) ). W e then h av e ψ ( s ) = ψ ( s ) Ψ ( 1 2 σ 2 | s | 2 ) and ψ ( k ) ( s ) = ψ ( k ) ( s ) Ψ ( 1 2 σ 2 | s | 2 ). Hence, ψ and ψ ( k ) are less or equal to Ψ ( 1 2 σ 2 | s | 2 ) which is integrable b y hypo thesis, i.e. ψ and ψ ( k ) are absolutely integrab le. W e then obtain g ( k ) ( x ) = (2 π ) − d R ψ ( k ) ( s ) e − i s ⊤ x d s an d f ( x ) = (2 π ) − d R ψ ( s ) e − i s ⊤ x d s . Moreover , since the s equenc e ( ψ ( k ) ) uniformly converges and since ψ and ψ ( k ) are less o r equal to Ψ ( 1 2 σ 2 | s | 2 ), then the dominated con vergence theorem implies that lim k | f ( x ) − g ( k ) ( x ) | ≤ (2 π ) − d R lim k | ψ ( s ) − ψ ( k ) ( s ) | d s = 0 a.s. i. e. lim k su p x | f ( x ) − g ( k ) ( x ) | = 0 a.s. Finally , since, by hypo thesis, (2 π ) − d R | ψ ( s ) − ψ ( k ) ( s ) | d s ≤ 2(2 π ) − d R Ψ ( 1 2 σ 2 | s | 2 ) d s < ∞ , then the above l imit and the dominated conver gence theo rem imply that lim k R | f ( x ) − g ( k ) ( x ) | d x = 0 . ✷ Proof of c orollary 3.1. Thro ugh the dominated co n vergence theorem and th rough theor em 3.4, we get the result using a reduc tio ad absurdum. ✷ Proof of lemma F .11. lemme F .1 1. Let con sider the sequence ( a i ) deﬁned in (2.3) page 6. W e then have lim n lim k K ( ˇ g ( k ) n f a k , n [ ˇ g ( k ) ] a k , n , f n ) = 0 a.s. Pr oo f. Trough th e relationship (2.3) and through remark D.1 page 22 as well as the additiv e relation of proposition D.1, we can say that 0 ≤ .. ≤ K ( g ( ∞ ) , f ) ≤ .. ≤ K ( g ( k ) , f ) ≤ .. ≤ K ( g , f ), where g ( ∞ ) = lim k g ( k ) which is a density by construction. A nd through proposition C.2, we obtain that K ( g ( ∞ ) , f ) = 0 , i.e. 0 = K ( g ( ∞ ) , f ) ≤ . . . ≤ K ( g ( k ) , f ) ≤ . . . ≤ K ( g , f ), (*). Moreover , let ( g ( k ) n ) k be the sequence o f d ensities such that g ( k ) n is the k ernel estimate of g ( k ) . Sin ce we derive from rem ark F .1 pag e 23 an integrable u pper b ound of g ( k ) n , for all k , wh ich is gr eater than f - s ee als o the deﬁnition o f ϕ in the p roof o f theo rem 3.4 -, then the dom inated convergence theorem implies that, for any k , lim n K ( g ( k ) n , f n ) = K ( g ( k ) , f ) , i.e., fro m a certain gi ven rank n 0 , we have 0 ≤ .. ≤ K ( g ( ∞ ) n , f n ) ≤ .. ≤ K ( g ( k ) n , f n ) ≤ . . ≤ K ( g n , f n ), (**). Consequently , th roug h lemma F .1 2 page 28, there exists a k such that 0 ≤ .. ≤ K ( Ψ ( ∞ ) n , k , f n ) ≤ .. ≤ K ( g ( ∞ ) n , f n ) ≤ .. ≤ K ( Ψ ( ∞ ) n , k − 1 , f n ) ≤ .. ≤ K ( g n , f n ), (***) 27 where Ψ ( ∞ ) n , k is a density such that Ψ ( ∞ ) n , k = lim k g ( k ) n . Finally , throu gh the dominated con vergence th eorem and taking the limit as n in (*** ) we get 0 = K ( g ( ∞ ) , f ) = lim n K ( g ( ∞ ) n , f n ) ≥ lim n K ( Ψ ( ∞ ) n , k , f n ) ≥ 0. The dominate d con vergence theorem e nables us to conclud e: 0 = lim n K ( Ψ ( ∞ ) n , k , f n ) = lim n lim k K ( g ( k ) n , f n ). Proof of lemma F .12. lemme F .1 2. W ith the no tation of the pr oof of lemma F .11, we have 0 ≤ .. ≤ K ( Ψ ( ∞ ) n , k , f n ) ≤ .. ≤ K ( g ( ∞ ) n , f n ) ≤ .. ≤ K ( Ψ ( ∞ ) n , k − 1 , f n ) ≤ .. ≤ K ( g n , f n ) , (***) Pr oo f. First, as explained in section D, we have K ( f ( k ) , g ) − K ( f ( k + 1) , g ) = K ( f ( k ) a k + 1 , g a k + 1 ). Mo re- over , th rough rem ark D.1 page 22, we also d erive that K ( f ( k ) , g ) = K ( g ( k ) , f ) . Then , K ( f ( k ) a k + 1 , g a k + 1 ) is the decreasing step of the relati ve entr opies in ( *) and leading to 0 = K ( g ( ∞ ) , f ) . Similarly , the very construction of (* *), implies that K ( f ( k ) a k + 1 , n , g a k + 1 , n ) is the decreasing step of the relati ve entropies in (**) and leading to K ( g ( ∞ ) n , f n ). Second, throug h the con clusion of the section D an d lemm a 14.2 of Huber’ s article, we obtain that K ( f ( k ) a k + 1 , n , g a k + 1 , n ) converges - in decrea sing and in k - to wards a positiv e function of n - that we will call ξ n . Third, the con vergence of ( g ( k ) ) k - see pro position C.2 - imp lies th at, for any gi ven n , the se- quence ( K ( g ( k ) n , f n )) k is not ﬁnite. Then , through relation ship ( ∗∗ ), ther e exists a k such that 0 < K ( g ( k − 1) n , f n ) − K ( g ( ∞ ) n , f n ) < ξ n . Thus, since Q 7→ K ( Q , P ) is l.s.c. - see property A.2 page 18 - relationship (* *) implies (***) . Proof of t heorem 3 .1. First, by the very deﬁnition o f the kernel estimator ˇ g (0) n = g n conv erges tow ards g . Mor eover , the continuity of a 7→ f a , n and a 7→ g a , n and propo sition 3. 3 imply that ˇ g (1) n = ˇ g (0) n f a , n ˇ g (0) a , n conv erges towards g (1) . Finally , since, for any k , ˇ g ( k ) n = ˇ g ( k − 1) n f ˇ a k , n ˇ g ( k − 1) ˇ a k , n , we conclud e by an immediat inductio n. ✷ Proof of theorem C. 2. relationship (C. 1). Let us con sider Ψ j = { f ˇ a j ( ˇ a j ⊤ x ) [ ˇ g ( j − 1) ] ˇ a j ( ˇ a j ⊤ x ) − f a j ( a ⊤ j x ) [ g ( j − 1) ] a j ( a ⊤ j x ) } . Since f and g are bounded, it is easy to prove that from a certain rank, we get, for an y x g iv en i n R d | Ψ j | ≤ ma x ( 1 [ ˇ g ( j − 1) ] ˇ a j ( ˇ a j ⊤ x ) , 1 [ g ( j − 1) ] a j ( a ⊤ j x ) ) | f ˇ a j ( ˇ a j ⊤ x ) − f a j ( a ⊤ j x ) | . R emark F .2. F irst, based on what we stated earlier , for any g iven x and fr o m a certain rank, ther e is a constant R > 0 independ ent fr om n, such that ma x ( 1 [ ˇ g ( j − 1) ] ˇ a j ( ˇ a j ⊤ x ) , 1 [ g ( j − 1) ] a j ( a ⊤ j x ) ) ≤ R = R ( x ) = O (1) . Second , since ˇ a k is an M − estimator of a k , its con ver gen ce r a te is O P ( n − 1 / 2 ) . Thus using simple f unction s, we infer an upper and lo wer bound f or f ˇ a j and for f a j . W e therefor e reach the follo wing conclusion: | Ψ j | ≤ O P ( n − 1 / 2 ) . (F .3) W e ﬁnally ob tain | Π k j = 1 f ˇ a j ( ˇ a j ⊤ x ) [ ˇ g ( j − 1) ] ˇ a j ( ˇ a j ⊤ x ) − Π k j = 1 f a j ( a ⊤ j x ) [ g ( j − 1) ] a j ( a ⊤ j x ) | = Π k j = 1 f a j ( a ⊤ j x ) [ g ( j − 1) ] a j ( a ⊤ j x ) | Π k j = 1 f ˇ a j ( ˇ a j ⊤ x ) [ ˇ g ( j − 1) ] ˇ a j ( ˇ a j ⊤ x ) [ g ( j − 1) ] a j ( a ⊤ j x ) f a j ( a ⊤ j x ) − 1 | . Based on relation ship (F .3), the expression f ˇ a j ( ˇ a j ⊤ x ) [ ˇ g ( j − 1) ] ˇ a j ( ˇ a j ⊤ x ) [ g ( j − 1) ] a j ( a ⊤ j x ) f a j ( a ⊤ j x ) tends towards 1 at a rate 28 of O P ( n − 1 / 2 ) for all j . Consequently , Π k j = 1 f ˇ a j ( ˇ a j ⊤ x ) [ ˇ g ( j − 1) ] ˇ a j ( ˇ a j ⊤ x ) [ g ( j − 1) ] a j ( a ⊤ j x ) f a j ( a ⊤ j x ) tends to wards 1 at a rate of O P ( n − 1 / 2 ). Thus from a certain rank, we get | Π k j = 1 f ˇ a j ( ˇ a j ⊤ x ) [ ˇ g ( j − 1) ] ˇ a j ( ˇ a j ⊤ x ) − Π k j = 1 f a j ( a ⊤ j x ) [ g ( j − 1) ] a j ( a ⊤ j x ) | = O P ( n − 1 / 2 ) O P (1) = O P ( n − 1 / 2 ). In conclusion, we obtain | ˇ g ( k ) ( x ) − g ( k ) ( x ) | = g ( x ) | Π k j = 1 f ˇ a j ( ˇ a j ⊤ x ) [ ˇ g ( j − 1) ] ˇ a j ( ˇ a j ⊤ x ) − Π k j = 1 f a j ( a ⊤ j x ) [ g ( j − 1) ] a j ( a ⊤ j x ) | ≤ O P ( n − 1 / 2 ). relationship (C.2). The relationship C.1 of theorem C.2 implies that | ˇ g ( k ) ( x ) g ( k ) ( x ) − 1 | = O P ( n − 1 / 2 ) because, for any given x , g ( k ) ( x ) | ˇ g ( k ) ( x ) g ( k ) ( x ) − 1 | = | ˇ g ( k ) ( x ) − g ( k ) ( x ) | . Consequently , there exists a smooth func tion C of R d in R + such that lim n →∞ n − 1 / 2 C ( x ) = 0 and | ˇ g ( k ) ( x ) g ( k ) ( x ) − 1 | ≤ n − 1 / 2 C ( x ), for any x . W e then have R | ˇ g ( k ) ( x ) − g ( k ) ( x ) | d x = R g ( k ) ( x ) | ˇ g ( k ) ( x ) g ( k ) ( x ) − 1 | d x ≤ R g ( k ) ( x ) C ( x ) n − 1 / 2 d x . Moreover , sup x ∈ R d | ˇ g ( k ) ( x ) − g ( k ) ( x ) | = sup x ∈ R d g ( k ) ( x ) | ˇ g ( k ) ( x ) g ( k ) ( x ) − 1 | = sup x ∈ R d g ( k ) ( x ) C ( x ) n − 1 / 2 → 0 a . s . , by theorem C.1. This imp lies that sup x ∈ R d g ( k ) ( x ) C ( x ) < ∞ a . s . , i.e. sup x ∈ R d C ( x ) < ∞ a . s . since g ( k ) has bee n assumed to be positiv e and bounded - see remark F .1. Thus, R g ( k ) ( x ) C ( x ) d x ≤ sup C . R g ( k ) ( x ) d x = sup C < ∞ since g ( k ) is a density , ther efore we can conclud e R | ˇ g ( k ) ( x ) − g ( k ) ( x ) | d x ≤ sup C . n − 1 / 2 = O P ( n − 1 / 2 ) . ✷ relationship (C.3). W e h av e K ( ˇ g ( k ) , f ) − K ( g ( k ) , f ) = R f ( ϕ ( ˇ g ( k ) f ) − ϕ ( g ( k ) f )) d x ≤ R f S | ˇ g ( k ) f − g ( k ) f | d x = S R | ˇ g ( k ) − g ( k ) | d x with the li ne b efore last being derived from th eorem A.1 page 18 and wher e ϕ : x 7→ xln ( x ) − x + 1 is a con vex f unction and where S > 0. W e get the s ame expression as th e one found in our Proof of Relation ship (C.2) section, we then obtain K ( ˇ g ( k ) , f ) − K ( g ( k ) , f ) ≤ O P ( n − 1 / 2 ). Similarly , we get K ( g ( k ) , f ) − K ( ˇ g ( k ) , f ) ≤ O P ( n − 1 / 2 ). W e can therefor e conclude. ✷ Proof of lemma F .13. lemme F .1 3. W e keep the notations intr odu ced in Appendix B. I t holds n = O ( m 1 2 ) . Pr oo f. Let N be the random v ariable such that N = Σ m j = 1 1 { f m ( X j ) ≥ θ m , g ( Y j ) ≥ θ m } . Since the e vents { f m ( X j ) ≥ θ m } and { g ( Y j ) ≥ θ m } are independ ent from one another and since { g ( Y j ) ≥ θ m } ⊂ { g m ( Y j ) ≥ − y m + θ m } , we can say that n = m . P ( f m ( X j ) ≥ θ m , g ( Y j ) ≥ θ m ) ≤ m . P ( f m ( X j ) ≥ θ m ) . P ( g m ( Y j ) ≥ − y m + θ m ). Consequently , let us study P ( f m ( X i ) ≥ θ m ). Let ( ξ i ) i = 1 ... m be the sequence such tha t, for any i and any x in R d , ξ i ( x ) = Π d l = 1 1 (2 π ) 1 / 2 h l e − 1 2 ( x l − X il h l ) 2 − R Π d l = 1 1 (2 π ) 1 / 2 h l e − 1 2 ( x l − X il h l ) 2 f ( x ) d x . Hence, for any giv en j and conditiona lly to X 1 , . . . , X j − 1 , X j + 1 , . . . , X m , the variables ( ξ i ( X j )) i , j i = 1 ... m are i.i.d. and centered, have same s econd moment, and are such that | ξ i ( X j ) | ≤ Π d l = 1 1 (2 π ) 1 / 2 h l + Π d l = 1 1 (2 π ) 1 / 2 h l R | f ( x ) | d x = 2 . (2 π ) − d / 2 Π d l = 1 h − 1 l since sup x e − 1 2 x 2 ≤ 1. Moreover , no ting that f m ( x ) = 1 m Σ m i = 1 ξ i ( x ) + (2 π ) − d / 2 1 m Σ m i = 1 Π d l = 1 h − 1 l R e − 1 2 ( x l − X il h l ) 2 f ( x ) d x , we have f m ( X j ) ≥ θ m ⇔ 1 m Σ m i = 1 ξ i ( X j ) + (2 π ) − d / 2 1 m Σ m i = 1 Π d l = 1 h − 1 l R e − 1 2 ( x l − X il h l ) 2 f ( x ) d x ≥ θ m ⇔ 1 m − 1 Σ m i = 1 i , j ξ i ( X j ) ≥ ( θ m − ( 2 π ) − d / 2 1 m Σ m i = 1 Π d l = 1 h − 1 l R e − 1 2 ( x l − X il h l ) 2 f ( x ) d x − 1 m ξ j ( X j )) m m − 1 with ξ j ( X j ) = 0. Then, deﬁning t (resp. ε ) as t = 2 . (2 π ) − d / 2 Π d l = 1 h − 1 l (resp. ε = ( θ m − (2 π ) − d / 2 Π d l = 1 h − 1 l 1 m Σ m i = 1 Π d l = 1 R e − 1 2 ( x l − X il h l ) 2 f ( x ) d x ) m m − 1 ), th e Bennet’ s in equality -Devroye (1985) p age 1 60- imp lies that P ( 1 m − 1 Σ m i = 1 i , j ξ i ( X j ) ≥ ε/ X 1 , . . . , X j − 1 , X j + 1 , . . . , X m ) ≤ 2 . e x p ( − ( m − 1) ε 2 4 t 2 ) . 29 Finally , since the X i are i.i.d. and since R ( R Π d l = 1 e − 1 2 ( x l − y l h l ) 2 f ( x ) d x ) f ( y ) d y < 1, then the law o f large number s implies th at 1 m Σ m i = 1 R Π d l = 1 e − 1 2 ( x l − X il h l ) 2 f ( x ) d x → m R R Π d l = 1 e − 1 2 ( x l − y l h l ) 2 f ( x ) f ( y ) d xd y a.s. Consequ ently , since 0 < ν < 1 4 + d - see remark B.1 - and since e − x ≤ x − 1 2 when x > 0, we obtain, after ca lculation, that, from a certain r ank, e x p ( − ( m − 1) ε 2 4 t 2 ) = O ( m − 1 4 ), i.e., from a certain rank, P ( f m ( Y j ) ≥ θ m ) = O ( m − 1 4 ). Similarly , we in fer P ( g ( Y j ) ≥ θ m ) = O ( m − 1 4 ). In conclusion, we can say th at n = m . P ( f m ( X j ) ≥ θ m ) . P ( g m ( Y j ) ≥ θ m ) = O ( m 1 2 ). Similarly , we derive the same result as above for any step of ou r method. Proof of theorem 3 .2. First, from lemma F .10, we derive that, fo r any x , sup a ∈ R d ∗ | f a , n ( a ⊤ x ) − f a ( a ⊤ x ) | = O P ( n − 2 4 + d ). Then, let u s co nsider Ψ j = f ˇ a j , n ( ˇ a j ⊤ x ) ˇ g ( j − 1) ˇ a j , n ( ˇ a j ⊤ x ) − f a j ( a ⊤ j x ) g ( j − 1) a j ( a ⊤ j x ) , we have Ψ j = 1 ˇ g ( j − 1) ˇ a j , n ( ˇ a j ⊤ x ) g ( j − 1) a j ( a ⊤ j x ) (( f ˇ a j , n ( ˇ a j ⊤ x ) − f a j ( a ⊤ j x )) g ( j − 1) a j ( a ⊤ j x ) + f a j ( a ⊤ j x )( g ( j − 1) a j ( a ⊤ j x ) − ˇ g ( j − 1) ˇ a j , n ( ˇ a j ⊤ x ))), i.e. | Ψ j | = O P ( n − 1 2 1 d = 1 − 2 4 + d 1 d > 1 ) sin ce f a j ( a ⊤ j x ) = O (1) an d g ( j − 1) a j ( a ⊤ j x ) = O ( 1). W e can therefore conclud e similarly as in theorem C.2 . ✷ Proof of theorem 3 .3. W e get ththeor em t hrou gh theorem C.3 and proposition B.1. ✷ Proof of theorem C.3. First of all, let us remark that h ypothe ses ( H 1) to ( H 3) imply that ˇ γ n and ˇ c n ( a k ) conver ge towards a k in probab ility . Hypothesis ( H 4) enables us to derive un der the integrable sig n after calculation, P ∂ ∂ b M ( a k , a k ) = P ∂ ∂ a M ( a k , a k ) = 0 , P ∂ 2 ∂ a i ∂ b j M ( a k , a k ) = P ∂ 2 ∂ b j ∂ a i M ( a k , a k ) = R ϕ ”( g f a k f g a k ) ∂ ∂ a i g f a k f g a k ∂ ∂ b j g f a k f g a k f d x , P ∂ 2 ∂ a i ∂ a j M ( a k , a k ) = R ϕ ′ ( g f a k f g a k ) ∂ 2 ∂ a i ∂ a j g f a k f g a k f d x , P ∂ 2 ∂ b i ∂ b j M ( a k , a k ) = − R ϕ ”( g f a k f g a k ) ∂ ∂ b i g f a k f g a k ∂ ∂ b j g f a k f g a k f d x , and consequently P ∂ 2 ∂ b i ∂ b j M ( a k , a k ) = − P ∂ 2 ∂ a i ∂ b j M ( a k , a k ) = − P ∂ 2 ∂ b j ∂ a i M ( a k , a k ) , which implies, ∂ 2 ∂ a i ∂ a j K ( g f a k g a k , f ) = P ∂ 2 ∂ a i ∂ a j M ( a k , a k ) − P ∂ 2 ∂ b i ∂ b j M ( a k , a k ) , = P ∂ 2 ∂ a i ∂ a j M ( a k , a k ) + P ∂ 2 ∂ a i ∂ b j M ( a k , a k ) , = P ∂ 2 ∂ a i ∂ a j M ( a k , a k ) + P ∂ 2 ∂ b j ∂ a i M ( a k , a k ) . The very deﬁnition of the estimators ˇ γ n and ˇ c n ( a k ), implies that ( P n ∂ ∂ b M ( b , a ) = 0 P n ∂ ∂ a M ( b ( a ) , a ) = 0 ie ( P n ∂ ∂ b M ( ˇ c n ( a k ) , ˇ γ n ) = 0 P n ∂ ∂ a M ( ˇ c n ( a k ) , ˇ γ n ) + P n ∂ ∂ b M ( ˇ c n ( a k ) , ˇ γ n ) ∂ ∂ a ˇ c n ( a k ) = 0 , i.e. ( P n ∂ ∂ b M ( ˇ c n ( a k ) , ˇ γ n ) = 0 ( E 0) P n ∂ ∂ a M ( ˇ c n ( a k ) , ˇ γ n ) = 0 ( E 1) . Under ( H 5) and ( H 6), and using a T aylor development of the ( E 0) (r esp. ( E 1) ) equation, we infer there exists ( c n , γ n ) (resp. ( ˜ c n , ˜ γ n )) on the interval [( ˇ c n ( a k ) , ˇ γ n ) , ( a k , a k )] such that − P n ∂ ∂ b M ( a k , a k ) = [( P ∂ 2 ∂ b ∂ b M ( a k , a k )) ⊤ + o P (1) , ( P ∂ 2 ∂ a ∂ b M ( a k , a k )) ⊤ + o P (1)] a n . (resp. − P n ∂ ∂ a M ( a k , a k ) = [( P ∂ 2 ∂ b ∂ a M ( a k , a k )) ⊤ + o P (1) , ( P ∂ 2 ∂ a 2 M ( a k , a k )) ⊤ + o P (1)] a n ) with a n = (( ˇ c n ( a k ) − a k ) ⊤ , ( ˇ γ n − a k ) ⊤ ). Thus we get √ na n = √ n       P ∂ 2 ∂ b 2 M ( a k , a k ) P ∂ 2 ∂ a ∂ b M ( a k , a k ) P ∂ 2 ∂ b ∂ a M ( a k , a k ) P ∂ 2 ∂ a 2 M ( a k , a k )       − 1 " − P n ∂ ∂ b M ( a k , a k ) − P n ∂ ∂ a M ( a k , a k ) # + o P (1) = √ n ( P ∂ 2 ∂ b ∂ b M ( a k , a k ) ∂ 2 ∂ a ∂ a K ( g f a k g a k , f ) ) − 1 .        P ∂ 2 ∂ b ∂ b M ( a k , a k ) + ∂ 2 ∂ a ∂ a K ( g f a k g a k , f ) P ∂ 2 ∂ b ∂ b M ( a k , a k ) P ∂ 2 ∂ b ∂ b M ( a k , a k ) P ∂ 2 ∂ b ∂ b M ( a k , a k )        . " − P n ∂ ∂ b M ( a k , a k ) − P n ∂ ∂ a M ( a k , a k ) # + o P (1) 30 Moreover , the c entral limit theorem implies: P n ∂ ∂ b M ( a k , a k ) L aw → N d (0 , P k ∂ ∂ b M ( a k , a k ) k 2 ), P n ∂ ∂ a M ( a k , a k ) L aw → N d (0 , P k ∂ ∂ a M ( a k , a k ) k 2 ), since P ∂ ∂ b M ( a k , a k ) = P ∂ ∂ a M ( a k , a k ) = 0, which leads us to the result. ✷ Proof of proposition C.2. Let us consider ψ ( resp. ψ ( k ) ) the characteristic func tion of f (resp. g ( k − 1) ). Let also consider the sequence ( a i ) deﬁned in (2.3) page 6. W e have | ψ ( t ) − ψ ( k ) ( t ) | ≤ R | f ( x ) − g ( k ) ( x ) | d x ≤ K ( g ( k ) , f ) . As explained in section 14 of Huber’ s article and through remark D.1 p age 22 as well as th rough the additiv e re lation of proposition D.1, we can say that lim k K ( g ( k − 1) f a k [ g ( k − 1) ] a k , f ) = 0. Consequently , we get lim k g ( k ) = f . Proof of theorem 3.4. W e recall that g ( k ) n is the k ernel es timator of ˇ g ( k ) . Since the relative entro py is greater than the L 1 -distance, we then have lim n lim k K ( g ( k ) n , f n ) ≥ lim n lim k R | g ( k ) n ( x ) − f n ( x ) | d x Moreover , the Fatou’ s lemma implies that lim k R | g ( k ) n ( x ) − f n ( x ) | d x ≥ R lim k  | g ( k ) n ( x ) − f n ( x ) |  d x = R | [lim k g ( k ) n ( x )] − f n ( x ) | d x and lim n R | [lim k g ( k ) n ( x )] − f n ( x ) | d x ≥ R lim n  | [lim k g ( k ) n ( x )] − f n ( x ) |  d x = R | [lim n lim k g ( k ) n ( x )] − lim n f n ( x ) | d x . T rough lemm a F .11, we the n obtain that 0 = lim n lim k K ( g ( k ) n , f n ) ≥ R | [lim n lim k g ( k ) n ( x )] − lim n f n ( x ) | d x ≥ 0, i.e. that R | [lim n lim k g ( k ) n ( x )] − lim n f n ( x ) | d x = 0. Moreover , for any given k and any given n , th e fun ction g ( k ) n is a con vex combin ation of mu lti- variate Gaussian d istributions. As derived at remark 2.1 of p age 5, for a ll k , the determin ant of the cov ariance of the ran dom vector - with density g ( k ) - is greater than or eq ual to the product of a po siti ve constant times the d eterminan t of the covariance of the random vector with density f . The form of the kern el estimate th erefore imp lies that there exists an integrable fun ction ϕ such that, for any gi ven k and any gi ven n , we have | g ( k ) n | ≤ ϕ . Finally , the d ominated conv ergence theorem enables us to say that lim n lim k g ( k ) n = lim n f n = f , since f n conv erges towards f and since R | [lim n lim k g ( k ) n ( x )] − lim n f n ( x ) | d x = 0. ✷ Proof of t heorem C.4. T hroug h a T ay lor d ev elopmen t of P n M ( ˇ c n ( a k ) , ˇ γ n ) of rank 2, we get at point ( a k , a k ): P n M ( ˇ c n ( a k ) , ˇ γ n ) = P n M ( a k , a k ) + P n ∂ ∂ a M ( a k , a k )( ˇ γ n − a k ) ⊤ + P n ∂ ∂ b M ( a k , a k )( ˇ c n ( a k ) − a k ) ⊤ + 1 2 { ( ˇ γ n − a k ) ⊤ P n ∂ 2 ∂ a ∂ a M ( a k , a k )( ˇ γ n − a k ) + ( ˇ c n ( a k ) − a k ) ⊤ P n ∂ 2 ∂ b ∂ a M ( a k , a k )( ˇ γ n − a k ) + ( ˇ γ n − a k ) ⊤ P n ∂ 2 ∂ a ∂ b M ( a k , a k )( ˇ c n ( a k ) − a k ) + ( ˇ c n ( a k ) − a k ) ⊤ P n ∂ 2 ∂ b ∂ b M ( a k , a k )( ˇ c n ( a k ) − a k ) } The lemma below enables us to conclude. lemme F .1 4. Let H be an integrable fu nction and let C = R H d P a nd C n = R H d P n , then, C n − C = O P ( 1 √ n ) . Thus we get P n M ( ˇ c n ( a k ) , ˇ γ n ) = P n M ( a k , a k ) + O P ( 1 n ) , i.e. √ n ( P n M ( ˇ c n ( a k ) , ˇ γ n ) − P M ( a k , a k )) = √ n ( P n M ( a k , a k ) − P M ( a k , a k )) + o P (1) . Hence √ n ( P n M ( ˇ c n ( a k ) , ˇ γ n ) − P M ( a k , a k )) abides by the same limit distribution as √ n ( P n M ( a k , a k ) − P M ( a k , a k )), which is N (0 , V ar P ( M ( a k , a k ))). ✷ Proof of theorem 3 .5. T hroug h proposition B .1 and theorem C.4, we derive theo rem 3.5.. ✷ Proof of theorem 3 .7. W e immedia tely get the proof from theorem 3.4. ✷ Proof of theorem 3.8. Since Φ ( g (1) , f ) = 0 , then, through lemma F .9, we d educt that the density of b ⊤ 2 X / a ⊤ 1 X , with a 1 = (0 , 1) ′ and b 2 = (1 , 0) ′ , is the same as the one of b ⊤ 2 Y / a ⊤ 1 Y . Hence, we deriv e that E ( X 1 / X 2 ) = E ( Y 1 / Y 2 ) and also that the regression between X 1 and X 2 is X 1 = E ( Y 1 ) + Cov ( Y 1 , Y 2 ) V ar ( Y 2 ) ( Y 2 − E ( Y 2 )) + ε, wh ere ε is a cen tered rando m v ariable such that it is 31 orthog onal to E ( X 1 / X 2 ). ✷ Proof of theorem 3 .9. W e infer this proo f similarly to the proof of theorem 3.8 section. ✷ Proof of corollary 3.4. Assuming ﬁrst th at the b k and the a i are the canonical basis of R d . Then, for any i , j , Y i is in depend ent from Y j , i.e. E ( Y k / Y 1 , ..., Y j ) = E ( Y k ). Consequently , th e regression between X k and ( X 1 , ..., X j ) is gi ven by X k = E ( Y k ) + ε k where ε is a centered rando m variable such that it is orthogon al to E ( X k / X 1 , ..., X j ). At present, we deri ve the general case thanks to the meth odolo gy used in the proof of le mma F .1 section with the transfor mation matrix B = ( a 1 , ..., a j , b j + 1 , ..., b d ). ✷ References Azé D., Eléments d’analyse con vexe et variationnell e, Ellipse , 1997. Bosq D., Lecoutre J.-P . Livr e - Theorie De L’Estimat ion F onctionnell e, Economi ca, 1999. Broniat owski M., Keziou A. P arametri c estimat ion and tests thr ough diver gences and the duali ty tec hnique . J. Multi- variate Anal. 100 (2009), no. 1, 16–36. Cambanis, Stamat is; Huang, Steel ; Simons, Gordon. On the theory o f ell iptical ly con tour ed distrib utions. J . Mul tivariat e Anal. 11 (1981), no. 3, 368–385. Cressie, Noel ; Read, T imothy R. C. Mult inomial goodness-of-ﬁ t tests. J . Roy . Stati st. Soc. Ser . B 46 (1984), no. 3, 440– 464. Csiszár , I. On top ology pro perties of f -diver gences. Studia Sci. Mat h. Hungar . 2 1967 329–339. De vroye , Luc; Györﬁ, László. Distribu tion free exponen tial bound for the L 1 error of part itioni ng-estimat es of a re gres- sion function. Probabili ty and statistical d ecision theory , V ol. A ( Bad T atzmannsdorf, 19 83), 67–76, Reidel, Dordrecht, 1985 Diaconi s, Persi; Freedman, Da vid. Asymptotics of graphi cal proj ection pursuit. Ann. Statist. 12 (1984), no. 3, 793–815. Jean Dieudonné, Calcul inﬁnité simal. Hermann. 1980 Friedman, Jerome H.; Stue tzle, W erner; Schroeder , Anne. Projec tion pursuit densit y estimat ion. J . Amer . Statist. Assoc. 79 (1984), no. 387, 599–608. Friedman, Jerome H. Exploratory pr ojection pursuit. J. Amer . Sta tist. A ssoc. 82 (1987), no. 397, 249–266. Liese Friedrich and V ajda Igor , Conv ex statistic al distan ces, vo lume 95 of T eubner-T ex te zur Ma thematik [T eubner T exts in Mathematic s]. BSB B. G. T eubner V erlagsge sellsc haft, 1987, wit h German, Fr ench and Russian summaries. Huber Peter J., Robust St atistics. W ile y , 1981 (r epublished in paperback , 2004) Huber Peter J., Proj ectio n pur suit, Ann. Statist.,13(2):435 –525, 1985 , W ith discussion. Landsman, Zinoviy M.; V aldez, Emilian o A. T ail condit ional e xpec tations for ellipti cal dist ributi ons. N. Am. Actuar . J. 7 (2003), no. 4, 55–71. Pardo Leandro, Statisti cal infe re nce base d on d iver ge nce me asure s, v olume 185 of Statistics: T extbooks and M onogr aphs. Chapman & Hall / CRC, Boca Raton, FL, 2006. Rockafe llar , R. T yrrell ., Con vex analysis. Pr inceto n Mathematical Series, No. 28 Princeton Univer sity Pr ess, Princ eton, N.J . 1970 xviii + 451 pp. Saporta Gilbert, Pr obabilit és, anal yse des données et statistique , T echnip, 2006. Scott, David W ., Multivariate density estimati on. Theory , practice , and visualizat ion. W iley Serie s in Probab ility and Mathemat ical Stat istics: Applied Pr obability and Stati stics. A W ile y-Inter scienc e Public ation. Jo hn W ile y and Sons, Inc., New Y ork, 1992. xiv + 317 pp. ISBN: 0-471-54770-0. Aida T oma Optimal rob ust M-estima tors using div er genc es. Stati stics and P r obabili ty Letters, V olume 79, Issue 1, 1 J anuary 2009, P age s 1-5 V ajda, Igor , χ α -diver gence and ge neral ized F isher’ s information. T ransac tions of the Si xth Prague Confe re nce on Info r- mation Theory , Stati stical Decision Functions, R andom Pr ocesses (T ech. Univ . Prague , P rag ue, 1971; dedicated to the memory of Antonín Spacek), pp. 873–886. Academia, Pragu e, 1973. V an de r V aart A. W ., Asymptoti c statistic s, volume 3 of Cambridge Series in Statistic al and Prob abilisti c Mathematics, Cambridg e University Press, Cambri dge , 1998. Zhu, Mu. On the forwar d and backw ard algori thms of pr ojecti on pursui t. Ann. Statist. 32 (2004), no. 1, 233–244. V ictor J. Y ohai Optimal rob ust estimate s using the K ullbac k-Leib ler diver gence . Statist ics and P r obabili ty Let ters, V olume 78, Issue 13, 15 September 2008, P ages 1811-1 816. Zografos, K. and Ferentinos, K. and Papaioan nou, T ., ϕ -diver genc e statistics: sampling pr operties and multinomial goodness of ﬁt and diver ge nce te sts, Comm. Statist. Theory Methods,19(5):1785– 1802(1990), MR1075502. 32

Projection Pursuit through $Phi$-Divergence Minimisation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment