Approximating L1-distances between mixture distributions using random projections

Appro ximating L 1 -distances b et w een mixture distributions using random pro jections Sat yaki Mahalanabis Daniel ˇ Stefank ov iˇ c Departmen t of Computer Science Univ ersit y of R oches ter Ro c hester, NY 14627 { smahalan,st efanko } @cs.r ochester.edu Abstract W e consider the pro blem of computing L 1 -distances b et ween every pair of prob- ability densities from a given family . W e p oint o ut that the technique o f Cauch y random pro jections [Ind06] in this context turns int o sto chastic in tegrals with res pect to C a uc hy motion. F or piecewise-linear densities these integrals can b e sampled from if one can sample from the sto c hastic integral of the function x 7→ (1 , x ). W e give an explicit density function for this sto c hastic integral and pr e s en t an eﬃcient (exact) sa mpling algorithm. As a consequence we obtain an eﬃcient algo rithm to a ppro x imate the L 1 -distances with a small relative er ror. F or piecewise-p olynomial densities w e show how to approximately sample fro m the distributions resulting fr o m the stochastic in teg rals. This also results in an eﬃcient algorithm to approximate the L 1 -distances, although our inability to g et exact sa mples worsens the dep endence o n t he parameters. 1 In tro duction Consider a ﬁnite class F = { f 1 , f 2 , . . . , f m } of probabilit y densities. W e w ant to compute the distance b et w een ev ery pair of m em b ers of F . W e are in terested in the case wh er e eac h mem b er of F is a mixtu r e of ﬁnitely many probabilit y den s it y fun ctio n s, eac h ha ving a particular fun ctio n al form (e. g., uniform, linear, exp onen tial, normal, etc.). Suc h classes of distributions are frequently encountered in mac hine learning (e. g., m ixture mo dels, see [Bis06]) and nonparametric densit y estimation (e. g., histograms, kernels, see [DL01]). The n umb er of distributions in a mixture giv es a natural measure of complexit y whic h w e use to expr ess the ru nning time of our algorithms. F or some classes of d istributions exact algorithms are p ossible, for examp le, if eac h distribution in F is a piecewise linear function consisting of n pieces th en we can compute the distances b etw een all p airs in time Θ( m 2 n ). F or other classes of distribu tio n s (for example, mixtures of normal distributions) exact computation of th e d istances migh t not b e p ossible. Th u s we turn to randomized appr o ximation algorithms. A ( δ , ε ) -r elative-err or appr oximation scheme computes D j k , j, k ∈ [ m ] s uc h that with pr obabilit y at least 1 − δ w e ha ve ( ∀ j, k ∈ [ m ]) (1 − ε ) D j k ≤ k f j − f k k 1 ≤ (1 + ε ) D j k . 1 A ( δ, ε ) -abso lute- err or appr oximation scheme computes D j k , j, k ∈ [ m ] su c h that with probabilit y at least 1 − δ we ha v e ( ∀ j, k ∈ [ m ]) D j k − ε ≤ k f j − f k k 1 ≤ D j k + ε. A direct app lica tion of the Mon te C arlo metho d ([MU49], s ee [Met87]) immediately yields the follo win g absolute-error app r o ximation sc heme. Let X j k b e sampled according to f j and let Y j k = sgn( f j ( X j k ) − f k ( X j k )), where sgn : R → {− 1 , 0 , 1 } is the sign fun ctio n . The exp ected v alue of Y j k + Y k j is equal to k f j − f k k 1 , ind eed E [ Y j k + Y k j ] = Z ( f j ( x ) − f k ( x )) sgn( f j ( x ) − f k ( x )) d x = k f j − f k k 1 . Th us, to obtain a ( δ , ε )-absolute-error app ro ximation sc h eme it is enough to approxi- mate eac h Y j k with absolute error ε/ 2 and conﬁdence 1 − δ /m 2 . By the Chernoﬀ b ound O ( ε − 2 ln( m 2 /δ )) samples from eac h Y j k are enough. (The total num b er of samples from the f j is O ( mε − 2 ln( m 2 /δ ), sin ce we can use the same sample from f j for Y j 1 , . . . , Y j m . The total num b er of ev aluations is O ( m 2 ε − 2 ln( m 2 /δ ).) T he ru nning time of th is algo- rithm will compare fa vorably with the exact algorithm if sampling fr om the d ensities and evaluation of the densities at a p oint can b e done fast. (F or example, for p iece w ise linear densities b oth sampling and ev aluation can b e done in O (log n ) time, using binary searc h .) Note that the ev aluation oracle is essentia l (cf. [BFR + 00] who only allo w use of sampling oracles). In the rest of the pap er we w ill fo cus on the harder relativ e-error appro ximation sc hemes (since the L 1 -distance b et ween t w o d istributions is at most 2, a relativ e-appro ximation sc heme i mmediately yields an absolute-error a p pro ximation sc heme). Our motiv ation comes from an application (densit y estimation) whic h r equires a relativ e-error sc h eme [M ˇ S07]. No w w e outline the rest of the p ap er . In Section 2 w e review Cauc hy random pr o- jections; in Section 3 we p oint out that for densit y f unctions Cauc hy r andom pro jectio n s b ecome sto c hastic integrals; in Section 4 we sho w that f or p iec ewise linear fu nctions we can sample fr om these in tegrals (using rejection sampling, with biv ariate student distri- bution as th e en velo p e) and as a consequence we obtain eﬃcien t appr o ximation algorithm for relativ e-error all-pairs- L 1 -distances. Finally , in S ection 5, w e show that for piecewise p olynomial fun ctions one can app ro ximately s amp le from the in tegrals, leading to sligh tly less eﬃcien t app ro ximation algorithms. 2 Cauc h y random pro jections Dimension redu ctio n (the most well -kn o wn example is the Johnson-Lin denstrauss lemma for L 2 -spaces [JL84]) is a natural tec hnique to us e here. W e are in terested in L 1 -spaces for whic h the analogue of the John son-Lindenstrauss lemma is not p ossible [BC05 , NL04 ] (that is, one cannot p ro ject p oin ts in to a lo w dimensional L 1 -space and preserv e distances with a small relativ e error). Ho wev er one can still p r o ject p oin ts to sh ort vect ors from whic h L 1 -distances b et w een the original p oints can b e appro ximately reco vered us ing non-line ar estimators [LHC07, I n d06]. A p articularly fruitful view of the d imensionalit y “reduction” (with non-linear estima- tors) is through stable distributions ([JS82, Ind 06 ]): giv en vec tors v 1 , . . . , v m one deﬁnes 2 (dep enden t) random v ariables X 1 , . . . , X m suc h that the distance of v j and v k can b e re- co v ered from X j − X k (for all j, k ∈ [ m ]). F or example, in th e case of L 1 -distances X j − X k will b e from Cauc hy distribu tion C (0 , k v j − v k k 1 ), and h ence the reco ve r y p roblem is to estimate the scale parameter R of Cauc hy distribu tion C (0 , R ). Th is is a well-studied problem (see, e. g., [HBA70]). W e can, for example, use the follo wing n on lin ea r estimator (other estimators, e. g., the median are also p ossible [Ind06]): Lemma 2.1 (Lemma 7 of [LHC 07]) . L et X 1 , X 2 , . . . , X t b e indep endent samples fr om the Cauchy distribution C (0 , D ) . Deﬁne the ge ometric me an estimator without bias-c orr e ction ˆ D g m as ˆ D g m = t Y j =1 | X j | 1 /t . Then for e ach ε ∈ [0 , 1 / 2] , we have P  ˆ D g m ∈ [(1 − ε ) D , (1 + ε ) D ]  ≥ 1 − 2 exp( − t ε 2 / 8) . W e ﬁrst illustrate ho w C auc h y random p r o jections immediately giv e an eﬃcien t relativ e- error approximati on scheme for piecewise uniform distribu tions. Let F consist of m piecewise uniform densities, that is, eac h member of F is a mixture of n distributions eac h un iform on an in terv al. Let a 1 , . . . , a s b e th e end p oin ts of all the in terv als that o ccur in F sorted in the increasing order (note that s ≤ 2 mn ). Without loss of generalit y , we can assume that eac h distrib u tion f j ∈ F is s p eciﬁed b y n p airs ( b j 1 , c j 1 ) , . . . , ( b j n , c j n ) where 1 ≤ b j 1 < c j 1 < · · · < b j n < c j n ≤ s , and f or eac h p air ( b j ℓ , c j ℓ ) we are also giv en a n u m b er α j ℓ whic h is the v alue of f j on the in terv al [ a b j ℓ , a c j ℓ ). No w we will use C auc h y rand om p ro jections to compute the pairwise L 1 -distances b et w een the f j eﬃcien tly . F or ℓ ∈ { 1 , . . . , s − 1 } let Z ℓ b e in dep enden t from the C auc h y distribution C (0 , a ℓ +1 − a ℓ ). Let Y ℓ = Z 1 + · · · + Z ℓ − 1 , for ℓ = 1 , . . . , s . Fin ally , let X j := n X ℓ =1 α j ℓ ( Y c j ℓ − Y b j ℓ ) = n X ℓ =1 α j ℓ ( Z b j ℓ + · · · + Z c j ℓ − 1 ) . (1) Note that X j is a sum of Cauc hy random v ariables and hence h as Cauch y distribu tion (in fact it is from C (0 , 1)). Thus X j − X k will b e from Cauc hy distribution as w ell. T he co eﬃcie nt of Z ℓ in X j − X k is the diﬀerence of f j and f k on in terv al [ a ℓ , a ℓ +1 ). Hence the con tribution of Z ℓ to X j − X k is from C auc h y distr ibution C (0 , R a ℓ +1 a ℓ | f j ( x ) − f k ( x ) | d x ), and thus X j − X k is fr om Cauch y d istribution C (0 , k f j − f k k 1 ). Remark 2.2. In the next section we will generalize the ab o ve approac h to p iecewise degree- d -p olynomia l densities. In th is case for eac h ( b j ℓ , c j ℓ ) we are giv en a v ector α j ℓ ∈ R d +1 suc h that th e v alue of f j on in terv al [ a b j ℓ , a c j ℓ ) is give n b y the follo wing p olynomial (written as an inner pro duct): f j ( x ) = (1 , x, . . . , x d ) · α j ℓ . 3 3 Cauc h y motion A natural wa y of generaliz ing the algorithm from the pr evio u s s ec tion to arb itrary d en- sit y fu nctions is to take inﬁnitesimal interv als. This leads one to the well-studied area of sto c hastic in tegrals w.r.t. symmetric 1-stable L´ evy motion (also called Cauc hy mo- tion). Cauc hy motion is a sto c hastic pro cess { X ( t ) , t ∈ R } suc h that X (0) = 0, X has indep enden t increment s (i. e., for any t 1 ≤ t 2 ≤ · · · ≤ t k the random v ariables X ( t 2 ) − X ( t 1 ) , . . . , X ( t k ) − X ( t k − 1 ) are indep enden t), and X ( t ) − X ( s ) is fr om Cauch y distribution C (0 , | t − s | ). Intuitiv ely , s to c h astic integral of a deterministic function w.r.t. Cauc hy motion is like a regular integral , except one uses X ( t ) − X ( s ) instead of t − s for the length of an int erv al (see section 3.4 of [ST94] for a readable formal treatment ). W e w ill only n eed the follo wing basic facts ab out sto c hastic integ r als of deterministic functions w.r.t. Cauc hy motion (which w e will denote d L ( x )), see [ST94], Chapter 3. F act 3.1. L et f : R → R b e a (Riemann) inte gr able f u nction. L et X = R b a f ( x ) d L ( x ) . Then X is a r andom variable fr om Cauchy distribution C (0 , R ) wher e R = Z b a | f ( x ) | d x. (2) F act 3.2. L et f 1 , . . . , f d : R → R b e (R ie ma nn) inte gr able functions. L et φ = ( f 1 , . . . , f d ) : R → R d . L et ( X 1 , . . . , X d ) = R b a φ ( x ) d L ( x ) . Then ( X 1 , . . . , X d ) is a r andom variable with char acteristic function ˆ f ( c 1 , . . . , c d ) = exp  − Z b a | c 1 f 1 ( x ) + · · · + c d f d ( x ) | d x  . F act 3.3. L et f , g : R → R b e (R iemann) i nte gr able f unctions. L et a < b, α, β ∈ R . Then Z b a ( αf + β g ) d L ( x ) = α Z b a f d L ( x ) + β Z b a g d L ( x ) . L et h ( x ) = f ( a + ( b − a ) x ) . Then Z b a f ( x ) d L ( x ) = ( b − a ) Z 1 0 h ( x ) d L ( x ) . F rom facts 3.1 and 3.3 it follo ws that the problem of appro ximating the L 1 -distances b et w een den s itie s can b e solv ed if we can ev aluate sto c hastic integ r als w.r .t. Cauc hy motion; we formalize this in th e follo wing observ ation. Observ at ion 3.4. L e t f 1 , . . . , f m : R → R b e pr ob ability densities. L et φ : R → R m b e deﬁne d by φ ( x ) = ( f 1 ( x ) , . . . , f m ( x )) . Consider ( X 1 , . . . , X m ) = Z ∞ −∞ φ ( x ) d L ( x ) . (3) F or al l j, k ∈ [ m ] we have that X j − X k is fr om Cauchy distribution C (0 , k f j − f k k 1 ) . 4 Note that the X j deﬁned by (1 ) are in fact computing the integral in (3). F or p iecewise uniform den s itie s it wa s enough to samp le from the Cauch y distribu tion to compute the in tegral. F or piecewise degree- d -p olynomial den s itie s it will b e enough to sample from the follo wing distribu tion. Deﬁnition 3.5. Let φ : R → R d +1 b e deﬁned by φ ( x ) = (1 , x, x 2 , . . . , x d ). Let C I d ( a, b ) b e the distribution of Z , where Z := ( Z 0 , . . . , Z d ) := Z b a φ ( x ) d L ( x ) . Note that giv en a samp le from C I d (0 , 1), using O ( d 2 ) arithmetic op erations w e can obtain a s ample from CI d ( a, b ), u sing F a ct 3.3. Lemma 3.6. L et F c onsist of m pie c ewise de gr e e- d -p olynomial densities, e ach c onsisting of n pie c es (given as in R emark 2.2). L et t ≥ (8 /ε ) 2 ln( m 2 /δ ) b e an inte ger. Assume that we c an sample fr om CI d (0 , 1) using T d op er ations. We c an obtain ( δ, ε ) -r elative- err or appr oximation of L 1 -distanc es b etwe e n al l p airs in F , using O (( d 2 + T d ) mnt + m 2 t ) arithmetic op er ations. Pro of : F or ℓ ∈ { 1 , . . . , s − 1 } let Z ℓ b e indep endent from C I d ( a ℓ , a ℓ +1 ) distribution. Let Y ℓ = Z 1 + · · · + Z ℓ − 1 , for ℓ = 1 , . . . , s . Finally , for eac h j ∈ [ m ], let X j := n X ℓ =1 α j ℓ · ( Y c j ℓ − Y b j ℓ ) = n X ℓ =1 α j ℓ · ( Z b j ℓ + · · · + Z c j ℓ − 1 ) . (4) Note that Y c j ℓ − Y b j ℓ is from C ( a b j ℓ , a c j ℓ ) and hence α j ℓ · ( Y c j ℓ − Y b j ℓ ) = Z a c j ℓ a b j ℓ f j ( x ) d L ( x ) . Th us ( X 1 , . . . , X m ) deﬁn ed by (4 ) compute (3). F or ev ery j, k ∈ [ m ] w e ha ve that X j − X k is from Cauch y d istribution C (0 , k f j − f k k 1 ) . If we h a v e t samples fr om eac h X 1 , . . . , X m then using Lemm a 2.1 and union b ound with probabilit y ≥ 1 − δ we reco v er all k f j − f k k 1 with relativ e error ε . Note that s ≤ 2 mn and hence for the Z ℓ w e u sed ≤ 2 mnt samples from CI(0 , 1) distribution, costing us O (( d 2 + T d ) mnt ) arithm eti c op eration. Computing the Y ℓ tak es O ( mnt ) op erations. Compu ting the X j tak es O ( mnt ) op erations. The ﬁ nal estimation of the distances tak es O ( m 2 t ) op erations.  4 Piecewise linear functions The densit y function of CI 1 (0 , 1) can b e computed explicitly , using the inv erse F ourier transform; the p roof is deferred to the app endix. Th e expression for the density allo ws us to constru ct eﬃcien t s amp ling algorithm, whic h in turn yields an eﬃcien t app ro ximation algorithm f or all-pairs- L 1 -distances for piecewise lin ea r d ensities. W e obtai n the follo w ing result. 5 Theorem 4.1. L et F c onsist of m pie c ewise line ar densities, e ach c onsisting of n pie c es (given as in R emark 2.2). We c an obtain ( δ , ε ) -r elative-err or appr oximation of L 1 -distanc es b etwe en al l p airs in F , using O ( m ( m + n ) ε − 2 ln( m/δ )) arithmetic op er ations. No w we state the density of CI 1 (0 , 1). In the follo wing ℜ ( x ) denotes the real part of a complex num b er x . Theorem 4.2. L et φ : R → R 2 b e the function φ ( x ) = (1 , x ) . L et Z = ( X 1 , X 2 ) = Z 1 0 φ ( z ) d L ( z ) . F or x 1 6 = 2 x 2 the density function of Z is gi v en by f ( x 1 , x 2 ) = 4 /π 2 1 + 6 x 2 1 + x 4 1 − 16 x 1 x 2 + 16 x 2 2 + 2 π 2 ℜ  atan( iQ/ ( x 1 − 2 x 2 )) Q 3 / 2  , (5) wher e Q = 1 − 2 ix 1 + x 2 1 + 4 ix 2 . (6) F or x 1 = 2 x 2 the density is g i ven by f ( x 1 , x 2 ) = 4 /π 2 (1 + x 2 1 ) 2 + 1 π (1 + x 2 1 ) 3 / 2 . (7) - 4 - 2 0 2 4 - 4 - 2 0 2 4 Figure 1: The d en sit y plot of ( X 1 , X 2 ) = R 1 0 (1 , z ) d L ( z ). Th e con tours are at lev els 2 − 15 , 2 − 14 , . . . , 2 − 1 . Next we sho w h o w to eﬃcien tly sample fr om th e CI 1 (0 , 1) d istribution b y rejection sampling using the biv ariate student distribution as the en velo p e. 6 Let Σ b e a p ositiv e-deﬁn ite 2 × 2 matrix. The biv ariate student d istribution with 1 degree of f reedom is giv en by th e follo wing formula (see, e. g., [ES00], p. 50) g ( x ) = | det(Σ) | − 1 / 2 2 π  1 + x T Σ − 1 x 2  − 3 / 2 . It is we ll-known how to sample X from this distrib ution: let X = Σ 1 / 2 Y / √ W , where Y , W are ind epend en t with Y ∼ N 2 (0 , I ) (the tw o dimensional gaussian) and W ∼ χ 2 (1) (c hi-squared distribu tio n with 1 degree of freedom). W e are going to use the biv ariate student distribution with the follo wing densit y g ( x ) = 1 π  1 + x 2 1 + (2 x 2 − x 1 ) 2  − 3 / 2 . (8) W e show th at the d ensit y fu nction of the CI 1 (0 , 1) distrib ution is b ounded b y a constan t m ultiple of (8) (the pro of is deferred to the app endix). Lemma 4.3. L et f ( x ) b e given by (5) and (7) . L et g ( x ) b e given by (8) . F or every x ∈ R 2 we have f ( x ) ≤ C π · g ( x ) , wher e C = 25 . As an immediate corollary of Lemma 4.3 we obtain an eﬃcient sampling algorithm for CI 1 (0 , 1) distribu tion, using rejection samplin g (see, e. g., [ES 00] ). Corollary 4.4. Ther e is a sampler fr om CI 1 (0 , 1) which uses a c onstant numb er of samples fr om fr om N (0 , 1) and χ 2 (1) (in e xp e ctation). Pro of of Theorem 4.1: The theorem follo ws f rom Corollary 4.4 and Lemma 3.6.  Remark 4.5. Lemma 4.3 is true w ith C = π 2 3 / 2 (w e skip the tec hnical pro of ). The constan t π 2 3 / 2 is tigh t (see equation (40) with α → 0 and T → 1). 5 Piecewise p olynomial functions Some ke r nels used in mac hin e learning (e. g., the Epanec hn iko v kernel, see [DL01], p .85 ) are piecewise p olynomial. Th us it is of interest to extend the resu lt f r om the previous section to higher-degree p olynomials. F or d > 1 we d o not kn o w h ow to sample from distrib ution CI d (0 , 1) exac tly . Ho wev er w e can still appr o ximately samp le from this distribution, as follo ws. Let r b e an integ er. Let Z 1 , . . . , Z r b e ind epen den t from Cauc hy distribution C (0 , 1 /r ). Consider th e follo win g distribution, wh ic h w e call r -approximati on of CI d (0 , 1): ( X 0 , . . . , X d ) = r X j =1 Z j · (1 , ( j /r ) , ( j /r ) 2 , . . . , ( j /r ) d ) . (9) No w we s h o w that if r is large enough then the d istribution giv en by (9) can b e used instead of distribu tio n CI d (0 , 1) f or our purp ose. As a consequ ence we will obtain the follo wing. 7 Theorem 5.1. L et F c onsist of m pie c ewise de gr e e- d -p olynomial densities, e ach c onsisting of n pie c es (given as in R emark 2.2). We c an obtain ( δ , ε ) -r elative-err or appr oximation of L 1 -distanc es b etwe en al l p airs in F , using O ( m ( m + n ) d 3 ε − 3 ln( m/δ )) arithmetic op er a- tions. Remark 5.2. Note that for d = 1 Theorem 5.1 giv es w orse (in ε ) ru nning time that Theorem 4.1. This slo wdo wn is caused b y the additional integ r ati on used to sim u lat e CI d (0 , 1). The pr oof of T heorem 5.1 will b e b ased on the follo w ing r esu lt wh ic h shows that (9) is in some sense close to CI d (0 , 1). Lemma 5.3. L et p = a 0 + a 1 x + · · · + a d x d b e a p olynomial of de gr e e d . L et ( X 0 , . . . , X d ) b e sample d fr om the distribution given by (9) , with r ≥ cd 2 /ε (wher e c is an absolute c onstant ). L et W = a 0 X 0 + · · · + a d X d . Then W is fr om the Cauchy distribution C (0 , R ) , wher e (1 − ε ) Z 1 0 | p ( x ) | d x ≤ R ≤ (1 + ε ) Z 1 0 | p ( x ) | d x. (10) W e d efer the pro of of Lemma 5.3 to the end of this section. Note that havi ng (10) instead of (2) (wh ich samp lin g from CI d (0 , 1) w ould yield) w ill int r odu ce small relativ e error to the approxima tion of the L 1 -distances. Pro of of Theorem 5.1: The p roof is analogous to the pro of of Lemma 3.6. Let r ≥ cd 2 /ε . F or ℓ ∈ { 1 , . . . , s − 1 } let Z ℓ b e indep end en t from r -appro ximation of CI d ( a ℓ , a ℓ +1 ) distribution. Let Y ℓ = Z 1 + · · · + Z ℓ − 1 , for ℓ = 1 , . . . , s . Finally , for eac h j ∈ [ m ], let X j := n X ℓ =1 α j ℓ · ( Y c j ℓ − Y b j ℓ ) = n X ℓ =1 α j ℓ · ( Z b j ℓ + · · · + Z c j ℓ − 1 ) . By Lemma 5.3, for ev ery j, k ∈ [ m ] w e hav e that X j − X k is fr om Cauch y distribution C (0 , R ) wher e (1 − ε ) k f j − f k k 1 ≤ R ≤ (1 + ε ) k f j − f k k 1 . If we h a v e t ≥ (8 /ε ) 2 ln( m 2 /δ ) samples fr om eac h X 1 , . . . , X m then us in g Lemma 2.1 and union b oun d with probabilit y ≥ 1 − δ w e reco ver all k f j − f k k 1 with r ela tiv e er r or ≈ 2 ε . Note that s ≤ 2 mn and hence for the Z ℓ w e used ≤ 2 mnt samp les fr om r -a pproxima tion of CI(0 , 1) distribu tion, costing us O (( d 3 /ε ) mnt ) arith m eti c op eration. Computing the Y ℓ tak es O ( mnt ) op erations. Computing the X j tak es O ( mnt ) op erations. The ﬁnal estimation of the distances tak es O ( m 2 t ) op erations.  T o pro ve Lemm a 5.3 w e w ill use the follo wing Bern stein-t yp e inequalit y from [Erd00]. Theorem 5.4. (The or em 3.1 of [Er d00]) Ther e exists a c onstan t c > 0 such that f or any de g r e e d p olynomia l p , Z 1 0 | p ′ ( x ) | d x ≤ cd 2 Z 1 0 | p ( x ) | d x. W e hav e the follo w ing corollary of Theorem 5.4. 8 Lemma 5.5. Ther e exists a c onstant c such that for any p olynomial p of de gr e e d , any r ≥ cd 2 , any 0 = x 0 < x 1 < x 2 , . . . < x t = 1 with max j | x j − x j − 1 | ≤ 1 /r , and any θ 1 ∈ [ x 0 , x 1 ] , θ 2 ∈ [ x 1 , x 2 ] , . . . , θ t ∈ [ x t − 1 , x t ] , we have (1 − cd 2 /r ) Z 1 0 | p ( x ) | d x ≤ t X j =1 ( x j − x j − 1 ) | p ( θ j ) | ≤ (1 + cd 2 /r ) Z 1 0 | p ( x ) | d x. (11) Pro of : W e will use indu ctio n on the degree d of the p olynomial. F or d = 0 the sum and the in tegrals in (11) are equal. No w assume d ≥ 1. F o r eac h j ∈ [ t ], we use the T a ylor expans ion of p ( x ) ab out θ j for x ∈ ( x j − 1 , x j ]. This yields for eac h x ∈ ( x j − 1 , x j ] , p ( x ) = p ( θ j ) + ( x − θ j ) p ′ ( θ ′ j,x ), where θ ′ j,x ∈ ( x j − 1 , x j ]. Let β j b e the p oin t y ∈ ( x j − 1 , x j ] that maximizes p ′ ( y ). W e hav e       t X j =1 ( x j − x j − 1 ) | p ( θ j ) | − Z 1 0 | p ( x ) | d x       ≤ t X j =1 Z x j x j − 1 | p ( x ) − p ( θ j ) | d x ≤ t X j =1 Z x j x j − 1 | ( x − θ j ) p ′ ( θ ′ j,x ) | d x ≤ 1 2 r t X j =1 ( x j − x j − 1 ) | p ′ ( β j ) | . (12) Since p ′ is of degree d − 1, b y induction h yp othesis the r igh t-hand side of (12) is b ounded as follo ws 1 2 r t X j =1 ( x j − x j − 1 ) | p ′ ( β j ) | ≤ 1 2 r (1 + c ( d − 1) 2 ε ) Z 1 0 | p ′ ( x ) | d x ≤ (1 /r ) Z 1 0 | p ′ ( x ) | d x ≤ ( cd 2 /r ) Z 1 0 | p ( x ) | d x. where in the last inequalit y w e u sed Theorem 5.4. Hence th e lemma follo ws.  Pro of of Lemma 5.3: W e hav e W = ( a 0 , . . . , a d ) · r X j =1 Z j (1 , ( j /r ) , ( j /r ) 2 , . . . , ( j /r ) d ) = r X j =1 Z j p ( j /r ) , where Z j are from Cauc hy distribu tio n C (0 , 1 /r ). Th u s W is from Cauch y d istribution C (0 , R ), where R = 1 r r X j =1 | p ( j /r ) | . Using Lemma 5.5 w e obtain (10).  Remark 5.6. An alternate view of Lemma 5.5 is that a piecewise degree- d -polynomial densit y with n pieces can b e appro ximated b y a piecewise uniform densit y with O ( nd 2 /ε ) pieces. The app ro ximation distorts L 1 -distances b et ween any pair of suc h densities b y a factor at m ost 1 ± ε . T o obtain a relativ e-appro ximation of the L 1 -distances in a family F one can now d irect ly use th e algorithm from Section 2 withou t going through the sto c hastic in tegrals approac h (for d = 1 the pr ice for this metho d is a 1 /ε factor slo wdo wn ). 9 Remark 5.7. (on L 2 -distances) F or L 2 -distances the dimension reduction uses normal distribution instead of Cauch y distribu tion. F or inﬁn itesima l in terv als the corresp ondin g pro cess is Bro w nian motion, whic h is m uc h b etter und erstoo d than Cauch y motion. Ev al- uation of a sto c hastic integral of a d ete rministic f unction R → R d w.r.t. Brownian motion is a d -dimensional gaussian (whose co v ariance matrix is easy to obtain), for example Z 1 0 (1 , x, . . . , x d ) d L Brown ( x ) is from N (0 , Σ) where Σ is the ( d + 1) × ( d + 1) Hilb ert m at rix (that is, th e ij -th en try of Σ is 1 / ( i + j − 1)). Question 5.8. How eﬃciently c an one sample fr om CI d (0 , 1) distribution? A r e asonable guess se ems to b e that one c an sample fr om a distribution within L 1 -distanc e δ fr om CI d (0 , 1) using d 2 ln(1 /δ ) samp les. Ac kno wledgemen t The authors would lik e to thank Carl Mueller for advice on sto c hastic inte grals. References [BC05] Bo Brinkman and Moses Charik ar. On the imp ossibilit y of d imension r eduction in l 1 . Journal of the ACM , 52(5):766– 788, 2005. [BFR + 00] T ugk an Batu, Lance F ortno w, Ronitt Rubinfeld, W arren D. Smith, and Patric k White. T esting that d istributions are close. In F OCS , pages 259–269, 2000. [Bis06] Christopher M. Bishop. Pattern R e c o gnition and Machine L e arning . Informa- tion Science and S tat istics. Springer-V erlag, New Y ork, 2006. [DL01] Luc Devro ye and G´ ab or Lugosi. Combinato rial metho ds in density estimation . Springer Series in Statistics. Sp r inger-V erlag, New Y ork, 2001. [Erd00] T am´ as Erd´ elyi. Mark o v- and Bern stein-t yp e inequ ali ties for M ¨ un tz p olynomials and exp onen tial sums in l p . J. A ppr ox. The ory , 104(1):142– 152, 2000. [ES00] Mic hael Ev ans and Tim Sw artz. A pp r oximating Inte gr als via Monte Carlo and Deterministic Metho ds . Oxford Univ ersity Press, 2000. [GR07] Israil S. Gradshteyn and Iosif M. Ryzhik. T able of Inte gr als, Series, and Pr o d- ucts, 7th e dition . Academic Press, New Y ork, 2007. [HBA70] Gerald Haas, Lee Bain, and Ch arles An tle. Infer en ce s for the cauc hy distribu- tion based on maxim um lik eliho o d estimators. Bi om etrika , 57:403–408 , 197 0. [Ind06] Piotr Indyk. S table distributions, pseudorandom generators, em b eddin gs, and data stream computation. Journal of the ACM , 53(3):3 07–323, 2006. [JL84] William B. Johnson an d J. L in denstrauss. Extensions of Lipschitz mappings in to a Hilb ert space. Contemp or ary Mathematics , 26:189– 206, 1984. 10 [JS82] William B. J ohnson and Gideon Sc hec htman. Embedd ing l m p in to l n 1 . A cta Math. , 149(1-2) :71–85, 1982. [LHC07] Ping Li, T revor J. Hastie, and K enneth W. Churc h. Nonlinear estimators an d tail b ound s for d imension reduction. Journa l of Machine L e arning R ese ar ch , 8:2497 –2532, 2007. [Met87] Nichola s Metrop olis. The b eginnin g of the Mon te Carlo metho d. L os Alamos Sci. , (15, S p ec ial Issue):125–1 30, 1987. S ta nisla w Ulam 1909–198 4. [M ˇ S07] Sat y aki Mahalanabis and Daniel ˇ Stefank ovi ˇ c. Densit y estimation in lin ea r time. arXiv.or g, http:/ /arxiv.org/ab s/0712.2869 , Decem b er 2007. [MU49] Nic holas Metrop oli s and S. Ulam. Th e Mon te Carlo metho d. J. Amer. Statist. Asso c . , 44:335 –341, 1949. [NL04] Assaf Naor and James R. Lee. Em b ed ding the diamond graph in l p and di- mension reduction in l 1 . Ge ometric and F unctional Analysis , 14(4):74 5–747, 2004. [ST94] Gennady Samoro dnitsky and Murad S. T aqqu. Stable non-Gaussian r andom pr o c esses : sto chastic mo dels with i nﬁnite varianc e . Sto c hastic mo deling. Chap- man & Hall, New Y ork, 1994. 6 App endix 6.1 Sto c hastic integral of (constant, linear) function In this secti on w e giv e an explicit form ula for the densit y function of th e rand om v ariable ( X, Y ) = Z 1 0 φ ( z ) d L ( z ) , where φ ( z ) = (1 , z ), and d L ( z ) is the Cauc hy motion. W e will obtain the dens ity function fr om the c h aracteristic fu nction. The follo w ing result w ill b e u sed in the inv erse F ourier transform. (W e use ℜ to denote the real part of a complex num b er.) Lemma 6.1. L e t φ = ( φ 1 , . . . , φ n ) : R → R n . L e t Z = ( X 1 , . . . , X n ) = Z 1 0 φ ( x ) d L ( x ) , wher e L is the Cauchy motion. The density function f of Z is given by ℜ  ( n − 1)! (2 π ) n Z ∞ −∞ . . . Z ∞ −∞ 2 ( A + iB ) n d b 1 . . . d b n − 1  , (13) wher e A = A ( b 1 , . . . , b n − 1 ) := Z 1 0 | b 1 φ 1 ( x ) + · · · + b n − 1 φ n − 1 ( x ) + φ n ( x ) | , (14) and B = B ( b 1 , . . . , b n − 1 , x 1 , . . . , x n ) := b 1 x 1 + · · · + b n − 1 x n − 1 + x n . (15) 11 Pro of : The c h aract eristic function of Z is (see, e. g., prop osition 3.2.2 of [ST 94]): ˆ f ( a 1 , . . . , a n ) = E [exp( i ( a 1 X 1 + · · · + a n X n ))] = exp  − Z 1 0 | a 1 φ 1 ( x ) + · · · + a n φ n ( x ) |  . W e will use the follo wing inte gral, v alid for an y A > 0 (see, e. g., [GR07]): Z ∞ 0 t n − 1 exp( − At ) cos( B t ) d t = ( n − 1)! 2  1 ( A − iB ) n + 1 ( A + iB ) n  . (16) W e would lik e to compute the in verse F ourier transform of ˆ f , wh ic h, since ˆ f is symmetric ab out the origin, is giv en by f ( x 1 , . . . , x n ) = 2 (2 π ) n Z ∞ 0 Z ∞ −∞ . . . Z ∞ −∞ ˆ f ( a 1 , . . . , a n ) cos( a 1 x 1 + · · · + a n x n )d a 1 . . . d a n − 1 d a n . (17) Substitution a n = t, a n − 1 = b n − 1 t, . . . , a 1 = b 1 t int o (17) yields f ( x 1 , . . . , x n ) = 2 (2 π ) n Z ∞ −∞ . . . Z ∞ −∞ Z ∞ 0 t n − 1 exp  − t Z 1 0 | b 1 φ 1 ( x ) + · · · + b n − 1 φ n − 1 ( x ) + φ n ( x ) |  cos ( t ( b 1 x 1 + · · · + b n − 1 x n − 1 + x n )) d t ! d b 1 . . . d b n − 1 . Note that th e inner in tegral has the same form as (16) and h ence we ha ve f ( x 1 , . . . , x n ) = ( n − 1)! (2 π ) n Z ∞ −∞ . . . Z ∞ −∞ 1 ( A − iB ) n + 1 ( A + iB ) n d b 1 . . . d b n − 1 = ℜ  ( n − 1)! (2 π ) n Z ∞ −∞ . . . Z ∞ −∞ 2 ( A + iB ) n d b 1 . . . d b n − 1  , (18) where A and B are giv en by (14 ) and (15). The last equalit y in (18) follo ws from the fact that the t wo summands in the integral are conjugate complex num b ers.  No w w e apply Lemma 6.1 for the case of tw o fu nctions, one constan t and one linear. Pro of of Theorem 4.2: Plugging n = 2, φ 1 ( x ) = 1, and φ 2 ( x ) = x in to (14 ) and (15) we obtain B ( b 1 , x 1 , x 2 ) = b 1 x 1 + x 2 (19) and A ( b 1 ) = ( b 1 + 1 / 2 if b 1 ≥ 0 , − b 1 − 1 / 2 if b 1 ≤ − 1 , b 2 1 + b 1 + 1 / 2 otherw ise . (20) Our goal no w is to ev aluate th e in tegral (13). W e split the in tegral into 3 parts according to the b eh avior of A ( b 1 ). 12 W e will u se the follo wing in tegral Z 1 ( S z + T ) 2 d z = − 1 S ( T + S x ) . (21) F or B = b 1 x 1 + x 2 and A = b 1 + 1 / 2 we ha ve A + iB = b 1 (1 + ix 1 ) + (1 / 2 + ix 2 ). Using (21) for A and B giv en by (19) and (20)) we obtain Z ∞ 0 1 ( A + iB ) 2 d b 1 = 2 ( ix 1 + 1)(2 ix 2 + 1) , (22) and Z − 1 −∞ 1 ( A − iB ) 2 d b 1 = 2 ( ix 1 − 1)(2 i ( x 1 − x 2 ) − 1) . (23) W e ha v e (see, e. g., [GR07])) Z 1 ( z 2 + S z + T ) 2 d z = S + 2 z (4 T − S 2 )( T + S z + z 2 ) + 4atan  ( S + 2 z ) / √ 4 T − S 2  (4 T − S 2 ) 3 / 2 . (24) F or A = b 2 1 + b 1 + 1 / 2 and B = b 1 x 1 + x 2 w e h a v e A + iB = b 2 1 + b 1 (1 + ix 1 ) + (1 / 2 + x 2 ). Using (24) we obtain Z 0 − 1 1 ( A + iB ) 2 d b 1 = 2( ix 1 + 1) (2 ix 2 + 1) Q + 2( ix 1 − 1) (2 i ( x 1 − x 2 ) − 1) Q + 4 atan  ix 1 +1 √ Q  − atan  ix 1 − 1 √ Q  Q 3 / 2 , (25) where Q is giv en by (6). Summing (22), (23), an d (25) we obtain Z ∞ −∞ 1 ( A + iB ) 2 d b 1 = 8 Q (1 + x 2 1 ) + 4 atan  ix 1 +1 √ Q  − atan  ix 1 − 1 √ Q  Q 3 / 2 . (2 6) W e ha v e     ix 1 ± 1 √ Q     4 = (1 + x 2 1 ) 2 (1 + x 2 1 ) 2 + (2 x 1 − 4 x 2 ) 2 ≤ 1 . with equalit y only if x 1 = 2 x 2 . Hence if x 1 6 = 2 x 2 then using (42) w e h av e atan  ix 1 + 1 √ Q  − atan  ix 1 − 1 √ Q  = atan( iQ/ ( x 1 − 2 x 2 )) , and by applying ℜ  8 Q (1 + x 2 1 )  = 8 1 + 6 x 2 1 + x 4 1 − 16 x 1 x 2 + 16 x 2 2 in (26) we obtain (5). If x 1 = 2 x 2 then Q = 1 + x 2 1 and usin g atan  ix 1 + 1 √ Q  − atan  ix 1 − 1 √ Q  = π / 2 in (26) we obtain (7).  13 6.2 Bounding the CI 1 (0 , 1) -distribution No w w e pr o v e that the m ultiv ariate stud en t distribu tio n giv es an eﬃcien t env elop e for the CI 1 (0 , 1)-distribution. Pro of of Lemma 4.3: T o simplify the formulas we u se the follo wing substitutions: x 1 = u and x 2 = w + u/ 2. The density g b ecomes g ′ ( u, v ) := 1 π  1 + u 2 + 4 w 2  − 3 / 2 . F or w = 0 (whic h corresp ond s to x 1 = 2 x 2 ) the den s it y f b ecomes 4 /π 2 (1 + u 2 ) 2 + 1 π (1 + u 2 ) 3 / 2 , (27) and hence Lemma 4.3 is tr ue, as (27) ≤ (4 /π + 1)  1 π  1 + u 2  − 3 / 2  . F or w 6 = 0, d ensit y (5) b ecomes f ′ ( u, v ) := 1 π 2  4 (1 + u 2 ) 2 + (4 w ) 2 + atan( iM / (2 w )) M 3 − atan( iM ′ / (2 w )) M ′ 3  , where M = (1 + u 2 − 4 iw ) 1 / 2 and M ′ = (1 + u 2 + 4 iw ) 1 / 2 . W e are going to sh o w π 2 f ′ ( u, v ) ≤ C π g ′ ( u, v ) . (28) Note that b oth sides of (28) are unchanged when w e ﬂip the sign of u or the sign of w . Hence we can, without loss of generalit y , assume u ≥ 0 and w > 0. There are unique a > 0 and b > 0 suc h that w = ab/ 2 and u = √ a 2 − b 2 − 1 (to see this notice that su b stituting b = 2 w /a into th e second equation yields u 2 + 1 = a 2 − 4 w 2 /a 2 , where the right -h and side is a strictly increasing fu nction going fr om −∞ to ∞ ). Not e that M = a − ib and M ′ = a + ib . Also note that a 2 ≥ b 2 + 1 . (29) After the su bstitution equation (28) simpliﬁ es as follo ws 4 ( a 2 + b 2 ) 2 + 1 ( a 2 + b 2 ) 3 ( a + ib ) 3 atan  1 a + i b  +( a − ib ) 3 atan  1 a − i b  ! ≤ C ( a 2 − b 2 + a 2 b 2 ) 3 / 2 . (30) No w w e expand ( a + ib ) 3 and ( a − ib ) 3 and simplify (30) into 4 ( a 2 + b 2 ) 2 + 1 ( a 2 + b 2 ) 3 ( a 3 − 3 ab 2 )  atan  1 a + i b  + atan  1 a − i b  − i ( b 3 − 3 a 2 b )  atan  1 a + i b  − atan  1 a − i b  ! ≤ C ( a 2 − b 2 + a 2 b 2 ) 3 / 2 . (31) 14 No w w e substitute a = 1 / A and b = 1 /B in to (31) and obtain 4 A 4 B 4 ( A 2 + B 2 ) 2 + A 3 B 3 ( A 2 + B 2 ) 3 ( B 3 − 3 A 2 B )  atan ( A + iB ) + atan ( A − iB )  − i ( A 3 − 3 AB 2 )  atan ( A + iB ) − atan ( A − iB )  ! ≤ C · A 3 B 3 ( B 2 − A 2 + 1) 3 / 2 . (32) Note that A > 0 and B > 0 and the constraint (29) b ecomes B 2 ≥ A 2 (1 + B 2 ) . (33) Multiplying b oth sid es of (32) b y ( A 2 + B 2 ) 3 / ( AB ) 3 w e obtain 4 AB ( A 2 + B 2 ) + ( B 3 − 3 A 2 B )  atan ( A + iB ) + atan ( A − iB )  − i ( A 3 − 3 AB 2 )  atan ( A + iB ) − atan ( A − iB )  ≤ C · ( A 2 + B 2 ) 6 ( B 2 − A 2 + 1) 3 / 2 . (34) Finally , we sub stitute A = T s in α and B = T cos α with T ≥ 0. Note that the constraint (33) b ecomes ( T sin α ) 2 ≤ cos(2 α ) (cos α ) 2 , (35) and hence α is r estricted to [0 , π / 4). Equation (34) then b ecomes 2 T 4 sin(2 α ) + T 3 cos(3 α ) (atan ( A + iB ) + atan ( A − iB )) + iT 3 sin(3 α ) (atan ( A + iB ) − atan ( A − iB )) ≤ C · T 6 ( T 2 cos(2 α ) + 1) 3 / 2 . (36) W e prov e (36) by considering three cases. CASE: T < 1 . W e can use (42) to simplify (36) as follo w s 2 T sin(2 α ) + cos(3 α ) ata n  2 T sin( α ) 1 − T 2  − sin(3 α ) atanh  2 T cos( α ) 1 + T 2  ≤ C · T 3 ( T 2 cos(2 α ) + 1) 3 / 2 . (37) F or z ≥ 0 w e ha ve atanh( z ) ≥ z ≥ atan( z ) and hence to p ro v e (37) it is enough to sho w 2 T sin(2 α )(1 − T 4 ) + (1 + T 2 ) cos(3 α ) ( 2 T sin( α )) − (1 − T 2 ) sin(3 α ) ( 2 T cos( α )) ≤ C · T 3 (1 − T 4 ) ( T 2 cos(2 α ) + 1) 3 / 2 , (38) whic h is imp lied b y the follo wing inequ ality whic h h olds for T ≤ 8 / 9: − 2 T 2 sin(2 α ) + 2 sin(4 α ) ≤ 2 ≤ C · 2465 / 6 561 2 3 / 2 ≤ C · (1 − T 4 ) ( T 2 cos(2 α ) + 1) 3 / 2 . (39) 15 F or 1 > T ≥ 8 / 9 w e d irectly pro ve (38) 2 T sin(2 α ) + cos(3 α ) atan  2 T sin( α ) 1 − T 2  − sin(3 α ) atanh  2 T cos( α ) 1 + T 2  ≤ 2 + π / 2 ≤ C · 512 / 72 9 2 3 / 2 ≤ C · T 3 ( T 2 cos(2 α ) + 1) 3 / 2 . CASE: T > 1. W e can u se (43) and (44) to simp lify (36) as follo ws 2 T sin(2 α ) + cos(3 α )  π + atan  2 T sin( α ) 1 − T 2  − sin(3 α ) at anh  2 T cos( α ) 1 + T 2  ≤ C · T 3 ( T 2 cos(2 α ) + 1) 3 / 2 . (40) F rom (35) w e hav e T sin( α ) ≤ 1 and hence 2 T sin(2 α ) ≤ 4. Th erefore (40) can b e pr o v ed as follo ws. 2 T sin(2 α ) + cos(3 α )  π + atan  2 T sin( α ) 1 − T 2  − sin(3 α ) atanh  2 T cos( α ) 1 + T 2  ≤ 4 + 3 π / 2 ≤ C 2 3 / 2 ≤ C · T 3 ( T 2 cos(2 α ) + 1) 3 / 2 . CASE: T = 1. Equation (36) simpliﬁes as follo ws 2 sin(2 α ) + ( π / 2) cos(3 α ) − s in(3 α ) atanh (cos( α )) ≤ C (cos(2 α ) + 1) 3 / 2 . (41) The left-hand sid e is b oun d ed from ab o ve b y 2 + π / 2 whic h is less than C / 2 3 / 2 whic h lo we r -boun ds the righ t-han d sid e of (41).  6.3 Basic prop erties of trigonometric functions In this s ect ion we list the basic prop erties of trigonometric fun cti on s that we u sed. F or complex parameters these are multi-v alued functions for whic h we c ho ose the branc h in the standard wa y . The lo garithm of a complex num b er z = (cos α + i sin α )e t , where α ∈ ( − π , π ], and t ∈ R is iα + t . The inverse tangent of a complex num b er z ∈ C \ {± i } is the solution of tan( x ) = z with ℜ ( x ) ∈ ( − π / 2 , π/ 2). In terms of the logarithm we h a v e atan( z ) := 1 2 i (ln(1 − iz ) − ln(1 + iz )) . The inv erse hyp er b olic tangen t fu nction is deﬁned analogously , for z ∈ C \ {± 1 } we ha ve atanh( z ) := 1 2 (ln(1 + z ) − ln(1 − z )) = − i atan( iz ) . F or non-negativ e real n umb ers z we ha v e the follo w ing inequalit y atanh( z ) ≥ z ≥ atan( z ) . 16 The atan fu nction (ev en as a multi-v alued f u nction) satisﬁes tan(atan( x ) + atan( y )) = x + y 1 − xy , for an y v alues of x, y ∈ C \ {± i } , with xy 6 = 1. F or a 2 + b 2 < 1 the real part of atan( a + bi ) is fr om ( − π / 4 , π / 4). Hence | x | < 1 ∧ | y | < 1 = ⇒ at an ( x ) + atan( y ) = atan  x + y 1 − xy  . (42) F or a ≥ 0 and a 2 + b 2 ≥ 1 the real part of atan( a + bi ) is from [ π / 4 , π / 2). a ≥ 0 ∧ a 2 + b 2 > 1 = ⇒ atan( a + bi ) + atan( a − bi ) = π + atan(2 a/ ( 1 − a 2 − b 2 )) . (43) F or a ≥ 0 the real part of atan( a + bi ) is from [0 , π / 2). Hence for an y a, b with a + ib 6 = ± i w e ha ve atan( a + bi ) − atan( a − bi ) = atan  2 ib 1 + a 2 + b 2  . (44) 17

Approximating L1-distances between mixture distributions using random projections

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment