VC dimension of ellipsoids
We will establish that the VC dimension of the class of d-dimensional ellipsoids is (d^2+3d)/2, and that maximum likelihood estimate with N-component d-dimensional Gaussian mixture models induces a geometric class having VC dimension at least N(d^2+3…
Authors: Yohji Akama, Kei Irie
V C dimensio n of ellipso ids Y ohji Ak ama a, ∗ , Kei Irie b a Mathematic al Institute, T ohoku University, A ob a-ku, Sendai, Miyagi, 980 -8578, Jap an +81-(0)22-795-77 08(tel) +81-(0)22-795 -6400(fax) b Dep artment of Mathematics, Kyoto University, Kyoto, 606, Jap an Abstract W e will establish that the vc dimension of the class of d -dimensional ellip- soids is ( d 2 + 3 d ) / 2, and that maximum likelihoo d estimate with N -comp onent d -dimensional Gaussian mixture mo dels induces a g eometric clas s having vc dimension at least N ( d 2 + 3 d ) / 2. Keywor ds: V C dimens io n; finite dimensional ellips o id; Gaussian mixture mo del 1. In tro duction F or sets X ⊆ R d and Y ⊆ X , we s ay that a set B ⊆ R d cuts Y out of X if Y = X ∩ B . A cla ss C of subsets of R d is sa id to shatter a set X ⊆ R d if every Y ⊆ X is cut out of X b y some B ∈ C . T he vc dimension of C , denoted by V Cdim( C ), is defined to b e the maximum n (or ∞ if no such max im um ex is ts) for which s ome subset of R d of cardina lity n is shattered by C . The vc dimension of a class describ es a complexity of the class , and a re employ ed in empirica l pro cess theory [4], sta tistical and computatio nal learning theory [8, 3] and discrete geometry [6]. Although asymptotic estimates of vc dimensions are given for man y classes, the exact v alues of vc dimensions are known for only a f ew classes (e.g. the class of Euclidea n balls [10 ], the class of halfspaces [6], and so on). In Section 2, we prov e : Theorem 1. Th e class of d -dimensional el lipsoid s has vc dimensio n ( d 2 + 3 d ) / 2 . Here, by a d -dimensional ellipsoid , w e mean an op en set { x ∈ R d ; t ( x − µ ) A ( x − µ ) < 1 } wher e µ ∈ R d and A ∈ R d × d is p os itive definite. In Section 3 , we use a part of Theorem 1 (Lemma 3) to study statistical mo dels. In sta tis tics and statistical learning theo ry , the class of d -dimensional ∗ Corresp onding author Email addr esses: akama@m.toh oku.ac.jp (Y ohji Ak ama), iriek@mat h.kyoto-u. ac.jp (Kei Ir ie) Pr eprint submitte d to Elsevier Novemb er 27, 2024 ellipsoids is induced from the cla ss G d of d -dimensio nal Gaussian distributions : A d -dimensional Gauss ia n distr ibution with mea n µ ∈ R d and cov a r iance matrix Σ ∈ R d × d is, by definition, a probability density function (2 π ) − d/ 2 | det Σ | − 1 / 2 exp − t ( x − µ )Σ − 1 ( x − µ ) / 2 , ( x ∈ R d ) where a c ovarianc e matrix of size d is, by definition, a r eal, p ositive definite matrix. As in statistical lear ning theor y [8], for a class P o f pro bability densit y functions we consider the class D ( P ) of sets { x ∈ R d ; f ( x ) > s } such that f is any probability density function in P and s is any p ositive r eal num b er. Then D ( G d ) is the cla ss of d -dimensio nal ellipsoids . F or a p o s itive integer N , an N -c omp onent d -dimensional Gaussian mixtur e mo del [7 ] ( ( N , d )- gmm ) is, by definition, any pr obability distribution belo nging to the co nvex h ull o f some N d -dimensional Gaus s ian distributions. Suppo se we ar e g iven a sa mple from a po pulation ( N , d )- gmm but the num b er N of the comp onents is unknown. T o select N from the sa mple is an example of Ak aike’s mo del se lection pr oblem [1] (see [5 ] fo r re cent approach). The author s of [9] prop osed to choo s e N by structur al risk minimization principle [8], wher e an impo rtant r o le is played by the vc dimension of the class D (( G d ) N ) with ( G d ) N being the cla ss o f ( N , d )- gmm s. Our result is that the vc dimensio n of D (( G d ) N ) is grea ter than o r e q ual to N ( d 2 + 3 d ) / 2 . 2. V C dimens ion of ellipsoi ds W e will pr ov e Theor e m 1 . F or a p ositive in teger B , a v e c to r a ∈ R B \ { ~ 0 } , and c ∈ R , we write an affine function ℓ a,c ( x ) := t ax + c ( x ∈ R B ) and a n op en halfspace H a,c := { x ∈ R B ; ℓ a,c ( x ) < 0 } . W e say a set W ⊆ R B sp ans a n affine subspace H ⊆ R B , if H is the s ma llest affine subspace that c o nt ains W . The cardinality of a set S is denoted by | S | . F o r a vector a = t ( a 1 , . . . , a B ) ∈ R B , let k a k ∞ be max { | a i | ; 1 ≤ i ≤ B } . Lemma 2. F or any a ∈ R B \ { ~ 0 } and a ny S ⊂ R B with | S | = B , if S sp ans a hyp erplane { x ∈ R B ; ℓ a, − 1 ( x ) = 0 } , then S is shatter e d by a class { H b, − 1 ; b ∈ R B \ { ~ 0 } , k b − a k ∞ < ε } f or any ε > 0 . Proof. By a n affine transformation we can assume without loss of generality that all the comp onents of the vector a are 1 a nd that S is the canonical basis { e 1 , . . . , e B } of R B . Supp ose k b − a k ∞ is less than ε > 0 . By b 6 = ~ 0, we hav e H b, − 1 6 = R B . Then the vector e i belo ngs to the op en halfspace H b, − 1 if and only if the i -th c o mpo nent o f b is less than 1. Lemma 3. Th e class of d - dimensional el lipsoids has vc dimension gr e ater than or e qual t o ( d 2 + 3 d ) / 2 . Proof. Let B b e the right-hand s ide. Let ϕ b e a map S d − 1 → R B which maps x = t ( x 1 , . . . , x d ) to t ( x 2 1 , . . . , x 2 d , x 1 x 2 , . . . , x d − 1 x d , x 1 , . . . , x d ). Let t ( ξ 1 , . . . , ξ B ) be a co ordinate of R B . Then the image ϕ S d − 1 spans a hyperplane ξ 1 + · · · + 2 ξ d − 1 = 0 . So there is some se t S ⊂ S d − 1 such that | S | = B and ϕ ( S ) spans the h yp erplane. Let a ∈ R B be a vector with the first d c o mpo nents being 1 and the other comp onents b eing 0. By Lemma 2, for any ε > 0 the family n H b, − 1 ; b ∈ R B \ { ~ 0 } , k b − a k ∞ < ε o shatters ϕ ( S ). By the definition of ϕ , the clas s o f sets defined by quadra tic ineq ualities b 1 x 2 1 + · · · + b d x 2 d + b d +1 x 1 x 2 + · · · + b B x d − 1 < 0 ( k b − a k ∞ < ε ) shatters S . B ut, when ε is sufficiently small, all o f these sets are ellipso ids. W e verify the conv er se inequality . Lemma 4. VC { H a,c ; a = t ( a 1 , . . . , a B ) ∈ R B , a B > 0 , c ∈ R } ≤ B for any p ositive inte ger B . Below, the conv ex h ull o f a set A is denoted by conv( A ). Proof. Let C be { H a,c ; a = t ( a 1 , . . . , a B ) ∈ R B , a B > 0 , c ∈ R } . Assume V Cdim( C ) > B . Then C shatters some set S ⊂ R B such that | S | = B + 1. If there are x = ( u, x B ) , y = ( u , y B ) ∈ S such that x B < y B , then for any a ∈ R B with the last co mp onent nonnegative a nd for any c ∈ R we hav e ℓ a,c ( x ) < ℓ a,c ( y ), a nd thus x ∈ H a,c = { x ∈ R B ; ℓ a,c ( x ) < 0 } whenever y ∈ H a,c . This contradicts the assumption “ C sha tter s S .” Therefore, for the canonical pro jection π : R B → R B − 1 ; ( x, z ) 7→ x , we have | π ( S ) | = B + 1. By applying Radon’s theorem 1 [6] to the set π ( S ) ⊂ R B − 1 , there is a partition ( T 1 , T 2 ) o f S such that we can take y from conv( π ( T 1 )) ∩ conv( π ( T 2 )). Then we s ee that there are z , z ′ ∈ R such that ( y , z ) ∈ conv( T 1 ) and ( y , z ′ ) ∈ conv( T 2 ). Because C shatters S , ther e a re so me a ∈ R B and some c ∈ R such that the last comp onent a B of a is nonnega tive and a halfspa c e H a,c ∈ C cuts T 1 out of S . Th us, we have ℓ a,c ( x ) < 0 for all x ∈ conv( T 1 ) while ℓ a,c ( x ) ≥ 0 for all x ∈ conv( T 2 ) where T 2 = S \ T 1 . Ther efore ℓ a,c ( y , z ) < ℓ a,c ( y , z ′ ) and a B > 0, we hav e z ′ > z . On the o ther hand, so me member H a ′ ,c ′ ∈ C cuts T 2 out of S . By a similar r easoning, we hav e z > z ′ , which is a contradiction. Corollary 5. If A ⊂ R B \ { ~ 0 } and VCdim( { H a,c } a ∈ A,c ∈ R ) > B , then ~ 0 ∈ conv( A ) . Proof. Let ~ 0 6∈ c o nv( A ). Then for every finite s ubset A ′ of A , ~ 0 / ∈ co n v( A ′ ) and there is a hyperplane J through ~ 0 such that conv( A ′ ) is contained in one of the tw o op en halfspac e s determined by J . So there is a new rectangular co ordinate system such that the o rigin p oint is the same as the o lde r rectangular co ordinate system, one o f the new co ordinate axes is normal to J , and a n y a ∈ A ′ is r epresented a s ( a 1 , . . . , a B ) with a B > 0. So VCdim( { H a,c } a ∈ A ′ ,c ∈ R ) ≤ B by Lemma 4, a nd thus V Cdim( { H a,c } a ∈ A,c ∈ R ) ≤ B . 1 Any set of ( d + 2) p oin ts in R d can be partitioned into tw o disjoin t sets whose conv ex h ulls inte rsect. 3 The pro o f of Theor em 1 is as follows: By Lemma 3, we hav e o nly to esta blish that the clas s of d - dimens ional ellipsoids has vc dimensio n less than or equal to B := ( d 2 + 3 d ) / 2 . Assume other wise. F or a = t ( a 1 , . . . , a B ) ∈ R B and x = t ( x 1 , . . . , x d ), define a quadra tic for m q a ( x ) and a q ua dratic po lynomial p a ( x ) by q a ( x ) := a 1 x 2 1 + · · · + a d x 2 d + a d +1 x 1 x 2 + · · · + a B − d x d − 1 x d , p a ( x ) := q a ( x ) + a B − d +1 x 1 + · · · + a B x d . Let A b e the set of a ∈ R B such that q a is p o sitive definite. Obviously , A is conv ex and ~ 0 / ∈ A . Then, o ur a ssumption implies VCdim( { H a,c } a ∈ A,c ∈ R ) > B , since for any ellipsoid E , there exis ts a ∈ A a nd c ∈ R such that E = { x ∈ R d ; p a ( x ) < − c } . Hence Coro llary 5 shows that ~ 0 ∈ conv( A ) = A , which is a contradiction. 3. A low er b o und o f V C dime nsion of GM Ms F or a p ositive integer N and a c la ss P of probability density functions, let ( P ) N be the class of pr obability dens ity functions p 1 f 1 + · · · + p N f N such that f 1 , . . . , f N ∈ P , p i ≥ 0 and p 1 + · · · + p N = 1. F or X ⊂ R d and t ∈ R d , put X + t := { x + t ; x ∈ X } . The Euclidean norm of a vector x is denoted b y k x k . Let diam X = sup {k x − x ′ k ; x, x ′ ∈ X } . Lemma 6. If a class P of pr ob ability density functions on R d satisfies 1. for al l f ( x ) ∈ P and t ∈ R d we have f ( x + t ) ∈ P ; and 2. for any ε > 0 ther e ex ists a > 0 such that f ( x ) < ε whenever k x k > a , then V Cdim( D (( P ) N )) ≥ N × VCdim( D ( P )) . Proof. Supp ose X ⊂ R d is shattere d by D ( P ). Then for e a ch Y ⊆ X ther e exist g Y ∈ P and r Y ∈ R such that Y = X ∩ D Y , D Y = { x ∈ R d ; g Y ( x ) > e − r Y } . (1) When there is z ∈ X \ Y s uch that − log g Y ( z ) is equal to r Y , we take a smaller r Y > max {− log g Y ( x ) ; x ∈ Y } with the condition (1) kept. Then q := min − r Y j − log g Y j ( z ) ; z ∈ X \ Y j , Y j ( X , 1 ≤ j ≤ N (2) is well-defined and pos itive. Let δ > 0 b e smaller than this and all of r Y j + log g Y j ( x ) where x ∈ Y j ⊆ X and 1 ≤ j ≤ N . By the assumptions (1) and (2), we can prove tha t for any j ∈ { 1 , . . . , N } , for any ε > 0, for any t 1 , . . . , t N ∈ R d with k t i − t j k > diam X ( i 6 = j ), w e hav e (i) U := S N i =1 ( X + t i ) has cardinality N | X | , and (ii) for a n y x ∈ X , Y 1 ⊆ X, . . . , Y N ⊆ X , for p Y i := exp( r Y i ) / P N k =1 exp( r Y k ) (1 ≤ i ≤ N ), X i 6 = j, 1 ≤ i ≤ N p Y i g Y i ( x − t i ) < ε < p Y j g Y j ( x − t j ) . 4 Then the sum of the leftmost ter m and the rightmost ter m is, write f ( x ), a mem b er of ( D ( P )) N , and s atisfies 0 < log f ( x + t j ) − log p Y j g Y j ( x ) < ε/ p Y j g Y j ( x ) , since log (1 + u ) ≤ u for any u > 0. Because U is the dis- joint unio n of X + t i ov er 1 ≤ i ≤ N , ev ery s ubs et V of U has a unique sequence ( Y i ) N i =1 of subsets of X suc h that V = S N i =1 ( Y i + t i ). So, we ca n define r V := log P N i =1 exp( r Y i ), and log p Y i = r Y i − r V . Hence, there exist t 1 , . . . , t N ∈ R d such that k t i − t j k > diam( X ) ( i 6 = j ) and for any x ∈ X , Y 1 ⊆ X , . . . , Y N ⊆ X, j ∈ { 1 , . . . , N } , we hav e 0 < ( r V + log f ( x + t j )) − r Y j + log g Y j ( x ) < δ. (3) Define C V := { x ∈ R d ; f ( x ) > exp( − r V ) } , and supp ose x ∈ X and j ∈ { 1 , . . . , N } . Assume x ∈ Y j . B y (1), 0 < r Y j + log g Y j ( x ). By (3), 0 < r Y j + log g Y j ( x ) < r V + log f ( x + t j ). Therefore x + t j ∈ C V . On the other ha nd, assume x ∈ X \ Y j . By (3) and (2), r V + log f ( x + t j ) < r Y j + log g Y j ( x ) + δ < r Y j + log g Y j ( x ) + q ≤ 0. Thus x + t j / ∈ C V . T o sum up, fo r any x ∈ X a nd any j ∈ { 1 , . . . , N } , we have x ∈ Y j ⇐ ⇒ x + t j ∈ C V . Hence U ∩ C V = V . Thus U is shattered by ( D ( P )) N . By Le mma 3 and Lemma 6, we hav e: Corollary 7. The vc dimension of ( N , d ) - gmm s is gr e ater than or e qual t o N ( d 2 + 3 d ) / 2 . In other wor ds, for the class ( G d ) N of ( N , d ) - gmm s, the class D (( G d ) N ) has the vc dimension gr e ater than or e qual to N ( d 2 + 3 d ) / 2 . 4. Conclusion W e can ea s ily obtain an asymptotically tight estimate of the class of d - dimensional ellipsoids throug h the co m bination of a naive linear ization argu- men t [6] and an approximation argument o f “affine s ubspaces” ( b ands [2], more precisely) by ellipsoids. How ever, w e in Sec tio n 2 hav e provided the exact v alue of the vc dimension, by co mbin ing a linearization ar gument [6, 1 0] with an ar- gument ab out conv ex b o dies. Our ar gument seems useful to establis h the vc dimension of the class of bo unded sets { x ∈ R d ; p ( x ) > 0 } s uch that p is any real p olynomial with b ounded degree. Ac knowledgemen ts The fir st author is partially supp orted by Grant-in-Aid for Scientific Re- search (C) (21 54010 5) of the Ministry of Education, Culture, Spo rts, Science and T echnology (MEXT). The se c ond author is Suppor ted by Grant-in-Aid for JSPS F ellows. 5 References [1] H. Ak aike. Information theory and an extensio n o f the maximum likeli- ho o d principle. In Se c ond Int ernational Symp osium on Information The ory (Tsahkadso r, 1971) , pp. 26 7–28 1 . Ak ad´ emiai Kiad´ o, Budap est, 1973. [2] Y ohji Ak a ma, Kei Irie, Akitoshi Kawam ur a , a nd Y a sutak a Uwano. VC di- mensions of principal comp onent analysis . D iscr ete and Computational Ge- ometry , 44 :589–5 98, 2010. [3] Anselm Blumer, Andrzej E hr enfeuch t, David Haussler , and Manfred K. W ar - m uth. Learnability and the Vapnik-Chervonenkis dimension. J . Asso c. Com- put. Mach. , V o l. 36, No. 4, pp. 929– 965, 1989. [4] R. M. Dudley . Uniform c entr al limit the or ems , V o l. 63 of Cambridg e Studies in A dvanc e d Mathematics . C a mbridge University Pres s, 1 999. [5] Pascal Massar t. Conc entr ation ine qu alities and mo del s ele ction , v olume 1 896 of L e ctur e Notes in Mathematics . Springer, Berlin, 2007. Lectures from the 33rd Summer School on P robability Theor y held in Sa int -Flour, July 6 –23, 2003, With a foreword by Jea n Picard. [6] Ji ˇ r ´ ı Matou ˇ sek. L e ctu r es on discr ete ge ometry , volume 212 of Gr aduate T exts in Mathematics . Spring er-V erla g, New Y ork, 2 002. [7] D. M. Titteringto n, A. F. M. Smith, and U. E. Mako v. Statistic al analysis of fi nite mixtur e distributions . Wiley Series in P robability and Mathemati- cal Statistics: Applied Probability and Statistics. John Wiley & Sons Ltd., Chichester, 1 985. [8] Vladimir N. V apnik. Statistic al le arning the ory . Adaptive a nd Lea rning Systems for Sig na l Pro cessing, Co mm unications, and Control. John Wiley & Sons Inc ., New Y ork, 19 98. A Wiley-Interscience Publication. [9] Li-W ei W ang a nd Ju-F u F eng . Learning Gaussian mixtur e mo dels by s truc- tural risk minimization. In Pr o c e e dings of t he F ourth International Confer- enc e on Machine L e arning and Cyb ernetics , pag es 18–21 . IEEE , Aug 20 0 5. [10] R. S. W eno cur and R. M. Dudley . Some sp ecial Vapnik-Chervonenkis classes. Discr ete Math. , V o l. 33, No. 3, pp. 313– 318, 1981. 6
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment