Lower Bounds on Performance of Metric Tree Indexing Schemes for Exact Similarity Search in High Dimensions

Within a mathematically rigorous model, we analyse the curse of dimensionality for deterministic exact similarity search in the context of popular indexing schemes: metric trees. The datasets $X$ are sampled randomly from a domain $\Omega$, equipped …

Authors: Vladimir Pestov

Lower Bounds on Performance of Metric Tree Indexing Schemes for Exact   Similarity Search in High Dimensions
Algorithmica man uscript No. (will be inserted by the editor) Lo w er Bounds on P erformance of Metric T ree Indexing Sc h emes for Exact Simil arit y Searc h in High Di mensions Vladimir P es to v Department of Mathematics and Statistic s, Universi ty of Ot ta wa, 585 King Edw ard Av enue, Otta wa , On tario K1N 6N5 Canada e-mail: vp est283@uottaw a.ca The d ate of receipt and acceptance will be inserted by the editor Abstract Within a mathematically rigorous mo del, w e analyse the curse of dimensionality for deterministic exact similar ity search in the context of po pular indexing schemes: metric trees . The da tasets X are sampled ran- domly from a domain Ω , equipp e d with a dis ta nce, ρ , a nd an underlying probability distribution, µ . While p erforming an asymptotic analysis, we send the intrinsic dimension d of Ω to infinit y , and assume that the size of a data set, n , gr ows superp o lynomially yet sub exp one ntially in d . Exa ct similarity search refers to finding the nearest neighbo ur in the dataset X to a query p oint ω ∈ Ω , where the q ue r y points are sub ject to the same prob- ability distributio n µ as datap oints. Let F denote a class of a ll 1-Lipschitz functions on Ω that can b e used as decision functions in constructing a hier- archical metric tree indexing sc heme. Suppos e the VC dimensio n of the cla ss of all s ets { ω : f ( ω ) ≥ a } , a ∈ R is o ( n 1 / 4 / lo g 2 n ). (In v iew of a 1995 result of Goldb erg and Jer rum, even a stronge r complexity assumption d O (1) is reasona ble.) W e deduce the Ω ( n 1 / 4 ) lower b ound on the exp ected average case performance of hier archical metr ic-tree based indexing schemes for ex- act similarit y se arch in ( Ω, X ). In par icular, this b ound is superp olynomia l in d . In tro duction Every similarit y query in a data set with n p oints can b e answered in time O ( n ) throug h a simple linea r scan, and in pra ctice such a s can sometimes outp e rforms the b est known indexing s chemes for high-dimensio nal work- loads. This is known as the curse of dimensio nality, cf. e.g. Chapter 9 in [36], as well as [4, 44]. 2 V. P estov Paradoxically , there is no known mathematical proo f that the above phe- nomenon is in the nature of high-dimensional datasets. While the conce pt o f int rinsic dimension of data is op en to a discussion (see e.g. [1 2, 3 2]), even in cases commo nly accepted as “ high-dimensional” (e.g. unifor mly distributed data in the Hamming cub e { 0 , 1 } d as d → ∞ ), the “curse of dimensionality conjecture” for pr oximit y se a rch r emains unproven [17]. Diverse results in this direction [5, 3, 8, 37, 1, 30, 28, 43] are still preliminary . Here we will v erify the curse of dimensionality for a particular class of indexing schemes widely used in similar ity search and g oing back to [39]: metr ic trees. So a re ca lled hierarchical partitioning indexing schemes equipp e d with 1- Lipschitz (non-e xpanding) decis io n functions f C at every inner no de C . The v alue of f C at the query p oint q determines which child no de to follow. If f C ( q ) > ε , where ε > 0 is the rang e query radius, we can b e sur e that the s olution to the r ange simila rity problem is no t in the region C − = { x : f C ( x ) ≤ 0 } . Similarly , for f C ( q ) < − ε . How ever, if q lies in the decis ion margin {− ε ≤ f C ≤ ε } , no child no de can b e discar ded, and branching o ccur s. Cho osing a decision function when an indexing sc heme is b eing con- structed thus b eco mes an unsupe r vised soft margin classification problem. ΝΝ Ω R 2ε Fig. 1 Constructing a decision function. Assuming the domain is high-dimensional, the w ell-known concentration of measure phenomenon implies that the measur e of the margin appro aches one as dimensio n gr ows. And under assumption that the combinatorial di- mension o f the class of all av ailable clas s ifiers (decision functions) grows not to o fast (say , p olyno mially in the dimension o f the domain), sta ndard meth- o ds o f s tatistical learning imply that ra ndomly s ampled data is concen trated on the margin as well, making efficient indexing imposs ible . T o b e more precise, we assume that the domain ( Ω , ρ ) is a metric space equipp e d with a pr obability distribution µ , and that the datap oints are drawn randomly with reg ard to µ . The intrinsic dimension of Ω is defined in ter ms of co ncentration of measur e as in [32]. This concept agr ees with the us ua l notion o f dimension for such imp o r tant doma ins as the Euclidean space R d with the gaussia n measure γ d , the cube [0 , 1 ] d with the uniform Lo wer Bounds fo r Metric T rees 3 measure, the Euclidean sphere S n with the Haar (Leb es g ue) measure, and the Hamming cube { 0 , 1 } n with the Hamming dis tance and the co unt ing measure. F ollowing [17], w e require the n umber of datap oints n to g row with regar d to dimension d sup er po lynomially , yet sub exp onentially: n = d ω (1) and d = ω (log n ). It is clear that the computationa l complexit y of decision funct ions used in constr ucting a metric tre e is a ma jor fac tor in a scheme p erformance. W e take this in to account in the for m of a combinatorial restriction on the sub c lass F of all functions o n Ω that a re allow e d to be used as decision functions. Namely , we requir e a well-known parameter o f statistical learn- ing theory , the V apnik-Chervonenkis dimension [40], of all binary functions of the form θ ( f − a ), f ∈ F , where θ is the Heaviside function, to be o ( n 1 / 4 / lo g 2 n ). This is in paricula r satisfied if the VC dimension in ques- tion is p oly nomial in d . A very g eneral class o f functions sa tis fying this V C dimension bo und is provided b y a theorem of Goldb er g and J errum [14], and appa r ently decision functions of all index ing schemes used in pr actice so far in Euclidean (and Hamming cube) do mains fall int o this class. Under ab ove assumptions, we prove a low er b o und Ω ( n 1 / 4 ) on the ex- pec ted av erag e pe r formance of a metric tree. This b ound is in pa rticular sup e rp olynomial in d . It is proba bly hard to ar gue that the rea l data can be sim ulated by random sampling from a hig h- dimensional distribution. The present author happily concedes that implications of the abov e result for high-dimensio nal similarity search a re only indirect: our w o rk underscores the impor tance of further developing a relev a nt theory of intrinsic dimensionality of da ta [1 2], which w ould equate indexability with lo w dimension. A s horter confer ence version of the pap er (with a weaker bound d ω (1) ) app ears in: Pro c. 4th Int. Conf. on Similarity Sear ch and Applications (SISAP 201 1), Lipari, Italy , ACM, New Y ork, NY, pp. 2 5–32 . The author is thankful to the anonymous refer e e for a n um b er of useful rema rks, in particular the present lower bo und Ω ( n 1 / 4 ) is obtained in re s po nse to o ne of them. 1 General framework for sim ilarit y search W e follow a formalism of [16] as adapted for similarity search in [31, 34]. A worklo ad is a tr iple W = ( Ω , X , Q ), wher e Ω is the domain, whose elements can occur bo th as datapo int s and as quer y points, X ⊆ Ω is a finite subset ( dataset , or instanc e ), a nd Q ⊆ 2 Ω is a family of queries. Answering a query Q ∈ Q means listing all datap oints x ∈ X ∩ Q . A ( dis ) simila rity me asur e o n Ω is a function of tw o arg uments ρ : Ω × Ω → R , which we as sume to b e a metric, as in [47]. (Sometimes one needs to consider mor e g eneral similar ity measures, cf. [13, 3 4].) A r ange similarity query c entr e d at ω ∈ Ω is a ball of ra dius ε around the query point: Q = B ε ( ω ) = { x ∈ Ω : ρ ( ω , x ) < ε } . 4 V. P estov Equipp ed with such balls as queries, the tr iple W = ( Ω , ρ, X ) forms a r ange similarity worklo ad . The k - ne ar est neighb ours ( k -NN) query cen tre d at ω ∈ Ω , where k ∈ N , can b e reduced to a seq ue nc e of range queries. This is discussed in detail in [8], Sect. 5.2. A workloa d is inner if X = Ω and outer if | X | ≪ | Ω | . Most workloads of practical in terest are outer ones. Cf. [34]. 2 Hierarc hical tree i ndex structures An ac c ess metho d is an algorithm that cor rectly answers every range q ue r y . Examples of a ccess metho ds ar e g iven by indexing schemes . In par ticula r, a hier ar chic al tr e e-b ase d indexing scheme is a sequence of refining partitions of the domain lab elled with a finite ro oted tree. (F or simplicity , we will assume all trees to b e binary: this is not really restrictive.) Cf. Figure 2. Such a scheme tak es storag e spa c e O ( n ). Fig. 2 A refi ning sequen ce of partitions of Ω . T o pro ce s s a range quer y B ε ( ω ), we tra verse the tree r ecursively to the leaf level. O nce a leaf B is r eached, its con tents (datap oints x ∈ X ∩ B ) are accessed, and the condition x ∈ B ε ( ω ) verified for eac h one of them. Of main interest is what ha ppens at each in ternal no de C . Let us identify C with the corr esp onding element C ⊆ Ω of the par tition, and supp ose that A and B a r e child no des of C , so that C = A ∪ B . A bra nch descending from B can b e pruned provided B ε ( ω ) ∩ B = ∅ , b ecause then datap oints contained in B a re of no further interest. This is the case wher e one can certify that ω is not contained in the ε -neighbourho o d of B : ω / ∈ B ε = { x ∈ Ω : ρ ( x, B ) < ε } . (Cf. Fig. 3, l.h.s .) Similarly , if ω / ∈ A ε , then the sub- tr ee descending from A can b e pruned. How ever, if the op en ball B ε ( ω ) meets b oth A and B or, Lo wer Bounds fo r Metric T rees 5 equiv alently , ω b e longs to the intersection of ε -neigh b ourho o ds of A and B , pruning is impo ssible and the sear ch branc hes out. (Cf. Fig. 3, r.h.s.) B w A B A e e e A w B A B e e e Fig. 3 Pruning is p ossible (l.h.s.), and impossible (r.h.s.). In order to efficiently certify that B ε ( ω ) ∩ B = ∅ , one emplo ys the tech- nique of de cision functions . A function f : Ω → R is called 1-Lipschitz if ∀ x, y ∈ Ω , | f ( x ) − f ( y ) | ≤ ρ ( x, y ) . Assign to every in ternal mo de C a 1-Lipschitz function f = f C so tha t f C ↾ B ≤ 0 and f C ↾ A ≥ 0. It is ea s ily see n that f C ↾ B ε < ε , and so the fact that f C ( ω ) ≥ ε serves as a certificate for B ε ( ω ) ∩ B = ∅ , assuring that a sub- tr ee descending fro m B can b e pruned. Similarly , if f C ( ω ) ≤ − ε , the sub-tree descending from A can b e pruned. x f B 0 f(x) e y Fig. 4 Graph of a d ecision function f = f C . Of cour se, decision functions s hould ha ve sufficiently low co mputational complexity in o rder for the indexing scheme to b e efficient. A hier a rchical indexing structure employing 1-Lipschitz decisio n func- tions at every no de is known as a metric tr e e. 3 Metric trees Here is a fo r mal definition. A metric tree for a metric similarity workload ( Ω , ρ, X ) consists of – a finite binary ro oted tree T , – a collectio n of (p ossibly pa rtially defined) r eal-v alued 1 -Lipschitz func- tions f t : B t → R for e very inner no de t (decision functions), wher e B t ⊆ Ω , 6 V. P estov – a collection o f bins B t ⊆ Ω for every leaf no de t , containing po inters to elements X ∩ B t , so that – B root( T ) = Ω , – for every int ernal no de t a nd child no des t − , t + , one ha s B t ⊆ B t − ∪ B t + , – f t ↾ B t − ≤ 0, f t ↾ B t + ≥ 0 . When pro cessing a range query B ε ( ω ), – t − is accessed ⇐ ⇒ f t ( ω ) < ε , and – t + is accessed ⇐ ⇒ f t ( ω ) > − ε . Here is the search algo rithm in pseudoc o de. Algorithm 1 on input ( ω , ε ) do set A 0 = { ro o t( T ) } for e ach i = 0 , 1 , . . . , depth ( T ) − 1 do if A i 6 = ∅ then for e ach t ∈ A i do if t is an internal no de then do if f t ( ω ) < ε then A i +1 ← A i +1 ∪ { t − } if f t ( ω ) > − ε then A i +1 ← A i +1 ∪ { t + } else for e ach x ∈ B t do if x ∈ B ε ( ω ) then A ← A ∪ { x } return A ⊓ ⊔ Under our as sumptions on the metric tree, it can b e pr ov e d (cf. [34], Theorem 3 .3) that Algorithm 1 co rrectly answers ev ery ra nge simila rity query for the w orkloa d ( Ω , ρ, X ), and so together with a n indexing sc heme forms an access metho d. 4 Examples of me tric tree indexing sc hemes Example 1 ( v p - tr e e) The vp-tr e e [46] uses decision functions of the form f t ( ω ) = (1 / 2)( ρ ( x t + , ω ) − ρ ( x t − , ω )) , where t ± are tw o children of t and x t ± are the vantage p oints for the no de t . Lo wer Bounds fo r Metric T rees 7 Example 2 ( M -tr e e) The M-t r e e [9] employs decision functions f t ( ω ) = ρ ( x t , ω ) − sup τ ∈ B t ρ ( x t , τ ) , where B t is a blo ck corres po nding to the no de t , x t is a datap oint chosen for each no de t , and supr ema on the r.h.s. are precomputed and stored. F or differing per sp ectives on metric tree s , see [3 4, 8]. Each of the b o oks [35, 36, 47] is an excellent reference to indexing structures in metric spaces. 5 Curse o f di mensional it y In recent years the resea rch emphasis has shifted aw ay fro m exact tow ar ds appr oximate similarity search: – given ε > 0 a nd ω ∈ Ω , r eturn a p o int x ∈ X that is [with c o nfidence > 1 − δ ] at a distance < (1 + ε ) d N N ( ω ) from ω . This has led to many impressive ac hievemen ts, particula r ly [20, 18], see also the survey [17] and Chapter 7 in [41]. At the sa me time, research in exact similar ity search, especially concer ning deterministic alg orithms, has slow ed down. A t a theo retical level, the following unprov ed conjecture helps to k eep resea rch efforts in fo cus. Conje ctur e 1 (Th e curse of dimensionality c onje ctu r e, cf. [17]) Let X ⊆ { 0 , 1 } d be a dataset with n p oints, where the Ha mming cube { 0 , 1 } d is equipp e d with the Hamming ( ℓ 1 ) distance: d ( x, y ) = ♯ { i : x i 6 = y i } . Suppo se d = n o (1) , but d = ω (lo g n ). (That is, the num b er of p oints in X has int ermediate g rowth w ith regar d to the dimension d : it is super p o lynomial in d , yet sub exp one ntial.) Then any data structure for exact near est neighbour search in X , with d O (1) query time, must use n ω (1) space within the c el l pr ob e mo del of computation. The b est lower bo und cur rently known is O ( d/ lo g sd n ), where s is the nu mber of cells used by the da ta structur e [30]. In pa rticular, this implies the ea rlier bound Ω ( d/ log n ) for p olynomia l space data structures [3], a s well as the b o und Ω ( d/ log d ) for nea r linea r s pace (na mely n log O (1) n ). See also [1, 28, 2 9]. A ge neral reference for the cell prob e mo del of computation is [2 4], while in the co nt ext o f simila r ity sea rch the model is discussed in [33]. 8 V. P estov 6 Concen tration of m easure As in [10], we a ssume the existence of an unk nown pro bability measure µ on Ω , s uch that b oth datap oints X and query p oints ω a re b eing sampled with regar d to µ . On the one hand, this assumption is o pe n to debate: for ins ta nce, it is said that in a t ypical universit y library most b o oks (75 % or more) are never bo rrow ed a single time, so it is r e a sonable to a ssume tha t the distribution of queries in a large dataset will b e skewed equally heavily awa y from data distribution. On the other ha nd, there is no obvious alter native wa y of making an aprio r i ass umption ab out the query distribution, a nd in some situations the ass umption mak es sense indeed, e.g. in the con text o f a lar ge biological database where a new ly -discov ered protein fragment ha s to b e matched ag ainst every previo usly kno wn sequence. The triple ( Ω , ρ, µ ) is known as a metric sp ac e with me asur e . This co n- cept op ens the wa y to systematically using the phenomenon of c onc en t r ation of m e asur e on high-dimensio nal structur es , also known a s the “ Ge ometric L aw of L ar ge Nu m b ers” [23, 21]. This pheno menon can be informally sum- marized as follows: for a typical “high-dimensional” structure Ω , if A is a subset contain- ing at le a st half o f all po int s, then the measure of the ε -neighbourho o d A ε of A is ov er w helming ly close to 1 already for small ε > 0. Here is a rig orous wa y for dealing with the phenomenon. Define the c onc ent r ation function α Ω of a metric space with measure Ω by α Ω ( ε ) =  1 2 , if ε = 0 , 1 − inf  µ ( A ε ) : A ⊆ Ω , µ ( A ) ≥ 1 2  , if ε > 0. The v alue o f α Ω ( ε ) g ives un upp er b ound on the measur e of the c o m- plement to the ε -neig hbourho o d A ε of ev ery subset A of measure ≥ 1 / 2. F or high-dimensiona l space s the v alues of the concentrataion function often admit gaussia n upper bo unds of the form α Ω ( ε ) = exp( − Θ ( d ) ε 2 ) , (1) where d is a dimension parameter. F o r instance, the concentration function of the d -dimensional Hamming cub e { 0 , 1 } d with the normalized Hamming metric and uniform meas ure satisfies a Chernoff bound α ( ε ) ≤ exp( − 2 ε 2 d ), cf Fig. 5. Similar b ounds hold for Euclidean s pher es S n , cub es I n , and many other structures of b oth contin uous and discrete mathematics, e quippe d with suit- ably normalized dis tances a nd ca nonical probability measures. T he con- centration phenomenon can b e now expressed by s aying that for “typical” high-dimensional metric spaces with measur e, Ω , the concentration function α Ω ( ε ) drops off sharply as d → ∞ [23, 21]. If now f : Ω → R is a 1-Lipschitz function, denote M = M f the median v alue of f , that is, a (non-uniquely defined) real n umber with the prop erty Lo wer Bounds fo r Metric T rees 9 0.0 0.1 0.2 0.3 0.4 0.0 0.2 0.4 0.6 0.8 1.0 epsilon value of concentration function concentration function Chernoff bound Fig. 5 Concentratio n function of { 0 , 1 } 50 vs Chernoff b ound. that each of the even ts [ f ≥ M ] and [ f ≤ M ] o c c urs with probabiity at least half. One can prov e without muc h difficulty: µ { x ∈ Ω : | f ( x ) − M f | > ε } < 2 α Ω ( ε ) . (2) Thu s, every one-Lipschitz function on a high-dimensio nal metric spa ce with measure concentrates near o ne v alue. 7 W orkload assump tions Here are our standing assumptions for the rest of the ar ticle. Let ( Ω , ρ, µ ) be a domain equipp ed with a metric ρ a nd a proba bility mea sure µ . W e assume that the exp ected distance betw een tw o po ints of Ω is nor malized so as to become asymptotically constant: E ρ ( x, y ) = Θ (1) . (3) W e further assume that Ω has “concentration dimension d ” in the sense that the concentration function α Ω is gaussia n with exponent Θ ( d ); α Ω ( ε ) = exp  − Θ ( ε 2 d )  . (4) (This approach to intrinsic dimension is dev elop ed in [32].) A da taset X ⊆ Ω co ntains n points, where n and d ar e related as follows: n = d ω (1) , (5) d = ω (lo g n ) . (6) In other words, asymptotica lly n grows faster than any p olynomia l function C d k , C > 0, k ∈ N , but slow er than an y exponential function e cd , c > 0. (An 10 V. P estov example o f such rate of growth is n = 2 √ d .) F or the purpose s of a symptotic analysis of search a lgorithms such assumptions ar e natural [17]. Datap oints ar e modelle d b y a sequence of i.i.d. r andom v ar iables dis- tributed according to the measure µ : X 1 , X 2 , . . . , X n ∼ µ. The instances of data po ints will be denoted w ith corr esp onding low er ca se letters x 1 , x 2 , . . . , x n . Finally , the query centres ω ∈ Ω follow the same distribution µ : ω ∼ µ. 8 Query radius It is known that in high-dimensiona l domains the distance to the nearest neighbour is appro a ching the av erage distance betw een tw o p o ints (cf. e.g. [4] for a particula r case). This is a consequence of c oncentration of measure, and the result can b e stated and prov ed in a rather g eneral situa tio n. Denote ε N N ( ω ) the distance fr o m ω ∈ Ω to the neare s t p o int in X . The function ε N N is easily verified to b e 1-Lips chit z, and so conce ntrates near its median v alue. F ro m here, one deduces: Lemma 1 Under our assumptions on t he domain Ω and a r andom sample X , with c onfidenc e appr o aching 1 one has for al l ε µ { ω : | ε N N ( ω ) − E ρ ( x, y ) | > ε } < ex p( − Θ ( ε 2 d )) . ⊓ ⊔ R emark 1 The r esult should b e understo o d in the asymptotic sense, as fol- lows. W e deal with a family of domains Ω d , d ∈ N , and the sampling is per formed in each o f them in an independent fashion, so that “confidence” refers to the probability that the infinite sample path b elonging to the infi- nite pro duct Ω n 1 1 × Ω n 2 2 × . . . × Ω n d d × . . . satisfies the desired prop erties. F or a pro of of Lemma 1, see Appendix A in [33]. This effect is a lready noticea ble in medium dimensions . Le t us dr aw a dataset X with 1 0 , 0 00 p oints ra ndomly from the Euclidean cub e [0 , 1] 50 with rega rd to the uniform mea sure. Then, with resp ect to the usual Eu- clidean distance, the median v alue of the distance to the nearest neighbour is ε M = 1 . 9 701, while the exp ected v alue of a distance b etw een tw o p oints of X , E d ( x, y ) = 2 . 8 72. Cf. Fig. 6 for the distribution of v alues of ε N N . Lo wer Bounds fo r Metric T rees 11 0 500 1000 1500 2000 1.0 1.5 2.0 2.5 3.0 d = 50 query points Distances to NN in X Fig. 6 Distances from 2 , 000 random query p oints to their nearest n eigh b ours in a dataset of 10 , 000 random p oints in the Euclidean cub e [0 , 1] 50 . The lo wer horizon tal line marks ε M = 1 . 970 1, the upp er E d ( x, y ) = 2 . 872. 9 A “naive” O ( n ) low er b ound As a first approximation to our analysis, we prese nt a heuris tic a rgument, allowing line ar in n asymptotic lo wer b ounds on the search performance of a metric tree. What ha ppens at an internal no de C when a metric tree is b eing tra- versed? Note that C itself b ecomes a metric space with measure if equipp ed with the metric induced from Ω and a probability mea sure µ C which is the normalized restrictio n of the measure µ from Ω : for A ⊆ C, µ C ( A ) = µ ( A ) µ ( C ) . Let α C denote the concentration function of C . Supp ose for the moment that our tree is p erfectly balanced: µ C ( A ) = µ C ( B ) = 1 2 . Then the siz e of the ε -neighbourho o d of A is at lea st 1 − α C ( ε ), and the same is true of B ε . F or all quer y po ints ω ∈ C e xcept a set of measure ≤ 2 α C ( ε ), the search algorithm 1 branches o ut at the node C . (Cf. Fig. 7.) Lemma 2 L et C b e a subset of a met ric sp ac e with me asur e ( Ω , ρ, µ ) . De- note α C the c onc entr ation function of C with r e gar d to the induc e d metric ρ ↾ C and the induc e d pr ob ability me asur e µ/µ ( C ) . Then for al l ε > 0 α C ( ε ) ≤ α Ω ( ε/ 2) µ ( C ) . Pr o of Let ε > 0 b e any , and let δ < α C ( ε ). Then there ar e subsets D , E ⊆ C at a distance ≥ ε from each other, satisfying µ ( D ) ≥ µ ( C ) / 2 and µ ( E ) ≥ δ µ ( C ), in particula r the measur e of either set is at least δ µ ( C ). Since the ε/ 2-neig hbourho o ds of D and E in Ω canno t meet b y the triangle inequalit y , the complement, F , to at least one of them, ta ken in Ω , has the prop er ty 12 V. P estov A A B < a (C, e) < a (C, e) C w B e e e Fig. 7 Searc h algorithm branc hes out for most query p oints ω at a n od e C if th e v alue α C ( ε ) is small. µ ( F ) ≥ 1 / 2, while µ ( F ε/ 2 ) ≤ 1 − δ µ ( C ), b ecause F ε/ 2 do es not meet one of the tw o or ig inal sets, D or E . W e conclude: α Ω ( ε/ 2) ≥ δµ ( C ), and taking suprema ov e r all δ < α C ( ε ), α Ω ( ε/ 2) ≥ α C ( ε ) µ ( C ) , that is, α C ( ε ) ≤ α Ω ( ε/ 2) /µ ( C ), as required. ⊓ ⊔ Since the size of the indexing scheme is O ( n ), a typical size of a set C will b e on the order Ω  n − 1  , while α Ω ( ε ) will go to zero as o  n − 1  . Let a workload ( Ω , ρ, X ) b e indexed with a ba lanced metr ic tree of depth O (log n ), having O ( n ) bins of roughly equal µ -measure. F or at least half of all quer y p oints, the distance ε N N to the nearest neig hbour in X is at least as large as ε M , the median NN distance. Let ω b e suc h a query centre. F or every element C of level t partition of Ω , one has, using Lemmas 2 and 1 and the assumption in Eq. (4), α C ( ε M ) ≤ α Ω ( ε M / 2) µ ( C ) − 1 = Θ (2 t ) e − Θ (1) ε 2 M d = e − Θ ( d ) , where the co ns tants do not dep end on a par ticular in ternal no de C . An argument in Section 8 implies that br anching at every internal n o de occur s for all ω except a set of measure ≤ ♯ (no des) × 2 sup C α C ( ε ) = O ( n 2 ) e − Θ ( d ) = o (1) , bec ause d = ω (log n ) and s o e Θ ( d ) is sup erp oly nomial in n . Thus, the e x - pec ted average p erfor mance of a n indexing scheme as ab ov e is linear in n . There a re t wo problems with this ar gument. Firstly , it ha s b een obser ved and confirmed experimentally that un balanced metric trees can b e mor e efficient tha n the balanced ones [7, 2 6]. Secondly and more imp ortantly , w e hav e replaced the v alue of the empiric al me asur e , µ n ( C ) = | C | n , Lo wer Bounds fo r Metric T rees 13 with the v alue of the underlying measur e µ ( C ), implicitly assuming that the t wo are clo se to each other: µ n ( C ) ≈ µ ( C ) . But the scheme is being chosen after seeing an insta nc e X , and it is rea- sonable to ass ume that index ing par titions will take a dv antage o f rando m clusters always present in i.i.d. data. (Fig. 8 illustra tes this p oint in di- mension d = 2.) Some elements of indexing par titions, while having lar ge µ -measure, may co n tains few datap oints, and v ice versa. 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Fig. 8 1000 points randomly and uniformly distributed in t he square [0 , 1] 2 . An equiv ale n t co nsideration is that we only know the conc e n tration func- tion of the domain Ω , but no t of a randomly chosen dataset X . It seems the pr oblem of estimating the co nc e nt ration function of a r andom sa mple has not b e en systematically treated. In o rder to b e a ble to estimate the empir ical meas ure in terms of the underlying distribution, one needs to invok e an a pproach of statistical learn- ing. 10 V apnik–Cherv onenkis theo ry Let C be a family of subsets of a se t Ω (a c onc ept class ). One says that a subset A ⊆ Ω is shatter e d b y C if for eac h B ⊆ A there is C ∈ C such that C ∩ A = B . The V apnik–Chervonenkis dimension VC( C ) of a class C is the supre - m um of sizes of finite subsets A ⊆ Ω shattered by C . Here are some examples. 1. The VC dimension of the class o f all Euclidean balls in R d is d + 1. 2. The class of all para llelepip eds in R d has V C dimension 2 d + 2. 3. The V C dimension of the class of all balls in the Hamming cube { 0 , 1 } d is bounded fro m above by d + ⌊ log 2 d ⌋ . (As every ball is determined by its centre and radius, the total num b er of pairwise differen t balls in { 0 , 1 } d is d 2 d . Now one uses an obvious 14 V. P estov observ a tio n: the V C dimensio n of a finite concept cla ss A is b ounded ab ov e by log 2 | A | .) Here is a deeper r esult. Theorem 2 (Goldb erg and Jerrum [14], Theorem 2.3) L et F = { x 7→ f ( θ , x ): θ ∈ R s } b e a p ar ametrize d class of { 0 , 1 } -value d functions. Supp ose that, for e ach input x ∈ R n , t her e is an algorithm t hat c omputes f ( θ , x ) , and this c ompu- tation takes n o mor e than t op er ations of the fol lowing typ es: – the arithmetic op er ations + , − , × and / on r e al n umb ers, – jumps c onditione d on > , ≥ , < , ≤ , = , and 6 = c omp arisons of r e al num- b ers, and – output 0 or 1 . Then VC( F ) ≤ 4 s ( t + 2 ) . ⊓ ⊔ Here is a typical r esult of statistical learning theory , whic h w e quote from [42], Theorem 7.8. Theorem 3 L et C ⊆ 2 Ω b e a c onc ept class of finite VC dimension, d . Then for al l ǫ, δ > 0 and every pr ob ability me asu r e µ on Ω , if n datap oints in X ar e dr awn r andomly and indep endent ly ac o or ding t o µ , then with c onfidenc e 1 − δ ∀ C ∈ C ,     µ ( C ) − | X ∩ C | n     < ǫ, pr ovide d n ≥ max  8 d ε lg 8 e ε , 4 ε lg 2 δ  . Let F b e a cla ss o f (possibly partially defined) re a l-v alued functions on Ω . Define F ≥ as the family of all sets of the form { ω ∈ do m f : f ( ω ) ≥ a } , a ∈ R . The v alue of VC( F ≥ ) is b o unded a bove by the Pollard dimension (pseu- do dimension) of F (cf. [42], 4.1 .2), but is in general smaller. Example 3 (Pivots) If F is the class of all dista nce functions to p o int s o f R d , then VC( F ≥ ) = d + 1. (The family F ≥ consists of complements to op en balls, and the V C dimension is in v ariant under pro ceeding to the complements.) F or the Hamming cube, VC( F ≥ ) ≤ d + ⌊ log 2 d ⌋ . Example 4 ( v p - tr e e) See Example 1. If Ω = R d , then F ≥ consists of all half-spaces, and the VC dimension of this family is w e ll known to equal d + 1. Example 5 ( M -tr e e) See Example 2 . The dimension estimates are the same as in Example 3. F or b oth sc hemes, if Ω = R d or { 0 , 1 } n , then V C(( F ) ≥ ) eq uals d + 1. A similar conclusion holds for the Hamming cube. Lo wer Bounds fo r Metric T rees 15 11 Rigorous low er b o unds In this Sectio n we prov e the following theor em under genera l assumptions of Section 7. Theorem 4 L et the domain Ω e quipp e d with a metric ρ and pr ob ability me asur e µ hav e c onc entr ation dimension Θ ( d ) (cf. Eq. (4)) and exp e cte d distanc e b etwe en two p oints E d ( x , y ) = 1 . L et F b e a cla ss of al l 1-Lipschitz functions on the domain Ω that c an b e use d as de cision functions for metric tr e e indexing schemes of a given typ e. Supp ose VC( F ≥ ) = o ( n 1 / 4 / lo g 2 n ) . L et X = { x 1 , x 2 , . . . , x n ) b e an inst anc e of an i.i.d. r andom sample of Ω fol lowing t he distribution µ , wher e d = n o (1) and d = ω (log n ) . Then an optimal m et ric tr e e indexing scheme for the s imilarity worklo ad ( Ω , ρ, X ) has exp e ct e d aver age runtime Ω ( n 1 / 4 ) . The followin g is a direct application of Lemma 4.2 in [31]. Lemma 3 (“Bin Access Lemma”) L et ε > 0 and m ≥ 4 b e su ch that α Ω ( ε ) ≤ m − 1 , and let γ b e a c ol le ction of subsets A ⊆ Ω of me asur e µ ( A ) ≤ m − 1 e ach, satisfying µ ( ∪ γ ) ≥ 1 / 2 . Then t he 2 ε -neighb ourho o d of every p oint ω ∈ Ω , ap art fr om a set of m e asur e at most 1 2 m − 1 2 , me ets at le ast 1 2 m 1 2 elements of γ . Here is the next step in the pro of. Lemma 4 L et F b e a fami ly of r e al-value d functions satisfying VC( F ≥ ) ≤ p . Denote B the class of al l subsets B ⊆ Ω app e aring as int erse ctions of ≤ h sets of the form [ f T a ] , f ∈ F . Then V C( B ) ≤ 4 hp log(2 hp ) . Pr o of Use Th. 4.5 in [42]: if A is a concept class of VC dimension ≤ p , then the VC dimension of the class o f all sets obta ined a s intersections o f ≤ h sets from F is b ounded by 2 hp log( hp ). ⊓ ⊔ Pr o of W e can s uppo se that the exp ected av erag e depth of a tree trav e r sed is o ( n 1 / 4 ), for otherwise there is nothing to prov e. Using Eq. (3) and Lemma 1, pick any ε ′ > 0 such that, for s ufficient ly high v alues of d , for most p oints ω (that is, for a s et of µ -measur e 1 − o (1)) the v alue of ε N N ( ω ) exceeds ε ′ . Similarly , we can assume that query points of µ -meas ure 1 − o (1 ) have the pro pe rty that their ε ′ -neighbourho o d only meets bins with fewer than n 1 / 4 datap oints. (Otherwise , alre a dy scanning the conten ts of lar ge bins would r esult in an exp ected running time Ω ( n 1 / 4 .) Combining the tw o assumptions together, we deduce that for a s et Ω ′ of query cen tres ω o f µ - measure 1 − o (1) the follo wing are true: (1) the ε ′ -ball around ω o nly meets bins with fewer than n 1 / 4 po ints, and (2) the depth of every sea rch tree b eginning with ω do es not exceed n 1 / 4 . 16 V. P estov Let b = { t 0 , t 1 , . . . , t k = t } b e a branch o f the search tree corr esp onding to a query point ω ∈ Ω ′ . Let Ω b denote the set o f all ω ∈ Ω ′ for which the branch b has to b e follow ed. Then Ω b ⊆ B t , a nd so Ω b contains fewer than n 1 / 4 datap oints. Also, Ω b is the intersection o f a family of ≤ n 1 / 4 sets of the for m [ f T a ], f ∈ F . By Lemma 4 and our assumption on F , the V C dimensio n of the collection, B , o f all p ossible sets Ω b emerging in this fashion is o ( n 1 / 2 / lo g n ). Apply Theorem 3 to the concept class B with ε = n − 1 / 2 . If n is suffi- ciently large, then with high confidence the µ -meas ure of every element of B do es no t differ from the empirical measure (which is ≤ n − 3 / 4 ) by more than ε = n − 1 / 2 . One concludes: with high c o nfidence, the sets Ω q , q ∈ Ω ′ hav e µ -measure ≤ 2 n − 1 / 2 . The Bin Access Lemma 3, applied with m = 2 n 1 / 2 and ε = ε ′ / 2, implies that for all ω ∈ Ω ′ the ε ′ -neighbourho o d of ω meets at least O ( n 1 / 4 ) pairwis e different sets of the form Ω b as ab ov e. Since µ ( Ω ′ ) = 1 − o (1), this implies the need to trav erse on average Ω ( n 1 / 4 ) distinct br anches of the s e a rch tree, establishing the claim. ⊓ ⊔ Combining o ur Theorem 4 with Theorem 2 of Goldb erg and Jerrum shows that for all pr actical purp oses the exp ected av erag e p erfor mance of metric trees is sup erp olynomial in dimension of the domain. Corollary 1 L et the domain Ω = R d b e e quipp e d with a pr ob ability me a- sur e µ d in such a way that the c onc ent ra tion function of ( R d , µ d ) admits a gaussian upp er b ound and the µ d -exp e cte d value of the Euclide an distanc e is Θ (1) . L et F d denote a class of functions f ( θ , x ) on R d p ar ametrize d with θ taking values in a sp ac e R p oly ( d ) and su ch that c omput ing e ach value f ( θ , x ) takes d O (1) op er ations of the typ e describ e d in Thm. 2. L et X b e an i.i.d . r andom sample of R d ac c or ding to µ d , having n p oints, wher e d = n o (1) and d = ω (log n ) . Then, with c onfidenc e asymptotic al ly appr o aching 1 , an optimal m et ric tr e e indexing scheme for the s imilarity worklo ad ( Ω , ρ, X ) whose de cision functions b elong to the p ar ametrize d class F has ex p e cte d aver age runtime d ω (1) . ⊓ ⊔ Three remarks ar e in order to explain the strength of the ab ove results. (1) Measur es µ d satisfying the ab ov e assumption include, for ins ta nce, the gaus s ian distribution, the unifor m mea sure on the unit ball, on the unit sphere, on the unit cub e, etc. (2) A p olynomia l upp er b ound on the size of the pa rameter θ for F is dictated by the obvious restr iction that reading off a para meter of s up er - po lynomial length lea ds to a sup erp olynomia l low er b ound on the length of computation. (3) In the situatio ns of interest, one can verify that the exp ected num b er of datap oints x ∈ X contained in the smallest quer y ball meeting X is O (1). F or contin uo us measures on R n such as the g a ussian measure o r the uniform measure on the cube etc., this will b e o bviously 1. F or the Hamming cube, the upp er limit of this num be r as d → ∞ is b o unded b y e ≈ 2 . 71 82 . . . . Lo wer Bounds fo r Metric T rees 17 Thu s, the lower bo und do es not come fro m the fact that there are s imply to o many v a lid near neighbours. (4) W e do not know the answer to the following. Question. Cos t of computing the v alues of decisio n functions aside, ca n a dataset X ⊂ { 0 , 1 } d , n = | X | , d = ω (log n ), d = n o (1) , b e indexed with a metric tree p erforming in time p oly( d )? 12 Conclusion In this Sectio n, written in resp ons e to refer ee’s comments, the author w ill try to outline his understa nding of applicability of the metho d o f pr o of to other indexing paradigms. The approach to obtaining low er bounds o n per formance of indexing schemes a dopted in this pap er consists in combining s imple concentration of measur e consider ations with the basic techniques of statistica l le a rning (V C theory). The arg ument is applicable to the situatio n o f the following kind. Let W = ( Ω , ρ, X ) denote a simila rity workloa d. An indexing sc heme for W cons is ts of a family o f real-v a lued 1 - Lipschitz functions f i , i ∈ I on Ω , which are in genera l par tia lly defined: dom ( f i ) ⊆ Ω . Given a query ( ω , ε ), where ω ∈ Ω and ε > 0, the algor ithm c ho oses r ecursively a s equence of indices i n , based o n the pr evious v alues f i k ( ω ), k < n . At some p oint, the computation is ter minated, and the v a lues f i k ( ω ) p oint at a collection of bins, whose conten ts are read off. The r ole of the functions f i is to dis c a rd those datap oints (or the en tire bins) which cannot p ossibly answer the query . Namely , if | f i ( ω ) − f i ( x ) | ≥ ε , then, since f i is a 1 - Lipschitz function, one has d ( ω , x ) ≥ ε , and so the po int x is irrele v ant. All the p oints (or e nt ire bins) which cannot b e disc arded are returned and their c o nten ts chec ked against the condition d ( x, ω ) < ε . On the spa ces of high dimension, every 1- Lipschitz function co ncentrates sharply near its mean (or median) v alue. If in addition w e assume that the class F of all functions used for a particular indexing scheme has a low complexity in the sense o f VC dimensio n, we ca n conclude tha t the n umber of p oints discarded b y ev er y function f i drops off fast a s dimensio n d o f the domain grows, r esulting in degrading per formance. So far, we ar e aw a r e of ess ent ially tw o different types o f such indexing schemes: metric trees (treated in the pr esent pap er ) a nd pivot tables [6]. F or pivots, the metho ds of the present paper hav e b een subsequently used to derive an exp ected av er age p erforma nce low er b ound Ω ( n/ d log n ) [43]. It is not clear to the a utho r how to sta te a more general result from which bo th estimates would follow, no r whether such a result would be useful in view of lack o f other examples. Even if the cell-prob e mo del has some formal similar ities with the metric tree scheme (a hierarchical tree structure, a collec tio n of cells as an index ing scheme, computations per formed at e a ch node with a limited n umber of cells accessed, etc.), it is not clear whether the par tia lly defined functions 18 V. P estov determined b y the algor ithm at ea ch node will b e 1-Lipschitz (they are taking v alues in the Ha mming cube). The examples of implemen ted indexing schemes for exact nearest neig hbour search k nown to this author see m to be using 1-Lipschitz functions, but of course this does not preclude the existence of schemes based o n other ideas. F urthermore , assuming that an indexing scheme consists o f a family o f 1-Lipschitz functions whose v alues are recursively computed b y the algo- rithm do es not necess a rily imply that the role of the functions is reduced to certifying that a certain po int is not in the ε -ball a round the query p o int. As an example, co nsider the indexing scheme [11] based on a walk o n the Delaunay gra ph of X in Ω and called sp atial appr oximation in [25]. F or ev- ery datap oint x ∈ X , the s cheme stores a list o f datap o ints whose V or onoi cells a re adjacent to the cell containing x . At the search phase, a sequence of datap oints x 1 , x 2 , . . . , x n is chosen, wher e each x i +1 is the closest p oint to ω on the list of p oints Delaunay-adjacent to x i . If choos ing x i +1 so a s to get closer to ω is imp oss ible, one backtracks. In practice, the scheme p erforms on par with the state of the art pivot or metr ic tree based schemes [2 7]. W e do not know whether our metho ds can b e employ ed to prov e the curse of dimensionality f or this particula r sc heme in the same general setting. It app ears that attempting to extend the metho d to randomized, ap- proximated NN sear ch stands no chance either. Firstly , the dimensionality reduction-type metho ds o ften pr esent in r andomized algo rithms for approx- imate search [20, 18, 1] mean that instead of 1- L ips chitz functions, one is using what may b e called “ pr obably approximately 1-Lipschitz” ones. F or instance, a ra ndom pro jection from a high-dimensio nal Euclidean space to a subspace of smaller dimension, a ppropria tely rescaled, will ha ve the pro p- erty tha t for most pairs of p o ints x, y the distance betw e e n them is approx- imately pres erved, to within a factor o f 1 ± ε . This prop er ty in itself is a consequence of concentration of measur e , but such maps do not exhibit a strong concentration pro p erty , rendering our metho ds inapplicable. Chapter 4 in [47 ] discusses algo rithms for appro ximate similar ity sea rch based o n a traditional metric tr e e, equipp ed with 1- L ips chitz decision fun c- tions, but emplo ying ag ressive pruning, either randomized o r deter ministic. Even her e, our pro of do es not seem to be readily transfera ble. Indeed, it is based o n the basic pr emise that every bin me eting the ε -n eighb ourho o d of the query p oint ne e ds to b e ex amine d in a deterministic fashion . A randomized algorithm, on the contrary , av oids op ening bins which are deemed unlikely to con ta in relev a nt da tap oints. E xp eriments confirm that some of the algo- rithms in question p erform up to 300 times faster than the co rresp onding algorithms for exact search using the same indexing str ucture ( lo c.cit. ), and provide a circ ums ta nt ial evidence that the situation here is indeed fun- damentally differ ent and p ossibly not amenable to the same metho ds o f analysis. While the setting of a rtificially high- dimens io nal synthetic i.i.d. data fed to a scheme is not rea lis tic, our results provide a theoretical v alidation to the known simulation results on the p o o r p erforma nce in medium to high Lo wer Bounds fo r Metric T rees 19 dimensions o f metric - tree type indexing schemes, s uch as SS tree [4 5] a nd SR tree [19], on suc h data inputs. Some data pra c titioners believe that the intrinsic dimens ion of r eal-life datasets does not exceed a s few as pe r haps s even or ten dimensions. A deep e r understanding of underly ing g e ometry of workloads and its int erplay with compleixty is called for in or der to le arn to detect and use this low dimensionality efficiently , and a symptotic a nalysis o f a lgorithm p er fo rmance in an artificial s etting of very high dimensions is contributing tow a rds this goal. References 1. A. And on i, P . In dyk, M. Pˇ atrascu, On the optimality of the dim ensionali ty r e duction metho d, in: Proc. 47th IEEE Symp. on F oundations of Comput er Science, p p. 449–458, 2006. 2. M. Anthon y and P . Bartlett. Neur al N etwork L e arning: The or etic al F oun- dations. Cam bridge Univ ersit y Pre ss, Cam bridge, 1999. 3. O. Barkol and Y. R ab an i. Tighter low er b ounds for nearest neigh b or searc h and related problems in the cell prob e mo del. In: Pr o c. 32n d ACM Symp. on the The ory of Computing, 2000, pp. 388–396. 4. K. Bey er, J. Goldstein, R. Ramakrishnan, and U . Shaft. W hen is “ne ar est neighb or” me aningful?, in: Pr o c. 7-th Intern. C onf . on Datab ase The ory (ICDT-99), J erusalem, pp. 217–235, 1999 . 5. A. Borodin, R. Ostro vsky , and Y. Rabani. Lo wer b ound s for high- dimensional nearest neighbor searc h and related problems, in: Pr oc . 31st Ann ual ACS Symp os. The ory Comput. , 312– 321, 1999. 6. Bustos, B., N a v arro, G., Ch´ avez, E. (2003) Pivot selection techniques for proximit y searching in metric spaces. Pattern Re c o gnition L ett. , vol. 24, pp. 2357–2366. 7. E. Ch´ avez, G. Na v arro. A c omp act sp ac e de c omp osition for effe ctive metric indexing. Patter n R e c o gnition L etters 26:1 363–1376, 2005. 8. E. Ch´ avez, G. Nav arro, R. Baeza-Y ates and J. L. Marroqu ´ ın. Searching in metric spaces. ACM Computing Surveys 33:273–321, 2001. 9. P . Ciaccia, M. Patella and P . Zezula. M-tree: An efficien t access method for similarit y search in metric spaces. In Pr o c. 23r d Int. Conf. on V ery L ar ge Data Bases (VLDB’97), (A thens, Gr e e c e) , 426–435, 19 97. 10. P . Ciaccia, M. Patel la and P . Zezula. A cost mo del for similarit y queries in metric spaces, in: Pr o c. 17 -th ACM Symp osium on Principles of Datab ase Systems (POD S’98), Seattle, W A, 59–68, 1998 . 11. K.L. Clarkson. An algorithm for approximate closest-p oint q ueries. In: Pr o c. 10th symp. Comp. Ge om . Stony Brook, NY, 160–164, 1994. 12. K.L. Cla rkson. Nearest-neigh b or searc h ing and metric space dimensio ns. In: Nearest-Neighbor Methods for Learning and Vision: Theory and Prac- tice, MIT Press, 2006, pp. 15–59. 13. A. F arag´ o, T. Linder, and G. Lugosi, F ast n earest neighbor searc h in dissimilarit y spaces, IEEE T r ansactions on Pattern A nalysis and M achine Intel l igenc e, v ol. 18, pp. 957–962, 1993 . 14. P .W. Goldberg and M.R. Jerrum, Bounding the V apnik–Chervonenkis di- mension of c onc ept classes p ar ametrize d by r e al numb ers, Machine Learn- ing 18:131-148, 1995. 20 V. P estov 15. M. Gromov and V.D . Milman, A top ological application of the isop eri- metric ineq ualit y . A m er. J. Math. 105 , 843–85 4, 1983. 16. J. M. Hellerstein, E. K outsoupias, D. P . Miranker, C. Papadimitriou, and V. Samoladas. On a mod el of indexability and its b ound s for range queries. Journal of the A CM (JA CM) , 49(1):35 –55, 2002. 17. P . I ndyk. Nearest neighbours in high-dimensional spaces. In : J.E. Goo dman, J. O’Rourke, Eds., Handb o ok of Discr ete and Computational Ge ometry , Chapman and Hall/CR C, Boca Raton–London–New Y ork– W ashington, D.C. 877–892, 2004 . 18. Piotr Ind yk, Ra jeev Motw ani, A ppr oximate ne ar est neighb ors: towar ds r emoving the curse of dimensionali ty, Pro ceedings of the thirtieth annual ACM symp osium on Theory of computing, pp. 604–613, 1998, D allas, T exas. 19. N. Katay ama and S. Satoh, T he S R -tr e e: A n index structur e for hi gh- dimensional ne ar est neighb our queries, in: Prof. 16-th S ymp osium on PODS, pp . 369–380, T uscon, AZ, 1997. 20. E. Kush ilevitz, R. Ostrovsky , Y. R abani, Efficient Search for Ap p ro xi- mate N earest Neigh b or in High Dimensional Sp aces. SIAM Journal on Computing 3 0:457–474 , 2000. 21. M. Ledoux . The Conc entr ation of Me asur e Phenomenon , vo lume 89 of Mathematic al Surveys and Mono gr aphs . American Mathematical Society , Pro vidence, RI, 2001. 22. S. Mendelson, A few notes on statistical learning theory . In: S. Mendelson, A.J. Smola, Eds., A dvanc e d L e ctur es in Machine L e arning , LNCS 2600, pp. 1–40, Springer, 20 03. 23. V.D. Milman and G. Schec htman, Asymptotic The ory of Finite Di men- sional Norme d Sp ac es , volume 1200 of L e ctur e Notes in Mathematics . Springer, 1 986. 24. P .B. Milters en, Cel l pr ob e c ompl exity - a survey . In: 19th Confere nce on the F oundations of Soft w are T echnolo gy and Theoretical Computer Science (FSTT CS), 199 9. Adv ances in Data Structures W orkshop. 25. Gonzalo Nav arro, Se ar ching in metric sp ac es by sp atial appr oximation, The V LDB Journal 11:28– 46, August 2002 . 26. Gonzalo Nav arro, Analysing metric sp ac e i ndexes: what for? Invited pa- p er, in: Proc. 2nd Int. W ork shop on Similarit y Search and Ap plications (SISAP 2009), Prague, Czec h Republic, 2009, 3–10. 27. Gonzalo Nav arro, N ora Reyes, Dynamic sp atial appr oxim ation tr e es for massive data, in: Proc. 2nd Int. W orkshop on Simil arity Searc h and Ap- plications (SISAP 2009), Prague, Czec h Republic, 2009, pp. 81–88. 28. R. P anigrah y , K . T alwar, U. Wieder, A ge ometric appr o ach to l ower b ounds for appr oxim ate ne ar-neighb or se ar ch and p artial match, in: Proc. 49th I EEE Sy mp. on F oundations of Comput er Science, pp. 414–423, 2008. 29. R. Pa nigrahy , K. T alw ar, U. Wieder, L ower b ounds on ne ar neighb or se ar ch via metric exp ansion, in: F oundations of Computer Science (FOCS 2010), pp . 805–814. 30. M. Pˇ atrascu, M. T horup, Higher lower b ounds for ne ar-neighb or and f ur- ther rich pr oblems, in Pro c. 47th IEEE S ymp. on F oundations of Com- puter S cience, pp . 646–654, 2006. 31. V. Pes tov. On the geometry of similarit y searc h: dimensionality curse and concentrati on of measure. Inform. Pr o c ess. L ett. , 73 :47–51, 2000. Lo wer Bounds fo r Metric T rees 21 32. V. Pesto v. A n axiomatic approach to intrinsic dimension of a dataset. Neur al Networks, 21:2 04–213, 2008. 33. V. Pes tov. Indexability , concentra tion, and VC theory . Journal of Discr ete Algor ithms , doi:10.1 016/j.jda. 2011.10.002. 34. V. P estov and A. Sto jmir ovi ´ c. Indexing schemes for similarity s earch: an illustrated paradigm. F und. Inf orm. , 70:367–385, 2006. 35. H. Samet. F oundations of Multidim ensional and Metric Data Structur es. Morgan Kaufma nn Publishers In c., San F rancisco, CA, 2005. 36. S. Santini, Explor atory Image Datab ases: Content-Base d R etrieval, Aca- demic Press, Inc. Du luth, MN, USA, 2001. 37. U. Sh aft and R. Ramakrishnan. Theory of nearest neighbors index abilit y . ACM T r ansactions on Datab ase Systems (TODS), 31 :814–838, 2006. 38. A. Sto jmiro v i ´ c and V . Pesto v. Index ing sc hemes for similarity search in datasets of short protein f ragments. Information Systems, 32:1 145–1165, 2007. 39. J.K. Uh lmann. Satisfying general proximit y /similarit y queries with metric trees, Inf ormation Pr o c essing L etters 40:175 –179, 1991. 40. V.N. V apnik. Statistic al L e arning The ory. John Wile y & Sons, I nc., New Y ork, 199 8. 41. S.S. V empala. The R andom Pr oj e ction Metho d. DIMACS Series in Dis- crete Mathematics and Theoretical Computer Science, 65 , Amer. Math. Soc., Providence, R.I., 20 04. 42. M. V idyasa gar. L e arning and Gener al ization, Wi th Applic ations to Neur al Networks. S econd Ed. Sp ringer-V erlag, London, 2003. 43. I. V olny ansk y and V. Pesto v , Curse of dimensionality in pivot-b ase d in- dexes. - Pro c. 2nd Int. W orkshop on S imilarit y Search and Applications (SISAP 2009), Prague, Czec h Republic, 2009, pp. 39-46. 44. R. W eb er, H.- J. Schek, and S. Blott, A quantatitiv e analysis and p er- formance study for similarit y-search meth od s in high-dimensional spaces. in: Pr o c e e dings of the 24-th VLDB Confer enc e, New Y ork, pp. 194–205, 1998. 45. D.A. White and R. Jai n, Similarit y indexing with the S S -tree, in: Pr o c. 12th Conf. on Data Engine ering (I CDE’96), La Jolla, CA, pp. 516–523, 1996. 46. P . Y ianilos. Data structures and algorithms for nearest neighbor search in general metric spaces, in: Pr o c. 3r d Annual ACM-SIAM Symp osium on Discr ete Algorithms, pp. 311 –321, 1993. 47. P . Zezula, G. A mato, Y. D ohnal, and M. Batko. Similarity Se ar ch. The Metric Sp ac e Appr o ach. S pringer Science + Business Media, New Y ork, 2006.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment