Partial match queries in random quadtrees

Partial match queries in random quadtr ees Nicolas Broutin nicolas.broutin@inria.fr Projet Algorithms INRIA Rocquencourt 78153 Le Chesnay France Ralph Neininger and Henning Sulzbach { neiningr , sulzbach } @math.uni-frankfurt.de Institute for Mathematics (FB 12) J.W . Goethe Uni versity 60054 Frankfurt am Main Germany October 29, 2018 Abstract W e consider the problem of recov ering items matching a partially speciﬁed pattern in multidi- mensional trees (quad trees and k-d trees). W e assume the traditional model where the data consist of independent and uniform points in the unit square. For this model, in a structure on n points, it is known that the number of nodes C n ( ξ ) to visit in order to report the items matching an independent and uniformly on [0 , 1] random query ξ satisﬁes E [ C n ( ξ )] ∼ κn β , where κ and β are explicit con- stants. W e de velop an approach based on the analysis of the cost C n ( x ) of any ﬁxed query x ∈ [0 , 1] , and giv e precise estimates for the v ariance and limit distribution of the cost C n ( x ) . Our results per- mit to describe a limit process for the costs C n ( x ) as x varies in [0 , 1] ; one of the consequences is that E [max x ∈ [0 , 1] C n ( x )] ∼ γ n β ; this settles a question of Devro ye [Pers. Comm., 2000]. 1 Intr oduction Multidimensional databases arise in a number of contexts such as computer graphics, management of geographical data or statistical analysis. The question of retrie ving the data matching a speciﬁed pattern is then of course of prime importance. If the pattern speciﬁes all the data ﬁelds, the query can generally be answered in logarithmic time, and a great deal of precise analyses are a vailable in this case [ 11 , 13 , 15 , 18 , 19 ]. W e will be interested in the case when the pattern only constrains some of the data ﬁelds; we then talk of a partial match query . The ﬁrst in vestigations about partial match queries by Riv est [ 28 ] were based on digital structures. In a comparison-based setting, a few general purpose data structures generalizing binary search trees permit to answer partial match queries, namely the quadtree [ 10 ], the k -d tree [ 1 ] and the relaxed k -d tree [ 7 ]. Aside of the interest that one might have in partial match for itself, there are numerous reasons that justify the precise quantiﬁcation of the cost of such general search queries in comparison-based data structures. The high dimesional trees are indeed a data structure of choice for applications that range from collision detection in motion planning to mesh generation that takes advantage of the adapti ve partition of space that is produced [ 17 , 35 ]. For general references on multidimensional data structures and more details about their various applications, see the series of monographs by Samet [ 32 , 33 , 34 ]. The cost of partial match queries also appears in (hence inﬂuences) the complexity of a number of other geometrical search questions such as range search [ 6 ] or rank selection [ 8 ]. In spite of its importance, the complexity results about partial match queries are not as precise as one could expect. In this paper, we provide nov el analyses of the costs of partial match queries in some of the most important two dimensional data structures. Most of the document will focus on the special case of quadtrees ; in a ﬁnal section, we discuss the case of k -d tree [ 1 ] and relax ed k -d trees [ 7 ]. Q UA D T R E E S A N D M U LT I D I M E N S I O N A L S E A R C H . The quadtree [ 10 ] allo ws to manage multidimen- sional data by extending the di vide-and-conquer approach of the binary search tree. Consider the point 1 1 2 3 4 1 2 3 4 Figure 1: An example of a (point) quadtree: on the left the partition of the unit square induced by the tree data structure on the right (the children are ordered according to the numbering of the regions on the left). Answering the partial match query materialized by the dashed line on the left requires to visit the points/nodes coloured in red. Note that each one of the visited nodes correspond to a horizontal line that is crossed by the query . sequence p 1 , p 2 , . . . , p n ∈ [0 , 1] 2 . As we build the tree, re gions of the unit square are associated to the nodes where the points are stored. Initially , the root is associated with the region [0 , 1] 2 and the data structure is empty . The ﬁrst point p 1 is stored at the root, and divides the unit square into four regions Q 1 , . . . , Q 4 . Each region is assigned to a child of the root. More generally , when i points hav e already been inserted, we have a set of 1 + 3 i (lo wer-le v el) regions that cov er the unit square. The point p i +1 is stored in the node (say u ) that corresponds to the region it falls in, di vides it into four new regions that are assigned to the children of u . See Figure 1 . A N A L Y S I S O F PA RT I A L M A T C H R E T R I E V A L . For the analysis, we will focus on the model of random quadtr ees , where the data points are uniformly distributed in the unit square. In the present case, the data are just points, and the problem of partial match retriev al consists in reporting all the data with one of the coordinates (say the ﬁrst) being s ∈ [0 , 1] . It is a simple observ ation that the number of nodes of the tree visited when performing the search is precisely C n ( s ) , the number of regions in the quadtree that insersect a vertical line at s . The ﬁrst analysis of partial match in quadtrees is due to Flajolet et al. [ 14 ] (after the pioneering work of Flajolet and Puech [ 12 ] in the case of k -d trees). They studied the singularities of a dif ferential system for the generating functions of partial match cost to prove that, for a random query ξ , being independent of the tree and uniformly distributed on [0 , 1] , E [ C n ( ξ )] ∼ κ n β where κ = Γ(2 β + 2) 2Γ( β + 1) 3 , β = √ 17 − 3 2 , (1) and Γ( x ) denotes the Gamma function Γ( x ) = R ∞ 0 t x − 1 e − t dt . This has since been strengthened by Chern and Hwang [ 3 ], who provided the order of the error term (together with the values of the leading constant in all dimensions). The most precise result is (6.2) there, saying that E [ C n ( ξ )] = κ n β − 1 + O ( n β − 1 ) . (2) T o gain a reﬁned understanding of the cost beyond the lev el of expectations we pursue two directions. First, to justify that the expected v alue is a reasonable estimate of the cost, one would like a guarantee that the cost of partial match retriev al are actually close to their mean. Ho we ver , deriving higher moments turns out to be more subtle than it seems. In particular , when the query line is random (like in the uniform case) although the four subtrees at the root really are independent given their sizes, the contributions of the two subtrees that do hit the query line are dependent ! The relative location of the query line inside these two subtrees, is again uniform, but unfortunately it is same in both regions. This issue has not yet been addressed appropriately , and there is currently no result on the variance of or higher moments for C n ( ξ ) . The second issue lies in the very deﬁnition of the cost measure: e ven if the data follow some distri- bution (here uniform), should one really assume that the query also satisﬁes this distrib ution? In other 2 words, should we focus on C n ( ξ ) ? Maybe not. But then, what distribution should one use for the query line? One possible approach to overcome both problems is to consider the query line to be ﬁxed and to study C n ( s ) for s ∈ [0 , 1] . This raises another problem: ev en if s is ﬁxed at the top level, as the search is performed, the r elative location of the the queries in the recursiv e calls v aries from a node to another! Thus, in following this approach, one is led to consider the entire process C n ( s ) , s ∈ [0 , 1] ; this is the method we use here. Recently Curien and Joseph [ 4 ] obtained some results in this direction. They proved that for e very ﬁxed s ∈ (0 , 1) , E [ C n ( s )] ∼ K 1 ( s (1 − s )) β / 2 n β , K 1 = Γ(2 β + 2)Γ( β + 2) 2Γ( β + 1) 3 Γ ( β / 2 + 1) 2 . (3) On the other hand, Flajolet et al. [ 14 , 15 ] prov e that, along the edge one has E [ C n (0)] = Θ( n √ 2 − 1 ) = o ( n β ) (see also [ 4 ]). The behaviour about the x -coordinate U of the ﬁrst data point certainly resembles that along the edge, so that one has E [ C n ( U )] = o ( n β ) . It suggests that C n ( s ) should not be concen- trated around its mean, and that n − β C n ( s ) should con ver ge to a non-tri vial random v ariable as n → ∞ . This random variable would of course carry much information about the asymptotic properties of the cost of partial match queries in quadtrees. Belo w , we identify these limit random variables and obtain reﬁned asymptotic information on the complexity of partial match queries in quadtrees from them. 2 Main results and implications Our main contribution is to pro ve the follo wing con vergence result: Theorem 1. Let C n ( s ) be the cost of a partial match query at a ﬁxed line s in a random quadtree . Then, ther e e xists a random continuous function Z such that, as n → ∞ ,  C n ( s ) K 1 n β , s ∈ [0 , 1]  d → ( Z ( s ) , s ∈ [0 , 1]) . (4) This con ver gence in distribution holds in the Banac h space ( D [0 , 1] , k · k ) of right-continuous functions with left limits (c ` adl ` ag) equipped with the supr emum norm deﬁned by k f k = sup s ∈ [0 , 1] | f ( s ) | . Note that the con v ergence in ( 4 ) above is stronger than the con v ergence in distribution of the ﬁnite dimensional marginals  C n ( s 1 ) K 1 n β , C n ( s 2 ) K 1 n β , . . . , C n ( s k ) K 1 n β  d → ( Z ( s 1 ) , Z ( s 2 ) , . . . , Z ( s k )) as n → ∞ , for any natural number k and points s 1 , s 2 , . . . , s k ∈ [0 , 1] [see, e.g., 2 ]. Theorem 1 has a myriad of consequences in terms of estimates of the costs of partial match queries in random quadtrees. Of course, Theorem 1 would be of less practical interest if we could not characterize the distribution of the random function Z (see Figure 2 for a simulation): Proposition 2. The distribution of the random function Z in ( 4 ) is a ﬁxed point of the following r ecursive functional equation Z ( s ) d = 1 { s 0 ). Some of the most striking consequence concerns the cost of the worst query in a random plane quadtree. Note in particular that the supremum does not induce any extra logarithmic terms in the asymptotic cost. Theorem 5. Let S n = sup s ∈ [0 , 1] C n ( s ) . Then, as n → ∞ , n − β S n d → S d = sup s ∈ [0 , 1] Z ( s ) and E [ S n ] ∼ n β E [ S ] , V ar ( S n ) ∼ n 2 β V ar ( S ) . Finally we note that the one-dimension marginals of the limit process ( Z ( s ) , s ∈ [0 , 1]) are all the same up to a multiplicati ve constant. 4 Theorem 6. There is a r andom variable Z ≥ 0 such that for all s ∈ [0 , 1] , Z ( s ) d = ( s (1 − s )) β / 2 Z. (7) The distribution of Z is char acterized by its moments c m := E [ Z m ] , m ∈ N . They ar e given by c 1 = 1 and the r ecurr ence c m = 2( β m + 1) ( m − 1)  m + 1 − 3 2 β m  m − 1 X ` =1  m `  B ( β ` + 1 , β ( m − ` ) + 1) c ` c m − ` , m ≥ 2 . P L A N O F T H E PA P E R . Our approach requires to work with random functions; as one might expect, proving con v ergence in a space of functions in volv es a fair amount of unavoidable technicalities. Here, we try to keep the discussion at a rather high lev el, to a void diluting the main ideas in an ocean of intricate details. In Section 3 , we giv e an ov erview of our main tool, the contraction method. In Section 4 , we identify the variance and the supremum of the limit process Z , and deduce the large n asymptotics for C n ( s ) in Theorems 3 and 5 . 3 Contraction method: from the real line to functional spaces 3.1 Overview The aim of this section is gi ve an overvie w of the method we employ to prove Theorem 1 . The idea is very natural and relies on a contraction argument in a certain space of probability distributions. In the context of the analysis of performance of algorithms, the method was ﬁrst employed by R ¨ osler [ 29 ] who prov ed conv er gence in distribution for the rescaled total cost of the randomized version of quicksort. The method was then further dev eloped by Rache v and R ¨ uschendorf [ 27 ], R ¨ osler [ 30 ], and later on in [ 5 , 9 , 22 , 24 , 25 , 31 ] and has permitted numerous analyses in distrib ution for random discrete structures. So far , the method has mostly been used to analyze random variables taking real v alues, though a fe w applications on functions spaces hav e been made, see [ 5 , 9 , 16 ]. Here we are interested in the function space D [0 , 1] with the uniform topology , b ut the main idea persists: (1) de vise a recursi ve equation for the quantity of interest (here the process ( C n ( s ) , s ∈ [0 , 1]) ), and (2) prov e that a properly rescaled version of the quantity con v erges to a ﬁxed point of a certain map related to the recursive equation ; (3) if the map is a contraction in a certain metric space, then a ﬁx ed point is unique and may be obtained by iteration. W e now mo ve on to the ﬁrst step of this program. Write I ( n ) 1 , . . . , I ( n ) 4 for the number of points falling in the four regions created by the point stored at the root. Then, gi ven the coordinates of the ﬁrst data point ( U, V ) , we have, cf. Figure 1 , ( I ( n ) 1 , . . . , I ( n ) 4 ) d = Mult ( n − 1; U V , U (1 − V ) , (1 − U )(1 − V ) , (1 − U ) V ) . Observe that, for the cost inside a subre gion, what matters is the location of the query line r elative to the region. Thus a decomposition at the root yields the following recursi v e relation, for any n ≥ 1 , C n ( s ) d = 1 + 1 { s 0 suc h that sup s ∈ [0 , 1] | t − β E [ P t ( s )] − µ 1 ( s ) | = O ( t − ε ) . The proof of Proposition 11 relies crucially on two main ingredients: ﬁrst, a strengthening of the arguments de v eloped by Curien and Joseph [ 4 ], and the speed of conv er gence E [ C n ( ξ )] to E [ µ 1 ( ξ )] for a uniform query line ξ , see ( 2 ), by Chern and Hwang [ 3 ]. By symmetry , we write for an y δ ∈ (0 , 1 / 2) sup s ∈ [0 , 1] | t − β E [ P t ( s )] − µ 1 ( s ) | ≤ sup s ≤ δ   t − β E [ P t ( s )] − µ 1 ( s )   + sup s ∈ ( δ, 1 / 2]   t − β E [ P t ( s )] − µ 1 ( s )   . (19) The two terms in the right-hand side abo ve are controlled by the follo wing two lemmas. Lemma 12 (Behavior on the edge) . Ther e exists a constant C 1 such that lim sup t →∞ sup s ≤ δ   t − β E [ P t ( s )] − µ 1 ( s )   ≤ C 1 δ β / 2 . (20) Lemma 13 (Beha vior away from the edge) . There e xist constants C 2 , C 3 , η with 0 < η < β and γ ∈ (0 , 1) such that, for any inte ger k , and real number δ ∈ (0 , 1 / 2) we have, for any t > 0 , sup s ≥ δ | t − β E [ P t ( s )] − µ 1 ( s ) | ≤ C 2 δ − 1 (1 − γ ) k + C 3 k 2 k ( β − η ) − 2 k t − η . B E H A V I O U R A L O N G T H E E D G E . The behaviour aw ay from the edge is rather in v olved and we do not describe how the bound in Lemma 13 is obtained. T o deal with the term for inv olving the v alues of s ∈ [0 , δ ] , we relate the value E [ P t ( s )] to E [ P t ( δ )] . Note that the limit ﬁrst moment µ 1 ( s ) = lim n →∞ E [ P t ( s )] is monotonic for s ∈ [0 , 1 / 2] . It seems, at least intuitiv ely , that for an y ﬁxed real number t > 0 , E [ P t ( s )] should also be monotonic for s ∈ [0 , 1 / 2] , but we were unable to prove it. The follo wing weaker version is suf ﬁcient for our purpose. Proposition 14 (Almost monotonicity) . F or any s < 1 / 2 and ε ∈ [0 , 1 − 2 s ) , we have E [ P t ( s )] ≤ E  P t (1+ ε )  s + ε 1 + ε  . 9 4 Second moment and supremum In this section, we obtain explicit expressions about the limit, proving that our general approach also turns out to yield ef fectiv e and computable results. V A R I A N C E O F T H E C O S T . W e ﬁrst focus on the result in Theorem 3 . Our main result implies the con v ergence n − 2 β E [ C n ( s ) 2 ] → E [ Z ( s ) 2 ] . Write h ( s ) = E [ Z ( s )] = ( s (1 − s )) β / 2 . T aking second moments in ( 9 ) and writing it as an integral in terms of µ 2 ( s ) = E [ Z ( s ) 2 ] yields that we hav e the follo wing integral equation, for e v ery s ∈ [0 , 1] , µ 2 ( s ) = 2 2 β + 1  Z 1 s x 2 β µ 2  s x  dx + Z s 0 (1 − x ) 2 β µ 2  1 − s 1 − x  dx  + 2 B ( β + 1 , β + 1) · h ( s ) 2 β + 1 . One easily veriﬁes that the function f given by f ( s ) = c 2 h ( s ) 2 solves the abo ve equation provided that the constant c 2 satisﬁes c 2 = 2 (2 β + 1)( β + 1) c 2 + 2 B ( β + 1 , β + 1) β + 1 that is c 2 = 2 B ( β + 1 , β + 1) 2 β + 1 3(1 − β ) , since β 2 = 2 − 3 β . So if we were sure that µ 2 ( s ) is indeed c 2 h ( s ) 2 , we would ha ve by inte gration V ar ( Z ( ξ )) = c 2 B ( β + 1 , β + 1) − B ( β / 2 + 1 , β / 2 + 1) 2 . T o complete the proof, it sufﬁces to sho w that the integral equation satisﬁed by µ 2 actually admits a unique solution. T o this aim, we show that the map K deﬁned below is a contraction for the supremum norm (the details are omitted) K f ( s ) = 2 2 β + 1  Z 1 s x 2 β f  s x  dx + Z s 0 (1 − x ) 2 β f  1 − s 1 − x  dx  + 2 B ( β + 1 , β + 1) [ s (1 − s )] β β + 1 . C O S T O F T H E W O R S T Q U E RY . The uniform con v er gence of n − β C n ( · ) to the process Z ( · ) directly implies (continuous mapping theorem) the ﬁrst claim of Theorem 5 , S n K 1 n β d → S := sup s ∈ [0 , 1] Z ( s ) . (21) The con ver gence in the Zolotarev metric ζ 2 on which the contraction method is based here, is strong enough to imply con v ergence of the ﬁrst tw o moments of S n to the corresponding moments of S . 5 Concluding remarks The method we exposed here to obtain reﬁned results about the costs of partial match queries in quadtrees also applies to other geometric data structures based on the divide-and-conquer approach. In particular , similar results can be obtained for the k -d trees of Bentle y [ 1 ] or the relaxed k -d trees of Duch et al. [ 7 ]. W e conclude by mentioning some open questions. The supremum of the process is of great in- terest since it upperbounds the cost of any query . Can one identify the moments of the supremum sup s ∈ [0 , 1] Z ( s ) (ﬁrst and second)? In the course of our proof, we had to construct a continuous solution of the ﬁxed point equation. W e prove con ver gence in distribution, b ut conjecture that the conv er gence actually holds almost surely . 10 Refer ences [1] J. L. Bentley . Multidimensional binary search trees used for associative searching. Communication of the A CM , 18:509–517, 1975. [2] P . Billingsley . Con ver gence of Probability Measures . W iley Series in Probability and Mathematical Statistics. W iley , second edition, 1999. [3] H. Chern and H. Hwang. Partial match queries in random quadtrees. SIAM Journal on Computing , 32: 904–915, 2003. [4] N. Curien and A. Joseph. Partial match queries in two-dimensional quadtrees: A probabilistic approach. Advances in Applied Pr obability , 43:178–194, 2011. [5] M. Drmota, S. Janson, and R. Neininger . A functional limit theorem for the proﬁle of search trees. Ann. Appl. Pr obab . , 18(1):288–333, 2008. ISSN 1050-5164. [6] A. Duch and C. Mart ´ ınez. On the av erage performance of orthogonal range search in multidimensional data structures. Journal of Algorithms , 44(1):226–245, 2002. [7] A. Duch, V . Estivill-Castro, and C. Mart ´ ınez. Randomized k -dimensional binary search trees. In K.-Y . Chwa and O. Ibarra, editors, Pr oc. of the 9th International Symposium on Algorithms and Computation (ISAAC’98) , volume 1533 of Lectur e Notes in Computer Science , pages 199–208. Springer V erlag, 1998. [8] A. Duch, R. Jim ´ enez, and C. Mart ´ ınez. Rank selection in multidimensional data. In A. L ´ opez-Ortiz, editor, Pr oceedings of LA TIN , volume 6034 of Lectur e Notes in Computer Science , pages 674–685, Berlin, 2010. Springer . [9] K. Eickmeyer and L. R ¨ uschendorf. A limit theorem for recursiv ely deﬁned processes in L p . Statist. Deci- sions , 25(3):217–235, 2007. ISSN 0721-2631. [10] R. A. Finkel and J. L. Bentle y . Quad trees, a data structure for retriev al on composite ke ys. Acta Informatica , 4:1–19, 1974. [11] P . Flajolet and T . Laffor gue. Search costs in quadtrees and singularity perturbation asymptotics. Discr ete and Computational Geometry , 12:151–175, 1994. [12] P . Flajolet and C. Puech. Partial match retriev al of multidimensional data. J ounal of the ACM , 33(2):371–407, 1986. [13] P . Flajolet and R. Sedgewick. Analytic Combinatorics . Cambridge University Press, Cambridge, UK, 2009. [14] P . Flajolet, G. H. Gonnet, C. Puech, and J. M. Robson. Analytic variations on quadtrees. Algorithmica , 10: 473–500, 1993. [15] P . Flajolet, G. Labelle, L. Laforest, and B. Salvy . Hypergeometrics and the cost structure of quadtrees. Random Structur es and Algorithms , 7:117–144, 1995. [16] R. Gr ¨ ubel. On the silhouette of binary search trees. Ann. Appl. Probab . , 19(5):1781–1802, 2009. ISSN 1050-5164. [17] K. Ho-Le. Finite element mesh generation methods: a revie w and classiﬁcation. Computer-Aided Design , 20:27–38, 1988. [18] D. E. Knuth. The Art of Computer Pro gramming: Sorting and Searc hing , volume 3. Addison-W esley , 2d edition, 1998. [19] H. Mahmoud. Evolution of Random Sear ch T r ees . Wile y , New Y ork, 1992. [20] C. Mart ´ ınez, A. Panholzer , and H. Prodinger . Partial match in relaxed multidimensional search trees. Algo- rithmica , 29(1–2):181–204, 2001. [21] R. Neininger . Asymptotic distributions for partial match queries in K-d trees. Random Structur es and Algorithms , 17:403–427, 2000. [22] R. Neininger . On a multiv ariate contraction method for random recursi ve structures with applications to Quicksort. Random Structures Algorithms , 19(3-4):498–524, 2001. ISSN 1042-9832. Analysis of algorithms (Krynica Morska, 2000). [23] R. Neininger and L. R ¨ uschendorf. Limit laws for partial match queries in quadtrees. The Annals of Applied Pr obability , 11:452–469, 2001. [24] R. Neininger and L. R ¨ uschendorf. A general limit theorem for recursi ve algorithms and combinatorial struc- tures. Ann. Appl. Pr obab . , 14(1):378–418, 2004. ISSN 1050-5164. [25] R. Neininger and L. R ¨ uschendorf. On the contraction method with degenerate limit equation. Ann. Pr obab . , 32(3B):2838–2856, 2004. ISSN 0091-1798. [26] R. Neininger and H. Sulzbach. On a functional contraction method. 2011. Manuscript in preparation. 11 [27] S. Rachev and L. R ¨ uschendorf. Probability metrics and recursive algorithms. Advances in Applied Pr oba- bility , 27:770–799, 1995. [28] R. Rivest. Partial-match retrie v al algorithms. SIAM Journal on Computing , 5(19–50), 1976. [29] U. R ¨ osler . A limit theorem for ”quicksort”. RAIRO Informatique th ´ eorique et Applications , 25:85–100, 1991. [30] U. R ¨ osler . A ﬁxed point theorem for distributions. Stochastic Processes and their Applications , 37:195–214, 1992. [31] U. R ¨ osler . On the analysis of stochastic divide and conquer algorithms. Algorithmica , 29(1-2):238–261, 2001. ISSN 0178-4617. A v erage-case analysis of algorithms (Princeton, NJ, 1998). [32] H. Samet. The Design and Analysis of Spatial Data Structur es . Addison-W esley , Reading, MA, 1990. [33] H. Samet. Applications of Spatial Data Structur es: Computer Graphics, Imag e Pr ocessing, and GIS . Addison-W esley , Reading, MA, 1990. [34] H. Samet. F oundations of multidimensional and metric data structur es . Morgan Kaufmann, 2006. [35] M. Y erry and M. Shephard. A modiﬁed quadtree approach to ﬁnite element mesh generation. IEEE Computer Graphics and Applications , 3:39–46, 1983. 12

Partial match queries in random quadtrees

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment