Range Medians

We study a generalization of the classical median finding problem to batched query case: given an array of unsorted $n$ items and $k$ (not necessarily disjoint) intervals in the array, the goal is to determine the median in {\em each} of the interval…

Authors: Sariel Har-Peled, S. Muthukrishnan

Range Medians Sariel Har-Peled ∗ S. Muth ukrishna n † Octob er 30, 2018 Abstract W e study a generalizat ion of th e classical median fi nding problem to batc hed query case: giv en an arra y of unsorted n items and k (not necessarily disjoin t) int erv als in the arr a y , the goal is to determine the median in e ach of the in terv a ls in the arra y . W e giv e an algorithm that uses O ( n log n + k log k log n ) comparisons and show a low er b ound of Ω( n log k ) comparisons for th i s p r o blem. This is optimal f o r k = O ( n/ log n ). 1 In tro duction The classical median finding problem is to find t he me dian item, that is, the item o f rank ⌈ n/ 2 ⌉ in an unsorted array of size n . W e fo cus on the comparison mo del, where items in the arra y can b e compared only using comparisons, a nd w e coun t the n um b er of comparisons p erformed b y any algorithm 1 . It is kno wn since the 70’s that this problem can be solv ed using O ( n ) comparisons in the w orst case [BFP + 73]. La t e r researc h [BJ85 , SPP76, D Z99 , D Z01] sho w ed tha t the n um b er of comparisons needed for solving the median finding a lgorithm is b et w ee n (2 + ε ) n and 2 . 95 n in t he w orst case ( in the deterministic case). Closing this gap for a determinis tic algorithm is an op en problem, but surprisingly , one can find the median using 1 . 5 n + o ( n ) comparisons using a ra nd omized algorithm [MR95]. W e study the fo llowing g e neralization of the median problem. The k -range- medians Problem. The input is an unsorted a r r a y S with n en tries. A sequence of k queries Q 1 , . . . , Q k is prov ided. A query Q j = [ l j , r j ] is an interval of the arra y , and the output is x 1 , . . . , x k , where x j = median n S [ l j ] , S [ l j + 1] , . . . , S [ r j ] o for j = 1 , . . . , k . W e refer to this as the k -r ange-me dians pr oblem . The problem is to build a dat a -structure for S suc h that it can answ er this kind of queries quic kly . Notice that t h e in terv als are p os sibly o v erlapping. ∗ Department of Computer Science; Universit y of Illinois; 20 1 N. Go odwin Aven ue; Urbana, IL, 61801 , USA; sariel @uiuc.edu ; http://ww w.uiuc.edu/ ~ sariel /. † Go ogle Inc., 7 6 9th Av, 4th Fl., New Y o r k, NY, 10 011. muthu @google.co m 1 In the algorithms discussed in this pap er, the c omputation p erformed b ey ond the comparisons will b e linear in the num ber o f compar isons. 1 This is the in terva l vers ion of the classical median finding pr o ble m, and it is inte resting on its o wn merit. In addition, there are many motiv ating scenarios where they arise. Examples. A motiv ation arises in analyzing lo g s of in ternet advertis emen ts ( a k a ads). W e ha v e the log of clic ks on ads on the in ternet: eac h r ecord giv es the time of the click as w ell as the v a ry ing price paid by the adve rtiser for the click , a nd t he log is a r ranged in time- indexed order. Then, S [ i ] is the price for the i th clic k. An y giv en adve rtiser runs sev eral ad campaigns sim ultaneously spread o v er different in terv als of time. The adv ertiser t hen wishes to compare his cost to the general ad mark et during the p erio d his campaigns ran, and a t ypical comparison is to the median price paid for clic ks during those t ime in terv a ls. This yields an instance of the k -range- medians problem, for po s sibly in tersecting set of interv als. As another example , consider IP net w orks where one collects what are kno wn a s SNMP logs: for eac h link that connects t wo routers, one collects the total b ytes sen t o n that link in each fixed length duration lik e say 5 minutes [K MZ0 3 ]. Then, S [ i ] is the n um b er o f bytes sen t on that link in the i th time duration. A traffic analyst is in terested in finding the median v alue of the tra ffi c lev el within a sp e cific time window suc h as a w eek, office hours, or w eek ends, or the median within e ach suc h time window. Equally , the analyst is sometimes in terested in median traffic lev els during specific ex ternal ev en ts suc h as the time duration when an at t ac k happ ene d or a new net w ork routing strategy w as tested. There a re other attributes in addition to time where applications ma y solv e range median problems. F or example, S [ i ] ma y b e the t o tal v alue of real estate sold in p ostal zip code area i arranged in sorted order, and an analyst may be in terested in the median v alue f or a b orough or a cit y represen ted b y a consecutiv e set of zip codes. One can a s k similar in terv al vers ions of other problems to o, for example, the median ma y b e replaced by (sa y) the maxim um, minim um, mo de or ev en the sum. • F or sum, a trivial O ( n ) prepro ce ssing to compute all the prefix sums P [ j ] = P i ≤ j S [ i ] suffices to answ er an y in terv al query Q j = [ l j , r j ] in optimal O (1) time using P [ r j ] − P [ l j − 1]. • If the sum mation op erator (i.e., P ) is replaced by a semigroup op erator (where the sub- traction op e rator is absen t), then S can b e prepro ces sed in O ( nk ) space and time and eac h query can b e answ ered in O ( α k ( n )) whe re α k is a slow growin g function [Y ao82], and this is optimal under general semigroup conditions [Y ao85]. • F or the sp ecial cases of the semigroup op erator suc h as the maxim um or minim um, a somewhat non trivial alg orithm is needed to get same o ptimal b ounds as for the P case (see fo r example [BF C04]). The median op erator is not a semigroup op erator and presen ts a more difficult problem. The only prior results w e kno w are obtained b y using the v arious tradeoffs show n in [KMS05]. F or the case when k = 1, the intere sting tradeoffs for prepro cessing time and query times are resp e ctiv ely , roughly , O ( n log 2 n ) and O (log n ), or O ( n 2 ) and O (1), o r O ( n ) and O ( n ε ) for constan t fraction ε [K MS05 ]. These bounds for individual queries can b e directly applied to eac h of the k interv al queries in our problem, resulting in a m ultiplicative k factor in the query 2 complexit y . In particular, the work of Krizanc et al. [KMS05] implies an O  n log 2 n + k log n  time algorithm for our problem. Our main result is as follo ws. Theorem 1.1 Ther e is a deterministic algo ri thm to solve the k -r ang e - me di a ns pr oblem in O ( n log k + k log k log n ) time. F urthermor e, in the c om p aris o n mo del, any algorithm that solves this pr oble m r e q uir es Ω( n log k ) c omp arisons. The k - range-medians problem seems to be a fairly basic problem and it is w o rth while to ha v e tight b ounds for it. In particular, Θ( n log k ) may not b e the bo un d one su sp ects at first g la nc e to b e tight for this problem. F or k = O ( n/l og n ), our alg orithm is optimal. It also impro v es [KMS05] for k = O ( n ). The lo wer b ound holds ev en if the set of interv als is hier ar chic al , that is, for an y t wo in terv als in the set, either o ne of them is contained in the other, or they are disjoint. On the other hand, the upp er b ound holds ev en if the queries arriv e online, in the amortized sen se. Our algorithm uses r e l a xe d sorting on pieces of the array , where only a subset of items in a piece is in their correct sorted lo cat io n. Relaxed sorting like this has b een used befo r e for other problems, for example, see [A Y89]. In the follow ing, the k th elemen t o f a set S (or elemen t of rank k ) w ould refer to t he k th smallest elemen t in the set S . F or simplic ity , we assum e the elemen ts of S are all unique. 2 The Lo w er Bound Recall that S is an unsorted array of n elem en ts. Assume tha t n is a m ult iple of k . Let Ψ( n, k ) = n in k    i = 1 , . . . , k o , for n > k > 0 . W e will sa y an elemen t of S is the i th element of S if its rank in S is i . Claim 2.1 Any algorithm MedianAlg that c om put es al l the elements of r a nk in Ψ ( n , k ) fr om S ne e ds to p erform Ω( n log k ) c omp aris ons in the worst c ase. Pr o of: Let m i = in/k , f o r i = 0 , . . . , k . An elemen t w ould b e lab ele d i if it is larger than the m i − 1 th elemen t of S and smaller than the m i th elemen t of S (note, that the m k th elemen t of S is the larg e st elemen t in S ). An elemen t w ould b e unlab ele d if its rank in S is in Ψ( n, k ). Note, that the output of the algorithm is the indices of the k unlab eled elemen ts. W e will argue that just computing these k n um b ers r equires Ω( n log k ) time. Consider an execution of MedianAlg on S . W e consider the comparison tree mo del, where the input trav els do wn the decision tree from the ro ot, at an y v ertex a comparison is b eing made, the and the input is directed either to the right or left child dep e nding on t he result of the comparison. A lab elling (at a v ertex v of the decision tree) is c onsis t ent with the comparisons seen so far b y the alg orithm if there is an input with this lab elling, suc h that it agrees with all the comparisons seen so far and it reac hes v during t he execution. Let Z be the set of labellings of S consisten t with the comparisons seen so fa r at this vertex v . 3 W e claim that if | Z | > 1 then the algorithm can not ye t terminate. Indeed, in suc h a case there are at least tw o differen t lab ellings that are consisten t with the comparisons seen so f ar. If not all the lab ellings of Z ha v e the same set of k elemen ts marke d as unlab eled, then the algorithm has differen t output (i.e., the output is just the indices of the unlab eled elemen ts), and as suc h the algorithm can not terminate. So, let S [ α ] b e an elemen t that has t w o differen t lab els in tw o lab ellings of Z . There exists t w o distinct inputs B = [ b 1 , . . . , b n ] and C = [ c 1 , . . . , c n ] that realizes these t w o la b ellings. No w consider the input D ( t ) = [ d 1 ( t ) , . . . , d n ( t )], where d i ( t ) = b i (1 − t ) + tc i , for t ∈ [0 , 1] and i = 1 , . . . , n . W e can p erturb the n um b ers b 1 , . . . , b n and c 1 , . . . , c n so that there is nev er a t ∈ [0 , 1] fo r whic h three en tries of D ( · ) are equal to eac h ot her (this can b e guar an teed b y adding random infinitesimal noise to eac h n umber, and observing that the probability of this bad ev en t has measure zero). Note that D ( 0) = B and D (1) = C . F urthermore, since for the inputs B and C our algo r it h m had reac hed the same no de (i.e., v ) in the decision tree, it holds that fo r all the comparisons the algorithm p erformed so far, it got exactly the same results for b oth inputs. No w, a s sume without loss o f generalit y , that the lab e l for b α in B is strictly smaller than the lab el fo r c α in C . Clearly , fo r some v alue of t in this r ange, denoted b y t ∗ , d α ( t ) m ust b e of rank in the set { m 1 , . . . , m k } . Indeed, as t increases fro m 0 to 1, the rank of d α ( t ) starts at the rank of b α in B , and ends up with the rank o f c α in C . But D ( t ∗ ) agrees with all the comparisons seen b y the alg orithm so far (since if b i < b j and c i < c j then d i ( t ) < d j ( t ), for t ∈ [0 , 1]). W e conclude that the assignmen t that realizes D ( t ∗ ) m ust lea v e d α ( t ) unlab ele d. Namely , the set Z has t w o lab ellings with differen t sets of k elemen ts that are unlabeled, and a s such the algorithm can not terminate and m ust perfor m SOME more comparisons if it reac hed v ( i.e., v is not a leaf of the decision tree). Th us, the algor it h m can terminate only when | Z | = 1. Let β = n/k − 1, a nd observ e that in the beginning of MedianAlg execution, it has M = n ! k !( β !) k p ossible lab ellings for the output. Indeed, a consisten t lab eling, is made out of k unla- b eled elemen ts, and then β elemen ts a re lab eled b y i , for i = 1 , . . . , k . Now, b y Stirling’s appro ximation, w e ha v e M ≥ ( β k )! ( β !) k ≈ √ 2 π β k ( β k ) β k e β k  √ 2 π β β β e β  k ( β k )! ( β !) k = √ 2 π β k ( β k ) β k  √ 2 π β β β  k = k β k √ 2 π β k  √ 2 π β  k . Eac h comparison p erformed can o nly ha lf this set of possible lab ellings, in the w orst case. It follo ws, that in the w orst case, the a lgorithms needs Ω(log M ) = Ω  β k log k − k 2 log(2 π β )  = Ω( β k log k ) = Ω( n log k ) comparisons, as claimed. Lemma 2.2 Solving the k -r ange -me d i a ns pr oblem r e quir es Ω( n log k ) c omp arisons. 4 Pr o of: W e will show that giv en an algorithm for the k -range-medians problem, one can reduce it, in linear time, to the problem of Claim 2.1. That w ould immediately imply the lo w er b ound. Giv en an input arra y S of size n , construct a new arra y T of size 4 n where the first n elemen ts o f T are −∞ , T [ n + 1 , . . . , 2 n ] = S , and T [ j ] = + ∞ , for j = 2 n + 1 , . . . , 4 n . Clearly , the ℓ th elemen t of S is t he median of the range [1 , 2 n + 2 ℓ − 1 ] in T . Th us, w e can solv e the problem of Claim 2 .1 using k median range queries , implying the low er b ound. Observ e that the low er bound holds ev en for the case when the interv als are hierarc hical. 3 Our A l gorithm W e first consider the case when all the query interv als are pro vided ahead o f time. W e will presen t a slo w algorithm first, and later sho w how to mak e it faster to get our b ounds. Our algorithm uses t he following fo lklore result. Theorem 3.1 Given ℓ sorte d arr ays with total size n , ther e i s a deterministic algorithm to determine me dian of the set forme d by the union of these arr ays using O ( ℓ lo g ( n/ℓ )) c omp arisons. Since w e w ere unable to find a reference to precisely this result b ey ond [KMS05] where a sligh tly w eak er result is stated as a folklore claim, w e describ e this a lgorithm in App endix A. 3.1 A Slo w A lg orithm Here w e show how to solv e the k -range-medians pro blem. Let I 1 , . . . , I k b e the give n (not necessarily disjoint) k in terv als in the ar r ay S [1 ..n ]. W e break S in to (at most) 2 k − 1 atomic disjoin t in terv als lab eled in the sorted order B 1 , . . . , B m , suc h that an atomic in terv al do es not ha ve a n endp oin t o f an y I i inside it. Next, w e sort eac h one of the B i ’s, and build a balanced binary tree having B 1 , . . . , B m as the lea ve s in this order. In a b ottom-up fashion w e merge t he sorted arrays sorted in the lea v es, so that eac h no de v stores a sorted array S v of all the elemen ts stored in its subtree. Let T denote this tree that has heigh t O (log k ). No w, computing the median o f an in terv al I j , is done by extracting the O (log k ) suitable no des in T tha t co v er I j . Next, w e apply Theorem 3.1, and using O (log n log k ) comparisons, w e get the desired median. W e now apply this to the k g iv en in terv als. Observ e that sorting the a tomic interv als t a k es O ( n log n ) comparisons and merging them in O (log k ) leve ls tak es O ( n log k ) comparisons in all. This giv es: Lemma 3.2 The a l g orit hm ab ove uses O ( n lo g n + k log n log k ) c omp arisons. Note, that this algorit hm is still mildly in teresting. Indee d, if the in terv als I 1 , . . . , I k are all “larg e ”, then the running time of the naiv e algo rithm is O ( nk ) , and the a bov e algorithm is faster f or k > log n . 5 3.2 Our Main Algorithm The main b ottlene c k in the abov e solution w as the presorting of the pieces of the array corresp onding to atomic interv als. In the optimal algorithm b elo w, we do not fully sort them. Definition 3.3 A subarray X is u - s o rt e d if there is a sorted list L X of at most (sa y) 20 u elemen ts of X suc h that these elemen ts a p p ear in this sorted o rde r in X (not nece ssarily as consecutiv e elemen ts). F urthermore, for an elemen t α of L X , all the elemen ts o f X smaller than it app ear b efore it in X and all the elemen ts la r ger than α app ear after α in X . Finally , w e require tha t the distance b et wee n t w o conse cutiv e elem en ts of L X in X is at most | X | /u , where | X | denotes the size of X . W e will refer to the elemen ts of X b et w een tw o consecutiv e elemen ts of L X as a se gment . An array X of n elemen ts tha t is n -sorted is just sorted, and a 0- s orted array is unsorted. Another w a y to lo ok at it, is that the elemen ts of L X are in their final p osition in the sorted order, and the elemen ts of the in terv als are in an arbitra r y ordering. Lemma 3.4 Given an unsorte d arr ay X , it c an b e u -sorte d using O ( | X | lo g u ) c omp arisons, wher e | X | denotes the numb e r of e l e ments of X . Pr o of: W e just find the me dian of X , partit io n X in to t w o equal size subarra ys, and con tin ue recursiv ely on the t w o subarrays . The depth the recursion is O (log u ), and the w ork at eac h lev el of the recursion is linear, whic h implies the claim. Lemma 3.5 Given a two u -sorte d arr ays X and Y , they c an b e mer ge d into an u -sorte d arr ay using O ( | X | + | Y | ) c omp arisons. Pr o of: Conv ert Y in to a link ed list. Insert t he elemen t s of L X in to Y . This can be done b y scanning the list of Y until w e arriv e at the segmen t Y i of Y that should con tain an elemen t b of L X that w e need to insert. W e part it io n this segmen t using b into tw o interv als, add b to L Y , and contin ue in this fashion with eac h suc h b . This takes O ( | Y i | ) = O ( | Y | /u ) comparisons p er b (ignoring the scanning cost whic h is O ( | Y | ) ov erall). Let Z b e the resulting u -sorted arra y , whic h contains all the elemen ts of Y and all the elemen ts of L X , and L Z = L X ∪ L Y . Computing Z tak es O  | Y | + | L X | | Y | u  = O ( | Y | ) comparisons. W e now need to insert the elemen ts of X \ L X in to Z . Clearly , if a segmen t X i of X has α i elemen ts of L Z in its ra nge, then inserting the elemen ts of X i w ould t a k e O ( | X i | lo g α i ) comparisons. Th us, the total n um b er of comparisons is O X i | X i | lo g α i ! = O X i | X | u log α i ! = O | X | u X i α i ! = O ( X ) , since | X i | ≤ | X | /u , log α i ≤ α i and P i α i = O ( u ). 6 The final step is to scan o v er Z , and merge cons ecutiv e in terv a ls that are to o small (remo ving the cor r e sp onding elemen ts from L Z ), suc h that eac h in terv al is of length at mo st | Z | /u . Clearly , this can b e done in linear time. The resulting Z is u -sorted since it s sorted list con ta ins a t most 2 u + 1 elemen ts, and ev ery interv a l is of length at most | Z | /u . Note, that the final filtering stage in the ab o v e algorithm is need t o guaran tee that the resulting list L Z size is not to o large, if w e w ere t o use this merging step sev eral times. In the followin g, w e need a mo dified v ersion of The orem 3.1 that works for u -sorted arra ys. Theorem 3.6 Given ℓ u -sorte d arr ays A 1 , . . . , A ℓ with total si ze n and a r ank k , ther e is a deterministic algorithm that r eturns ℓ s u bintervals B 1 , . . . , B ℓ of these arr ays and a numb er k ′ , such that the fol lowing pr op erties hold. (i) Th e k ′ th r a n ke d element of B 1 ∪ · · · ∪ B ℓ is the k th r anke d element of A 1 ∪ · · · ∪ A ℓ . (ii) Th e running time is O ( ℓ lo g( n/ℓ )) time. (iii) P ℓ i =1 | B i | = O ( ℓ · ( n/u )) . Pr o of: F or ev ery elemen t of L A i realizing the u -sorting of the array A i , we a s sume w e ha v e its rank in A i precomputed. No w, w e ex ecute the a lgorithm of Theorem 3.1 on these (represen tative) sort ed arra ys (taking in to accoun t their asso c iated rank). (Note that the required mo difications o f the algorithm of Theorem 3.1 are tedious but straightforw ard, and w e omit the details.) The main problem is that now the rank of an elemen t is only estimated appro ximately up to an (a dditive) error of n/u . In the end of pro cess of trimming do wn the represen tative arrays, w e migh t still ha ve active in terv als of tota l length 2 n/u in each o n e of these arra ys, resulting in the b ound o n the size of the computed in terv als. Using t he theorem ab o ve as well as tw o lemmas ab o v e, we get the following result, which is building up to the algorithmic part of Theorem 1.1. Lemma 3.7 Ther e is a deterministic algorithm to solve the k -r an g e -me d i ans pr oblem in O ( n log k + k log k log n ) time, when the k query interva l s ar e pr ovide d in ad vanc e. Pr o of: W e repeat the algorithm of Section 3.1 using u -sorting instead of sort ing , for u to b e specified shortly . Building the data- s tructure (i.e., the tree o v er t h e atomic interv als) tak es O ( n log u ) comparisons. Indeed, we first u -sort the atomic in terv als, and then we merge them as we g o up the tree. A query of finding the median of arra y elemen ts in an in terv al is no w equiv alen t to finding the median for m = O (log k ) u -sort ed arra ys A 1 , . . . , A m . Using the a lg orithm of Theorem 3.6 results in m in terv a ls B 1 , . . . , B m that b elong to A 1 , . . . , A m , resp ectiv ely , suc h that we need to find the k ′ th smallest elemen t in B 1 ∪ . . . ∪ B m . The t o tal length of the B i s is O ( mn/u ). No w we can j us t use the brute force metho d. Merge B 1 , . . . , B m in to a single arra y and find the k ′ th smallest elemen t using the classical algorithm. This tak e O ( mn/u ) comparisons. W e ha v e to rep eat this k times, and the num b er of comparisons w e need is O  k m n u + k m log n  = O ( n + k log k log n ) , 7 for u = k 2 , since m = O (log k ). Th us, in all, the n um b er of comparisons using b y the algorithm is O ( n log k + k log k log n ). W e can extend this bound to the case whe n the in terv als are presen ted in an online manner, and w e get amortized b ounds . Lemma 3.8 (When k is kno wn in adv ance.) Ther e is a deterministic algorithm to solve the k -r ange - me di a ns pr oblem in O ( n log k + k log k log n ) time , wh e n the k query intervals ar e pr ovide d in an online fash ion, but k is known in advanc e. Pr o of: The idea is to partition t h e arra y into u , u ≤ k 2 atomic interv als all of the same length, and build the da ta-structure of these atomic in terv als. The ab o v e a lg orithm w ould w ork v erbatim, except for eve ry query in terv al I , there w ould b e t w o “dangling ” atomic in terv als that are of size n/u that con ta in the t w o endp oin ts of I . Sp ecifi cally , to perfo rm the query for I , w e compute m = O (log k ) u -sort ed arrays using our data-structure. W e also tak e these t w o atomic in terv a ls, clip them in to the query in terv al, u -sort them, and add them to the m u -sorted arrays we already ha v e. Now , w e need to p erform the median query o v er these O (log k ) u -sort ed arra ys, whic h w e can do, as described ab o ve. Clearly , the resulting a lgorithm has running time O  n log u + k log u log n + k n u log u  = O ( n log k + k log k log n ) , since u = k 2 . Lemma 3.9 (When k is not kno wn in adv ance.) Ther e is a deterministic alg orit hm to solve the k -r ange-m e dian s pr oblem in O ( n log k + k log k log n ) time, w hen the k query inter- vals ar e pr ovide d in an online fashi o n. Pr o of: W e will use the algorithm o f Lemma 3.8. A t eac h stage, w e ha v e a curren t guess to the n um b er of querie s to b e p erformed. In the b eginning this guess is a constan t, sa y 10. When this n um b er of queries is exceeded, w e squar e our guess, rebuild our data-structure from scratc h for this new guess , and con tin ue. Let k 1 = 10 and k i = ( k i − 1 ) 2 b e the sequence of guesses, fo r i = 1 , . . . , β , where β = O (log log k ). W e ha v e that the tota l running time of the algo rithm is β X i =1 O ( n log k i + k i log k i log n ) = O ( n log k + k log k log n ) , since log k i − 1 = (log k i ) / 2, for a ll i . Lemma 3.9 implies the algo rithmic part of The orem 1.1. 4 Conclud ing Remarks The k -range-medians problem is a natural in terv al generalization o f the classical median finding problem: unlik e interv al generalizations of other problems suc h as max, min or sum whic h can b e solv ed in linear time, our problem (surprisingly) needs Ω( n log k ) compar isons, 8 and we presen t an algo rithm that solve s this problem with running time (and n umber of comparisons) O ( n log k + k log k lo g n ). A n um b er of tec hnical problems remain and w e list them below. • Curren tly , our algorithm uses O ( n log k ) space. It w ould b e in teresting to reduce this to linear space. • Sa y the elemen ts are from an in teger range 1 , . . . , U . Can w e design o ( n ) time algo- rithms in that case using word op erations? F or the classical median finding problem, b oth comparison-based and w ord-based algorithms tak e O ( n ) time. But giv en that the comparison-based a lgorithm needs Ω( n log k ) comparisons for our k -range-medians problem, it now b ecomes interes ting if word-based algorithms can do b etter fo r inte ger alphab et. • Sa y one w ants to only answ er median querie s a pproximately for eac h interv a l (see [BKMT05] for some relev an t results). Can one design o ( n log k ) algorithms? Supp ose the elemen ts are integers in the range 1 , . . . , U . W e define an approxim ate v ersion where the g oal is to return an elem en t within (1 ± ε ) of the correct median in v alue, for some fixed ε , 0 < ε < 1. Then w e can k eep an exp onential histo gr am with eac h at o mic in terv al of the num b er of elemen ts in the range [(1 + ε ) i , (1 + ε ) i +1 ) for eac h i , and follow the alg o rithm outline here constructing them for all the suitably chos en in terv als on the balanced binary tree ato p these atomic in terv als. F or eac h in terv al in the query , one can easily merge the exponen tial histograms correspo ndin g t o a nd obtain an algorithm that tak es t im e O ( n + k log k lo g U ), since an y t w o exponen tial histograms can b e merged in O (log U ) time. If the elemen ts are not integers in the range 1 , . . . , U and one w ork ed in the comparison mo de l, similar results ma y b e obtained using [GK01, GK04], or ε -nets. It is not clear if these b ounds ar e optimal. • W e b eliev e extending the problem to tw o (or more) dimensions is also of in terest. There is prior w or k for ra nge sum and minim ums, but tigh t b ounds for k range medians will b e intere sting. Ac kno wledgemen ts. The authors would lik e to thank the anonymous referees for their careful reading, useful commen ts and references. In particular, they iden t ifi ed mistak es in an earlier v ersion of this pap er. References [A Y89] T. Altman a n d I. Y oshihide. Roughly sorting: Sequen tia l and para llel approach. Journal of Info rmation Pr o c essin g , 12(2):154 – 158, 198 9 . [BF C04] M. A. Bender and M. F ara ch-Colton. The lev el ancestor problem simplified. The o. Comp. Sci. , 321(1):5 – 12, 2 004. 9 [BFP + 73] M . Blum, R. W. F lo yd, V. R. Pratt, R. L . Riv est, and R. E. T arjan. Time b ounds for selection. J. Comput. Sys. Sci. , 7(4):448–461 , 1 973. [BJ85] S. W. Bent and J. W. John. Finding the median requires 2 n comparisons. In Pr o c. 17th A nnu. ACM Symp o s. The ory Com pu t. , pages 2 13–216, 1985. [BKMT05] P . Bose, E. Kra nak is, P . Morin, and Y. T ang. Approx imate range mo de and range median queries. In Pr o c. 22 nd Internat. Symp os . Th e or et. Asp. Comp. Sci. , pages 37 7–388, 2005. [DZ99] D. D o r and U. Zwic k. Selecting the median. SIAM J. Comput. , 28(5):1 722–1758, 1999. [DZ01] D. D o r and U. Zwic k. Median selection requires (2 + ǫ ) n comparisons. SIAM J. Discr et. Math. , 14(3 ):312–325, 2001 . [GK01] M. Green wald and S. Khanna. Space-efficien t online computation of quantile summaries. In Pr o c. 2001 ACM SIGOD Conf. Mang. Da t a. , pages 58– 6 6, 2 0 01. [GK04] M. G ree n w ald a nd S. Khanna. P o w er-conserving computation of order-statistics o v er sens or net w orks. In Pr o c. 23r d A CM Symp os. Principles Datab ase Syst. , pages 275–285, 2004. [KMS05] D. Krizanc, P . Morin, and M. Smid. Range mo de and range median queries on lists and trees. Nor dic J. Comput. , 12(1):1–17 , 200 5. [KMZ03] F. Ko r n , S. Muth ukrishnan, a n d Y. Zh u. Chec ks and balances: Monitoring data qualit y problems in net work traffic da tabase s. In Pr o c. 29th In tl. Co nf. V ery L ar ge Data B ases , pages 536–547, 2003. [MR95] R. Motw ani and P . Ragha v an. R andomize d A lgorithms . Cambridge Univ ersit y Press, New Y ork, NY, 1995 . [SPP76] A. Sc h¨ onhage, M. P aterson, and N. Pipp enger. Finding the median. J. Comput. Sys. Sci . , 13(2):18 4–199, 1 9 76. [Y ao82] A. C. Y ao. Space-time tradeoff for answ ering range queries. In Pr o c. 14th A nnu. A CM Symp os. T he ory Comput. , pages 128–136, 1 9 82. [Y ao85] A. C. Y ao . On the complexit y of main taining partial sums. SI AM J. Com pu t. , 14(2):277– 288, 1985. A Cho osin g median from s o rted arra ys In this section, we prov e Theorem 3.1 b y providing a fast deterministic algorithm for ch o osing the median elem en t of ℓ sorted arra ys. As we men tio ne d b efore, this result seems to b e known, but w e are una w are of a direct reference to it, and as suc h we prov ide a detailed algorithm. 10 A.1 The algorithm Let A 1 , . . . , A ℓ b e the giv en sorted arrays of total size n . W e main tain ℓ a ctive ranges [ l i , r i ] of the arra y A i where the required elemen t (i.e., “median”) lies, for i = 1 , . . . , ℓ . Let k denote the rank of the required median. Let n curr = P i ( r i − l i + 1) b e the total nu m b er of curren tly activ e elemen ts. If n curr ≤ 32 ℓ , then w e find the median in linear time, using the standard deterministic algorithm. Otherwise, let ∆ = ⌊ n curr / (32 ℓ ) ⌋ . Pic k u i − 1 equally spaced elemen ts fr o m the activ e range of A i , where u i = 4 +  r i − l i + 1 ∆  . Let L i b e the resulting list of represen tat ives , for i = 1 , . . . , ℓ . Not e that L i breaks the active range of A i in to blo c ks of size ν i ≤  r i − l i + 1 u i  . F or eac h elemen t of L i w e know exactly how many elemen ts are smaller than it and larger than it in the i th array . Merge the lists L 1 , . . . , L ℓ in to one sorted list L . F or an elemen t x , let rank( x ) denote the rank of x in the set A 1 ∪ . . . ∪ A ℓ . Note, that no w for ev ery elemen t x of L w e can estimate its rank( x ) to lie within an in terv al of length T = P ℓ i =1 ν i . Indeed, w e kno w f or an elemen t of x ∈ L betw een what t w o consecutiv e represe n tatives it lies f or all ℓ arra ys. F or elemen t x ∈ L , let R ( x ) denote this rang e where the rank of x might lie. No w, giv en t w o consecutiv e represen tativ es x and y in the i th array , if k / ∈ R ( x ) a nd k / ∈ R ( y ) then the required median cannot lie b et we en x and y , and w e can shrink the activ e range not to include this por t io n. In particular, the new activ e range spans all the blo c ks whic h might con tain the median. The algorithm no w up dates the v alue o f k and con tin ues recursiv ely on the new active ranges. A.2 Analysis The error estimate fo r the rank of a represen tative is b ounded by U = ℓ X i =1 ν i ≤ ℓ + ℓ X i =1 r i − l i + 1 u i ≤ ℓ + ℓ X i =1 r i − l i + 1 4 + r i − l i +1 ∆ = ℓ + ∆ ℓ X i =1 r i − l i + 1 4∆ + r i − l i + 1 ≤ ℓ + ℓ ∆ ≤ n curr 32 + ℓ j n curr 32 ℓ k ≤ n curr 16 , since ℓ ≤ n curr / 32 and by the c hoice of ∆. Consider the sorted merged arra y B of all t he activ e elemen ts. The length of B is n curr , and assume, for the sake of simplicit y o f exp osition, that the desired median is in the second half of B (the other case follows by a symmetric argumen t). Note, that an y represen tative x that fall in the first quarter of B has a r a nk that lies in a range shorter than T < n curr / 4, and as suc h it cannot include k . In part icular, let t i b e the index in A i of the first represen tative in the activ e range (of A i ) that do es not falls in the first quarter of B . Observ e that P i ( t i − l i + 1) ≥ n curr / 4. The total num b er of elemen ts that are being eliminated b y the 11 algorithm (in the top of the recursion) is at least X i (( t i − l i + 1) − 2 ν i ) ≥ X i ( t i − l i + 1) − 2 X i ν i = n curr 4 − 2 U ≥ n curr 8 . Namely , eac h recursiv e call contin ues o n tota l length o f all a c tiv e ranges smaller by a factor of (7 / 8) from the original arra y . The total length of L 1 , . . . L ℓ is O ( ℓ ), and as suc h t he total w ork (igno r ing the recursiv e call) is bounded b y O ( ℓ lo g ℓ ). The running time is b o und ed by T ( n curr ) = O ( ℓ log ℓ ) + T ((7 / 8) n curr ) , where T ( ℓ ) = O ( ℓ log ℓ ). Th us, the total running time is O ( ℓ log ℓ log( n curr /ℓ )). A.3 Doing ev en b etter - a faster algo rithm Observ e, that the b ottlenec k in the ab o v e a lgorithm is the merger of the represen tativ e lists L 1 , . . . , L ℓ . Instead of merging them, w e will compute the median x of L = L 1 ∪ . . . ∪ L ℓ . If R ( x ) do es not con tain k , then w e can throw aw ay at least n curr / 4 elemen ts in the curren t activ e ranges and con t inue recursiv ely . Otherwise, compute the elemen t z of rank n curr / 4 in L . Clearly , k / ∈ R ( z ) and one can throw, as abov e, as constan t fraction of the activ e ranges. The resulting running time (ignoring the recursiv e call) is O ( ℓ ) (instead o f O ( ℓ log ℓ )). Th us, the running time of the resulting algorithm is O ( ℓ log( n curr /ℓ )). 12

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment