Succinct Coverage Oracles

Succinct Co v erage Oracles Ioannis An tonellis 1 , Anish Das Sarma 2 , Shaddin Dughmi 1 1 Stanford Univ ersit y , 2 Y aho o Researc h 1 { antonell,sh addin } @cs.stanford.edu , 2 anishdas@ya hoo-inc.com Abstract In this pap er, we iden tify a fundamen tal alg orithmic problem t hat we term suc cinct dynamic c overing (SDC), arising in many mo dern-day web applica tions, including ad-serving and online recommendation systems in eBay and Netﬂix. Roughly sp eaking, SDC applies tw o restrictio ns to the w ell-studied Max-Cov erage problem [9]: Given an integer k , X = { 1 , 2 , . . . , n } and I = { S 1 , . . . , S m } , S i ⊆ X , ﬁnd J ⊆ I , such that |J | ≤ k and ( S S ∈J S ) is as lar ge as p ossible. The t wo re strictions applied by SDC are: (1) Dynamic: At query-time, we a re given a query Q ⊆ X , and o ur goal is to ﬁnd J such that Q T ( S S ∈J S ) is as large a s p ossible; (2 ) Sp ac e-c o nstra ine d: W e don’t have enough s pa ce to store (and pro cess) the entire input; spec iﬁc a lly , we hav e o ( mn ), and maybe a s little as O (( m + n ) poly log ( mn )) space. A s olution to SDC ma intains a s mall data structure, and us es this datastructure to answer most dynamic queries with high accuracy . W e call suc h a scheme a Cover a ge Or acle . W e present a lgorithms and complexity r esults for cov erag e oracles. W e prese n t deterministic and probabilis tic near -tigh t upp er and low er bo unds on the appro ximatio n ratio of SDC as a function o f the amoun t of spa ce a v aila ble to the oracle. O ur low er b ound results show that to obtain constant-factor approximations w e need Ω( mn ) spa ce. F ortunately , o ur upp er b ounds present an explicit tradeoﬀ b et ween space and approximation ratio, a llo wing us to determine the amoun t of spa ce needed to guar an tee certain accuracy . 1 In tro duc t ion The explosion of d a ta and applications on the w eb o ver the last decade hav e given r ise to m a ny new data managemen t challenge s. This pap er id e ntiﬁes a fu ndamen tal sub p roblem inherent in sev eral W eb applications, includ i ng online recommendation systems, and serving adv ertisemen ts on webpages. Let us b egin with a motiv ating example. Example 1.1. Consider the online movie r ental and str e aming website, Netﬂix 1 , and one of their users Alic e. Base d on Alic e’s movie viewing (and r ating) history, Netﬂix would like to r e c ommend new movies to Alic e for watching. (Inde e d, Netﬂix thr ew op en a mil lion-dol lar chal lenge on signif- ic antly impr oving their movie r e c ommendations 2 .) Conc eivably, ther e ar e many ways of devising algorithms for r e c ommendation r anging f r om data mining to machine le arning te chniques, and in- de e d ther e has b e en a gr e at de al of such work on pr oviding p ersonalize d r e c ommendations (se e [1] for a survey). R e gar d less of the sp e ciﬁc te chnique, an imp ortant subpr oblem that arises is ﬁnding users “similar” to Alic e, i.e., ﬁnding users who have indep endently or in c onjunction viewe d (and like d) movies se en (and like d) by Alic e. Abstr actly sp e aking, we ar e given a u niv ersal set of al l Netﬂix movies, and Netﬂix users identiﬁe d by the su bset of movies they have viewe d (and like d or dislike d). Given a sp e ciﬁc user Alic e, we ar e inter este d in ﬁnding (say k ) other users, who to gether co ve r a lar ge set of Alic e ’s likes and dislikes. Note tha t for e ach user, the set of movies that ne e d to b e c over e d is diﬀer ent, and ther efor e the c overing c annot b e p erforme d staticall y , indep endent of the user. In fact, Netﬂix d y n amic ally pr ovides movie r e c ommendations as users r ate movies in a p articular genr e (say c ome dy), or r e qu est movies in sp e ciﬁc languages, or time p e rio ds. Pr oviding r e c ommendations at i nter active sp e e d, b ase d on user qu e r ies (such as for a p articular genr e ), rules out c omputation al ly-exp e nsive pr o c essing over the entir e Netﬂix data, which is very lar ge. 3 Ther efor e, we ar e i nt er este d in approxi mately solving the afor ementione d c overing pr oblem b ase d on a subset of the data . The main chal lenge that arises is to statically identify a sub set of the data that would pr ovide go o d appr oximations to the c overing pr oblem for any dyn amic user query. Note that a very similar challe nge arises in other recommendation systems, su c h as when Alice visits an online s hopping website lik e eBa y 4 or Amazon 5 , and the w ebsite is in terested in reco mmend ing pro ducts to Alice based on her cu r ren t query for a particular brand or pro duct, and her p rior purchasing (and viewing) history . The example ab o v e can b e formulate d as an instance of a simple algorithmic co ve ring problem, generalizing the NP-h a rd optimizatio n p roblem max k-c over [9]. The inpu t to this problem is an in teger k , a set X = { 1 , . . . , n } , a family I ⊆ 2 X of subsets of X , and que r y Q ⊆ X . Here ( X , I ) is called a set system , X is called the gr ound set of the set-system, and members of X are called elements or items . W e mak e no assu m ptio ns on ho w the set system is represente d in the inp ut, though the r ea der can think of the ob v ious repr e senta tion by a n × m bipartite graph for intuition. This n × m b ipartite graph can b e stored in O ( n m ) b its, whic h is in fact information-theoretical ly optimal for storing an arbitrary set system on n items and m sets. The ob jectiv e of the pr ob lem is 1 www.netﬂix.com 2 http://w ww.netﬂixprize.com/ 3 Netﬂix currently has over 10 millions users, ov er 100,000 mo vies, and obviously some of the p opular movies hav e b een view ed by man y users, and mo vie b uﬀs have rated a large number of movies; Netﬂix owns ov er 55 m illion discs. 4 www.eba y .com 5 www.ama zon.com 1 to return J ⊆ I with |J | ≤ k that collectiv ely co ve r as m uch of Q as p ossible. Since this problem is a ge neralization of max k-co v er, it is NP-hard. Nev ertheless, absent any additional constrain ts this problem can b e appro ximated in p olynomial time b y a straigh tforward adaptation of the greedy algorithm for max k-co v er 6 , whic h attains a constant factor e e − 1 appro ximation in O ( mn ) time [11]. Ho wev er, we further constr ain solutions to th e problem as follo ws, rendering new tec hniqu es necessary . F rom the ab o v e example, w e identify tw o prop erties that w e require of any s yste m that solv es this co ve ring p r oblem: 1. Space Constrained: W e need to (statically) prepr ocess the set system ( X , I ) and store a small sk etch (muc h smaller than O ( mn )), in the form of a data structur e , and d isc ard the original repr e senta tion of ( X , I ). T his can b e though t of as a form of lo ssy compression. W e do not require the data str ucture to tak e any particular form; it need only b e a sequence of bits that allo ws us to extract information ab out the original set system ( X , I ). F or ins tance, an y statistical sum mary , a subgrap h of the b ipartite graph repr ese nting the set system, or other r epresen tation is acceptable. 2. Dynamic : Th e query Q is n o t known a-priori, but arrive dynamic al ly . More precisely: Q arriv es after the data structure is constructed and the original data discarded. It is at that p oin t that the data stru ct ur e must b e used to compute a solution J to the co ve ring pr oblem. W e call this co vering pr oblem (formalized in the next section) the Suc cinct Dynamic Covering (SDC) p roblem. Moreo v er, w e call a solution to SDC a Cover age Or acle . A co v erage oracl e consists of a static stage that constructs a datastructure, and a dynamic stage that uses th e datastructure to answer queries. Next we brieﬂy present another, en tirely diﬀeren t, W eb application that also needs to confron t SDC . In addition, we note that there are sev eral other applications facing similar co vering pr o blems, including gene identiﬁcat ion [8], searc hing domain-sp eciﬁc aggregator sites like Y elp 7 , topical query decomp ositio n [4], and searc h-result dive rsiﬁ c ation [5, 7]. Example 1.2. O nline advertisers bi d on (1) webp ages matching r elevancy criteria and (2) typic al ly tar get a c ertain user demo gr aphic. A dvertisements ar e serve d b ase d on a c ombination of the two criterion ab ove. When a user visits a p articular webp age, ther e is usual ly no pr e cise information ab out the users’ demo g r aphic, i.e., age, lo c ation, inter ests, gender, etc. Inste ad, ther e is a range of p ossible values for e ach of these attributes, deter mine d b ase d on the se ar ch query th e user issue d or session information. A d-servers ther efor e attempt to pick a set of advertisements that would b e of inter est (i . e., “c over”) a lar ge nu m b e r of users; the user demo gr aphic that ne e ds to b e c over e d is determine d by the p age on which the advertisement is b eing plac e d, the user query, and session information. Ther efor e, ad-serving is fac e d with the SDC pr oblem. The sp ac e c onstr aint arises b e c ause the set system c onsisting of al l webp ages, and e ach user identiﬁe d by the set of webp ages visite d by the user is pr ohibitively lar ge to stor e i n memory and pr o c ess in r e al-time f or every single p age vi ew . The dynamic asp e ct arises b e c ause e ach user view of e ach p age is asso ciate d with a diﬀer ent user demo gr aphic that ne e ds to b e c over e d. 6 The greedy algorithm for max k-cov er, adapted to our p roblem, is simple: Find the set in I cov ering as many uncov ered items in Q as p oss ible, and rep eat this k times. This can clearly b e implemen ted in O ( mn ) time, and has b een show n to y ield a e/ ( e − 1) appro ximation. 7 www.y elp.com 2 1.1 Con tributions and Outline Next we outline th e main con tribu ti ons of this pap er. • In Section 2 w e formally deﬁne the su c cinct d y n amic co ve ring (SDC) problem, and summarize our resu lt s. • In Section 3 w e present a randomized cov erage oracle for SDC . The oracle is pr e sented as a function of the a v ailable space, thus allo wing us to tradeoﬀ space for accuracy based on the sp eciﬁc application. Unfortunately , the approximat ion ratio of this oracle d egrades rapidly as sp ac e decreases; Ho wev er, th e next section sho ws that this is in fact u n a v oidable. • In S e ction 4 w e pr ese nt a lo werb o un d on th e b est p ossible appro ximation attainable as a function of the space allo w ed for the datastructure. Th is lo we rb ound essen tially matc hes the upp erb ound of Section 3, though with the ca ve at that the low erb ound is for oracles that do not use randomization. W e exp ect the lo w erb ound to hold more generally for randomized oracles, though w e lea ve th is as an o p en qu esti on. Related work and future directions are presen ted in Section 5. 1.2 Related W ork Our stu dy of the tradeoﬀ b et w een sp a ce and ap p ro ximation ratio is in the spirit of the work of Thorup and Zwic k [14] on distanc e or acles . They considered the problem of compr essing a graph G in to a small datastructure, in suc h a wa y that th e d a tastructure can b e used to approximat ely answ er queries for th e distance b et wee n pairs of no des in G . Similar to our results, they sho w ed matc h ing upp er and low er b ounds on th e space n e eded for compressin g the graph sub ject to pre- serving a certain app ro ximation r a tio. Moreo v er, similarly to our upp erb ounds for SDC, their distance oracles b eneﬁt from a sp eedup at qu ery time as appro ximation ratio is sacriﬁced f o r space. Previous w ork has studied the set co v er p roblem under streaming mo dels. One m o del s t ud - ied in [3, 10] assumes that the sets are kno wn in adv ance, only elemen ts arrive online, and, the algorithms d o not kno w in adv ance which subset of elements will arriv e. An alternativ e m odel assumes that elemen ts are kn o wn in adv ance and sets arrive in a streaming fashion [13]. Our w ork diﬀers from these works in that SDC op erat es under a storage bu dget, so all sets cannot b e stored; moreo ver, S DC needs to pro vid e a go od co ver f or al l p ossible d ynamic query inpu ts. Another r e lated area is th at of n e arest neigh b or searc h. It is easy to see that th e S DC problem with k = 1 corresp onds to nearest neigh b or searc h u sing the dot pro duct similarit y measure, i.e., sim dot ( x, y ) = dot ( x,y ) n . How ev er, follo wing fr o m a result from Charik ar [6], there exists no localit y sensitiv e hash fun ct ion family for the dot pro duct similarity function. Thus, there is no hop e that signature s c hemes (lik e minhashing for the Jaccard distance) can b e used for SDC . 2 SDC W e start b y deﬁning the s u cc inct dynamic co vering (SDC) pr o blem in Section 2.1. Then, in Section 2.2 w e sum m a rize the main te c hn ic al results ac hiev ed by this p a p er. 3 2.1 Problem Deﬁnition W e no w formally deﬁn e the SDC pr oblem. Deﬁnition 2.1 (SDC) . Given an oﬄine input c onsisting of a set system ( X , I ) with n elemen ts (a.k.a items ) X and m sets I , and an i nte ger k ≥ 1 , devise a co v erage oracle such that give n a dynamic qu er y Q ⊆ X , the or acle ﬁnds a J ⊆ I such that |J | ≤ k and ( S S ∈J S ) T Q is as lar ge as p ossible. Deﬁnition 2.2 (Co verage Or ac le) . A Co ve rage Oracle for SDC c onsists of two stages: 1. Static Stage: Given inte gers m , n , k , and set system ( X , I ) with |X | = n and |I | = m , build a datastr uctur e D . 2. Dynamic Stage: Given a a dynamic query Q ⊆ X , use D to r etu rn J ⊆ I with |J | ≤ k as a solution to SDC. Note that our t w o constrain ts on a solution for SDC are illustrated b y the t wo stages ab o v e. (1) W e are in terested in building an oﬄine data structure D , an d only use D to answer queries. T ypically , w e w an t to m a inta in a smal l data structure, certainly o ( mn ), and ma yb e as little as O (( m + n ) poly l og ( mn )) or even O ( m + n ). Therefore, we cannot store the en tire set system. (2) Unlik e the traditional max-co v erage p roblem where the en tire set of elements X need to b e co vered, in SDC w e are give n queries dynamically . Th erefo re, w e wan t a co verag e oracle that retur ns goo d solutions f o r all queries. Giv en the space limitation of SDC, we cannot hop e to exactly solv e SDC (for all dyn amic inpu t queries). Th e goal of this pap er is to explore ap pr oximate solutions for S DC, give n a sp e ciﬁc space constrain t on the oﬄine data structur e D . W e deﬁn e th e app r oximation r atio of an oracle as th e w orst-case, take n o ver all inputs, of the ratio b et wee n the co verag e of Q by the optimal solution and the co ve rage of Q b y the output of the oracle. W e allo w the appro ximation ratio to b e a function of n , m , and k , and denote it b y α ( n, m, k ). More p reci sely , giv en a co verag e oracle A , if on inputs k , X , I , Q (w h ere imp lic itly n = |X | and m = |I | ) the oracle A return s J ⊆ I , w e d enot e the size of the co ve rage as A ( k , X , I , Q ) := | ( S S ∈J S ) T Q | . Similarly , we denote the co v erage of the optimal s o lution by OP T ( k , X , I , Q ) := max {| ( S S ∈J ∗ S ) T Q | : J ∗ ⊆ I , |J ∗ | ≤ k } . W e then express the appr oximatio n r atio α ( n, m, k ) as follo ws. α ( n, m , k ) = max O P T ( k , X , I , Q ) A ( k , X , I , Q ) Where the maxim um ab o v e is tak en o ver set systems ( X , I ) with |X | = n and |I | = m , and queries Q ⊆ X . W e will also b e concerned with r andomize d co verag e oracle s. Note that, w hen w e d evi se ran- domized co verag e oracle, w e use randomization only in the static stage; i.e. in th e constr u ct ion of the datastructure. W e then let the exp e c te d appr oximation r atio b e the worst case exp e cte d p erformance of the oracle as compared to the optimal solution. α ( n, m , k ) = max  E  O P T ( k , X , I , Q ) A ( k , X , I , Q )  (1) 4 T able 1: Su m mary of results for S D C giving the appro ximation-ratio, th e space constrain t on the co verage oracle , and whether the nature of the b ound: upp er b ound (UB) or lo w er b ound (LB) and d e terministic (Det .) or rand omiz ed (Rand.) Appro ximation Ratio Storage Bound O  min  m k , p n k  e O ( n ) Det. UB O  min  m ǫ √ k , p n k  e O ( nm 1 − 2 ǫ ) Rand. UB Ω  min  m ǫ − δ 1 k √ k , n 1 / 2 − δ 2 k √ k  e O ( nm 1 − 2 ǫ ) Det. LB The exp ectatio n in the ab o v e expression is ov er the random coins ﬂ ipp ed by the static stage of the oracle, and the maximizatio n is o ver X , I , Q as b efore. W e elab orate on this b enc h mark in Section 3. W e stud y the space-approxima tion tradeoﬀ; i.e., h ow the (exp ect ed) app ro ximation ratio im- pro ve s as the amount of space allo wed for D is increased. In our lo werb ounds, we are not sp eciﬁcal ly concerned with the time tak en to compu te the datastructure or an s w er qu e ries. Therefore, our lo werb ounds are purely information-the or etic : w e calculate the amount of information we are r e- quired to store if we are to guaran tee a s p eciﬁc appro x im ation ratio, indep endent of co mp u ta tional concerns. Our lo w erb ounds are p a rticularly no v el and striking in that they assume nothing ab out the datastructure, whic h ma y b e an a rb it rary sequence of bits. W e establish our lo werboun ds via a n ov el application of the probabilistic method that ma y b e of indep enden t interest. Ev en though w e fo cus on space vs ap p ro ximation, and n ot on runtime, fortunately the co ver- age oracles in our upp erb ounds can b e imp le mented eﬃcien tly (b oth static and dynamic s t age). Moreo ver, using our upp erb ounds to trade appro ximation for s pac e yields, as a sid e- eﬀect, an im- pro ve ment in ru n time when answ ering a query . In particular, observ e th a t if no sparsiﬁcation of the data is done u p-fron t, then answering eac h quer y using the stand ard greedy app ro ximation algorithm for m ax k-co ve r [11] tak es O ( mn ) time. Our oracles, presented in Section 3, sp ends O ( mn ) time u p-fron t building a data structur e of size O ( b ), where b is a parameter of the oracle b et ween n and nm . In the d ynamic stage, ho we ve r, answering a query n o w tak es O ( b ), sin c e w e use the greedy algorithm for max k-co ve r on a “sparse” set system. Therefore, the d ynamic stage b ecomes faster as we decrease size of the data structure. I n fact, this increase in sp eed is not restricted to an algorithmic sp eedup as describ ed ab o ve. It is lik ely that there will also b e sp eedup due to arc hitectural reasons, since a smaller amoun t of data needs to b e ke pt in memory . Therefore, trading oﬀ approxima tion for space y ields an in c idental sp eedup in runtime whic h b od es w ell for the d ynamic nature of the queries. 2.2 Summary of results T able 1 sum marize s the main r e sults obtained in this p aper for S DC input with n elemen ts, m sets, and inte ger k ≥ 1. Th e lo w er b ound in the table is for any nonnegativ e constan ts δ 1 , δ 2 not b oth 0, and the randomized upp erb ound is p arameterized b y ǫ with 0 ≤ ǫ ≤ 1 / 2. Th e upp er and lo wer b ounds are d evelo p ed in Sections 3 and 4 resp e ctiv ely . 5 3 Upp er Bounds In this section, we sho w a co ve rage oracle that trades oﬀ s pac e and approximati on ratio. W e designate a tradeoﬀ parameter ǫ , w here 0 ≤ ǫ ≤ 1 / 2. F or any suc h ǫ , we get an O  min( m ǫ , √ n ) √ k  - appro ximate co v erage oracle th at stores e O ( nm 1 − 2 ǫ ) bits. Therefore, setting a small v alue of ǫ ac h ie ve s a b ett er approximat ion r atio, at the exp ense of storage space. As is common practice, w e use e O () to denote suppressing p olylogarithmic fact ors in n and m ; this is reasonable when the guaran tees are sup er-p olylogarithmic, as is th e case here. The oracle w e sho w is r a nd o mized, in the sense that the static stage ﬂips some random coins. The datastructure co nstru c ted is a rand o m v ariable in the in ternal coin ﬂ ips of the s tatic stage of the oracle. W e measure the exp e cte d appr oximation r atio (a.k.a app ro ximation r atio, when clear from con text) of the oracle, as d eﬁ ned in E quati on (1 ). F or ev ery ﬁxed query Q in depen d en t of the random coins used in constructing the datastructure, this ratio is attained in exp ectation. In other w ords, our ad versarial mo del is that of an oblivious adversary : someone trying to fo ol our oracle may c ho ose an y qu ery they lik e, but their choic e cannot dep end on kn owledge of the rand om c hoices made in co nstru c ting the datastructure. In S e ction 4 we will see that our oracle attains a space-appro ximation tradeoﬀ that is essen tially optimal when compared with oracles that are deterministic. In other words, no deterministic oracle can do substan tially b etter. W e lea v e op en the questions of whether a b et ter r andomized oracle is p ossible, and w h et her an equally go o d deterministic oracle exists. 3.1 Main Result and Roadmap The follo wing theorem states the main result of this section. Theorem 3.1. F or eve r y ǫ with 0 ≤ ǫ ≤ 1 / 2 , ther e is a r andomize d c over age or acle for SDC that achieves an O  min( m ǫ , √ n ) √ k  appr oxima tion and sto r es e O ( nm 1 − 2 ǫ ) bits. The remainder of this section, leading up to the ab o ve r e sult, is organized as follo w s. Before proving Theorem 3.1, to b u ild in tuition w e sho w in Section 3.2 (Remark 3.2) a muc h simpler deterministic oracle, with a m uch weak er appro ximation guaran tee. Then, w e pr o v e Theorem 3.1 in tw o p arts. First, in Section 3.3, w e sh ow a randomized co verag e oracl e that s tores e O ( nm 1 − 2 ǫ ) b it s and ac hiev es an O ( m ǫ / √ k ) approxima tion in exp ectation. Th en, in S e ction 3.4, we sho w a determin istic oracle that ac hieve s a O ( √ n/ √ k ) approxima tion and stores e O ( n ) bits. Com bining th e t wo oracles in to a single oracle in the ob vious w ay yields Theorem 3.1. 3.2 Simple Deterministic Oracle Remark 3.2. Ther e is a simpl e deterministic or acle that att ains a m/k appr oximation with e O ( n ) sp ac e. The static stage pr o c e e ds as fol lows: Given set system ( X , I ) , for e ach i ∈ X we “r ememb er” one set S ∈ I with i ∈ S (br e aking ties arbitr arily). In other wor ds, for e ach S ∈ I we deﬁne b S ⊆ S such that n b S : S ∈ I o is a p artition of X . W e then stor e the “sp arsiﬁe d” set system  X , b I = n b S : S ∈ I o . It is cle ar that this c an b e done in line ar time by a trivial gr e e dy algorithm . Mor e over, ( X , b I ) c an b e stor e d in e O ( n ) sp ac e as a n × m bip artite gr aph with n e dges. 6 The dynamic stage is str aightforwar d: when g iven a query Q , we simply r eturn the indic es of the k sets i n b I that c ol le ctively c over as much of Q as p ossible. It is cle ar that this gives a m/k appr oxima tion. Mor e over, sinc e b I is a p artition of X , it c an b e ac c omplishe d by a trivial gr e e dy algorithm in p olynomial time. Next we use randomization to sh o w a muc h b ett er, an d m u c h more inv olve d, upp erb ound that trades oﬀ appro ximation and space. 3.3 An O ( m ǫ / √ k ) Approxi mation with e O ( nm 1 − 2 ǫ ) Space Consider the set sys t em ( X , I ), where X is the set of items and I is the family of sets. W e assume withou t loss that eac h item is in some set. W e deﬁne a rand o mized oracle for building a datastructure, whic h is a “sparsiﬁ e d” v ersion of ( X , I ). Namely , f o r ev er y S ∈ I we d eﬁne b S ⊆ S , and store the set system  X , b I = n b S o S ∈I  . W e require that ( X , b I ) can b e stored in e O ( nm 1 − 2 ǫ ) space. W e constru c t the datastructure in t w o stages, as follo ws. • Lab el all items in X “unco v ered” and all sets in I “unchosen” • Stage 1: While there exists an unc hosen s e t S ∈ I conta ining at least n m ǫ √ k unco v ered items – Let b S b e the set of unchosen items in S . – Relab el all items in b S as “co vered” and “signiﬁcan t” – Relab el S as “c h o sen” and “signiﬁcant” • Stage 2: F or ev ery remaining “unc hosen” set S – Ch oose n m 2 ǫ “unco v ered” items b S ⊆ S un iformly at random f rom the unco v ered items in S (if fewer than n m 2 ǫ suc h ite ms, then le t b S b e all of them). – Relab el eac h item in b S as “co v ered” and “insigniﬁcan t” – Relab el S as “c h o sen” and “ins igniﬁ c ant” • Lab el ev ery unco vered item as “unco vered” and “insigniﬁcan t” When presented w it h a query Q ⊆ X , we u se the stored datastructure ( X , b I ) in the ob vious w a y: namely , w e ﬁnd c S 1 , . . . , c S k ∈ b I maximizing | ( S k i =1 b S i ) T Q | , and return the name of the cor- resp onding original sets S 1 , . . . , S k . Ho wev er, this pr o blem cannot b e solved exactly in p o lynomial time in general. Nev ertheless, w e can instead use the greedy algorithm for m a x-k-co v er to get a constan t-factor appro ximation [11]; this will not aﬀect our asymptotic guarantee on the appr o x- imation ratio. Th e follo wing t wo lemmas complete the pro of that the ab o v e oracle ac hieve s an O ( m ǫ / √ k ) approxima tion with e O ( nm 1 − 2 ǫ ) sp ac e. Lemma 3.3. The datastructur e ( X , b I ) c an b e stor e d using e O ( nm 1 − 2 ǫ ) bits. Pr o of. W e store the set system as a b ip a rtite graph represen ting the con tainment relation b e tw een items and sets. T o sh o w that the b ipartit e graph can b e stored in th e required space, it suﬃces to sho w that ( X , b I ) is “sparse”; n a mely , that the total n umber of edges ( x, b S ) ∈ X × b I such th a t x ∈ b S is O ( nm 1 − 2 ǫ ). W e accoun t for the edges created in stages 1 and 2 separately . 7 1. Ev er y signiﬁcan t item is co nn e cted to a s in g le set. This create s at most n edges. 2. F or every insigniﬁcan t set, we store at most nm − 2 ǫ items. This creates at most mnm − 2 ǫ = nm 1 − 2 ǫ edges. Lemma 3.4. F or e very query Q , the or acle r eturns sets S 1 , . . . , S k such that E [ | ( k [ i =1 S i ) \ Q | ] ≥ | ( S k i =1 S ∗ i ) T Q | O ( m ǫ / √ k ) for any S ∗ 1 , . . . , S ∗ k ∈ I . Note that S 1 , . . . , S k are random v ariables in the inte rnal coin-ﬂips of the static stage that constructs the datastructure. The exp ec tation in the statemen t o f the lemma is ov er these random coins. Pr o of. W e ﬁx an optimal c hoice for S ∗ 1 , . . . , S ∗ k ∈ I , and d e note OP T = | ( S k i =1 S ∗ i ) T Q | . Since, b y construction, b S ⊆ S f o r all S ∈ I , it suﬃces to sho w that the outpu t of the oracle satisﬁes | ( S k i =1 b S i ) T Q | ≥ O P T O ( m ǫ / √ k ) in exp ectatio n. Moreo ve r, since the dynamic stage algorithm ﬁn d s a constan t factor ap p ro ximation to max {| ( S k i =1 b S i ) T Q | : c S 1 , . . . , c S k ∈ b I } , it is su ﬃcie nt to sho w that there exists S 1 , . . . , S k ∈ I with E [ | ( S k i =1 b S i ) T Q | ] ≥ O P T O ( m ǫ / √ k ) . W e distinguish t wo cases, based on whether most of the items ( S k i =1 S ∗ i ) T Q co vered b y the optimal solution are in signiﬁcan t or insigniﬁcan t sets. W e use the “signiﬁcant ” and “ins igniﬁ c ant” designation as used in the static stage algorithm. Moreo ver, we refer to b S ∈ b I as s ig niﬁcant (insigniﬁcan t, resp.) wh e n the corresp onding S ∈ I is signiﬁcan t (insigniﬁcant, resp.). 1. A t least half of ( S k i =1 S ∗ i ) T Q are signiﬁcan t it ems : Notice that, by construction, there are at m o st m ǫ √ k signiﬁcan t s e ts in b I . Moreo v er, the signiﬁcan t items are precisely those co vered by the signiﬁcan t s ets of b I , and th ose sets form a partition of the signiﬁcan t items. Therefore, by the pigeonhole p rinciple there are there are some c S 1 , . . . , c S k ∈ b I suc h that S k i =1 b S i con tains at least an k m ǫ √ k = √ k m ǫ fraction of the signiﬁ c ant items in ( S k i =1 S ∗ i ) T Q . This gives the desired O ( m ǫ / √ k ) approxima tion. 2. A t least half of ( S k i =1 S ∗ i ) T Q are insigniﬁcant items : In this case, at least half the items ( S k i =1 S ∗ i ) T Q co v ered by the optimal s olution are conta ined in the insigniﬁ cant memb e rs of { S ∗ 1 , . . . , S ∗ k } . Recall that an y insigniﬁcan t set in I con tains at most n m ǫ √ k insigniﬁcan t ite ms. Therefore, the algorithm includes eac h elemen t of an insigniﬁcan t S ∗ i in c S ∗ i with probabilit y at least n m 2 ǫ / n m ǫ √ k , whic h is at least √ k /m ǫ . Thus, ev ery insigniﬁcan t item in ( S k i =1 S ∗ i ) is in ( S k i =1 c S ∗ i ) with probabilit y at least √ k /m ǫ . This giv es that the exp ected size of ( S k i =1 c S ∗ i ) T Q is at least O P T O ( m ǫ / √ k ) . T aking S i = S ∗ i completes the pro of. 8 3.4 An O ( p n/k ) A pp roxima tion with e O ( n ) Space This co v erage oracle is similar to the one in th e previous section, th o ugh is muc h simpler. Moreo v er, it is d et erministic. Indeed, w e construct the d a tastructure by the follo wing greedy alg orithm that resem bles the greedy algorithm for max-k-co v er • Lab el all items in X “unco v ered” and all sets in I “unchosen” • While there are unchosen sets – Find the un c hosen s et S ∈ I contai nin g the most unco v ered items – Let b S b e the set of unco vered items in S . – Relab el all items in b S as “co vered” – Relab el S as “c h o sen” Observe th a t b I is a partition of X . When presen ted w it h a query Q ⊆ X , we u se the datastruc- ture ( X , b I = n b S : S ∈ I o ) in the ob vious w a y . Namely , we ﬁnd the s e ts c S 1 , . . . , c S k ∈ b I maximizing | ( S k i =1 b S i ) T Q | , and output the corresp o nd ing non-sparse sets S 1 , . . . , S k . Th is can ea sily b e done in p olynomial time b y u sing the obvio us greedy algorithm, since b I is a partition of X . Note that the oracle describ ed ab o ve is v ery similar to the oracle from Section 3.2: The dyn a mic stage is identi cal. Th e static stage, how eve r, n ee ds to bu il d the partition using a sp eciﬁc greedy ordering – as opp osed to th e arbitrary ordering used in Section 3.2. The f o llo wing t wo Lemmas complete th e pro of that the oracle ac hieves an O ( p n/k ) approxima tion with e O ( n ) space. Lemma 3.5. The datastructur e ( X , b I ) c an b e stor e d using e O ( n ) bits Pr o of. Ob serv e th a t eac h item is contai ned in exactly one b S ∈ b I . Therefore, the bipartite graph represent ing the set s y s t em ( X , b I ) h a s at most n edges. This establishes the Lemma. Lemma 3.6. F or e very query Q , the or acle r eturns sets S 1 , . . . , S k with | ( k [ i =1 S i ) \ Q | ≥ | ( S k i =1 S ∗ i ) T Q | O ( p n/k ) for any S ∗ 1 , . . . , S ∗ k ∈ I . Pr o of. Fix an optimal choice of S ∗ 1 , . . . , S ∗ k , and den o te O P T = | ( S k i =1 S ∗ i ) T Q | . Recall that the oracle ﬁnd s c S 1 , . . . , c S k ∈ b I maximizing | ( S k i =1 b S i ) T Q | , and then outpu ts the corr esp ondin g original sets S 1 , . . . , S k . It suﬃces to sho w that there are some b S 1 , . . . , b S k ∈ b I with | ( S k i =1 b S i ) T Q | ≥ O P T /O ( p n/k ). W e distinguish t wo cases, b a sed on whether most of ( S k i =1 S ∗ i ) T Q are in big or small sets in b I . Recall that b I forms a partition of X . W e sa y b S ∈ b I is “signiﬁcan t” if | b S | ≥ p n/k , otherwise b S is “insigniﬁcan t”. Similarly , w e sa y an item i ∈ X is “signiﬁcan t” if it falls in a signiﬁcan t set in b I , otherwise it is “insigniﬁcan t”. Notice that there are at most n √ n/k = √ nk signiﬁcan t sets. First, we consider the case where at lea st half the items in ( S k i =1 ) S ∗ i T Q are signiﬁcan t. Since there at m ost √ nk signiﬁcan t sets in b I , by the p ig eonhole p rinciple there are k of th e m that 9 collect ive ly co ve r a k / √ nk = p k /n fractio n of all signiﬁcan t items in ( S k i =1 S ∗ i ) T Q . This would guaran tee the O ( p n/k ) appro ximation, as needed. Next, w e consider the case wh ere at least half of ( S k i =1 S ∗ i ) T Q are insigniﬁ c ant. By examining the greedy algo rithm of the stati c s t age, it is easy to see that eac h S ∈ I con tains at most p n/k insigniﬁcan t items. Therefore, there are at most k · p n/k = √ nk insigniﬁcant items in ( S k i =1 S ∗ i ). Therefore we deduce that O P T = | ( S i S ∗ i ) T Q | ≤ 2 √ k n . Sin c e the optimal co vers O ( √ k n ) items in Q , it su ﬃce s for a O ( p n/k ) appro ximation to sho w that th e re are b S 1 , . . . , b S k ∈ b I that collectiv ely co ver k items of Q . It is easy to see that this is indeed the case, sin ce b I is a p a rtition of X . This completes th e pro of. 4 Lo w er Bounds This section develo ps lo wer b ounds for the S D C p roblem. W e consider deterministic oracles that store a datastructure of size b ( n, m, k ) for set s yste ms with n items, m sets, maxim um num b er of allo wed sets k . Moreo ve r, we assume that n ≤ b ( n, m, k ) ≤ n m , since no nontrivia l p ositiv e result is p ossible wh e n b ( n , m, k ) = o ( n ), and a p erfect appro ximation ratio of 1 is p ossible when b ( n, m , k ) = Ω( nm ). 4.1 Main Result and Roadmap The main result of this section is stated in the follo wing theorem, whic h says that our rand omized oracle in the previous section achiev es a space-appro ximation tradeoﬀ that essen tially m a tc hes the b est p ossible for any deterministic oracl e. Theorem 4.1. Consider any deterministic or acle that stor es a datastr uctur e of size at most b ( n, m , k ) bits, wher e n ≤ b ( n, m, k ) ≤ nm . L et ǫ ( n, m, k ) b e su c h that b ( n, m, k ) = nm 1 − 2 ǫ ( n,m,k ) . When m ǫ ( n,m,k ) ≤ √ n , the or acle do es not attain an appr oximation r atio of O ( m ǫ ( n,m,k ) − δ k √ k ) for any c onstant δ > 0 . Mor e over, when √ n ≤ m ǫ ( n,m,k ) the or acle do es not attain an appr oxima tion r atio of O ( n 1 / 2 − δ k √ k ) for any δ > 0 . The p roof of the theorem ab o ve is somewhat inv olv ed. Therefore, to simplify the pr ese ntat ion w e p ro ve in S ec tion 4.2 a slight simpliﬁcation of Theorem 4.1 that captures all the main ideas: Our simpliﬁcation sets k = 1, and pro ves the O ( m ǫ ( n,m,k ) − δ k √ k ) appro ximation ratio, for m ǫ ( n,m,k ) ≤ √ n . Then, in Section 4.3 we p ro ve th e appro ximation ratio for the case of √ n ≤ m ǫ ( n,m,k ) , still main taining k = 1. Finally , in Section 4.4, we demonstrate how to mo dify our p roofs for an y k , yielding T heo rem 4.1. W e ﬁx δ > 0. F or the remainder of the section, w e use b and ǫ as shorthand for b ( n, m, k ) and ǫ ( n, m, k ), resp ectiv ely . W e let α ( n, m, k ) b e the app ro ximation ratio of the oracle, and us e α as shorthand. O bserv e that 0 ≤ ǫ ≤ 1 / 2. 4.2 Pro of of a Simpler Lo werbound W e simplify Theorem 4.1 b y assuming k = 1 and m ǫ ≤ √ n . The r e sult is the follo wing prop osition, stated us in g the shorthand notat ion describ ed ab o ve. 10 Prop os ition 4.2. Fix k = 1 and p ar ameter ǫ with 0 ≤ ǫ ≤ 1 / 2 . Assume m ǫ ≤ √ n . Consider any deterministic or acle that stor es a datastructur e of size at most b = nm 1 − 2 ǫ bits. The or acle do es not attain an appr oximation r atio of O ( m ǫ − δ ) for any c onstant δ > 0 . W e assum e the appr o ximation ratio α atta ined b y the oracle is O ( m ǫ − δ ) and der ive a contradic- tion. T he p roof uses the probabilistic metho d (see [2]). W e b egi n b y deﬁning a d istribution on set systems, and then go on to show that this d istribution “fo o ls” a small cov erage oracle w ith p osit ive probabilit y . 4.2.1 Deﬁning a Distribution D on Set Systems W e will sho w that there is a set system ( X , I ) and a query Q that forces the algo rithm to output a set S ∈ I th at is not within α from optimal. W e use the pr o babilistic m e tho d. Namely , we exhibit a distribu ti on D o v er set systems ( X , I ) su ch that, for ev ery d et erministic oracle storing a datastructure of size b , there exists with non-zero p robabilit y a query Q f or whic h the oracle outputs a set of approximat ion w orse than α . T o sho w this, we dra w tw o set systems i.i.d from D , and sho w that with non -zero pr o bability b oth the follo wing hold: the t w o s et sys tems are not distinguished b y the co verag e oracle, and moreo v er there exists a qu e ry Q that r e quires that the algorithm retur n diﬀeren t answ er s for the tw o set sys tems for a O ( m ǫ − δ ) approximat ion. W e deﬁ ne D as follo ws . Given th e ground set X = { 1 , . . . , n } , we let I = { A i } m i =1 and dra w A 1 , . . . , A m i.i.d as follo ws: W e let A i b e a subs e t of X of size nm − ǫ dra wn u niformly at rand om. 4.2.2 Sampling t wice from D and collisions Next, w e dra w t wo set systems ( X , I = { A i } m i =1 ) and ( X , I ′ = { A ′ i } m i =1 ) i.i.d f r om D , as d iscussed ab o ve . Firs t, we lo werb o un d the pr o bability that ( X , I ) and ( X , I ′ ) are n o t distinguished b y the co verage oracle. W e call suc h an o ccurence a “Coll ision”. Lemma 4.3. The pr ob ability that the same datastructur e is stor e d for ( X , I ) and ( X , I ′ ) is at le ast 2 − b . Pr o of. Th e re are 2 b p ossible datastructures. Let p i denote the probabilit y that, when pr ese nted with r an d om ( X , I ) ∼ D , th e oracle s t ores the i ’th datastructure. W e can write this probabilit y of “col lision” of the t wo i.i.d samp le s ( X , I ) and ( X , I ′ ) as P 2 b i =1 p 2 i . Ho w ev er, sin c e P i p i = 1, this expression is min im ized when p i = 2 − b for all i . Plugging int o the ab o ve exp r essio n giv es a lo werb ound of 2 − b , as required. 4.2.3 F o oling Querie s and Candidat es Next, w e lo werb ound th e probabilit y that a query Q exists requiring t wo diﬀerent answ ers for ( X , I ) and ( X , I ′ ) in order to get the desired α = O ( m ǫ − δ ) appro ximation. W e call suc h a qu ery Q a fo oling query . W e deﬁne a set of qu er ies that are “candidates” for b eing a fo oling query: A set Q ⊆ X is called a c andidate query if Q = A i S A ′ i ′ for some i 6 = i ′ . In other w ords, a query is a candidate if it is the union of a set f rom ( X , I ) and a set fr om ( X , I ′ ) with diﬀeren t indices. Ideally , candidate Q = A i S A ′ i ′ w ould b e a fooling qu ery by forcing the oracl e to output i for ( X , I ) and i ′ for ( X , I ′ ) in order to guarantee the desired appro ximation. How ev er, this need not b e the case: consider for instance the case when, f o r some j 6 = i, i ′ , b oth A j and A ′ j ha v e large in tersection with Q , making it ok to outpu t j for b oth. W e will sh o w that the probability that none 11 of the candidate queries is a fo oling query is strictly less than 2 − b when n and m are su ﬃcie ntly large. Doing s o w ould complete the pro of: collision o c curs with probabilit y ≥ 2 − b , and a fo oling query exists with p robabilit y > 1 − 2 − b , and ther efore b oth o ccur sim ultaneously with p ositiv e probabilit y . This would yield th e desired contradicti on. 4.2.4 The Probabilit y that None of the Candidates is F o oling is Small W e now u pp e rb ound the p r obabilit y that none of the candidates is a fo oling query . Observe that if candidate Q = A i S A ′ i ′ is not a fo oling q u ery , then there exists A ∈ I S I ′ \ { A i , A ′ i ′ } w it h | A T Q | ≥ nm − ǫ /α . T h erefore one of the follo win g must b e true: 1. There exists A ∈ I S I ′ \ { A i , A ′ i ′ } with | A T A i | ≥ n m − ǫ / 2 α = Ω( nm − 2 ǫ + δ ). 2. There exists A ∈ I S I ′ \ { A i , A ′ i ′ } with | A T A ′ i ′ | ≥ nm − ǫ / 2 α = Ω ( nm − 2 ǫ + δ ). Therefore, if n one of the ca nd idat es were fo oling queries, then there are man y “pairs” of sets in I S I ′ that ha v e an inte rsection s u bstan tially larger than the exp ec ted size of nm − 2 ǫ . This seems v ery unlik ely . Indeed, th e remainder of this pro of will demonstrate just that. If none of the candid a tes are fo oling queries, then by examining (1) an d (2) ab o v e w e d ed uce the f o llo wing. Th ere exists 8 a set of pairs P ⊆ ( I S I ′ ) × ( I S I ′ ) such that: 1. | P | ≥ m − 2 = Ω( m ) 2. The un d irect ed graph with no des I S I ′ and edges P is bipartite. Moreo v er, ev ery no de in the left part has degree at most 1. Thus P is acyclic. 3. If ( B , C ) ∈ P then | B T C | ≥ Ω( nm − 2 ǫ + δ ) W e no w pr oceed to b ound the p robabilit y of existence of su c h a P , and in the pro cess also b ound the pr o babilit y that non e of th e candidate queries are fo oling. Recall that memb ers of I S I ′ are dra w n i.i.d from the uniform distribution on subsets of X of size nm − ǫ . F or ev ery pair ( B , C ) ∈ I S I ′ , w e let R ( B , C ) = | B T C | denote the size of th e ir inte rsection. It is easy to see the random v ariables {R ( B , C ) } B ,C ∈I S I ′ are pairwise indep enden t. Therefore, any acyclic set of pairs is m utually indep enden t, b y basic pr o bability theory . Thus, if w e ﬁx a particular P satisfying (1) and (2), the probabilit y that P satisﬁes condition (3) is at most Y ( B ,C ) ∈ P Pr [ R ( B , C ) ≥ Ω( nm − 2 ǫ + δ )] W e now wan t to estimate the pr obabilit y that the in tersection of B and C is a factor Ω( m δ ) more than its exp ectati on of nm − 2 ǫ . Therefore, we consider an indicator random v ariable Y i for eac h i ∈ X , designating wheter i ∈ B ∩ C . If Y i w ere indep endent, we co uld use Chern o ﬀ b ounds 8 Consider constructing P as follo ws: F or candidate qu ery Q = A 1 S A ′ 2 , ﬁnd the set in I S I ′ \ { A 1 , A ′ 2 } with a large intersection with one of A 1 or A ′ 2 as in (1) or (2). Say for instance we ﬁnd that A 7 has a larg e intersec tion with A 1 . W e include ( A 1 , A 7 ) in P , mark both A 1 and A 7 as “touc hed”, and designate A 1 a “lef t” n ode and A 7 a “righ t” n o de. Then, we rep eat th e pro cess with some cand idate Q ′ = A i S A ′ i ′ for some “untouc h ed” A i and A ′ i ′ . W e keep rep eating u n t i l there are no such candidates. Throughout this greedy pro ces s, w e mark at most tw o members of I S I ′ as “touched” for ev ery pair w e include in P . Note that some A i ma y be “touc hed” more than once. As long as there are at least 2 untouched sets in each of I and I ′ , the algorithm may contin ue. 12 to b ound the probabilit y th a t R ( B , C ) is large. F ortunately , it is easy to see that the Y i ’s are negativ ely-correlated: i.e., for any L ⊆ { 1 , . . . , n } , we h a v e Pr [ V i ∈ L Y i = 1] ≤ Q i ∈ L Pr [ Y i = 1] . Therefore, by the result of [12], if w e “pretend” that they are indep enden t b y app ro ximating their join t-distribution by i.i.d b ernoulli random v ariables, w e can still us e Chernoﬀ Bounds to b ound the u pp er-ta il p r obabilit y . Therefore, using Chern oﬀ b ounds 9 w e deduce that the probab ility that the in tersection of B and C is a factor Ω( m δ ) more than the exp ectatio n of nm − 2 ǫ is at most 2 − (Ω( m δ ) − 1) nm − 2 ǫ ≤ 2 − Ω( nm − 2 ǫ + δ ) . Therefore, the probabilit y that the ﬁxed P satisﬁes condition (3 ) is at most Y ( B ,C ) ∈ P 2 − Ω( nm − 2 ǫ + δ ) ≤ (2 − Ω( nm − 2 ǫ + δ ) ) | P | ≤ 2 − Ω( nm 1 − 2 ǫ + δ ) No w , w e can sum o ver all p ossible choic es for P satisfying (1) and (2) to get a b ound on the existence of a P s a tisfying (1), (2) and (3). It is easy to see that there are at most m m c hoices for P that satify (1) and (2). Using the union b oun d , w e get the foll o wing b ound on the existence of suc h a P . m m · 2 − Ω( nm 1 − 2 ǫ + δ ) ≤ 2 m log m − Ω( nm 1 − 2 ǫ + δ ) ≤ 2 − Ω( nm 1 − 2 ǫ + δ ) Where the last inequalit y f o llo ws by simple algebraic manipulation f r om our assump tio n that m ǫ ≤ √ n and δ > 0, w hen n and m are suﬃcien tly large. Recall that, b y our previous discussion, this expr essio n also up perb ound s the probabilit y th a t none of the candidate queries are f o oling queries. But, when n and m are suﬃcien tly large, th is is strictly smaller than 2 − b = 2 − nm 1 − 2 ǫ . Th us, b y our previous discussion, this completes the pro of of Prop osition 4.2. 4.3 Mo difyin g the pro of for the case √ n ≤ m ǫ W e main tain the assu mption that k = 1, and show how to mo dify the pro of of Pr o p osition 4.2 for the case when √ n ≤ m ǫ . Prop os ition 4.4. Fix k = 1 and p ar ameter ǫ with 0 ≤ ǫ ≤ 1 / 2 . Assume √ n ≤ m ǫ . Consider any deterministic or acle that stor es a datastructur e of size at most b = nm 1 − 2 ǫ bits. The or acle do es not attain an appr oximation r atio of O ( n 1 / 2 − δ ) for any c onstant δ > 0 . Instead of replicating almost the en tire pr oof of Prop osition 4.2, w e instead p oint out th e key c hanges n ec essary to yield a pro of of 4.4 and lea v e the rest as an ea sy excercise for the reader. The pro o f pro ceeds almost iden tically to the pr oof of Prop osition 4.2, with the foll o wing main c hanges: • Modiﬁcations t o Section 4.2.1 : When deﬁning D , w e let eac h A i b e a subset of X of size √ n instead of nm − ǫ . • W e p erform similar calculations throughout, accomod a ting the ab o ve mo diﬁcation to the size of A i . 9 W e use th e follo wing versio n of the Chernoﬀ Bound: Let X 1 , . . . , X n b e ind ependent b ernoulli random v ariables, and let X = P i X i . If E [ X ] = µ and ∆ > 2 e − 1, then Pr [ X > (1 + ∆) µ ] ≤ 2 − ∆ µ . 13 • Modiﬁcations to Section 4.2.4 : W e ev ent ually arr iv e at an upp er b ound of 2 − mn δ on the probabilit y that none of the candidate qu er ies are fo o ling. Using the assu mption m ǫ ≥ √ n and the fact th at b = n m 1 − 2 ǫ , a simp le algebraic m a nipu la tion shows that this b ound is stricly less th an 2 − b . Th is completes the proof, as b efore. 4.4 Mo difyin g the pro of for arbitrary k In this section, we ge neralize Prop osition 4.2 to arbitrary k . The generaliza tion of Pr o p osition 4.4 to arbitrary k is essen tially identica l, and th e refore we lea ve it as an exercise for the r ea der. W e no w state the generalization of Prop osition 4.2 to arb it rary k . Prop os ition 4.5. L et p ar ameter ǫ b e such that 0 ≤ ǫ ≤ 1 / 2 . Assume m ǫ ≤ √ n . Consider any deterministic or acle that stor es a datastructur e of size at most b = nm 1 − 2 ǫ bits. The or acle do es not attain an appr oximation r atio of O ( m ǫ − δ k √ k ) for any c onstant δ > 0 . The pro of of Prop osition 4.5 follo ws the outline of the proof of Prop osit ion 4.2. Th e necessary mo diﬁcations to th e pro o f of Prop osition 4.2 are as f ollo ws: • Modiﬁcations to Section 4.2.1 : W e deﬁne d istribution D as b efore, except that we let eac h A i b e a su bset of X of size nm − ǫ √ k . • Modiﬁcations to Section 4.2.2 : Instead of sampling from D t w ic e, we sample 2 k + 1 times to get s et systems ( X , I 1 ) , ( X , I 2 ) , . . . , ( X , I 2 k + 1 ). This c h ange s the probabilit y of collision of Lemma 4.3 to 2 − 2 k b . Here, collision means that all 2 k + 1 samp le s fr o m D are stored as the s ame datastructure by the static stage of the oracle. • Modiﬁcations to Section 4.2.3 : W e no w deﬁn e a fo oling query analogously for general k : A query Q is fo oling if there is no single index i suc h that returning the i ’th set giv es a goo d appro ximation for all the set systems ( X , I 1 ), . . . ( X , I 2 k + 1 ). Moreo ver, w e analogously d eﬁ ne c andida te queries : W e use A a b to denote the b ’th set in set system ( X , I a ). W e sa y Q ⊆ X is a candidate if Q = A ℓ 1 i 1 S A ℓ 2 i 2 S . . . S A ℓ k +1 i k +1 , where indices ℓ 1 , . . . , ℓ k +1 are distinct, and indices i 1 , . . . , i k +1 are distinct. In other words, Q is a fo oli ng query if it is the union of k + 1 sets from k + 1 distinct set systems and k + 1 distinct indices in those set systems. • Modiﬁcations to Section 4.2.4 : S imila rly , if a candidate Q = A ℓ 1 i 1 S . . . S A ℓ k +1 i k +1 is not a fo oling query , then there is s o me A ∈ ( I 1 S . . . I 2 k + 1 ) \ n A ℓ j i j o j with | A T Q | ≥ nm − ǫ /α . Therefore, for one of th e components A ℓ j i j of Q we ha ve that | A T A ℓ j i j | ≥ nm − ǫ /k α . Plugging in the appr oximati on ratio α = m ǫ − δ /k √ k we ha ve that | A T A ℓ j i j | ≥ nm − 2 ǫ + δ √ k . It is not to o hard to see that w e can construct P similarly w it h 1. | P | ≥ k ( m − k ) = Ω( k m ). 10 2. The un directe d graph with no des I S I ′ and edges P is bipartite. Moreo v er, ev ery no de in the left part has degree at most 1. Th us P is acyclic. 10 This is not true when k is almost equal to m . How ever, the theorem becomes trivially true when k > m 1 / 6 , so w e can without loss assume t hat k is n ot to o large. 14 3. If ( B , C ) ∈ P then | B T C | ≥ Ω( nm − 2 ǫ + δ √ k ) Con tinuing with the r e maining calculations in this sectio n almost identi cally give s a b ound of 2 − Ω( kn m 1 − 2 ǫ + δ ) on the probabilit y of existance of a ﬁxed P . The n umb er of su c h P is at most ( k m ) k m , therefore a similar cal culation giv es a boun d of 2 − Ω( kn m 1 − 2 ǫ + δ ) = 2 − Ω( kbm δ ) on the existenc e of an y suc h P . As b e fore, th is completes the proof. 5 Conclusions and F uture W ork This pap er int ro duced and studied a f u ndamen tal problem, called SDC, arising in many large-scale W eb applications. A summary of results obtained b y the pap er app ear in T able 1 (S e ction 2.2). The main sp eciﬁc op en qu e stion that arises is whether th e re is a d et erministic oracle that is as go od as the randomized oracle prop osed in Section 3. More generally , a detailed analysis of p ract ical sub classes of SDC seems to hold promise. Ac kno wledgemen ts W e thank Philip Bohannon, Hector Garcia-Molina, Ash win Mac ha v ana jhhaala, Tim Roughgarden, and Elad V erb in for insightful discussions. References [1] Gediminas Adomavicius and Er T uzhilin. T ow a rd the next g eneration of r ecommender systems: A survey of the state-of-the- a rt a nd p ossible extensio ns. IEEE TKDE , 17 , 2005. [2] N. Alon and J. Sp encer. The Pr ob abilistic Me tho d . John Wiley , 1992. [3] Y. Azar N. Alon B. Awerbuc h. The online se t cov er problem. In STOC , 2 003. [4] F rance sco Bonchi, Carlo s Castillo, Deb ora Donato, and Aristides Gionis. T o pical quer y decomp osition. In KDD , 2008. [5] Jaime G. Car bonell and J ade Goldstein. The use of mmr, diversit y-based reranking fo r reorder ing do cumen ts and pro ducing summaries. In SIGIR , 1998. [6] M. Charik ar. Simila rit y estimation tec hniques from rounding algor ithms. In STOC , 2002. [7] Harr Chen and David R. Karger . Less is more: pro ba bilistic mo dels for retrieving few er relev a n t do cumen ts. In S IGIR , 2006. [8] Nello Cr istianini and Matthew W. Hahn. Intr o duction to Computational Genomics . Cambridge Un ver- sity Press, 2006. [9] M. R. Garey and D. S. Johnson. Computers and Intractability. W . H. F r e eman and Comp any , 1 979. [10] YJ. Naor a nd N. Buch binder . Online pr imal-dual algorithms for cov ering and packing problems. In ESA , 2 005. [11] G. L. Nemhaus er, L. A. W olsey , and M. L. Fisher. An analysis of a ppro ximations for maximizing submo dular set functions – I. Mathematic al Pr o gr amming , 14(3):265–2 94, 1978 . 15 [12] Alessandro P anconesi and Ar a vind Sriniv asan. Randomized distributed edge coloring via an extension of the chernoﬀ-ho e ﬀ ding bounds. SIAM J. Comput. , 26(2):35 0–368, 1 997. [13] B. Saha and L. Geto or. O n maximum coverage in the streaming mo del and applica tion to mult i-topic blog-watc h. In S DM , 200 9. [14] Mikkel Thorup and Uri Zwick. Appr oximate dista nc e or acles. J. A CM , 52(1):1–24, 20 05. 16

Succinct Coverage Oracles

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment