Succinct Coverage Oracles
In this paper, we identify a fundamental algorithmic problem that we term succinct dynamic covering (SDC), arising in many modern-day web applications, including ad-serving and online recommendation systems in eBay and Netflix. Roughly speaking, SDC …
Authors: Ioannis Antonellis, Anish Das Sarma, Shaddin Dughmi
Succinct Co v erage Oracles Ioannis An tonellis 1 , Anish Das Sarma 2 , Shaddin Dughmi 1 1 Stanford Univ ersit y , 2 Y aho o Researc h 1 { antonell,sh addin } @cs.stanford.edu , 2 anishdas@ya hoo-inc.com Abstract In this pap er, we iden tify a fundamen tal alg orithmic problem t hat we term suc cinct dynamic c overing (SDC), arising in many mo dern-day web applica tions, including ad-serving and online recommendation systems in eBay and Netflix. Roughly sp eaking, SDC applies tw o restrictio ns to the w ell-studied Max-Cov erage problem [9]: Given an integer k , X = { 1 , 2 , . . . , n } and I = { S 1 , . . . , S m } , S i ⊆ X , find J ⊆ I , such that |J | ≤ k and ( S S ∈J S ) is as lar ge as p ossible. The t wo re strictions applied by SDC are: (1) Dynamic: At query-time, we a re given a query Q ⊆ X , and o ur goal is to find J such that Q T ( S S ∈J S ) is as large a s p ossible; (2 ) Sp ac e-c o nstra ine d: W e don’t have enough s pa ce to store (and pro cess) the entire input; spec ific a lly , we hav e o ( mn ), and maybe a s little as O (( m + n ) poly log ( mn )) space. A s olution to SDC ma intains a s mall data structure, and us es this datastructure to answer most dynamic queries with high accuracy . W e call suc h a scheme a Cover a ge Or acle . W e present a lgorithms and complexity r esults for cov erag e oracles. W e prese n t deterministic and probabilis tic near -tigh t upp er and low er bo unds on the appro ximatio n ratio of SDC as a function o f the amoun t of spa ce a v aila ble to the oracle. O ur low er b ound results show that to obtain constant-factor approximations w e need Ω( mn ) spa ce. F ortunately , o ur upp er b ounds present an explicit tradeoff b et ween space and approximation ratio, a llo wing us to determine the amoun t of spa ce needed to guar an tee certain accuracy . 1 In tro duc t ion The explosion of d a ta and applications on the w eb o ver the last decade hav e given r ise to m a ny new data managemen t challenge s. This pap er id e ntifies a fu ndamen tal sub p roblem inherent in sev eral W eb applications, includ i ng online recommendation systems, and serving adv ertisemen ts on webpages. Let us b egin with a motiv ating example. Example 1.1. Consider the online movie r ental and str e aming website, Netflix 1 , and one of their users Alic e. Base d on Alic e’s movie viewing (and r ating) history, Netflix would like to r e c ommend new movies to Alic e for watching. (Inde e d, Netflix thr ew op en a mil lion-dol lar chal lenge on signif- ic antly impr oving their movie r e c ommendations 2 .) Conc eivably, ther e ar e many ways of devising algorithms for r e c ommendation r anging f r om data mining to machine le arning te chniques, and in- de e d ther e has b e en a gr e at de al of such work on pr oviding p ersonalize d r e c ommendations (se e [1] for a survey). R e gar d less of the sp e cific te chnique, an imp ortant subpr oblem that arises is finding users “similar” to Alic e, i.e., finding users who have indep endently or in c onjunction viewe d (and like d) movies se en (and like d) by Alic e. Abstr actly sp e aking, we ar e given a u niv ersal set of al l Netflix movies, and Netflix users identifie d by the su bset of movies they have viewe d (and like d or dislike d). Given a sp e cific user Alic e, we ar e inter este d in finding (say k ) other users, who to gether co ve r a lar ge set of Alic e ’s likes and dislikes. Note tha t for e ach user, the set of movies that ne e d to b e c over e d is differ ent, and ther efor e the c overing c annot b e p erforme d staticall y , indep endent of the user. In fact, Netflix d y n amic ally pr ovides movie r e c ommendations as users r ate movies in a p articular genr e (say c ome dy), or r e qu est movies in sp e cific languages, or time p e rio ds. Pr oviding r e c ommendations at i nter active sp e e d, b ase d on user qu e r ies (such as for a p articular genr e ), rules out c omputation al ly-exp e nsive pr o c essing over the entir e Netflix data, which is very lar ge. 3 Ther efor e, we ar e i nt er este d in approxi mately solving the afor ementione d c overing pr oblem b ase d on a subset of the data . The main chal lenge that arises is to statically identify a sub set of the data that would pr ovide go o d appr oximations to the c overing pr oblem for any dyn amic user query. Note that a very similar challe nge arises in other recommendation systems, su c h as when Alice visits an online s hopping website lik e eBa y 4 or Amazon 5 , and the w ebsite is in terested in reco mmend ing pro ducts to Alice based on her cu r ren t query for a particular brand or pro duct, and her p rior purchasing (and viewing) history . The example ab o v e can b e formulate d as an instance of a simple algorithmic co ve ring problem, generalizing the NP-h a rd optimizatio n p roblem max k-c over [9]. The inpu t to this problem is an in teger k , a set X = { 1 , . . . , n } , a family I ⊆ 2 X of subsets of X , and que r y Q ⊆ X . Here ( X , I ) is called a set system , X is called the gr ound set of the set-system, and members of X are called elements or items . W e mak e no assu m ptio ns on ho w the set system is represente d in the inp ut, though the r ea der can think of the ob v ious repr e senta tion by a n × m bipartite graph for intuition. This n × m b ipartite graph can b e stored in O ( n m ) b its, whic h is in fact information-theoretical ly optimal for storing an arbitrary set system on n items and m sets. The ob jectiv e of the pr ob lem is 1 www.netflix.com 2 http://w ww.netflixprize.com/ 3 Netflix currently has over 10 millions users, ov er 100,000 mo vies, and obviously some of the p opular movies hav e b een view ed by man y users, and mo vie b uffs have rated a large number of movies; Netflix owns ov er 55 m illion discs. 4 www.eba y .com 5 www.ama zon.com 1 to return J ⊆ I with |J | ≤ k that collectiv ely co ve r as m uch of Q as p ossible. Since this problem is a ge neralization of max k-co v er, it is NP-hard. Nev ertheless, absent any additional constrain ts this problem can b e appro ximated in p olynomial time b y a straigh tforward adaptation of the greedy algorithm for max k-co v er 6 , whic h attains a constant factor e e − 1 appro ximation in O ( mn ) time [11]. Ho wev er, we further constr ain solutions to th e problem as follo ws, rendering new tec hniqu es necessary . F rom the ab o v e example, w e identify tw o prop erties that w e require of any s yste m that solv es this co ve ring p r oblem: 1. Space Constrained: W e need to (statically) prepr ocess the set system ( X , I ) and store a small sk etch (muc h smaller than O ( mn )), in the form of a data structur e , and d isc ard the original repr e senta tion of ( X , I ). T his can b e though t of as a form of lo ssy compression. W e do not require the data str ucture to tak e any particular form; it need only b e a sequence of bits that allo ws us to extract information ab out the original set system ( X , I ). F or ins tance, an y statistical sum mary , a subgrap h of the b ipartite graph repr ese nting the set system, or other r epresen tation is acceptable. 2. Dynamic : Th e query Q is n o t known a-priori, but arrive dynamic al ly . More precisely: Q arriv es after the data structure is constructed and the original data discarded. It is at that p oin t that the data stru ct ur e must b e used to compute a solution J to the co ve ring pr oblem. W e call this co vering pr oblem (formalized in the next section) the Suc cinct Dynamic Covering (SDC) p roblem. Moreo v er, w e call a solution to SDC a Cover age Or acle . A co v erage oracl e consists of a static stage that constructs a datastructure, and a dynamic stage that uses th e datastructure to answer queries. Next we briefly present another, en tirely differen t, W eb application that also needs to confron t SDC . In addition, we note that there are sev eral other applications facing similar co vering pr o blems, including gene identificat ion [8], searc hing domain-sp ecific aggregator sites like Y elp 7 , topical query decomp ositio n [4], and searc h-result dive rsifi c ation [5, 7]. Example 1.2. O nline advertisers bi d on (1) webp ages matching r elevancy criteria and (2) typic al ly tar get a c ertain user demo gr aphic. A dvertisements ar e serve d b ase d on a c ombination of the two criterion ab ove. When a user visits a p articular webp age, ther e is usual ly no pr e cise information ab out the users’ demo g r aphic, i.e., age, lo c ation, inter ests, gender, etc. Inste ad, ther e is a range of p ossible values for e ach of these attributes, deter mine d b ase d on the se ar ch query th e user issue d or session information. A d-servers ther efor e attempt to pick a set of advertisements that would b e of inter est (i . e., “c over”) a lar ge nu m b e r of users; the user demo gr aphic that ne e ds to b e c over e d is determine d by the p age on which the advertisement is b eing plac e d, the user query, and session information. Ther efor e, ad-serving is fac e d with the SDC pr oblem. The sp ac e c onstr aint arises b e c ause the set system c onsisting of al l webp ages, and e ach user identifie d by the set of webp ages visite d by the user is pr ohibitively lar ge to stor e i n memory and pr o c ess in r e al-time f or every single p age vi ew . The dynamic asp e ct arises b e c ause e ach user view of e ach p age is asso ciate d with a differ ent user demo gr aphic that ne e ds to b e c over e d. 6 The greedy algorithm for max k-cov er, adapted to our p roblem, is simple: Find the set in I cov ering as many uncov ered items in Q as p oss ible, and rep eat this k times. This can clearly b e implemen ted in O ( mn ) time, and has b een show n to y ield a e/ ( e − 1) appro ximation. 7 www.y elp.com 2 1.1 Con tributions and Outline Next we outline th e main con tribu ti ons of this pap er. • In Section 2 w e formally define the su c cinct d y n amic co ve ring (SDC) problem, and summarize our resu lt s. • In Section 3 w e present a randomized cov erage oracle for SDC . The oracle is pr e sented as a function of the a v ailable space, thus allo wing us to tradeoff space for accuracy based on the sp ecific application. Unfortunately , the approximat ion ratio of this oracle d egrades rapidly as sp ac e decreases; Ho wev er, th e next section sho ws that this is in fact u n a v oidable. • In S e ction 4 w e pr ese nt a lo werb o un d on th e b est p ossible appro ximation attainable as a function of the space allo w ed for the datastructure. Th is lo we rb ound essen tially matc hes the upp erb ound of Section 3, though with the ca ve at that the low erb ound is for oracles that do not use randomization. W e exp ect the lo w erb ound to hold more generally for randomized oracles, though w e lea ve th is as an o p en qu esti on. Related work and future directions are presen ted in Section 5. 1.2 Related W ork Our stu dy of the tradeoff b et w een sp a ce and ap p ro ximation ratio is in the spirit of the work of Thorup and Zwic k [14] on distanc e or acles . They considered the problem of compr essing a graph G in to a small datastructure, in suc h a wa y that th e d a tastructure can b e used to approximat ely answ er queries for th e distance b et wee n pairs of no des in G . Similar to our results, they sho w ed matc h ing upp er and low er b ounds on th e space n e eded for compressin g the graph sub ject to pre- serving a certain app ro ximation r a tio. Moreo v er, similarly to our upp erb ounds for SDC, their distance oracles b enefit from a sp eedup at qu ery time as appro ximation ratio is sacrificed f o r space. Previous w ork has studied the set co v er p roblem under streaming mo dels. One m o del s t ud - ied in [3, 10] assumes that the sets are kno wn in adv ance, only elemen ts arrive online, and, the algorithms d o not kno w in adv ance which subset of elements will arriv e. An alternativ e m odel assumes that elemen ts are kn o wn in adv ance and sets arrive in a streaming fashion [13]. Our w ork differs from these works in that SDC op erat es under a storage bu dget, so all sets cannot b e stored; moreo ver, S DC needs to pro vid e a go od co ver f or al l p ossible d ynamic query inpu ts. Another r e lated area is th at of n e arest neigh b or searc h. It is easy to see that th e S DC problem with k = 1 corresp onds to nearest neigh b or searc h u sing the dot pro duct similarit y measure, i.e., sim dot ( x, y ) = dot ( x,y ) n . How ev er, follo wing fr o m a result from Charik ar [6], there exists no localit y sensitiv e hash fun ct ion family for the dot pro duct similarity function. Thus, there is no hop e that signature s c hemes (lik e minhashing for the Jaccard distance) can b e used for SDC . 2 SDC W e start b y defining the s u cc inct dynamic co vering (SDC) pr o blem in Section 2.1. Then, in Section 2.2 w e sum m a rize the main te c hn ic al results ac hiev ed by this p a p er. 3 2.1 Problem Definition W e no w formally defin e the SDC pr oblem. Definition 2.1 (SDC) . Given an offline input c onsisting of a set system ( X , I ) with n elemen ts (a.k.a items ) X and m sets I , and an i nte ger k ≥ 1 , devise a co v erage oracle such that give n a dynamic qu er y Q ⊆ X , the or acle finds a J ⊆ I such that |J | ≤ k and ( S S ∈J S ) T Q is as lar ge as p ossible. Definition 2.2 (Co verage Or ac le) . A Co ve rage Oracle for SDC c onsists of two stages: 1. Static Stage: Given inte gers m , n , k , and set system ( X , I ) with |X | = n and |I | = m , build a datastr uctur e D . 2. Dynamic Stage: Given a a dynamic query Q ⊆ X , use D to r etu rn J ⊆ I with |J | ≤ k as a solution to SDC. Note that our t w o constrain ts on a solution for SDC are illustrated b y the t wo stages ab o v e. (1) W e are in terested in building an offline data structure D , an d only use D to answer queries. T ypically , w e w an t to m a inta in a smal l data structure, certainly o ( mn ), and ma yb e as little as O (( m + n ) poly l og ( mn )) or even O ( m + n ). Therefore, we cannot store the en tire set system. (2) Unlik e the traditional max-co v erage p roblem where the en tire set of elements X need to b e co vered, in SDC w e are give n queries dynamically . Th erefo re, w e wan t a co verag e oracle that retur ns goo d solutions f o r all queries. Giv en the space limitation of SDC, we cannot hop e to exactly solv e SDC (for all dyn amic inpu t queries). Th e goal of this pap er is to explore ap pr oximate solutions for S DC, give n a sp e cific space constrain t on the offline data structur e D . W e defin e th e app r oximation r atio of an oracle as th e w orst-case, take n o ver all inputs, of the ratio b et wee n the co verag e of Q by the optimal solution and the co ve rage of Q b y the output of the oracle. W e allo w the appro ximation ratio to b e a function of n , m , and k , and denote it b y α ( n, m, k ). More p reci sely , giv en a co verag e oracle A , if on inputs k , X , I , Q (w h ere imp lic itly n = |X | and m = |I | ) the oracle A return s J ⊆ I , w e d enot e the size of the co ve rage as A ( k , X , I , Q ) := | ( S S ∈J S ) T Q | . Similarly , we denote the co v erage of the optimal s o lution by OP T ( k , X , I , Q ) := max {| ( S S ∈J ∗ S ) T Q | : J ∗ ⊆ I , |J ∗ | ≤ k } . W e then express the appr oximatio n r atio α ( n, m, k ) as follo ws. α ( n, m , k ) = max O P T ( k , X , I , Q ) A ( k , X , I , Q ) Where the maxim um ab o v e is tak en o ver set systems ( X , I ) with |X | = n and |I | = m , and queries Q ⊆ X . W e will also b e concerned with r andomize d co verag e oracle s. Note that, w hen w e d evi se ran- domized co verag e oracle, w e use randomization only in the static stage; i.e. in th e constr u ct ion of the datastructure. W e then let the exp e c te d appr oximation r atio b e the worst case exp e cte d p erformance of the oracle as compared to the optimal solution. α ( n, m , k ) = max E O P T ( k , X , I , Q ) A ( k , X , I , Q ) (1) 4 T able 1: Su m mary of results for S D C giving the appro ximation-ratio, th e space constrain t on the co verage oracle , and whether the nature of the b ound: upp er b ound (UB) or lo w er b ound (LB) and d e terministic (Det .) or rand omiz ed (Rand.) Appro ximation Ratio Storage Bound O min m k , p n k e O ( n ) Det. UB O min m ǫ √ k , p n k e O ( nm 1 − 2 ǫ ) Rand. UB Ω min m ǫ − δ 1 k √ k , n 1 / 2 − δ 2 k √ k e O ( nm 1 − 2 ǫ ) Det. LB The exp ectatio n in the ab o v e expression is ov er the random coins fl ipp ed by the static stage of the oracle, and the maximizatio n is o ver X , I , Q as b efore. W e elab orate on this b enc h mark in Section 3. W e stud y the space-approxima tion tradeoff; i.e., h ow the (exp ect ed) app ro ximation ratio im- pro ve s as the amount of space allo wed for D is increased. In our lo werb ounds, we are not sp ecifical ly concerned with the time tak en to compu te the datastructure or an s w er qu e ries. Therefore, our lo werb ounds are purely information-the or etic : w e calculate the amount of information we are r e- quired to store if we are to guaran tee a s p ecific appro x im ation ratio, indep endent of co mp u ta tional concerns. Our lo w erb ounds are p a rticularly no v el and striking in that they assume nothing ab out the datastructure, whic h ma y b e an a rb it rary sequence of bits. W e establish our lo werboun ds via a n ov el application of the probabilistic method that ma y b e of indep enden t interest. Ev en though w e fo cus on space vs ap p ro ximation, and n ot on runtime, fortunately the co ver- age oracles in our upp erb ounds can b e imp le mented efficien tly (b oth static and dynamic s t age). Moreo ver, using our upp erb ounds to trade appro ximation for s pac e yields, as a sid e- effect, an im- pro ve ment in ru n time when answ ering a query . In particular, observ e th a t if no sparsification of the data is done u p-fron t, then answering eac h quer y using the stand ard greedy app ro ximation algorithm for m ax k-co ve r [11] tak es O ( mn ) time. Our oracles, presented in Section 3, sp ends O ( mn ) time u p-fron t building a data structur e of size O ( b ), where b is a parameter of the oracle b et ween n and nm . In the d ynamic stage, ho we ve r, answering a query n o w tak es O ( b ), sin c e w e use the greedy algorithm for max k-co ve r on a “sparse” set system. Therefore, the d ynamic stage b ecomes faster as we decrease size of the data structure. I n fact, this increase in sp eed is not restricted to an algorithmic sp eedup as describ ed ab o ve. It is lik ely that there will also b e sp eedup due to arc hitectural reasons, since a smaller amoun t of data needs to b e ke pt in memory . Therefore, trading off approxima tion for space y ields an in c idental sp eedup in runtime whic h b od es w ell for the d ynamic nature of the queries. 2.2 Summary of results T able 1 sum marize s the main r e sults obtained in this p aper for S DC input with n elemen ts, m sets, and inte ger k ≥ 1. Th e lo w er b ound in the table is for any nonnegativ e constan ts δ 1 , δ 2 not b oth 0, and the randomized upp erb ound is p arameterized b y ǫ with 0 ≤ ǫ ≤ 1 / 2. Th e upp er and lo wer b ounds are d evelo p ed in Sections 3 and 4 resp e ctiv ely . 5 3 Upp er Bounds In this section, we sho w a co ve rage oracle that trades off s pac e and approximati on ratio. W e designate a tradeoff parameter ǫ , w here 0 ≤ ǫ ≤ 1 / 2. F or any suc h ǫ , we get an O min( m ǫ , √ n ) √ k - appro ximate co v erage oracle th at stores e O ( nm 1 − 2 ǫ ) bits. Therefore, setting a small v alue of ǫ ac h ie ve s a b ett er approximat ion r atio, at the exp ense of storage space. As is common practice, w e use e O () to denote suppressing p olylogarithmic fact ors in n and m ; this is reasonable when the guaran tees are sup er-p olylogarithmic, as is th e case here. The oracle w e sho w is r a nd o mized, in the sense that the static stage flips some random coins. The datastructure co nstru c ted is a rand o m v ariable in the in ternal coin fl ips of the s tatic stage of the oracle. W e measure the exp e cte d appr oximation r atio (a.k.a app ro ximation r atio, when clear from con text) of the oracle, as d efi ned in E quati on (1 ). F or ev ery fixed query Q in depen d en t of the random coins used in constructing the datastructure, this ratio is attained in exp ectation. In other w ords, our ad versarial mo del is that of an oblivious adversary : someone trying to fo ol our oracle may c ho ose an y qu ery they lik e, but their choic e cannot dep end on kn owledge of the rand om c hoices made in co nstru c ting the datastructure. In S e ction 4 we will see that our oracle attains a space-appro ximation tradeoff that is essen tially optimal when compared with oracles that are deterministic. In other words, no deterministic oracle can do substan tially b etter. W e lea v e op en the questions of whether a b et ter r andomized oracle is p ossible, and w h et her an equally go o d deterministic oracle exists. 3.1 Main Result and Roadmap The follo wing theorem states the main result of this section. Theorem 3.1. F or eve r y ǫ with 0 ≤ ǫ ≤ 1 / 2 , ther e is a r andomize d c over age or acle for SDC that achieves an O min( m ǫ , √ n ) √ k appr oxima tion and sto r es e O ( nm 1 − 2 ǫ ) bits. The remainder of this section, leading up to the ab o ve r e sult, is organized as follo w s. Before proving Theorem 3.1, to b u ild in tuition w e sho w in Section 3.2 (Remark 3.2) a muc h simpler deterministic oracle, with a m uch weak er appro ximation guaran tee. Then, w e pr o v e Theorem 3.1 in tw o p arts. First, in Section 3.3, w e sh ow a randomized co verag e oracl e that s tores e O ( nm 1 − 2 ǫ ) b it s and ac hiev es an O ( m ǫ / √ k ) approxima tion in exp ectation. Th en, in S e ction 3.4, we sho w a determin istic oracle that ac hieve s a O ( √ n/ √ k ) approxima tion and stores e O ( n ) bits. Com bining th e t wo oracles in to a single oracle in the ob vious w ay yields Theorem 3.1. 3.2 Simple Deterministic Oracle Remark 3.2. Ther e is a simpl e deterministic or acle that att ains a m/k appr oximation with e O ( n ) sp ac e. The static stage pr o c e e ds as fol lows: Given set system ( X , I ) , for e ach i ∈ X we “r ememb er” one set S ∈ I with i ∈ S (br e aking ties arbitr arily). In other wor ds, for e ach S ∈ I we define b S ⊆ S such that n b S : S ∈ I o is a p artition of X . W e then stor e the “sp arsifie d” set system X , b I = n b S : S ∈ I o . It is cle ar that this c an b e done in line ar time by a trivial gr e e dy algorithm . Mor e over, ( X , b I ) c an b e stor e d in e O ( n ) sp ac e as a n × m bip artite gr aph with n e dges. 6 The dynamic stage is str aightforwar d: when g iven a query Q , we simply r eturn the indic es of the k sets i n b I that c ol le ctively c over as much of Q as p ossible. It is cle ar that this gives a m/k appr oxima tion. Mor e over, sinc e b I is a p artition of X , it c an b e ac c omplishe d by a trivial gr e e dy algorithm in p olynomial time. Next we use randomization to sh o w a muc h b ett er, an d m u c h more inv olve d, upp erb ound that trades off appro ximation and space. 3.3 An O ( m ǫ / √ k ) Approxi mation with e O ( nm 1 − 2 ǫ ) Space Consider the set sys t em ( X , I ), where X is the set of items and I is the family of sets. W e assume withou t loss that eac h item is in some set. W e define a rand o mized oracle for building a datastructure, whic h is a “sparsifi e d” v ersion of ( X , I ). Namely , f o r ev er y S ∈ I we d efine b S ⊆ S , and store the set system X , b I = n b S o S ∈I . W e require that ( X , b I ) can b e stored in e O ( nm 1 − 2 ǫ ) space. W e constru c t the datastructure in t w o stages, as follo ws. • Lab el all items in X “unco v ered” and all sets in I “unchosen” • Stage 1: While there exists an unc hosen s e t S ∈ I conta ining at least n m ǫ √ k unco v ered items – Let b S b e the set of unchosen items in S . – Relab el all items in b S as “co vered” and “significan t” – Relab el S as “c h o sen” and “significant” • Stage 2: F or ev ery remaining “unc hosen” set S – Ch oose n m 2 ǫ “unco v ered” items b S ⊆ S un iformly at random f rom the unco v ered items in S (if fewer than n m 2 ǫ suc h ite ms, then le t b S b e all of them). – Relab el eac h item in b S as “co v ered” and “insignifican t” – Relab el S as “c h o sen” and “ins ignifi c ant” • Lab el ev ery unco vered item as “unco vered” and “insignifican t” When presented w it h a query Q ⊆ X , we u se the stored datastructure ( X , b I ) in the ob vious w a y: namely , w e find c S 1 , . . . , c S k ∈ b I maximizing | ( S k i =1 b S i ) T Q | , and return the name of the cor- resp onding original sets S 1 , . . . , S k . Ho wev er, this pr o blem cannot b e solved exactly in p o lynomial time in general. Nev ertheless, w e can instead use the greedy algorithm for m a x-k-co v er to get a constan t-factor appro ximation [11]; this will not affect our asymptotic guarantee on the appr o x- imation ratio. Th e follo wing t wo lemmas complete the pro of that the ab o v e oracle ac hieve s an O ( m ǫ / √ k ) approxima tion with e O ( nm 1 − 2 ǫ ) sp ac e. Lemma 3.3. The datastructur e ( X , b I ) c an b e stor e d using e O ( nm 1 − 2 ǫ ) bits. Pr o of. W e store the set system as a b ip a rtite graph represen ting the con tainment relation b e tw een items and sets. T o sh o w that the b ipartit e graph can b e stored in th e required space, it suffices to sho w that ( X , b I ) is “sparse”; n a mely , that the total n umber of edges ( x, b S ) ∈ X × b I such th a t x ∈ b S is O ( nm 1 − 2 ǫ ). W e accoun t for the edges created in stages 1 and 2 separately . 7 1. Ev er y significan t item is co nn e cted to a s in g le set. This create s at most n edges. 2. F or every insignifican t set, we store at most nm − 2 ǫ items. This creates at most mnm − 2 ǫ = nm 1 − 2 ǫ edges. Lemma 3.4. F or e very query Q , the or acle r eturns sets S 1 , . . . , S k such that E [ | ( k [ i =1 S i ) \ Q | ] ≥ | ( S k i =1 S ∗ i ) T Q | O ( m ǫ / √ k ) for any S ∗ 1 , . . . , S ∗ k ∈ I . Note that S 1 , . . . , S k are random v ariables in the inte rnal coin-flips of the static stage that constructs the datastructure. The exp ec tation in the statemen t o f the lemma is ov er these random coins. Pr o of. W e fix an optimal c hoice for S ∗ 1 , . . . , S ∗ k ∈ I , and d e note OP T = | ( S k i =1 S ∗ i ) T Q | . Since, b y construction, b S ⊆ S f o r all S ∈ I , it suffices to sho w that the outpu t of the oracle satisfies | ( S k i =1 b S i ) T Q | ≥ O P T O ( m ǫ / √ k ) in exp ectatio n. Moreo ve r, since the dynamic stage algorithm fin d s a constan t factor ap p ro ximation to max {| ( S k i =1 b S i ) T Q | : c S 1 , . . . , c S k ∈ b I } , it is su fficie nt to sho w that there exists S 1 , . . . , S k ∈ I with E [ | ( S k i =1 b S i ) T Q | ] ≥ O P T O ( m ǫ / √ k ) . W e distinguish t wo cases, based on whether most of the items ( S k i =1 S ∗ i ) T Q co vered b y the optimal solution are in significan t or insignifican t sets. W e use the “significant ” and “ins ignifi c ant” designation as used in the static stage algorithm. Moreo ver, we refer to b S ∈ b I as s ig nificant (insignifican t, resp.) wh e n the corresp onding S ∈ I is significan t (insignificant, resp.). 1. A t least half of ( S k i =1 S ∗ i ) T Q are significan t it ems : Notice that, by construction, there are at m o st m ǫ √ k significan t s e ts in b I . Moreo v er, the significan t items are precisely those co vered by the significan t s ets of b I , and th ose sets form a partition of the significan t items. Therefore, by the pigeonhole p rinciple there are there are some c S 1 , . . . , c S k ∈ b I suc h that S k i =1 b S i con tains at least an k m ǫ √ k = √ k m ǫ fraction of the signifi c ant items in ( S k i =1 S ∗ i ) T Q . This gives the desired O ( m ǫ / √ k ) approxima tion. 2. A t least half of ( S k i =1 S ∗ i ) T Q are insignificant items : In this case, at least half the items ( S k i =1 S ∗ i ) T Q co v ered by the optimal s olution are conta ined in the insignifi cant memb e rs of { S ∗ 1 , . . . , S ∗ k } . Recall that an y insignifican t set in I con tains at most n m ǫ √ k insignifican t ite ms. Therefore, the algorithm includes eac h elemen t of an insignifican t S ∗ i in c S ∗ i with probabilit y at least n m 2 ǫ / n m ǫ √ k , whic h is at least √ k /m ǫ . Thus, ev ery insignifican t item in ( S k i =1 S ∗ i ) is in ( S k i =1 c S ∗ i ) with probabilit y at least √ k /m ǫ . This giv es that the exp ected size of ( S k i =1 c S ∗ i ) T Q is at least O P T O ( m ǫ / √ k ) . T aking S i = S ∗ i completes the pro of. 8 3.4 An O ( p n/k ) A pp roxima tion with e O ( n ) Space This co v erage oracle is similar to the one in th e previous section, th o ugh is muc h simpler. Moreo v er, it is d et erministic. Indeed, w e construct the d a tastructure by the follo wing greedy alg orithm that resem bles the greedy algorithm for max-k-co v er • Lab el all items in X “unco v ered” and all sets in I “unchosen” • While there are unchosen sets – Find the un c hosen s et S ∈ I contai nin g the most unco v ered items – Let b S b e the set of unco vered items in S . – Relab el all items in b S as “co vered” – Relab el S as “c h o sen” Observe th a t b I is a partition of X . When presen ted w it h a query Q ⊆ X , we u se the datastruc- ture ( X , b I = n b S : S ∈ I o ) in the ob vious w a y . Namely , we find the s e ts c S 1 , . . . , c S k ∈ b I maximizing | ( S k i =1 b S i ) T Q | , and output the corresp o nd ing non-sparse sets S 1 , . . . , S k . Th is can ea sily b e done in p olynomial time b y u sing the obvio us greedy algorithm, since b I is a partition of X . Note that the oracle describ ed ab o ve is v ery similar to the oracle from Section 3.2: The dyn a mic stage is identi cal. Th e static stage, how eve r, n ee ds to bu il d the partition using a sp ecific greedy ordering – as opp osed to th e arbitrary ordering used in Section 3.2. The f o llo wing t wo Lemmas complete th e pro of that the oracle ac hieves an O ( p n/k ) approxima tion with e O ( n ) space. Lemma 3.5. The datastructur e ( X , b I ) c an b e stor e d using e O ( n ) bits Pr o of. Ob serv e th a t eac h item is contai ned in exactly one b S ∈ b I . Therefore, the bipartite graph represent ing the set s y s t em ( X , b I ) h a s at most n edges. This establishes the Lemma. Lemma 3.6. F or e very query Q , the or acle r eturns sets S 1 , . . . , S k with | ( k [ i =1 S i ) \ Q | ≥ | ( S k i =1 S ∗ i ) T Q | O ( p n/k ) for any S ∗ 1 , . . . , S ∗ k ∈ I . Pr o of. Fix an optimal choice of S ∗ 1 , . . . , S ∗ k , and den o te O P T = | ( S k i =1 S ∗ i ) T Q | . Recall that the oracle find s c S 1 , . . . , c S k ∈ b I maximizing | ( S k i =1 b S i ) T Q | , and then outpu ts the corr esp ondin g original sets S 1 , . . . , S k . It suffices to sho w that there are some b S 1 , . . . , b S k ∈ b I with | ( S k i =1 b S i ) T Q | ≥ O P T /O ( p n/k ). W e distinguish t wo cases, b a sed on whether most of ( S k i =1 S ∗ i ) T Q are in big or small sets in b I . Recall that b I forms a partition of X . W e sa y b S ∈ b I is “significan t” if | b S | ≥ p n/k , otherwise b S is “insignifican t”. Similarly , w e sa y an item i ∈ X is “significan t” if it falls in a significan t set in b I , otherwise it is “insignifican t”. Notice that there are at most n √ n/k = √ nk significan t sets. First, we consider the case where at lea st half the items in ( S k i =1 ) S ∗ i T Q are significan t. Since there at m ost √ nk significan t sets in b I , by the p ig eonhole p rinciple there are k of th e m that 9 collect ive ly co ve r a k / √ nk = p k /n fractio n of all significan t items in ( S k i =1 S ∗ i ) T Q . This would guaran tee the O ( p n/k ) appro ximation, as needed. Next, w e consider the case wh ere at least half of ( S k i =1 S ∗ i ) T Q are insignifi c ant. By examining the greedy algo rithm of the stati c s t age, it is easy to see that eac h S ∈ I con tains at most p n/k insignifican t items. Therefore, there are at most k · p n/k = √ nk insignificant items in ( S k i =1 S ∗ i ). Therefore we deduce that O P T = | ( S i S ∗ i ) T Q | ≤ 2 √ k n . Sin c e the optimal co vers O ( √ k n ) items in Q , it su ffice s for a O ( p n/k ) appro ximation to sho w that th e re are b S 1 , . . . , b S k ∈ b I that collectiv ely co ver k items of Q . It is easy to see that this is indeed the case, sin ce b I is a p a rtition of X . This completes th e pro of. 4 Lo w er Bounds This section develo ps lo wer b ounds for the S D C p roblem. W e consider deterministic oracles that store a datastructure of size b ( n, m, k ) for set s yste ms with n items, m sets, maxim um num b er of allo wed sets k . Moreo ve r, we assume that n ≤ b ( n, m, k ) ≤ n m , since no nontrivia l p ositiv e result is p ossible wh e n b ( n , m, k ) = o ( n ), and a p erfect appro ximation ratio of 1 is p ossible when b ( n, m , k ) = Ω( nm ). 4.1 Main Result and Roadmap The main result of this section is stated in the follo wing theorem, whic h says that our rand omized oracle in the previous section achiev es a space-appro ximation tradeoff that essen tially m a tc hes the b est p ossible for any deterministic oracl e. Theorem 4.1. Consider any deterministic or acle that stor es a datastr uctur e of size at most b ( n, m , k ) bits, wher e n ≤ b ( n, m, k ) ≤ nm . L et ǫ ( n, m, k ) b e su c h that b ( n, m, k ) = nm 1 − 2 ǫ ( n,m,k ) . When m ǫ ( n,m,k ) ≤ √ n , the or acle do es not attain an appr oximation r atio of O ( m ǫ ( n,m,k ) − δ k √ k ) for any c onstant δ > 0 . Mor e over, when √ n ≤ m ǫ ( n,m,k ) the or acle do es not attain an appr oxima tion r atio of O ( n 1 / 2 − δ k √ k ) for any δ > 0 . The p roof of the theorem ab o ve is somewhat inv olv ed. Therefore, to simplify the pr ese ntat ion w e p ro ve in S ec tion 4.2 a slight simplification of Theorem 4.1 that captures all the main ideas: Our simplification sets k = 1, and pro ves the O ( m ǫ ( n,m,k ) − δ k √ k ) appro ximation ratio, for m ǫ ( n,m,k ) ≤ √ n . Then, in Section 4.3 we p ro ve th e appro ximation ratio for the case of √ n ≤ m ǫ ( n,m,k ) , still main taining k = 1. Finally , in Section 4.4, we demonstrate how to mo dify our p roofs for an y k , yielding T heo rem 4.1. W e fix δ > 0. F or the remainder of the section, w e use b and ǫ as shorthand for b ( n, m, k ) and ǫ ( n, m, k ), resp ectiv ely . W e let α ( n, m, k ) b e the app ro ximation ratio of the oracle, and us e α as shorthand. O bserv e that 0 ≤ ǫ ≤ 1 / 2. 4.2 Pro of of a Simpler Lo werbound W e simplify Theorem 4.1 b y assuming k = 1 and m ǫ ≤ √ n . The r e sult is the follo wing prop osition, stated us in g the shorthand notat ion describ ed ab o ve. 10 Prop os ition 4.2. Fix k = 1 and p ar ameter ǫ with 0 ≤ ǫ ≤ 1 / 2 . Assume m ǫ ≤ √ n . Consider any deterministic or acle that stor es a datastructur e of size at most b = nm 1 − 2 ǫ bits. The or acle do es not attain an appr oximation r atio of O ( m ǫ − δ ) for any c onstant δ > 0 . W e assum e the appr o ximation ratio α atta ined b y the oracle is O ( m ǫ − δ ) and der ive a contradic- tion. T he p roof uses the probabilistic metho d (see [2]). W e b egi n b y defining a d istribution on set systems, and then go on to show that this d istribution “fo o ls” a small cov erage oracle w ith p osit ive probabilit y . 4.2.1 Defining a Distribution D on Set Systems W e will sho w that there is a set system ( X , I ) and a query Q that forces the algo rithm to output a set S ∈ I th at is not within α from optimal. W e use the pr o babilistic m e tho d. Namely , we exhibit a distribu ti on D o v er set systems ( X , I ) su ch that, for ev ery d et erministic oracle storing a datastructure of size b , there exists with non-zero p robabilit y a query Q f or whic h the oracle outputs a set of approximat ion w orse than α . T o sho w this, we dra w tw o set systems i.i.d from D , and sho w that with non -zero pr o bability b oth the follo wing hold: the t w o s et sys tems are not distinguished b y the co verag e oracle, and moreo v er there exists a qu e ry Q that r e quires that the algorithm retur n differen t answ er s for the tw o set sys tems for a O ( m ǫ − δ ) approximat ion. W e defi ne D as follo ws . Given th e ground set X = { 1 , . . . , n } , we let I = { A i } m i =1 and dra w A 1 , . . . , A m i.i.d as follo ws: W e let A i b e a subs e t of X of size nm − ǫ dra wn u niformly at rand om. 4.2.2 Sampling t wice from D and collisions Next, w e dra w t wo set systems ( X , I = { A i } m i =1 ) and ( X , I ′ = { A ′ i } m i =1 ) i.i.d f r om D , as d iscussed ab o ve . Firs t, we lo werb o un d the pr o bability that ( X , I ) and ( X , I ′ ) are n o t distinguished b y the co verage oracle. W e call suc h an o ccurence a “Coll ision”. Lemma 4.3. The pr ob ability that the same datastructur e is stor e d for ( X , I ) and ( X , I ′ ) is at le ast 2 − b . Pr o of. Th e re are 2 b p ossible datastructures. Let p i denote the probabilit y that, when pr ese nted with r an d om ( X , I ) ∼ D , th e oracle s t ores the i ’th datastructure. W e can write this probabilit y of “col lision” of the t wo i.i.d samp le s ( X , I ) and ( X , I ′ ) as P 2 b i =1 p 2 i . Ho w ev er, sin c e P i p i = 1, this expression is min im ized when p i = 2 − b for all i . Plugging int o the ab o ve exp r essio n giv es a lo werb ound of 2 − b , as required. 4.2.3 F o oling Querie s and Candidat es Next, w e lo werb ound th e probabilit y that a query Q exists requiring t wo different answ ers for ( X , I ) and ( X , I ′ ) in order to get the desired α = O ( m ǫ − δ ) appro ximation. W e call suc h a qu ery Q a fo oling query . W e define a set of qu er ies that are “candidates” for b eing a fo oling query: A set Q ⊆ X is called a c andidate query if Q = A i S A ′ i ′ for some i 6 = i ′ . In other w ords, a query is a candidate if it is the union of a set f rom ( X , I ) and a set fr om ( X , I ′ ) with differen t indices. Ideally , candidate Q = A i S A ′ i ′ w ould b e a fooling qu ery by forcing the oracl e to output i for ( X , I ) and i ′ for ( X , I ′ ) in order to guarantee the desired appro ximation. How ev er, this need not b e the case: consider for instance the case when, f o r some j 6 = i, i ′ , b oth A j and A ′ j ha v e large in tersection with Q , making it ok to outpu t j for b oth. W e will sh o w that the probability that none 11 of the candidate queries is a fo oling query is strictly less than 2 − b when n and m are su fficie ntly large. Doing s o w ould complete the pro of: collision o c curs with probabilit y ≥ 2 − b , and a fo oling query exists with p robabilit y > 1 − 2 − b , and ther efore b oth o ccur sim ultaneously with p ositiv e probabilit y . This would yield th e desired contradicti on. 4.2.4 The Probabilit y that None of the Candidates is F o oling is Small W e now u pp e rb ound the p r obabilit y that none of the candidates is a fo oling query . Observe that if candidate Q = A i S A ′ i ′ is not a fo oling q u ery , then there exists A ∈ I S I ′ \ { A i , A ′ i ′ } w it h | A T Q | ≥ nm − ǫ /α . T h erefore one of the follo win g must b e true: 1. There exists A ∈ I S I ′ \ { A i , A ′ i ′ } with | A T A i | ≥ n m − ǫ / 2 α = Ω( nm − 2 ǫ + δ ). 2. There exists A ∈ I S I ′ \ { A i , A ′ i ′ } with | A T A ′ i ′ | ≥ nm − ǫ / 2 α = Ω ( nm − 2 ǫ + δ ). Therefore, if n one of the ca nd idat es were fo oling queries, then there are man y “pairs” of sets in I S I ′ that ha v e an inte rsection s u bstan tially larger than the exp ec ted size of nm − 2 ǫ . This seems v ery unlik ely . Indeed, th e remainder of this pro of will demonstrate just that. If none of the candid a tes are fo oling queries, then by examining (1) an d (2) ab o v e w e d ed uce the f o llo wing. Th ere exists 8 a set of pairs P ⊆ ( I S I ′ ) × ( I S I ′ ) such that: 1. | P | ≥ m − 2 = Ω( m ) 2. The un d irect ed graph with no des I S I ′ and edges P is bipartite. Moreo v er, ev ery no de in the left part has degree at most 1. Thus P is acyclic. 3. If ( B , C ) ∈ P then | B T C | ≥ Ω( nm − 2 ǫ + δ ) W e no w pr oceed to b ound the p robabilit y of existence of su c h a P , and in the pro cess also b ound the pr o babilit y that non e of th e candidate queries are fo oling. Recall that memb ers of I S I ′ are dra w n i.i.d from the uniform distribution on subsets of X of size nm − ǫ . F or ev ery pair ( B , C ) ∈ I S I ′ , w e let R ( B , C ) = | B T C | denote the size of th e ir inte rsection. It is easy to see the random v ariables {R ( B , C ) } B ,C ∈I S I ′ are pairwise indep enden t. Therefore, any acyclic set of pairs is m utually indep enden t, b y basic pr o bability theory . Thus, if w e fix a particular P satisfying (1) and (2), the probabilit y that P satisfies condition (3) is at most Y ( B ,C ) ∈ P Pr [ R ( B , C ) ≥ Ω( nm − 2 ǫ + δ )] W e now wan t to estimate the pr obabilit y that the in tersection of B and C is a factor Ω( m δ ) more than its exp ectati on of nm − 2 ǫ . Therefore, we consider an indicator random v ariable Y i for eac h i ∈ X , designating wheter i ∈ B ∩ C . If Y i w ere indep endent, we co uld use Chern o ff b ounds 8 Consider constructing P as follo ws: F or candidate qu ery Q = A 1 S A ′ 2 , find the set in I S I ′ \ { A 1 , A ′ 2 } with a large intersection with one of A 1 or A ′ 2 as in (1) or (2). Say for instance we find that A 7 has a larg e intersec tion with A 1 . W e include ( A 1 , A 7 ) in P , mark both A 1 and A 7 as “touc hed”, and designate A 1 a “lef t” n ode and A 7 a “righ t” n o de. Then, we rep eat th e pro cess with some cand idate Q ′ = A i S A ′ i ′ for some “untouc h ed” A i and A ′ i ′ . W e keep rep eating u n t i l there are no such candidates. Throughout this greedy pro ces s, w e mark at most tw o members of I S I ′ as “touched” for ev ery pair w e include in P . Note that some A i ma y be “touc hed” more than once. As long as there are at least 2 untouched sets in each of I and I ′ , the algorithm may contin ue. 12 to b ound the probabilit y th a t R ( B , C ) is large. F ortunately , it is easy to see that the Y i ’s are negativ ely-correlated: i.e., for any L ⊆ { 1 , . . . , n } , we h a v e Pr [ V i ∈ L Y i = 1] ≤ Q i ∈ L Pr [ Y i = 1] . Therefore, by the result of [12], if w e “pretend” that they are indep enden t b y app ro ximating their join t-distribution by i.i.d b ernoulli random v ariables, w e can still us e Chernoff Bounds to b ound the u pp er-ta il p r obabilit y . Therefore, using Chern off b ounds 9 w e deduce that the probab ility that the in tersection of B and C is a factor Ω( m δ ) more than the exp ectatio n of nm − 2 ǫ is at most 2 − (Ω( m δ ) − 1) nm − 2 ǫ ≤ 2 − Ω( nm − 2 ǫ + δ ) . Therefore, the probabilit y that the fixed P satisfies condition (3 ) is at most Y ( B ,C ) ∈ P 2 − Ω( nm − 2 ǫ + δ ) ≤ (2 − Ω( nm − 2 ǫ + δ ) ) | P | ≤ 2 − Ω( nm 1 − 2 ǫ + δ ) No w , w e can sum o ver all p ossible choic es for P satisfying (1) and (2) to get a b ound on the existence of a P s a tisfying (1), (2) and (3). It is easy to see that there are at most m m c hoices for P that satify (1) and (2). Using the union b oun d , w e get the foll o wing b ound on the existence of suc h a P . m m · 2 − Ω( nm 1 − 2 ǫ + δ ) ≤ 2 m log m − Ω( nm 1 − 2 ǫ + δ ) ≤ 2 − Ω( nm 1 − 2 ǫ + δ ) Where the last inequalit y f o llo ws by simple algebraic manipulation f r om our assump tio n that m ǫ ≤ √ n and δ > 0, w hen n and m are sufficien tly large. Recall that, b y our previous discussion, this expr essio n also up perb ound s the probabilit y th a t none of the candidate queries are f o oling queries. But, when n and m are sufficien tly large, th is is strictly smaller than 2 − b = 2 − nm 1 − 2 ǫ . Th us, b y our previous discussion, this completes the pro of of Prop osition 4.2. 4.3 Mo difyin g the pro of for the case √ n ≤ m ǫ W e main tain the assu mption that k = 1, and show how to mo dify the pro of of Pr o p osition 4.2 for the case when √ n ≤ m ǫ . Prop os ition 4.4. Fix k = 1 and p ar ameter ǫ with 0 ≤ ǫ ≤ 1 / 2 . Assume √ n ≤ m ǫ . Consider any deterministic or acle that stor es a datastructur e of size at most b = nm 1 − 2 ǫ bits. The or acle do es not attain an appr oximation r atio of O ( n 1 / 2 − δ ) for any c onstant δ > 0 . Instead of replicating almost the en tire pr oof of Prop osition 4.2, w e instead p oint out th e key c hanges n ec essary to yield a pro of of 4.4 and lea v e the rest as an ea sy excercise for the reader. The pro o f pro ceeds almost iden tically to the pr oof of Prop osition 4.2, with the foll o wing main c hanges: • Modifications t o Section 4.2.1 : When defining D , w e let eac h A i b e a subset of X of size √ n instead of nm − ǫ . • W e p erform similar calculations throughout, accomod a ting the ab o ve mo dification to the size of A i . 9 W e use th e follo wing versio n of the Chernoff Bound: Let X 1 , . . . , X n b e ind ependent b ernoulli random v ariables, and let X = P i X i . If E [ X ] = µ and ∆ > 2 e − 1, then Pr [ X > (1 + ∆) µ ] ≤ 2 − ∆ µ . 13 • Modifications to Section 4.2.4 : W e ev ent ually arr iv e at an upp er b ound of 2 − mn δ on the probabilit y that none of the candidate qu er ies are fo o ling. Using the assu mption m ǫ ≥ √ n and the fact th at b = n m 1 − 2 ǫ , a simp le algebraic m a nipu la tion shows that this b ound is stricly less th an 2 − b . Th is completes the proof, as b efore. 4.4 Mo difyin g the pro of for arbitrary k In this section, we ge neralize Prop osition 4.2 to arbitrary k . The generaliza tion of Pr o p osition 4.4 to arbitrary k is essen tially identica l, and th e refore we lea ve it as an exercise for the r ea der. W e no w state the generalization of Prop osition 4.2 to arb it rary k . Prop os ition 4.5. L et p ar ameter ǫ b e such that 0 ≤ ǫ ≤ 1 / 2 . Assume m ǫ ≤ √ n . Consider any deterministic or acle that stor es a datastructur e of size at most b = nm 1 − 2 ǫ bits. The or acle do es not attain an appr oximation r atio of O ( m ǫ − δ k √ k ) for any c onstant δ > 0 . The pro of of Prop osition 4.5 follo ws the outline of the proof of Prop osit ion 4.2. Th e necessary mo difications to th e pro o f of Prop osition 4.2 are as f ollo ws: • Modifications to Section 4.2.1 : W e define d istribution D as b efore, except that we let eac h A i b e a su bset of X of size nm − ǫ √ k . • Modifications to Section 4.2.2 : Instead of sampling from D t w ic e, we sample 2 k + 1 times to get s et systems ( X , I 1 ) , ( X , I 2 ) , . . . , ( X , I 2 k + 1 ). This c h ange s the probabilit y of collision of Lemma 4.3 to 2 − 2 k b . Here, collision means that all 2 k + 1 samp le s fr o m D are stored as the s ame datastructure by the static stage of the oracle. • Modifications to Section 4.2.3 : W e no w defin e a fo oling query analogously for general k : A query Q is fo oling if there is no single index i suc h that returning the i ’th set giv es a goo d appro ximation for all the set systems ( X , I 1 ), . . . ( X , I 2 k + 1 ). Moreo ver, w e analogously d efi ne c andida te queries : W e use A a b to denote the b ’th set in set system ( X , I a ). W e sa y Q ⊆ X is a candidate if Q = A ℓ 1 i 1 S A ℓ 2 i 2 S . . . S A ℓ k +1 i k +1 , where indices ℓ 1 , . . . , ℓ k +1 are distinct, and indices i 1 , . . . , i k +1 are distinct. In other words, Q is a fo oli ng query if it is the union of k + 1 sets from k + 1 distinct set systems and k + 1 distinct indices in those set systems. • Modifications to Section 4.2.4 : S imila rly , if a candidate Q = A ℓ 1 i 1 S . . . S A ℓ k +1 i k +1 is not a fo oling query , then there is s o me A ∈ ( I 1 S . . . I 2 k + 1 ) \ n A ℓ j i j o j with | A T Q | ≥ nm − ǫ /α . Therefore, for one of th e components A ℓ j i j of Q we ha ve that | A T A ℓ j i j | ≥ nm − ǫ /k α . Plugging in the appr oximati on ratio α = m ǫ − δ /k √ k we ha ve that | A T A ℓ j i j | ≥ nm − 2 ǫ + δ √ k . It is not to o hard to see that w e can construct P similarly w it h 1. | P | ≥ k ( m − k ) = Ω( k m ). 10 2. The un directe d graph with no des I S I ′ and edges P is bipartite. Moreo v er, ev ery no de in the left part has degree at most 1. Th us P is acyclic. 10 This is not true when k is almost equal to m . How ever, the theorem becomes trivially true when k > m 1 / 6 , so w e can without loss assume t hat k is n ot to o large. 14 3. If ( B , C ) ∈ P then | B T C | ≥ Ω( nm − 2 ǫ + δ √ k ) Con tinuing with the r e maining calculations in this sectio n almost identi cally give s a b ound of 2 − Ω( kn m 1 − 2 ǫ + δ ) on the probabilit y of existance of a fixed P . The n umb er of su c h P is at most ( k m ) k m , therefore a similar cal culation giv es a boun d of 2 − Ω( kn m 1 − 2 ǫ + δ ) = 2 − Ω( kbm δ ) on the existenc e of an y suc h P . As b e fore, th is completes the proof. 5 Conclusions and F uture W ork This pap er int ro duced and studied a f u ndamen tal problem, called SDC, arising in many large-scale W eb applications. A summary of results obtained b y the pap er app ear in T able 1 (S e ction 2.2). The main sp ecific op en qu e stion that arises is whether th e re is a d et erministic oracle that is as go od as the randomized oracle prop osed in Section 3. More generally , a detailed analysis of p ract ical sub classes of SDC seems to hold promise. Ac kno wledgemen ts W e thank Philip Bohannon, Hector Garcia-Molina, Ash win Mac ha v ana jhhaala, Tim Roughgarden, and Elad V erb in for insightful discussions. References [1] Gediminas Adomavicius and Er T uzhilin. T ow a rd the next g eneration of r ecommender systems: A survey of the state-of-the- a rt a nd p ossible extensio ns. IEEE TKDE , 17 , 2005. [2] N. Alon and J. Sp encer. The Pr ob abilistic Me tho d . John Wiley , 1992. [3] Y. Azar N. Alon B. Awerbuc h. The online se t cov er problem. In STOC , 2 003. [4] F rance sco Bonchi, Carlo s Castillo, Deb ora Donato, and Aristides Gionis. T o pical quer y decomp osition. In KDD , 2008. [5] Jaime G. Car bonell and J ade Goldstein. The use of mmr, diversit y-based reranking fo r reorder ing do cumen ts and pro ducing summaries. In SIGIR , 1998. [6] M. Charik ar. Simila rit y estimation tec hniques from rounding algor ithms. In STOC , 2002. [7] Harr Chen and David R. Karger . Less is more: pro ba bilistic mo dels for retrieving few er relev a n t do cumen ts. In S IGIR , 2006. [8] Nello Cr istianini and Matthew W. Hahn. Intr o duction to Computational Genomics . Cambridge Un ver- sity Press, 2006. [9] M. R. Garey and D. S. Johnson. Computers and Intractability. W . H. F r e eman and Comp any , 1 979. [10] YJ. Naor a nd N. Buch binder . Online pr imal-dual algorithms for cov ering and packing problems. In ESA , 2 005. [11] G. L. Nemhaus er, L. A. W olsey , and M. L. Fisher. An analysis of a ppro ximations for maximizing submo dular set functions – I. Mathematic al Pr o gr amming , 14(3):265–2 94, 1978 . 15 [12] Alessandro P anconesi and Ar a vind Sriniv asan. Randomized distributed edge coloring via an extension of the chernoff-ho e ff ding bounds. SIAM J. Comput. , 26(2):35 0–368, 1 997. [13] B. Saha and L. Geto or. O n maximum coverage in the streaming mo del and applica tion to mult i-topic blog-watc h. In S DM , 200 9. [14] Mikkel Thorup and Uri Zwick. Appr oximate dista nc e or acles. J. A CM , 52(1):1–24, 20 05. 16
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment