A Unified Approach to Ranking in Probabilistic Databases

A Uniﬁed Approach to Ranking in Probabilistic Databases Jian Li ∗ Barna Saha † Amol Deshpande ‡ Abstract Ranking is a fundamental operation in data analysis and decision support, and plays an ev en more crucial role if the dataset being explored e xhibits uncertainty . This has led to much work in understanding how to rank the tuples in a probabilistic dataset in recent years. In this article, we present a uniﬁed ap- proach to ranking and top- k query processing in probabilistic databases by viewing it as a multi-criteria optimization problem, and by deriving a set of features that capture the key properties of a probabilistic dataset that dictate the ranked result. W e contend that a single, speciﬁc ranking function may not suf ﬁce for probabilistic databases, and we instead propose two parameterized ranking functions , called PRF ω and PRF e , that generalize or can approximate many of the pre viously proposed ranking functions. W e present novel generating functions -based algorithms for efﬁciently ranking large datasets according to these ranking functions, ev en if the datasets exhibit complex correlations modeled using probabilistic and/xor trees or Markov networks . W e further propose that the parameters of the ranking function be learned from user preferences, and we de velop an approach to learn those parameters. Finally , we present a comprehensi ve e xperimental study that illustrates the ef fectiveness of our parameterized ranking func- tions, especially PRF e , at approximating other ranking functions and the scalability of our proposed algorithms for exact or approximate ranking. 1 Intr oduction Recent years hav e seen a dramatic increase in the number of applications domains that naturally generate uncertain data and that demand support for executing complex decision support queries over them. These include information retrie val [21], data integration and cleaning [2, 18], text analytics [25, 31], social net- work analysis [1], sensor data management [12, 17], ﬁnancial applications, biological and scientiﬁc data management, etc. Uncertainty arises in these en vironments for a variety of reasons. Sensor data typically contains noise and measurement errors, and is often incomplete because of sensor faults or communication link failures. In social networks and scientiﬁc domains, the observed interaction or experimental data is of- ten very noisy , and ubiquitous use of predictive models adds a further layer of uncertainty . Use of automated tools in data integration and information e xtraction can introduce signiﬁcant uncertainty in the output. By their very nature, many of these applications require support for ranking or top-k query pr ocessing ov er large volumes of data. For instance, consider a House Sear ch application where a user is searching for a house using a real estate sales dataset that lists the houses for sale. Such a dataset, which may be constructed by cra wling and combining data from multiple sources, is inherently uncertain and noisy . In fact, the houses that the user prefers the most, are also the most likely to be sold by no w . W e may denote such uncertainty by associating with each advertisement a pr obability that it is still valid. Incorporating such uncertainties into the returned answers is, ho we ver , a challenge considering the complex interplay between the relev ance of a house by itself, and the probability that the advertisement is still v alid. ∗ Computer Science Department, Univ ersity of Maryland, College P ark 20742, MD, USA. Email: lijian@cs.umd.edu † Computer Science Department, Univ ersity of Maryland, College P ark 20742, MD, USA. Email: barna@cs.umd.edu ‡ Computer Science Department, Univ ersity of Maryland, College P ark 20742, MD, USA. Email: amol@cs.umd.edu 1 Many other application domains also exhibit resource constraints of some form, and we must somehow rank the entities or tuples under consideration to select the most relev ant objects to focus our attention on. For example, in ﬁnancial applications, we may w ant to choose the best stocks in which to in vest, given their expected performance in the future (which is uncertain at best). In learning or classiﬁcation tasks, we often need to choose the best “k” features to use [56]. In sensor netw orks or scientiﬁc databases, we may not know the “true” v alues of the physical properties being measured because of measurement noises or failures [17], but we may still need to choose a set of sensors or entities in response to a user query . Ranking in presence of uncertainty is non-trivial ev en if the relev ance scores can be computed easily (the main challenge in the deterministic case), mainly because of the complex trade-offs introduced by the score distributions and the tuple uncertainties. This has led to many ranking functions being proposed for combining the scores and the probabilities in recent years, all of which appear quite natural at the ﬁrst glance (we revie w se veral of them in detail later). W e begin with a systematic exploration of these issues by recognizing that ranking in probabilistic databases is inherently a multi-criteria optimization problem, and by deri ving a set of features , the ke y properties of a probabilistic dataset that inﬂuence the ranked result. W e empirically illustrate the di verse and conﬂicting beha vior of sev eral natural ranking functions, and argue that a single speciﬁc ranking function may not be appropriate to rank different uncertain databases that we may encounter in practice. Furthermore, different users may weigh the features dif ferently , resulting in different rankings over the same dataset. W e then deﬁne a general and po werful ranking function, called PRF , that allows us to explore the space of possible ranking functions. W e discuss its relationship to previously proposed ranking functions, and also identify two speciﬁc parameterized ranking functions, called PRF ω and PRF e , as being interesting. The PRF ω ranking function is essentially a linear, weighted ranking function that resembles the scoring functions typically used in information retriev al, web search, data integration, ke yword query answering etc. [9, 16, 27, 34, 52]. W e observe that PRF ω may not be suitable for ranking large datasets due to its high running time, and instead propose PRF e , which uses a single parameter , and can ef fectiv ely approximate previously proposed ranking functions for probabilistic databases very well. W e then de velop novel algorithms based on generating functions to efﬁciently rank the tuples in a prob- abilistic dataset using any PRF ranking function. Our algorithm can handle a probabilistic dataset with arbi- trary correlations; howe ver , it is particularly efﬁcient when the probabilistic database contains only mutual exclusivity and/or co-e xistence correlations (called pr obabilistic and/xor trees [42]). Our main contributions can be summarized as follo ws: • W e develop a framework for learning ranking functions ov er probabilistic databases by identifying a set of ke y featur es , by proposing several parameterized ranking functions ov er those features, and by choosing the parameters based on user preferences or feedback. • W e present nov el algorithms based on generating functions that enable us to efﬁciently rank very large datasets. Our ke y algorithm is an O ( n log( n )) algorithm for e v aluating a PRF e function over datasets with lo w correlations (speciﬁcally , constant height probabilistic and/xor trees). The algorithm runs in O ( n ) time if the dataset is pre-sorted by score. • W e present a polynomial time algorithm for ranking a correlated dataset when the correlations are captured using a bounded-tree width graphical model. The algorithm we present is actually for computing the prob- ability that a giv en tuple is ranked at a giv en position across all the possible worlds, and is of independent interest. • W e de velop a novel, DFT -based algorithm for approximating an arbitrary weighted ranking function using a linear combination of PRF e functions. • W e show that a PRF ω ranked result can be seen as a consensus answer under a suitably deﬁned distance function – a consensus answer is deﬁned to be the answer that is closest in expectation to the answers over the possible worlds. 2 • W e present a comprehensi ve experimental study over se veral real and synthetic datasets, comparing the behavior of the ranking functions and the ef fectiv eness of our proposed algorithms. Outline: W e be gin with a brief discussion of the related w ork (Section 2). In Section 3, we re view our probabilistic database model and the prior work on ranking in probabilistic databases, and propose two parameterized ranking functions. In Section 4, we present our generating functions-based algorithms for ranking. W e then present an approach to approximate different ranking functions using our parameterized ranking functions, and to learn a ranking function from user preferences (Section 5). In Section 6, we explore the connection between PRF ω and consensus top-k query results. In Section 7, we observe an interesting property of the PRF e function that helps us gain better insight into its behavior . W e then present a comprehensi ve experiment study in Section 8. Finally , in Section 9, we dev elop an algorithm for handling correlated datasets where the correlations are captured using bounded-tree width graphical models. 2 Related W ork There has been much work on managing probabilistic, uncertain, incomplete, and/or fuzzy data in database systems (see, e.g., [12, 14, 21, 24, 38, 41, 53]). The work in this area has spanned a range of issues from theoretical dev elopment of data models and data languages to practical implementation issues such as indexing techniques; several research efforts are underway to build systems to manage uncertain data (e.g., MYSTIQ [14], Trio [53], ORION [12], MayBMS [38], PrDB [49]). The approaches can be differentiated based on whether they support tuple-level uncertainty where “existence” probabilities are attached to the tuples of the database, or attribute-le vel uncertainty where (possibly continuous) probability distributions are attached to the attributes, or both. The proposed approaches differ further based on whether they consider correlations or not. Most work in probabilistic databases has either assumed independence [14, 21] or has restricted the correlations that can be modeled [2, 41, 48]. More recently , se veral approaches hav e been presented that allow representation of arbitrary correlations and querying over correlated databases [24, 39, 49]. The area of ranking and top-k query processing has also seen much work in databases (see, e.g., Ilyas et al. ’ s surv ey [29]). More recently , se veral researchers have considered top-k query processing in probabilistic databases. Soliman et al. [50] deﬁned the problem of ranking over probabilistic databases, and proposed two ranking functions to combine tuple scores and probabilities. Y i et al. [54] present improv ed algorithms for the same ranking functions. Zhang and Chomicki [55] present a desiderata for ranking functions, and propose the notion of Global T op-k answers. Ming Hua et al. [28] propose pr obabilistic thr eshold ranking , which is quite similar to Global T op-k. Cormode et al. [13] also present a semantics of ranking functions and a new ranking function called expected rank . Liu et al. [45] propose the notion of k -selection queries; unlike most of the above deﬁnitions, the result here is sensitiv e to the actual tuple scores. W e will revie w these ranking functions in detail in next section. Ge et al. [22] propose the notion of typical answers , where they propose returning a collection of typical answers instead of just one answer . This can be seen as complementary to our approach here; one could show the typical answers to the user to understand the user preferences during an exploratory phase, and then learn a single ranking function to rank using the techniques de veloped in this article. There has also been work on top-k query processing in probabilistic databases where the ranking is by the result tuple pr obabilities (i.e., probability and score are identical) [46]. The main challenge in that work is efﬁcient computation of the probabilities, whereas we assume that the probability and score are either gi ven or can be computed easily . The aforementioned w ork has focused mainly on tuple uncertainty and discrete attribute uncertainty . Soliman and Ilyas [51] were the ﬁrst to consider the problem of handling continuous distrib utions. Recently , in a followup w ork [43], we extended the algorithm for PRF to arbitrary continuous distributions. W e 3 T ime Car Plate Speed . . . Prob T uple Loc No Id 11:40 L1 X-123 120 . . . 0.4 t 1 11:55 L2 Y -245 130 . . . 0.7 t 2 11:35 L3 Y -245 80 . . . 0.3 t 3 12:10 L4 Z-541 95 . . . 0.4 t 4 12:25 L5 Z-541 110 . . . 0.6 t 5 12:15 L6 L-110 105 . . . 1.0 t 6 Possible W orlds Prob pw 1 = { t 2 , t 1 , t 6 , t 4 } .112 pw 2 = { t 2 , t 1 , t 5 , t 6 } .168 pw 3 = { t 1 , t 6 , t 4 , t 3 } .048 pw 4 = { t 1 , t 5 , t 6 , t 3 } .072 pw 5 = { t 2 , t 6 , t 4 } .168 pw 6 = { t 2 , t 5 , t 6 } .252 pw 7 = { t 6 , t 4 , t 3 } .072 pw 8 = { t 5 , t 6 , t 3 } .108 ∨ ∨ ∨ ∧ t 1 120 , t 2 130 , t 3 80 , t 4 95 , t 5 110 , t 6 105 , ∨ . 4 . 7 . 3 . 4 . 6 1 Figure 1: Example of a probabilistic database which contains automatically captured information about speeding cars – here the Plate No. is the possible worlds ke y and the speed is the score attribute that we will use for ranking. T uples t 2 and t 3 (similarly , t 4 and t 5 ) are mutually exclusi ve. The second table lists all possible worlds. Note that the tuples are sorted according to their speeds in each possible world. The corresponding and/xor tree compactly encodes these correlations. were able to obtain exact polynomial time algorithms for some continuous probability distribution classes, and efﬁcient approximation schemes with prov able guarantees for arbitrary probability distributions. One important ingredient of those algorithms is an extension of the generating function used in this article. Recently , there has also been much work on nearest neighbor-style queries over uncertain datasets [6, 10, 11, 40]. In f act, a nearest neighbor query (or a k -nearest neighbor query) can be seen as a ranking query where the score of a point is the distance of that point to the giv en query point. Thus, our new ranking semantics and algorithms can be directly used for nearest neighbor queries over uncertain points with discrete probability distributions. There is a tremendous body of work on ranking documents in information retrie val, and learning ho w to rank documents giv en user preferences (see Liu [44] for a comprehensiv e survey). That work has considered aspects such as dif ferent ranking models, loss functions, different scoring techniques etc. The techniques de veloped there tend to be speciﬁc to document retriev al (focusing on keywords, terms, and rele vance), and usually do not deal with existence uncertainty (although they often do model document relev ance as a random v ariable). Furthermore, our work here primarily focuses on highly efﬁcient algorithms for ranking using a spectrum of different ranking functions. Exploring and understanding the connections between the two research areas is a fruitful direction for further research. Finally , we note that one PRF function is only able to model preferences of one user . There is an increasing interest in ﬁnding a ranking that satisﬁes multiple users having div erse preferences and intents. Se veral new theoretical models ha ve been proposed recently [3 – 5]. Ho wev er, all the inputs are assumed to be certain in those models. Incorporating uncertainty into those models or introducing the notion of di versity into our model is an interesting research direction. 3 Pr oblem F ormulation W e begin with deﬁning our model of a probabilistic database, called pr obabilistic and/xor tree [42], that captures se veral common types of correlations. W e then revie w the prior work on top-k query processing in probabilistic databases, and argue that a single speciﬁc ranking function may not capture the intricacies of ranking with uncertainty . W e then present our parameterized ranking functions, PRF ω and PRF e . 3.1 Probabilistic Database Model W e use the prev alent possible worlds semantics for probabilistic databases [14]. W e denote a probabilistic re- lation with tuple uncertainty by D T , where T denotes the set of tuples (in Section 4.4, we present extensions to handle attribute uncertainty). The set of all possible worlds is denoted by P W = { pw 1 , pw 2 , ...., pw n } . 4 Each tuple t i ∈ T is associated with an existence probability Pr( t i ) and a score sco re ( t i ) , computed based on a scoring function score : T → R . Usually score ( t ) is computed based on the tuple attribute values and measures the relativ e user preference for different tuples. In a deterministic database, tuples with higher scores should be ranked higher . W e use r pw : T → { 1 , . . . , n } ∪ {∞} to denote the rank of the tuple t in a possible world pw according to score . If t does not appear in the possible world pw , we let r pw ( t ) = ∞ . W e say t 1 ranks higher than t 2 in the possible world pw if r pw ( t 1 ) < r pw ( t 2 ) . For each tuple t , we deﬁne a random v ariable r ( t ) that denotes the rank of t in D T . Deﬁnition 1 The positional probability of a tuple t being ranked at position k , denoted Pr( r ( t ) = k ) , is the total pr obability of the possible worlds wher e t is ranked at position k . The rank distribution of a tuple t , denoted Pr( r ( t )) , is simply the pr obability distribution of the random variable r ( t ) . Probabilistic And/Xor T r ee Model: Our algorithms can handle arbitrarily correlated relations where cor- relations modeled using Markov networks (Section 9). Howe ver , in most of this article, we focus on the pr obabilistic and/xor tr ee model , introduced in our prior work [42], that can capture only a more restricted set of correlations, but admits highly efﬁcient query processing algorithms. More speciﬁcally , an and/xor tree captures tw o types of correlations: (1) mutual e xclusivity (denoted ∨  ( xor )) and (2) mutual co-existence ( ∧  ( and )). T wo ev ents satisfy the mutual co-existence correlation if, in an y possible world, either both e vents occur or neither occurs. Similarly two ev ents are mutually exclusiv e if there is no possible world where both happen. No w , let us formally deﬁne a probabilistic and/xor tree. In tree T , we denote the set of children of node v by C h T ( v ) and the least common ancestor of two lea ves l 1 and l 2 by LC A T ( l 1 , l 2 ) . W e omit the subscript if the conte xt is clear . For simplicity , we separate the attrib utes of the relation into two groups: (1) a possible worlds k e y , denoted K , which is unique in any possible world (i.e., two tuples that agree on K are mutually exclusi ve), and (2) the value attrib utes, denoted A . If the relation does not have an y ke y attributes, K = φ . Deﬁnition 2 A pr obabilistic and/xor tr ee T r epresents the mutual exclusion and co-existence corr elations in a pr obabilistic r elation R P ( K ; A ) , where K is the possible worlds ke y , and A denotes the value attributes. In T , each leaf denotes a tuple, and each inner node has a mark, ∨  or ∧  . F or each ∨  node u and each of its childr en v ∈ C h ( u ) , there is a nonnegative value p ( u,v ) associated with the edge ( u, v ) . Mor eover , we r equire: • (Pr obability Constraint) P v : v ∈ C h ( u ) Pr( u, v ) ≤ 1 . • (K ey Constraint) F or any two differ ent leaves l 1 , l 2 holding the same ke y , LC A ( l 1 , l 2 ) is a ∨  node 1 . Let T v be the subtr ee r ooted at v and C h ( v ) = { v 1 , . . . , v ` } . The subtr ee T v inductively deﬁnes a random subset S v of its leaves by the following independent pr ocess: • If v is a leaf, S v = { v } . • If T v r oots at a ∨  node, then S v =  S v i with pr ob p ( v ,v i ) ∅ with pr ob 1 − P ` i =1 p ( v ,v i ) • If T v r oots at a ∧  node, then S v = ∪ ` i =1 S v i x-tuples (which can be used to specify mutual e xclusi vity correlations between tuples) correspond to the special case where we ha ve a tree of height 2, with a ∧  node as the root and only ∨  nodes in the second lev el. Figure 2 shows an example of an and/xor tree that models the data from a trafﬁc monitoring 1 The key constraint is imposed to av oid two leav es with the same key but different attribute values coexisting in a possible world. 5 Possible W orlds Prob pw 1 = { ( t 3 , 6) , ( t 2 , 5) , ( t 1 1) } .3 pw 2 = { ( t 3 , 9) , ( t 1 , 7) } .3 pw 3 = { ( t 2 , 8) , ( t 4 , 4) , ( t 5 , 3) } .4 t 3 , 6 t 2 , 5 t 1 , 1 t 3 , 9 t 1 , 7 t 2 , 8 t 4 , 4 t 5 , 3 ∧ ∧ ∧ ∨ . 3 . 3 . 4 x 3 Figure 2: Example of a highly correlated probabilistic database with 3 possible worlds and the and/xor tree that captures the correlation. application [50], where the tuples represent automatically captured trafﬁc data. The inherent uncertainty in the monitoring infrastructure is captured using an and/xor tree, that encodes the tuple existence probabilities as well as the correlations between the tuples. For example, the leftmost ∨  node indicates t 1 is present with probability . 4 and the second ∨  node dictates that exactly one of t 2 and t 3 should appear . The topmost ∧  node tells us the random sets deri ved from these ∨  nodes coexist. W e note that and/xor trees are able to represent any ﬁnite set of possible worlds. This can be done by listing all possible worlds, creating one ∧  node for each world, and using a ∨  node as the root to capture that these worlds are mutual e xclusiv e. Figure 2 sho ws an example of this. Probabilistic and/xor trees signiﬁcantly generalize x-tuples [48, 54], block-independent disjoint tuples model, and p -or-sets [15], and as discussed above, can represent a ﬁnite set of arbitrary possible worlds. The correlations captured by such a tree can be represented by probabilistic c-tables [24] and provenance semirings [23]. Howe ver , that does not directly imply an ef ﬁcient algorithm for ranking. W e remark that Marko v or Bayesian network models are able to capture more general correlations in a compact way [49], ho wev er, the structure of the model is more complex and probability computations on them (inference) is typically e xponential in the treewidth of the model. The treewidth of an and/xor tree (viewing it as a Marko v network) is not bounded, and hence the techniques dev eloped for those models can not be used to obtain polynomial time algorithms for and/xor trees. And/xor trees also exhibit superﬁcial similarities to ws-trees [39], which can also capture mutual exclusi vity and coexistence between tuples. W e note that no prior work on ranking in pr obabilistic databases has consider ed more comple x correlations than x-tuples. 3.2 Ranking over Pr obabilistic Data: Deﬁnitions and Prior W ork The interplay between probabilities and scores complicates the semantics of ranking in probabilistic databases. This was observ ed by Soliman et al. [50], who ﬁrst considered this problem and presented two deﬁnitions of top-k queries in probabilistic databases. Se veral other deﬁnitions of ranking hav e been proposed since then. W e brieﬂy revie w the ranking functions we consider in this work. – Uncertain T op-k ( U - T op ) [50]: Here the query returns the k -tuple set that appears as the top-k answer in most possible worlds (weighted by the probabilities of the worlds). – Uncertain Rank-k ( U - Rank ) [50]: At each rank i , we return the tuple with the maximum probability of being at the i ’th rank in all possible w orlds. In other words, U - Rank returns: { t ∗ i , i = 1 , 2 , .., k } , where t ∗ i = ar g max t (Pr( r ( t ) = i )) . Note that, under these semantics, the same tuple may be ranked at multiple positions. In our experiments, we use a slightly modiﬁed version that 6 enforces distinct tuples in the answer (by not choosing a tuple at a position if it is already chosen at a higher position). – Probabilistic Thr eshold T op-k ( PT ( h ) ) [28] 2 : The original deﬁnition of a probabilistic threshold query asks for all tuples with probability of being in top- h answer larger than a pre-speciﬁed threshold, i.e., all tuples t such that Pr( r ( t ) ≤ h ) > thr eshold . For consistency with other ranking functions, we slightly modify the deﬁnition and instead ask for the k tuples with the largest Pr( r ( t ) ≤ h ) v alues. – Expected Ranks ( E - Rank ) [13]: The tuples are ranked in the increasing order by the expected value of their ranks across the possible worlds, i.e., by: P pw ∈ P W Pr( pw ) r pw ( t ) , where r pw ( t ) is deﬁned to be | pw | if t / ∈ pw . – Expected Scor e ( E - Sco re ) : Another natural ranking function, also considered by [13], is simply to rank the tuples by their expected score, Pr( t ) sco re ( t ) . – k -selection Query [45]: A k -selection query returns the set of k tuples, such that the expected score of the best av ailable tuple across the possible worlds is maximized. – Consensus T op-k ( Con - T opk ) : This is a semantics for top- k queries dev eloped under the frame work of consensus answers in probabilistic databases [42]. W e defer its deﬁnition till Section 6 where we discuss in detail its relationship with the PRF function proposed in this article. Normalized K endall Distance: T o compare dif ferent ranking functions or criteria, we need a distance mea- sure to ev aluate the closeness of two top-k answers. W e use the prev alent K endall tau distance deﬁned for comparing top-k answers for this purpose [20]. It is also called Kemeny distance in the literature and is considered to have many advantages ov er other distance metrics [19]. Let R 1 and R 2 denote two full ranked lists, and let K 1 and K 2 denote the top-k rank ed tuples in R 1 and R 2 respecti vely . Then K endall tau distance between K 1 and K 2 is deﬁned to be: dis ( K 1 , K 2 ) = P ( i,j ) ∈ P ( K 1 , K 2 ) ˆ K ( i, j ) , where P ( K 1 , K 2 ) is the set of all unordered pairs of K 1 ∪ K 2 ; ˆ K ( i, j ) = 1 if it can be inferred from K 1 and K 2 that i and j appear in opposite order in the two full ranked lists R 1 and R 2 , otherwise ˆ K ( i, j ) = 0 . Intuiti vely the Kendall distance measures the number of inv ersions or ﬂips between the two rankings. F or ease of comparison, we divide the K endall distance by k 2 to obtain normalized K endall distance, which always lies in [0 , 1] . A higher value of the K endall distance indicates a larger disagreement between the two top-k lists. It is easy to see that if the Kendall distance between two top-k answers is δ , then the two answers must share at least 1 − √ δ fraction of tuples (so if the distance is 0.09, then the top-k answers share at least 70%, and typically 90% or more tuples). The distance is 0 if two top-k answers are identical and 1 if they are disjoint. Comparing Ranking Functions: W e compared the top-100 answers returned by ﬁv e of the ranking func- tions with each other using the normalized Kendall distance, for tw o datasets with 100,000 independent tuples each (see Section 8 for a description of the datasets). T able 1 shows the results of this experiment. As we can see, the ﬁv e ranking functions return wildly different top-k answers for the two datasets, with no obvious trends. For the ﬁrst dataset, E - Rank behaves very differently from all other functions, whereas for the second dataset, E - Rank happens to be quite close to E - Score . Howe ver both of them de viate largely from U - T op , PT ( h ) , and U - Rank . The behavior of E - Score is very sensitiv e to the dataset, especially the score distribution: it is close to PT ( h ) and U - Rank for the ﬁrst dataset, but far aw ay from all of them in 2 This is quite similar to the Global T op-k semantics [55]. 7 E - Sco re PT (100) U - Rank E - Rank U - T op E - Sco re – 0.1241 0.3027 0.7992 0.2760 PT (100) 0.1241 – 0.3324 0.9290 0.3674 U - Rank 0.3027 0.3324 – 0.9293 0.2046 E - Rank 0.7992 0.9290 0.9293 – 0.9456 U - T op 0.2760 0.3674 0.2046 0.9456 – IIP-100,000 ( k = 100 ) E - Sco re PT (100) U - Rank E - Rank U - T op E - Sco re – 0.8642 0.8902 0.0044 0.9258 PT (100) 0.8642 – 0.3950 0.8647 0.5791 U - Rank 0.8902 0.3950 – 0.8907 0.3160 E - Rank 0.0044 0.8647 0.8907 – 0.9263 U - T op 0.9258 0.5791 0.3160 0.9263 – Syn-IND Dataset with 100,000 tuples ( k = 100 ) T able 1: Normalized K endall distance between top-k answers according to various ranking functions for two datasets the second dataset (by looking into the results, it shares less than 15 tuples with the T op- 100 answers of the others). W e observed similar beha vior for other datasets, and for datasets with correlations. This simple experiment illustrates the issues with ranking in probabilistic databases – although several of these deﬁnitions seem natural, the wildly different answers they return indicate that none of them could be the “right” deﬁnition. W e also observe that in large datasets, E - Rank tends to gi ve very high priority to a tuple with a high probability ev en if it has a low score. In our synthetic dataset Syn-IND-100,000 with expected size ≈ 50000 , t 2 (the tuple with 2nd highest score) has probability approximately 0.98 and t 1000 (the tuple with 1000th highest score) has probability 0 . 99 . The expected ranks of t 2 and t 1000 are approximately 10000 and 6000 respecti vely , and hence t 1000 is ranked abo ve t 2 e ven though t 1000 is only slightly more probable. As mentioned above, the original U - Rank function may return the same tuple at dif ferent ranks (also observed by the authors [50]), which is usually undesirable. This problem becomes ev en sev ere when the dataset and k are both large. F or example, in RD-100,000, the same tuple is ranked at positions 67895 to 100000. In the table, we show a slightly modiﬁed version of U - Rank to enforce distinct tuples in the answer . 3.3 Parameterized Ranking Functions Ranking in uncertain databases is inherently a multi-criteria optimization problem, and it is not alw ays clear ho w to rank two tuples that dominate each other along dif ferent axes. Consider a database with two tuples t 1 (score = 100, Pr( t 1 ) = 0 . 5 ), and t 2 (score = 50, Pr( t 2 ) = 1 . 0 ). Even in this simple case, it is not clear whether to rank t 1 abov e t 2 or vice versa. This is an instance of the classic risk-re ward trade-off, and the choice between these two options lar gely depends on the application domain and/or user preferences. W e propose to follow the traditional approach to dealing with such tradeoffs, by identifying a set of featur es , by deﬁning a parameterized ranking function over these features, and by learning the parameters (weights) themselv es using user preferences [9, 16, 27, 34]. T o achiev e this, we propose a family of ranking functions, parameterized by one or more parameters, and design algorithms to efﬁciently ﬁnd the top-k answer according to any ranking function from these families. Our general ranking function, PRF , directly subsumes some of the previously proposed ranking functions, and can also be used to approximate other ranking functions. Moreov er , the parameters can be learned from user preferences, which allows us to adapt to dif ferent scenarios and different application domains. 8 Pr( r ( t i ) = j ) Positional prob. of t i being ranked at position j Pr( r ( t i )) Rank distribution of t i PRF Parameterized ranking function Υ ω ( t ) = P i> 0 ω ( t, i ) Pr( r ( t ) = i ) PRF ω ( h ) Special case of PRF : ω ( t, i ) = w i , w i = 0 , ∀ i > h PRF e ( α ) Special case of PRF ω : w i = α i , α ∈ C PRF ` Special case of PRF ω : w i = − i δ ( p ) Delta function: δ ( p ) = 1 if p is true, δ ( p ) = 0 o.w . T able 2: Notation F eatures: Although it is tempting to use the tuple probability and the tuple score as the features, a ranking function based on just those two will be highly sensitive to the actual values of the scores; further , such a ranking function will be insensitiv e to the correlations in the database, and hence cannot capture the rich interactions between ranking and possible worlds. Instead we propose to use the positional probabilities as the features: for each tuple t , we hav e n features, Pr( r ( t ) = i ) , i = 1 , · · · , n , where n is the number of tuples in the database. This set of features succinctly captures the possible worlds. Further , correlations among tuples, if any , are naturally accounted for when computing the features. W e note that in most cases, we do not e xplicitly compute all the features, and instead design algorithms that can directly compute the v alue of the ov erall ranking function. Ranking Functions: Next we deﬁne a general ranking function which allows exploring the trade-offs dis- cussed abov e. Deﬁnition 3 Let ω : T × N → C be a weight function , that maps a tuple-rank pair to a complex number . The parameterized ranking function ( PRF ) , Υ ω : T → C in its most general form is deﬁned to be: Υ ω ( t ) = X pw : t ∈ pw ω ( t, r pw ( t )) · Pr( pw ) = X pw : t ∈ pw X i> 0 ω ( t, i ) Pr( pw ∧ r pw ( t ) = i ) = X i> 0 ω ( t, i ) · Pr( r ( t ) = i ) . A top-k query r eturns k tuples with the highest | Υ ω | values. In most cases, ω is a real positiv e function and we just need to ﬁnd the k tuples with highest Υ ω v alues. Ho wev er we allow ω to be a complex function in order to approximate other functions efﬁciently (see Section 5.1). Depending on the actual function ω , we get dif ferent ranking functions with div erse behaviors. Before discussing the relationship to prior ranking functions, we deﬁne two special cases. PRF ω ( h ) : One important class of ranking functions is when ω ( t, i ) = w i (i.e., independent of t ) and w i = 0 ∀ i > h for some positi ve integer h (typically h  n ). This forms one of prev alent classes of ranking functions used in domains such as information retriev al and machine learning, with the weights typically learned from user preferences [9, 16, 27, 34]. Also, the weight function ω ( i ) = ln 2 ln( i +1) (called discount factor ) is often used in the context of ranking documents in information retrie val [30]. PRF e ( α ) : This is a special case of PRF ω ( h ) where w i = ω ( i ) = α i , where α is a constant and may be a real or a complex number . Here h = n (no weights are 0). T ypically we expect | α | ≤ 1 , otherwise we ha ve the counterintuiti ve beha vior that tuples with lo wer scores are preferred. 9 PRF ω and PRF e form the two parameterized ranking functions that we propose in this w ork. Although PRF ω is the more natural ranking function and has been used elsewhere, PRF e is more suitable for ranking in probabilistic databases for various reasons. First, the features as we hav e deﬁned abov e are not completely arbitrary , and the features Pr( r ( t ) = i ) for small i are clearly more important than the ones for large i . Hence in most cases we would like the weight function, ω ( i ) , to be monotonically non-increasing. PRF e naturally captures this behavior (as long as | α | ≤ 1 ). More importantly , we can compute the PRF e function in O ( n log( n )) time ( O ( n ) time if the dataset is pre-sorted by score) ev en for datasets with low degrees of correlations (i.e., modeled by and/xor trees with low heights). This mak es it signiﬁcantly more attractiv e for ranking ov er large datasets. Furthermore, ranking by PRF e ( α ), with suitably chosen α , can approximate rankings by man y other functions reasonably well even with only real α . Finally , a linear combination of e xponential functions, with comple x bases, is kno wn to be very expressi ve in representing other functions [7]. W e make use of this fact to approximate many ranking functions by linear combinations of a small number of PRF e functions, thus signiﬁcantly speeding up the running time (Section 5.1). Relationship to other ranking functions: W e illustrate some of the choices of weight function, and relate them to prior ranking functions 3 . W e omit the subscript ω if the context is clear . Let δ ( p ) denote a delta function where p is a boolean predicate: δ ( p ) = 1 if p = tr ue , and δ ( p ) = 0 otherwise. – Ranking by pr obabilities: If ω ( t, i ) = 1 , the result is the set of k tuples with the highest probabili- ties [46]. – Expected Score: By setting ω ( t, i ) = score ( t ) , we get the E - Sco re : Υ( t ) = X pw : t ∈ pw sco re ( t ) Pr( pw ) = score ( t ) Pr( t ) = E[ sco re ( t )] – Probabilistic Threshold T op-k ( PT ( h ) ): If we choose ω ( i ) = δ ( i ≤ h ) , i.e., ω ( i ) = 1 for i ≤ h , and = 0 otherwise, then we have e xactly the answer for PT ( h ) . – Uncertain Rank-k ( U - Rank ): Let ω j ( i ) = δ ( i = j ) , for some 1 ≤ j ≤ k . W e can see the tuple with largest Υ ω j v alue is the rank- j answer in U - Rank query [50]. This allo ws us to compute the U - Rank answer by e valuating Υ ω j ( t ) for all t ∈ T and j = 1 , . . . , k . – Expected ranks ( E - Rank ) : Let PRF ` ( PRF linear) be another special case of the PRF ω function, where w i = ω ( i ) = − i . The PRF ` function bears a close similarity to the notion of expected r anks . Recall that the expected rank of a tuple t is deﬁned to be: E[ r pw ( t )] = X pw ∈ P W Pr( pw ) r pw ( t ) where r pw ( t ) = | pw | if t i / ∈ pw . Let C denote the expected size of a possible w orld. It is easy to see that: C = P n i =1 p i due to linearity of expectation. Then the expected rank of t can be seen to consist of two parts: (1) the contribution of possible w orlds where t exists: er 1 ( t ) = X i> 0 i × Pr( r ( t ) = i ) = − Υ( t ) 3 The deﬁnition of the U - T op introduced in [50] requires the retriev ed k tuples belongs to a valid possible world. Howev er , it is not required in our deﬁnition, and hence it is not possible to simulate U - T op using PRF . 10 where Υ( t ) is the PRF ` v alue of tuple t . 4 (2) the contribution of w orlds where t does not exist: er 2 ( t ) = X pw : t / ∈ pw Pr( pw ) | pw | = (1 − p ( t ))( X t i 6 = t Pr( t i | t does not exist )) If the tuples are independent of each other , then we have: X t i 6 = t Pr( t i | t does not exist ) = ( C − p ( t )) Thus, the expected ranks can be computed in the same time as PRF ` in tuple-independent datasets. This term can also be computed efﬁciently in many other cases, including in datasets where only mutual exclusion correlations are permitted. If the correlations are represented using a probabilistic and/xor tree (see Section 4.2) or a low-tree width graphical model (see Section 9), then we can compute this term ef ﬁciently as well, thus generalizing the prior algorithms for computing expected ranks. – k -selection Query [45]: It is easy to see that a k -selection query is equiv alent to setting: ω ( t, i ) = δ ( i = 1) sco re ( t ) . As we can see, man y different ranking functions can be seen as special cases of the general PRF ranking function, supporting our claim that PRF can effecti vely unify these different approaches to ranking uncertain datasets. 4 Ranking Algorithms W e next present an algorithm for efﬁciently ranking according to a PRF function. W e ﬁrst present the basic idea behind our algorithm assuming mutual independence, and then consider correlated tuples with correlations represented using an and/xor tree. W e then present a very efﬁcient algorithm for ranking using a PRF e function, and then brieﬂy discuss ho w to handle attribute uncertainty . 4.1 Assuming T uple Independence First we sho w ho w the PRF function can be computed in O ( n 2 ) time for a general weight function ω , and for a giv en set of tuples T = { t 1 , . . . , t n } . In all our algorithms, we assume that ω ( t, i ) can be computed in O (1) time. Clearly it is sufﬁcient to compute Pr( r ( t ) = j ) for any tuple t and 1 ≤ j ≤ n in O ( n 2 ) time. Giv en these v alues, we can directly compute the v alues of Υ( t ) in O ( n 2 ) time. (Later , we will present sev eral algorithms which run in O ( n ) or O ( n log ( n )) time which combine these two steps for some special ω functions). W e ﬁrst sort the tuples in a non-increasing order by their scores (which are assumed to be deterministic); assume t 1 , . . . , t n indicates this sorted order . Suppose now we want to compute Pr( r ( t i ) = j ) . Let T i = { t 1 , t 2 , . . . , t i } and σ i be an indicator v ariable that takes value 1 if t i is present in a possible world, and 0 otherwise. Further, let σ = h σ 1 , . . . , σ n i denote a vector containing all the indicator v ariables. Then, we 4 Note that, in the expected rank approach, we pick the k tuples with the lowest expected rank, but in our approach, we choose the tuples with the highest PRF function values, hence the ne gation. 11 t 3 , 6 t 2 , 5 t 1 , 1 t 3 , 9 t 1 , 7 t 2 , 8 t 4 , 4 t 5 , 3 ∧ ∧ ∧ ∨ . 3 . 3 . 4 ∨ ∨ ∨ ∧ t 1 120 , t 2 130 , t 3 80 , t 4 95 , t 5 110 , t 6 105 , ∨ . 4 . 7 . 3 . 4 . 6 1 x x x 1 y x x x x x x x x x . 6 + . 4 x . 3 + . 7 x . 6 x + . 4 y x x 3 x 2 x 3 ( . 6 + . 4 x )( . 3 + . 7 x )( . 6 x + . 4 y ) x . 7 x 3 + . 3 x 2 (i) (ii) Figure 3: PRF computation on and/xor trees: (i) The left ﬁgure corresponds to the database in Figure 2; the generating function obtained by assigning the same v ariable x to all leav es giv es us the distribution o ver the sizes of the possible worlds. (ii) The right ﬁgure illustrates the construction of the generating function for computing Pr( r ( t 4 ) = 3) in the and/xor tree in Figure 1. can write Pr( r ( t i ) = j ) as follo ws: Pr( r ( t i ) = j ) = Pr( t i ) X pw : | pw ∩ T i − 1 | = j − 1 Pr( pw ) = Pr( t i ) X σ : i − 1 P l =1 σ l = j − 1 Y l i . Hence, we can expand F i to compute the coef ﬁcients in O ( i 2 ) time. This allo ws us to compute Pr( r ( t i ) = j ) for t i in O ( i 2 ) time; Υ( t i ) , in turn, can be written as: Υ( t i ) = X j ω ( t i , j ) · Pr( r ( t i ) = j ) = X j ω ( t i , j ) c j (1) which can be computed in O ( i 2 ) time. Example 1 Consider a r elation with 3 independent tuples t 1 , t 2 , t 3 (alr eady sorted according to the scor e function) with existence pr obabilities 0 . 5 , 0 . 6 , 0 . 4 , respectively . The generating function for t 3 is: 12 Algorithm 1 : IND - PRF - RANK ( D T ) F 0 ( x ) = 1 1 f or i = 1 to n do 2 F i ( x ) = Pr( t i ) Pr( t i − 1 ) F i − 1 ( x )  1 − Pr( t i − 1 ) + Pr( t i − 1 ) x  3 Expand F i ( x ) in the form of P j c j x j 4 Υ( t i ) = P n j =1 ω ( t i , j ) c j 5 retur n k tuples with lar gest Υ v alues 6 F 3 ( x ) = ( . 5 + . 5 x )( . 4 + . 6 x )( . 4 x ) = . 12 x 3 + . 2 x 2 + . 08 x This gives us: Pr( r ( t 3 ) = 1) = . 08 , Pr( r ( t 3 ) = 2) = . 2 , Pr( r ( t 3 ) = 3) = . 12 If we expand each F i for 1 ≤ i ≤ n from scratch, we need O ( n 2 ) time for each F i and O ( n 3 ) time in total. Howe ver , the expansion of F i can be obtained from the e xpansion of F i − 1 in O ( i ) time by observing that: F i ( x ) = Pr( t i ) Pr( t i − 1 ) F i − 1 ( x )  1 − Pr( t i − 1 ) + Pr( t i − 1 ) x  (2) This trick giv es us a O ( n 2 ) time complexity for computing the values of the ranking function for all tuples. See Algorithm 1 for the pseudocode. Note that O ( n 2 ) time is asymptotically optimal in general since the computation in volves at least O ( n 2 ) probabilities, namely Pr( r ( t i ) = j ) for all 1 ≤ i, j ≤ n . For some speciﬁc ω functions, we may be able to achiev e faster running time. For PRF ω ( h ) functions, we only need to expand all F i ’ s up to x h term since ω ( i ) = 0 for i > h . Then, the e xpansion from F i − 1 ( x ) to F i ( x ) only takes O ( h ) time. This yields an O ( n · h + n log ( n )) time algorithm. W e note the abo ve technique also gives an O ( nk + n log ( n )) time algorithm for answering the U - Rank top-k query (all the needed probabilities can be computed in that time), thus matching the best known upper bound by Y i et al. [54] (the original algorithm in [50] runs in O ( n 2 k ) time). W e remark that the generating function technique can be seen as a variant of dynamic programming in some sense; howe ver , using it explicitly in place of the obscure recursion formula giv es us a much cleaner vie w and allows us to generalize it to handle more complicated tuple correlations. This also leads to an algorithm for extremely ef ﬁcient ev aluation of PRF e functions (Section 4.3). 4.2 Probabilistic And/Xor T rees Next we generalize our algorithm to handle a correlated database where the correlations can be captured us- ing an and/xor tree. In f act, man y types of probability computations on and/xor trees can be done efﬁciently and elegantly using generating functions. Here we ﬁrst provide a general result and then specialize it for PRF computation. As before, let T = { t 1 , t 2 , . . . , t n } denote the tuples sorted in a non-increasing order of their score function, and let T i = { t 1 , t 2 , . . . , t i } . Let T denote the and/xor tree. Suppose X = { x 1 , x 2 , . . . } is a set of v ariables. Deﬁne a mapping π which associates each leaf l ∈ T with a variable π ( l ) ∈ X . Let T v denote the subtree rooted at v and let v 1 , . . . , v h be v ’ s children. For each node v ∈ T , we deﬁne a generating function F v ( X ) = F v ( x 1 , x 2 , . . . ) recursi vely: • If v is a leaf, F v ( X ) = π ( v ) . • If v is a ∨  node, F v ( X ) = (1 − P h l =1 p ( v ,v l ) ) + P h l =1 p ( v ,v l ) F v l ( X ) 13 • If v is a ∧  node, F i v ( X ) = Q h l =1 F v l ( X ) . The generating function F ( X ) for tree T is the one deﬁned above for the root. It is easy to see, if we hav e a constant number of variables, the polynomial can be expanded in the form of P i 1 ,i 2 ,... c i 1 ,i 2 ... x i 1 1 x i 2 2 . . . in polynomial time. No w recall that each possible world pw contains a subset of the leav es of T (as dictated by the ∨  and ∧  nodes). The follo wing theorem characterizes the relationship between the coefﬁcients of F and the probabilities we are interested in. Theorem 1 The coefﬁcient of the term Q j x i j j in F ( X ) is the total pr obability of the possible worlds for which, for all j , there ar e exactly i j leaves associated with variable x j . See Appendix A for the proof. W e ﬁrst provide two simple examples to sho w how to use Theorem 1 to compute the probabilities of two e vents related to the size of the possible world, and then sho w how to use the same idea to compute Pr( r ( t ) = i ) . Example 2 If we associate all leaves with the same variable x , the coefﬁcient of x i is equal to Pr( | pw | = i ) . The above can be used to obtain a distrib ution on the possible world sizes (F igur e 3(i)). Example 3 If we associate a subset S of the leaves with variable x , and other leaves with constant 1 , the coefﬁcient of x i is equal to Pr( | pw ∩ S | = i ) . Algorithm 2 : AND X OR - PRF - RANK ( T ) π ( t i ) ← 1 ∀ i { π ( t i ) is the variable associated to leaf t i } f or i = 1 to n do if i 6 = 1 then s ( t i − 1 ) ← x π ( t i ) ← y F i ( x, y ) = GENE ( T i , π ) Expand F i ( x, y ) in the form P j c 0 j x j + ( P j c j x j − 1 ) y Υ( t i ) = P n j =1 ω ( t i , j ) c j retur n k tuples with lar gest Υ v alues Subroutine: GENE ( T , π ) r is the root of tree T if T is a singleton node then retur n π ( r ) else T i is the subtree rooted at r i for r i ∈ C h ( r ) p = P r i ∈ C h ( r ) p ( r,r i ) if r is a ∨   node then retur n 1 − p + P r i ∈ C h ( r ) p ( r,r i ) · GENE ( T i , t ) if r is a ∧  node then retur n Q r i ∈ C h ( r ) GENE ( T i , t ) Next we sho w how to compute Pr( r ( t i ) = j ) (i.e., the probability t i is ranked at position j ). Let s denote the score of the tuple. In the and/xor tree T , we associate all leaves with score value larger than s with variable x , the leaf ( t i , s ) with variable y , and the rest of leav es with constant 1 . Let the resulting generating function be F i . By Theorem 1, the coef ﬁcient of x j − 1 y in the generating function F i is exactly Pr( r ( t i ) = j ) . See Algorithm 2 for the pseudocode of the algorithm. 14 Example 4 W e consider the database in F igur e 1. Suppose we want to compute Pr( r ( t 4 ) = 3) . W e asso- ciate variable x to t 1 , t 2 , t 5 and t 6 since their scores ar e larg er than t 4 ’ s scor e. W e also associate y to t 4 itself and 1 to t 3 whose score is less t 4 ’ s. The generating function for the right hand side tree in F igur e 3 is ( . 6 + . 4 x )( . 3 + . 7)( . 4 x + . 6 y ) x = . 168 x 4 + 0 . 112 x 3 y + 0 . 324 x 3 + 0 . 216 x 2 y + 0 . 108 x 2 + 0 . 072 xy . So we get that Pr( r ( t 5 ) = 3) is the coefﬁcient of x 2 y which is 0 . 216 . F r om F igur e 1, we can also see Pr( r ( t 5 ) = 3) = Pr( pw 3 ) + Pr( pw 5 ) = . 048 + . 168 = . 216 . If we expand F i v for each internal node v in a naiv e way (i.e., we do polynomial multiplication one by one), we can show the running time is O ( n 2 ) at each internal node, O ( n 3 ) for each tree F i and thus O ( n 4 ) ov erall. If we do di vide-and-conquer at each internal node and use the FFT -based (Fast Fourier T ransfor - mation) algorithm for the multiplication of polynomials, the running time for each F i can be improv ed to O ( n 2 log 2 ( n )) . See Appendix B.1 for the details. In fact, we can further improve the running time to O ( n 2 ) for each F i and O ( n 3 ) overall. W e outline two algorithms in Appendix B.2. 4.3 Computing a PRF e Function Next we present an O ( n log( n )) algorithm to ev aluate a PRF e function (the algorithm runs in linear time if the dataset is pre-sorted by score). If ω ( i ) = α i , then we observe that: Υ( t i ) = n X j =1 Pr( r ( t i ) = j ) α j = F i ( α ) (3) This surprisingly simple relationship suggests we don’t hav e to e xpand the polynomials F i ( x ) at all; instead we can ev aluate the numerical value of F i ( α ) directly . Again, we note that the value F i ( α ) can be computed from the v alue of F i − 1 ( α ) in O (1) time using Equation (2). Thus, we hav e O ( n ) time algorithm to compute Υ( t i ) for all 1 ≤ i ≤ n if the tuples are pre-sorted. Example 5 Consider Example 1 and the P RF e function for t 3 . W e choose ω ( i ) = . 6 i . Then, we can see that F 3 ( x ) = ( . 5 + . 5 x )( . 4 + . 6 x )( . 4 x ) . So, Υ( t 3 ) = F 3 ( . 6) = ( . 5 + . 5 × . 6)( . 4 + . 6 × . 6)( . 4 × . 6) = . 14592 . W e can use a similar idea to speed up the computation if the tuples are correlated and the correlations are represented using an and/xor tree. Let T i be the and/xor tree where π ( t j ) = x for 1 ≤ j < i , π ( t i ) = y and π ( t j ) = 1 for j > i . Suppose the generating function for T i is F i ( x, y ) = P j c 0 j x j + ( P j c j x j − 1 ) y and Υ( t i ) = P n j =1 α j c j . W e observe an intriguing relationship between the PRF e v alue and the generating function: Υ( t i ) = X j c j α j =  X j c 0 j α j + ( X j c j α j − 1 ) α  − X j c 0 j α j = F i ( α, α ) − F i ( α, 0) . Gi ven this, Υ( t i ) can be computed in linear time by bottom up ev aluation of F i ( α, α ) and F i ( α, 0) in T i . If we simply repeat it n times, once for each t i , this gi ves us a O ( n 2 ) total running time. By carefully sharing the intermediate results among computations of Υ( t i ) , we can improv e the running time to O ( n log ( n ) + nd ) where d is the height of the and/xor tree. This improved algorithm runs in iterations. Suppose the tuples are already pre-sorted by their scores. Initially , the label of all leav es, i.e., π ( t i ) , is 1 . In iteration i , we change the label of leaf t i − 1 from y to x and the label of t i from 1 to y . The algorithm maintains the following information in each inner node v : the numerical values of F i v ( α, α ) and F i v ( α, 0) . The values on node v need to be updated when the v alue of one of its children changes. Therefore, in each iteration, the computation only happens on the two paths, one from t i − 1 to the root and one from t i 15 to the root. Since we update at most O ( d ) nodes for each iteration, the running time is O ( nd ) . Suppose we want to update the information on the path from t i − 1 to the root. W e ﬁrst update the F i v ( ., . ) values for the leaf t i − 1 . Since F i t i − 1 = π ( t i − 1 ) = x , we hav e F i t i − 1 ( α, α ) = α and F i t i − 1 ( α, 0) = α . W e assume v ’ s child, say u , just had its values changed. The updating rule for F i v ( ., . ) (both F i v ( α, α ) and F i v ( α, 0) ) in node v is as follo ws. 1. v is a ∧  node, F i v ( ., . ) ← F i − 1 v ( ., . ) F i u ( ., . ) / F i − 1 u ( ., . ) 2. v is a ∨  node, then: F i v ( ., . ) ← F i − 1 v ( ., . ) + p ( v ,u ) F i u ( ., . ) − p ( v ,u ) F i − 1 u ( ., . ) The v alues on other nodes are not affected. The updating rule for the path from t i to the root is the same except that for the leaf t i , we hav e F i t i ( α, α ) = α and F i t i ( α, 0) = 0 since F i t i ( x, y ) = π ( t i ) = y . See Algorithm 3 for the psuedo-code. W e note that, for the case of x-tuples , which can be represented using a tw o-le vel tree, this giv es us an O ( n log ( n )) algorithm for ranking according to PRF e . Algorithm 3 : AND X OR - PRF e - RANK ( T ) F t i ( α, α ) = 1 , F t i ( α, 0) = 1 , ∀ i f or i = 1 to n do if i 6 = 1 then F t i − 1 ( α, α ) = α, F t i − 1 ( α, 0) = α UPD A TE ( T , t i − 1 ) F t i ( α, α ) = α, F t i ( α, 0) = 0 UPD A TE ( T , t i ) Υ( t i ) = F r ( α, α ) − F r ( α, 0) retur n k tuples with lar gest Υ v alues Subroutine: UPDA TE ( T , v ) while v is not the r oot do u ← v v ← par ent ( v ) if v is a ∧  node then F v ( ., . ) ← F v ( ., . ) F i u ( ., . ) / F u ( ., . ) if v is a ∨  node then F v ( ., . ) ← F v ( ., . ) + p ( v ,u ) F u ( ., . ) − p ( v ,u ) F u ( ., . ) 4.4 Attribute Uncertainty or Uncertain Scor es W e brieﬂy describe ho w we can do ranking ov er tuples with discrete attribute uncertainty where the uncertain attributes are part of the tuple scoring function (if the uncertain attributes do not af fect the tuple score, then they can be ignored for the ranking purposes). More generally , this approach can handle the case when there is a discrete probability distribution o ver the score of the tuple. Assume P j p i,j ≤ 1 for all i . The score score i of tuple t i takes value v i,j with probability p i,j and t i 16 PRF PRF ω ( h ) PRF e Independent tuples O ( n 2 ) O ( nh + n log( n )) O ( n log ( n )) And/Xor tree (height= d ) O ( n 3 ) or O ( n 2 log 2 ( n ) d ) O ( n 3 ) or O ( n 2 log 2 ( n ) d ) O ( nd + n log( n ) And/Xor tree O ( n 3 ) O ( n 3 ) O ( P i d i + n log( n )) T able 3: Summary of the running times. n is the number of tuples. d i is the depth of tuple t i in the and/xor tree. does not appear in the database with probability 1 − P j p i,j . It is easy to see the PRF v alue of t i is Υ( t i ) = X k> 0 ω ( t i , k ) Pr( r ( t i ) = k ) = X k> 0 ω ( t i , k ) X j Pr( r ( t i ) = k ∧ score i = v i,j ) = X j  X k> 0 ω ( t i , k ) Pr( r ( t i ) = k ∧ score i = v i,j )  The algorithm works by treating the alternativ es of the tuples (with a separate alternativ e for each different possible score for the tuple) as different tuples. In other words, we create a new tuple t i,j for each v i,j v alue. t i,j has existence probability p i,j . Then, we add an xor constraint over the alternati ves { t i,j } j of each tuple t i . W e can then use the algorithm for the probabilistic and/xor tree model to ﬁnd the v alues of the PRF function for each t i,j separately . Note that Pr( r ( t i ) = k ∧ score i = v i,j ) is exactly the probability that r ( t i,j ) = k in the and/xor tree. Thus, by the above equation, we hav e that Υ( t i,j ) = P k> 0 ω ( t i , k ) Pr( r ( t i ) = k ∧ score i = v i,j ) and Υ( t i ) = P j Υ( t i,j ) . Therefore, in a ﬁnal step, we calculate the Υ score for each original tuple t i by adding the Υ scores of its alternativ es { t i,j } j . If the original tuples were independent, the complexity of this algorithm is O ( n 2 ) for computing the PRF function, and O ( n log( n )) for computing the PRF e function where n is the size of the input, i.e., the total number of different possible scores. 4.5 Summary W e summarize the complexities of the algorithms for different models in T able 3. Now , we explain some entries in the table which has not been discussed. The ﬁrst is the PRF computation over an and/xor tree with height d . W e hav e two choices here. One is just to use the algorithm for arbitrary and/xor trees, i.e., to use the algorithm in Appendix B.2 to expand F i ( x, y ) for each i , which runs in O ( n 2 ) time. The ov erall running time is O ( n 3 ) . The other one is to use the divide-and-conquer algorithm in Appendix B.1 to expand the polynomial for each ∧  node in F i ( x, y ) . W e can easily see that expanding nodes for each le vel of the tree requires at most O ( n log 2 ( n )) time. Therefore, the running time for expanding F i ( x, y ) is at most O ( n log 2 ( n ) d ) and the overall running time is O ( n 2 log 2 ( n ) d ) which is much better than O ( n 3 ) if d  n . For PRF ω ( h ) computation over and/xor trees, we do not know how to achieve a better bound as in the tuple-independent datasets. W e leav e it as an interesting open problem. For PRF e computation on and/xor trees, we use AND X OR - PRF e - RANK . Now , the procedure UPDA TE ( T , t i ) runs in O ( d i ) time where d i is the depth of tuple t i in the and/xor tree, i.e., the length of path from the root to t i . Therefore, the total running time is O ( P i d i + n log( n )) . If the height of the and/xor tree is bounded by d , the running time is simply O ( nd + n log( n )) . 17 5 A ppr oximating and Learning Ranking Functions In this section, we discuss how to choose the PRF functions and their parameters. Depending on the appli- cation domain and the scenarios, there are two approaches to this: • If we kno w the ranking function we would like to use (say PT ( h ) ), then we can either simulate or approximate it using appropriate PRF functions. • If we are instead provided user preferences data, we can learn the parameters from them. Clearly , we would prefer to use a PRF e function, if possible, since it admits highly ef ﬁcient ranking algorithms. For this purpose, we begin with presenting an algorithm to ﬁnd an approximation to an arbitrary PRF ω function using a linear combination of PRF e functions. W e then discuss how to learn a PRF ω function from user preferences, and ﬁnally present an algorithm for learning a single PRF e function. 5.1 Appr oximating PRF ω using PRF e Functions A linear combination of comple x e xponential functions is known to be very expressi ve, and can approximate many other functions very well [7]. Speciﬁcally , gi ven a PRF ω function, if we can write ω ( i ) as: ω ( i ) ≈ P L l =1 u l α i l , then we hav e that: Υ( t ) = X i ω ( i ) Pr( r ( t ) = i ) ≈ L X l =1 u l X i α i l Pr( r ( t ) = i ) ! This reduces the computation of Υ( t ) to L individual PRF e function computations, each of which only takes linear time. This gi ves us an O ( n log ( n ) + nL ) time algorithm for approximately ranking using PRF ω function for independent tuples (as opposed to O ( n 2 ) for exact ranking). Se veral techniques have been proposed for ﬁnding such approximations using comple x e xponentials [7, 26]. Those techniques are howe ver computationally inefﬁcient, inv olving computation of the inv erses of large matrices and the roots of polynomials of high orders. In this section, we present a clean and ef ﬁcient algorithm, based on Discrete Fourier T ransforms (DFT), for approximating a function ω ( i ) , that approaches zero for large values of i (in other words, ω ( i ) ≥ ω ( i + 1) ∀ i, ω ( i ) = 0 , i > h ). As we noted earlier , this captures the typical behavior of the ω ( i ) function. An example of such a function is the step function ( ω ( i ) = 1 ∀ i ≤ h, = 0 ∀ i > h ) which corresponds to the ranking function PT ( h ) . At a high le vel, our algorithm starts with a DFT approximation of ω ( i ) and then adapts it by adding se veral damping, scaling and shifting factors. Discrete Fourier transformation (DFT) is a well kno wn technique for representing a function as a linear combination of complex exponentials (also called frequency domain r epresentation ). More speciﬁcally , a discrete function ω ( i ) deﬁned on a ﬁnite domain [0 , N − 1] can be decomposed into exactly N e xponentials as: ω ( i ) = 1 N N − 1 X k =0 ψ ( k ) e 2 π N ki i = 0 , . . . , N − 1 . (4) where  is the imaginary unit and ψ (0) , · · · , ψ ( N − 1) denotes the DFT transform of ω (0) , · · · , ω ( N − 1) . If we want to approximate ω by fewer , say L , exponentials, we can instead use the L DFT coef ﬁcients with maximum absolute value. Assume that ψ (0) , . . . , ψ ( L − 1) are those coef ﬁcients. Then our approximation ˜ ω DF T L of ω by L exponentials is gi ven by: ˜ ω DF T L ( i ) = 1 N L − 1 X k =0 ψ ( k ) e 2 π N ki i = 0 , . . . , N − 1 . (5) 18 0 500 1000 1500 2000 2500 x 0.0 0.5 1.0 y y=w(x) DFT DFT+DF DFT+DF+IS DFT+DF+IS+ES Figure 4: Illustrating the effect of the approximation steps: w(i) = step function with N = 1000 , L = 20 0 500 1000 1500 2000 2500 x 0.0 0.5 1.0 y (i) w(i) = step function y=w(x) y=w10(x) y=w20(x) y=w30(x) y=w50(x) y=w100(x) 0 500 1000 1500 x 0 500 1000 1500 y (ii) w(i) =1000-i (i<=1000), =0 (i>1000) y=w(x) y=w5(x) y=w10(x) y=w20(x) y=w50(x) 0 500 1000 1500 x 0.0 0.5 1.0 y (iii) w(i) = an arbitrary smooth function y=w(x) y=w10(x) y=w20(x) y=w30(x) y=w50(x) Figure 5: Approximating functions using linear combinations of complex exponentials: effect of increasing the number of coef ﬁcients Ho wev er, DFT utilizes only comple x exponentials of unit norm, i.e., e r (where r is a real), which makes this approximation periodic (with a period of N ). This is not suitable for approximating an ω function used in PRF , which is typically a monotonically non-increasing function. If we make N sufﬁciently large, say larger than the total number of tuples, then we usually need a large number of exponentials ( L ) to get a reasonable approximation. Moreov er , computing DFT for v ery large N is computationally non-trivial. Furthermore, the number of tuples n may not be known in adv ance. W e next present a set of nontri vial tricks to adapt the base DFT approximation to ov ercome these short- comings. W e assume ω ( i ) takes non-zero values within interv al [0 , N − 1] and the absolute values of both ω ( i ) and ω DF T L ( i ) are bounded by B . T o illustrate our method, we use the step function: ω ( i ) =  1 , i < N 0 , i ≥ N with N = 1000 as our running example to show our method and the speciﬁc shortcomings it addresses. Figure 4 illustrates the ef fect of each of these adaptations. 1. (DFT) W e perform pure DFT on the domain [1 , aN ] , where a is a small integer constant (typically < 10 ). As we can see in Figure 4 (where N = 1000 and a = 2 ), this results in a periodic approximation with a period of 2000. Although the approximation is reasonable for x < 2000 , the periodicity is unacceptable if the number of tuples is larger than 2000 (since the positions between 2000 and 3000 (similarly , between 4000 and 5000) would be giv en high weights). 19 2. (Damping Factor (DF)) T o address this issue, we introduce a damping factor η ≤ 1 such that B η aN ≤  where  is a small positiv e real (for example, 10 − 5 ). Our new approximation becomes: ˜ ω DF T + D F L ( i ) = η i · ˜ ω DF T L ( i ) = 1 N L − 1 X k =0 ψ ( k )( η e 2 π N k ) i . (6) By incorporating this damping factor , the periodicity is mitigated, since we hav e: lim i → + ∞ ˜ ω DF T + D F L ( i ) = 0 . Especially , ˜ ω DF T + D F L ( i ) ≤  for i > αN . 3. (Initial Scaling (IS)) Howe ver the use of damping factor introduces another problem: it gives a biased approximation when i is small (see Figure 4). T aking the step function as an example, ˜ ω DF T + D F L ( i ) is approximately η i for 0 ≤ i < N instead of 1 . T o rectify this, we initially perform DFT on a dif ferent sequence ˆ ω ( i ) = η − i ω ( i ) (rather than ω ( i ) ) on domain ∈ [0 , aN ] . Therefore, ˜ ω DF T + I S is a reasonable approximation of ˆ ω . Then, if we apply the damping factor , it will giv e us an unbiased approximation of ω , which we denote by ˜ ω DF T + D F + I S . 4. (Extending and Shifting (ES)) This step is in particular tailored for optimizing the approximation performance for ranking functions. DFT does not perform well at discontinuous points, speciﬁcally at i = 0 (the left boundary), which can signiﬁcantly affect the ranking approximation. T o handle this, we extrapolate ω to make it continuous around 0 . Let the resulting function be ¯ ω which is de- ﬁned on [ − bN , + ∞ ] for small b > 0 . Again, taking the step function for example, we let ¯ ω ( i ) =  1 , − bN ≤ i < N ; 0 , i ≥ N . Then, we shift ¯ ω ( i ) rightwards by bN to make its domain lie entirely in pos- iti ve axis, do initial scaling and perform DFT on the resulting sequence. W e denote the approximation of the resulting sequence by ˜ ω 0 ( i ) (by performing (6)). For the approximation of original ω ( i ) v alues, we only need to do corresponding leftward shifting , namely ˜ ω DF T + D F + I S + E S ( i ) = ˜ ω 0 ( i + bN ) . Figure 4 sho ws that DFT+DF+IS+ES giv es a much better approximation than others around i = 0 . Figures 4 and 5(i) illustrate the efﬁcacy of our approximation technique for the step function. As we can see, we are able to approximate that function very well with just 20 or 30 coefﬁcients. Figure 5(ii) and (iii) sho w the approximations for a piecewise linear function and an arbitrarily generated continuous function respecti vely , both of which are much easier to approximate than the step function. 5.2 Learning a PRF ω or PRF e Function Next we address the question of ho w to learn the weights of a PRF ω function or the α for a single PRF e function from user preferences. T o learn a linear combination of PRF e functions, we ﬁrst learn a PRF ω function and then approximate it as abov e. Prior work on learning ranking functions (e.g., [9, 16, 27, 34]) assumes that the user preferences are provided in the form of a set of pairs of tuples, and for each pair , we are told which tuple is ranked higher . Our problem differs slightly from this prior work in that, the features that we use to rank the tuples (i.e., Pr( r ( t ) = i ) , i = 1 , . . . , n ) cannot be computed for each tuple individually , but must be computed for the entire dataset (since the values of the features for a tuple depend on the other tuples in the dataset). Hence, we assume that we are instead giv en a small sample of the tuples, and the user ranking for all those tuples. W e compute the features assuming this sample constitutes the entire relation, and learn a ranking function accordingly , with the goal to ﬁnd the parameters (the weights w i for PRF ω or the parameter α for PRF e ) that minimize the number of disagreements with the provided ranking o ver the samples. 20 Gi ven this, the problem of learning PRF ω is identical to the problem addressed in the prior work, and we utilize the algorithm based on support vector machines (SVM) [34] in our e xperiments. On the other hand, we are not aware of any work that has addressed learning a ranking function like PRF e . W e use a simple binary search-like heuristic to ﬁnd the optimal real value of α that minimizes the K endall distance between the user -speciﬁed ranking and the ranking according to PRF e ( α ). In other words, we try to ﬁnd arg min α ∈ [0 , 1] ( dis ( σ, σ ( α ))) where dis () is the K endall distance between tw o rankings, σ is the ranking for the giv en sample and σ ( α ) is the one obtained by using PRF e ( α ) function. Suppose we want to ﬁnd the optimal a within the interval [ L, U ] now . W e ﬁrst compute dis ( σ , σ ( L + i · U − L 10 ) for i = 1 , . . . , 9 and ﬁnd i for which the distance is the smallest. Then we reduce our search range to [max( L, L + ( i − 1) · U − L 10 , min( U, L + ( i + 1) · U − L 10 )] and repeat the abov e recursi vely . Although this algorithm can only con ver ge to a local minimum, in our experimental study , we observed that all of the prior ranking functions exhibit a uni-v alley behavior (Section 8), and in such cases, this algorithm ﬁnds the global optimal. 6 PRF as a Consensus T op- k Answer In this section, we show there is a close connection between PRF ω and the notion of consensus top- k answer ( Con - T opk ) proposed in [42]. W e ﬁrst revie w the deﬁnition of a consensus top- k ranking. Deﬁnition 4 Let dis () denote a distance function between two top- k rankings. Then the most consensus answer τ is deﬁned to be the top- k ranking such that the e xpected distance between τ and the answer τ pw of the (random) world pw is minimized, i.e., τ = arg min τ 0 ∈ Ω { E[ dis ( τ 0 , τ pw )] } . dis () may be any distance function deﬁned on pairs of top- k answers. In [42], we discussed ho w to compute or approximate Con - T opk under a number of distance functions, such as Spearman’ s rho, Kendall’ s tau and intersection metric [20]. Example 6 Consider the example in F igur e 1. Assume k = 2 and the distance function is the symmetric differ ence metric dis ∆ = | ( τ 1 \ τ 2 ) ∪ ( τ 2 \ τ 1 ) | . The most consensus top- 2 answer is { t 2 , t 5 } and the expected distance is E[ dis ( τ 0 , τ pw )] = . 112 × 2 + . 168 × 2 + . 048 × 4 + . 072 × 4 + . 168 × 2 + . 252 × 0 + . 072 × 4 + . 108 × 2 . W e ﬁrst show that a Con - T opk answer under symmetric difference is equiv alent to PT ( h ) ( k ), a special case of PRF ω . Then, we generalize the result and sho w that any PRF ω function is in fact equiv alent to some Con - T opk answer under some suitably deﬁned distance function that generalizes symmetric dif ference. This ne w connection further justiﬁes the semantics of PRF ω from an optimization point of vie w in that the top- k answer obtained by PRF ω minimizes the expected value of some distance function, and it may shed some light on designing the weight function for PRF ω in particular applications. 6.1 Symmetric Difference and PT -k Ranking Function Recall PT ( h ) ( k ) query returns k tuples with the largest Pr( r ( t ) ≤ k ) values. W e show that the an- swer returned is the Con - T opk under symmetric difference metric dis ∆ where dis ∆ ( τ 1 , τ 2 ) = | τ 1 ∆ τ 2 | = | ( τ 1 \ τ 2 ) ∪ ( τ 2 \ τ 1 ) | 5 . For ease of notation, we let Pr( r ( t ) > i ) includes the probability that t ’ s rank is larger than i and that t doesn’t exist. W e use the symbol τ to denote a top-k ranked list. W e use τ ( i ) to denote the i th item in the list τ for positi ve integer i , and τ ( t ) to denote the position of t ∈ T in τ . 5 The result of this subsection has appeared in [42]. 21 Theorem 2 If τ = { τ (1) , τ (2) , . . . , τ ( k ) } is the set of k tuples with the lar gest Pr( r ( t ) ≤ k ) , then τ is the Con - T opk answer under metric dis ∆ , i.e., the answer minimizes E[ dis ∆ ( τ , τ pw )] . Proof: Suppose τ is ﬁxed. W e write E[ dis ∆ ( τ , τ pw )] as follows: E [ dis ∆ ( τ , τ pw )] = E h X t ∈ T δ ( t ∈ τ ∧ t / ∈ τ pw ) + δ ( t ∈ τ pw ∧ t / ∈ τ ) i = X t ∈ T \ τ E[ δ ( t ∈ τ pw )] + X t ∈ τ E[ δ ( t / ∈ τ pw )] = X t ∈ T \ τ Pr( r ( t ) ≤ k ) + X t ∈ τ Pr( r ( t ) > k ) = k + X t ∈ T Pr( r ( t ) ≤ k ) − 2 X t ∈ τ Pr( r ( t ) ≤ k ) The ﬁrst two terms are in variant with respect to τ . Therefore, it is clear that the set of k tuples with the largest Pr( r ( t ) ≤ k ) minimizes the e xpectation.  6.2 W eighted Symmetric Difference and PRF ω W e present a generalization of Theorem 2 that sho ws the equiv alence between any PRF ω function and Con - T opk under weighted symmetric differ ence distance functions which generalize the symmetric difference. Suppose ω is a positiv e function deﬁned on Z + and ω ( i ) = 0 ∀ i > k . Deﬁnition 5 The weighted symmetric differ ence with weight ω of two top- k answers τ 1 and τ 2 is deﬁned to be dis ω ( τ 1 , τ 2 ) = k X i =1 ω ( i ) δ ( τ 2 ( i ) / ∈ τ 1 ) . Intuiti vely , if the i th item of τ 2 can not be found in τ 1 , we pay a penalty of ω ( i ) and the distance is just the total penalty . If ω is a decreasing function, the distance function captures the intuition that top ranked items should carry more weight. If ω is a constant function, it reduces to the ordinary symmetric difference distance. Note that dis ω is not necessarily symmetric 6 . Now , we present the theorem which is a generalization of Theorem 2. Theorem 3 Suppose ω is a positive function deﬁned on Z + and ω ( i ) = 0 ∀ i > k . If τ = { τ (1) , τ (2) , . . . , τ ( k ) } is the set of k tuples with the lar gest Υ ω ( t ) values, then τ is the Con - T opk answer under the weighted sym- metric differ ence dis ω , i.e., the answer minimizes E[ dis ω ( τ , τ pw )] . Proof: The proof mimics the one for Theorem 2. Suppose τ is ﬁxed. W e can write E[ dis ω ( τ , τ pw )] as follo ws: E [ dis ω ( τ , τ pw )] = E h X t ∈ T ω ( τ pw ( t )) δ ( t ∈ τ pw ∧ t / ∈ τ ) i = X t ∈ T \ τ E[ ω ( τ pw ( t )) δ ( t ∈ τ pw )] = X t ∈ T \ τ k X i =1 ω ( i ) Pr( r ( t ) = i ) = X t ∈ T \ τ Υ ω ( t ) 6 Rigorously , a distance function (or metric) should satisfy positive deﬁniteness, symmetry and triangle inequality . Here we abuse this term a bit. 22 Therefore, it is clear that the set of k tuples with the largest Υ ω ( t ) values minimizes the abov e quantity .  Although the weighted symmetric dif ference appears to be a very rich class of distance functions, its relationship with other well studied distance functions, such at Spearman’ s rho and Kendall’ s tau, is still not well understood. W e leav e it as an interesting open problem. 7 An Inter esting Property of PRF e W e have seen that PRF e ( α ) admits very efﬁcient e v aluation algorithms. W e also suggest that the parameter α should be learned from samples or user feedback. In fact, we do so since since we hold the promise that by changing the parameter α , PRF e can span a spectrum of rankings, and the true ranking should be part of this spectrum or close to some point in it. W e provide empirical support for this claim shortly in the next section (Section 8). In this section, we make some interesting theoretical observations about PRF e , which help us further understand the behavior of PRF e itself. First, we observe that for α = 1 , the PRF e ranking is equiv alent to the ranking of tuples by their existence probabilities ( PRF e v alue in that case is simply the total probability). On the other hand, when α approaches 0 , PRF e tends to rank the tuples by their probabilities to be the top- 1 answer , i.e., Pr( r ( t ) = 1) . Thus, it is a natural question to ask how the ranking changes when we vary α from 0 to 1 . Now , we pro ve the follo wing theorem which giv es an important characterization of the behavior of PRF e on tuple independent databases. Let τ α denote the ranking obtained by PRF e ( α ). For simplicity , we ignore the possibility of ties and assume this ranking is unique. As two special cases, let τ 0 and τ 1 denote the rankings obtained by sorting the tuples in a decreasing Pr( r ( t ) = 1) and Pr( t ) order , respectiv ely . Theorem 4 1. If t i > τ 0 t j ( t i is ranked higher than t j in τ 0 ) and t i > τ 1 t j , then t i > τ α t j any 0 ≤ α ≤ 1 . 2. If t i > τ 0 t j and t i < τ 1 t j , then ther e is e xactly one point β such that t i > τ α t j for α < β and t i < τ α t j for α > β . Proof: Let Υ α ( t i ) be the PRF e ( α ) value of tuple t i . Then: Υ α ( t i ) = F i ( α ) =  Y t ∈ T i − 1  1 − Pr( t ) + Pr( t ) α   Pr( t i ) α. Assume that i < j . Dividing Υ α ( t j ) by Υ α ( t i ) , we get ρ j,i ( α ) = Υ α ( t j ) Υ α ( t i ) = Q t ∈ T j − 1  1 − Pr( t ) + Pr( t ) α  Q t ∈ T i − 1  1 − Pr( t ) + Pr( t ) α  · Pr( t j ) Pr( t i ) = Pr( t j ) Pr( t i ) · j − 1 Y l = i  1 − Pr( t l ) + Pr( t l ) α  Notice that 1 − Pr( t ) + Pr( t ) α is always non-negati ve and an increasing function of α . Therefore, ρ j,i ( α ) is increasing in α . If i > j , the same argument show ρ j,i ( α ) is decreasing in α . In either case, the ratio is monotone in α . If ρ j,i (0) < 1 and ρ j,i (1) < 1 , then ρ j,i ( α ) < 1 for all 0 < α ≤ 1 . Therefore, the ﬁrst half of the theorem holds. If ρ j,i (0) < 1 and ρ j,i (1) > 1 , then there is exactly one point 0 < β < 1 such that ρ j,i ( β ) = 1 , ρ j,i ( α ) < 1 for all 0 < α < β , and ρ j,i ( α ) > 1 for all β < α ≤ 1 . This proves the second half.  23 0.0 0.2 0.4 0.6 0.8 1.0 a 0.0 0.2 0.4 0.6 0.8 1.0 y f1(a) f2(a) f3(a) f4(a) intersection of f1&f4 Figure 6: Illustration of Example 7. f i ( α ) = Υ α ( t i ) for i = 1 , 2 , 3 , 4 . Some nontri vial questions can be immediately answered by the theorem. For example, one may ask the question “Is it possible that we get some ranking τ 1 , increase α a bit and get another ranking τ 2 , and increase α further and get τ 1 back?”, and we can quickly see that the answer is no; if two tuples change positions, they never change back. Another observation we can make is that: if t 1 dominates t 2 (i.e., t 1 has a higher score and probability), then t 1 always ranks abov e t 2 for any α (this is because t 1 ranks abov e t 2 in both τ 0 and τ 1 ). Interestingly , the way the ranking changes as α is increased from 0 to 1 is reminiscent of the execution of the bubble sort algorithm . W e assume the true order of the tuples is τ 1 and the initial order is τ 0 . W e increase α from 0 to 1 gradually . Each change in the ranking is just a swap of a pair of adjacent tuples that are not in the right relativ e order initially . The number of swaps is exactly the number of re versed pairs. This is just like b ubble sort! The only dif ference is that the order of those swaps may not be the same. Example 7 Suppose we have four independent tuples: ( t 1 : 100 , . 4) , ( t 2 : 80 , . 6) , ( t 3 : 50 , . 5) , ( t 4 : 30 , . 9) Using (3), it is easy to see that Υ α ( t 1 ) = . 4 α, Υ α ( t 2 ) = ( . 6+ . 4 α ) . 6 α, Υ α ( t 3 ) = ( . 6+ . 4 α )( . 4 + . 6 α ) . 5 α and Υ α ( t 4 ) = ( . 6 + . 4 α )( . 4 + . 6 α )( . 5 + . 5 α ) . 9 α . In F igure 6, eac h curve corresponds to one tuple. In interval (0 , 1] , any two curves inter sect at most once. Changes in the ranking happen right at the intersection points and one adjacent pair of tuples swap their positions. F or instance, the + sign in the ﬁgur e is the inter section point of f 1 and f 4 . The rank list is { t 2 , t 1 , t 4 , t 3 } right befor e the point and { t 2 , t 4 , t 1 , t 3 } right after the point. In fact, if we think of h as a parameter of PT ( h ) and we vary h from 1 to n , the process that the rank list changes is quite similar to the one for PRF e : On one extreme where h = 1 , the rank list is τ 0 , i.e., the tuples are sorted by Pr( r ( t ) = 1) and on the other extreme where h = n , the rank list is τ 1 , i.e., the tuples are sorted by Pr( r ( t ) ≤ n ) = Pr( t ) . Howe ver , PT ( h ) is only able to explore at most n different rankings (one for each h ) “between” τ 0 and τ 1 , while PRF e may explore O ( n 2 ) of them. 8 Experimental Study W e conducted an extensiv e empirical study over sev eral real and synthetic datasets to illustrate: (a) the di verse and conﬂicting behavior of dif ferent ranking functions proposed in the prior literature, (b) the ef fec- ti veness of our parameterized ranking functions, especially PRF e , at approximating other ranking functions, and (c) the scalability of our ne w generating functions-based algorithms for exact and approximate ranking. W e discussed the results supporting (a) in Section 3.2. In this section, we focus on (b) and (c). 24 Datasets: W e mainly use the International Ice Patrol (IIP) Iceberg Sighting Dataset 7 for our experiments. This dataset was also used in prior work on ranking in probabilistic databases [28 ? ]. The database contains a set of iceber g sighting r ecor ds , each of which contains the location ( latitude, longitude ) of the iceber g, and the number of days the iceberg has drifted, among other attributes. Detecting the icebergs that hav e been drifting for long periods is crucial, and hence we use the number of days drifted as the ranking score. The sighting record is also associated with a conﬁdence-level attribute according to the source of sighting: R/V (radar and visual), VIS (visual only), RAD (radar only), SA T -LOW (low earth orbit satellite), SA T -MED (medium earth orbit satellite), SA T -HIGH (high earth orbit satellite), and EST (estimated). W e con verted these six conﬁdence lev els into probabilities 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, and 0.4 respectiv ely . W e added a very small Gaussian noise to each probability so that ties could be broken. There are nearly a million records av ailable from 1960 to 2007; we created 10 dif ferent datasets for our experimental study containing 100 , 000 (IIP-100,000) to 1 , 000 , 000 (IIP-1,000,000) records, by uniformly sampling from the original dataset. Along with the real datasets, we also use sev eral synthetic datasets with varying degrees of correlations, where the correlations are captured using probabilistic and/xor trees. The tuple scores (for ranking) were chosen uniformly at random from [0 , 10000] . The corresponding and/xor trees were also generated randomly starting with the root, by controlling the height (L) , the maximum de gr ee of the non-r oot nodes (d) , and the pr oportion of ∨  and ∧  nodes (X/A) in the tree. Speciﬁcally , we use ﬁv e such datasets: 1. Syn-IND (independent tuples): the tuple existence probabilities were chosen uniformly at random from [0 , 1] . 2. Syn-XOR (L=2,X/A= ∞ ,d=5): Note that the Syn-XOR dataset, with height set to 2 and no ∧  nodes, exhibits only mutual e xclusi vity correlations (mimicking the x-tuples model [48, 54]) 3. Syn-LO W (L=3,X/A=10,d=2) 4. Syn-MED (L=5,X/A=3,d=5) 5. Syn-HIGH (L=5,X/A=1,d=10). Setup: W e use the normalized K endall distance (Section ?? ) for comparing two top-k rankings. All the algorithms were implemented in C++, and the e xperiments were run on a 2.4GHz Linux PC with 2GB memory . 8.1 Appr oximability of Ranking Functions W e be gin with a set of e xperiments illustrating the ef fectiveness of our parameterized ranking functions at approximating other ranking functions. Due to space constraints, we focus on PRF e here because it is signiﬁcantly faster to rank according to a PRF e function (or a linear combination of se veral PRF e functions) than it is to rank according a PRF ω function. Figures 7 (i) and (ii) show the Kendall distance between the T op-100 answers computed using a speciﬁc ranking function and PRF e for varying values of α , for the IIP-100,000 and Syn-IND-1000 datasets. For better visualization, we plot i on the x-axis, where α = 1 − 0 . 9 i . The reason behind this is that the beha vior of the PRF e function changes rather drastically , and spans a spectrum of rankings, when α approaches 1 . First, as we can see, the PRF e ranking is close to ranking by Scor e alone for small v alues of α , whereas it is close to the ranking by Pr obability when α is close to 1 (in fact, for α = 1 , the PRF e ranking is equi valent to the ranking of tuples by their existence probabilities) 8 . Second, we see that, for all other functions ( E - Score , PT ( h ) , U - Rank , E - Rank ), there exists a value of α for which the distance of that function to PRF e is very 7 http://nsidc.org/data/g00807.html 8 On the other hand, for α = 0 , PRF e ranks the tuples by their probabilities to be the T op- 1 answer . 25 0 50 100 150 200 i 0.0 0.2 0.4 0.6 0.8 1.0 Kendall Distance Score Prob Exp-score PT(100) U-rank Exp-rank UTop-k Approximating with PRF-e (a=1-0.9^i): (i) IIP-100000, k=100; (ii) Syn-IND-1000, k=100 0 50 100 150 200 ii 0.0 0.2 0.4 0.6 0.8 1.0 Kendall Distance Score Prob Exp-score PT(100) U-rank Exp-rank UTop-k Figure 7: Comparing PRF e with other ranking functions for varying values of α ; (i))IIP-100,000, (ii)Syn- IND-1000 50 100 150 200 #terms : L 0.0 0.2 0.4 0.6 0.8 1.0 Kendall Distance (i) Approximating PT(1000)-1000 (n=100000) DFT DFT+DF DFT+DF+IS DFT+DF+IS+ES 50 100 150 200 #terms : L 0.0 0.1 0.2 Kendall Distance (ii) No. of Terms vs Approximation Quality PT(1000) (n=100000,k=1000) PT(1000) (n=1000000,k=1000) sfunc (n=100000,k=1000) sfunc (n=1000000,k=1000) Linear(n=100000,k=1000) Linear(n=1000000,k=1000) Figure 8: (i) Approximating PT (1000) using a linear combination of PRF e functions; (ii) Approximation quality for three ranking functions for v arying number of exponentials. small, indicating that PRF e can indeed approximate those functions quite well. Moreov er we observe that this “uni-valle y” behavior of the curves justiﬁes the binary search algorithm we adv ocate for learning the v alue of α in Section 5.2. Our experiments with other synthetic and real datasets indicated a very similar behavior by the ranking functions. Next we ev aluate the ef fectiv eness of our approximation technique presented in Section 5. In Figure 8 (i), we sho w the Kendall distance between the top-k answers obtained using PT ( h ) (for h = 1000 , k = 1000 ) and using a linear combination of PRF e functions found by our algorithms. As expected, the approximation using the v anilla DFT technique is v ery bad, with the K endall distance close to 0.8 indicating little similarity between the top-k answers. Howe ver , the approximation obtained using our proposed algorithm (indicated by DFT+DF+IS+ES curve) achie ves a Kendall distance of less than 0.1 with just L = 20 exponentials. In Figure 8 (ii), we compare the approximation quality (found by our algorithm DFT+DF+IS+ES) for three ranking functions for two datasets: IIP-100,000 with k = 1000 , and IIP-1,000,000 dataset with k = 10000 . The ranking functions we compared were: (1) PT ( h ) ( h = 1000 ), (2) an arbitrary smooth function, sf unc , and (3) a linear function (Figure 8(ii)). W e see that L = 40 suf ﬁces to bring the Kendall distance to < 0 . 1 in all cases. W e also observe that smooth functions (for which the absolute v alue of the ﬁrst deri vati ve 26 1000 10000 100000 # Samples 0.0 0.5 1.0 Kendall Distance (i) Learning PRF-e (n=100000,k=100) PT(100) PRF-e (alpha=0.95) Exp-score U-rank Exp-rank 0 50 100 150 200 # Samples 0.0 0.2 0.4 0.6 0.8 1.0 Kendall Distance (ii) Learning PRF-w (n=100000,k=100) PT(100) PRF-e (alpha=0.95) EXP-score U-rank EXP-rank Figure 9: (i) Learning PRF e from user preferences; (ii) Learning PRF ω from user preferences. of the underlying continuous function is bounded by a small value) are usually easier to approximate. W e only need L = 20 exponentials to achiev e a K endall distance less than 0 . 05 for sf unc . The Linear function is e ven easier to approximate. 8.2 Learning Ranking Functions Next we consider the issue of learning ranking functions from user preferences. Lacking real user preference data, we instead assume that the user ranking function, denoted user-func , is identical to one of: E - Score , PT ( h ) , U - Rank , E - Rank , or PRF e ( α = 0 . 95 ). W e generate a set of user preferences by ranking a random sample of the dataset using user-func (thus generating ﬁve sets of user preferences). These are then fed to the learning algorithm, and ﬁnally we compare the K endall distance between the learned ranking and the true ranking for the entire dataset. In Figure 9(i), we plot the results for learning a single PRF e function (i.e., for learning the value of α ) using the binary search-like algorithm presented in Section 5.2. The experiment reveals that when the underlying ranking is done by PRF e , the v alue of α can be learned perfectly . When one of PT ( h ) or U - Rank is the underlying ranking function, the correct value a can be learned with a fairly small sample size, and increasing the number of samples does not help in ﬁnding a better α . On the other hand, E - Rank cannot be learned well by PRF e unless the sample size approaches the total size of whole dataset. This phenomenon can be partly explained using Figure 7(i) and (ii) in which the curves for PT ( h ) and U - T op have a fairly smooth valley , while the one for E - Rank is very sharp and the re gion of α values where the distance is lo w is extremely small ( [1 − 0 . 9 90 , 1 − 0 . 9 110 ] ). Hence, the minimum point for E - Rank is harder to reach. Another reason is that E - Rank is quite sensiti ve to the size of the dataset, which makes it hard to learn it using a smaller-sized sample dataset. W e also observe that while extremely lar ge samples are able to learn E - Score well, the behavior of E - Sco re is quite unstable when the sample size is smaller . Note that if we already know the form of the ranking function, we don’t need to learn it in this fashion; we can instead directly ﬁnd an approximation for it using our DFT -based algorithm. In Figure 9 (ii), we show the results of an experiment where we tried to learn a PRF ω function (using the SVM-lite package [34]). W e keep our sample size ≤ 200 since SVM-lite becomes drastically slow with larger sample sizes. First we observe that PT ( h ) and PRF e can be learned v ery well from a small size sample (distance < 0 . 2 in most cases) and increasing the sample size does not beneﬁt signiﬁcantly . U - Rank can also be learned, b ut the approximation isn’t nearly as good. This is because U - Rank can not be written as a single PRF ω function. W e observed similar behavior in our experiments with other datasets. Due to space 27 0.0 0.2 0.4 0.6 0.8 1.0 alpha 0.0 0.2 0.4 0.6 0.8 1.0 Kendall Distance (i) Effect of correlation x-tuples (Syn-XOR) Low Correlation (Syn-LOW) Medium Correlation (Syn-MED) High Correlation (Syn-HIGH) Syn-LOW Syn-MED Syn-HIGH Dataset 0.0 0.2 0.4 0.6 0.8 1.0 Kendall Distance (ii) Effect of correlation PRF-e (alpha=0.9) PT(100) U-rank Figure 10: (i) Ef fect of correlations on PRF e ranking as a varies; (ii) Effect of correlations on PRF e , U - Rank and PT ( h ) . constraints, we omit a further discussion on learning a PRF ω function; the issues in learning such weighted functions have been inv estigated in prior literature, and if the true ranking function can be written as a PRF ω function, then the abov e algorithm is expected to learn it well gi ven a reasonable number of samples. 8.3 Effect of Correlations Next we ev aluate the behavior of ranking functions ov er probabilistic datasets modeled using probabilistic and/xor trees. W e use the four synthetic correlated datasets, Syn-XOR, Syn-LO W , Syn-MED, and Syn- HIGH, for these experiments. For each dataset and each ranking function considered, we compute the rankings by considering the correlations, and by ignoring the correlations, and then compute the Kendall distance between these two (e.g., for PRF e , we compute the rankings using PR OB-ANDOR-PRF-RANK and IND-PRF-RANK algorithms). Figure 10(i) sho ws the results for the PRF e ranking function for varying α , whereas in Figure 10(ii), we plot the results for PRF e ( α = 0 . 9 ), PT (100) , and U - Rank . As we can see, on highly correlated datasets, ignoring the correlations can result in signiﬁcantly inac- curate top-k answers. This is not as pronounced for the Syn-XOR dataset. This is because, in any group of tuples that are mutually exclusiv e, there are typically only a few tuples that may have sufﬁciently high probabilities to be part of the top-k answer; the rest of the tuples may be ignored for ranking purposes. Because of this, assuming tuples to be independent of each other does not result in signiﬁcant errors. As α approaches 1 , PRF e tends to sort the tuples by probabilities, so all four curves in Figure 10(i) become close to 0 . W e note that ranking by E - Sco re is in v ariant to the correlations, which is a signiﬁcant drawback of that function. 8.4 Execution Times Figure 11(i) shows the execution times for four ranking functions: PRF e , PT ( h ) , U - Rank and E - Rank , for the IIP-datasets, for dif ferent dataset sizes and k . W e note that the running time for PRF ω is similar to that of PT ( h ) . As expected, ranking by PRF e or E - Rank is very ef ﬁcient (1000000 tuples can be ranked within 1 or 2 seconds). Indeed, after sorting the dataset in an non-decreasing score order , PRF e needs only a single scan of the dataset, and E - Rank needs to scan the dataset twice. Execution times for P T ( h ) and U - Rank - k increase linearly with h and k respecti vely and the algorithms become very slow for high h and k . The 28 0 200000 400000 600000 800000 1000000 Number of tuples 0 10 20 30 40 50 Execution Time (sec) (i) PRF-e (alpha=0.95) PT(100) U-rank Exp-rank Top-10 Top-50 Top-100 0 200000 400000 600000 800000 1000000 Number of tuples 1 10 100 1000 Execution Time (sec) (ii) PT(10000),k=10000 PT(1000),k=1000 w100 w50 w20 0 20000 40000 60000 80000 100000 Number of tuples 0 1 10 100 1000 Execution Time (sec) (iii) k=1000 Syn-HIGH, PT(1000) Syn-XOR, PT(1000) Syn-HIGH, w50 Syn-XOR, w50 Syn-HIGH, w20 Syn-XOR, w20 Syn-HIGH, PRF-e Syn-XOR, PRF-e Figure 11: Experiments comparing the ex ecution times of the ranking algorithms (note that the y-axis is log-scale for (ii) and (iii)) running times of both PRF e and E - Rank are not signiﬁcantly affected by k . Figure 11(ii) compares the execution time for PT ( h ) and its approximation using a linear combination of PRF e functions (see Figure 8(i)), for two different values of k . w 50 indicates that 50 exponentials were used in the approximation (note that the approximate ranking, based on PRF e , is insensiti ve to the v alue of k ). As we can see, for large datasets and for higher values of k , exact computation takes sev eral orders of magnitude more time to compute than the approximation. For example, the exact algorithm takes nearly 1 hour for n = 500 , 000 and h = 10 , 000 while the approximate answer obtained using L = 50 PRF e functions takes only 24 seconds and achiev es a K endall distance 0 . 09 . For correlated datasets, the ef fect is e ven more pronounced. In Figure 11(iii), we plot the results of a similar experiment, but using two correlated datasets: Syn-XOR and Syn-HIGH. Note that the number of tuples in these datasets is smaller by a f actor of 10. As we can see, our generating functions-based algorithms for computing PRF e are highly efﬁcient, ev en for datasets with high degrees of correlation. As above, approximation of the PT ( h ) ranking function using a linear combination of PRF e functions is signiﬁcantly cheaper to compute than using the exact algorithm. Combined with the previous results illustrating that a linear combination of PRF e functions can approx- imate other ranking functions very well, this validates the uniﬁed ranking approach that we propose in this paper . 29 9 PRF Computation for Arbitrary Corr elations Among many models for capturing the correlations in a probabilistic database, graphical models (Marko v or Bayesian networks) perhaps represent the most systematic approach [49]. The appeal of graphical models stems both from the pictorial representation of the dependencies, and a rich literature on doing inference ov er them. In this section, we present an algorithm for computing the PRF function v alues for all tuples of a correlated dataset when the correlations are represented using a graphical model. The resulting algorithm is a non-trivial dynamic program over the junction tree of the graphical model. Our main result is that we can compute the PRF function in polynomial time if the junction tree of the graphical model has bounded tree width. It is worth noting that this result can not subsume our algorithm for and/xor trees (Section 4.2) since the tree width of the moralized graph of a probabilistic and/xor tree may not be bounded. In some sense, this is close to instance-optimal since the complexity of the underlying inference problem is itself exponential in the tree width of the graphical model (this howe ver does not preclude the possibility that the ranking itself could be done more ef ﬁciently without computing the PRF function explicitly – howe ver , such an algorithm is unlikely to e xist). 9.1 Deﬁnitions W e start with brieﬂy re vie wing some notations and deﬁnitions related to graphical models and junction trees. Let T = { t 1 , t 2 , . . . , t n } be the set of tuples in D T , sorted in an non-increasing order of their score v alues. For each tuple t in T , we associate an indicator random v ariable X t , which is 1 if t is present, and 0 otherwise. Let X = { X t 1 , . . . , X t n } and X i = { X t 1 , . . . , X t i } . For a set of variables S , we use Pr( S ) to denote the joint probability distrib ution o ver those variables. So Pr( X ) denotes the joint probability distribution that we are trying to reason about. This joint distribution captures all the correlations in the dataset. Howe ver , directly trying to represent it would take O (2 n ) space, and hence is clearly infeasible. Probabilistic graphical models allow us to represent this joint distribution compactly by exploiting the conditional independences present among the variables. Gi ven three disjoint sets of random v ariables A, B , C , we say that A is conditionally independent of B given C if and only if: Pr( A, B | C ) = Pr( A | C ) Pr( B | C ) W e assume that we are provided with a junction tr ee over the variables X that captures the correlations among them. A junction tree can be constructed from a graphical model using standard algorithms [32]. Recently junction trees ha ve also been used as a internal representation for probabilistic databases, and have been sho wn to be quite effecti ve at handling lightly correlated probabilistic databases [36]. W e describe the ke y properties of junction trees next. Junction tr ee: Let T be a tree with each node v associated with a subset C v ⊆ X . W e say T is a junction tr ee if any intersection C u ∩ C v for any u, v ∈ T is contained in C w for ev ery node w on the unique path between u and v in T (this is called the running intersection pr operty ). The tree width tw of a junction tree is deﬁned to be max v ∈T | C v | − 1 . Denote S u,v = C v ∩ C u for each edge ( u, v ) ∈ T . W e call S u,v a separator since remov al of S u,v disconnects the graphical model. The set of conditional independences embodied by a junction tree can be found using the Marko v property: (Marko v Property) Given variable sets A, B , C , if C separates A and B (i.e., remov al of variables in C disconnects the v ariables in A from variables in B in the junction tree), then A is conditionally independent of B given C . Example 8 Let T = { t 1 , t 2 , t 3 , t 4 , t 5 } . F igur e 12 (i) and (ii) show the (undir ected) graphical model and the corr esponding junction tr ee T . T has four nodes: C 1 = { X t 4 , X t 5 } , C 2 = { X t 4 , X t 3 } , C 3 = { X t 3 , X t 1 } 30 X 5 X 4 X 3 X 2 X 1 (i) X 5 X 4 X 4 X 3 X 3 X 2 X 3 X 1 X 4 X 3 X 3 1 1 0.3 0.2 0 1 0.2 1 0 0.3 0 0 Pr(X 5 , X 4 ) X 4 X 5 1 1 0.2 0.3 0 1 0.4 1 0 0.1 0 0 Pr(X 4 , X 3 ) X 3 X 4 1 1 0.1 0.5 0 1 0.3 1 0 0.1 0 0 Pr(X 3 , X 2 ) X 2 X 3 1 1 0.2 0.4 0 1 0.3 1 0 0.1 0 0 Pr(X 3 , X 1 ) X 1 X 3 (ii) Figure 12: (i) A graphical model; (ii) A junction tree for the model along with the (calibrated) potentials. and C 4 = { X t 3 , X t 2 } . The tr eewidth of T is 1 . W e have, S 1 , 2 = { X 4 } , S 2 , 3 = { X 3 } and S 2 , 4 = { X 3 } . Using the Markov pr operty , we observe that X 5 is independent of X 1 , X 2 , X 3 given X 4 . Clique and Separator Potentials: W ith each clique C v in the junction tree, we associate a potential π v ( C v ) , which is a function over all v ariables X t i ∈ C v and captures the correlations among those v ariables. Simi- larly , with each separator S u,v , we associate a potential µ u,v ( S u,v ) . W ithout loss of generality , we assume that the potentials are calibrated , i.e., the potential corresponding to a clique (or a separator) is exactly the joint probability distrib ution over the v ariables in that clique (separator). Giv en a junction tree with arbitrary potentials, calibrated potentials can be computed using a standard message passing algorithm [32]. The complexity of this algorithm is O ( n 2 tw ) . Then the joint probability distribution of X , whose correlations can be captured using a calibrated junction tree T , can be written as: Pr( X ) = Q v ∈T π v ( C v ) Q ( u,v ) ∈T µ u,v ( S u,v ) = Q v ∈T Pr( C v ) Q ( u,v ) ∈T Pr( S u,v ) 9.2 Problem Simpliﬁcation W e begin with describing the ﬁrst step of our algorithm, and deﬁning a reduced and simpler to state problem. Recall that our goal is to rank the tuples according to Υ( t i ) = P j > 0 ω ( j ) Pr( r ( t i ) = j ) . For this purpose, we ﬁrst compute the positional probabilities, Pr( r ( t i ) = j ) ∀ j ∀ t i , using the algorithms presented in the next two subsections. Given those, the values of Υ( t i ) can be computed in O ( n 2 ) time for all tuples, and the ranking itself can be done in O ( n log( n )) time (by sorting). The positional probabilities ( P r ( r ( t i ) = j ) ) may also be of interest by themselves. For each tuple t i , we compute Pr( r ( t i ) = j ) ∀ j at once. Recall that Pr( r ( t i ) = j ) is the probability that t i exists (i.e., X i = 1 ) and e xactly j − 1 tuples with scores higher than t i are present (i.e., P i − 1 l =1 X l = j − 1 ). In other words: Pr( r ( t i ) = j ) = Pr( X i = 1 ∧ i − 1 X l =1 X l = j − 1) = Pr(( i − 1 X l =1 X l = j − 1) | X i = 1) Pr( X i = 1) 31 Hence, we begin with ﬁrst conditioning the junction tree by setting X i = 1 , and re-calibrating. This is done by identifying all cliques and separators which contain X i , and by updating the corresponding probability distributions by removing the values corresponding to X i = 0 . More precisely , we replace a probability distribution Pr( X i 1 , . . . , X i k , X i ) , by a potential π ( X i 1 , . . . , X i k ) computed as: π ( X i 1 = v 1 , . . . , X i k = v k ) = Pr( X i 1 = v 1 , . . . , X i k = v k , X i = 1) π is not a probability distribution since the entries in it may not sum up to 1. Further , the potentials may not be consistent with each other . Hence, we need to recalibrate this junction tree using message passing [32]. As mentioned earlier , this takes O ( n 2 tw ) time. Figure 13 shows the resulting (uncalibrated) junction tree after conditioning on X 5 = 1 . X 5 X 4 X 4 X 3 X 3 X 2 X 3 X 1 X 4 X 3 X 3 1 1 0.3 0.2 0 1 0.2 1 0 0.3 0 0 π (X 4 ) X 4 X5 1 1 0.2 0.3 0 1 0.4 1 0 0.1 0 0 π (X 4 , X 3 ) X 3 X 4 1 1 0.1 0.5 0 1 0.3 1 0 0.1 0 0 π (X 3 , X 2 ) X 2 X 3 1 1 0.2 0.4 0 1 0.3 1 0 0.1 0 0 π (X 3 , X 1 ) X 1 X 3 Figure 13: Conditioning on X 5 = 1 results in a smaller junction tree, with uncalibrated potentials, that captures the distribution o ver X 1 , X 2 , X 3 , X 4 gi ven X 5 = 1 . If X i is a separator in the junction tree, then we get more than one junction tree after conditioning on X i = 1 . Figure 14 shows the two junction trees we would get after conditioning on X 4 = 1 . The variables in these junction trees are independent of each other (this follows from the Markov property), and the junction trees can be processed separately from each other . Since the resulting junction tree or junction trees capture the probability distribution conditioned on the e vent X i = 1 , our problem no w reduces to ﬁnding the probability distrib ution of P i − 1 l =1 X l in those junction trees. For cleaner description of the algorithm, we associate an indicator v ariable δ X l with each v ariable X l in the junction tree. δ X l is set to 1 if l ≤ i − 1 , and is 0 otherwise. This allows us to state the ke y problem to be solved as follo ws: Redeﬁned Problem 9 : Given a junction tr ee over m binary variables Y 1 , . . . , Y m , wher e each variable Y j is associated with an indicator variable δ Y j ∈ { 0 , 1 } , ﬁnd the pr obability distribution of the random variable P = P m l =1 Y l δ l . If the result of the conditioning was a single junction tree (over m = n − 1 v ariables), we multiply the resulting probabilities by Pr( X i = 1) to get the rank distribution of t i . Ho wev er, if we get k > 1 junction trees, then we need one additional step. Let P 1 , . . . , P k be the random v ariables denoting the partial sums for each of junction trees. W e need to combine the probability distributions ov er these partial sums, Pr( P 1 ) , . . . , Pr( P k ) , into a single probability distribution over Pr( P 1 + · · · + P k ) . This can be done by repeatedly applying the following general formula: Pr( P 1 + P 2 = a ) = a X j =0 Pr( P 1 = j ) Pr( P 2 = a − j ) 9 W e rename the variables to av oid confusion. 32 X 5 X 3 X 2 X 3 X 1 X 3 Figure 14: Conditioning on X 4 = 1 results in two junction trees. A naiv e implementation of the abov e takes time O ( n 2 ) . Although this can be improved using the ideas presented in Appendix B, the complexity of computing Pr( P i ) is much higher and dominates the ov erall complexity . Next we present algorithms for solving the redeﬁned problem. 9.3 Algorithm for Mark ov Sequences W e ﬁrst describe an algorithm for Markov chains, a special, yet important, case of the graphical models. Marko v chains appear naturally in many settings, and hav e been studied in probabilistic database literature as well [35, 37, 47]. Any ﬁnite-length Mark ov chain is a Marko v network whose underlying graph is simply a path: each v ariable is directly dependent on only its predecessor and successor . The junction tree for a Marko v chain is also a path in which each node corresponds to an edge of the Mark ov chain. The treewidth of such a junction tree is one. W ithout loss of generality , we assume that the Markov chain is Y 1 , . . . , Y m (Figure 15(i)). The corresponding junction tree T is a path with cliques C j = { Y j , Y j +1 } as shown in the ﬁgure. W e compute the distrib ution Pr( P m l =1 Y l δ l ) recursiv ely . Let P j = P j l =1 Y l δ l denote the partial sum ov er the ﬁrst j v ariables Y 1 , . . . , Y j . At the clique { Y j − 1 , Y j } , j ≥ 1 , we recursi vely compute the joint probability distribution: Pr( Y j , P j − 1 ) . The initial distribution Pr( Y 2 , P 1 ) , P 1 = δ 1 Y 1 , is computed directly: Pr( Y 2 , P 1 = 0) = Pr( Y 2 , Y 1 = 0) + (1 − δ i ) Pr( Y 2 , Y 1 = 1) Pr( Y 2 , P 1 = 1) = δ i Pr( Y 2 , Y 1 = 1) . Gi ven Pr( Y j , P j − 1 ) , we compute Pr( Y j +1 , P j ) as follo ws. Observe that P j − 1 and Y j +1 are conditionally independent gi ven the v alue of Y j (by Marko v property). Thus we hav e: Pr( Y j +1 , Y j , P j − 1 ) = Pr( Y j +1 , Y j ) Pr( Y j , P j − 1 ) Pr( Y j ) Using Pr( Y j +1 , Y j , P j − 1 ) , we can compute: Pr( Y j +1 , P j = a ) = Pr( Y j +1 , Y j = 0 , P j − 1 = a ) + Pr( Y j +1 , Y j = 1 , P j − 1 = a − δ j ) At the end, we hav e the joint distribution: Pr( Y m , P m − 1 ) . W e can compute a distribution o ver P m as: Pr( P m = a ) = Pr( Y m = 0 , P m − 1 = a ) + Pr( Y m = 1 , P m − 1 = a − δ m ) Complexity: The complexity of the abov e algorithm to compute Pr( P m ) is O ( m 2 ) – although we only perform m steps, Pr( Y j +1 , P j ) contains 2( j + 1) terms, each of which takes O (1) time to compute. Since we ha ve to repeat this for e very tuple, the overall complexity of ranking the dataset can be seen to be O ( n 3 ) . 33 Y 1 Y 2 Y m-1 Y m A Markov Chain ....... Y 1 Y 2 Y 2 Y 3 Y 2 ....... Y m-1 Y m Y m-1 Corresponding Junction T ree (i) C S S 1 S 2 S k ....... (ii) Pr(S 1 ,PS 1 ) Pr(S k ,PS k ) Pr(S,PS) Pr(Y 2 ,P 1 ) Pr(Y m-1 ,P m-2 ) Pr(Y m ,P m-1 ) T S Figure 15: (i) A Markov chain, and the corresponding junction tree; (ii) Illustrating the recursion for general junction trees. 9.4 General Junction T rees W e follo w the same general idea for general junction trees. Let T denote the junction tree over the v ariables Y = { Y 1 , . . . , Y m } . W e be gin by rooting T at an arbitrary clique, and recurse down the tree. For a separator S , let T S denote the subtree rooted at S . Denote by P S the partial sum ov er the variables in the subtree T S that are not pr esent in S , i.e.,: P S = X j ∈T S ,j / ∈ S δ j X j Consider a clique node C , and let S denote the separator between C and its parent node ( S = φ for the root clique node). W e will recursiv ely compute the joint probability distribution Pr( S, P S ) for each such separator S . Since the root clique node has no parent, at the end we are left with precisely the probability distribution that we need, i.e., Pr( P m j =1 Y i δ i ). C is an interior or root node: Let the separators to the children of C be S 1 , . . . , S k (see Figure 15(ii)). W e recursi vely compute Pr( S i , P S i ) , i = 1 , . . . , k . Let Z = C \ S . W e observ e that Z is precisely the set of v ariables that contribute to the partial sum P S , but do not contrib ute to any of the partial sums P S 1 , . . . , P S k , i.e.: P S = P S 1 + · · · + P S k + X Z i ∈ Z δ Z i Z i W e begin with computing Pr( C, P S 1 + · · · + P S k ) . Observe that the variable set C \ S 1 is independent of P S 1 gi ven the v alues of the variables in S 1 (by Markov property). Note that it was critical that the variables 34 in S 1 not contribute to the partial sum P S 1 , otherwise this independence would not hold. Gi ven that, we hav e: Pr( C, P S 1 ) = Pr( C \ S 1 , S 1 , P S 1 ) = Pr( C \ S 1 , S 1 ) Pr( S 1 , P S 1 ) Pr( S 1 ) Using P S 2 is independent of C ∪ { P S 1 } given S 2 , we get: Pr( C, P S 1 , P S 2 ) = Pr( C, P S 1 ) Pr( S 2 , P S 2 ) Pr( S 2 ) No w we can compute the probability distribution o ver Pr( C, P S 1 + P S 2 ) as follows: Pr( C, P S 1 + P S 2 = a ) = a X j =0 Pr( C, P S 1 = j, P S 2 = a − j ) = a X j =0 Pr( C, P S 1 = j ) Pr( S 2 , P S 2 = a − j ) Pr( S 2 ) By repeating this process for S 3 to S k , we get the probability distribution: Pr( C , P S 1 + · · · + P S k ) . Next, we need to add in the contributions of the v ariables in Z to the partial sum P S 1 + · · · + P S k . Let Z contain l variables, Z 1 , . . . Z l , and let δ Z 1 , . . . , δ Z l denote the corresponding indicator variables. It is easy to see that: Pr( C \ Z , Z 1 = v 1 , . . . , Z k = v k , k X j =1 P S j + l X j =1 δ z j Z j = a ) = Pr( C \ Z , Z 1 = v 1 , . . . , Z k = v k , k X j =1 P S j = a − l X l =1 δ z j Z j ) where v i ∈ { 0 , 1 } . Although it looks complex, we only need to touch ev ery entry of the probability distribution Pr( C, P 1 + · · · + P k ) once to compute Pr( C , P S ) . All that remains is mar ginalizing that distribution to sum out the v ariables in C \ S , gi ving us Pr( S, P S ) . C is a leaf node (i.e., k = 0 ): This is similar to the ﬁnal step above. Let Z = C \ S denote the variables that contribute to the partial sum P S . W e can apply the same procedure as above to compute Pr( C , P S = P Z i ∈ Z δ Z i Z i ) , which we marginalize to obtain Pr( S, P S ) . Overall Complexity: The complexity of the above algorithm for a speciﬁc clique C is dominated by the cost of computing the different probability distributions of the form Pr( C, P ) , where P is a partial sum. W e have to compute O ( n ) such probability distributions, and each of those computations takes O ( n 2 2 | C | ) time. Since there are at most n cliques, and since we hav e to repeat this process for ev ery tuple, the ov erall complexity of ranking the dataset can be seen to be: O ( n 4 2 tw ) , where tw denotes the tree width of the junction tree, i.e., the size of the maximum clique minus 1. 10 Conclusions In this article we presented a uniﬁed framework for ranking o ver probabilistic databases, and presented se v- eral novel and highly efﬁcient algorithms for answering top-k queries. Considering the complex interplay between probabilities and scores, instead of proposing a speciﬁc ranking function, we propose using two 35 parameterized ranking functions, called PRF ω and PRF e , which allo w the user to control the tuples that appear in the top-k answers. W e de veloped novel algorithms for ev aluating these ranking functions over large, possibly correlated, probabilistic datasets. W e also dev eloped an approach for approximating a rank- ing function using a linear combination of PRF e functions thus enabling highly efﬁcient, albeit approximate computation, and also for learning a ranking function from user preferences. Our work opens up many av enues for further research. There may be other non-tri vial subclasses of PRF functions, aside from PRF e , that can be computed efﬁciently . Understanding the behavior of various ranking functions and their relationships across probabilistic databases with diverse uncertainties and correlation structures also remains an important open problem in this area. Finally , the issues of ranking have been studied for man y years in disciplines ranging from economics to information retriev al; better understanding the connections between that work and ranking in probabilistic databases remains a fruitful direction for further research. Refer ences [1] E. Adar and C. Re. Managing uncertainty in social netw orks. IEEE Data Eng . Bull. , 2007. [2] P . Andritsos, A. Fuxman, and R. J. Miller . Clean answers ov er dirty databases. In ICDE , 2006. [3] Y . Azar and I. Gamzu. Ranking with Submodular V aluations. Arxiv preprint , 2010. [4] Y . Azar , I. Gamzu, and X. Y in. Multiple intents re-ranking. In ST OC , pages 669–678, 2009. [5] N. Bansal, K. Jain, A. Kazeykina, and J. Naor . Approximation Algorithms for Di versiﬁed Search Ranking. ICALP , pages 273–284, 2010. [6] G. Beskales, M. Soliman, and I. IIyas. Ef ﬁcient search for the top-k probable nearest neighbors in uncertain databases. VLDB , 2008. [7] G. Beylkin and L. Monzon. On approximation of functions by exponential sums. Applied and Com- putational Harmonic Analysis , 19:17–48, 2005. [8] A. Bjorck and V . Pereyra. Solution of vandermonde systems of equations. Mathematics of Computa- tion , 24(112):893–903, 1970. [9] C. Burges, T . Shaked, E. Renshaw , A. Lazier , M. Deeds, N. Hamilton, and G. Hullender . Learning to rank using gradient descent. In ICML , pages 89–96, 2005. [10] R. Cheng, J. Chen, M. Mokbel, and C. Chow . Probabilistic veriﬁers: Ev aluating constrained nearest- neighbor queries ov er uncertain data. In ICDE , 2008. [11] R. Cheng, L. Chen, J. Chen, and X. Xie. Evaluating probability threshold k-nearest-neighbor queries ov er uncertain data. In EDBT , 2009. [12] R. Cheng, D. Kalashnikov , and S. Prabhakar . Evaluating probabilistic queries over imprecise data. In SIGMOD , 2003. [13] G. Cormode, F . Li, and K. Y i. Semantics of ranking queries for probabilistic data and expected ranks. In ICDE , 2009. [14] N. Dalvi and D. Suciu. Ef ﬁcient query ev aluation on probabilistic databases. In VLDB , 2004. 36 [15] N. Dalvi and D. Suciu. Management of probabilistic data: Foundations and challenges. In PODS , 2007. [16] O. Dekel, C. Manning, and Y . Singer . Log-linear models for label-ranking. In NIPS 16 , 2004. [17] A. Deshpande, C. Guestrin, and S. Madden. Using probabilistic models for data management in acquisitional en vironments. In CIDR , 2005. [18] X. L. Dong, A. Hale vy , and C. Y u. Data integration with uncertainty . In VLDB , 2007. [19] C. Dwork, R. Kumar , M. Naor , and D. Siv akumar . Rank aggregation methods for the web . In WWW , 2001. [20] R. Fagin, R. K umar , and D. Siv akumar . Comparing top-k lists. In SOD A , 2003. [21] N. Fuhr and T . Rolleke. A probabilistic relational algebra for the integration of information retriev al and database systems. A CM T rans. on Info. Syst. , 1997. [22] T . Ge, S. Zdonik, and S. Madden. T op-k queries on uncertain data: On score distribution and typical answers. In SIGMOD , pages 375–388, 2009. [23] T . Green, G. Karvounarakis, and V . T annen. Provenance semirings. In PODS , pages 31–40, 2007. [24] T . Green and V . T annen. Models for incomplete and probabilistic information. In EDBT , 2006. [25] R. Gupta, S. Sarawagi. Creating probabilistic databases from information e xtraction models. In VLDB , 2006. [26] J. F . Hauer, C. J. Demeure, and L. L. Scharf. Initial results in prony analysis of po wer system response signals. IEEE T ransactions on P ower Systems , 5(1):80–89, 1990. [27] R. Herbrich, T . Graepel, P . Bollmann-Sdorra, and K. Obermayer . Learning preference relations for information retriev al. In ICML-98 W orkshop: T ext Cate gorization and Machine Learning , page 8084, 1998. [28] M. Hua, J. Pei, W . Zhang, and X. Lin. Ranking queries on uncertain data: A probabilistic threshold approach. In SIGMOD , 2008. [29] I. Ilyas, G. Beskales, and M. Soliman. A survey of top-k query processing techniques in relational database systems. A CM Computing Surveys , 2008. [30] K. J ¨ arvelin, J. K ek ¨ al ¨ ainen. Cumulated gain-based e valuation of ir techniques. ACM T rans. Inf. Syst. , 20(4), 2002. [31] T . S. Jayram, R. Krishnamurthy , S. Raghav an, S. V aithyanathan, and H. Zhu. A vatar information extraction system. IEEE Data Eng. Bull. , 29(1), 2006. [32] F . Jensen and F . Jensen. Optimal junction trees. In U AI , pages 360–366, 1994. [33] C. Jin, K. Y i, L. Chen, J. Xu Y u, X. Lin. Sliding-window top-k queries on uncertain streams. In VLDB , 2008. [34] T . Joachims. Optimizing search engines using click-through data. In Pr oc. SIGKDD , pages 133–142, 2002. 37 [35] B. Kanagal and A. Deshpande. Efﬁcient query ev aluation ov er temporally correlated probabilistic streams. In ICDE , 2009. [36] B. Kanagal and A. Deshpande. Indexing correlated probabilistic databases. In SIGMOD , 2009. [37] B. Kimelfeld and C. R ´ e. T ransducing markov sequences. In PODS , pages 15–26, 2010. [38] C. K och. MayBMS: A System for Managing Lar ge Uncertain and Pr obabilistic Databases . Managing and Mining Uncertain Data. Charu Aggarwal ed., 2009. [39] C. K och and D. Olteanu. Conditioning probabilistic databases. PVLDB , 1(1):313–325, 2008. [40] H.P . Kriegel, P . Kunath, M. Renz. Probabilistic nearest-neighbor query on uncertain objects. In D AS- F AA , 2007. [41] L. Lakshmanan, N. Leone, R. Ross, and V . S. Subrahmanian. Probvie w: a ﬂexible probabilistic database system. T ODS , 1997. [42] J. Li and A. Deshpande. Consensus answers for queries o ver probabilistic databases. PODS, 2009. [43] J. Li and A. Deshpande. Ranking continuous probabilistic datasets. In VLDB , 2010. [44] T . Y . Liu. Learning to Rank for Information Retriev al. F oundations and T rends in Information Re- trieval , 3(3):225–331, 2009. [45] X. Liu, M. Y e, J. Xu, Y . T ian, and W . Lee. k -selection query ov er uncertain data. In D ASF AA (1) , pages 444–459, 2010. [46] C. R ´ e, N. Dalvi, and D. Suciu. Ef ﬁcient top-k query ev aluation on probabilistic data. In ICDE , 2007. [47] C. R ´ e, J. Letchner , M. Balazinska, and D. Suciu. Event queries on correlated probabilistic streams. In SIGMOD Confer ence , 2008. [48] A. Sarma, O. Benjelloun, A. Hale vy , and J. W idom. W orking models for uncertain data. In ICDE , 2006. [49] P . Sen, A. Deshpande, and L. Getoor . PrDB: managing and e xploiting rich correlations in probabilistic databases. VLDB J . , 18(5):1065–1090, 2009. [50] M. Soliman, I. Ilyas, and K. C. Chang. T op-k query processing in uncertain databases. In ICDE , 2007. [51] M. Soliman and I. Ilyas. Ranking with uncertain scores. In ICDE , pages 317–328, 2009. [52] P . T alukdar , M. Jacob, M. Mehmood, K. Crammer , Z. Ives, F . Pereira, and S. Guha. Learning to create data-integrating queries. PVLDB , 1(1):785–796, 2008. [53] J. W idom. T rio: A system for integrated management of data, accurac y , and lineage. In CIDR , 2005. [54] K. Y i, F . Li, D. Sri v astav a, G. K ollios. Ef ﬁcient processing of top-k queries in uncertain databases. ICDE , 2008. [55] X. Zhang and J. Chomicki. On the semantics and ev aluation of top-k queries in probabilistic databases. In DBRank , 2008. [56] O. Zuk, L. Ein-Dor , and E. Domany . Ranking under uncertainty . In UAI , pages 466–473, 2007. 38 A Pr oofs Theorem 1 The coefﬁcient of the term Q j x i j j in F ( X ) is the total pr obability of the possible worlds for which, for all j , there ar e exactly i j leaves associated with variable x j . Proof: Suppose T is rooted at r , r 1 , . . . , r h are r ’ s children, and T l is the subtree rooted at r l . W e denote by S (or S l ) the random set of lea ves generated according to model T (or T l ). W e let F (or F l ) be the generating function corresponding to T (or T l ). For ease of notation, we use i to denote index vector h i 1 , i 2 , . . . i , I to denote the set of all such i s and X i to denote Q j x i j j . Therefore, we can write F ( X ) = P i 1 ,i 2 ,... c i 1 ,i 2 ... x i 1 1 x i 2 2 . . . = P i ∈ I c i X i . W e use the notation S ∼ = i for some i = h i 1 , i 2 , . . . i ∈ I to denote the e vent that S contains i j leav es associated with variable x j for all j . Given the notations, we need to sho w c i = Pr( S ∼ = i ) . W e shall prove by induction on the height of the and/xor tree. W e consider two cases. If r is a ∧  node, we kno w from Deﬁnition 2 that S = ∪ h l =1 S l . First, it is not hard to see that gi ven S l ∼ = i l for 1 ≤ l ≤ h , the e vent S ∼ = i happens if and only if P l i l = i . Therefore, Pr( S ∼ = i ) = X P l i l = i h Y l =1 Pr( S l ∼ = i l ) . (7) Assume F l can be written as P i l c l, i l X i l . From the construction of the generating function, we know that F ( X ) = h Y l =1 F l = h Y l =1 X i l ∈ I c l, i l X i l = X i ∈ I  X P l i l = i h Y l =1 c l, i l X i l  = X i ∈ I  X P l i l = i h Y l =1 c l, i l  X i (8) By induction hypothesis, we ha ve Pr( S l ∼ = i l ) = c l, i l for any l and i l . Therefore, we can conclude from (7) and (8) that F ( X ) = P i Pr( S ∼ = i ) X i . No w let us consider the other case where r is a ∨  node. From Deﬁnition 2, it is not hard to see that Pr( S ∼ = i ) = h X l =1 Pr( S l = i ) p ( r,r l ) (9) Moreov er , we have F ( X ) = h X l =1 p ( r,r l ) F l ( X ) = h X l =1 p ( r,r l ) X i l c l, i l X i l = X i  h X l =1 p ( r,r l ) c l, i  X i = X i Pr( S ∼ = i ) X i where the last equality follo ws from (9) and induction hypothesis. This completes the proof.  B Expanding P olynomials This section is de voted to se veral algorithms for expanding polynomials into standard forms. 39 B.1 Multiplication of a Set of Polynomials Gi ven a set of polynomials in the form of P i = P j ≥ 0 c ij x j for 1 ≤ i ≤ k , we want to compute the multiplication P = Q k i =1 P i written in the standard form P = P j ≥ 0 c j x j , i.e., we need to compute the coef ﬁcients c j . Let d ( P i ) be the degree of the polynomial P i . Let n = P k i d ( P i ) be the degree of P . Naive Method: First we note that the naiv e method (multiply P i s one by one) gi ves us an O ( n 2 ) time algorithm by simple counting argument. Let ¯ P i = Q i j =1 P j . It is easy to see d ( ¯ P i ) = P i j =1 d ( P i ) . So the time to multiply ¯ P i and P i +1 is O ( d ( ¯ P i ) · d ( P i +1 )) . Then, we can see the total time complexity is: k − 1 X i =1 O ( d ( ¯ P i ) · d ( P i +1 )) = O ( n ) · k − 1 X i =1 d ( P i +1 ) = O ( n 2 ) . Divide-and-Conquer: No w , we show how to use di vide-and-conquer and FFT (Fast Fourier T ransforma- tion) to achieve an O ( n log 2 n ) time algorithm. It is well known that the multiplication of two polynomials of degree O ( n ) can be done in O ( n log n ) time using FFT . The di vide-and-conquer algorithm is as follo ws: If there exists any P i such that d ( P i ) ≥ 1 3 d ( P ) , we ev aluate Q j : j 6 = i P i recursi vely and then multiply it with P i using FFT . If not, we partition all P i s into tw o sets S 1 and S 2 such that 1 3 d ( P ) ≤ d ( Q i ∈ S i P i ) ≤ 2 3 d ( P ) . Then we ev aluate S 1 and S 2 separately and multiply them together using FFT . It is easy to see the time complexity of the algorithm running on input size n satisﬁes T ( n ) ≤ max { T ( 2 3 n ) + O ( n log n ) , T ( n 1 ) + T ( n 2 ) + O ( n log n ) } where n 1 + n 2 = n and 1 3 n ≤ n 1 ≤ n 2 ≤ 2 3 n . By solving the abov e recursiv e formula, we kno w T ( n ) = O ( n log 2 n ) . B.2 Expanding a Nested Formula W e consider a more general problem of expanding a nested expression of uni-v ariable polynomial (with v ariable x ) into its standard form P c i x i . Here a nested expression refers to a formula that only in v olves constants, the v ariable x , addition + , multiplication × , and parenthesis ( and ) , for example, f ( x ) = ((1 + x + x 2 )( x 2 + 2 x 3 ) + x 3 (2 + 3 x 4 ))(1 + 2 x ) . Formally , we deﬁne recursively an e xpression to be either 1. A constant or the v ariable x , or 2. The sum of two e xpr essions , or 3. The product of two e xpr essions . W e assume the degree of the polynomial and the length of the e xpression are of sizes O ( n ) . The naiv e method runs in time O ( n 3 ) (each inner node needs O ( n 2 ) time as shown in the last subsec- tion). If we use the previous di vide-and-conquer method for expanding each inner node, you can easily get O ( n 2 log 2 n ) . No w we sketch two improved algorithms with running time O ( n 2 ) . The ﬁrst is conceptual simpler while the second is much easier to implement. Algorithms 1: 1. Choose n + 1 dif ferent numbers x 0 , ...., x n . 2. Ev aluate the polynomial at these points, i.e., compute f ( x i ) . It is easy to see that each ev aluation takes linear time (bottom-up o ver the tree). So this step takes O ( n 2 ) time in total. 40 3. Use any O ( n 2 ) polynomial interpolation algorithm to ﬁnd the coef ﬁcient. In fact, the interpolation reduces to ﬁnding a solution for the follo wing linear system:      x n 0 x n − 1 0 x n − 2 0 . . . x 0 1 x n 1 x n − 1 1 x n − 2 1 . . . x 1 1 . . . . . . . . . . . . . . . x n n x n − 1 n x n − 2 n . . . x n 1           c n c n − 1 . . . c 0      =      f ( x 0 ) f ( x 1 ) . . . f ( x n )      . The commonly used Gaussian elimination for in verting a matrix requires O ( n 3 ) operations. The matrix we used is a special type of matrix and is commonly referred to as a V andermonde matrix. There exists numerical algorithms that can inv ert a V andermonde matrix in O ( n 2 ) time, for example [8]. A small drawback of the above algorithm is that the algorithms used to in vert a V andermonde matrix is nontri vial to implement. The next algorithm does not need to in vert a matrix, is much simpler to implement and has the same running time of O ( n 2 ) . Algoirthm 2: W e need some notation ﬁrst. Suppose the polynomial is f ( x ) = P n j =0 c j x j ( c j s are unkno wn yet). Let e i be the ( n + 1) -dimensional zero vector except that the i th entry is 1 , i.e., e i = h 0 , 0 , .., 1 , ..., 0 , 0 i . Let d i = h 1 , e 2 π n +1 i , e 2 π n +1 2 i , . . . i be the n + 1 -dimensional vector which is the DFT (Discrete Fourier T ransformation) of e i . Let u = e − 2 π n +1 be the n + 1 th root of unit. Let u = h 1 , u, u 2 , ...., u n i and u k = h 1 , u k , u 2 k , u 3 k , ..... i . By deﬁnition, e i = 1 n +1 P k d ik u k where d ik is the k th entry of d i . Let c = h c 0 , . . . , c n i be the coef ﬁcient vector of f . It is trivial to see c i = c · e i (the inner product). Therefore, we hav e that c i = c · e i = 1 n + 1 X k d ik ( c · u k ) = 1 n + 1 X k d ik f ( u k ) . (10) The last equality holds by the deﬁnition of f ( x ) . If we use f to denote the v ector h f ( u 0 ) , . . . , f ( u n ) i and D to denote the matrix { d ij } 0 ≤ ,i,j ≤ n , the abov e equation can be simply written as c = 1 n + 1 Df . No w , we are ready describe our algorithm: 1. Compute f ( u k ) for all k . This consists of e valuating f ( x ) over complex x n times, which takes O ( n 2 ) time. 2. Use (10) to compute the coef ﬁcients. This again takes O ( n 2 ) time. In fact, the abov e algorithm can be seen as a specialization of the ﬁrst algorithm. Instead of picking arbi- trary n + 1 real points x 0 , . . . x n to e valuate the polynomial, we pick n + 1 complex points 1 , u, u 2 , . . . , u n . The V andermonde matrix formed by these points, i.e., F =      u 0 · 0 u 0 · 1 . . . u 0 · n u 1 · 0 u 1 · 1 . . . u 1 · n . . . . . . . . . . . . u n · 0 u n · 1 . . . u n · n      has a very nice property that F − 1 = 1 n + 1 F ∗ where F ∗ is the conjugate of F (This can be veriﬁed easily). Therefore, we can obtain F − 1 for free. Actually , it is easy to see that F ∗ is exactly D . 41

A Unified Approach to Ranking in Probabilistic Databases

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment