Query Evaluation and Optimization in the Semantic Web

Under c onsider ation for public ation in The ory and Pr actic e of L o gic Pr o gr amming 1 Query Evaluation and Optimization in the Semantic Web Edna Ruc khaus Eduardo Ruiz Mar ´ ıa-Esther Vidal Computer Scienc e Dep artment Universidad Sim´ on Bol ´ ıvar ( e-mail: { ruckhaus, eruiz, mvidal } @ldc.usb.ve ) submitte d 20 June 2006; r evise d 3 January 2007; ac c epte d 18 Octob er 2007 Abstract W e address the problem of answering W eb on tology queries eﬃciently . An ontology is formalized as a De ductive Ontolo gy Base (DOB), a deductive database that comprises the ontology’s inference axioms and facts. A cost-based query optimization technique for DOB is presented. A hybrid cost mo del is prop osed to estimate the cost and cardinality of basic and inferred facts. Cardinality and cost of inferred facts are estimated using an adaptiv e sampling tec hnique, while techniques of traditional relational cost mo dels are used for estimating the cost of basic facts and conjunctive ontology queries. Finally , we implemen t a dynamic-programming optimization algorithm to identify query ev aluation plans that minimize the num b er of intermediate inferred facts. W e modeled a subset of the W eb ontology language O WL Lite as a DOB, and performed an e xp erimental study to analyze the predictiv e capacit y of our cost model and the beneﬁts of the query optimization tec hnique. Our study has b een conducted o ver syn thetic and real-world O WL ontologies, and shows that the tec hniques are accurate and improv e query performance. T o appear in Theory and Practice of Logic Programming (TPLP) 1 In tro duction On tology systems usually provide reasoning and retriev al services that identify the basic facts that satisfy a requiremen t, and derive implicit kno wledge using the on tology’s inference axioms. In the context of the Seman tic W eb, the num b er of inferred facts can b e extremely large. On one hand, the amount of basic ontology facts (domain concepts and W eb source annotations) can b e considerable, and on the other hand, Op en World reasoning in W eb ontologies ma y yield a large space of c hoices. Therefore, eﬃcien t ev aluation strategies are needed in W eb on tology’s inference engines. In our approach, ontologies are formalized as a deductive database called a De- ductive Ontolo gy Base (DOB). The extensional database comprises all the ontology language’s statements that represent the explicit ontology kno wledge. The inten- sional database corresponds to the set of deductiv e rules which deﬁne the seman tics 2 Edna Ruckhaus, Eduar do Ruiz, Mar ´ ıa-Esther Vidal of the ontology language. W e provide a cost-based optimization technique for W eb on tologies represented as a DOB. T raditional query optimization techniques for deductive databases systems in- clude join-ordering strategies, and techniques that com bine a b ottom-up ev aluation with top-do wn propagation of query v ariable bindings in the spirit of the Magic- Sets algorithm (Ramakrishnan and Ullman 1993). Join-ordering strategies may b e heuristic-based or cost-based; some cost-based approac hes depend on the estimation of the join sele ctivity ; others rely on the fan-out of a literal (Staudt et al. 1999). Cost-based query optimization has b een successfully used b y relational database managemen t systems; how ever, these optimizers are not able to estimate the cost or cardinality of data that do not exist a priori, which is the case of in tensional predicates in a DOB. W e prop ose a h ybrid cost model that com bines tw o techniques for cardinality and cost estimation: (1) the sampling tec hnique proposed in (Lipton and Naugh ton 1990; Lipton et al. 1990) is applied for the estimation of the ev aluation cost and cardinalit y of intensional predicates, and (2) a cost model ` a la System R cost mo del is used for the estimation of the cost and cardinality of extensional predicates and the cost of conjunctiv e queries. Three ev aluation strategies are considered for ”joining” predicates in conjunctiv e queries. They are based on the Nested-Lo op, Blo c k Nested-Loop, and Hash Join op erators of relational databases (Ramakrishnan and Gehrke 2003). T o iden tify a go od ev aluation plan, we provide a dynamic-programming optimization algorithm that orders subgoals in a query , considering estimates of the subgoal’s ev aluation cost. W e mo deled a subset of the W eb ontology language OWL Lite (McGuinness and Harmelen 2004) as a DOB, and performed exp eriments to study the predic- tiv e capacity of the cost mo del and the b eneﬁts of the on tology query optimization tec hniques. The study has b een conducted ov er synthetic and real-world OWL on- tologies. Preliminary results sho w that the cost-model estimates are prett y accurate and that optimized queries are signiﬁcan tly less expensive than non-optimized ones. Our current formalism do es not represent the O WL built-in constructor Comple- mentOf . W e stress that in practice this is not a severe limitation. F or example, this op erator is not used in any of the three real-world ontologies that we hav e studied in our exp eriments; and in the survey rep orted in (W ang 2006), only 21 on tologies out of 688 con tain this constructor. Our w ork diﬀers from other systems in the Semantic W eb that com bine a Descrip- tion Logics (DL) reasoner with a relational DBMS in order to solve the scalability problems for reasoning with individuals (Calv anese et al. 2005; Haarslev and Moller 2004; Horro c ks and T uri 2005; Pan and Heﬄin 2003). Clearly , all of these systems use the query optimization comp onent em b edded in the relational DBMS; how- ev er, they do not develop cost-based optimization for the implicit kno wledge, that is, there is no estimation of the cost of data not known a priori. Other systems use Logic Programming (LP) to reason on large-scale ontologies. This is the case of the pro jects describ ed in (Grosof et al. 2003; Hustadt and Motik 2005; Motik et al. 2003) . In Description Logic Programs (DLP) (Grosof et al. 2003), Query Evaluation and Optimization in the Semantic Web 3 the expressiv e in tersection betw een DL and LP without function sym b ols is deﬁned. DL queries are reduced to LP queries and eﬃcient LP algorithms are explored. The pro ject describ ed in (Hustadt and Motik 2005; Motik et al. 2003) reduces a S HI Q kno wledge base to a Disjunctive Datalog program. Both pro jects apply Magic-Sets rewriting techniques but to the b est of our knowledge, no cost-based optimization tec hniques hav e b een developed. The OWL Lite − sp ecies of the OWL language prop osed in (Bruijn et al. 2004) is based in the DLP pro ject; it corresp onds to the p ortion of the O WL Lite language that can be translated to Datalog. All of these systems dev elop LP reasoning with individuals, whereas in the DOB mo del w e develop Datalog reasoning with b oth, domain concepts and individuals. In (Eiter et al. 2006), an eﬃcient bottom-up ev aluation strategy for HEX-programs based on the theory of splitting sets is describ ed. In the context of the Seman tic W eb, these non-monotonic logic programs con tain higher-order atoms and exter- nal atoms that may represen t RDF and OWL knowledge. How ever, their approach do es not include determining the b est ev aluation strategy according to a certain cost metric. In the next section we describ e our DOB formalism. F ollo wing this, we describ e the DOB-S System architecture, Then, we mo del a subset of OWL Lite as a DOB and presen t a motiv ating example. Next, w e dev elop our hybrid cost mo del and query optimization algorithm. W e describ e our exp erimental study and, ﬁnally , we p oin t out our conclusions and future work. 2 The Deductive Ontology Base (DOB) In general, an on tology knowledge base can b e deﬁned as: Deﬁnition 1 ( Ontolo gy Know le dge Base ) An on tology kno wledge base O is a pair O = hF , I i , where F is a set of on tology facts that represent the explicit ontology structure (domain) and source annotations (individuals), and I is a set of axioms that allo w the inference of new on tology facts regarding b oth domain and individuals. W e will model O as a deductive database which we call a De ductive Ontolo gy Base (DOB). A DOB is comp osed of an Extensional Ontology Base (EOB) and an In tensional Ontology Base (IOB). F ormally , a DOB is deﬁned as: Deﬁnition 2 ( DOB ) Giv en an ontology knowledge base O = hF , I i , a DOB is a deductiv e database comp osed of a set of built-in EOB ground predicates represen ting F and a set of IOB built-in predicates representing I , i.e. that deﬁne the semantics of the EOB built-in predicates. The IOB predicate and DOB query deﬁnitions follow the Datalog language for- malism (Abiteb oul et al. 1995). Next, we pro vide the deﬁnitions related to query- answ ering for DOBs. 4 Edna Ruckhaus, Eduar do Ruiz, Mar ´ ıa-Esther Vidal Deﬁnition 3 ( V alid Instantiation ) Giv en a Deductive Ontology Base O , a set of constants C in O , a set of v ariables V , a rule R , and an interpretation I of O that corresp onds to its Minimal Perfect Mo del (Abiteb oul et al. 1995), a v aluation 1 γ is a v alid instantiation of R if and only if, γ ( R ) ev aluates to true in I . Deﬁnition 4 ( Interme diate Inferr e d F acts ) Giv en a Deductiv e Ontology Base O , and a query q : Q ( X ) ← ∃ Y B ( X , Y )). A pro of tree for q wrt O is deﬁned as follows: • Each no de in the tree is lab eled by a predicate in O . • Each leaf in the tree is lab eled b y a predicate in O ’s EOB. • The ro ot of the tree is lab eled by Q • F or each internal no de N including the ro ot, if N is lab eled by a predicate A deﬁned b y the rule R , A ( X ) ← ∃ Y C ( X , Y )), where C ( X , Y )) is the conjunc- tion of the predicates C 1 , ..., C n , then, for e ac h v alid instantiation of R , γ , the no de N has a sub-tree whose root is γ ( A ( X )) and its children are respectively lab eled γ ( C 1),..., γ ( C n ). The v aluations needed to deﬁne all the v alid instantiations in the pro of tree corres- p ond to the Intermediate Inferred F acts of q . The num b er of intermediate inferred facts measures the ev aluation c ost of the query Q . Additionally , since the v alid instantiat ions of Q in the pro of tree corres- p ond to the answ ers of the query , the c ar dinality of Q corresp onds to the num ber of suc h instantiations. Note that the sets of EOB and IOB built-in predicates of a DOB deﬁne an on tology framework, so our mo del is not tied to any particular ontology language. T o illustrate the use of our approach we fo cus on OWL Lite ontologies. 3 The DOB-S System’s Arc hitecture DOB-S is a system that allo ws an agen t to pose eﬃcient conjunctiv e queries against a set of on tologies. The system’s architecture can b e seen in Figure 1. A subset of a given OWL ontology is translated into a DOB using an OWL Lite to DOB translator . EOB and IOB predicates are stored as a deductive database. Next, an analyzer generates the ontology’s statistics: for each EOB predicate, the analyzer computes the num ber of facts or v alid instantiations in the DOB (cardinalit y), and the n umber of diﬀerent v alues for each of its argumen ts (nKeys); for eac h IOB predicate, an adaptive sampling algorithm (Lipton and Naughton 1990) is applied to compute cardinalit y and cost estimates. When an agent formulates a conjunctive query , the DOB-S system’s optimizer generates an eﬃcient query ev aluation plan. A dynamic-programming optimizer is based in a h ybrid cost mo del : it uses the ontology’s EOB and IOB statistics, 1 Given a set of v ariables V and a set of constan ts C , a mapping or v aluation γ is a function γ : V → C . Query Evaluation and Optimization in the Semantic Web 5 Optimizer Dynamic Programming Hybrid Cost Model Cardinality: Valid instantiations. Cost: # inferred predicates Execution Engine Evaluation Strategies -Nested-loop join -Block nested-loop join -Hash join OWL Lite DOB EOB Predicates IOB Predicates OWL Lite to DOB Translator Domain Individuals OWL Lite Ontology Analyzer EOB: System R IOB: Adaptive Sampling Ontology Statistics EOB: Cardinality, nKeys IOB: Cardinality, Cost Agents Efﬁcient query evaluation plan Conjunctive query Query answer Query Processing DOB Generation Fig. 1. DOB-S System Architecture and estimates the cost of a query according the diﬀerent ev aluation strategies im- plemen ted. Finally , an execution engine ev aluates the query plan and pro duces a query answ er. 4 O WL Lite DOB An OWL Lite ontology contains: (1) a set of axioms that provides information ab out classes and prop erties, and (2) a set of facts that represents individuals in the on tology , the classes they b elong to, and the prop erties they participate in. Restrictions allow the construction of class deﬁnitions b y restricting the v alues of their properties and their cardinalit y . Classes ma y also be deﬁned through the in ter- section of other classes. Ob ject prop erties represent binary relationships b etw een individuals; datat yp e prop erties corresp ond to relationships b et ween individuals and data v alues b elonging to primitive datatypes. The subset of OWL Lite represented as a DOB do es not include domain and range class intersection. Also, primitive datat yp es are not handled; therefore, w e do not represen t ranges for Datatype prop erties 2 . 2 EquivalentClasses , EquivalentProperties , and allDifferent axioms, and the 6 Edna Ruckhaus, Eduar do Ruiz, Mar ´ ıa-Esther Vidal 4.1 OWL Lite DOB Syntax Our formalism, DOB, provides a set of EOB built-in predicates that represents all the axioms and restrictions of an O WL Lite subset. EOB predicates are gr ound , i.e., no v ariables are allow ed as arguments. A set of IOB built-in predicates represents the seman tics of the EOB predicates. W e ha ve follo wed the O WL W eb On tology Language Overview presen ted in (McGuinness and Harmelen 2004). T able 1 illustrates the EOB and IOB built-in predicates for an O WL Lite subset 3 . Note that some predicates refer to domain concepts (e.g. isClass, areClasses ), and some to instance concepts (e.g. is isIndividual, areIndividuals ). T able 1. Some built-in EOB and IOB Predicates for a subset of OWL Lite EOB PREDICA TE DESCRIPTION isOntology(O) An ontology has an Uri O isImpOntology(O1,O2) Ontology O1 imp orts ontology O2 isClass(C,O) C is a class in ontology O isOProperty(P,D,R) P is an ob ject property with domain D and range R isDProperty(P,D) P is a datatype prop erty with domain D isTransitive(P) P is a transitive prop erty subClassOf(C1,C2) C1 is sub class of C2 AllValuesFrom(C,P,D) C has prop erty P with all v alues in D isIndividual(I,C) I is an individual b elonging to class C isStatement(I,P,J) I is an individual that has property P with v alue J IOB PREDICA TE DESCRIPTION areSubClasses(C1,C2) C1 are the direct and indirect subclasses of C2 areImpOntologies(O1,O2) O1 imp ort the ontologies O2 directly and indirectly areClasses(C,O) C are all the classes of an ontology and its imported ontologies O areIndividuals(I,C) I are the individuals of a class and all of its direct and indirect superclasses C ; or I are the individuals that participate in a prop ert y and belong to its domain or range C , or are values of a property with all v alues in C 4.2 OWL Lite DOB Semantics A mo del-theoretic semantics for an OWL Lite (subset) DOB is as follows: cardinality restriction are not represented b ecause they are syntactic sugar for other lan- guage constructs. 3 W e assume that the class ow l:Thing is the default v alue for the domain and range of a prop erty . Query Evaluation and Optimization in the Semantic Web 7 T able 2. Mapping OWL Lite subset to EOB Predicates OWL ABSTRACT SYNT AX EOB PREDICA TES Ontol og y ( O ) isOntology(O) I ndiv idual ( O 1 v alue ( ow l : impor ts O 2)) impOntology(O1, O2) Ontol og y ( O ) , C lass ( C par tial T hing ) isClass(C,O) C lass ( A par tial C ) subClassOf(A,C) C lass ( C 1 par tial restr iction ( P allV aluesF r om ( C 2))) allValuesFrom(C1,P,C2) C lass ( A par tial C 1 . . . C n ) subClassOf(A,C1),..., subClassOf(A,Cn) Obj ectP r oper ty ( P domain ( D )), isOProperty(P,D,R) Obj ectP r oper ty ( P r ang e ( R )) Dataty peP r oper ty ( P domain ( D )) isDProperty(P,D) P r operty ( P T ransitiv e ) isTransitive(P) I ndiv idual ( I ty pe ( C )) isIndividual(I,C) I ndiv idual ( I v alue ( P J )) isStatement(I,P,J) T able 3. Mapping OWL Lite subset Inference Rules to IOB Predicates OWL LITE INFERENCE RULES IOB RULE DEFINITIONS If subClassOf(C1,C2) and subClassOf(C2,C3) areSubClasses(C1,C2):-subClassOf(C1,C2). then subClassOf(C1,C3) areSubClasses(C1,C2):-subClassOf(C1,C3), areSubClasses(C3,C2). If impOntology(O1,O2) and impOntology(O2,O3) areImpOntologies(O1,O2):-impOntology(O1,O2). then impOntology(O1,O3) areImpOntologies(O1,O2):-impOntology(O1,O3), areImpOntologies(O3,O2). If isClass(C1,O2) and impOntology(O1,O2) areClasses(C,O):-isClass(C,O). then isClass(C1,O1) areClasses(C,O1):-isClass(C,O2), areImpOntologies(O1,O2). If isSubClassOf(C1,C2) and isIndividual(I,C1) areIndividuals(I,C):-isIndividual(I,C). then isIndividual(I,C2) areIndividuals(I,C2):-isIndividual(I,C1), areSubClasses(C1,C2). If isStatement(I,P,J) and isOProperty(P,C,R) areIndividuals(I,C):-isOProperty(P,C,R), then isIndividual(I,C) areStatements(I,P,J). If isStatement(I,P,J) and isOProperty(P,D,C) areIndividuals(J,C): isOProperty(P,D,C), then isIndividual(J,C) areStatements(I,P,J). If isStatement(I,P,J) and isDProperty(P,C) areIndividuals(I,C):-isDProperty(P,C), then isIndividual(I,C) areStatements(I,P,J). If AllValues(C1,P,C) and isStatement(I,P,J) areIndividuals(J,C):-isIndividual(I,C1), and isIndividual(I,C1) then isIndividual(J,C) allValuesFrom(C1,P,C), areStatements(I,P,J). Deﬁnition 5 ( Interpr etation ) An In terpretation I = (∆ I , P I , . I ) consists of: • A non-empty in terpretation domain ∆ I corresp onding to the union of the sets of v alid URIs of ontologies, classes, ob ject and datatype prop erties, and individuals. These sets are pairwise disjoin t. 8 Edna Ruckhaus, Eduar do Ruiz, Mar ´ ıa-Esther Vidal • A set of in terpretations P I , of the EOB and IOB built-in predicates in T able 1. • An interpretation function . I whic h maps each n-ary built-in predicate p I ∈ P I to an n-ary relation Q n i =1 ∆ I . Deﬁnition 6 ( Satisﬁability ) Giv en an OWL Lite DOB D , an interpretation I , and a predicate p ∈ D , I | = p iﬀ: • p is an EOB predicate p ( t 1 , ..., t n ) and ( t 1 , ..., t n ) ∈ p I . • p is an IOB predicate R : H ( X ) ← ∃ Y B ( X , Y ), and whenev er I satisﬁes eac h predicate in the b ody B , I also satisﬁes the predicate in the head H . Deﬁnition 7 ( Mo del ) Giv en an OWL Lite DOB D and an interpretation I , I is a mo del of D iﬀ for every predicate p ∈ D , I | = p . 4.3 T r anslation of OWL Lite to OWL Lite DOB A deﬁnition of a translation map from OWL Lite to OWL Lite DOB is the follo wing: Deﬁnition 8 ( T r anslation ) Giv en an O WL Lite theory O and an OWL Lite DOB theory D , an OWL Lite to DOB T ranslation T is a function T : O → D . Giv en an O WL Lite ontology O , an OWL Lite DOB ontology D is deﬁned as follo ws: • (Base Case) If o is an axiom or fact b elonging to the sets of axioms or facts of O , then an EOB predicate T ( o ) is deﬁned according to the EOB mappings in T able 2. • If o is an O WL Lite inference rule, then an IOB predicate T ( o ) is deﬁned according to the IOB mappings in T able 3. The translation ensures that the follo wing theorem holds: The or em 1 Let O and D b e OWL Lite and OWL Lite DOB theories resp ectively , and T b e an O WL Lite to DOB T ranslation such that, T ( O ) = D , then D | = O . 5 A Motiv ating Example Consider a ’cars and dealers’ domain ontology carsOnt and W eb source ontologies source1 and source2 . Source source1 publishes information about all types of v ehicles and dealers, whereas source2 is sp ecialized in SUVs. The O WL Lite ontologies can b e seen in T able 4. A p ortion of the example’s EOB can b e seen in T able 5. T o illustrate a rule ev aluation, we will take a query q that asks for the Web sour c es that publish information ab out ’tr action ’ : Query Evaluation and Optimization in the Semantic Web 9 T able 4. Example OWL Lite ontology On tology carsOnt Ontology source1 Ontology source2 Class (vehicle partial Thing) imports carsOnt imports carsOnt Class (suv partial vehicle) individual(s123 type(suv)) Class (car partial vehicle) DataProperty(price domain(vehicle)) Class (dealer partial Thing) ObjectProperty(sells domain(dealer)) ObjectProperty(sells range(vehicle)) DataProperty(traction domain(suv)) DataProperty(model domain(vahicle)) T able 5. Example DOB ontology EOB PREDICA TES isOntology(carsOnt) isOntology(source1) isOntology(source2) impOntology(source1,carsOnt) impOntology(source2,carsOnt) isClass(vehicle,carsOnt) isClass(vehicle,carsOnt) isClass(dealer,carsOnt) subClassOf(car,vehicle) subClassOf(suv,vehicle) isOProperty(sells,dealer,vehicle) isDProperty(model,vehicle) isDProperty(price,vehicle) isDProperty(traction,suv) isIndividual(s123,suv) q(O):-areClasses(C,O),isDProperty(traction,C). The answ er to this query corresponds to all the on tologies with classes c haracterized b y the prop erty traction , i.e., ontologies source1 , source2 and carsOnt . If w e in vert the ordering of the ﬁrst t wo predicates in q , w e will ha v e an equiv alent query q’ : q’(O):-isDProperty(traction,C),areClasses(C,O). The cost or total n umber of inferred facts for q is larger than the cost for q’ . In q , the n umber of instantiations or cardinality for the ﬁrst intensional predicate areClasses(C,O) is tw elv e, four for each on tology , as source1 and source2 inherit the classes in carsOnt . The cost of inferring these facts is dep endent on the cost of ev aluating the areClasses rule. In q’ , for the ﬁrst subgoal isDProperty(traction,C) , w e hav e one instantiation: isDProperty(traction,suv) . Again, the cost of inferring this fact dep ends on the cost of the isDProperty predicate. Note that statistics on the size and argument v alues of the EOB isDProperty predicate can b e computed, whereas statistics for the IOB areClasses predicate will hav e to b e estimated as data is not known a priori. Once the cost of each 10 Edna Ruckhaus, Eduar do Ruiz, Mar ´ ıa-Esther Vidal query predicate is determined, we may apply a cost-based join-ordering optimiza- tion strategy . 6 DOB Hybrid Cost Model The pro cess of answ ering a query relies on inferring facts from the predicates in the DOB. Our cost metric is fo cused on the num b er of intermediate facts that need to b e inferred in order to answer the query . The ob jective is to ﬁnd an order of the predicates in the b o dy of the query , suc h that the n umber of in termediate inferred facts is reduced. W e will apply a join-ordering optimization strategy ` a la System R using Datalog-relational equiv alences (Abiteboul et al. 1995). T o estimate the cardinalit y and ev aluation cost of the in tensional predicates, we ha ve applied an adaptiv e sampling technique. Th us, w e prop ose a h ybrid cost mo del whic h com bines adaptiv e sampling and traditional relational cost mo dels. 6.1 A daptive Sampling T e chnique W e hav e dev elop ed a sampling tec hnique that is based on the adaptive sampling metho d prop osed by Lipton, Naughton, and Schneider (Lipton and Naugh ton 1990; Lipton et al. 1990). This technique assumes that there is a p opulation P of all the diﬀerent v alid instantiations of a predicate P , and that P is divided into n partitions according to the n p ossible instan tiations of one or more argumen ts of P . Each element in P is related to its ev aluation cost and cardinality , and the p opulation P is characterized by the statistics mean and v ariance. The ob jective of the sampling is to identify a sample of the p opulation P , called EP , suc h that the mean and v ariance of the cardinality (resp. ev aluation cost) of EP are v alid to within a predetermined accuracy and conﬁdence level. T o estimate the mean of the cardinality (resp. cost) of EP , say Y , within Y d with probabilit y p , where 0 ≤ p < 1 and d > 0, the sampling metho d assumes an urn mo del. The urn has n balls from whic h m samplings are rep eatedly taken, until the sum z of the cardinalities (resp. costs) of the samples is greater than α × ( S Y ), where α = d × ( d +1) (1 − √ p ) . The estimated mean of the cardinalit y (resp. cost) is: Y = z m . The v alues d and 1 (1 − √ p ) are associated with the relativ e error and the conﬁdence lev el, and S and Y represent the cardinalit y (resp. cost) v ariance and mean of P . Since statistics of P are unknown, the upp er b ound α × S Y is replaced by α × b ( n ). T o approximate b ( n ) for cost and cardinality estimates, we apply Double Sam- pling (Ling and Sun 1992). In the ﬁrst stage we randomly ev aluate k samples and tak e the maximum v alue among them: b ( n ) = max k i =1 ( car d ( P i )) (resp. b ( n ) = max k i =1 ( cost ( P i ))), where 1 ≤ k ≤ n It has b een shown that a few samples are necessary in order for the distribution of the sum to begin to lo ok normal. Th us, the factor 1 / (1 − √ p ) ma y b e impro ved b y Query Evaluation and Optimization in the Semantic Web 11 cen tral limit theorem (Lipton et al. 1990). This improv ement allows us to achiev e accurate estimations and lo wer b ounds. 6.1.1 Estimating c ar dinality. Giv en an intensional predicate P , the c ar dinality of P corresp onds to the num b er of v alid instantiations of P (Deﬁnition 3). In our previous example, the num b er of on tology v alues obtained in the answer of the query is estimated using this metric. T o estimate the cardinality of P , w e execute the adaptiv e sampling algorithm explained b efore, by selecting any argument of P , and partitioning P according to the chosen argument. The cardinalit y estimation will b e car d ( P ) = Y × n , where n is the n umber of partitions, i.e. the n um b er of diﬀeren t instantiations for the c hosen argumen t. Note that once the cardinality of the non-instantiated P is estimated, we can es- timate the cardinality of the instantiated predicate b y using the selectivity v alue(s) of the instan tiated argument(s). 6.1.2 Estimating c ost. The c ost of P measures the num b er of in termediate inferred facts (Deﬁnition 4). F or instance, to estimate the cost of a predicate P ( X , Y ) , we consider the diﬀerent instan tiation patterns that the predicate can hav e, i.e., we indep enden tly estimate the cost for P ( X b , Y b ), P ( X b , Y f ), P ( X f , Y b ) and P ( X f , Y f ) , where b and f indicate that the argumen t is b ound and free, resp ectively . The computation of sev eral cost estimates is necessary because in Datalog top- do wn ev aluation (Abiteb oul et al. 1995), the cost of an instan tiated intensional predicate cannot b e accurately estimated from the cost of a non-instan tiated pred- icate (using selectivity v alues). Instan tiated argumen ts will propagate in the IOB rule’s b o dy through sidewa ys-passing, and cost v aries according to the binding pat- terns. F or example, the cost of areClasses(C1 b ,C2 f ) ma y b e smaller than the cost of areClasses(C1 f ,C2 b ) , i.e., the b ound argument C1 ”pushes” instan tiations in the deﬁnition of the rule: areSubClasses(C1,C2):-isSubClass(C1,C3),areSubClasses(C3,C2). making its b ody predicates more selective. F or P ( X b , Y b ), P ( X b , Y f ) and P ( X f , Y b ) , we partition P according to the b ound argumen ts. In these cases we are estimating the cost of one partition. Therefore, cost ( P ) = Y × n n = Y . Finally , to estimate the cost of P ( X f , Y f ) , we choose an argument of P and partition P according to the c hosen argument. T o reduce the cost of computing the estimate, we choose the most selective argumen t. The cost estimate is cost ( P ) = Y × n . 6.1.3 Determining the numb er of p artitions n . F or both, cost and cardinalit y estimates, w e need to determine the n um b er of p ossi- ble instantiations, n , of the chosen argument. This v alue dep ends on the semantics 12 Edna Ruckhaus, Eduar do Ruiz, Mar ´ ıa-Esther Vidal of the particular predicate. F or instance, for an interpretation I , ar eC l asses ( C lass, O nt ) I ⊆ C × O where C is the set of v alid class URIs and O is the set of v alid ontology URIs. |C | corresp onds to the num b er of EOB predicates isC lass ( C lass, O nt ) , i.e., |C | = C ar d ( isC lass ( C l ass, Ont )) Similarly , |O | = C ar d ( isO ntol og y ( O nt )) ; these cardinalities are pre-computed oﬄine. W e assume that the v alues are uniformly distributed. 6.2 System R T e chnique T o es timate the cardinality and cost of t w o or more predicates, we use the cost mo del proposed in System R. The cardinality of the conjunction of predicates P 1 , P 2 is describ ed by the following expression: car d ( P 1 , P 2 ) = car d ( P 1 ) × card ( P 2 ) × reductionF actor ( P 1 , P 2 ) r eductionF actor ( P 1 , P 2 ) reﬂects the impact of the sidewa ys passing v ariables in re- ducing the cardinality of the result. This v alue is computed assuming that sidew ays passing v ariables are indep enden t and each is uniformly distributed (Selinger et al. 1979). F or cost estimation, we consider three ev aluation strategies: 1. Nested-Lo op Join F ollo wing a Nested-Lo op Join ev aluation strategy , for each v alid instantiation in P 1 , w e retrieve a v alid instan tiation in P 2 with a matc hing ”join” argument v alue: cost ( P 1 , P 2 ) = cost ( P 1 ) + card ( P 1 ) × cost inst ( P 2 ) cost inst ( P 2 ) corresp onds to the estimate of the cost of the predicate P 2 where the ”join” argumen ts are instan tiated in P 2 , i.e., all the sidew a ys passing v ari- ables from P 1 to P 2 are b ound in P 2 . These binding patterns were considered during the sampling-based estimation of the cost of P 2 . 2. Blo c k Nested-Lo op Join Predicate P 1 is ev aluated into blocks of ﬁxed size, and then each blo c k is ”joined” with P 2 . cost ( P 1 , P 2 ) = cost ( P 1 ) + d car d ( P 1 ) B lock S iz e e × cost ( P 2 ) 3. Hash Join A hash table is built for each predicate according to their join argument. The v alid instan tiations of predicates P 1 and P 2 with the same hash k ey will b e joined together: cost ( P 1 , P 2 ) = cost ( P 1 ) + cost ( P 2 ) Although the sampling technique is appropiate for estimating a single predicate, it may b e ineﬃcient for estimating the size of a conjunction of more than tw o predicates. The sampling algorithm in (Lipton and Naughton 1990) suggests that for a con- junction of tw o predicates, P , Q , if the size of P is n , the query is n-partitionable, Query Evaluation and Optimization in the Semantic Web 13 T able 6. Query Optimization Algorithm Algorithm Dynamic Programming INPUT : Predicate : a set of predicates, P 1 ,..., P n . OUTPUT : OrderedPredicate : an ordering of Predicate 1. SubPaths = Predicate ; 2. F or i=1 to n (a) F or each solution S ub j in SubPaths i F or each predicate P z in Predicate • If there are sidewa ys passing v ariables from S ub j to P z , then add S ub = S ub j , P z to NewSubPaths (b) Remov e from NewSubPaths any subpath S ub k iﬀ there is another subpath S ub l in NewSubPaths , such that, S ub l and S ub k are equiv alent , and S ub l is b etter than S ub k . (c) SubPaths = NewSubPaths (d) Reset NewSubPaths 3. Return the path in SubPaths with low est cost. i.e., for each v alid instantiation p in P , the corresponding partition of Q contains all the v alid instantiations q in Q such that q ”joins” p . Therefore, when the size of the ﬁrst predicate in a query is small, its sample size may b e larger. This problem can b e extended to conjunctive queries with several subgoals, so when the num ber of in termediate results is small, sampling time may b e as large as ev aluation time. 6.3 Query Optimization In Figure 6.3, we present the algorithm used to optimize the b o dy of a query . The proposed optimization algorithm extends the System R dynamic-programming algorithm by identifying orderings of the n EOB and IOB predicates in a query . During each iteration of the algorithm, the b est intermediate sub-plans are chosen based on cost and cardinality . In the last iteration, ﬁnal plans are constructed and the b est plan is selected in terms of the cost metric. During each iteration i b et ween 2 and n-1 , diﬀeren t orderings of the predicates are analyzed. Tw o subplans are considered equiv alents if and only if, they are composed b y the same predicates. A subplan S P i is b etter than a subplan S P j if and only if, the cost and cardinalit y of S P j are greater than the cost and cardinality of S P i , resp ectiv ely . If S P i cost is greater than S P j cost, but S P j cardinalit y is greater than S P i cardinalit y , i.e. they are un-comparable, then the equiv alence class is annotated with the t wo subplans. 7 Exp erimen tal Results An exp erimen tal study was conducted for synthetic and real-world ontologies. Ex- p erimen ts on synthetic on tologies were executed on a SunBlade 150 (650MHz) with 1GB RAM; exp eriments on real-w orld ontologies were executed on a SunFire V440 (1281MHz) with 16GB RAM. Our system w as implemented in SWI-Prolog 5.6.1. 14 Edna Ruckhaus, Eduar do Ruiz, Mar ´ ıa-Esther Vidal W e hav e studied three real-world on tologies: T rav el (Shell 2002), EHR RM (Pro- tege staﬀ 1999), and GALEN (Op en Clinical Organization 2001). Our cost metrics are the n umber of intermediate facts for synthetic and real-w orld on tologies, and the ev aluation time for real-w orld ontologies. In our exp erimen ts, the sampling parameters d (the error), p (the conﬁdence level), and k (the size of the sample for the ﬁrst stage) were set to 0.2, 0.7 and 7, resp ectively . W e developed t wo sets of exp eriments according to the ev aluation strategies considered: (1) the Nested-Lo op join ev aluation strategy , and (2) the combination of Nested-Lo op, Blo c k Nested-Lo op and Hash join ev aluation strategies. Our study consisted of the follo wing: • Cost Mo del Pr e dictive Cap ability : In Figure 2a, we rep ort the correlation among the estimated v alues and the actual cost for synthetic ontologies con- sidering the Nested-lo op Join ev aluation strategy . Synthetic ontologies were randomly generated following a uniform distribution. W e generated ten onto- logy do cuments and three chain and star queries with three subgoals for each on tology; the cost of each ordering w as estimated with our cost mo del, and eac h ordering was then ev aluated against the ontology; this giv es us a total of six h undred queries. The correlation is 0.92. 0 5 10 15 0 5 10 15 Actual Cost Estimated Cost (a) 2 4 6 8 10 12 14 5 10 15 Actual Cost Estimated C ost (b) Fig. 2. (a) Correlation of estimated cost to actual cost (log. scale) - nested-loop join - Synt. ontologies; (b) Correlation of estimated cost to actual cost (log. scale) - nested-lo op join - GALEN T able 7. Correlation v alues for real-w orld ontologies Nested-Loop Join Three Ev aluation Strategies T rav el 0.96 0.94 EHR RM 0.98 0.92 In Figure 2b, w e report the same correlation metric for the real-world ontology Query Evaluation and Optimization in the Semantic Web 15 Query Cost Optimal / Cost Worst 0.00 0.05 0.10 0.15 0.20 0.25 (a) Query Cost Optimal / Cost Median 0.0 0.2 0.4 0.6 0.8 1.0 (b) Fig. 3. (a) #Pred. optimal ordering vs. #Pred. worst ordering - nested-lo op-join - Syn t. On tologies; (b) #Pred. optimal ordering vs. #Pred. median ordering - nested- lo op-join - Synt. Ontologies Query Cost Optimal / Cost Worst 0.0 0.2 0.4 0.6 0.8 (a) Query Cost Optimal / Cost Worst 0.0 0.2 0.4 0.6 0.8 (b) Fig. 4. (a) #Pred. optimal ordering vs. #Pred. worst ordering - nested-lo op-join - EHR RM; (b) #Pred. optimal ordering vs. #Pred. worst ordering - combination ev aluation strategies - EHR RM GALEN, and the v alue is 0.62. In T able 7, w e present correlation v alues for the real-w orld on tologies T rav el and EHR RM for our t wo sets of experiments: the accuracy of the Nested-Lo op join cost mo del is similar to the accuracy of the cost mo del that considers the combination of the three ev aluation strategies. • Cost impr ovements : W e also conducted exp erimen ts to study cost improv e- men t using the optimizer. W e ev aluated all the orderings of each query , then w e ran the optimizer and ev aluated the optimized query . Figure 3a rep orts the ratio of the cost of the optimal ordering to the cost of the worst ordering 16 Edna Ruckhaus, Eduar do Ruiz, Mar ´ ıa-Esther Vidal considering only nested-lo op join, costOptimal Or dering costW or stOr der ing , for queries against syn- thetic ontologies. F or synthetic on tologies, this ratio is less than 10% for most of the queries. W e also computed the prop ortion of the optimal ordering cost with resp ect to the median ordering cost. The results for synthetic ontologies sho w that the optimal ordering cost is less than 40% of the median for ﬁfteen of t wen ty queries; this result can b e observed in Figure 3b. In Figure 4a, we rep ort the ratio of the cost of the optimal ordering to the cost of the worst ordering considering only nested-lo op join for EHR RM. Additionally , Figure 4b rep orts the same metric considering the com bination of the three ev aluation strategies. W e can observe that the ratio improv es when the combination of the diﬀerent strategies is considered: for nested-lo op join the mean of this ratio is 0.10, whereas for the combination of strategies the mean is 0.07; this is b ecause the optimizer searc hes in a larger space of p ossibilities, increasing the chance of ﬁnding b etter query plans. In general, we ma y state that the results show a signiﬁcant improv emen t in the ev aluation cost for the optimized queries with respect to the w orst-case and median- case query orderings. This property holds for synthetic and real-w orld on tologies. Ho wev er, for syn thetic ontologies we notice that for star-shap ed queries, the dif- ference b etw een the median cost and the optimal cost is very small; this indicates that the form of the query may inﬂuence the cost impro vemen t ac hieved b y the optimizer. 0e+00 2e+05 4e+05 6e+05 8e+05 −4 −2 0 2 Number of Inferred Predicates Time ms. (log−scale) Query Evaluation Sample evaluation Fig. 5. Sampling Conjunctions - Query Ev al. time and Sample Ev al. time vs. # Inf. Pred. Finally , we would like to p oin t out that w e also studied the use of an adaptiv e sampling technique for the cost estimation of the conjunction of t wo or more pre- Query Evaluation and Optimization in the Semantic Web 17 dicates (instead of System R cost mo del). Although, the sampling technique gives a similar correlation result than the combination of sampling and System R cost mo del, the time required to compute the cost estimation may b e as large as the time needed to ev aluate the query . In Figure 5, we can observ e that the time diﬀerence is marginal. 8 Conclusions and F uture W ork W e hav e developed a cost mo del that com bines System R and adaptiv e sampling tec hniques. Adaptive sampling is used to estimate data that do not exist a priori, data related to the cardinalit y and cost of intensional rules in the DOB. The exp er- imen tal results show that our prop osed techniques pro duce in general a signiﬁcant impro vemen t in the ev aluation cost for the optimized query . Curren tly , we are dev eloping a hybrid optimization mechanism that combines Magic Sets and our cost-based technique; the idea is to ﬁrst identify a go o d order- ing, and then apply Magic Sets rewritings to reduce the program that ev aluates the query . Initial exp eriments sho w that this com bined solution outp erforms the b eha vior of each individual technique. W e plan to apply similar optimization tec hniques for conjunctive queries to DL on tologies. Initially , we will work on ABox queries e xtending the the techniques prop osed in (Sirin and Parsia 2006). In a next stage, we will consider mixed TBox and ABo x conjunctive queries. References Abiteboul, S. , Hull, R. , and Vianu, V. 1995. F oundations of Databases. A ddison- Wesley . Bruijn, J. , Polleres, A. , and D.Fensel . 2004. O WL Lite- WSML Working Draft. DERI Institute. http://www.wsmo.org/2004/d20/v0.1/20040629/. Cal v anese, D. , Gia como, G. D. , Lembo, D. , Lenzerini, M. , and Rosa ti, R. 2005. T ailoring O WL for Data Intensiv e On tologies. In Pr o c e edings of the Workshop on OWL: Exp erienc es and Dir e ctions . Eiter, T. , Ianni, G. , Schindlauer, R. , and Tompits, H. 2006. T o wards Eﬃcien t Ev aluation of HEX-Programs. In Pr o c e e dings of the NMR International Workshop on Non-Monotonic Re asoning . Grosof, B. , Horrocks, I. , Volz, R. , and Decker, S. 2003. Description Logic Pro- grams: Com bining Logic Programs with Description Logic. In Pr o c ee dings of the WWW International World Wide Web Confer enc e . Haarslev, V. and Moller, R. 2004. Optimization Techniques for Retrieving Resources Describ ed in O WL/RDF Documents: First results. In Pr o c e e dings of the KR Know le dge R e asoning Confer enc e . Horrocks, I. and Turi, D. 2005. The O WL Instance Store: System Description. In Pr o c ee dings of the CADE International Confer enc e on Automate d De duction . 177–181. Hust adt, U. and Motik, B. 2005. Description Logics and Disjunctiv e Datalog: The Story so Far. In Pr o c e e dings of the DL International Workshop on Description Lo gics . Ling, Y. and Sun, W. 1992. A Supplement to Sampling-based Metho ds for Query Size Estimation in a Database System. SIGMOD R e c or d 21, 4, 12–15. 18 Edna Ruckhaus, Eduar do Ruiz, Mar ´ ıa-Esther Vidal Lipton, R. and Naughton, J. 1990. Query Size Estimation by Adaptiv e Sampling (Extended Abstract). Pr o c e e dings of the A CM SIGMOD International Confer enc e on Management of Data , 40–46. Lipton, R. , Naughton, J. , and Schneider, D. 1990. Practical Selectivity Estimation through Adaptive Sampling. Pr o c e e dings of the ACM SIGMOD International Confer- enc e on Management of Data , 1–10. McGuinness, D. and Harmelen, F. V. 2004. OWL Web On tology Language Overview. h ttp://www.w3.org/tr/owl-features/. Motik, B. , Volz, R. , and Maedche, A. 2003. Optimizing Query Answering in De- scription Logics using Disjunctive Deductive Databases. In Pro c e e dings of the KRDB International Workshop on Know le dge R epr esentation me ets Datab ases . 39–50. Open Clinical Organiza tion . 2001. GALEN Common Reference Mo del. h ttp://www.op enclinical.org/. P an, Z. and Hefflin, J. 2003. DLDB: Extending Relational Databases to Supp ort Seman tic Web Queries. In Pr o c e edings of the PSSS Workshop on Pr actic al and Sc alable Semantic Systems . Protege st aff . 1999. Protege OWL: Ontology Editor for the Semantic Web. h ttp://protege.stanford.edu/plugins/owl/o wl-library/. Ramakrishnan, R. and Gehrke, J. 2003. Datab ase Management Systems . Mc Graw Hill. Ramakrishnan, R. and Ullman, J. D. 1993. A surv ey of researc h on deductive database systems. Journal of L o gic Pr o gr amming 23, 2, 125–149. Selinger, P. , Astrahan, M. , Chamberlin, D. , Lorie, R. , and Price, T. 1979. Access Path Selection in a Relational Database Management System. In Pr o c e e dings of the ACM SIGMOD International Confer enc e on Management of Data . 23–34. Shell, M. 2002. SchemaW eb Website. http://www.sc hemaw eb.info. Sirin, E. and P arsia, B. 2006. Optimizations for Answering Conjunctive Ab ox Queries. In Pr o c e e dings of the DL International Workshop on Description Logics . St audt, M. , Soiron, R. , Quix, C. , and M.Jarke . 1999. Query Optimization for Rep ository-Based Applications. In Sele cte d Ar e as in Crypto gr aphy . 197–203. W ang, T. 2006. Gauging Ontologies and Sc hemas b y Num b ers. In Pr o c e edings of the EON Workshop on Evaluation of Ontolo gies for the Web .

Query Evaluation and Optimization in the Semantic Web

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment