A role-free approach to indexing large RDF data sets in secondary memory for efficient SPARQL evaluation

Massive RDF data sets are becoming commonplace. RDF data is typically generated in social semantic domains (such as personal information management) wherein a fixed schema is often not available a priori. We propose a simple Three-way Triple Tree (Tr…

Authors: George H. L. Fletcher, Peter W. Beck

A role-free approach to indexing large RDF data sets in secondary memory   for efficient SPARQL evaluation
A Role-F ree Approac h to Indexing Large RDF Data Sets in Secondary Memory for Efficien t SP AR QL Ev aluation George H. L. Fletc her and Peter W. Bec k Sc ho ol of Engineering and Computer Science W ashington State Univ ersit y , V ancouv er, USA { fletcher , pwbeck } @wsu.edu 1 In tro duction Massiv e RDF data sets are b ecoming commonplace. RDF data is typically generated in so cial seman tic domains (suc h as p ersonal information man- agemen t [2, 11, 13]) wherein a fixed schema is often not av ailable a priori. W e prop ose a simple Thr e e-way T riple T r e e (T ripleT) secondary-memory indexing technique to facilitate efficien t SP ARQL query ev aluation on such data sets. The no v elty of T ripleT is that (1) the in dex is built o v er the atoms o ccurring in the data set, rather than at a coarser granularit y , such as whole triples o ccurring in the data set; and (2) the atoms are indexed regardless of the roles (i.e., sub jects, predicates, or ob jects) they pla y in the triples of the data set. W e show through extensiv e empirical ev aluation that T ripleT exhibits m ultiple orders of magnitude improv emen t ov er the state of the art on RDF indexing, in terms of b oth storage and query pro cessing costs. Preliminary Notions. W e assume familiarit y with the RDF and SP AR QL standards [8, 12, 15], the B+tree data structure [4, 16], and the basics of conjunctiv e query pro cessing [3, 16, 18]. Let A b e an enumerable set of atoms (e.g., Unico de strings). A triple is an element of A × A × A . An RDF gr aph is a finite set of triples. F or graph G , let S ( G ) = { s | ( s, p, o ) ∈ G } P ( G ) = { p | ( s, p, o ) ∈ G } O ( G ) = { o | ( s, p, o ) ∈ G } A ( G ) = S ( G ) ∪ P ( G ) ∪ O ( G ) . 1 ˘ h Yamada, authored, doc1 i , h Yamada, knows, McShea i , h knows, is a kind of, social action i , h Herzog, authored, doc2 i , h Herzog, authored, doc3 i , h McShea, performed, doc3 i , h McShea, past action, authored i , h doc1, type, PDF i , h doc1, rating, 4/5 i , h doc2, type, MP3 i , h doc3, type, MP3 i , h doc3, created on, 26.10.08 i ¯ Figure 1: A triple graph. The atoms app earing in S ( G ) are called the subje cts of G ; the atoms ap- p earing in P ( G ) are called the pr e dic ates of G ; and, the atom s app earing in O ( G ) are called the obje cts of G . 2 The Problem The problem we consider in this pap er is ho w to index a graph G to supp ort efficien t ev aluation of b asic gr aph p atterns (BGP) ov er G . BGPs, which are conjunctions of simple ac c ess p atterns (SAP), form the heart of all SP AR QL queries. Example 1 Consider the query “What ar e the dates and typ es of do cuments on which McShe a was a p erformer?” over the triple stor e given in Figur e 1. In SP ARQL, wher e variables ar e identifie d by a le ading ? , this query c an b e formulate d as fol lows: SELECT ?date ?type WHERE { McShea performed ?doc . ?doc created_on ?date . ?doc type ?type } The WHERE clause of a SP AR QL query sp e cifies a BGP, which in this c ase c onsists of the c onjunction of the fol lowing thr e e SAPs: ( McShea , performed , ?do c) , (?do c , created on , ?date) , (?do c , type , ?t yp e) . 2 Conc eptual ly, the evaluation of a BGP on a gr aph G c onsists of finding al l variable bindings such that e ach of the BGP’s c onstituent SAPs simultane- ously holds in G . In our example, ther e is only one set of valid variable bindings: ?do c ?date ?t yp e doc1 26.10.08 MP3 The SELECT clause indic ates that only the bindings for ?date and ?type ar e r eturne d in the query r esult. The reader will recognize that BGPs are essentially conjunctiv e queries ev aluated o ver a single ternary relation [3, 7, 9, 18, 21]. Joins b et ween the SAPs of a BGP are induced by the co-o ccurrence of v ariables and atoms. There are six native BGP join t yp es: sub ject-sub ject, sub ject- predicate, sub ject-ob ject, predicate-predicate, predicate-ob ject, and ob ject- ob ject joins. In Example 1, there is a sub ject-ob ject join b et w een the first SAP and b oth the second and third SAPs, due to the co-o ccurrence of v ari- able ?do c. F urthermore, there is a sub ject-sub ject join b et w een the second and third SAPs. W e sp ecifically focus on the problem of designing native RDF index data structures to accelerate BGP ev aluation. By nativ e, w e mean data structures whic h supp ort the full range of BGP join patterns. 3 The Solution Let G b e a fixed RDF graph. In what follo ws, w e use the B+tree secondary- memory data structure [4] to implemen t the v arious indexing techniques considered. Ho w ever, any of a v ariet y of appropriate secondary-memory data structures (e.g., linear hashing [16]) could also be also hav e b een used. 3.1 State of the Art T o the b est of our knowledge, the tw o ma jor comp etitiv e proposals for nativ e RDF indexing are m ultiple access patterns (MAP) and HexT ree. • MAP . In this approac h, all three p ositions of triples are indexed: sub- jects (S), predicates (P), and ob jects (O), for some p erm utation of S, P , and O. MAP requires up to six separate indexes, corresp onding to the six p ossible orderings of roles: SPO, SOP , PSO, POS, OSP , OPS. F or example, for each ( s, p, o ) ∈ G , it is the case that o # p # s is a 3 ... ... ... (a) MAP ... ... k . . . ... (b) HexT ree ... ... k 1 k 2 . . . ... (c) T ripleT Figure 2: V arieties of T riple T rees. k ey in the OPS index on G ; see Figure 2(a). 1 A BGP join ev aluation requires tw o or more lo ok-ups, p oten tially in differen t trees, follow ed b y merge-joins. Ma jor systems employing this technique include Vir- tuoso, Y ARS, RDF-3X, Kow ari, and System-Π [6, 10, 14, 26, 27]. In the present inv estigation we use the B+tree data structure for eac h of the MAP indexes (Figure 2(a)). • HexT ree. Recen tly in the Hexstore system, W eiss et al. [24] ha ve prop osed indexing t wo roles at a time. This approac h requires up to six separate indexes corresponding to the six p ossible orderings of roles: SO, OS, SP , PS, OP , PO. P ayloads are shared b etw een indexes with symmetric orderings. F or example, for eac h ( s, p, o ) ∈ G , it is the case that s # p is a k ey in the SP index on G , p # s is a key in the PS index on G , and b oth of these k eys p oint to a pa yload of { o ∈ O ( G ) | ( s, p, o ) ∈ G } ; see Figure 2(b). As with MAP , join ev aluation requires tw o or more lo ok-ups, p oten tially in differen t trees, follo wed by merge-joins. Hexstore has only been prop osed and ev aluated as a main-memory data structure [24]. W e propose HexT ree as an effectiv e secondary- memory realization of the He xstore prop osal using the B+tree data structure (Figure 2(b)). Note that tec hniques ha ve also been dev elop ed for indexing heuristically- selected classes of larger graph patterns, e.g., [23]. Such tec hniques, ho w ever, do not supp ort pro cessing of the full range of native BGP join patterns. 1 Where “#” is some reserved separator sym b ol. 4 o 1 p 1 . . . s 1 o 1 . . . p 1 s 1 . . . s p o Figure 3: T ripleT payload for atom k . 3.2 Our Proposal W e prop ose indexing the key-space A ( G ), regardless of the particular roles the atoms of A ( G ) play in the triples of G . F or a key k , the payload is all triples of G in whic h atom k o ccurs (see Figure 2(c)). In particular, the pa yload for k consists of three “buck ets”: one for all pairs ( p, o ) where ( k , p, o ) ∈ G , one for all pairs ( s, o ) where ( s, k , o ) ∈ G , and one for all pairs ( s, p ) where ( s, p, k ) ∈ G , (see Figure 3). In other words, there is one buck et apiece for all those triples where k o ccurs as a sub ject, for all those triples where k o ccurs as a predicate, and for all those triples where k app ears as an ob ject. F or example, on the graph of Figure 1, the pa yload for doc1 w ould consist of an ob ject buck et h ( Yamada , authored ) i , a sub ject buc ket h ( 4 / 5 , rating ) , ( PDF , type ) i , and a predicate buck et hi . 2 T ripleT requires just one index, while efficiently supporting all join patterns native to SP AR QL. F or example, a sub ject-ob ject join induced b y the co-o ccurrence of an atom k can be ev aluated b y a single look-up on k follow ed b y a merge-join b et w een the sub ject and ob ject buc kets of k ’s pa yload. A join induced by the co-o ccurrence of a v ariable is implemen ted as multiple lo ok-ups follo wed b y merge-joins, as with MAP and HexT ree. Ho wev er, since the k eys in T ripleT are 1/3 the length of those in MAP and 1/2 those in HexT ree, there is a significan t increase in the branc hing factor of the T ripleT B+tree, whic h leads to a significan t reduction in cost for these lo ok-ups. T ripleT do es not fav or any particular join types, supp orting the full range of join patterns nativ e to RDF data. The recen tly prop osed “v ertical- partitioning” approach [1] can b e viewed as a sp ecial restricted case of T ripleT where (1) only the atoms of P ( G ) are indexed and (2) only the predicate payload buc ket for each key is maintained. In this sense, v ertical- 2 T o facilitate query pro cessing, note that we k eep the pairs in each of the buck ets sorted. By default, the sub ject buck et is sorted in OP order, the predicate buc ket in SO order, and the ob ject buc ket in SP order. 5 partitioning is not a fully native RDF indexing tec hnique; indeed, recent researc h has demonstrated practical limitations of this approac h [19, 17, 24]. This researc h has also demonstrated similar limitations of the related “prop- ert y table” RDF storage techniques [5, 20, 22, 25]. 4 Empirical Ev aluation W e implemented all three approac hes using 8K blo c ks and 32-bit references, in virtual memory , using Python 2.5.2. All exp erimen ts were executed on a pair of 2.66 GHz dual-core Intel Xeon pro cessors with 16 GB RAM running Mac OS X 10.4.11. Eac h exp erimen t w as p erformed using (1) simple syn- thetic data; (2) the DBPedia RDF data set; and, (3) the Uniprot RDF data set. F urther details of these data sets are provided in the Appendix. As mentioned ab o ve, in T ripleT we only materialized the OP , SO, and SP sort orderings for the sub ject, predicate, and ob ject payload buck ets, resp ectiv ely . 3 Consequen tly , we only built the corresp onding SOP , PSO, and OSP trees for MAP and the SO, PS, and OS trees for HexT ree. In all of our exp erimen ts, the T ripleT payloads o ccupied on av erage only one disk blo c k. Hence, if a symmetric sort ordering w as necessary for a merge join (e.g., if the PO ordering was necessary for the sub ject buc ket while using T ripleT or if the SPO ordering w as necessary while doing a lo okup in MAP), the sort w as p erformed in main-memory without p enalt y . 4.1 Index size In incremen ts of 1 million triples, from 1 to 6 million triples, w e built the three index types. The plots of the index sizes, in 8K blo c ks, are shown in Figures 4(a)-4(c). T ripleT was up to eight orders of magnitude smaller, with a typical tw o orders of magnitude savings in storage cost. The reason for this can b e attributed to (1) T ripleT uses just one B+tree, whereas MAP and HexT ree b oth require three B+trees, and (2) the key size in T ripleT is 1/3 that of MAP and 1/2 that of HexT ree, leading to significantly higher branc hing factor of the B+tree (and hence shallow er trees). 3 If necessary , each of the tw o p ossible sort orderings for each of the three T ripleT buc kets could be materialized. In this case, we would of course still need just one B+tree to index pa yloads. 6 (a) Syn thetic (b) DBP edia (c) Uniprot Figure 4: Index sizes, in 8K blo cks. 4.2 Query performance W e use the classic I/O cost mo del for query ev aluation, i.e., we use the n umber of blo ck reads as our p erformance metric [16], as w e are interested in comparing the technology-independent b eha vior of MAP , HexT ree, and T ripleT. W e considered t wo query scenarios: • A single SAP without v ariables, whic h w e denote as a “ k = 0” join scenario. F or eac h dataset, and for eac h size, we randomly selected ten triples from the dataset and recorded the costs of lo oking them up in MAP , HexT ree, and T ripleT. The av erage I/O cost of performing these lo okups is given in Figures 5(a)-5(c). • Basic BGP join patterns, whic h we denote as a “ k = 1” join scenario. W e considered four sub-scenarios, co vering the basic wa ys in which SAPs ma y b e joined. 1. Computing the join of t wo v ariable-free SAPs ha ving one atom in common. 2. Computing the join of t wo SAPs ha ving one atom in common, one SAP ha ving a single v ariable and the other v ariable-free. 3. Computing the join of tw o SAPs ha ving no atoms in common, eac h having a single v ariable, whic h they share. 4. Computing the join of t wo SAPs ha ving one atom in common, eac h having one v ariable, whic h they also share. F or eac h data set, for each size, we generated ten random BGPs of eac h of these four scenarios and recorded the cost of their ev aluation 7 (a) Syn thetic, k = 0 (b) DBP edia, k = 0 (c) Uniprot, k = 0 (d) Syn thetic, k = 1 (e) DBP edia, k = 1 (f ) Uniprot, k = 1 Figure 5: Cost of Query Pro cessing. using MAP , HexT ree, and T ripleT. The av erage I/O costs are given in Figures 5(d)-5(f ). W e observ e from these exp erimen ts that (1) for k = 0 T ripleT never p erformed w orse than MAP or HexT ree, and usually b etter; and, (2) for k = 1, T ripleT alw ays out-p erformed MAP and HexT ree, with up to tw o orders of magnitude impro vemen t in I/O costs. 5 Concluding remarks It is clear from this extensive ev aluation of the full range of BGP join sce- narios on b oth synthetic and real-world data sets that T ripleT is a serious con tender for indexing massive RDF data stores in secondary memory . Our prop osal is conceptually quite simple, and hence straight forward to imple- men t. F urthermore, T ripleT exhibits m ultiple orders of magnitude impro ve- men t o ver the state of the art for b oth storage cost and query ev aluation cost. In closing, we note that the man y optimizations (such as v arious key compression sc hemes) which hav e b een used in implemen tations of MAP and HexT ree rep orted in the literature can equally b e applied to T ripleT. 8 References [1] Daniel J. Abadi, Adam Marcus, Sam uel Madden, and Katherine J. Hollen bach. Scalable Semantic W eb Data Management Using V ertical P artitioning. In VLDB , pages 411–422, Vienna, 2007. [2] Karl Ab erer. Data Management in the Social W eb. In EDBT , pages 1203–1204, Munic h, 2006. [3] Ashok K. Chandra and Philip M. Merlin. Optimal Implementation of Conjunctiv e Queries in Relational Data Bases. In ACM STOC , pages 77–90, Boulder, CO, USA, 1977. [4] Douglas Comer. The Ubiquitous B-T ree. ACM Comput. Surv. , 11(2):121–137, 1979. [5] Mar ´ ıa del Mar Rold´ an Garc ´ ıa and Jos ´ e F rancisco Aldana Mon tes. A Surv ey on Disk Oriented Querying and Reasoning on the Semantic W eb. In IEEE ICDE Workshop SWDB , A tlanta, 2006. [6] Orri Erling. T o wards W eb Scale RDF. In SSWS , Karlsruhe, Germany , 2008. [7] George H. L. Fletcher. An Algebra for Basic Graph Patterns. In LID , Rome, 2008. [8] Tim F urche, Benedikt Linse, F ran¸ cois Bry , Dimitris Plexousakis, and Georg Gottlob. RDF Querying: Language Constructs and Ev aluation Metho ds Compared. In R e asoning Web , pages 1–52, Lisb on, Portugal, 2006. [9] Claudio Guti ´ errez, Carlos A. Hurtado, and Alb erto O. Mendelzon. F oundations of Semantic W eb Databases. In A CM PODS , pages 95– 106, P aris, 2004. [10] Andreas Harth, J ¨ urgen Umbric h, Aidan Hogan, and Stefan Deck er. Y ARS2: A F ederated Repository for Querying Graph Structured Data from the W eb. In ISWC , Busan, Korea, 2007. [11] Da vid R. Karger, Karun Bakshi, Da vid Huynh, Dennis Quan, and Vi- neet Sinha. Haystac k: A General-Purp ose Information Management T o ol for End Users Based on Semistructured Data. In CIDR , pages 13–26, 2005. 9 [12] Graham Klyne and Jeremy J. Carroll. Resource Description F ramew ork (RDF): Concepts and Abstract Syn tax. W3C Recommendation, 2004. [13] m c sc hraefel. What is an Analogue for the Semantic W eb and Wh y is Ha ving One Imp ortan t? In ACM Hyp ertext , pages 123–132, Manch- ester, UK, 2007. [14] Thomas Neumann and Gerhard W eikum. RDF-3X: A RISC-Style En- gine for RDF. In VLDB , Auckland, New Zealand, 2008. [15] Eric Prud’hommeaux and Andy Seab orne. SP ARQL Query Language for RDF. W3C Recommendation, 2008. [16] Ragh u Ramakrishnan and Johannes Gehrk e. Datab ase Management Systems, 3r d Ed. McGraw Hill, Boston, 2003. [17] Mic hael Sc hmidt, Thomas Horn ung, Norb ert K ¨ uc hlin, Georg Lausen, and Christoph Pink el. An Exp erimen tal Comparison of RDF Data Managemen t Approac hes in a SP AR QL Benc hmark Scenario. In ISWC , pages 82–97, Karlsruhe, German y , 2008. [18] P atricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Ra y- mond A. Lorie, and Thomas G. Price. Access Path Selection in a Rela- tional Database Management System. In ACM SIGMOD , pages 23–34, Boston, 1979. [19] Lefteris Sidirourgos, Rom ulo Goncalv es, Martin Kersten, Niels Nes, and Stefan Manigold. Column-Store Supp ort for RDF Data Management: Not All Sw ans are White. In VLDB , Auckland, New Zealand, 2008. [20] Mic hael Sin tek and Malte Kiesel. RDFBroker: A Signature-Based High- P erformance RDF Store. In ESWC , pages 363–377, Budv a, Montene- gro, 2006. [21] Markus Sto c k er, Andy Seab orne, Abraham Bernstein, Christoph Kiefer, and Da ve Reynolds. SP ARQL Basic Graph Pattern Optimiza- tion Using Selectivity Estimation. In ACM WWW , pages 595–604, Beijing, 2008. [22] Y annis Theoharis, V assilis Christophides, and Gregory Karv ounarakis. Benc hmarking Database Represen tations of RDF/S Stores. In ISWC , pages 685–701, Galw ay , Ireland, 2005. 10 [23] Octa vian Udrea, Andrea Pugliese, and V. S. Subrahmanian. GRIN: A Graph Based RDF Index. In AAAI , pages 1465–1470, V ancouv er, B.C., 2007. [24] Cathrin W eiss, Panagiotis Karras, and Abraham Bernstein. Hexastore: Sextuple Indexing for Seman tic W eb Data Management. In VLDB , Auc kland, New Zealand, 2008. [25] Kevin Wilkinson. Jena Prop ert y T able Implementation. In SSWS , pages 35–46, A thens, Georgia, USA, 2006. [26] Da vid W o od, P aul Gearon, and T om Adams. Kow ari: A Platform for Seman tic W eb Storage and Analysis. In XT e ch , Amsterdam, 2005. [27] Gang W u, Juanzi Li, and Kehong W ang. System Π: a Hyp ergraph Based Native RDF Rep ository. In WWW , pages 1035–1036, Beijing, 2008. App endix In this section we pro vide details of the data sets used in the exp erimen ts discussed in Section 4: (1) syn thetic data, (2) the DBPedia RDF data set; 4 and (3) the Uniprot RDF data set. 5 F or (1), w e built tw o synthetic data sets of size 6 million (the results of Section 4 are the av erages ov er these t wo sets). In the first set, w e randomly generated n triples ov er n 1 / 3 unique atoms, for n = 1 , 000 , 000, to n = 6 , 000 , 000, in incremen ts of one million, where rep etitions of atoms w ere allo wed within triples. In the second set, we randomly generated n triples o ver ceil ing ( n 1 / 3 ) + 2 unique atoms, for n = 1 , 000 , 000, to n = 6 , 000 , 000, in incremen ts of one million, where rep etitions of atoms within triples were disallo wed. F or (2) and (3), we to ok an arbitrary sample of 10,000,000 triples from eac h data collection (treating the DBPedia infob o x and pagelinks as one collection) — see T able 1. After cleaning and duplicate elimination, w e kept 6,000,000 triples in eac h collection. In this cleaned data, w e use only the first 400 (DBPedia) or 150 (Uniprot) c haracters of atoms (note that these are the basis of the fixed k ey sizes for the B+trees w e built). This truncation only affected a few extremely long atoms app earing exclusiv ely in the ob ject p osition. Final statistics for these data sets are given in T able 2. 4 http://wiki.dbpedia.org 5 http://dev.isb-sib.ch/projects/uniprot-rdf 11 G | G | a verage atom length DBP edia 82,701,339 34.2 Uniprot 956,915,180 29.0 T able 1: Data sets G |S ( G ) | |P ( G ) | |O ( G ) | |A ( G ) | |S ( G ) ∩ O ( G ) | |S ( G ) ∩ P ( G ) | |P ( G ) ∩ O ( G ) | DBP edia 1,370,679 20,873 1,848,114 2,852,484 387,182 0 0 Uniprot 4,357,005 81 1,734,176 5,644,939 446,311 0 12 T able 2: Basic statistics of sampled data sets 12

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment