The SPARQL query language is a recent W3C standard for processing RDF data, a format that has been developed to encode information in a machine-readable way. We investigate the foundations of SPARQL query optimization and (a) provide novel complexity results for the SPARQL evaluation problem, showing that the main source of complexity is operator OPTIONAL alone; (b) propose a comprehensive set of algebraic query rewriting rules; (c) present a framework for constraint-based SPARQL optimization based upon the well-known chase procedure for Conjunctive Query minimization. In this line, we develop two novel termination conditions for the chase. They subsume the strongest conditions known so far and do not increase the complexity of the recognition problem, thus making a larger class of both Conjunctive and SPARQL queries amenable to constraint-based optimization. Our results are of immediate practical interest and might empower any SPARQL query optimizer.
Deep Dive into Foundations of SPARQL Query Optimization.
The SPARQL query language is a recent W3C standard for processing RDF data, a format that has been developed to encode information in a machine-readable way. We investigate the foundations of SPARQL query optimization and (a) provide novel complexity results for the SPARQL evaluation problem, showing that the main source of complexity is operator OPTIONAL alone; (b) propose a comprehensive set of algebraic query rewriting rules; (c) present a framework for constraint-based SPARQL optimization based upon the well-known chase procedure for Conjunctive Query minimization. In this line, we develop two novel termination conditions for the chase. They subsume the strongest conditions known so far and do not increase the complexity of the recognition problem, thus making a larger class of both Conjunctive and SPARQL queries amenable to constraint-based optimization. Our results are of immediate practical interest and might empower any SPARQL query optimizer.
The SPARQL Protocol and Query Language is a recent W3C recommendation that has been developed to extract information from data encoded using the Resource Description Framework (RDF) [14]. From a technical point of view, RDF databases are collections of (subject,predicate,object) triples. Each triple encodes the binary relation predicate between subject and object, i.e. represents a single knowledge fact. Due to their homogeneous structure, RDF databases can be seen as labeled directed graphs, where each triple defines an edge from the subject to the object node under label predicate. While originally designed to encode knowledge in the Semantic Web in a machine-readable format, RDF has found its way out of the Semantic Web community and entered the wider discourse of Computer Science. Coming along with its application in other areas, such as bio informatics, data publishing, or data integration, large RDF repositories have been created (cf. [29]). It has repeatedly been observed that the database community is facing new challenges to cope with the specifics of the RDF data format [7,18,21,31].
With SPARQL, the W3C has recommended a declarative query language that allows to extract data from RDF graphs. SPARQL comes with a powerful graph matching facility, whose basic construct are so-called triple patterns. During query evaluation, variables inside these patterns are matched against the RDF input graph. The solution of the evaluation process is then described by a set of mappings, where each mapping associates a set of variables with RDF graph components. SPARQL additionally provides a set of operators (namely And, Filter, Optional, Select, and Union), which can be used to compose more expressive queries.
One key contribution in this paper is a comprehensive complexity analysis for fragments of SPARQL. We follow previous approaches [26] and use the complexity of the Evaluation problem as a yardstick: given query Q, data set D, and candidate solution S as input, check if S is contained in the result of evaluating Q on D. In [26] it has been shown that full SPARQL is PSpace-complete, which is bad news from a complexity point of view. We show that yet operator Optional alone makes the Evaluation problem PSpace-hard. Motivated by this result, we further refine our analysis and prove better complexity bounds for fragments with restricted nesting depth of Optional expressions.
Having established this theoretical background, we turn towards SPARQL query optimization. The semantics of SPARQL is formally defined on top of a compact algebra over mapping sets. In the evaluation process, the SPARQL operators are first translated into algebraic operations, which are then directly evaluated on the data set. The SPARQL Algebra (SA) comprises operations such as join, union, left outer join, difference, projection, and selection, akin to the operators defined in Relational Algebra (RA). At first glance, there are many parallels between SA and RA; in fact, the study in [1] reveals that SA and RA have exactly the same expressive power. Though, the technically involved proof in [1] indicates that a semantics-preserving SA-to-RA translation is far from being trivial (cf. [6]). Hence, although both algebras provide similar operators, there are still very fundamental differences between both. One of the most striking discrepancies, as also argued in [26], is that joins in RA are rejecting over null-values, but in SA, where the schema is loose in the sense that mappings may bind an arbitrary set of variables, joins over unbound variables (essentially the equivalent of RA null-values) are always accepting.
One direct implication is that not all equivalences that hold in RA also hold in SA, and vice versa, which calls for a study of SA by its own. In response, we present an elaborate study of SA in the second part of the paper. We survey existing and develop new algebraic equivalences, covering various SA operators, their interaction, and their relation to the RA counterparts. When interpreted as rewriting rules, these equivalences form the theoretical foundations for transferring established RA optimization techniques, such as filter pushing, into the SPARQL context. Going beyond the adaption of existing techniques, we also address SPARQL-specific issues, e.g. provide rules for simplifying expressions involving (closed-world) negation, which can be expressed in SPARQL syntax using a combination of Optional and Filter.
We note that in the past much research effort has been spent in processing RDF data with traditional systems, such as relational DBMSs or datalog engines [7,18,31,25,12,21,27], thus falling back on established optimization strategies. Some of them (e.g. [31,12]) work well in practice, but are limited to small fragments, such as And-only queries. More complete approaches (e.g. [7]) suffer from performance bottlenecks for complex queries, often caused by poor optimization results (cf. [18,21,22]). For instance, [21
…(Full text truncated)…
This content is AI-processed based on ArXiv data.