S2RDF: RDF Querying with SPARQL on Spark

S2RDF: RDF Querying with SPARQL on Spark
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

RDF has become very popular for semantic data publishing due to its flexible and universal graph-like data model. Yet, the ever-increasing size of RDF data collections makes it more and more infeasible to store and process them on a single machine, raising the need for distributed approaches. Instead of building a standalone but closed distributed RDF store, we endorse the usage of existing infrastructures for Big Data processing, e.g. Hadoop. However, SPARQL query performance is a major challenge as these platforms are not designed for RDF processing from ground. Thus, existing Hadoop-based approaches often favor certain query pattern shape while performance drops significantly for other shapes. In this paper, we describe a novel relational partitioning schema for RDF data called ExtVP that uses a semi-join based preprocessing, akin to the concept of Join Indices in relational databases, to efficiently minimize query input size regardless of its pattern shape and diameter. Our prototype system S2RDF is built on top of Spark and uses its relational interface to execute SPARQL queries over ExtVP. We demonstrate its superior performance in comparison to state of the art SPARQL-on-Hadoop approaches using the recent WatDiv test suite. S2RDF achieves sub-second runtimes for majority of queries on a billion triples RDF graph.


💡 Research Summary

The paper addresses the growing challenge of storing and querying massive RDF datasets that no longer fit on a single machine. While many distributed solutions rely on Hadoop‑based platforms, their SPARQL performance is often limited because they are optimized for specific query shapes (typically star patterns) and suffer severe slowdowns on other shapes such as long linear chains or snowflake structures. To overcome these limitations, the authors introduce S2RDF, a SPARQL engine built on top of Apache Spark, and a novel data layout called ExtVP (Extended Vertical Partitioning).

ExtVP extends the classic vertical partitioning (VP) approach, where each distinct predicate is stored in its own two‑column table (subject, object). In ExtVP, for every pair of predicates the system pre‑computes semi‑join reductions for all four possible join orientations (subject‑subject, subject‑object, object‑subject, object‑object). These semi‑join results are materialized as additional tables that contain only those rows that can actually participate in a join with the counterpart predicate. This pre‑filtering dramatically reduces the amount of data that must be read and processed during query execution, regardless of the query’s diameter or shape. The layout also stores basic statistics (row counts, join cardinalities) for each ExtVP table, enabling the query compiler to select the most selective table for each triple pattern at compile time. An optional selectivity threshold allows the system to discard ExtVP tables whose benefit would be marginal, keeping storage overhead low while preserving most of the performance gains.

S2RDF translates SPARQL queries into Spark SQL DataFrame operations. By leveraging Spark’s Catalyst optimizer, the engine automatically applies join reordering, column pruning, and partition pruning. The underlying data is persisted in Parquet, a column‑oriented format that provides schema preservation, dictionary and run‑length encoding, and Snappy compression. When cached in memory, Parquet’s columnar layout further reduces memory footprint and I/O.

The authors evaluate S2RDF using the WatDiv benchmark, which generates a diverse set of 108 queries covering star, linear, snowflake, and mixed patterns on a synthetic RDF graph containing one billion triples. Compared with state‑of‑the‑art Hadoop‑based SPARQL systems such as Hive‑RDF, Sempala, and H2RDF, S2RDF achieves sub‑second response times for the majority of queries, with an average latency around 0.5 seconds. Even the most demanding linear and snowflake queries run orders of magnitude faster than the competitors, demonstrating that ExtVP’s semi‑join based reduction effectively curtails intermediate result explosion. Storage experiments show that applying a selectivity threshold of 0.01 reduces the ExtVP footprint to roughly 30 % of the full VP layout while retaining most of the speedup.

In summary, the paper contributes (1) the ExtVP partitioning scheme that brings join‑index‑like benefits to RDF without requiring a native RDF store, (2) a selective materialization strategy that balances storage cost and query performance, (3) a Spark‑SQL based query compiler that exploits ExtVP statistics for optimal table selection, and (4) an extensive empirical validation confirming that S2RDF outperforms existing Hadoop‑based SPARQL engines across a wide range of query shapes. The work demonstrates that combining semi‑join pre‑filtering with modern in‑memory cluster computing can deliver interactive SPARQL performance on billion‑triple graphs, opening avenues for further research in dynamic ExtVP maintenance, support for more complex SPARQL operators, and cost‑aware deployment in cloud environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment