A Scalable Framework for Quality Assessment of RDF Datasets
Large amounts of data are being published openly to Linked Data by different data providers. A multitude of applications such as semantic search, query answering, and machine reading depend on these large-scale RDF datasets. The quality of underlying RDF data plays a fundamental role in large-scale data consuming applications. Measuring the quality of linked data spans a number of dimensions including but not limited to: accessibility, interlinking, performance, syntactic validity or completeness . Each of these dimensions can be expressed through one or more quality metrics. Considering that each quality metric tries to capture a particular aspect of the underlying data, numerous metrics are usually provided against the given data that may or may not be processed simultaneously.
On the other hand, the limited number of existing techniques of quality assessment for RDF datasets are not adequate to assess data quality at large-scale and these approaches mostly fail to capture the increasing volume of big data. To date, a limited number of solutions have been conceived to offer quality assessment of RDF datasets . But, these methods can either be used on a small portion of large datasets or narrow down to specific problems e.g., syntactic accuracy of literal values , or accessibility of resources . In general, these existing efforts show severe deficiencies in terms of performance when data grows beyond the capabilities of a single machine. This limits the applicability of existing solutions to medium-sized datasets only, in turn, paralyzing the role of applications in embracing the increasing volumes of the available datasets.
To deal with big data, tools like Apache Spark have recently gained a lot of interest. Apache Spark provides scalability, resilience, and efficiency for dealing with large-scale data. Spark uses the concepts of Resilient Distributed Datasets (RDDs) and performs operations like transformations and actions on this data in order to effectively deal with large-scale data.
To handle large-scale RDF data, it is important to develop flexible and extensible methods that can assess the quality of data at scale. At the same time, due to the broadness and variety of quality assessment domain and resulting metrics, there is a strong need to provide a generic pattern to characterize the quality assessment of RDF data in terms of scalability and applicability to big data.
In this paper, we borrow the concepts of data transformation and action from Spark and present a pattern for designing quality assessment metrics over large RDF datasets, which is inspired by design patterns. In software engineering, design patterns are general and reusable solutions to common problems. Akin to design pattern, where each pattern acts like a blueprint that can be customized to solve a particular design problem, the introduced concept of Quality Assessment Pattern ($`\mathcal{QAP}`$) represents a generalized blueprint of scalable quality assessment metrics. In this way, the quality metrics designed following $`\mathcal{QAP}`$ can exhibit the ability to achieve scalability to large-scale data and work in a distributed manner. In addition, we also provide an open source implementation and assessment of these quality metrics in Apache Spark following the proposed $`\mathcal{QAP}`$.
Our contributions can be summarized in the following points:
-
We present a Quality Assessment Pattern $`\mathcal{QAP}`$ to characterize scalable quality metrics.
-
We provide DistQualityAssessment – a distributed (open source) implementation of quality metrics using Apache Spark.
-
We perform an analysis of the complexity of the metric evaluation in the cluster.
-
We evaluate our approach and demonstrate empirically its superiority over a previous centralized approach.
-
We integrated the approach into the SANSA framework. SANSA is actively maintained and uses the community ecosystem (mailing list, issues trackers, continues integration, web-site etc.).
-
We briefly present three use cases where DistQualityAssessment has been used.
The paper is structured as follows: Our approach for the computation of RDF dataset quality metrics is detailed in 2 and evaluated in 3. Related work on the computation of quality metrics for RDF datasets is discussed in 5. Finally, we conclude and suggest planned extensions of our approach in 6.
In this section, we first introduce basic notions used in our approach, the formal definition of the proposed quality assessment pattern and then describe the workflow.
Quality Assessment Pattern
Data quality is commonly conceived as a multi-dimensional construct with a popular notion of ’fitness for use’ and can be measured along many dimensions $`\mathcal{D}`$ such as accuracy ($`d_{accu} \in \mathcal{D}`$), completeness ($`d_{comp} \in \mathcal{D}`$) and timeliness ($`d_{tmls} \in \mathcal{D}`$). The assessment of a quality dimensions $`d`$ is based on quality metrics $`\mathcal{QM} = {m_1,m_2 …...m_k}`$ where $`m_i`$ is a heuristic that is designed to fit a specific assessment dimension. The following definitions form the basis of $`\mathcal{QAP}`$.
Let $`\mathcal{F} = {f_1,f_2 …...f_l}`$ be a set of filters where each filter $`f_i`$ sets a criteria for extracting predicates, objects, subjects, or their combination. A filter $`f_i`$ takes a set of RDF triples as input and returns a subgraph that satisfies the filtering criteria.
Let $`\mathcal{R} = {r_1,r_2 …...r_j}`$ be a set of rules where each rule $`r_i`$ sets a conditional criteria. A rule takes a subgraph as input and returns a new subgraph that satisfies the conditions posed by the rule $`r_i`$.
A transformation $`\tau:\mathcal{G} \rightarrow \mathcal{G'}`$ is an operation that applies rules defined by $`\mathcal{R}`$ on the RDF graph $`\mathcal{G}`$ and returns an RDF subgraph $`\mathcal{G'}`$. A transformation $`\tau`$ can be a union $`\cup`$ or intersection $`\cap`$ of other transformations.
An action $`\alpha: \mathcal{G}\rightarrow \mathbb{R}`$ is an operation that triggers the transformation of rules on the filtered RDF graph $`\mathcal{G'}`$ and generates a numerical value. Action $`\alpha`$ is the count of elements obtained after performing a $`\tau`$ operation.
The Quality Assessment Pattern $`\mathcal{QAP}`$ is a reusable template to implement and design scalable quality metrics. The $`\mathcal{QAP}`$ is composed of transformations and actions. The output of a $`\mathcal{QAP}`$ is the outcome of an action returning a numeric value against the particular metric.
$`\mathcal{QAP}`$ is inspired by Apache Spark operations and designed to fit different data quality metrics (for more details see 1). Each data quality metric can be defined following the $`\mathcal{QAP}`$. Any given data quality metric $`m_i`$ that is represented through the $`\mathcal{QAP}`$ using transformation $`\tau`$ and action $`\alpha`$ operations can be easily transformed into Spark code to achieve scalability.
Quality Metric |
:= |
Action (Action $`\mathcal{OP}`$ Action) |
| $`\mathcal{OP}`$ | := |
$`\mathcal{*}`$ $`\mathcal{-}`$ / $`\mathcal{+}`$ |
Action |
:= |
Count(Transformation) |
Transformation |
:= |
Rule(Filter) (Transformation BOP Transformation) |
Filter |
:= |
getPredicates $`\sim ?p`$ getSubjects $`\sim ?s`$ getObjects $`\sim ?o`$ getDistinct(Filter) |
Filter or Filter Filter && Filter) |
||
Rule |
:= |
isURI(Filter) isIRI(Filter) isInternal(Filter) isLiteral(Filter) |
!isBroken(Filter) hasPredicateP hasLicenceAssociated(Filter) |
||
hasLicenceIndications(Filter) isExternal(Filter) hasType((Filter) |
||
isLabeled(Filter) |
||
BOP |
:= |
$`\cap`$ | $`\cup`$ |
Quality Assessment Pattern
[tab:MetricRules] demonstrates a few selected quality metrics defined against proposed $`\mathcal{QAP}`$. As shown in [tab:MetricRules], each quality metric can contain multiple rules, filters or actions. It is worth mentioning that action count(triples) returns the total number of triples in the given data. This can also be seen that the action can be an arithmetic combination of multiple actions i.e. ratio, sum etc. We illustrate our proposed approach on some metrics selected from . Given that the aim of this paper is to show the applicability of the proposed approach and comparison with existing methods, we have only selected those which are already provided out-of-box in Luzzu.
| Metric | Action $`\alpha`$ | |||
|---|---|---|---|---|
| Detection of a | r = hasLicenceAssociated(?p) |
$`\alpha`$ = count(r) |
||
| Machine Readable License | $`\alpha`$ > 0 ? 1 : 0 |
|||
| Detection of a Human | r = isURI(?s) $`\cap`$ hasLicenceIndications(?p) $`\cap`$ `` |
$`\alpha`$ = count(r) |
||
| Readable License | isLiteral(?o) $`\cap`$ isLicenseStatement(?o) |
$`\alpha`$ > 0 ? 1 : 0 |
||
| Linkage Degree of Linked | r_1 = isIRI(?s) $`\cap`$ internal(?s) $`\cap`$ |
$`\alpha`$_1 = count(r_3) |
||
| External Data Providers | isIRI(?o) $`\cap`$ external(?o) |
$`\alpha`$_2 = count(triples) |
||
r_2 = isIRI(?s) $`\cap`$ external(?s) $`\cap`$ |
$`\alpha`$ = a_1/a_2 |
|||
isIRI(?o) $`\cap`$ internal(?o) |
||||
r_3 = r_1 $`\cup`$ r_2 |
||||
| Detection of a Human | r_1 = isURI(?s) $`\cap`$ isInternal(?s) $`\cap`$ |
$`\alpha`$_1 = count(r_1) + |
||
| Readable Labels | isLabeled(?p) |
count(r_2) + |
||
r_2 = isInternal(?p) $`\cap`$ isLabeled(?p) |
count(r_3) |
|||
r_3 = isURI(?o) $`\cap`$ isInternal(?o) $`\cap`$ |
$`\alpha`$_2 = count(triples) |
|||
isLabeled(?p) |
$`\alpha`$_1/ $`\alpha`$_2 |
|||
| Short URIs | r_1 = isURI(?s) $`\cup`$ isURI(?p) $`\cup`$ isURI(?o) |
$`\alpha`$_1 =count(r_2) |
||
r_2 = resTooLong(?s, ?p, ?o) |
$`\alpha`$_1/count(triples) |
|||
| Identification of Literals | r = isLiteral(?o) $`\cap`$ getDatatype(?o) $`\cap`$ |
$`\alpha`$ = count(r) |
||
| with Malformed Datatypes | isLexicalFormCompatibleWithDatatype(?o) |
|||
| Extensional Conciseness | r = isURI(?s) $`\cap`$ isURI(?o) |
$`\alpha`$_1 = count(r) |
||
$`\alpha`$_2 = count(triples) |
||||
($`\alpha`$_2- $`\alpha`$_1)/ $`\alpha`$_2 |
System Overview
In this section, we give an overall description of the data model and the architecture of DistQualityAssessment. We model and store RDF graphs $`\mathcal{G}`$ based on the basic building block of the Spark framework, RDDs. RDDs are in-memory collections of records that can be operated in parallel on a large distributed cluster. RDDs provide an interface based on coarse-grained transformations (e.g map, filter and reduce): operations applied on an entire RDD. A map function transforms each value from an input RDD into another value while applying $`\tau`$ rules. A filter transforms an input RDD to an output RDD, which contains only the elements that satisfy a given condition. Reduce aggregates the RDD elements using a specific function from $`\tau`$.
The computation of the set of quality metrics $`\mathcal{QM}`$ is performed using Spark as depicted in 1. Our approach consists of four steps:
Defining quality metrics parameters (step 1)
The metric definitions are kept in a dedicated file which contains most of the configurations needed for the system to evaluate quality metrics and gather result sets.
Retrieving the RDF data (step 2)
RDF data first needs to be loaded into a large-scale storage that Spark can efficiently read from. We use Hadoop Distributed File-System (HDFS). HDFS is able to fit and stores any type of data in its Hadoop-native format and parallelizes them across a cluster while replicating them for fault tolerance. In such a distributed environment, Spark automatically adopts different data locality strategies to perform computations as close to the needed data as possible in HDFS and thus avoids data transfer overhead.
Parsing and mapping RDF into the main dataset (step 3)
We first create a distributed dataset called main dataset that represent the HDFS file as a collection of triples. In Spark, this dataset is parsed and loaded into an RDD of triples having the format Triple$`<`$(s,p,o)$`>`$.
Quality metric evaluation (step 4)
Considering the particular quality metric, Spark generates an execution plan , which is composed of one or more $`\tau`$ transformations and $`\alpha`$ actions. The numerical output of the final action is the quality of the input RDF corresponding to the given metric.
Implementation
We have used the Scala programming language API in Apache Spark to provide the distributed implementation of the proposed approach.
The DistQualityAssessment (see
[alg:DistQualityAssessment])
constructs the main dataset
([line:rdf2rdd]) while reading RDF data
(e.g. NTriples file or any other RDF serialization format) and converts
it into an RDD of triples. This latter undergoes the transformation
operation of applying the filtering through rules in $`R`$ and producing
a new filtered RDD ($`\mathcal{G'}`$)
([line:filter]). At the end,
$`\mathcal{G'}`$ will serve as an input to the next step which applies a
set of $`\alpha`$ actions
([line:action]). The output of this step
is the metric output represented as a numerical value
([line:action]). The result set of
different quality metrics
([line:result]) can be further
visualized and monitored using SANSA-Notebooks .
The user can also choose to extract the output in a machine-readable
format ([line:dqvify]). We have used the data
quality vocabulary (DQV) to represent the quality metrics.
$`\textit{triples} = spark.\textbf{rdf}(lang)(input)`$
$`\textit{triples}.persist()`$
$`dqv \leftarrow \emptyset`$
Furthermore, we also provide a Docker image of the system integrated within the BDE platform - an open source Big Data processing platform allowing users to install numerous big data processing tools and frameworks and create working data flow applications.
The work done here (available under Apache License 2.0) has been integrated into SANSA , an open source data flow processing engine for scalable processing of large-scale RDF datasets. SANSA uses Spark offering fault-tolerant, highly available and scalable approaches to process massive sized datasets efficiently. SANSA provides the facilities for semantic data representation, querying, inference, and analytics at scale. Being part of this integration, DistQualityAssessment can take advantage of having the same user community as well as infrastructure build via SANSA project. Doing so, it can also ensure the sustainability of the tool given that SANSA is supported by several grants until at least 2021.
Complexity Analysis
We deem that the overall time complexity of the distributed quality assessment evaluation is $`O(n)`$. The performance of metrics computation depends on data shuffling (while filtering using rules in $`R`$) and data scanning. Our approach performs a direct mapping of any quality metric designed using $`\mathcal{QAP}`$ into a sequence of Spark-compliant Scala-commands, as a consequence, most of the operators used are a series of transformations like $`map`$, $`filter`$ and $`reduce`$. The complexity of $`map`$ and $`filter`$ is considered to be linear with respect to the number of triples associated with it. The complexity of a metric then depends on the $`\alpha`$ operation that returns the count of the filtered output. This later step works on the distributed RDD between $`p`$ nodes which imply that the complexity of each node then becomes $`O(n/p)`$, where $`n`$ is number of input triples. Let be $`O(\tau)`$ a complexity of $`\tau`$, then the complexity of the metric will be $`O(n/p*O(\tau))`$. This indicates that the runtime increases linearly when the size of an RDD increases and decreases linearly when more nodes $`p`$ are added to the cluster. The main aim of DistQualityAssessment is to serve massive large-scale real-life RDF datasets. We are interested in addressing the following additional questions.
-
Flexibility: How fast our approach processes different types of metrics?
-
Scalability: How large are the RDF datasets that DistQualityAssessment can scale to? What is the system speedup w.r.t the number of nodes in a cluster mode?
-
Efficiency: How well our approach performs compared with other state-of-the-art systems on real-world datasets?
In the following, we present our experimental setup including the datasets used. Thereafter, we give an overview of our results.
Experimental Setup
We chose two real-world and one synthetic datasets for our experiments:
-
DBpedia (v 3.9) – a cross domain dataset. DBpedia is a knowledge base with a large ontology. We build a set of 3 pipelines of increasing complexity: (i) $`M_{DBpedia}^{en}`$ ($`\approx`$ 813M triples); (ii) $`M_{DBpedia}^{de}`$ ($`\approx`$ 337M triples); (iii) $`M_{DBpedia}^{fr}`$ ($`\approx`$ 341M triples). DBpedia has been chosen because of its popularity in the Semantic Web community.
-
LinkedGeoData – a spatial RDF knowledge base derived from OpenStreetMap.
-
Berlin SPARQL Benchmark (BSBM) – a synthetic dataset based on an e-commerce use case containing a set of products that are offered by different vendors and reviews posted by consumers about products. The benchmark provides a data generator, which can be used to create sets of connected triples of any particular size.
Properties of the considered datasets are given in [tab:dataset_info].
| → | DBpedia | BSBM | ||||||
|---|---|---|---|---|---|---|---|---|
| 3-9 | LinkedGeoData | en | de | fr | 2GB | 20GB | 200GB | |
| #nr. of triples | 1,292,933,812 | 812,545,486 | 336,714,883 | 340,849,556 | 8,289,484 | 81,980,472 | 817,774,057 | |
| size (GB) | 191.17 | 114.4 | 48.6 | 49.77 | 2 | 20 | 200 | |
We implemented DistQualityAssessment using Spark-2.4.0, Scala 2.11.11 and Java 8, and all the data were stored on the the HDFS cluster using Hadoop 2.8.0. The experiments in local mode are all performed on a single instance of the cluster. Specifically, we compare our approach with Luzzu v4.0.0, a state-of-the-art quality assessment system. All distributed experiments were carried out on a small cluster of 7 nodes (1 master, 6 workers): Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (32 Cores), 128 GB RAM, 12 TB SATA RAID-5. The machines were connected via a Gigabit network. All experiments have been executed three times and the average value is reported in the results.
Results
We evaluate the proposed approach using the above datasets to compare it against Luzzu . We carry out two sets of experiments. First, we evaluate the runtime of our distributed approach in contrast to Luzzu. Second, we evaluate the horizontal scalability via increasing nodes in the cluster. Results of the experiments are presented in [tbl:performance-evaluation], 2 and 3. Based on the metric definition, some metrics make use of external access (e.g. Dereferenceability of Forward Links) which leads to a significant increase in Spark processing due to network latency. For the sake of the evaluation we have suspended such metrics. As of that, we choose seven metrics (see [tab:MetricRules] for more details) where the level of difficulty vary from simple to complex according to combination of transformation/action operations involved.
We started our experiments by evaluating the speedup gained by adopting a distributed implementation of quality assessment metrics using our approach, and compare it against Luzzu. We run the experiments on five datasets ($`DBpedia_{en}`$, $`DBpedia_{de}`$, $`DBpedia_{fr}`$, $`LinkedGeoData`$ and $`BSBM_{200GB}`$). Local mode represent a single instance of the cluster without any tuning of Spark configuration and the cluster mode includes further tuning. Luzzu was run in a local environment on a single machine with two strategies: (1) streaming the data for each metric separately, and (2) one stream/load – all metrics evaluated just once.
| Runtime (m) (mean/std) | ||||||
|---|---|---|---|---|---|---|
| 2-6 | Luzzu | DistQualityAssessment | ||||
| 2-6 | a) single | b) joint | c) local | d) cluster | e) speedup ratio w.r.t | |
| Luzzu DistQualityAssessmentc) | ||||||
| Fail | Fail | 446.9/63.34 | 7.79/0.54 | n/a56.4x | ||
| Fail | Fail | 274.31/38.17 | 1.99/0.04 | n/a136.8x | ||
| Fail | Fail | 161.4/24.18 | 0.46/0.04 | n/a349.9x | ||
| Fail | Fail | 195.3/26.16 | 0.38/0.04 | n/a512.9x | ||
| Fail | Fail | 454.46/78.04 | 7.27/0.64 | n/a61.5x | ||
| 2.64/0.02 | 2.65/0.01 | 0.04/0.0 | 0.42/0.04 | 65x(-0.9x) | ||
| 5.9/0.16 | 5.66/0.02 | 0.04/0.0 | 0.43/0.03 | 146.5x(-0.9x) | ||
| 16.38/0.44 | 15.39/0.21 | 0.05/0.0 | 0.46/0.02 | 326.6x(-0.9x) | ||
| 40.59/0.56 | 37.94/0.28 | 0.06/0.0 | 0.44/0.05 | 675.5x(-0.9x) | ||
| 101.8/0.72 | 101.78/0.64 | 0.07/0.0 | 0.4/0.03 | 1453.3(-0.8x) | ||
| 459.19/18.72 | 468.64/21.7 | 0.15/0.01 | 0.48/0.03 | 3060.3x(-0.7x) | ||
| 1454.16/10.55 | 1532.95/51.6 | 0.4/0.02 | 0.56/0.02 | 3634.4x(-0.3x) | ||
| Timeout | Timeout | 3.19/0.16 | 0.62/0.04 | n/a4.1x | ||
| Timeout | Timeout | 29.44/0.14 | 0.52/0.01 | n/a55.6x | ||
| Fail | Fail | 34.32/9.22 | 0.75/0.29 | n/a44.8x | ||
[tbl:performance-evaluation] shows the performance of two approaches applied to five datasets. In [tbl:performance-evaluation] we indicate “Timeout” whenever the process did not complete within a certain amount of timeW̑e set the timeout delay to 24 hours of the quality assessment evaluation stage. and “Fail” when the system crashed before this timeout delay. Column Luzzu$`^{a)}`$ represents the performance of Luzzu on bulk load – considering each metric as a sequence of the execution, on the other hand, the column Luzzu$`^{b)}`$ reports on the performance of Luzzu using a joint load by evaluating each metric using one load. The last columns reports on the performance of DistQualityAssessment run on a local mode $`c)`$, cluster mode $`d)`$ and speedup ratio of our approach compared to Luzzu$`^{b)}`$ ($`d)/b)-1`$) and itself evaluated on local mode ($`d)/c)-1`$) is reported on the column $`e)`$. We observe that the execution of our approach finishes with all the datasets whereas this is not the case with Luzzu which either timeout or fail at some point.
Unfortunately, Luzzu was not capable of evaluating the metrics over large-scale RDF datasets from [tbl:performance-evaluation] (part one). For that reason we run yet another set of experiments on very small datasets which Luzzu was able to handle. Second part of the [tbl:performance-evaluation] shows a performance evaluation of our approach compared with Luzzu on very small RDF datasets. In some cases (e.g. [qm:RC1], [qm:SV3]) for a very small dataset Luzzu performs better than our approach with a small margin of runtime in the local mode. It is due to the fact that in the streaming mode, when Luzzu$`^{a)}`$ finds the first statement which fulfills the condition (e.g.finding the shortest URIs), it stops the evaluation and return the results. On the contrary, our approach evaluates the metrics over the whole dataset exploiting the fault-tolerance and resilient features build in Spark. In other cases Luzzu suffers from significant slowdowns, which are several orders of magnitude slower. Therefore, its average runtime over all metrics is worst as compared to our approach. It is important to note that our approach on these very small datasets degrades while running on the cluster mode. This is because of the network overhead while shuffling the data, but it outperforms Luzzu$`^{a),b)}`$ when considering ”average runtime” over all the metrics (even for very small datasets).
Findings shown in [tbl:performance-evaluation] depict that our approach starts outperforming when the size of the dataset grows (e.g. $`BSBM_{2GB}`$). The runtime in the cluster mode stays constant when the size of the data fits into the main memory of the cluster. On other hand, Luzzu is not able to evaluate the metrics when the size of data starts increasing, the time taken lasts beyond the delay we set for small datasets. Because of the large differences, we have used a logarithmic scale to better visualize these results.
In this experiment we evaluate the efficiency of our approach. 2 and 3 illustrates the results of the comparative efficiency analysis.
Data scalability To measure the performance of size-up scalability of our approach, we run experiments on five different sizes. We fix the number of nodes to 6 and grow the size of datasets to measure whether DistQualityAssessment can deal with larger datasets. For this set of experiments we consider BSBM benchmark tool to generate syntethic datasets of different sizes, since the real-world dataset are considered to be unique in their size and attributes.
We start by generating a dataset of 2GB. Then, we iteratively increase the size of datasets. On each dataset, we run our approach and the runtime is reported on 2. The $`x`$-axis shows the size of BSBM dataset with an increasing order of 10x magnitude.
By comparing the runtime (see 2), we note that the execution time increases linearly and is near-constant when the size of the dataset increases. As expected, it stays near-constant as long as the data fits in memory. This demonstrates one of the advantages of utilizing the in-memory approach for performing the quality assessment computation. The overall time spent in data read/write and network communication found in disk-based approaches is saved. However, when the data overflows the memory, and it is spilled to disk, the performance degrades. These results show the scalability of our algorithm in the context of size-up.
Node scalability In order to measure node scalability, we vary the number of the workers on our cluster. The number of workers have varied from 1, 2, 3, 4 and 5 to 6.
3 shows the speedup for $`BSBM_{200GB}`$ with the various number of worker nodes. We can see that as the number of workers increases, the execution time cost-decrease is almost linear. The execution time decreases about 14 times (from 433.31 minutes down to 28.8 minutes) as cluster nodes increase from one to six worker nodes. The results shown here imply that our approach can achieve near linear scalability in performance in the context of speedup.
Furthermore, we conduct the effectiveness evaluation of our approach. Speedup $`S`$ is an important metric to evaluate a parallel algorithm. It is defined as a ratio $`S=T_s/T_n`$, where $`T_s`$ represents the execution time of the algorithm run on a single node and $`T_n`$ represents the execution time required for the same algorithm on $`n`$ nodes with the same configuration and resources. Efficiency is defined as a ratio $`E = S/n =T_s/n T_n`$ which measures the processing power being used, in our case the speedup per node. The speedup and efficiency curves of DistQualityAssessment are shown in 4. The trend shows that it achieves almost linearly speedup and even super linear in some cases. The upper curve in the 4 indicates super linear speedup. The speedup grows faster than the number of worker nodes. This is due to the computation task for the metric being computationally intensive, and the data does not fit in the cache when executed on a single node. But it fits into the caches of several machines when the workload is divided amongst the cluster for parallel evaluation. While using Spark, the super linear speedup is an outcome of the improved complexity and runtime, in addition to efficient memory management behavior of the parallel execution environment.
In order to test the correctness of implemented metrics, we assess the numerical values for metrics like [qm:L1], [qm:L2], and [qm:RC1] on very small datasets and the results are found correct w.r.t Luzzu. For metrics like [qm:I2] and [qm:CN2], Luzzu uses approximate values for faster performance, and that is not the same as getting the exact number as in the case of our implementation.
We analyze the overall run-time of the metric evaluation. [fig:overall-analysis] reports on the run-time of each metric considered in this paper (see [tab:MetricRules]) on both $`BSBM_{20GB}`$ and $`BSBM_{200GB}`$ datasets.
DistQualityAssessment implements predefined quality assessment metrics from . We have implemented these metrics in a distributed manner such that most of them have a run-time complexity of $`O(n)`$ where $`n`$ is the number of input triples. The overall performance of analysis for BSBM dataset with two instances is shown in [fig:overall-analysis]. The results obtained show that the execution is sometimes a little longer when there is a shuffling involved in the cluster compared to when data is processed without movement e.g. Metric [qm:L2] and [qm:L1]. Metric [qm:SV3] and [qm:CN2] are the most expensive ones in terms of runtime. This is due to the extra overhead caused by extracting the literals for objects, and checking the lexical form of its datatype.
Overall, the evaluation study carried out in this paper demonstrates that distributed computation of different quality measures is scalable and the execution ends in reasonable time given the large volume of data.
The data quality assessment becomes challenging with the increasing sizes of data. Many existing tools mostly contain a customized data quality functionality to detect and analyze data quality issues within their own domain. However, this process is both data-intensive and computing-intensive and it is a challenge to develop fast and efficient algorithms that can handle large scale RDF datasets.
In this paper, we have introduced DistQualityAssessment, a novel approach for distributed in-memory evaluation of RDF quality assessment metrics implemented on top of the Spark framework. The presented approach offers generic features to solve common data quality checks. As a consequence, this can enable further applications to build trusted data utilities.
We have demonstrated empirically that our approach improves upon previous centralized approach that we have compared against. The benefit of using Spark is that its core concepts (RDDs) are designed to scale horizontally. Users can adapt the cluster sizes corresponding to the data sizes, by dropping when it is not needed and adding more when there is a need for it.
Although we have achieved reasonable results in terms of scalability, we plan to further improve time efficiency by applying intelligent partitioning strategies and persist the data to an even higher extent in memory and perform dependency analysis in order to evaluate multiple metrics simultaneously. We also plan to explore near real-time interactive quality assessment of large-scale RDF data using Spark Streaming. Finally, in the future we intend to develop a declarative plugin for the current work using Quality Metric Language (QML) , which gives users the ability to express, customize and enhance quality metrics.
The proposed quality assessment tool is being used in many use cases. These includes the projects QROWD, SLIPO, and an industrial application by Alethio.
QROWD is a cross-sectoral streaming Big Data integration project including geographic, transport, meteorological, cross domain and news data, aiming to capitalize on hybrid Big Data integration and analytics methods. One of the major challenges faced in QROWD, is to investigate options for effective and scalable data quality assessment on integrated (RDF) datasets using their crowdsourcing platform. In order to perform this task efficiently and effectively, QROWD uses DistQualityAssessment as an underlying quality assessment framework.
Alethio has build an Ethereum analytics platform that strives to provide transparency over the transaction pool of the whole Ethereum ecosystem. Their 18 billion triple data set contains large scale blockchain transaction data modelled as RDF according to the structure of the Ethereum ontology. Alethio is using SANSA in general and DistQualityAssesment in particular, for performing large-scale batch quality checks, e.g. analysing the quality while merging new data, computing attack pattern frequencies and fraud detection. Alethio uses DistQualityAssesment on a cluster of 100 worker nodes to assess the quality of their $`\approx`$7 TB of data.
SLIPO is a project which leverages semantic web technologies for scalable and quality assured integration of large Point of Interest (POI) datasets. One of the key features of the project is the fusion process. SLIPO-fusion receives two different RDF datasets containing POIs and their properties, as well as a set of links between POI entities of the two datasets. SLIPO is using DistQualityAssessment to assess the quality of both input datasets. The SLIPO-fusion produces a third, final dataset, containing consolidated descriptions of the linked POIs. This process is often data and processing intensive, therefore, it requires a scalable mechanism for data quality check. SLIPO uses DistQualityAssessment for fusion validation and quality statistics/assessment to facilitate and assure the quality of the fusion process.