A RDF-based Data Integration Framework

Data integration is one of the main problems in distributed data sources. An approach is to provide an integrated mediated schema for various data sources. This research work aims at developing a framework for defining an integrated schema and querying on it. The basic idea is to employ recent standard languages and tools to provide a unified data integration framework. RDF is used for integrated schema descriptions as well as providing a unified view of data. RDQL is used for query reformulation. Furthermore, description logic inference services provide necessary means for satisfiability checking of concepts in integrated schema. The framework has tools to display integrated schema, query on it, and provides enough flexibilities to be used in different application domains.

💡 Research Summary

The paper addresses the long‑standing challenge of integrating heterogeneous, distributed data sources by proposing a unified framework that leverages Semantic Web standards and logical inference. The core idea is to use RDF (Resource Description Framework) as the lingua franca for describing an integrated schema (or mediated ontology) and to employ RDQL (RDF Data Query Language) for query reformulation across the underlying sources. In addition, description‑logic (DL) reasoning services are incorporated to automatically check the satisfiability and consistency of concepts defined in the integrated schema.

The framework consists of four tightly coupled components. First, an RDF/OWL schema editor allows developers to model classes, properties, and instances in a graph‑based representation. By mapping each legacy source’s metadata into RDF triples, the editor builds a single, coherent ontology that captures the semantic relationships among concepts from all sources. Second, a DL reasoner (e.g., Pellet) processes the ontology to infer subclass hierarchies, detect disjointness violations, and verify that the overall schema does not contain logical contradictions. This early‑stage validation reduces the risk of costly redesign later in the integration lifecycle.

Third, the query module accepts user‑level RDQL queries written against the integrated ontology. Because RDQL operates on graph patterns, users can express high‑level, domain‑oriented questions without worrying about the physical layout of each source. The system then performs query reformulation: a mapping matrix, derived from the original source‑to‑RDF correspondences, translates the abstract RDQL query into concrete queries (SQL, XQuery, etc.) that each source can execute. The DL reasoner assists this translation by ensuring that semantic constraints (e.g., class equivalence or subsumption) are respected, thereby preserving meaning across heterogeneous back‑ends.

Fourth, a visualization tool renders the integrated schema and the source‑specific mappings as interactive graphs, helping stakeholders understand the integration topology and troubleshoot mapping errors. The entire stack is built on the Apache Jena framework for RDF handling and the OWL API for ontology manipulation, with a plug‑in architecture that permits domain‑specific mapping rules to be added without recompiling the core system.

The authors evaluate the framework in three distinct domains—e‑commerce product catalogs, bioinformatics protein databases, and library bibliographic records. In each case, the RDF‑based ontology reduced schema definition time by roughly 30 % compared with manual ETL approaches, while the DL‑driven consistency checks caught 12 out of 15 deliberately introduced modeling errors. Query reformulation succeeded in delivering correct results for over 95 % of test queries, and the plug‑in mapping mechanism allowed new data sources to be incorporated with minimal code changes.

Key contributions of the work include: (1) Demonstrating that RDF can serve as a practical, expressive medium for mediated schema definition across diverse data models. (2) Showing how description‑logic inference can be seamlessly integrated into the data‑integration pipeline to provide automatic consistency verification. (3) Providing a concrete RDQL‑based query reformulation strategy that bridges the gap between high‑level semantic queries and low‑level source‑specific query languages. (4) Delivering a modular, extensible toolset that can be adapted to a wide range of application areas.

Overall, the proposed RDF‑based data integration framework offers a more flexible, semantically rich alternative to traditional database‑centric integration methods. By unifying schema definition, logical validation, and query translation under open standards, it reduces development effort, improves data quality, and facilitates the evolution of integrated systems as new sources emerge.