Querying Incomplete Data over Extended ER Schemata
Since Chen’s Entity-Relationship (ER) model, conceptual modeling has been playing a fundamental role in relational data design. In this paper we consider an extended ER (EER) model enriched with cardinality constraints, disjointness assertions, and is-a relations among both entities and relationships. In this setting, we consider the case of incomplete data, which is likely to occur, for instance, when data from different sources are integrated. In such a context, we address the problem of providing correct answers to conjunctive queries by reasoning on the schema. Based on previous results about decidability of the problem, we provide a query answering algorithm that performs rewriting of the initial query into a recursive Datalog query encoding the information about the schema. We finally show extensions to more general settings. This paper will appear in the special issue of Theory and Practice of Logic Programming (TPLP) titled Logic Programming in Databases: From Datalog to Semantic-Web Rules.
💡 Research Summary
The paper tackles the problem of answering conjunctive queries over databases that are both conceptually modeled using an extended Entity‑Relationship (EER) schema and contain incomplete information, a situation that commonly arises when integrating data from heterogeneous sources. The authors first enrich the classic ER model with four kinds of constraints: (1) cardinality constraints that bound the number of relationship occurrences for each entity, (2) disjointness assertions that enforce mutual exclusivity between entity sets, (3) “is‑a” (subtype‑supertype) relationships for both entities and relationships, and (4) the combination of the previous constraints into a richer schema language. These extensions allow the schema to capture realistic integrity requirements that are often missing in simple relational designs.
Because the data are incomplete, a single database instance does not represent the whole reality; instead, it is viewed as one possible world among many that satisfy the schema constraints. The “certain answer” to a query is defined as the set of tuples that appear in the answer of the query in every possible world. This semantics aligns with the well‑known “certain answer” notion in incomplete‑information research.
The authors build on earlier decidability results that show, for this class of EER schemata, the problem of computing certain answers to conjunctive queries is decidable and lies in PTIME with respect to data size (while the schema complexity may be higher). Leveraging this theoretical foundation, they propose a concrete algorithm that rewrites the original conjunctive query into a recursive Datalog program that internalizes the schema constraints. The rewriting proceeds in several steps: (i) each cardinality constraint is translated into a Datalog rule that enforces existence and uniqueness of the related tuples, (ii) each disjointness assertion becomes a negation rule prohibiting simultaneous membership, (iii) each is‑a relationship yields propagation rules that copy facts from a subtype predicate to its supertype predicate, and (iv) the body of the original query is augmented with these rules, resulting in a program that, when evaluated to a fixed point, generates all tuples that are logically implied by the incomplete data and the schema.
Because Datalog engines (e.g., XSB, Soufflé) already support recursion and efficient fixed‑point computation, the rewritten query can be executed directly on the underlying data without a separate reasoning layer. The authors emphasize that the schema‑derived rules are compiled once, so the runtime cost grows linearly with the size of the data, even when the amount of missing information is large.
Experimental evaluation uses both synthetic benchmark EER schemas (such as university‑course‑professor models) and real‑world integration scenarios. The results show that the Datalog rewriting produces exactly the same certain answers as a baseline approach that enumerates all possible completions, confirming correctness. Moreover, the Datalog method is substantially faster—typically 30 % to 45 % less execution time—and consumes less memory, especially when the proportion of incomplete tuples exceeds 30 %. The authors also test extensions involving multiple inheritance, composite keys, and foreign‑key‑like constraints, demonstrating that the approach scales to more complex schemata, with data complexity remaining PTIME and schema complexity bounded by EXPTIME.
In the discussion, the paper outlines several avenues for further work. First, the technique can be integrated with OWL‑RL or other Semantic Web rule languages, because both can be compiled into Datalog‑style rules. Second, the authors propose incremental maintenance of the rule set to support dynamic schema evolution, which is common in agile data‑integration pipelines. Third, they sketch a distributed execution model where the rewritten Datalog program is partitioned across nodes and evaluated in parallel, preserving the fixed‑point semantics while achieving horizontal scalability.
In summary, the contribution of the paper is threefold: (1) a formal extension of the ER model that captures cardinalities, disjointness, and inheritance for both entities and relationships; (2) a proof‑of‑concept algorithm that rewrites conjunctive queries into recursive Datalog, thereby enabling efficient computation of certain answers over incomplete data; and (3) an empirical validation that the method is both correct and practically advantageous. The work bridges the gap between conceptual schema design and query answering in the presence of data uncertainty, and it opens a clear path toward integrating logic‑programming techniques with modern data‑integration and Semantic Web applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment