Repairing Property Graphs under PG-Constraints
Recent standardization efforts for graph databases lead to standard query languages like GQL and SQL/PGQ, and constraint languages like Property Graph Constraints (PG-Constraints). In this paper, we embark on the study of repairing property graphs under PG-Constraints. We identify a significant subset of PG-Constraints, encoding denial constraints and including recursion as a key feature, while still permitting automata-based structural analyses of errors. We present a comprehensive repair pipeline for these constraints to repair Property Graphs, involving changes in the graph topology and leading to node, edge and, optionally, label deletions. We investigate three algorithmic strategies for the repair procedure, based on Integer Linear Programming (ILP), a naive, and an LP-guided greedy algorithm. Our experiments on various real-world datasets reveal that repairing with label deletions can achieve a 59% reduction in deletions compared to node/edge deletions. Moreover, the LP-guided greedy algorithm offers a runtime advantage of up to 97% compared to the ILP strategy, while matching the same quality.
💡 Research Summary
The paper tackles the problem of repairing property graphs that violate constraints expressed in the emerging PG‑Constraints language, which is part of the broader PG‑Schema standard for graph databases. Recognizing that the full PG‑Constraints formalism is too expressive for efficient automated repair, the authors isolate a significant fragment that captures denial constraints and supports recursion. They call this fragment Regular Graph Pattern Calculus (RGPC). RGPC patterns consist of node and edge label predicates combined with logical operators (∧, ∨, ¬) and Kleene‑star/plus operators, allowing the description of arbitrarily long paths while keeping the pattern regular.
The authors first formalize RGPC syntax and semantics, showing how it can encode realistic constraints such as “if a person works on a task that (directly or indirectly) references an important document, then the person’s access level must be at least the document’s access level.” They then propose an automata‑based model that translates RGPC patterns into finite state machines, enabling systematic detection of all matches that violate a given constraint.
Repair is modeled exclusively through deletions: nodes, edges, and optionally individual labels attached to nodes or edges may be removed. Deleting a node implicitly removes all incident edges, preserving graph consistency. The goal is to find a minimal set of deletions that eliminates every violating match. This is formulated as a combinatorial optimization problem.
Three algorithmic strategies are explored.
- Naïve Greedy – processes violations sequentially, deleting the cheapest object for each violation unless it has already been resolved by earlier deletions. This approach is simple but does not guarantee a globally optimal solution.
- Integer Linear Programming (ILP) – introduces a binary variable for each deletable object and a constraint for each violation requiring that at least one object in the violation’s match be deleted. The objective minimizes the sum of selected variables, yielding an optimal solution when solved with a commercial ILP solver. However, the approach scales poorly with graph size and number of violations.
- LP‑Guided Greedy – solves the linear‑relaxation of the ILP (allowing variables to take fractional values) and uses the resulting fractional values as a priority score for a greedy deletion process. This hybrid method retains the quality of the ILP solution while dramatically reducing runtime.
The experimental evaluation uses several real‑world datasets (organizational graphs, supply‑chain graphs, social networks) and a variety of RGPC constraints, including those with deep recursion. The authors compare the three algorithms under two repair settings: (a) allowing only node and edge deletions, and (b) additionally permitting label deletions. Key findings include:
- Allowing label deletions reduces the total number of deletions by up to 59 % compared with node/edge‑only repairs, because many violations can be resolved by simply removing a problematic label rather than an entire object.
- The LP‑guided greedy algorithm achieves runtime improvements of up to 97 % over the ILP approach while producing solutions within 1–2 % of the optimal deletion count.
- The naïve greedy algorithm is the fastest of the three but typically incurs 10–15 % more deletions than the optimal baseline.
- Approximate variants (e.g., limiting the depth of recursive pattern exploration or sampling matches) can further accelerate execution by up to 89 % with only modest impact on solution quality.
The paper’s contributions are threefold: (1) identification of a practically useful, regular fragment of PG‑Constraints (RGPC) that supports recursion and denial constraints; (2) a comprehensive repair pipeline that integrates automata‑based violation detection with a deletion‑only repair model, including optional label deletions; and (3) a thorough algorithmic study demonstrating that LP‑guided greedy offers an excellent trade‑off between optimality and performance for large‑scale property graph repair.
Future work suggested by the authors includes extending the repair model to support insertions and property value updates, handling conflicting constraints, and scaling the approach to distributed graph processing environments. The presented methodology provides a solid foundation for automated data quality management in the emerging ecosystem of standardized graph databases.
Comments & Academic Discussion
Loading comments...
Leave a Comment