RDFGraphGen: An RDF Graph Generator based on SHACL Shapes

RDFGraphGen: An RDF Graph Generator based on SHACL Shapes
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Developing and testing modern RDF-based applications often requires access to RDF datasets with certain characteristics. Unfortunately, it is very difficult to publicly find domain-specific knowledge graphs that conform to a particular set of characteristics. Hence, in this paper we propose RDFGraphGen, an open-source RDF graph generator that uses characteristics provided in the form of SHACL (Shapes Constraint Language) shapes to generate synthetic RDF graphs. RDFGraphGen is domain-agnostic, with configurable graph structure, value constraints, and distributions. It also comes with a number of predefined values for popular schema.org classes and properties, for more realistic graphs. Our results show that RDFGraphGen is scalable and can generate small, medium, and large RDF graphs in any domain.


💡 Research Summary

**
The paper addresses a practical problem faced by developers and researchers of RDF‑based systems: the scarcity of publicly available knowledge graphs that exhibit specific structural, cardinality, and value‑distribution characteristics required for benchmarking, testing, or quality‑control purposes. To fill this gap, the authors introduce RDFGraphGen, an open‑source, domain‑agnostic RDF graph generator that takes a SHACL (Shapes Constraint Language) shapes graph as its sole input and produces a synthetic RDF dataset that conforms to the supplied constraints.

Key Contributions

  1. SHACL‑Driven Generation – Unlike prior generators that rely on OWL ontologies, tabular inputs, or ad‑hoc rule sets, RDFGraphGen treats SHACL shapes as a complete blueprint. It parses node shapes, property shapes, and logical constraints (sh:and, sh:or, sh:xone) and translates them directly into triples. This enables fine‑grained control over both graph topology (which nodes are linked to which) and literal/value constraints (datatype, min/max count, enumeration, etc.).

  2. Algorithmic Core – The generation process is formalised in three recursive procedures:

    • GENERATE_FROM_NODE_SHAPE walks through each constraint component of a node shape, delegating to property‑shape handling or logical‑constraint handling as appropriate.
    • GENERATE_FROM_PROP_SHAPE determines cardinality using sh:minCount/sh:maxCount, then either (a) creates literal values via createLiteralValues (respecting XSD datatypes) or (b) generates realistic values for supported schema.org predicates via createSchemaValues. If the property points to another node shape, the algorithm recurses, creating new IRIs for the target entities.
    • GENERATE_LOGICAL resolves logical constraints by selecting the required subset of shapes (all for and, a random non‑empty subset for or, exactly one for xone).
  3. Scale‑Factor Control – Users specify a single numeric scale factor that determines how many instances are generated for each top‑level node shape (shapes that are not only objects of other shapes). The number of lower‑level entities is derived automatically from the cardinalities of the outgoing properties, ensuring a predictable overall graph size while preserving the intended structural ratios.

  4. Multi‑Ontology Support – Because SHACL itself is expressed in RDF, RDFGraphGen can mix classes and properties from any vocabularies (e.g., schema.org, FOAF, Dublin Core) within a single shapes graph. This flexibility surpasses tools such as GAIA (OWL‑only) or GRR (single‑ontology SPARQL‑like specifications).

  5. Implementation & Availability – The tool is released under the MIT license, hosted on GitHub, and distributed via PyPI as a Python package (rdf-graph-gen). The implementation is lightweight, requiring only standard RDF libraries and a small set of predefined schema.org value tables.

Evaluation
The authors benchmarked RDFGraphGen across three size regimes: small (≈10⁴ triples), medium (≈10⁶ triples), and large (≈10⁷ triples). Generation time scaled linearly with the number of triples, and memory consumption remained within acceptable bounds for a single‑machine execution. When using schema.org predicates, the generated literals exhibited distributions (e.g., realistic URLs, dates, person names) comparable to real‑world web data, demonstrating the practical utility of the built‑in value tables.

Comparison with Related Work

  • TAB2KG infers an ontology from tabular data and then maps rows to triples; it does not allow explicit value constraints.
  • GAIA generates data from OWL ontologies but lacks fine‑grained literal constraints.
  • GRR uses a SPARQL‑like DSL for shape description but requires users to supply concrete values beforehand.
  • PyGraft offers sophisticated distribution modelling but still depends on an input RDF schema rather than SHACL constraints.
    RDFGraphGen differentiates itself by (i) using SHACL as the only specification, (ii) supporting both structural and value constraints, (iii) allowing multi‑vocabulary mixing, and (iv) providing a simple scale‑factor interface.

Limitations

  • The current implementation assumes well‑formed and non‑recursive SHACL shapes; recursive shapes are not supported.
  • When contradictory constraints appear (e.g., sh:minCount > sh:maxCount), the generator defaults to the minimum count, which may not reflect the user’s intent.
  • Advanced SHACL features such as custom functions, SPARQL‑based constraints, or complex path expressions are not yet handled.

Future Work
The authors outline several extensions: adding support for recursive shapes, richer logical operators, automatic detection and resolution of contradictory constraints, and integrating quality metrics that compare generated graphs against real datasets (e.g., distribution similarity, schema compliance scores). They also plan to expose a plug‑in mechanism for user‑defined value generators, enabling domain‑specific realism beyond the built‑in schema.org tables.

Conclusion
RDFGraphGen represents a novel, practical contribution to the RDF ecosystem. By repurposing SHACL—originally designed for validation—as a generative blueprint, the tool empowers developers to synthesize realistic, constraint‑compliant RDF graphs of arbitrary size without needing hand‑crafted datasets. Its open‑source nature, Python implementation, and clear scalability make it a valuable asset for benchmarking, testing, and educational scenarios across any domain that leverages RDF.


Comments & Academic Discussion

Loading comments...

Leave a Comment