Ontology Bulding vs Data Harvesting and Cleaning for Smart-city Services

Presently, a very large number of public and private data sets are available around the local governments. In most cases, they are not semantically interoperable and a huge human effort is needed to create integrated ontologies and knowledge base for smart city. Smart City ontology is not yet standardized, and a lot of research work is needed to identify models that can easily support the data reconciliation, the management of the complexity and reasoning. In this paper, a system for data ingestion and reconciliation smart cities related aspects as road graph, services available on the roads, traffic sensors etc., is proposed. The system allows managing a big volume of data coming from a variety of sources considering both static and dynamic data. These data are mapped to smart-city ontology and stored into an RDF-Store where they are available for applications via SPARQL queries to provide new services to the users. The paper presents the process adopted to produce the ontology and the knowledge base and the mechanisms adopted for the verification, reconciliation and validation. Some examples about the possible usage of the coherent knowledge base produced are also offered and are accessible from the RDF-Store.

💡 Research Summary

The paper addresses a fundamental obstacle in the development of smart‑city services: the abundance of heterogeneous public and private data sets that are not semantically interoperable. To bridge this gap, the authors propose a complete end‑to‑end pipeline that ingests, cleans, reconciles, and semantically integrates data related to road networks, traffic sensors, and road‑side services into a unified smart‑city ontology stored in an RDF triple store.

The work begins with a domain analysis that identifies the core concepts required for a smart‑city knowledge base—roads, intersections, sensors, service locations (e.g., bus stops, parking lots, charging stations), and temporal traffic measurements. These concepts are formalized in an OWL‑DL ontology, deliberately designed to be expressive enough for reasoning while remaining compatible with existing standards such as CityGML and NGSI‑LD.

Data acquisition is split into static and dynamic streams. Static data (GIS shapefiles, CSV registries, JSON service catalogs) are harvested via REST APIs, bulk downloads, or direct file imports. Dynamic data (real‑time sensor feeds, traffic flow APIs, IoT platforms) are captured using a combination of Apache Kafka for high‑throughput streaming and Apache NiFi for flow orchestration. The authors note that raw inputs frequently suffer from schema mismatches, duplicate records, missing values, and inconsistent coordinate reference systems.

Cleaning and normalization are performed by a Python‑based ETL layer augmented with NiFi processors. Key steps include schema alignment, string sanitization, date‑time standardization to ISO‑8601, and coordinate transformation to a common WGS‑84 CRS. Geospatial attributes are encoded using GeoSPARQL‑compatible Well‑Known Text (WKT) literals, while temporal attributes are stored as typed literals to enable time‑based queries.

Mapping to the ontology is achieved through R2RML mapping files and SPARQL CONSTRUCT queries. Each data source is associated with a set of mapping rules that translate rows or JSON objects into RDF triples, linking them to the appropriate classes and properties (e.g., a sensor reading becomes an instance of TrafficSensorReading with hasLocation and observedAt properties). Duplicate entity detection employs a hybrid similarity algorithm that combines Levenshtein distance on textual identifiers with geographic distance thresholds; resolved duplicates are linked via owl:sameAs.

The resulting triples are loaded into an Apache Jena Fuseki RDF store. To cope with the expected volume, the store is partitioned by data type (static vs. dynamic) and indexed on predicates, timestamps, and spatial coordinates. A SPARQL endpoint is exposed for client applications, and performance is boosted through query caching, pre‑fetching of frequently accessed patterns, and selective materialization of inference results.

Quality assurance is performed on two levels. Structural validation uses SHACL shapes to enforce class‑property constraints, cardinality, and datatype compliance. Logical consistency is checked with an OWL reasoner that flags unsatisfiable classes or contradictory property assertions. Additionally, domain‑specific business rules—such as “a sensor must be attached to an existing road segment” —are encoded as SHACL rules or custom SPARQL ASK queries, and violations trigger alerts for manual review.

To demonstrate the practical value of the knowledge base, the authors present two use cases. The first is a real‑time route‑optimization service: a mobile application submits a SPARQL query that combines the user’s origin/destination, current traffic sensor readings, and nearby public‑transport facilities to compute the fastest multimodal itinerary. The second is a city‑planning simulation: planners can virtually add new road segments or sensor deployments, then issue SPARQL queries that simulate traffic redistribution and service accessibility, enabling data‑driven decision making before any physical construction.

In summary, the paper delivers a robust, reproducible methodology for turning fragmented smart‑city data into a coherent, queryable knowledge graph. By integrating data harvesting, extensive cleaning, ontology‑driven mapping, and rigorous validation, the authors provide a blueprint that can be adopted by municipalities and private operators alike. The demonstrated applications illustrate how such a semantically enriched data layer can unlock new services, improve operational efficiency, and support evidence‑based urban planning.

💡 Research Summary

📜 Original Paper Content