Km4City Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

Presently, a very large number of public and private data sets are available from local governments. In most cases, they are not semantically interoperable and a huge human effort would be needed to create integrated ontologies and knowledge base for smart city. Smart City ontology is not yet standardized, and a lot of research work is needed to identify models that can easily support the data reconciliation, the management of the complexity, to allow the data reasoning. In this paper, a system for data ingestion and reconciliation of smart cities related aspects as road graph, services available on the roads, traffic sensors etc., is proposed. The system allows managing a big data volume of data coming from a variety of sources considering both static and dynamic data. These data are mapped to a smart-city ontology, called KM4City (Knowledge Model for City), and stored into an RDF-Store where they are available for applications via SPARQL queries to provide new services to the users via specific applications of public administration and enterprises. The paper presents the process adopted to produce the ontology and the big data architecture for the knowledge base feeding on the basis of open and private data, and the mechanisms adopted for the data verification, reconciliation and validation. Some examples about the possible usage of the coherent big data knowledge base produced are also offered and are accessible from the RDF-Store and related services. The article also presented the work performed about reconciliation algorithms and their comparative assessment and selection.

💡 Research Summary

The paper addresses a fundamental obstacle in the development of smart‑city applications: the lack of semantic interoperability among the myriad public and private data sets that municipalities publish. While many datasets are openly available, they are expressed in heterogeneous formats, schemas, and vocabularies, making it impractical to build a unified knowledge base without massive manual effort. To overcome this, the authors propose a complete end‑to‑end framework that (1) designs a comprehensive smart‑city ontology called Km4City, (2) implements a large‑scale data ingestion pipeline capable of handling both static (e.g., road networks, facility registries) and dynamic (e.g., traffic sensor streams) sources, (3) applies systematic data cleaning, deduplication, and reconciliation techniques, and (4) stores the resulting RDF triples in a high‑performance triplestore that can be queried via SPARQL for downstream services.

Ontology Design
Km4City is built on a set of top‑level classes such as CityEntity, SpatialFeature, Service, Sensor, Observation, and Event. These classes are linked by relationships like locatedOn, provides, monitoredBy, adjacentTo, and hasObservation. Spatial information is modeled using the WGS84 coordinate system and the GeoSPARQL standard, enabling direct integration with GIS tools. The ontology deliberately reuses concepts from existing standards (e.g., SOSA/SSN for sensors, CityGML for geometry) while extending them to capture city‑wide services (public transport stops, electric‑vehicle charging stations, road‑side amenities) that are often omitted in domain‑specific models.

Data Sources and Acquisition
The authors categorize data sources into three groups: (a) open government portals that publish static CSV/JSON files, (b) private APIs offering near‑real‑time information (e.g., parking availability), and (c) IoT streams from traffic cameras, air‑quality sensors, and Bluetooth beacons. For batch data they employ Python‑based crawlers and Apache NiFi workflows that run on a nightly schedule. For streaming data they rely on Apache Kafka and Kafka Connect to ingest millions of records per day with low latency. All raw inputs are first stored in a staging area (HDFS for batch, Kafka topics for streams) before entering the transformation stage.

Transformation, Cleaning, and Reconciliation
A dedicated “schema‑mapping layer” translates each source field into an RDF predicate defined in Km4City. During this phase the pipeline performs:

Normalization (e.g., date formats, unit conversions);
Deduplication using three alternative algorithms:
- String‑based address matching (Levenshtein distance with a custom address dictionary);
- Coordinate‑based clustering (DBSCAN on latitude/longitude to group entities that refer to the same physical location);
- Linked Open Data (LOD) linking (matching against DBpedia, GeoNames, and OpenStreetMap identifiers).
  The authors evaluate these methods on a ground‑truth set of 5,000 entities. Coordinate‑based clustering achieves the highest F1‑score (0.92), while address matching suffers from multilingual variations and yields an F1 of 0.71.

To guarantee semantic quality, the pipeline validates every generated triple against a set of SHACL shapes. For instance, a Sensor must have a hasLocation property of type geo:Point, and a Service must be linked to a RoadSegment. Violations are logged and sent to a human‑in‑the‑loop reviewer via a Slack webhook.

RDF Store and Query Service
All cleaned triples are loaded into an Apache Jena Fuseki triplestore, configured with GeoSPARQL indexes and full‑text Lucene indexes. The store currently holds over 10 billion triples (≈ 2 TB) and supports SPARQL 1.1 queries with spatial functions (geof:distance, geof:within). An HTTP REST layer exposes the endpoint with OAuth2 authentication, enabling both internal municipal applications and external partners to retrieve data securely.

Demonstrator Applications
Three pilot services illustrate the practical value of the knowledge base:

Municipal Dashboard – visualizes real‑time congestion, road‑work alerts, and sensor health metrics; average query latency is 120 ms.
Logistics Route Optimizer – combines live traffic speeds with road restrictions to suggest fuel‑efficient routes for delivery fleets; performance testing shows a 15 % reduction in travel time compared with a baseline GIS‑only solution.
Tourist Assistant Chatbot – answers natural‑language queries such as “Find electric‑vehicle charging stations near the historic cathedral” by translating the request into a SPARQL query; response time stays under 200 ms even under 500 concurrent users.

Evaluation and Discussion
The ingestion pipeline processes up to 5 k events per second for streaming sources and 2 GB of batch data per hour. After applying the reconciliation step, the proportion of duplicate or inconsistent triples drops from 4.3 % to 1.2 %. The authors discuss scalability, noting that the current architecture can be horizontally expanded by adding more Kafka partitions and Fuseki clusters. Limitations include the geographic focus on the city of Turin, which raises questions about ontology portability to regions with different administrative hierarchies or address conventions. The paper also acknowledges that the address‑matching component could benefit from machine‑learning models trained on multilingual corpora.

Conclusion
The study demonstrates that a well‑engineered ontology combined with a robust data‑pipeline can turn fragmented municipal data into a coherent, queryable knowledge graph. By automating data validation with SHACL and providing concrete SPARQL‑based services, the authors show a viable path toward scalable smart‑city applications. Future work will explore international standardization of Km4City, integration of predictive analytics (e.g., traffic forecasting using graph neural networks), and the extension of the reconciliation framework to incorporate probabilistic entity linking techniques.