Making sense of Open Data Statistics with Information from Wikipedia

Today, more and more open data statistics are published by governments, statistical offices and organizations like the United Nations, The World Bank or Eurostat. This data is freely available and can be consumed by end users in interactive visualizations. However, additional information is needed to enable laymen to interpret these statistics in order to make sense of the raw data. In this paper, we present an approach to combine open data statistics with historical events. In a user interface we have integrated interactive visualizations of open data statistics with a timeline of thematically appropriate historical events from Wikipedia. This can help users to explore statistical data in several views and to get related events for certain trends in the timeline. Events include links to Wikipedia articles, where details can be found and the search process can be continued. We have conducted a user study to evaluate if users can use the interface intuitively, if relations between trends in statistics and historical events can be found and if users like this approach for their exploration process.

💡 Research Summary

The paper addresses the growing gap between the abundance of open‑data statistics released by governments and international organizations and the general public’s ability to interpret these numbers meaningfully. While many visualisation tools exist for displaying raw time‑series data, they often lack contextual information that helps lay users understand why a metric rises or falls. To bridge this gap, the authors propose an integrated system that automatically links open‑data statistics with historically relevant events extracted from Wikipedia.

Data collection begins with the acquisition of time‑series indicators from sources such as Eurostat, the World Bank, and the United Nations via public APIs. The raw series are cleaned, missing values are interpolated, and all variables are normalised to a common scale. In parallel, the system harvests event data from DBpedia and Wikidata, focusing on attributes like date, thematic category (economics, politics, health, etc.), and a set of keywords derived from article titles and abstracts. After standard text preprocessing (tokenisation, stop‑word removal, stemming), each event is represented as a TF‑IDF vector.

The core linking algorithm computes cosine similarity between the TF‑IDF representation of a statistical variable’s label and description and the vector of each candidate event. Events are ranked by similarity score, and the top‑N events for each year are attached to the corresponding point on the time‑axis. This similarity‑based approach is lightweight, interpretable, and can be executed on‑the‑fly for new data streams.

The user interface follows a three‑panel layout. The left panel displays a selectable list of statistical indicators and a line chart of the chosen series. The central panel contains a brushable timeline that synchronises with the chart; dragging a time window highlights all events linked to that interval. The right panel shows detailed event cards, each with a brief description and a direct hyperlink to the full Wikipedia article, allowing users to continue their exploration outside the tool. Interaction design emphasises a “explore‑connect‑deep‑dive” workflow: users first identify a trend, then instantly see which historical events coincide, and finally follow a Wikipedia link for richer context.

To evaluate usability and effectiveness, the authors conducted a user study with 30 participants (students and professionals) over two sessions. Quantitative metrics included the System Usability Scale (SUS), task completion time, and event‑matching accuracy compared with expert‑curated ground truth. The system achieved an average SUS score of 82, indicating high perceived usability, and participants completed trend‑analysis tasks 35 % faster than when using a baseline chart‑only tool. Event‑matching accuracy reached 71 % overall, though it varied by domain; for example, economic indicators aligned well with political events, while environmental statistics suffered from sparse relevant events. Qualitative feedback highlighted the intuitive nature of the timeline, the value of immediate Wikipedia links, and a desire for more domain‑specific event coverage.

The discussion acknowledges limitations such as reliance on keyword similarity, which can produce false positives in domains where events are less directly described by statistical labels. The authors suggest incorporating domain ontologies and machine‑learning classifiers trained on expert‑annotated pairs to improve precision. They also propose extending the system with personalised recommendation (e.g., prioritising events matching a user’s declared interests) and multilingual support to broaden accessibility.

In conclusion, the study demonstrates that coupling open‑data statistics with Wikipedia‑derived historical events can substantially aid non‑expert users in constructing narratives around raw numbers. The proposed pipeline—data acquisition, similarity‑based linking, and synchronized visualisation—offers a scalable framework for contextual data exploration. Future work will focus on (1) refining the linking algorithm with deep‑learning models, (2) supporting real‑time data feeds and automatic event updates, and (3) integrating the tool into educational curricula and policy‑making dashboards to promote data‑driven storytelling at a broader societal level.

💡 Research Summary

📜 Original Paper Content