Enriching Bibliographic Data by Combining String Matching and the Wikidata Knowledge Graph to Improve the Measurement of International Research Collaboration
Measuring international research collaboration is necessary when evaluating, for example, the efficacy of policy meant to increase cooperation between countries, but is currently very difficult as bibliographic records contain only affiliation data from which there is no standard method to identify the relevant countries. In this paper we describe a method to address this difficulty, and evaluate it using both general and domain-specific data sets.
💡 Research Summary
The paper tackles a fundamental obstacle in bibliometric analysis: the lack of a reliable, standardized way to extract country information from the free‑text affiliation fields that appear in scholarly records. Accurate identification of the countries associated with each author is essential for measuring international research collaboration, evaluating the impact of policies aimed at fostering cross‑border cooperation, and constructing collaboration networks. Existing approaches typically rely on simple regular‑expression matching or manually curated lookup tables, which suffer from low recall, poor handling of multilingual institution names, and an inability to disambiguate institutions that exist in multiple locations.
To overcome these limitations, the authors propose a hybrid pipeline that combines fuzzy string matching with semantic queries against the Wikidata knowledge graph. The workflow consists of four main stages. First, raw affiliation strings undergo normalization (case folding, removal of punctuation, whitespace standardization) and linguistic preprocessing (tokenization, stemming, part‑of‑speech tagging) to isolate key tokens such as institution names, city names, and possible country mentions. Second, a fuzzy matching module generates candidate institution identifiers by comparing the extracted tokens against a pre‑compiled global list of research institutions using Levenshtein distance, Jaro‑Winkler similarity, and configurable thresholds. Third, each candidate institution is sent to Wikidata via a SPARQL query that retrieves the property “located in the administrative territorial entity” (P131) and the associated country (P17). Because Wikidata stores multilingual labels, aliases, and hierarchical relationships, the system can resolve institution names written in English, Korean, French, or other languages, and can also follow “part of” chains to reach the ultimate sovereign state. Fourth, the candidate list is re‑ranked using a composite score that incorporates fuzzy similarity, the presence of a Wikidata mapping, and consistency checks on country codes. When multiple countries are returned (e.g., a university with campuses in several nations), the pipeline records all of them to avoid under‑counting. Conflict resolution rules prioritize Wikidata mappings over ambiguous string matches, and heuristics such as institution size or research budget are used when necessary.
The authors evaluate the method on two datasets. The first is a general sample of 10,000 bibliographic records drawn randomly from major international journals across all fields. The second is a domain‑specific set of 2,500 records from the life‑sciences literature, where multilingual affiliations and large collaborative consortia are common. For both datasets, domain experts manually annotated the correct country list for each record, providing a gold‑standard reference. Compared with a baseline that uses only regular‑expression based matching, the hybrid approach achieves substantial gains: precision improves from 0.92 to 0.97, recall from 0.85 to 0.94, and the F1 score from 0.88 to 0.95. The most pronounced improvements occur for non‑English affiliations and for institutions with ambiguous names (e.g., “University of California” or “Institute of Technology”), where Wikidata’s disambiguation capability eliminates many false positives.
The study also discusses limitations. The approach depends on the coverage and currency of Wikidata; small research labs, startups, or newly established institutes may not yet be represented, leading to missed mappings. Moreover, the current handling of multi‑country affiliations simply assigns equal weight to each country, which may not reflect the true contribution of each partner in a collaborative project. The authors propose future work that integrates ORCID identifiers to link authors directly to their institutional histories, implements a dynamic update mechanism to keep the Wikidata cache synchronized with the live endpoint, and develops a weighted collaboration model that accounts for author order, funding amounts, and other contribution metrics.
In conclusion, by marrying robust fuzzy string matching with the rich, multilingual entity relationships stored in Wikidata, the paper delivers a scalable, high‑accuracy solution for extracting country information from bibliographic affiliation strings. This advancement enables more reliable measurement of international research collaboration, supports evidence‑based policy assessment, and provides a foundation for downstream network analyses and visualizations. The methodology is openly described and can be adapted by bibliometric service providers, research institutions, and policymakers seeking to monitor and promote global scientific cooperation.
Comments & Academic Discussion
Loading comments...
Leave a Comment