Discovering Beaten Paths in Collaborative Ontology-Engineering Projects using Markov Chains
Biomedical taxonomies, thesauri and ontologies in the form of the International Classification of Diseases (ICD) as a taxonomy or the National Cancer Institute Thesaurus as an OWL-based ontology, play a critical role in acquiring, representing and processing information about human health. With increasing adoption and relevance, biomedical ontologies have also significantly increased in size. For example, the 11th revision of the ICD, which is currently under active development by the WHO contains nearly 50,000 classes representing a vast variety of different diseases and causes of death. This evolution in terms of size was accompanied by an evolution in the way ontologies are engineered. Because no single individual has the expertise to develop such large-scale ontologies, ontology-engineering projects have evolved from small-scale efforts involving just a few domain experts to large-scale projects that require effective collaboration between dozens or even hundreds of experts, practitioners and other stakeholders. Understanding how these stakeholders collaborate will enable us to improve editing environments that support such collaborations. We uncover how large ontology-engineering projects, such as the ICD in its 11th revision, unfold by analyzing usage logs of five different biomedical ontology-engineering projects of varying sizes and scopes using Markov chains. We discover intriguing interaction patterns (e.g., which properties users subsequently change) that suggest that large collaborative ontology-engineering projects are governed by a few general principles that determine and drive development. From our analysis, we identify commonalities and differences between different projects that have implications for project managers, ontology editors, developers and contributors working on collaborative ontology-engineering projects and tools in the biomedical domain.
💡 Research Summary
The paper investigates how large‑scale collaborative ontology‑engineering projects evolve by mining sequential editing patterns from change logs using Markov chain models. Five biomedical ontology projects of varying size and scope—ICD‑11, ICTM, NCIt, BRO, and OPL—serve as the empirical basis. Each project provides a structured change log (ChAO) that records, for every edit, the user, the class, the property modified, and the timestamp. The authors transform these logs into two kinds of sequences: (1) user‑based paths, which list the properties a single user modifies over time across any classes, and (2) class‑based paths, which list the properties edited on a particular class by any user.
These sequences are modeled as discrete‑time Markov chains. Both first‑order (the next property depends only on the immediately preceding property) and second‑order (the next property depends on the two most recent properties) chains are fitted. Model order is selected using Akaike and Bayesian information criteria, with second‑order models generally providing a better fit, indicating that the immediate editing context influences subsequent actions. Transition probability matrices are estimated for each dataset and examined to identify the most likely “next‑step” after any given property edit.
Across all datasets, a consistent pattern emerges: edits to high‑level descriptive properties such as “title” or “definition” are frequently followed by edits to structural properties like “classification” or “relationship”. This suggests a common workflow where contributors first establish the semantic core of a concept and then integrate it into the ontology’s hierarchy and relational network. Additional recurring patterns include “annotation → status/review” and “comment → validation”, reflecting typical quality‑control cycles.
Project‑specific differences are also observed. ICD‑11 and ICTM, both massive international efforts using the web‑based iCAT editor, exhibit multiple review cycles and a higher proportion of “review → comment” transitions, reflecting complex governance structures. In contrast, the smaller OPL project shows a strong “annotation → relationship” transition, likely due to its focused domain and limited contributor pool. The BRO dataset, built with a customized version of Web‑Protégé, displays a pronounced “status → comment” pattern, possibly driven by UI design that emphasizes state changes.
From these findings the authors derive several practical recommendations. First, ontology editors could incorporate predictive assistance that suggests the most probable next property to edit, streamlining the workflow. Second, frequent transition patterns could trigger automated validation rules or notifications to catch errors early. Third, tailoring the interface to user roles (e.g., domain expert, reviewer, casual contributor) could improve efficiency and reduce conflict. Fourth, regular log‑based analytics can monitor project health, identify bottlenecks, and inform resource allocation.
In sum, the study demonstrates that collaborative ontology engineering, despite differences in domain, tooling, and team size, is governed by a small set of general principles that can be captured through Markov chain analysis of edit logs. These insights provide a data‑driven foundation for improving collaborative tools, managing large‑scale ontology projects, and ultimately enhancing the quality and sustainability of biomedical knowledge representations. Future work may explore higher‑order or deep‑learning sequence models to capture more complex decision processes and extend the methodology to other collaborative domains such as software development or wiki‑based knowledge bases.
Comments & Academic Discussion
Loading comments...
Leave a Comment