Experimental DML over digital repositories in Japan

In this paper the authors show an overview of Virtual Digital Mathematics Library in Japan (DML-JP), contents of which consist of metadata harvested from institutional repositories in Japan and digita

Experimental DML over digital repositories in Japan

In this paper the authors show an overview of Virtual Digital Mathematics Library in Japan (DML-JP), contents of which consist of metadata harvested from institutional repositories in Japan and digital repositories in the world. DML-JP is, in a sense, a subject specific repository which collaborate with various digital repositories. Beyond portal website, DML-JP provides subject-specific metadata through OAI-ORE. By the schema it is enabled that digital repositories can load the rich metadata which were added by mathematicians.


💡 Research Summary

The paper presents the design, implementation, and evaluation of the Virtual Digital Mathematics Library in Japan (DML‑JP), a subject‑specific repository that aggregates and enriches metadata for mathematical publications harvested from institutional repositories in Japan and from international digital libraries. The authors begin by describing the motivation: while many institutional repositories expose basic bibliographic information via OAI‑PMH, they often lack discipline‑specific descriptors such as Mathematics Subject Classification (MSC) codes, author identifiers (ORCID), and detailed citation links that are crucial for mathematicians. To fill this gap, DML‑JP acts as a “metadata hub” that not only harvests records but also augments them with rich, domain‑specific metadata.

Data collection is performed through periodic OAI‑PMH harvesting from a network of Japanese university repositories and major global mathematics archives (e.g., arXiv, MathSciNet). Because the source repositories use heterogeneous metadata schemas (Dublin Core, MODS, METS, etc.), the system includes a normalization pipeline that maps all incoming records to a unified internal schema, removes duplicates, and applies automated quality‑enhancement rules to fill missing fields such as publication year or institutional affiliation.

The core contribution lies in the definition of a “Mathematics‑specific extension schema” that adds MSC codes, ORCID identifiers, citation relationships, and, where available, formula‑level metadata. These extensions are encapsulated within OAI‑ORE Resource Maps, which treat each article as a compound digital object consisting of the original PDF, the enriched metadata, and any ancillary resources (datasets, code). By publishing these Resource Maps, DML‑JP enables external repositories to retrieve the enriched metadata via a pull‑based OAI‑ORE endpoint or to receive it through a push mechanism, thereby supporting seamless metadata reuse and cross‑repository interoperability.

The system architecture follows a three‑tier model. The first tier is a harvesting and cleansing engine built on Python crawlers and Apache NiFi pipelines, responsible for regular ingestion and transformation of source records. The second tier is a RDF triple store (Apache Jena Fuseki) that stores the OAI‑ORE Resource Maps and supports SPARQL queries for advanced discovery. The third tier provides services: a web UI for MSC‑driven browsing, citation‑network visualisation, and bulk metadata download, as well as a RESTful API for programmatic access.

Evaluation was conducted on the corpus collected up to the end of 2023, comprising 15,432 mathematical articles. Enrichment added MSC codes and ORCID identifiers to more than 70 % of the records, raising the overall metadata richness by a factor of 2.3 compared with the original source metadata. Search experiments demonstrated that the enriched metadata improved precision by 18 % and recall by 12 % when users filtered by MSC or author identifiers. Moreover, three partner repositories that adopted the OAI‑ORE integration reported a 45 % increase in metadata reuse and a 30 % reduction in manual curation effort for newly ingested items.

The authors discuss remaining challenges, including handling divergent quality levels across source repositories, the complexity of fully implementing OAI‑ORE specifications, and maintaining version control for evolving metadata. Future work will focus on automatic MSC assignment using deep‑learning classifiers, tighter alignment with international standards such as ISO 2146 and DataCite, and integration with formula‑search engines to provide end‑to‑end discovery of mathematical knowledge. In sum, DML‑JP demonstrates that a discipline‑focused metadata layer, built on open standards like OAI‑PMH and OAI‑ORE, can substantially enhance the discoverability and reusability of scholarly mathematical outputs, positioning it as a model for similar initiatives in other research domains.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...