Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers

February 20, 2026

Reading time: 5 minute

...

#Computer Science #Learning #Digital Libraries

📝 Original Info

Title: Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers
ArXiv ID: 1802.01168
Date: 2019-06-05
Authors: : Jan Šedivý, Ondřej Dušek, Tomáš Foltýnek, Petr Plecháč, Martin Štolc

📝 Abstract

Bibliographic reference parsing refers to extracting machine-readable metadata, such as the names of the authors, the title, or journal name, from bibliographic reference strings. Many approaches to this problem have been proposed so far, including regular expressions, knowledge bases and supervised machine learning. Many open source reference parsers based on various algorithms are also available. In this paper, we apply, evaluate and compare ten reference parsing tools in a specific business use case. The tools are Anystyle-Parser, Biblio, CERMINE, Citation, Citation-Parser, GROBID, ParsCit, PDFSSA4MET, Reference Tagger and Science Parse, and we compare them in both their out-of-the-box versions and versions tuned to the project-specific data. According to our evaluation, the best performing out-of-the-box tool is GROBID (F1 0.89), followed by CERMINE (F1 0.83) and ParsCit (F1 0.75). We also found that even though machine learning-based tools and tools based on rules or regular expressions achieve on average similar precision (0.77 for ML-based tools vs. 0.76 for non-ML-based tools), applying machine learning-based tools results in a recall three times higher than in the case of non-ML-based tools (0.66 vs. 0.22). Our study also confirms that tuning the models to the task-specific data results in the increase in the quality. The retrained versions of reference parsers are in all cases better than their out-of-the-box counterparts; for GROBID F1 increased by 3% (0.92 vs. 0.89), for CERMINE by 11% (0.92 vs. 0.83), and for ParsCit by 16% (0.87 vs. 0.75).

💡 Deep Analysis

📄 Full Content

Within the past decades there has been exponential increase in the volume of available scientific literature [1]. This has resulted in a scientific information overload problem, which refers to challenges related to consuming enormous amount of information by interested readers. Scientific information systems and digital libraries help researchers to tackle the scientific information overload problem by providing intelligent information retrieval and recommendation services. These services need machinereadable, rich bibliographic metadata of stored documents to function correctly, but this requirement is not always met in practice. As a consequence, there is a huge demand for automated methods and tools able to extract high-quality machine-readable bibliographic metadata information directly from scientific unstructured data.

Reference parsing is one important task in this research area. In reference parsing, the input is a single reference string, usually formatted in a specific bibliography style (Figure 1). The output is a machine-readable representation of the input string, typically called a parsed reference (Figure 2). Such parsed representation is a collection of metadata fields, each of which is composed of a field type (e.g. “volume” or “journal”) and value (e.g. “12” or “Nature”). Bibliographic reference parsing is important for tasks such as matching citations to cited documents [2], assessing the impact of researchers [3,4], journals [5,6] and research institutions [7,8], and calculating document similarity [9,10], in the context of academic search engines [11,12] and recommender systems [13,14].

Reference parsing can be viewed as reversing the process of formatting a bibliography record into a string. During formatting some information is lost, and thus the reversed process is not a trivial task and usually introduces errors.

There are a few challenges related to reference parsing. First, the type of the referenced object (a journal article, a conference publication, a patent, etc.) is typically not known, so we do not know which metadata fields can be extracted. Second, the reference style is unknown, thus we do not know where in the string specific metadata fields are present. Finally, it is common for a reference string to contain errors, introduced either by humans while adding the references to the paper, or by the process of extracting the string itself from the scientific publication. These errors include for example OCR errors, unexpected spaces inside words, missing spaces, typos and errors in style-specific punctuation.

The most popular approaches to reference parsing include regular expressions, template matching, knowledge bases and supervised machine learning. There also exist a number of open source reference parsers ready to use. It is unknown, however, which approaches and which open source parsers give the best results for given metadata field types. What is more, some of the existing parsers can be tuned to the data of interest. In theory, this process should increase the quality of the results, but it is also time consuming and requires training data, which is typically expensive to obtain. An important issue is then how high an increase in the quality should be expected after retraining. These aspects are important for researchers and programmers developing larger information extraction systems for scientific data, as well as digital library practitioners wishing to use existing bibliographic reference parsers within their infrastructures.

In this study we apply, evaluate and compare a number of existing reference parsing tools, both their out-of-the-box and retrained versions, in the context of a real business project involving data from chemical domains. Specifically, we are interested in the following questions:

How good are reference parsing tools for our use case? 2. How do the results of machine learning-based approaches compare to the results of more static, non-trainable approaches, such as regular expressions or rules? 3. How much impact does retraining the machine learning models using project-specific data have on the parsing results? In the following sections, we describe the state of the art, give the larger context of the business case, list the tools we evaluated, describe our evaluation setup and report the results. Finally, we discuss the findings and present conclusions.

Reference parsing is a well-known research problem, and many techniques have been proposed for solving it over the years, including regular expressions, template matching, knowledge bases and supervised machine learning.

Regular expressions are a simple way of approaching the task of reference parsing. This approach is typically based on a set of manually developed regular expressions able to capture single or multiple metadata fields in different reference styles. Such a strategy works best if the reference styles to process are known in advance and if the data contains little

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on open access ArXiv data.

Machine Learning vs. Rules and Out-of-the-Box vs. Retrained: An Evaluation of Open-Source Bibliographic Reference and Citation Parsers

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Related Posts

A sensitivity analysis of researchers productivity rankings to the time of citation observation

Accounting for gender research performance differences in ranking universities

An investigation on the skewness patterns and fractal nature of research productivity distributions at field and discipline level

Start searching

No results found