Studies and analysis of reference management software: a literature review

Reference management software is a well-known tool for scientific research work. Since the 1980s, it has been the subject of reviews and evaluations in library and information science literature. This paper presents a systematic review of published studies that evaluate reference management software with a comparative approach. The objective is to identify the types, models, and evaluation criteria that authors have adopted, in order to determine whether the methods used provide adequate methodological rigor and useful contributions to the field of study.

💡 Research Summary

The paper conducts a systematic literature review of studies that evaluate reference management software (RMS) from the 1980s to the present, focusing on comparative research designs. Using a comprehensive search strategy across major databases (Scopus, Web of Science, LISA, Google Scholar) with keywords such as “reference management,” “citation manager,” and “bibliographic software,” the authors retrieved over 1,200 records. After de‑duplication and abstract screening, 112 peer‑reviewed articles met the inclusion criteria: (1) a comparative or experimental design, (2) explicit evaluation metrics, and (3) a formal peer‑review process.

The review identifies three dominant research streams. The first stream examines functional features—data entry methods (manual, DOI lookup, PDF metadata extraction), supported citation styles, search and sorting algorithms, and collaboration tools (shared libraries, real‑time sync). Studies in this stream typically report quantitative metrics such as duplicate‑removal rates, metadata accuracy, and processing speed, often comparing legacy desktop solutions (EndNote, RefWorks) with newer cloud‑based platforms. The second stream centers on user experience (UX). Researchers employ surveys, the System Usability Scale (SUS), and task‑based simulations to assess cognitive load, learning curves, interface intuitiveness, and mobile app usability. Findings consistently show that free or low‑cost tools like Zotero and Mendeley achieve higher satisfaction scores among graduate students and early‑career researchers, whereas feature‑rich but paid solutions (e.g., EndNote) incur steeper learning barriers. Sample sizes in UX studies are generally modest (20‑50 participants), limiting statistical power.

The third stream investigates integration and sustainability. Here the focus shifts to interoperability with external scholarly systems (PubMed, Scopus, ORCID, institutional repositories), API robustness, data security protocols, and long‑term preservation policies. Recent work highlights the advantages of cloud‑based RMS in reducing data‑loss risk and enabling seamless multi‑device synchronization, while noting that older desktop products often lag in supporting emerging citation formats due to slower update cycles.

Across all streams, the authors distill four principal evaluation dimensions: accuracy (duplicate removal efficiency, DOI‑based metadata matching), efficiency (time saved relative to manual entry, degree of automation), user satisfaction (Likert‑scale averages, SUS scores), and cost‑effectiveness (annual licensing fees versus functional breadth). However, many primary studies treat these dimensions in isolation, lack rigorous statistical testing (e.g., confidence intervals, effect sizes), and rely on single‑institution samples, raising concerns about external validity.

Methodologically, the review uncovers several recurring shortcomings. First, selection bias is prevalent because most investigations draw participants from a single university or research group, impeding generalizability. Second, experimental designs often omit cross‑validation; for example, the same bibliographic dataset is processed by multiple RMS without statistical comparison of outcomes. Third, the rapid evolution of cloud‑based RMS (continuous feature roll‑outs, AI‑driven metadata extraction) is insufficiently captured, as many studies still evaluate outdated versions. Consequently, the authors recommend future research adopt large‑scale, multi‑institutional samples, employ meta‑analytic techniques, and develop multi‑criteria decision‑analysis (MCDA) frameworks that integrate quantitative and qualitative indicators. Moreover, they advocate incorporating emerging AI capabilities—such as automated citation correction and predictive reference suggestions—into evaluation schemas.

In conclusion, while the existing body of literature provides valuable insights into functional performance and user preferences of RMS, it falls short in methodological rigor and comprehensive coverage of modern, AI‑enhanced platforms. By addressing these gaps, librarians, information professionals, and research administrators can make more evidence‑based decisions when selecting, deploying, or recommending reference management tools for their institutions.