Enhancing Invenio Digital Library With An External Relevance Ranking Engine

Invenio is a comprehensive web-based free digital library software suite originally developed at CERN. In order to improve its information retrieval and word similarity ranking capabilities, the goal of this thesis is to enhance Invenio by bridging it with modern external information retrieval systems. In the first part a comparison of various information retrieval systems such as Solr and Xapian is made. In the second part a system-independent bridge for word similarity ranking is designed and implemented. Subsequently, Solr and Xapian are integrated in Invenio via adapters to the bridge. In the third part scalability tests are performed. Finally, a future outlook is briefly discussed.

💡 Research Summary

The paper presents a comprehensive effort to modernize the search and relevance ranking capabilities of Invenio, the open‑source digital library platform originally created at CERN, by integrating it with contemporary external information‑retrieval engines. The authors begin by diagnosing the limitations of Invenio’s native ranking subsystem, which relies on a relatively simple TF‑IDF weighting scheme and a monolithic index structure that struggles with large‑scale collections and complex query patterns. To address these shortcomings, they evaluate two mature open‑source search engines—Apache Solr and Xapian—against a set of criteria that include indexing speed, query latency, support for advanced linguistic processing, scalability, and ease of integration with Python‑based applications. Solr is highlighted for its robust schema definition, distributed indexing, and rich query syntax, while Xapian is praised for its lightweight C++ core, low memory footprint, and straightforward Python bindings.

Building on this comparative analysis, the authors design a system‑independent “bridge” layer that abstracts the interaction between Invenio and any external engine. The bridge defines a uniform interface for preprocessing (tokenization, stemming, stop‑word removal), index management (create, update, delete), and relevance scoring. Engine‑specific adapters implement this interface, translating Invenio’s internal data model into the API calls required by Solr or Xapian and converting the returned scores into a common format that Invenio’s front‑end can render without modification. This modular architecture enables administrators to switch engines, run them in parallel, or add new ones with only configuration changes, preserving uptime and minimizing code churn.

The implementation phase adds Python adapter classes to Invenio’s codebase and extends the configuration system with a “search_engine” option. The authors then conduct extensive scalability experiments using a synthetic dataset of several million bibliographic records and a simulated workload of up to ten thousand queries per second. Results show that Solr, when deployed in a clustered mode, sustains high query throughput while keeping index update latency low, thanks to its sharding and replication mechanisms. Xapian, operating on a single node, demonstrates impressive memory efficiency and still delivers competitive throughput, especially for simple term‑frequency based queries. Crucially, the bridge layer proves resilient: swapping the underlying engine or adding a second engine does not interrupt service, confirming the design’s fault‑tolerance.

In the discussion, the paper outlines future directions such as incorporating machine‑learning‑based ranking models (e.g., learning‑to‑rank), extending multilingual support, and moving toward containerized, cloud‑native deployments for elastic scaling. By providing a clear architectural blueprint, thorough performance evaluation, and a practical integration pathway, the study convincingly shows that Invenio can evolve from a closed‑source search stack to a flexible, high‑performance digital library platform that leverages the best features of modern IR systems.