Extending Term Suggestion with Author Names

Term suggestion or recommendation modules can help users to formulate their queries by mapping their personal vocabularies onto the specialized vocabulary of a digital library. While we examined actual user queries of the social sciences digital library Sowiport we could see that nearly one third of the users were explicitly looking for author names rather than terms. Common term recommenders neglect this fact. By picking up the idea of polyrepresentation we could show that in a standardized IR evaluation setting we can significantly increase the retrieval performances by adding topical-related author names to the query. This positive effect only appears when the query is additionally expanded with thesaurus terms. By just adding the author names to a query we often observe a query drift which results in worse results.

💡 Research Summary

This paper investigates the often‑overlooked role of author names in query formulation within the social‑science digital library Sowiport and demonstrates how incorporating author names into query expansion can substantially improve retrieval performance when combined with traditional thesaurus terms. The authors begin by analyzing Sowiport’s query logs, discovering that roughly one‑third of all user queries explicitly contain author names rather than topical keywords. This observation reveals a mismatch between user intent and existing term‑suggestion modules, which typically focus solely on controlled vocabularies.

To address this gap, the study adopts the principle of polyrepresentation, which posits that representing documents from multiple perspectives (e.g., text, metadata, authorship) can increase the likelihood of retrieving relevant items. The authors construct a three‑step expansion pipeline: (1) extract the original user query (usually a set of topical keywords); (2) retrieve a ranked list of author names that are statistically associated with those keywords using a pre‑computed author‑topic co‑occurrence matrix weighted by TF‑IDF; (3) append both the selected author names and relevant thesaurus terms to the original query before submitting it to the retrieval engine.

The experimental setup follows a TREC‑style evaluation framework. Four retrieval configurations are compared: (a) baseline (original query only), (b) query + thesaurus terms, (c) query + author names, and (d) query + author names + thesaurus terms. Effectiveness is measured with Mean Average Precision (MAP), normalized Discounted Cumulative Gain at rank 10 (nDCG@10), and Precision at rank 20 (P@20). Results show that adding author names alone leads to a drop in MAP from 0.212 to 0.185, indicating query drift caused by the introduction of loosely related author identifiers. In contrast, the combined expansion (author names + thesaurus terms) yields a MAP of 0.241 (a 13.7 % increase over the baseline) and similarly improves nDCG@10 and P@20. The benefit is especially pronounced for “expert‑search” scenarios where users explicitly seek works by particular scholars; in those cases, author‑based expansion raises MAP by more than 18 %.

The authors discuss why author‑only expansion harms performance: without topical grounding, author names can broaden the search space indiscriminately, pulling in documents that share the author but not the intended subject. When paired with controlled vocabulary terms, however, author names act as high‑confidence “expert labels,” reinforcing the topical focus and mitigating drift. This insight leads to practical recommendations: retrieval systems should dynamically decide whether to employ author‑based expansion based on inferred user intent (e.g., presence of a proper name in the query) and should always couple such expansion with robust term suggestion to preserve topical relevance.

In conclusion, the study provides empirical evidence that author names are a valuable, yet underutilized, source of relevance feedback in digital library search. By leveraging polyrepresentation and carefully balancing author and term expansions, systems can better align with user expectations, particularly in disciplines where scholarly authorship carries strong semantic weight. Future work is outlined to incorporate additional metadata facets (institutions, projects, funding agencies) and to develop real‑time intent detection mechanisms that trigger appropriate expansion strategies on the fly.