Recommender Systems by means of Information Retrieval

In this paper we present a method for reformulating the Recommender Systems problem in an Information Retrieval one. In our tests we have a dataset of users who give ratings for some movies; we hide some values from the dataset, and we try to predict them again using its remaining portion (the so-called “leave-n-out approach”). In order to use an Information Retrieval algorithm, we reformulate this Recommender Systems problem in this way: a user corresponds to a document, a movie corresponds to a term, the active user (whose rating we want to predict) plays the role of the query, and the ratings are used as weigths, in place of the weighting schema of the original IR algorithm. The output is the ranking list of the documents (“users”) relevant for the query (“active user”). We use the ratings of these users, weighted according to the rank, to predict the rating of the active user. We carry out the comparison by means of a typical metric, namely the accuracy of the predictions returned by the algorithm, and we compare this to the real ratings from users. In our first tests, we use two different Information Retrieval algorithms: LSPR, a recently proposed model based on Discrete Fourier Transform, and a simple vector space model.

💡 Research Summary

The paper proposes a novel reformulation of the classic collaborative‑filtering recommendation problem as an information‑retrieval (IR) task. In this mapping, each user is treated as a document, each movie as a term, and the active user whose rating we wish to predict becomes the query. The user‑movie rating matrix is reinterpreted as a term‑document weight matrix: a rating value directly replaces the traditional TF‑IDF weight for the corresponding term‑document pair. This simple yet powerful transformation enables the direct application of any IR ranking engine to the recommendation setting.

Two IR models are examined. The first is LSPR (Learning Sparse Representations), a recently introduced approach that applies the Discrete Fourier Transform to compress the term‑document matrix while preserving its spectral structure. By operating in the frequency domain, LSPR can compute similarity between the query and the document collection efficiently, even when the matrix is high‑dimensional and sparse. The second model is the classic Vector Space Model (VSM) that uses cosine similarity on raw weighted vectors. Both models receive the active user’s rated movies (the query) and return a ranked list of “relevant” users (documents).

The ranking is then used to predict the hidden ratings. The top‑k users in the list are selected, and each of their ratings for the target movie is weighted according to its rank (typically the inverse of the rank). A weighted average of these values yields the final prediction for the active user. This procedure is analogous to a k‑nearest‑neighbour collaborative filter, but the neighbourhood is defined by an IR relevance score rather than a raw similarity metric.

Experiments are conducted on a movie‑rating dataset using a leave‑n‑out protocol: a subset of known ratings is deliberately removed and later reconstructed. Accuracy (the proportion of correctly predicted rating values) is used as the primary evaluation metric, complemented by standard error measures such as RMSE and MAE. Results show that LSPR consistently outperforms the VSM baseline, albeit by a modest margin, and both IR‑based methods achieve competitive performance compared with traditional collaborative‑filtering algorithms. The rank‑based weighting scheme is identified as a key factor in boosting prediction quality.

Beyond performance, the authors highlight practical advantages. Because the reformulation yields a standard term‑document index, existing open‑source IR platforms (e.g., Apache Lucene, Elasticsearch) can be reused without substantial engineering effort. This reduces implementation complexity and leverages mature indexing, caching, and distributed search capabilities already available in production environments.

The paper also discusses limitations. Extremely sparse rating matrices may lead to large inverted indexes and slower query processing; additional compression or dimensionality‑reduction techniques may be required. LSPR’s performance is sensitive to hyper‑parameters such as the number of Fourier components retained, demanding careful tuning for each dataset.

In conclusion, the study demonstrates that viewing recommendation as an IR problem is not merely a metaphor but a concrete methodological bridge. It opens the door to hybrid systems that combine IR’s sophisticated ranking heuristics, language‑model smoothing, and relevance feedback with the personalization strengths of collaborative filtering. Future work could explore deep‑learning‑based IR models (e.g., neural ranking, transformer encoders) or integrate content‑based features into the term space, further enriching the recommendation pipeline while retaining the scalability and robustness of modern search engines.

💡 Research Summary

📜 Original Paper Content