Recommender systems are crucial tools to overcome the information overload brought about by the Internet. Rigorous tests are needed to establish to what extent sophisticated methods can improve the quality of the predictions. Here we analyse a refined correlation-based collaborative filtering algorithm and compare it with a novel spectral method for recommending. We test them on two databases that bear different statistical properties (MovieLens and Jester) without filtering out the less active users and ordering the opinions in time, whenever possible. We find that, when the distribution of user-user correlations is narrow, simple averages work nearly as well as advanced methods. Recommender systems can, on the other hand, exploit a great deal of additional information in systems where external influence is negligible and peoples' tastes emerge entirely. These findings are validated by simulations with artificially generated data.
One of the most amazing trends of today's globalized economy is peer production [Anderson 2006]. An unprecedented mass of unpaid workers is contributing to the growth of the World Wide Web: some build entire pages, some only drop casual comments, having no other reward than reputation [Masum and Zhang 2004]. Many successful websites (e.g. Blogger and MySpace) are just platforms holding user-generated content. The information thus conveyed is particularly valuable because it contains personal opinions, with no specific corporate interest. It is, at the same time, very hard to go through it and judge its degree of reliability. If you want to use it, you need to filter this information, select what is relevant and aggregate it; you need to reduce the information overload [Maes 1994].
As a matter of fact, opinion filtering has become rather common on the web. There exist search engines (e.g. Googlenews) that are able to extract news from journals, websites (e.g. Digg) that harvest them from blogs, platforms (e.g. Epinions) that collect and aggregate votes on products. The basic version of these systems ranks the objects once for all, assuming they have an intrinsic value, independent of the personal taste of the demander [Laureti et al. 2006]. They lack personalisation [Kelleher 2006], which constitutes the new frontier of online services.
Users need only browse the web in order to leave recorded traces, the eventual comments they drop add on to it. The more information you release, the better the service you receive. Personal information can, in fact, be exploited by recommender systems. The deal becomes, at the same time, beneficial to the community, as every piece of information can potentially improve the filtering procedures. Amazon.com, for instance, uses one’s purchase history to provide individual suggestions. If you have bought a physics book, Amazon recommends you other physics books: this is called item-based recommendation [Breese et al. 1998;Sarwar et al. 2001]. Those who have experience with it know that this system works fairly well, but it is conservative as it rarely dares suggesting books regarding subjects you have never explored. We believe a good recommender system should sometimes help uncovering people’s hidden wants [Maslov and Zhang 2001].
Collaborative filtering is currently the most successful implementation of recommendation systems. It essentially consists in recommending you items that users, whose tastes are similar to yours, have liked. In order to do that, one needs collecting taste information from many users and define a measure of similarity. The easiest and most common ways to do it is to use either correlations or Euclidean distances.
Here we test a correlation-based algorithm and a spectral method to make predictions. We describe these two families of recommender systems in section 2, and propose some improvements to currently used algorithms. In section 3 we present the results of our predicting methods on the MovieLens and Jester data sets, as well as on artificial data. We argue that the distribution of correlations in the system is the key ingredient to state whether or not sophisticated recommendations outperform simple averages. Finally, we draw some conclusions in section 4.
Our aim is here to test two methods for recommending, spectral and correlationbased, on different data sets. The starting point is data collection. One typically has a system of N users, M items and n evaluations. Opinions, books, restaurants or any other object can be treated, although we shall examine in detail two fundamentally different examples: movies and jokes. Each user i evaluates a pool of n i items and each item α receives n α evaluations, with n = N i=1 n i = M α=1 n α . The votes v iα can be gathered in a matrix V . If a user j has not voted on item β, the corresponding matrix element takes a constant value v jβ = EMPTY, usually set to zero.
Once the data collected into the voting matrix, we aim to predicting votes before they are expressed. That is, we would like to predict if agent j would appreciate the movie, book or food β, before she actually watched, read or ate it. Say, we predict that user j would give a very high vote to item β if she were exposed to it; we can then recommend β to j and verify a posteriori her appreciation. Ideally, we would like to have a prediction for every EMPTY element of V .
Most websites only allow votes to be chosen from a finite set of values. In order to take into account the fact that each person adopts an individual scale, we compute each user’s i average expressed vote v i and subtract it from non empty v iα ’s. The methods we analyze give predictions in the following form [Delgado 1999]:
where v ′ jβ is the predicted vote and S is a similarity matrix. The choice of S is the
• 3 crucial issue of collaborative filtering. One has, in fact, very often to face a lack of data, which makes it difficult to estimate the similarity between non
This content is AI-processed based on open access ArXiv data.