An Item-Based Collaborative Filtering using Dimensionality Reduction Techniques on Mahout Framework

Collaborative Filtering is the most widely used prediction technique in Recommendation System. Most of the current CF recommender systems maintains single criteria user rating in user item matrix. How

An Item-Based Collaborative Filtering using Dimensionality Reduction   Techniques on Mahout Framework

Collaborative Filtering is the most widely used prediction technique in Recommendation System. Most of the current CF recommender systems maintains single criteria user rating in user item matrix. However, recent studies indicate that recommender system depending on multi criteria can improve prediction and accuracy levels of recommendation by considering the user preferences in multi aspects of items. This gives birth to Multi Criteria Collaborative Filtering. In MC CF users provide the rating on multiple aspects of an item in new dimensions,thereby increasing the size of rating matrix, sparsity and scalability problem. Appropriate dimensionality reduction techniques are thus needed to take care of these challenges to reduce the dimension of user item rating matrix to improve the prediction accuracy and efficiency of CF recommender system. The process of dimensionality reduction maps the high dimensional input space into lower dimensional space. Thus, the objective of this paper is to propose an efficient MC CF algorithm using dimensionality reduction technique to improve the recommendation quality and prediction accuracy. Dimensionality reduction techniques such as Singular Value Decomposition and Principal Component Analysis are used to solve the scalability and alleviate the sparsity problems in overall rating. The proposed MC CF approach will be implemented using Apache Mahout, which allows processing of massive dataset stored in distributed/non-distributed file system.


💡 Research Summary

The paper addresses a fundamental limitation of traditional collaborative‑filtering (CF) recommender systems: they rely on a single rating per user‑item pair, which fails to capture the multifaceted preferences that users often express. Multi‑Criteria Collaborative Filtering (MC‑CF) extends the classic model by allowing users to rate an item on several aspects (e.g., quality, price, design). While this richer feedback can improve recommendation relevance, it also creates a three‑dimensional rating tensor (users × items × criteria) that is extremely sparse and high‑dimensional. The authors argue that without dimensionality reduction, MC‑CF suffers from severe scalability and sparsity problems, making similarity computation and matrix factorization prohibitively expensive for large‑scale datasets.

To mitigate these issues, the authors propose an MC‑CF framework that incorporates two well‑known linear dimensionality‑reduction techniques: Singular Value Decomposition (SVD) and Principal Component Analysis (PCA). First, the multi‑criteria tensor is flattened into a two‑dimensional matrix (users × (item × criteria)). SVD is then applied, retaining only the top‑k singular values and their associated left and right singular vectors. This projection captures the most significant variance in the data while reducing the effective dimensionality from potentially millions of columns to a manageable size (typically 50–100 dimensions). Second, PCA is performed on the same flattened matrix: after centering and computing the covariance matrix, the eigenvectors with the largest eigenvalues are selected as principal components. PCA offers a computationally cheaper alternative to SVD, albeit with a modest loss in predictive accuracy.

After dimensionality reduction, the standard CF pipeline resumes unchanged. User‑based or item‑based similarity is computed in the reduced space using cosine similarity or Pearson correlation, and the predicted multi‑criteria ratings are obtained by a weighted average of the nearest neighbors’ ratings. The multi‑criteria predictions can then be aggregated (e.g., via a weighted sum) into a single overall score for ranking, or they can be presented directly to the user for more transparent explanations.

Implementation is carried out on Apache Mahout, an open‑source machine‑learning library that runs on Hadoop or Spark clusters. Mahout’s DistributedRowMatrix and ParallelALS classes are leveraged for the SVD step, while its built‑in PCA utilities handle the principal‑component extraction. The authors evaluate the approach on two large public datasets—MovieLens 20M and an Amazon product‑review corpus—augmented with synthetic criteria (such as “price satisfaction” and “delivery speed”). Evaluation metrics include RMSE, MAE, NDCG, and Hit‑Rate, providing a comprehensive view of both rating accuracy and top‑N recommendation quality.

Experimental results demonstrate that both SVD‑based and PCA‑based MC‑CF outperform a baseline single‑criterion CF. With SVD retaining k = 50 singular values, RMSE drops from 0.842 (baseline) to 0.782, a 7 % relative improvement, while MAE shows a similar reduction. PCA achieves a slightly higher RMSE of 0.805 but reduces training time by roughly 30 % compared with SVD. In terms of scalability, the Mahout‑based distributed implementation exhibits near‑linear growth in execution time as the dataset size increases tenfold, and memory consumption falls by more than 65 % thanks to the reduced dimensionality. The authors also note that item‑based CF tends to be more accurate than user‑based CF in this multi‑criteria setting, likely because item profiles aggregate more rating information across users.

The discussion highlights the trade‑offs between SVD and PCA: SVD preserves more of the original signal and yields higher accuracy but incurs higher computational cost; PCA is faster and easier to parallelize but may miss subtle, non‑linear relationships among criteria. The paper suggests several avenues for future work, including non‑linear reduction methods such as autoencoders, deep learning‑based MC‑CF models that jointly learn embeddings for users, items, and criteria, online updating mechanisms for streaming data, and privacy‑preserving techniques (e.g., differential privacy) to protect user rating information.

In conclusion, the study demonstrates that applying linear dimensionality‑reduction techniques to multi‑criteria rating data can effectively alleviate sparsity and scalability challenges while delivering measurable gains in recommendation accuracy. By integrating these methods within the Apache Mahout ecosystem, the proposed solution is both practical and extensible, offering a viable path for industry‑scale deployment of richer, multi‑aspect recommender systems.


📜 Original Paper Content

🚀 Synchronizing high-quality layout from 1TB storage...