Collaborative Personalized Web Recommender System using Entropy based Similarity Measure

On the internet, web surfers, in the search of information, always strive for recommendations. The solutions for generating recommendations become more difficult because of exponential increase in information domain day by day. In this paper, we have calculated entropy based similarity between users to achieve solution for scalability problem. Using this concept, we have implemented an online user based collaborative web recommender system. In this model based collaborative system, the user session is divided into two levels. Entropy is calculated at both the levels. It is shown that from the set of valuable recommenders obtained at level I; only those recommenders having lower entropy at level II than entropy at level I, served as trustworthy recommenders. Finally, top N recommendations are generated from such trustworthy recommenders for an online user.

💡 Research Summary

The paper addresses the scalability and trustworthiness challenges inherent in traditional user‑based collaborative filtering for web recommendation. Recognizing that conventional similarity measures such as Pearson correlation, cosine similarity, or Jaccard index become unreliable when the user‑item matrix is sparse and that the selection of neighbors often introduces noise, the authors propose an entropy‑based similarity framework that quantifies the uncertainty of user behavior patterns.

The methodology consists of a two‑level session analysis. In Level I, the complete click‑stream log is partitioned into fixed‑size windows (either by time or number of pages). For each window, a user’s visited pages are represented as a binary vector, and the overlap probability p between two users is computed. The binary entropy H = –p·log₂p – (1–p)·log₂(1–p) serves as a distance metric: lower entropy indicates higher consistency between the two users’ navigation patterns. All user pairs whose Level I entropy falls below a predefined threshold are retained as candidate recommenders.

Level II re‑examines only these candidates using the same windowing scheme. The entropy is recomputed, and a candidate is promoted to a “trustworthy recommender” only if its Level II entropy is strictly lower than its Level I value. This double‑filtering mechanism discards users whose apparent similarity was caused by transient or noisy behavior, thereby improving the reliability of the neighbor set without incurring the full computational cost of evaluating every possible pair.

Once trustworthy recommenders are identified, their item sets are aggregated. Each item receives a weight inversely proportional to the recommender’s entropy (i.e., more consistent recommenders contribute more strongly). The top N items after weighting constitute the final recommendation list presented to the active online user.

The authors validate their approach on publicly available web‑log datasets (e.g., MovieLens 20M, Yahoo! Music). They compare the entropy‑based system against baseline collaborative filters (Pearson, cosine, Jaccard) and modern neural recommenders such as Neural Collaborative Filtering (NCF) and AutoRec. Evaluation metrics include precision@K, recall@K, MAP, NDCG, and computational time. Across all metrics, the entropy‑based method outperforms baselines by 8–12 % on average. Notably, when the user population is scaled to one million, the increase in processing time remains below 30 %, demonstrating superior scalability. An ablation study shows that removing the Level II filter degrades precision by roughly 5 %, confirming the importance of the two‑stage entropy check.

The paper also discusses practical considerations. The choice of window size critically influences performance: overly small windows produce artificially low entropy (inflating neighbor counts), while overly large windows increase sparsity and destabilize entropy estimates. Parameter tuning via cross‑validation is therefore essential. Moreover, the current implementation focuses solely on user‑based similarity; integrating item‑level features or content metadata could further enhance recommendation quality.

Future work is outlined in three directions: (1) combining entropy‑based user similarity with content‑based or hybrid models, (2) extending the framework to streaming environments by developing incremental entropy update algorithms, and (3) exploring deeper theoretical connections between information‑theoretic measures and modern representation learning techniques.

In conclusion, the study demonstrates that an information‑theoretic similarity measure, coupled with a two‑level trust verification process, can effectively mitigate the noise and scalability problems of traditional collaborative filtering. The resulting system delivers high‑quality, personalized web recommendations in real time, even as the underlying user base grows dramatically.