Towards Personalized Bangla Book Recommendation: A Large-Scale Multi-Entity Book Graph Dataset

Towards Personalized Bangla Book Recommendation: A Large-Scale Multi-Entity Book Graph Dataset
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Personalized book recommendation in Bangla literature has been constrained by the lack of structured, large-scale, and publicly available datasets. This work introduces RokomariBG, a large-scale, multi-entity heterogeneous book graph dataset designed to support research on personalized recommendation in a low-resource language setting. The dataset comprises 127,302 books, 63,723 users, 16,601 authors, 1,515 categories, 2,757 publishers, and 209,602 reviews, connected through eight relation types and organized as a comprehensive knowledge graph. To demonstrate the utility of the dataset, we provide a systematic benchmarking study on the Top-N recommendation task, evaluating a diverse set of representative recommendation models, including classical collaborative filtering methods, matrix factorization models, content-based approaches, graph neural networks, a hybrid matrix factorization model with side information, and a neural two-tower retrieval architecture. The benchmarking results highlight the importance of leveraging multi-relational structure and textual side information, with neural retrieval models achieving the strongest performance (NDCG@10 = 0.204). Overall, this work establishes a foundational benchmark and a publicly available resource for Bangla book recommendation research, enabling reproducible evaluation and future studies on recommendation in low-resource cultural domains. The dataset and code are publicly available at https://github.com/backlashblitz/Bangla-Book-Recommendation-Dataset


💡 Research Summary

The paper addresses a critical gap in Bangla‑language recommendation research by introducing RokomariBG, a large‑scale heterogeneous graph dataset harvested from Rokomari.com, Bangladesh’s leading online bookstore. The dataset contains 127,302 distinct books, 63,723 anonymized users, 16,601 authors, 2,757 publishers, 1,515 categories, and 209,602 user‑generated reviews. Five primary entity types (books, authors, categories, publishers, reviews) are linked through eight relation types, forming a knowledge graph with 403,714 nodes and over two million edges. Each entity is richly annotated: books include ISBN, page count, average rating, rating count, and textual title and summary; authors have biographical text and follower counts; publishers provide descriptive text and aggregate statistics; categories carry names, descriptions, and book counts; reviews contain rating, full text, timestamps, up‑vote/down‑vote counts, and a verified‑purchase flag. All user‑identifying information has been removed and replaced with sequential anonymous IDs to preserve privacy.

The authors detail a systematic data‑collection pipeline that uses BeautifulSoup for HTML parsing, leverages JSON‑LD when available, normalizes Bangla numerals, removes duplicates (13.1 % of raw records), and enforces referential integrity across all edges. Metadata completeness is high, and missing values are explicitly marked as null, facilitating downstream processing.

To demonstrate the dataset’s utility, the paper conducts an extensive benchmark on the Top‑N recommendation task. Twelve representative models spanning four major families are evaluated: (1) popularity‑based and Item‑KNN baselines; (2) collaborative filtering and matrix factorization methods such as BPR‑MF and LightGCN; (3) content‑based approaches that encode book summaries and review texts via TF‑IDF or pretrained embeddings; (4) hybrid models that combine side information (author follower counts, publisher statistics, etc.) with interaction data, exemplified by a factorization‑machine‑style Hybrid‑FM; and (5) a neural two‑tower retrieval architecture that independently encodes users (via their review histories) and items (via textual and structural features) and matches them with a cosine similarity loss. Evaluation metrics include NDCG@10, NDCG@50, and Hit‑Rate@10.

Results reveal that models exploiting both relational structure and textual side information outperform those relying solely on interaction matrices. The neural two‑tower model achieves the highest performance (NDCG@10 = 0.204, NDCG@50 = 0.276), confirming the advantage of joint representation learning in a low‑resource setting. LightGCN, which leverages the graph of user‑item edges but ignores side text, attains moderate scores, while pure collaborative filtering (BPR‑MF) suffers from data sparsity. The analysis also highlights a pronounced rating bias (65.8 % of reviews are five‑star) and a concentration of user activity in categories such as career/education, contemporary novels, patriotism, and religion. Author popularity (follower count) shows a weak positive correlation with book ratings, suggesting that author‑level side features can aid cold‑start mitigation.

The paper concludes with several research directions enabled by RokomariBG: (i) designing relation‑aware heterogeneous GNNs or meta‑path strategies to better capture multi‑type interactions; (ii) cross‑lingual recommendation by aligning Bangla textual embeddings with English or other languages; (iii) reinforcement‑learning‑based ranking and exploration; (iv) explainable recommendation through path extraction in the heterogeneous graph; and (v) scalable indexing and retrieval for real‑time deployment. By releasing both the dataset and the benchmarking code (https://github.com/backlashblitz/Bangla-Book-Recommendation-Dataset), the authors provide a reproducible foundation for future work on personalized recommendation in Bangla literature and, more broadly, for recommendation research in low‑resource cultural domains.


Comments & Academic Discussion

Loading comments...

Leave a Comment