A NoSQL Data-based Personalized Recommendation System for C2C e-Commerce
With the considerable development of customer-to-customer (C2C) e-commerce in the recent years, there is a big demand for an effective recommendation system that suggests suitable websites for users to sell their items with some specified needs. Nonetheless, e-commerce recommendation systems are mostly designed for business-to-customer (B2C) websites, where the systems offer the consumers the products that they might like to buy. Almost none of the related research works focus on choosing selling sites for target items. In this paper, we introduce an approach that recommends the selling websites based upon the item’s description, category, and desired selling price. This approach employs NoSQL data-based machine learning techniques for building and training topic models and classification models. The trained models can then be used to rank the websites dynamically with respect to the user needs. The experimental results with real-world datasets from Vietnam C2C websites will demonstrate the effectiveness of our proposed method.
💡 Research Summary
The paper addresses a practical problem in the rapidly expanding Vietnamese customer‑to‑customer (C2C) e‑commerce market: helping sellers decide which online marketplace is most suitable for listing a particular item. While most recommendation research focuses on business‑to‑customer (B2C) scenarios that suggest products to buyers, this work targets the opposite direction—recommending selling sites to sellers.
The authors propose a two‑stage recommendation engine that takes three user‑provided inputs: the free‑text description of the item, its categorical label, and the desired selling price. The system is built on a NoSQL data store—Elasticsearch—chosen for its schema‑less nature, horizontal scalability, and powerful inverted‑index search and aggregation capabilities. Data from multiple Vietnamese C2C platforms (e.g., Cho Tot, Nhat Tao, Vat Gia) are crawled, normalized, and indexed, allowing fast retrieval and statistical computation.
Textual descriptions are first pre‑processed (lower‑casing, tokenization, stop‑word removal, rare‑term filtering) and represented as TF‑IDF weighted bag‑of‑words vectors. Because raw BOW vectors are high‑dimensional and lack semantic insight, the authors apply topic modeling to obtain a compact, meaning‑oriented representation. Both probabilistic Latent Dirichlet Allocation (LDA) and non‑probabilistic Non‑Negative Matrix Factorization (NMF) are evaluated; each yields a document‑to‑topic matrix that serves as a low‑dimensional semantic embedding of the description.
Two recommendation strategies are then defined.
- Description‑Similarity Ranking: Using the topic vectors, cosine similarity identifies items with comparable descriptions across the dataset. For each similar item set, the system aggregates site‑level statistics—either the count of matching items (quantity‑based ranking) or the average price derived from inter‑quartile price ranges (price‑based ranking). This provides a quick, interpretable ranking based on market supply and price trends.
- Ensemble‑Learning Ranking: To incorporate all three features (topic vector, categorical one‑hot, price) and to handle the fact that a single item may be listed on multiple sites (multi‑label scenario), the authors train a Random Forest ensemble. Each tree is built on a bootstrap sample (bagging) and selects a random subset of features at each split, ensuring diversity among learners. The forest’s voting mechanism aggregates predictions, producing a ranked list of candidate sites for the new item. This approach naturally accommodates overlapping labels and mitigates over‑fitting on the informal training data.
The experimental evaluation uses real‑world listings from the aforementioned Vietnamese platforms. Standard recommendation metrics—precision, recall, and Mean Average Precision (MAP)—show that the combined topic‑model + Random Forest pipeline outperforms a baseline that relies solely on categorical matching or raw TF‑IDF similarity, achieving roughly a 12 % improvement in MAP. Moreover, both ranking modes meet low‑latency requirements (sub‑second response times), indicating feasibility for deployment in an online service.
In summary, the paper contributes: (1) a flexible NoSQL‑based data architecture for heterogeneous C2C data; (2) a semantic text representation via topic modeling that reduces dimensionality while preserving meaning; (3) an ensemble classification framework that solves a non‑standard multi‑label ranking problem; and (4) empirical evidence of superior recommendation quality on real C2C marketplaces. The authors suggest future extensions such as incorporating user feedback for online model updates, dynamic price prediction, and testing the approach in other cultural or regional contexts.
Comments & Academic Discussion
Loading comments...
Leave a Comment