Do Reviews Matter for Recommendations in the Era of Large Language Models?

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

With the advent of large language models (LLMs), the landscape of recommender systems is undergoing a significant transformation. Traditionally, user reviews have served as a critical source of rich, contextual information for enhancing recommendation quality. However, as LLMs demonstrate an unprecedented ability to understand and generate human-like text, this raises the question of whether explicit user reviews remain essential in the era of LLMs. In this paper, we provide a systematic investigation of the evolving role of text reviews in recommendation by comparing deep learning methods and LLM approaches. Particularly, we conduct extensive experiments on eight public datasets with LLMs and evaluate their performance in zero-shot, few-shot, and fine-tuning scenarios. We further introduce a benchmarking evaluation framework for review-aware recommender systems, RAREval, to comprehensively assess the contribution of textual reviews to the recommendation performance of review-aware recommender systems. Our framework examines various scenarios, including the removal of some or all textual reviews, random distortion, as well as recommendation performance in data sparsity and cold-start user settings. Our findings demonstrate that LLMs are capable of functioning as effective review-aware recommendation engines, generally outperforming traditional deep learning approaches, particularly in scenarios characterized by data sparsity and cold-start conditions. In addition, the removal of some or all textual reviews and random distortion does not necessarily lead to declines in recommendation accuracy. These findings motivate a rethinking of how user preference from text reviews can be more effectively leveraged. All code and supplementary materials are available at: https://github.com/zhytk/RAREval-data-processing.

💡 Research Summary

This paper investigates the role of user reviews in recommendation systems in the era of large language models (LLMs). While traditional recommender models have relied heavily on textual reviews to enrich user and item representations, recent advances in LLMs raise the question of whether explicit reviews remain indispensable. To answer this, the authors conduct a systematic empirical study comparing state‑of‑the‑art deep learning review‑aware models (e.g., DeepCoNN, NARRE, RGCL, DIRECT) with LLM‑based approaches across eight public Amazon datasets. Three LLM usage paradigms are explored: (i) zero‑shot, where a carefully crafted prompt containing user ID, item ID, user reviews, and item reviews is fed to a frozen LLM to predict a rating; (ii) few‑shot, where a small number of demonstration examples are appended to the prompt; and (iii) fine‑tuning via a novel method called REVLoRA, which applies Low‑Rank Adaptation (LoRA) to adapt only a subset of LLM parameters using review and rating data, dramatically reducing training time and memory while preserving performance.

To rigorously assess the contribution of reviews, the authors introduce RAREval, a benchmarking framework that evaluates models under five distinct scenarios: complete review removal, partial removal, random textual distortion, data sparsity (k‑core pruning), and cold‑start users (CS_k). For each scenario, standard metrics such as MAE, RMSE, and ranking‑based measures are reported.

Experimental results reveal several key findings. First, LLM‑based models consistently outperform the deep learning baselines in terms of lower MAE and RMSE. Second, removing all reviews or keeping only a fraction of them does not lead to a systematic degradation of accuracy; in some cases, random distortion even mitigates over‑fitting. Third, the performance gap widens as data become sparse (e.g., k‑core ≤ 5), indicating that LLMs can better generalize from limited textual signals. Fourth, in cold‑start settings, REVLoRA fine‑tuning improves MAE by roughly 10 % compared with zero‑shot or few‑shot LLMs. Fifth, few‑shot prompting with as few as five to ten demonstrations yields modest gains over pure zero‑shot, confirming that LLMs can leverage a small amount of task‑specific context.

The analysis suggests that the strength of LLMs lies not in the raw presence of reviews but in their ability to extract latent semantic patterns from both reviews and auxiliary metadata. Consequently, the conventional assumption that reviews are a critical bottleneck for recommendation quality is challenged. The paper also discusses practical considerations: LLM inference remains computationally expensive, prompt design is model‑specific, and scaling to real‑time production requires further optimization. Future work is outlined, including multimodal extensions (incorporating images or audio), automated prompt optimization, and lightweight distillation techniques for deployment.

In summary, this study provides the first comprehensive evaluation of review contribution within LLM‑driven recommender systems, introduces a versatile evaluation suite (RAREval), and demonstrates that LLMs—especially when fine‑tuned with REVLoRA—offer a robust alternative to traditional review‑aware models, even when textual reviews are partially missing or noisy.

Do Reviews Matter for Recommendations in the Era of Large Language Models?

💡 Research Summary

Comments & Academic Discussion

Leave a Comment