SUMFORU: An LLM-Based Review Summarization Framework for Personalized Purchase Decision Support
Online product reviews contain rich but noisy signals that overwhelm users and hinder effective decision-making. Existing LLM-based summarizers remain generic and fail to account for individual preferences, limiting their practical utility. We propose SUMFORU, a steerable review summarization framework that aligns outputs with explicit user personas to support personalized purchase decisions. Our approach integrates a high-quality data pipeline built from the Amazon 2023 Review Dataset with a two-stage alignment procedure: (1) persona-aware Supervised Fine-Tuning (SFT) via asymmetric knowledge distillation, and (2) Reinforcement Learning with AI Feedback (RLAIF) using a preference estimator to capture fine-grained, persona-relevant signals. We evaluate the model across rule-based, LLM-based, and human-centered metrics, demonstrating consistent improvements in consistency, grounding, and preference alignment. Our framework achieves the highest performance across all evaluation settings and generalizes effectively to unseen product categories. Our results highlight the promise of steerable pluralistic alignment for building next-generation personalized decision-support systems.
💡 Research Summary
The paper introduces SUMFORU, a steerable review‑summarization framework designed to produce personalized purchase‑decision support for e‑commerce shoppers. The authors observe that existing large‑language‑model (LLM) summarizers generate generic, one‑size‑fits‑all outputs that ignore individual user preferences, limiting their practical usefulness. To address this gap, they formulate the problem as a human‑centered alignment task and propose a “steerable pluralistic alignment” paradigm that accepts an explicit user persona (or query) and synthesizes a customized summary together with a suitability score (1‑10) indicating how well the product matches the persona.
Data pipeline
The system is built on the Amazon 2023 Review Dataset (≈635 K reviews, 107 K products, 585 K users). The authors filter for “active users” (≥3 historical reviews) and “golden products” (≥20 reviews, at least one from an active user). For each active user–product pair they retain only reviews posted before the user’s own review timestamp, thereby simulating the information available at purchase time. They clean the text (remove <5‑word reviews, discard high‑rating reviews with zero helpful votes), enforce a minimum of 15 and a maximum of 50 reviews per pair (the latter via stratified sampling preserving the original rating distribution). A concise persona description for each active user is automatically generated using the Qwen‑3‑30B model based on the user’s past reviews. The final dataset contains 3 000 training pairs and 1 000 test pairs across multiple product categories.
Two‑stage alignment
-
Persona‑aware Supervised Fine‑Tuning (SFT) – The student model (Qwen‑3‑4B‑Instruct‑2507) is fine‑tuned via asymmetric knowledge distillation. A larger teacher model (Qwen‑3‑235B‑A22B‑Instruct‑2507) generates “Golden Summaries” for each persona–review set using temperature 0.7. These synthetic summaries serve as supervision; human‑written references are deliberately excluded to avoid copying subjective writing styles. The SFT loss is standard cross‑entropy, providing a stable initialization that captures the teacher’s conditional generation behavior while embedding implicit persona signals (the teacher also receives the raw reviews as context).
-
Reinforcement Learning with AI Feedback (RLAIF) – To further align outputs with persona preferences, the SFT‑initialized model is fine‑tuned with PPO. Multiple candidate summaries are sampled at a higher temperature to encourage diversity. An AI Preference Estimator (the same teacher model) evaluates candidates pairwise, assigning reward scores that reflect three dimensions: consistency with the input, factual grounding to the reviews, and alignment with the persona. LoRA adapters are used during PPO to keep training efficient. This stage improves fine‑grained persona relevance that SFT alone cannot achieve.
Evaluation
Three evaluation families are employed:
-
Rule‑based metrics – BertScore‑based Recall/Precision for reference reviews, input reviews, and persona information (RefBS‑R, RevBS‑P, PersBS‑R). Suitability score accuracy is measured by MAE, Spearman correlation, and Within‑1 accuracy.
-
LLM‑based metrics – Two large judges (Qwen‑3‑235B and GPT‑OSS‑120B) score each summary on Consistency, Grounding, and Persona Alignment, producing an overall composite score.
-
Human metrics – A small user study with three annotators evaluates 10 cases. Annotators rank the four system outputs (Base, IPT, SFT, RL) for usefulness, then rate the top‑ranked summary on a 5‑point Likert scale for Persona Alignment, Decision Utility, and Factual Trustworthiness.
Results
Across all metrics, the RL‑enhanced model (the “RL” condition) outperforms the baseline, instruction‑prompt‑tuned (IPT), and SFT‑only variants. Rule‑based suitability scores improve from MAE 1.236 (Base) to 1.078 (RL), Spearman from 0.423 to 0.563, and Within‑1 accuracy from 0.701 to 0.764. LLM judges consistently rank RL highest (overall scores 0.892 for Qwen, 0.620 for GPT). Human evaluation shows an 80 % win rate for RL, mean rank 1.2, and high Likert scores (Persona Alignment 4.875, Decision Utility 4.917, Trustworthiness 4.792). Kendall’s W = 0.787 indicates strong inter‑annotator agreement.
Insights and contributions
The paper’s key contributions are: (1) a high‑quality, persona‑conditioned dataset derived from real e‑commerce reviews; (2) a two‑stage alignment strategy that combines asymmetric knowledge distillation with AI‑feedback‑driven reinforcement learning; (3) a comprehensive multi‑dimensional evaluation framework that validates both objective performance and subjective human preference. The authors demonstrate that SFT provides a stable foundation, while RLAIF is essential for capturing nuanced, persona‑specific signals.
Limitations and future work
The human study is limited in scale, which may affect statistical generalizability. Persona generation relies on an LLM, potentially propagating model biases into the downstream summarizer. Future directions include handling multiple simultaneous personas, incorporating real‑time user feedback loops, extending the framework to other domains (travel, healthcare), and developing bias‑mitigation techniques for persona creation.
Conclusion
SUMFORU showcases that a carefully engineered data pipeline, coupled with a two‑stage alignment (SFT → RLAIF), can produce personalized, factually grounded review summaries and actionable suitability scores. The system consistently outperforms generic baselines across rule‑based, LLM‑based, and human evaluations, and generalizes to unseen product categories. This work paves the way for next‑generation decision‑support tools that respect individual user preferences while maintaining factual reliability.
Comments & Academic Discussion
Loading comments...
Leave a Comment