A Hybrid Supervised-LLM Pipeline for Actionable Suggestion Mining in Unstructured Customer Reviews

A Hybrid Supervised-LLM Pipeline for Actionable Suggestion Mining in Unstructured Customer Reviews
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Extracting actionable suggestions from customer reviews is essential for operational decision-making, yet these directives are often embedded within mixed-intent, unstructured text. Existing approaches either classify suggestion-bearing sentences or generate high-level summaries, but rarely isolate the precise improvement instructions businesses need. We evaluate a hybrid pipeline combining a high-recall RoBERTa classifier trained with a precision-recall surrogate to reduce unrecoverable false negatives with a controlled, instruction-tuned LLM for suggestion extraction, categorization, clustering, and summarization. Across real-world hospitality and food datasets, the hybrid system outperforms prompt-only, rule-based, and classifier-only baselines in extraction accuracy and cluster coherence. Human evaluations further confirm that the resulting suggestions and summaries are clear, faithful, and interpretable. Overall, our results show that hybrid reasoning architectures achieve meaningful improvements fine-grained actionable suggestion mining while highlighting challenges in domain adaptation and efficient local deployment.


💡 Research Summary

The paper tackles the problem of extracting concrete, actionable suggestions from unstructured customer reviews—a task that is crucial for operational decision‑making but remains under‑explored. Existing work typically focuses on binary detection of suggestion‑bearing sentences or on high‑level opinion summarization, neither of which isolates the precise improvement directives businesses need. To bridge this gap, the authors propose a two‑stage hybrid pipeline that combines a high‑recall supervised classifier with a controlled, instruction‑tuned large language model (LLM) to perform extraction, categorization, clustering, and summarization in a fully end‑to‑end fashion.

Stage 1 – High‑Recall Classification
A RoBERTa‑base model is fine‑tuned on a proprietary dataset of 1,110 reviews (440 positive, 670 negative). The training objective augments standard cross‑entropy with a differentiable surrogate for the precision‑recall curve, encouraging the model to maximize recall while keeping calibrated probabilities. This “precision‑recall surrogate loss” yields a recall of 0.9221 and a precision of 0.9039, outperforming lexical, rule‑based, and prompt‑only LLM baselines. The authors emphasize that false negatives are unrecoverable downstream, making high recall a non‑negotiable requirement.

Stage 2 – LLM‑Driven Structured Processing
For downstream processing, the pipeline employs an instruction‑tuned, quantized Ollama Gemma‑3 27B model. Prompt templates guide the model through five sub‑tasks: (1) isolate explicit suggestions from each review, (2) rewrite them into concise, context‑complete statements, (3) assign each to a canonical category (e.g., menu, service, facilities), (4) cluster semantically similar suggestions within each category, dynamically determining the number of clusters, and (5) generate short, coherent summaries for each cluster. The rewriting step normalizes phrasing, reduces lexical variance, and improves downstream clustering stability.

Evaluation
The authors evaluate three research questions across hospitality and food (ice‑cream) domains, and additionally test cross‑domain robustness on four other industries (retail, travel, healthcare, e‑commerce).

RQ1 – Classifier Performance: Compared against lexical heuristics, rule‑based patterns, and a prompt‑only LLM classifier, the RoBERTa model achieves the best balance of precision and recall.

RQ2 – Effect of the PR Surrogate: Removing the surrogate loss drops recall by ~3.5 % (to 0.8873) with negligible precision loss, confirming statistical significance (p < 0.01).

RQ3 – End‑to‑End Pipeline: The full hybrid system is benchmarked against three end‑to‑end baselines: (a) prompt‑only LLM (extraction + rewriting only), (b) classifier‑only pipeline (classifier + rule‑based extraction), and (c) rule‑based end‑to‑end. Extraction quality is measured with semantic metrics (BERScore, BLEURT) because the pipeline outputs rewritten suggestions rather than raw spans. The hybrid pipeline attains BERScore 0.92 and BLEURT 0.89, surpassing the prompt‑only LLM (0.87/0.84) and a T5‑base span model (0.78/0.76). Exact and fuzzy F1 scores are lower for the hybrid system, which the authors explain as a consequence of rewriting rather than copying gold spans.

Cluster coherence is assessed with Adjusted Mutual Information (AMI). The hybrid pipeline achieves AMI 0.67, markedly higher than the prompt‑only LLM (0.49) and classifier‑only baseline (0.38). Human evaluators rate the hybrid outputs as clearer, more faithful, and more interpretable than baselines.

Efficiency and Deployment
The quantized Gemma‑3 model runs on a modest GPU (≤8 GB) and processes a full review (including extraction, clustering, and summarization) in under one second, a 5× speedup over larger, non‑quantized LLMs. This makes the approach viable for large‑scale, real‑time deployment in production environments.

Insights and Future Directions
The study demonstrates that pairing a recall‑oriented classifier with a tightly controlled LLM mitigates the typical weaknesses of each component: the classifier prevents missed suggestions, while the LLM supplies robust extraction, semantic normalization, and generative summarization without hallucination. Cross‑domain experiments show that the classifier generalizes well, and the LLM adapts via prompt engineering alone. The authors suggest extending the framework to multimodal reviews (e.g., images), incorporating user profile data for personalized prioritization, and exploring online learning to continuously adapt to evolving vocabularies.

In summary, the paper provides a comprehensive, empirically validated hybrid architecture that significantly improves actionable suggestion mining from noisy, unstructured reviews, delivering higher recall, better semantic extraction, more coherent clustering, and concise summaries—all while remaining computationally efficient for real‑world deployment.


Comments & Academic Discussion

Loading comments...

Leave a Comment