FactSim: Fact-Checking for Opinion Summarization

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We explore the need for more comprehensive and precise evaluation techniques for generative artificial intelligence (GenAI) in text summarization tasks, specifically in the area of opinion summarization. Traditional methods, which leverage automated metrics to compare machine-generated summaries from a collection of opinion pieces, e.g. product reviews, have shown limitations due to the paradigm shift introduced by large language models (LLM). This paper addresses these shortcomings by proposing a novel, fully automated methodology for assessing the factual consistency of such summaries. The method is based on measuring the similarity between the claims in a given summary with those from the original reviews, measuring the coverage and consistency of the generated summary. To do so, we rely on a simple approach to extract factual assessment from texts that we then compare and summarize in a suitable score. We demonstrate that the proposed metric attributes higher scores to similar claims, regardless of whether the claim is negated, paraphrased, or expanded, and that the score has a high correlation to human judgment when compared to state-of-the-art metrics.

💡 Research Summary

FactSim introduces a novel, fully automated metric for evaluating the factual consistency of opinion summaries, particularly those generated from collections of product reviews. The authors begin by highlighting the inadequacy of traditional n‑gram based metrics such as ROUGE, BLEU, and even neural‑based scores like BERTScore when applied to summaries produced by large language models (LLMs). In opinion summarization, the core challenge is not merely reproducing exact wording but faithfully representing the consensus of multiple subjective sources. Existing reference‑based evaluations often penalize paraphrased or abstracted summaries, while reference‑free self‑checking approaches lack explainability and can inherit the biases of the underlying LLM.

To address these gaps, FactSim proceeds in three stages. First, it extracts “fact tuples” from both the source reviews and the candidate summary. A tuple is defined as a (subject, description) pair where the subject is a single‑word identifier of the product or a product feature, and the description is a single‑word attribute. Extraction is performed via prompt‑engineered LLM calls (e.g., GPT‑4), which are instructed to paraphrase multi‑word expressions, resolve negations (e.g., “not fast at all” → “slow”), and output a normalized list of tuples. This step converts diverse linguistic realizations of the same claim into a uniform representation.

Second, each tuple is embedded into a dense vector space using a pre‑trained encoder such as Sentence‑BERT. Cosine similarity between vectors serves as the semantic similarity function. The authors define two complementary scores:

Coverage (f_V) – For every tuple extracted from the reviews (N of them), the maximum similarity to any tuple in the summary (M of them) is computed, and the average over all N tuples is taken. This measures how well the summary captures the facts present in the source set, automatically weighting more frequent review claims higher because they appear multiple times in the concatenated review tuple list.
Consistency (f_N) – For each tuple in the summary, the maximum similarity to any review tuple is computed, and the average over the M summary tuples is taken. This quantifies whether the summary introduces claims that are not supported by any source review (i.e., hallucinations).

The final FactSim score is the harmonic mean of f_V and f_N, yielding a value in

FactSim: Fact-Checking for Opinion Summarization

💡 Research Summary

Comments & Academic Discussion

Leave a Comment