LFQA-E: Carefully Benchmarking Long-form QA Evaluation
Long-Form Question Answering (LFQA) involves generating comprehensive, paragraph-level responses to open-ended questions, which poses a significant challenge for evaluation due to the richness of information and flexible response format. Existing LFQA-evaluation benchmarks often lack reference answers and are limited in size and topic coverage, reducing their reliability. To address this gap, we introduce LFQA-E, a well-constructed, multilingual, and reference-based benchmark designed to rigorously evaluate automatic metrics for LFQA. LFQA-E comprises 1618 questions and 7323 pairwise comparisons across 15 topics, drawn from diverse sources such as online queries and examination questions, thereby enabling a comprehensive assessment of evaluation metrics. We examine five categories of metrics, encompassing 17 specific methods, using LFQA-E. The results demonstrate that none of the existing automatic metrics perform comparably to human judgments, highlighting their inability to capture the dense information in long-form responses. Furthermore, we present a detailed analysis of the failure cases and the generalization capacity of these metrics, offering insights to guide the future development of LFQA evaluation methods. The benchmark and code are available at https://github.com/YuchenFan48/LFQA-E.
💡 Research Summary
Long‑Form Question Answering (LFQA) requires models to generate paragraph‑level, information‑dense answers to open‑ended questions. Existing evaluation benchmarks suffer from two major shortcomings: they either lack reference answers or are limited in scale, language, and topical diversity, which hampers reliable assessment of automatic metrics. To fill this gap, the authors introduce LFQA‑E, a multilingual, reference‑based benchmark specifically designed to rigorously evaluate LFQA metrics.
LFQA‑E contains 1,618 questions and 7,323 pairwise comparisons drawn from 15 distinct domains (engineering, medicine, law, history, etc.) and two languages (English and Chinese). Questions originate from recent online forums (Reddit/ELI5) and fresh examination materials (College Entrance Examination Simulation Questions and Postgraduate Entrance Examination Questions), ensuring minimal overlap with publicly available training data. For each question, expert‑curated reference answers are provided; these references have been double‑annotated (Cohen’s κ = 0.78) and verified to cover all key points. Human responses are collected from high‑scoring but closely ranked answers, while model responses are generated by two comparable LLMs (Llama‑3‑8B‑Instruct and GPT‑3.5‑turbo) using a “generate reasonable answers” prompt with temperature = 1.0 to encourage diversity.
The benchmark defines three evaluation settings: human vs. human (h‑v‑h), human vs. model (h‑v‑m), and model vs. model (m‑v‑m). Each comparison follows a three‑choice format (better answer, worse answer, tie) to capture subtle differences that often exist between high‑quality long‑form answers.
Seventeen automatic metrics spanning five categories are evaluated on LFQA‑E:
- Static metrics – length, ROUGE, BERTScore.
- LLM‑based metrics – Qwen2.5‑32B/72B, Llama‑3.1‑70B, GPT‑4o, DeepSeek‑V3.
- Reward‑model (RM) based metrics – Skywork‑Reward‑Llama/Gemma, RM‑R1‑Qwen2.5‑Instruct‑14B, RM‑R1‑DeepSeek‑Distilled‑Qwen‑14B.
- Large‑scale reasoning model (LRM) based metrics – o1‑mini, DeepSeek‑R1.
- Trained evaluation models – Auto‑J‑6B‑bilingual, Prometheus‑7B‑v2.0, M‑Prometheus‑14B.
Performance is measured by pairwise accuracy and F1 against the human baseline (≈77 % accuracy). The best automatic metric, Auto‑J‑6B‑bilingual, attains only 66.8 % accuracy, still more than 10 % points below human performance. Across the board, metrics struggle to distinguish the better answer when both candidates are high‑quality and often default to a tie or arbitrarily favor one side.
Detailed failure analysis reveals three dominant error patterns: (i) omission of core factual details, (ii) inclusion of excessive verbose or irrelevant information that dilutes the signal, and (iii) mismatches caused by lexical or structural divergence from the reference, which hurts similarity‑based measures. Language‑transfer experiments show a consistent drop when applying an English‑trained metric to Chinese data, highlighting the current reliance of LLM‑based evaluators on language‑specific prompting and pre‑training corpora.
The authors also conduct a contamination study, confirming that LFQA‑E’s sources (recent exams, recent forum posts) have negligible overlap with public model training sets, achieving a 15 % lower contamination rate than prior benchmarks. To explore possible improvements, they experiment with TTRL (Text‑to‑Reward‑Learning) to fine‑tune reward models, but even the enhanced versions remain far from human‑level alignment.
In summary, LFQA‑E provides the first large‑scale, multilingual, reference‑grounded benchmark for long‑form QA evaluation. Its extensive analysis demonstrates that existing automatic metrics—whether static, LLM‑based, reward‑based, or trained—are insufficient for capturing the dense, factual content of LFQA outputs. The benchmark thus offers a valuable testbed for future research aimed at developing more reliable, human‑aligned evaluation methods for long‑form question answering.
Comments & Academic Discussion
Loading comments...
Leave a Comment