Assessing LLM Reliability on Temporally Recent Open-Domain Questions

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) are increasingly deployed for open-domain question answering, yet their alignment with human perspectives on temporally recent information remains underexplored. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a benchmark dataset of 15,000 recent Reddit questions from September 2025 paired with community-derived reference answers. We investigate how four open-source LLMs (Llama3.1-8B, Mistral-7B, Gemma-2-9B, and GPT-OSS-20B) respond to these questions, evaluating alignment using lexical metrics (BLEU, ROUGE), semantic similarity (BERTScore, MoverScore, cosine similarity), and logical inference (NLI). Our central finding is a striking semantic-lexical paradox: all models achieve over 99% cosine similarity with references despite less than 8% BLEU-1 overlap, a 90+ percentage point gap indicating that models preserve meaning through extensive paraphrasing rather than lexical reproduction. MoverScore (51-53%) confirms this pattern, occupying an intermediate position that reflects the optimal transport cost of semantic alignment. Furthermore, model scale does not predict performance: Mistral-7B (7B parameters) outperforms GPT-OSS-20B (20B parameters) across all metrics. NLI analysis reveals that contradiction rates remain below 7%, suggesting models rarely generate content that directly conflicts with human consensus. These findings challenge the reliability of lexical metrics for evaluating abstractive generation and argue for multi-dimensional evaluation frameworks that capture semantic fidelity beyond surface-level text matching. The RECOM dataset is publicly available at https://anonymous.4open.science/r/recom-D4B0

💡 Research Summary

This paper introduces RECOM (Reddit Evaluation for Correspondence of Models), a benchmark designed to assess how well open‑source large language models (LLMs) handle temporally recent, open‑domain questions. The authors harvested 132,728 posts from Reddit’s r/AskReddit during September 2025, filtered for high engagement, and randomly sampled 15,000 questions that are likely to involve information emerging after the models’ training cut‑off. For each question, community answers were aggregated and summarized using Llama‑3.1‑8B‑Instruct, producing a consensus‑style reference answer.

Four publicly available LLMs—Llama‑3.1‑8B, Mistral‑7B, Gemma‑2‑9B, and GPT‑OSS‑20B—were prompted with a strict “answer‑only, ≤50‑word” instruction to generate concise responses. After removing refusal outputs (e.g., “I’m an AI”), 11,515 question‑response pairs per model remained for evaluation.

The evaluation framework spans three complementary dimensions: (1) lexical overlap (BLEU‑1‑4, ROUGE‑1/2/L), (2) semantic similarity (BERTScore, MoverScore, and cosine similarity of RoBERTa‑large embeddings), and (3) logical consistency via natural‑language inference (NLI) classification into entailment, contradiction, or neutral. Pairwise statistical significance was tested with Wilcoxon signed‑rank, and effect sizes reported as Cohen’s d.

Results reveal a striking “semantic‑lexical paradox.” Lexical metrics are uniformly low: BLEU‑1 ranges from 0.57 % (Gemma‑2‑9B) to 7.58 % (Llama‑3.1‑8B), BLEU‑4 stays below 0.1 %, and ROUGE‑1 never exceeds 19 %. In stark contrast, cosine similarity between model outputs and references exceeds 99 % for all models, and BERTScore F1 clusters between 83.3 % and 84.8 %. MoverScore occupies an intermediate band (≈51‑53 %). Thus, models preserve meaning almost perfectly while rephrasing it extensively.

Model scale does not predict alignment. The 7‑billion‑parameter Mistral‑7B consistently outperforms the 20‑billion‑parameter GPT‑OSS‑20B across lexical, semantic, and NLI metrics, challenging the assumption that larger models are automatically better at recent‑information QA.

NLI analysis shows low contradiction rates (< 7 %) and high entailment rates (> 73 %), indicating that generated answers rarely conflict with the community consensus and usually convey logically consistent information.

Statistical testing confirms that all observed differences are significant (p < 0.001). Effect sizes are large for lexical measures (d > 1.4) and medium for semantic measures (d ≈ 0.9), underscoring the independence of these evaluation axes.

The authors argue that reliance on BLEU/ROUGE alone severely underestimates LLM performance on real‑world, time‑sensitive queries. A multi‑dimensional evaluation—combining lexical overlap, embedding‑based similarity, optimal‑transport metrics like MoverScore, and logical inference—offers a more faithful picture of model reliability.

RECOM is released publicly (https://anonymous.4open.science/r/recom-D4B0) to enable further research on temporally dynamic QA, model calibration, and the development of evaluation protocols that capture both meaning fidelity and logical soundness. The paper concludes that future work should prioritize such holistic metrics, especially when deploying LLMs in environments where up‑to‑date information and alignment with human consensus are critical.

Assessing LLM Reliability on Temporally Recent Open-Domain Questions

💡 Research Summary

Comments & Academic Discussion

Leave a Comment