Can LLMs Automate Fact-Checking Article Writing?
Automatic fact-checking aims to support professional fact-checkers by offering tools that can help speed up manual fact-checking. Yet, existing frameworks fail to address the key step of producing output suitable for broader dissemination to the general public: while human fact-checkers communicate their findings through fact-checking articles, automated systems typically produce little or no justification for their assessments. Here, we aim to bridge this gap. In particular, we argue for the need to extend the typical automatic fact-checking pipeline with automatic generation of full fact-checking articles. We first identify key desiderata for such articles through a series of interviews with experts from leading fact-checking organizations. We then develop QRAFT, an LLM-based agentic framework that mimics the writing workflow of human fact-checkers. Finally, we assess the practical usefulness of QRAFT through human evaluations with professional fact-checkers. Our evaluation shows that while QRAFT outperforms several previously proposed text-generation approaches, it lags considerably behind expert-written articles. We hope that our work will enable further research in this new and important direction. The code for our implementation is available at https://github.com/mbzuai-nlp/qraft.git.
💡 Research Summary
The paper addresses a critical gap in the automatic fact‑checking literature: while most research focuses on claim detection, evidence retrieval, verification, and brief explanation generation, it neglects the final and highly visible step of producing a full‑length fact‑checking article for public consumption. To fill this void, the authors first conduct semi‑structured interviews with professional fact‑checkers from leading organizations (Full Fact, Newtral, Pagella Politica, etc.) and extract six core desiderata for such articles: precise claim clarification, contextual background, transparent sourcing, logical structure, importance justification, and verifiable references. These insights reveal that a fact‑checking article is far more demanding than a short justification, requiring nuanced argumentation, multiple interpretations, and a clear narrative for lay readers.
Based on these requirements, the authors propose QRAFT (Question‑Answer‑Driven Fact‑checking Article Generation Toolkit), a multi‑agent framework built on large language models (LLMs). QRAFT decomposes the writing process into two main stages. In the first “Draft Planning” stage, a Planner agent extracts concise evidence nuggets from the supplied evidence set, creates a structured outline guided by the expert‑derived preferences, and passes this outline to a Writer agent that drafts the article. In the second “Iterative Rewriting” stage, an Editor agent engages in a conversational QA loop with the Writer, providing concrete edit suggestions (e.g., add missing citations, improve logical flow). The loop repeats up to a predefined maximum number of iterations, with automatic metrics (ROUGE‑L, BERTScore, FactCC) monitoring convergence.
Technically, each agent is instantiated with a role‑specific prompt template for GPT‑3.5‑Turbo. The Planner’s prompt emphasizes “summarize each document into 3‑5 bullet‑point nuggets and propose an outline that respects the following high‑level guidelines.” The Writer’s prompt focuses on fluent, coherent prose that follows the outline. The Editor’s prompt asks the model to critique the draft, point out factual inconsistencies, and request missing evidence or clearer explanations. Crucially, the Editor’s feedback is fed back to the Planner, which can re‑extract or re‑prioritize nuggets, thereby mitigating hallucinations.
The authors evaluate QRAFT on two fronts. First, automatic evaluation against reference human‑written articles shows that QRAFT outperforms several strong baselines (single‑prompt generation, chain‑of‑thought prompting) by 10‑15 % on ROUGE‑L and BERTScore, and improves FactCC scores, indicating better factual grounding. Second, a human study with four professional fact‑checkers rates QRAFT‑generated articles on dimensions such as structure, clarity, source transparency, and overall trustworthiness. While QRAFT achieves respectable scores (average 3.8/5 on structure and clarity), it lags behind expert‑written pieces (average 4.6/5) especially on source verifiability (3.1/5) and contextual accuracy (3.0/5). Interviewed experts also note recurring issues: occasional hallucinated URLs, insufficient world‑knowledge to assess relevance of certain evidence, and difficulty in capturing implicit claim meanings.
The paper concludes that QRAFT represents a promising first step toward automating the full fact‑checking article pipeline, demonstrating that a coordinated multi‑agent LLM system can produce more structured and fact‑rich drafts than single‑model baselines. However, the authors acknowledge persistent limitations: LLM hallucinations, lack of up‑to‑date world knowledge, challenges in multi‑source cross‑validation, and the necessity of final human editorial oversight. Future work is outlined to integrate external verification APIs, employ retrieval‑augmented generation for better grounding, support multilingual article generation, and refine the human‑AI collaborative interface to further close the gap between AI‑generated and expert‑crafted fact‑checking articles.
Comments & Academic Discussion
Loading comments...
Leave a Comment