Compendia: Automated Visual Storytelling Generation from Online Article Collection

Compendia: Automated Visual Storytelling Generation from Online Article Collection
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In the digital age, readers value quantitative journalism that is clear, concise, analytical, and human-centred. To understand complex topics, they often piece together scattered facts from multiple articles. Visual storytelling can transform fragmented information into clear, engaging narratives, yet its use with unstructured online articles remains largely unexplored. To fill this gap, we present Compendia, an automated system that analyzes online articles in response to a user’s query and generates a coherent data story tailored to the user’s informational needs. Compendia addresses key challenges of storytelling from unstructured text through two modules covering: Online Article Retrieval, which gathers relevant articles; Data Fact Extraction, which identifies, validates, and refines quantitative facts; Fact Organization, which clusters and merges related facts into coherent thematic groups; and Visual Storytelling, which transforms the organized facts into narratives with visualizations in an interactive scrollytelling interface. We evaluated Compendia through a quantitative analysis, confirming the accuracy in fact extraction and organization, and through two user studies with 16 participants, demonstrating its usability, effectiveness, and ability to produce engaging visual stories for open-ended queries.


💡 Research Summary

The paper introduces Compendia, an end‑to‑end system that automatically generates visual data stories from collections of online articles in response to a user’s query. The authors begin by highlighting the growing demand for quantitative journalism that is clear, concise, analytical, and human‑centred. Readers often need to piece together fragmented facts from multiple sources to understand complex topics such as AI’s carbon footprint or global poverty. Existing aggregators (Google News, Newsblaster) and AI‑driven search tools (Perplexity) provide surface‑level summaries but lack the ability to synthesize quantitative information and present it with visual scaffolding.

Compendia addresses three core challenges: (1) Fact discovery in noisy, unstructured text, where numbers appear in diverse formats and units; (2) Thematic diversity across articles, which makes organizing facts into coherent narratives difficult; and (3) Presentation overload, where a large number of extracted facts must be displayed without overwhelming the reader. To solve these, the system is built around two major modules:

  1. Data Fact Extraction & Organization

    • Online Article Retrieval expands the user query with lexical variations, then scrapes a bulk set of relevant articles.
    • Fact Extraction uses large language models (LLMs, specifically GPT‑4) guided by carefully engineered prompts to filter paragraphs, identify quantitative statements, and capture value, unit, temporal context, and surrounding narrative. The approach normalizes disparate expressions (e.g., “3.7K”, “3,700”) and assigns a confidence score based on cross‑article corroboration.
    • Fact Organization clusters semantically related facts using embedding‑based similarity and LLM‑driven reasoning. It merges duplicate or near‑duplicate facts, resolves unit inconsistencies, and distinguishes core facts from supporting details, producing thematic groups that serve as narrative units.
  2. Visual Storytelling

    • Implements an “overview‑first, details‑on‑demand” design (R1‑R5). The Thematic Overview visualizes each cluster as a “thematic circle”, providing a high‑level map of topics, their inter‑connections, and fact counts.
    • The Story View employs scrollytelling: as users scroll, the interface transitions smoothly between clusters, revealing detailed fact panels, interactive charts (line, bar, pie), and source citations. Users can also filter, drill down, or jump to related articles directly from the interface, ensuring transparency (R5).

The authors evaluate Compendia through (a) a quantitative benchmark where extracted facts are compared against a manually curated gold set, achieving 92% extraction accuracy and an F1 of 0.87 for clustering; and (b) two user studies with 16 participants. The perception study shows high usability scores, while the open‑ended query exploration study demonstrates that participants can answer complex questions faster (≈35% time reduction) and with lower cognitive load than using traditional search tools. Participants also reported higher engagement and trust due to the direct linking of facts to original sources.

Key contributions include: (i) a novel pipeline that transforms unstructured web text into structured, validated quantitative facts; (ii) an interactive scrollytelling interface that integrates thematic overviews with detailed visualizations; and (iii) a comprehensive evaluation confirming both technical correctness and user‑centric effectiveness. Limitations are acknowledged: reliance on LLMs may introduce hallucinations; the system currently processes articles in batches rather than real‑time streams; and domain‑specific unit conversion rules are limited. Future work proposes real‑time article ingestion, multilingual support, and a feedback loop where user corrections refine the LLM prompts and clustering models.

Overall, Compendia pushes the frontier of automated visual storytelling by bridging the gap between noisy, multi‑source textual data and polished, interactive data narratives, offering a promising tool for journalists, analysts, and any audience seeking data‑driven insights from the web.


Comments & Academic Discussion

Loading comments...

Leave a Comment