TEGRA: Text Encoding With Graph and Retrieval Augmentation for Misinformation Detection
Misinformation detection is a critical task that can benefit significantly from the integration of external knowledge, much like manual fact-checking. In this work, we propose a novel method for representing textual documents that facilitates the incorporation of information from a knowledge base. Our approach, Text Encoding with Graph (TEG), processes documents by extracting structured information in the form of a graph and encoding both the text and the graph for classification purposes. Through extensive experiments, we demonstrate that this hybrid representation enhances misinformation detection performance compared to using language models alone. Furthermore, we introduce TEGRA, an extension of our framework that integrates domain-specific knowledge, further enhancing classification accuracy in most cases.
💡 Research Summary
The paper introduces TEGRA, a novel framework that augments text representations with structured graph information and external knowledge to improve misinformation detection. The core idea is to convert each news article into an Open Information Extraction (OpenIE) graph, where nodes represent entities and edges capture relations and actions extracted as subject‑predicate‑object triples. Two extraction pipelines are explored: OpenIE6, a BERT‑based end‑to‑end model, and a LLM‑prompted approach (KGI) using the LlamaIndex library. Although the extracted triples are noisy, they provide an entity‑centric view that mitigates long‑range dependencies inherent in pure text models.
In the basic TEG (Text Encoding with Graph) stage, the raw text is encoded with a fine‑tuned transformer (RoBERTa‑a) while the graph nodes and edges are embedded with fastText for efficiency. A Graph Attention Network (GAT) propagates information across the graph, after which max and mean pooling produce a fixed‑size graph vector. The text and graph vectors are concatenated and fed to a two‑layer perceptron classifier.
TEGRA extends this pipeline by linking graph nodes to URIs using DBpedia Spotlight and BLINK, then retrieving additional triples from external knowledge bases via SPARQL. Separate class‑specific knowledge graphs are built from training documents: KG_true (from legitimate articles) and KG_misinfo (from misinformation articles). The original OpenIE graph is duplicated and enriched with triples from each KG, yielding two enriched graphs G_true and G_misinfo. To control the influence of potentially irrelevant or noisy added triples, a Triple Selection (TS) module computes a relevance score µ for each triple. TS projects fastText embeddings of the text and the triple (averaged over subject, predicate, object) into a shared space, takes a dot product, and applies a sigmoid to obtain µ∈(0,1). The score scales the embeddings of added nodes and edges before GAT processing, and the module is trained end‑to‑end with the rest of the model.
The authors evaluate the approach on four public misinformation datasets: PolitiFact, GossipCop, CoAID (COVID‑19), and Horne2017 (US 2016 election). For each dataset, five random 80/10/10 splits are used; models are trained with Adam (lr = 1e‑5), early stopping (patience = 20), and the best validation F1 checkpoint is tested. Baselines include a fine‑tuned RoBERTa, the Gemma‑3‑12B LLM in zero‑shot and three‑shot settings, a Tsetlin Machine, and DeClarE. Results show that TEG consistently outperforms the text‑only baseline by 2–3 percentage points in accuracy and macro‑F1, while TEGRA adds another 1–2 points, demonstrating the benefit of class‑specific knowledge augmentation. Importantly, the dual‑graph strategy (G_true vs. G_misinfo) enables the model to detect consistency or contradiction between the article’s content and the knowledge base, providing an interpretable signal absent in prior KG‑enhanced methods such as CompareNet or DDGCN.
Key contributions are: (1) a hybrid text‑graph representation that explicitly models entities and relations, (2) a lightweight pipeline for linking to external URIs and retrieving triples, (3) a differentiable Triple Selection mechanism that mitigates noise, (4) class‑specific knowledge graphs that allow comparative reasoning, and (5) empirical validation across diverse domains. Limitations include dependence on the quality of OpenIE extraction, computational overhead of URI linking and SPARQL queries for large‑scale deployment, and focus on English‑centric resources. Future work will explore more robust OpenIE models, efficient caching of knowledge retrieval, multilingual extensions, and integration of multimodal evidence (images, video) to further strengthen misinformation detection.
Comments & Academic Discussion
Loading comments...
Leave a Comment