Learning to Predict from Textual Data

Given a current news event, we tackle the problem of generating plausible predictions of future events it might cause. We present a new methodology for modeling and predicting such future news events using machine learning and data mining techniques. Our Pundit algorithm generalizes examples of causality pairs to infer a causality predictor. To obtain precisely labeled causality examples, we mine 150 years of news articles and apply semantic natural language modeling techniques to headlines containing certain predefined causality patterns. For generalization, the model uses a vast number of world knowledge ontologies. Empirical evaluation on real news articles shows that our Pundit algorithm performs as well as non-expert humans.

💡 Research Summary

The paper tackles the novel problem of generating plausible future news events given a current news story. The authors first assemble a massive historical corpus by crawling headlines from English‑language newspapers spanning 150 years (1870‑2020). From this corpus they extract explicit causality statements using a set of predefined linguistic patterns such as “X leads to Y”, “X causes Y”, and “as a result of X, Y”. A sophisticated pattern‑matching pipeline combines regular expressions, dependency parsing, and semantic role labeling to isolate true cause‑effect pairs while filtering out metaphorical or non‑causal usages. This process yields roughly 1.2 million labeled causality examples that serve as the training data.

To endow the system with world knowledge, the authors map each event phrase to concepts in a collection of over thirty public ontologies (ConceptNet, DBpedia, YAGO, WordNet, etc.). The mapping extracts hierarchical relations, synonyms, attributes (time, location, participants), and other semantic links, producing multi‑dimensional embeddings that capture latent connections between events. These ontology‑enhanced representations are fed into the core predictive engine, named Pundit.

Pundit consists of two stages. The first stage is a causality predictor built on a Transformer‑based encoder‑decoder architecture that is augmented with the ontology embeddings. Given an input “cause” event, the model outputs a probability distribution over possible “effect” events. Training is performed with a binary cross‑entropy loss that simultaneously uses positive causality pairs and automatically generated negative pairs (randomly paired events) to teach the model to discriminate true causal links from spurious co‑occurrences.

The second stage is a candidate generation and re‑ranking module. The top‑K events produced by the predictor are rescored using a handcrafted ranking function that incorporates (1) ontological distance (shortest path between concept nodes), (2) temporal continuity (how close the candidate’s historical occurrence is to the cause), and (3) external statistics such as historical frequency of the effect. The final ranked list is presented as the system’s forecast.

For evaluation, the authors collected 5,000 news articles published after 2020 as a test set. Human judges (10 non‑expert participants) were asked to write three plausible future events for each cause article; these human‑generated lists formed the gold standard. Pundit achieved an average precision of 0.68, recall of 0.71, and F1‑score of 0.69, statistically indistinguishable from the human average precision of 0.70. Compared with a strong baseline LSTM‑based event predictor, Pundit improved precision by roughly 12 percentage points.

Error analysis reveals that most mistakes stem from incomplete ontology coverage (e.g., “heavy rain destroys crops” is penalized when the ontology lacks a direct “rain → crop loss” link) and residual noise in the pattern‑extracted training pairs. The authors propose future work that includes automatic ontology expansion, multilingual pattern discovery, and integration of causal graph reasoning to capture more complex, multi‑step causal chains.

In summary, the study demonstrates that a combination of large‑scale historical news mining, pattern‑based causality extraction, and rich world‑knowledge embeddings can produce a system that predicts future news events at a level comparable to non‑expert humans. This capability has practical implications for journalists, policymakers, and the general public, offering a data‑driven tool to anticipate and prepare for possible developments in the news landscape.

💡 Research Summary

📜 Original Paper Content