PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation

PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Predicting future events based on news on the Web stands as one of the ultimate aspirations of artificial intelligence. Recent advances in large language model (LLM)-based systems have shown remarkable potential in forecasting future events, thereby garnering significant interest in the research community. Currently, several benchmarks have been established to evaluate the forecasting capabilities by formalizing the event prediction as a retrieval-augmented generation (RAG)-and-reasoning task. In these benchmarks, each prediction question is answered with relevant retrieved news articles downloaded from the Web. However, because there is no consideration of whether the questions can be supported by valid or sufficient supporting rationales, some of the questions in these benchmarks may be inherently noninferable. To address this issue, we introduce a new benchmark, PROPHET, which comprises inferable forecasting questions paired with relevant news for retrieval. To ensure the inferability of the benchmark, we propose Causal Intervened Likelihood (CIL), a statistical measure that assesses inferability through causal inference. In constructing this benchmark, we first collected recent trend forecasting questions, and then filtered the data using CIL resulting in an inferable benchmark for future forecasting. Through extensive experiments, we first demonstrate the validity of CIL and in-depth investigations into future forecasting with the aid of CIL. Subsequently, we evaluate several representative prediction methods on PROPHET. The overall results draws valuable insights for task of future directions.


💡 Research Summary

The paper introduces PROPHET, a new benchmark for future event forecasting that explicitly addresses the problem of “inferability” – whether a given prediction question can be answered based on the supporting evidence available in a retrieved news corpus. Existing forecasting benchmarks simply collect real‑world prediction questions and pair them with news articles, but they do not verify that the articles contain sufficient rationales. Consequently, many questions are non‑inferable, leading to unfair or misleading evaluations of large language model (LLM)‑based forecasting systems.

To solve this, the authors propose a novel statistical metric called Causal Intervened Likelihood (CIL). Each news article is modeled as a binary variable Xᵢ indicating whether the event described in the article occurs. The target binary outcome Y indicates whether the future event described by the question actually happens (ground‑truth label Ŷ). CIL for article i is defined as the difference between the probability of the correct answer when we intervene to force Xᵢ = 1 and the probability when we intervene to force Xᵢ = 0:
CILᵢ = P(Y = Ŷ | do(Xᵢ = 1)) − P(Y = Ŷ | do(Xᵢ = 0)).
A high CIL value means the article strongly supports the correct answer; a low or negative value indicates little or even contradictory support.

Because a full structural causal model (SCM) of all news events is infeasible, the authors introduce two realistic assumptions to make CIL computable from observational data:

  1. Temporality – later events cannot causally affect earlier ones, ensuring a directed acyclic graph aligned with time.
  2. w‑day Dependency Window – direct causal links only exist between events occurring within a fixed window w days of each other, reflecting the intuition that distant news influences are mediated through more recent events.

Under these assumptions, the interventional probabilities can be expressed purely in terms of observable joint probabilities (Equation 6). This enables efficient estimation of CIL for each article without explicit expert‑crafted SCMs.

Data collection proceeds as follows: prediction questions are harvested from the Polymarket forecasting platform, focusing on those resolved between 2025‑01‑01 and 2025‑01‑31. For each question, three LLM‑generated search queries (entity‑based, resolution‑step, and historical‑event) are issued to MediaCloud, retrieving a large set of news URLs. Articles are downloaded via Newspaper3k, filtered for correct publication dates, and scored for relevance using an LLM. The highest‑scoring (potentially answer‑leaking) and lowest‑scoring (irrelevant) articles are removed. The remaining articles form a candidate knowledge base for each question.

CIL is then computed for every article; questions whose average CIL falls below a pre‑defined threshold are deemed non‑inferable and excluded. The final PROPHET benchmark comprises 612 questions, each paired with roughly 12 filtered articles, all verified to be sufficiently inferable.

Extensive experiments validate CIL’s usefulness. The authors show a strong correlation between a question’s CIL score and the performance of several LLM‑based forecasting pipelines (including LLaMA‑2, GPT‑4, and FLAN‑T5) measured by Brier Score. Higher CIL questions consistently yield lower Brier Scores across models, confirming that CIL captures the true difficulty of the inference task. Moreover, CIL enables nuanced analyses such as difficulty stratification, retrieval module tuning, and error attribution.

Baseline evaluations reveal that retrieval‑augmented generation (RAG) approaches outperform pure prompting, yet overall performance remains modest, highlighting persistent challenges: news bias, temporal dynamics, and limited coverage of relevant rationales. The paper concludes with future directions: refining CIL estimation with more sophisticated causal inference techniques, incorporating domain‑expert feedback to enrich the underlying SCM, extending the benchmark to multilingual and cross‑cultural domains, and continuously updating the dataset to mitigate model leakage as LLMs evolve.

In summary, PROPHET establishes a rigorously filtered, inferability‑aware benchmark for future forecasting, and CIL provides a principled, causal‑theoretic tool to assess and improve the quality of prediction questions and their supporting evidence. This work paves the way for more reliable evaluation of AI systems tasked with anticipating real‑world events.


Comments & Academic Discussion

Loading comments...

Leave a Comment