All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection in LLM Backtesting
To evaluate whether LLMs can accurately predict future events, we need the ability to \textit{backtest} them on events that have already resolved. This requires models to reason only with information available at a specified past date. Yet LLMs may inadvertently leak post-cutoff knowledge encoded during training, undermining the validity of retrospective evaluation. We introduce a claim-level framework for detecting and quantifying this \emph{temporal knowledge leakage}. Our approach decomposes model rationales into atomic claims and categorizes them by temporal verifiability, then applies \textit{Shapley values} to measure each claim’s contribution to the prediction. This yields the \textbf{Shapley}-weighted \textbf{D}ecision-\textbf{C}ritical \textbf{L}eakage \textbf{R}ate (\textbf{Shapley-DCLR}), an interpretable metric that captures what fraction of decision-driving reasoning derives from leaked information. Building on this framework, we propose \textbf{Time}-\textbf{S}upervised \textbf{P}rediction with \textbf{E}xtracted \textbf{C}laims (\textbf{TimeSPEC}), which interleaves generation with claim verification and regeneration to proactively filter temporal contamination – producing predictions where every supporting claim can be traced to sources available before the cutoff date. Experiments on 350 instances spanning U.S. Supreme Court case prediction, NBA salary estimation, and stock return ranking reveal substantial leakage in standard prompting baselines. TimeSPEC reduces Shapley-DCLR while preserving task performance, demonstrating that explicit, interpretable claim-level verification outperforms prompt-based temporal constraints for reliable backtesting.
💡 Research Summary
The paper tackles a fundamental obstacle in evaluating large language models (LLMs) for forward‑looking prediction tasks: temporal knowledge leakage. When a model is asked to forecast an event as if it were situated at a past reference date t_ref, any information that became publicly available after t_ref but before the model’s training cutoff can inadvertently be used, inflating back‑test performance. Existing work either flags whole‑model contamination or checks for test‑set overlap, but it does not pinpoint which pieces of reasoning are contaminated nor how much they influence the final decision.
To address this, the authors propose a claim‑level framework. A model’s free‑form rationale is first broken down into atomic, verifiable statements called “claims.” Each claim is assigned a temporal grounding τ(c), the earliest time at which the claim could be known from public sources. Claims are organized into a taxonomy:
- Group A (time‑dependent) – A1 (discrete events), A2 (measurements at a specific date), A3 (dated publications), A4 (the target outcome itself), A5 (post‑event consequences).
- Group B (time‑independent) – B1 (domain knowledge) and B2 (definitions or logical truths).
By definition, A4 and A5 are always leaked (they refer to the future), while B1 and B2 are never leaked. For A1–A3 the system performs an external search (with strict date filters) to discover the earliest public appearance of the fact, thereby determining whether τ(c) > t_ref.
The next key component is importance weighting via Shapley values. For a set of claims S, a characteristic function v(S) returns the model’s prediction when only those claims are supplied. The Shapley value ϕ_i measures the average marginal contribution of claim c_i across all possible coalitions, capturing how much that claim drives the final output. Exact computation is feasible for typical rationales (10–20 claims); otherwise Monte‑Carlo sampling provides unbiased estimates.
Combining leakage status ℓ(c_i) (1 if the claim is leaked, 0 otherwise) with Shapley importance yields the Shapley‑Weighted Decision‑Critical Leakage Rate (Shapley‑DCLR):
Shapley‑DCLR = Σ_i |ϕ_i| · ℓ(c_i) / Σ_i |ϕ_i|
This metric has two complementary interpretations. First, it reflects “influence‑weighted leakage”: a high value means the most influential claims are temporally contaminated, rendering the prediction unreliable. Second, it can be seen as the proportion of decision‑relevant information that originates from post‑cutoff sources. Thus Shapley‑DCLR distinguishes superficial leakage (e.g., irrelevant background facts) from critical leakage that actually determines the answer.
To proactively prevent leakage, the authors introduce TimeSPEC (Time‑Supervised Prediction with Extracted Claims), a five‑phase pipeline:
- Generator – receives the task input and t_ref, performs date‑filtered web search, gathers pre‑t_ref evidence, and produces an initial prediction with a rationale.
- Supervisor – extracts atomic claims, assigns taxonomy labels, and checks each claim for temporal violation using the rules above and external verification.
- Regenerator – if any violation is found, re‑generates the prediction using only the validated claims (or by prompting the model to replace the offending parts).
- Resupervisor – re‑evaluates the regenerated output to ensure all remaining claims satisfy the temporal constraint.
- Aggregator – consolidates the verified claims into the final answer.
TimeSPEC therefore moves the temporal guard from the prompt (which relies on the model’s self‑awareness) to an external, programmatic verification loop that can reliably enforce the cutoff date.
The authors evaluate the framework on 350 instances across three domains: (i) U.S. Supreme Court case outcome prediction, (ii) NBA player salary estimation, and (iii) ranking of S&P 500 stocks by future return. Baseline prompting strategies—including explicit “predict as of t_ref” instructions and chain‑of‑thought prompting—exhibit substantial leakage, with Shapley‑DCLR ranging from 0.35 to 0.62. Despite this, their raw accuracy or ranking metrics appear competitive, highlighting the mirage effect of temporal contamination.
Applying TimeSPEC reduces Shapley‑DCLR by 75 %–99 % across all tasks. Crucially, task performance remains on par with baselines: Supreme Court prediction accuracy stays around 78 % (the same as the best baseline), NBA salary RMSE changes by less than 2 %, and stock ranking NDCG is unchanged within statistical noise. In the legal domain, the model never cites the actual post‑decision ruling yet still captures the relevant legal reasoning, demonstrating that genuine forward‑looking prediction is feasible without leakage.
The paper’s contributions are threefold: (1) a claim‑level taxonomy and detection pipeline that isolates temporally contaminated reasoning; (2) the Shapley‑DCLR metric that quantifies not just the presence but the decision‑critical impact of leakage; and (3) the TimeSPEC architecture that operationalizes proactive leakage prevention in a model‑agnostic manner. The work bridges gaps between attribution research (Shapley values for LLMs), fact‑checking (temporal verification of claims), and forecasting evaluation (contamination‑free backtesting).
Limitations and future directions are acknowledged. Claim extraction currently relies on heuristic parsing and may miss nuanced statements; improving automated extraction with fine‑tuned parsers could raise coverage. The external search step incurs latency and may be incomplete for niche domains; integrating structured knowledge bases or domain‑specific archives could mitigate this. Finally, extending the framework to multimodal evidence (tables, figures) and to continuous‑time forecasting (e.g., weather) would broaden its applicability.
Overall, the study provides a rigorous, interpretable methodology for detecting and eliminating temporal knowledge leakage in LLM backtesting, offering both a diagnostic metric (Shapley‑DCLR) and a practical mitigation pipeline (TimeSPEC) that together enable trustworthy forward‑looking evaluation of language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment