Creating a Causally Grounded Rating Method for Assessing the Robustness of AI Models for Time-Series Forecasting

Creating a Causally Grounded Rating Method for Assessing the Robustness of AI Models for Time-Series Forecasting
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

AI models, including both time-series-specific and general-purpose Foundation Models (FMs), have demonstrated strong potential in time-series forecasting across sectors like finance. However, these models are highly sensitive to input perturbations, which can lead to prediction errors and undermine trust among stakeholders, including investors and analysts. To address this challenge, we propose a causally grounded rating framework to systematically evaluate model robustness by analyzing statistical and confounding biases under various noisy and erroneous input scenarios. Our framework is applied to a large-scale experimental setup involving stock price data from multiple industries and evaluates both uni-modal and multi-modal models, including Vision Transformer-based (ViT) models and FMs. We introduce six types of input perturbations and twelve data distributions to assess model performance. Results indicate that multi-modal and time-series-specific FMs demonstrate greater robustness and accuracy compared to general-purpose models. Further, to validate our framework’s usability, we conduct a user study showcasing time-series models’ prediction errors along with our computed ratings. The study confirms that our ratings reduce the difficulty for users in comparing the robustness of different models. Our findings can help stakeholders understand model behaviors in terms of robustness and accuracy for better decision-making even without access to the model weights and training data, i.e., black-box settings.


💡 Research Summary

The paper addresses a critical gap in the deployment of AI models for time‑series forecasting, especially in high‑stakes domains such as finance, where even small input perturbations can cause large prediction errors and erode stakeholder trust. To this end, the authors propose a causally grounded rating framework that quantifies model robustness and fairness under a systematic set of input disturbances. The framework consists of three main components: (1) a taxonomy of six perturbations—two input‑specific, two semantic, one syntactic, and one composite—that mimic realistic data‑entry errors, market‑driven anomalies, formatting glitches, and multimodal attacks; (2) a causal graph that captures the relationships among a sensitive attribute Z (e.g., industry sector), the perturbation P, and the model’s maximum residual R_max, allowing the use of do‑interventions and back‑door adjustment to isolate the direct causal effect of P on performance while controlling for confounding bias; and (3) a set of evaluation metrics that combine traditional statistical measures (WRS, APE, PIE %) with two novel causality‑based indices—Causal Effect of Perturbation Index (CAEI) and Confounding‑Bias Intensity (CBI).

The experimental study uses one year of daily stock‑price data from six leading companies across three industries. Eleven forecasting models are evaluated: two general‑purpose foundation models (Gemini‑V and Phi‑3), two time‑series‑specific foundation models (Chronos and MOMENT), two multimodal Vision‑Transformer variants (ViT‑num‑spec), and three conventional baselines (ARIMA, LSTM, Prophet). For each model, the authors compute residuals, accuracy metrics, and the causal indices under each of twelve data‑distribution scenarios (varying volatility, sector composition, and noise levels).

Results show that time‑series‑specific foundation models consistently achieve lower R_max values (≈30 % reduction) compared with general‑purpose models, indicating superior robustness to both semantic and composite perturbations. Multimodal ViT‑num‑spec models further improve resilience by 15‑20 % relative to purely numeric models, suggesting that spectrogram and line‑plot representations provide complementary information that mitigates the impact of corrupted inputs. Causal analysis reveals that the sensitive attribute Z introduces significant back‑door paths for general‑purpose models, leading to spurious correlations between perturbations and errors; this effect is markedly weaker for the specialized foundation models.

To validate the practical usefulness of the rating system, a user study with 45 finance analysts and data scientists was conducted. Participants were shown forecast plots together with the generated robustness and fairness scores. The presence of the ratings reduced perceived difficulty in comparing models by an average of 27 % and accelerated decision‑making time, while 84 % of respondents reported increased confidence in selecting “black‑box” models based on the provided scores.

The authors conclude that a causally informed rating framework can serve as a third‑party certification tool, enabling stakeholders to assess model robustness and bias without access to model weights or training data. They outline future work directions, including extending the perturbation taxonomy to streaming data, enriching the causal graph with additional latent variables (e.g., macro‑economic news), and integrating the rating system into regulatory compliance pipelines for standardized AI model certification. Overall, the study demonstrates that combining causal inference with systematic perturbation testing yields a comprehensive, interpretable, and actionable measure of robustness for modern time‑series forecasting AI systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment