EVEREST: An Evidential, Tail-Aware Transformer for Rare-Event Time-Series Forecasting
Forecasting rare events in multivariate time-series data is challenging due to severe class imbalance, long-range dependencies, and distributional uncertainty. We introduce EVEREST, a transformer-based architecture for probabilistic rare-event forecasting that delivers calibrated predictions and tail-aware risk estimation, with auxiliary interpretability via attention-based signal attribution. EVEREST integrates four components: (i) a learnable attention bottleneck for soft aggregation of temporal dynamics; (ii) an evidential head for estimating aleatoric and epistemic uncertainty via a Normal–Inverse–Gamma distribution; (iii) an extreme-value head that models tail risk using a Generalized Pareto Distribution; and (iv) a lightweight precursor head for early-event detection. These modules are jointly optimized with a composite loss (focal loss, evidential NLL, and a tail-sensitive EVT penalty) and act only at training time; deployment uses a single classification head with no inference overhead (approximately 0.81M parameters). On a decade of space-weather data, EVEREST achieves state-of-the-art True Skill Statistic (TSS) of 0.973/0.970/0.966 at 24/48/72-hour horizons for C-class flares. The model is compact, efficient to train on commodity hardware, and applicable to high-stakes domains such as industrial monitoring, weather, and satellite diagnostics. Limitations include reliance on fixed-length inputs and exclusion of image-based modalities, motivating future extensions to streaming and multimodal forecasting.
💡 Research Summary
The paper introduces EVEREST, a novel transformer‑based architecture designed for rare‑event forecasting in multivariate time‑series data. The authors identify three intertwined challenges that plague existing methods: (1) extreme class imbalance, which makes standard cross‑entropy loss ineffective for the scarce positive class; (2) long‑range temporal dependencies, where early precursors are diluted across long sequences; and (3) the need for calibrated probabilities and explicit tail‑risk assessment for high‑stakes decision making. EVEREST tackles all three simultaneously while keeping the inference footprint minimal.
The core of the model consists of a standard six‑layer transformer encoder that processes embedded time‑step tokens with scaled sinusoidal positional encodings. After the encoder, a single‑query attention bottleneck aggregates the sequence into a single latent vector z. This bottleneck learns a weight vector w that produces a softmax over time, focusing the representation on the most informative, often weak, early signals. The bottleneck adds only d parameters (≈128) and O(T·d) operations, yet yields a substantial performance boost over naïve mean pooling.
Four parallel heads are attached to z: (i) a primary binary classification head that outputs a logit l and probability p̂ = σ(l); (ii) an evidential head that predicts Normal‑Inverse‑Gamma (NIG) parameters (μ, v, α, β) over the logit, providing a closed‑form predictive mean and variance and thereby decomposing aleatoric and epistemic uncertainty without Monte‑Carlo sampling; (iii) an extreme‑value theory (EVT) head that fits a Generalized Pareto Distribution (GPD) to logit exceedances above a high batchwise quantile u (90 % by default), maximizing the GPD log‑likelihood to shift gradient mass toward the far tail; and (iv) a lightweight precursor head trained with binary cross‑entropy on the same label, encouraging the backbone to encode early discriminative cues. Crucially, the evidential, EVT, and precursor heads are used only during training; at test time the model reduces to the single classification head, preserving the same runtime cost as a vanilla transformer.
Training optimizes a composite loss L = λ_f L_focal + λ_e L_evid + λ_t L_evt + λ_p L_prec. The focal component addresses class imbalance by re‑weighting hard examples, with the focusing parameter γ annealed from 0 to 2 over the first 50 epochs. The evidential term regularizes calibration via the NIG negative log‑likelihood. The EVT term encourages tail‑aware learning by fitting the GPD to exceedances, and the precursor term enriches the information bottleneck by adding anticipatory supervision. The authors set λ_f = 0.8, λ_e = 0.1, λ_t = 0.1, λ_p = 0.05 in their reference configuration, noting that only relative ratios matter.
Experiments are conducted on a decade of SHARP‑GOES solar‑flare data (2010‑2023) covering six flare classes and three forecast horizons (24, 48, 72 h), as well as on the industrial anomaly dataset SKAB. EVEREST achieves state‑of‑the‑art True Skill Statistic (TSS) scores: 0.973/0.970/0.966 for C‑class flares at 24/48/72 h, and 0.907/0.936/0.966 for ≥ M5 flares. Calibration metrics are equally impressive, with Expected Calibration Error (ECE) as low as 0.016 for the hardest tasks. On SKAB, the model reaches an F1 of 98.16 % and TSS 0.964, surpassing strong baselines such as TraNAD.
Ablation studies demonstrate the contribution of each component: replacing the attention bottleneck with mean pooling drops TSS by up to 0.427 on the hardest task; removing the precursor head reduces TSS by 0.65 for M5‑72 h; omitting the evidential head slightly lowers calibration but still yields a modest TSS gain of 0.064; and excluding the EVT head harms tail sensitivity. Sensitivity analyses over loss weights and the exceedance quantile show that performance remains robust across wide hyperparameter ranges, confirming that the auxiliary heads act as regularizers rather than fragile knobs.
The model is lightweight, containing roughly 8.14 × 10⁵ parameters (≈0.81 M) and 1.66 × 10⁷ FLOPs per forward pass. Because auxiliary heads are stripped at inference, deployment incurs no additional memory or latency overhead compared to a standard transformer of similar size.
Limitations are acknowledged: EVEREST assumes fixed‑length input windows and does not process image‑based modalities, which restricts its applicability to streaming or multimodal scenarios. The authors propose future work on streaming attention mechanisms, multimodal fusion (e.g., satellite imagery + telemetry), and contrastive pre‑cursor objectives to further enhance early detection.
In summary, EVEREST presents a compact, end‑to‑end solution that simultaneously improves discrimination, probability calibration, and tail‑risk awareness for rare‑event forecasting. Its training‑only auxiliary architecture offers a practical pathway to high‑performance, uncertainty‑aware models suitable for high‑stakes domains such as space weather prediction, industrial monitoring, and satellite health diagnostics.
Comments & Academic Discussion
Loading comments...
Leave a Comment