Forecasting Epileptic Seizures from Contactless Camera via Cross-Species Transfer Learning
Epileptic seizure forecasting is a clinically important yet challenging problem in epilepsy research. Existing approaches predominantly rely on neural signals such as electroencephalography (EEG), which require specialized equipment and limit long-term deployment in real-world settings. In contrast, video data provide a non-invasive and accessible alternative, yet existing video-based studies mainly focus on post-onset seizure detection, leaving seizure forecasting largely unexplored. In this work, we formulate a novel task of video-based epileptic seizure forecasting, where short pre-ictal video segments (3-10 seconds) are used to predict whether a seizure will occur within the subsequent 5 seconds. To address the scarcity of annotated human epilepsy videos, we propose a cross-species transfer learning framework that leverages large-scale rodent video data for auxiliary pretraining. This enables the model to capture seizure-related behavioral dynamics that generalize across species. Experimental results demonstrate that our approach achieves over 70% prediction accuracy under a strictly video-only setting and outperforms existing baselines. These findings highlight the potential of cross-species learning for building non-invasive, scalable early-warning systems for epilepsy.
💡 Research Summary
**
This paper introduces a novel, clinically relevant task: video‑only epileptic seizure forecasting. Instead of detecting seizures after they start, the authors aim to predict whether a seizure will occur within a 5‑second horizon using only a short pre‑ictal video clip of 3–10 seconds. The key challenge is the extreme scarcity of annotated human epilepsy videos, which the authors address by leveraging large‑scale rodent video recordings through a cross‑species transfer learning framework.
The technical core is a two‑stage pipeline built on VideoMAE (Video Masked AutoEncoder). In Stage 1, a self‑supervised pre‑training phase, the model learns spatio‑temporal representations by randomly masking “tubes” of video patches (both spatial and temporal dimensions) and reconstructing the missing pixels with a mean‑squared‑error loss. The pre‑training dataset combines the publicly available RodEpil mouse dataset (≈13 k 10‑second clips, including 2,952 seizure episodes) and a modest collection of 1,870 non‑seizure human clips recorded in a hospital setting. This domain‑specific continual pre‑training forces the encoder to capture motion patterns characteristic of seizures while preserving generic human pose information.
After pre‑training, the decoder is discarded and the encoder weights are transferred to the downstream forecasting task (Stage 2). Each short human clip is fed through the encoder, and the global CLS token summarizing the clip’s dynamics is passed to a lightweight linear classification head with a sigmoid activation, yielding a probability of seizure onset. To mimic realistic clinical constraints, the authors evaluate the model under few‑shot conditions (2‑, 3‑, and 4‑shot fine‑tuning), where only a handful of labeled human pre‑ictal examples are available for each run. Training uses cross‑entropy loss, mixed‑precision 16‑bit arithmetic, and gradient checkpointing for memory efficiency.
Experimental results are compelling. Across all few‑shot settings, the proposed method achieves an average balanced accuracy of 0.71, ROC‑AUC of 0.75, and PR‑AUC of 0.71, substantially outperforming strong video‑understanding baselines such as CSN, X3D, and SlowFast (which hover around 0.35–0.55 balanced accuracy). Ablation studies reveal that the cross‑species pre‑training is the primary driver of performance: using only human data for pre‑training yields a balanced accuracy of ~0.48, whereas adding rodent data raises it to >0.70. Moreover, a masking ratio between 0.5 and 0.7 provides the best trade‑off between forcing high‑level semantic learning and retaining sufficient visual information.
The paper also discusses limitations. The human dataset comprises recordings from only six patients, raising concerns about generalizability across ages, skin tones, lighting conditions, and camera viewpoints. The 5‑second prediction window, while useful for proof‑of‑concept, may be too short for practical interventions such as medication administration. Finally, robustness to environmental variations (e.g., background clutter, occlusions) is not thoroughly evaluated.
In conclusion, the authors demonstrate that cross‑species self‑supervised pre‑training can endow video models with seizure‑relevant behavioral priors, enabling accurate few‑shot seizure forecasting from purely visual input. This opens a promising pathway toward non‑invasive, scalable early‑warning systems for epilepsy. Future work should expand the human dataset, explore longer prediction horizons, integrate multimodal signals (EEG, heart rate, accelerometry), and test the approach in real‑time clinical deployments.
Comments & Academic Discussion
Loading comments...
Leave a Comment