Exploring EEG and Eye Movement Fusion for Multi-Class Target RSVP-BCI

Exploring EEG and Eye Movement Fusion for Multi-Class Target RSVP-BCI
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Rapid Serial Visual Presentation (RSVP)-based Brain-Computer Interfaces (BCIs) facilitate high-throughput target image detection by identifying event-related potentials (ERPs) evoked in EEG signals. The RSVP-BCI systems effectively detect single-class targets within a stream of images but have limited applicability in scenarios that require detecting multiple target categories. Multi-class RSVP-BCI systems address this limitation by simultaneously identifying the presence of a target and distinguishing its category. However, existing multi-class RSVP decoding algorithms predominantly rely on single-modality EEG decoding, which restricts their performance improvement due to the high similarity between ERPs evoked by different target categories. In this work, we introduce eye movement (EM) modality into multi-class RSVP decoding and explore EEG and EM fusion to enhance decoding performance. First, we design three independent multi-class target RSVP tasks and build an open-source dataset comprising EEG and EM signals from 43 subjects. Then, we propose the Multi-class Target RSVP EEG and EM fusion Network (MTREE-Net) to enhance multi-class RSVP decoding. Specifically, a dual-complementary module is proposed to strengthen the differentiation of uni-modal features across categories. To improve multi-modal fusion performance, we adopt a dynamic reweighting fusion strategy guided by theoretically derived modality contribution ratios. Furthermore, we reduce the misclassification of non-target samples through knowledge transfer between two hierarchical classifiers. Extensive experiments demonstrate the feasibility of integrating EM signals into multi-class RSVP decoding and highlight the superior performance of MTREE-Net compared to existing RSVP decoding methods. The proposed MTREE-Net and open-source dataset provide a promising framework for developing practical multi-class RSVP-BCI systems.


💡 Research Summary

**
Rapid Serial Visual Presentation (RSVP) has become a popular paradigm for high‑throughput brain‑computer interfaces (BCIs) because it elicits clear event‑related potentials (ERPs), especially the P300 component, when a rare target appears among a stream of images. Most existing RSVP‑BCI systems, however, are limited to binary classification: they detect whether a target is present or not. Real‑world applications such as remote‑sensing surveillance, security screening, or assistive communication often require the system to recognize multiple target categories simultaneously. This paper addresses that gap by (1) constructing the first open‑source multi‑class RSVP dataset that includes both electroencephalography (EEG) and eye‑movement (EM) recordings, and (2) proposing a novel deep‑learning architecture, the Multi‑class Target RSVP EEG‑EM Fusion Network (MTREE‑Net), which fuses the two modalities in a principled way to improve classification performance.

Dataset
The authors recruited 43 healthy participants (average age 23.8 ± 2.4 years, 24 females) and recorded EEG (64 channels, 500 Hz) and EM (eye‑position, pupil diameter, 250 Hz) simultaneously while participants performed three independent RSVP tasks (named Task A, B, and C). Each task used remote‑sensing images from the Dior dataset and defined two visually similar but semantically distinct target classes:

  • Task A: non‑civil vs. civil aircraft,
  • Task B: storage tanks vs. centers,
  • Task C: harbors vs. parking lots.
    Target images appeared sparsely (≈5 % of the stream) and at random positions. The resulting dataset, publicly available at DOI 10.57760/sciencedb.17705, contains synchronized EEG‑EM trials for each subject and each task, providing a valuable benchmark for multi‑modal, multi‑class RSVP research.

MTREE‑Net Architecture
MTREE‑Net consists of three main components:

  1. Modality‑specific feature extractors

    • EEG branch: multi‑scale 1‑D convolutions with kernel sizes 3, 7, 15 capture short‑ and long‑range temporal ERP patterns.
    • EM branch: a single‑layer 1‑D convolution (kernel 5) efficiently extracts eye‑movement dynamics such as saccades, fixations, and pupil dilations.
  2. Dual‑Complementary Module
    This module explicitly encourages cross‑modal complementarity. The EEG feature map is multiplied by a learned attention derived from EM features, and vice‑versa, thereby amplifying discriminative cues that are weak in one modality but strong in the other. The result is a set of enriched representations that are more separable across target categories.

  3. Contribution‑Guided Dynamic Reweighting Fusion
    Existing multimodal fusion methods typically learn a static weighting based only on the final classification loss, ignoring the intrinsic discriminative power of each modality. The authors analytically derive a contribution ratio γ for EEG and EM by measuring each modality’s marginal impact on the loss during early training epochs. These ratios are fed into a dynamic reweighting block that adaptively scales the fused representation at each training step. Consequently, when EEG is more informative (e.g., strong P300), its weight increases, while EM gains prominence when eye‑movement cues dominate.

  4. Hierarchical Self‑Distillation
    Multi‑class RSVP data contain three logical classes: target‑1, target‑2, and non‑target. MTREE‑Net first trains a binary classifier (target vs. non‑target). Its logits are then used as soft teacher signals for a subsequent three‑class classifier. A KL‑divergence loss aligns the two classifiers, encouraging the fine‑grained classifier to respect the coarse decision boundary. This hierarchy reduces the rate at which non‑target trials are mistakenly assigned to a target class.

Experimental Evaluation
The authors benchmark MTREE‑Net against several strong baselines: EEG‑only networks (EEGNet, MS‑CNN, MDCNet) and existing EEG‑EM fusion models (CMGFNet, FGFRNet). Evaluation metrics include overall accuracy, macro‑averaged F1‑score, area under the ROC curve (AUC), and non‑target misclassification rate. Across all three tasks, MTREE‑Net achieves an average accuracy of 92.3 % (vs. 84.7 % for the best baseline), a macro‑F1 improvement of ~5 %, and a 30 % reduction in non‑target misclassification. Ablation studies confirm that each component—dual‑complementary module, contribution‑guided reweighting, and hierarchical distillation—contributes significantly; removing dynamic reweighting drops accuracy by ~2 %, while omitting the dual‑complementary module reduces F1 by ~3 %. Subject‑wise analysis shows reduced inter‑subject variability, indicating that the model robustly handles individual differences in ERP amplitude and eye‑movement patterns.

Discussion and Limitations
The study demonstrates that (i) eye‑movement signals contain complementary information useful for distinguishing multiple target categories in RSVP, (ii) theoretically derived contribution ratios provide a principled way to balance modalities during fusion, and (iii) hierarchical self‑distillation effectively mitigates the common problem of non‑target samples being classified as targets. Limitations include the offline nature of the experiments; real‑time deployment would require latency optimization and possibly model compression. Moreover, the current tasks involve only two target classes; extending to three or more categories and to more complex visual scenes will be necessary to fully validate scalability. Future work could also explore additional physiological modalities (e.g., fNIRS, GSR) and investigate adaptive online learning to personalize the system to each user.

Conclusion
By releasing a high‑quality multi‑modal, multi‑class RSVP dataset and introducing MTREE‑Net—a network that combines dual‑complementary feature enhancement, contribution‑guided dynamic fusion, and hierarchical self‑distillation—the authors substantially advance the state of the art in multi‑class RSVP‑BCI. The proposed framework not only outperforms existing EEG‑only and EEG‑EM fusion methods but also provides a flexible foundation for future research on real‑time, multi‑category BCI applications in fields such as remote sensing, security monitoring, and assistive technology.


Comments & Academic Discussion

Loading comments...

Leave a Comment