MEGConformer: Conformer-Based MEG Decoder for Robust Speech and Phoneme Classification

MEGConformer: Conformer-Based MEG Decoder for Robust Speech and Phoneme Classification
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Decoding speech-related information from non-invasive MEG is a key step toward scalable brain-computer interfaces. We present compact Conformer-based decoders on the LibriBrain 2025 PNPL benchmark for two core tasks: Speech Detection and Phoneme Classification. Our approach adapts a compact Conformer to raw 306-channel MEG signals, with a lightweight convolutional projection layer and task-specific heads. For Speech Detection, a MEG-oriented SpecAugment provided a first exploration of MEG-specific augmentation. For Phoneme Classification, we used inverse-square-root class weighting and a dynamic grouping loader to handle 100-sample averaged examples. In addition, a simple instance-level normalization proved critical to mitigate distribution shifts on the holdout split. Using the official Standard track splits and F1-macro for model selection, our best systems achieved 88.9% (Speech) and 65.8% (Phoneme) on the leaderboard, winning the Phoneme Classification Standard track. For further implementation details, the technical documentation, source code, and checkpoints are available at https://github.com/neural2speech/libribrain-experiments.


💡 Research Summary

The paper presents a novel approach for decoding speech-related information from non‑invasive magnetoencephalography (MEG) recordings by adapting a compact Conformer architecture—originally designed for automatic speech recognition—to raw 306‑channel MEG data. The authors evaluate their method on the LibriBrain 2025 PNPL benchmark, which provides over 50 hours of within‑subject MEG recordings collected while a participant listened to the Sherlock Holmes audiobooks. Two core tasks are defined: Speech Detection (binary classification of speech versus silence) and Phoneme Classification (39‑class phoneme labeling).

Model Architecture
Both tasks share a single backbone: a Conformer encoder preceded by a lightweight 1‑D convolutional projection that maps the 306 sensor channels to a 144‑dimensional feature space compatible with the Conformer. For Speech Detection, the authors adopt a “Conformer Small” configuration (16 layers, 4 attention heads, feed‑forward dimension 576, depthwise convolution kernel size 31). For Phoneme Classification, a custom Conformer with fewer layers (7) but more heads (12) and a larger feed‑forward dimension (2048) and kernel size (127) is used to better fit the shorter 0.5 s windows. Dropout (p = 0.1) and a final linear classifier complete the architecture.

Task‑Specific Enhancements

Speech Detection
– Input windows of 2.5 s (625 samples at 250 Hz) are extracted with a stride of 60 samples during training, providing overlapping segments and increasing data diversity.
– A MEG‑specific variant of SpecAugment, called MEGAugment, is introduced. It applies two random time masks (max width 180 samples) and band‑stop masks that notch out narrow frequency bands corresponding to canonical EEG/MEG rhythms (Theta, Alpha, Beta, Gamma, HGA) using fourth‑order IIR filters (probability = 0.4).
– Binary cross‑entropy with logits is used as the loss, combined with label smoothing (ε = 0.1).
– At inference, a post‑processing step removes speech predictions shorter than 60 samples (≈240 ms) to smooth the output.

Phoneme Classification
– Input windows are 0.5 s (125 samples). Because the competition provides holdout examples averaged over 100 raw samples, the training pipeline includes a dynamic grouping loader that randomly assembles 100‑sample averages per class each epoch, preserving temporal locality while exposing the model to many distinct averages.
– Class imbalance is mitigated by inverse‑square‑root class weighting (w_c ∝ 1/√n_c).
– Instance‑level normalization (InstanceNorm1d without running statistics or affine parameters) is applied per window, normalizing each channel using its own mean and variance computed over the time axis. This simple step dramatically reduces the distribution shift observed between the public splits and the hidden holdout set.
– Cross‑entropy loss is used for training.
– Five best seeds are ensembled at inference, and majority voting determines the final phoneme label.

Training Details
All experiments follow the official LibriBrain data splits (train, validation, test, holdout). The AdamW optimizer (lr = 1e‑4, weight decay = 5e‑2) is used with a batch size of 256. Early stopping monitors validation F1‑macro with a patience of 10 epochs. Ten random seeds are trained for each configuration; the best checkpoint per seed is saved.

Results
Speech Detection achieves an F1‑macro of 88.9 % on the holdout set, far surpassing the baseline (68.0 %). Phoneme Classification reaches 65.8 % F1‑macro on holdout, winning the Standard track and outperforming many Extended‑track submissions that were allowed additional data. Ablation studies, evaluated with Wilcoxon signed‑rank tests (p ≤ 0.01 for significance), reveal that:
– Extending the window from 0.5 s to 2.5 s yields the largest gain (+10.8 % relative).
– Reducing the training stride to 60 samples improves performance (+2.8 %).
– MEGAugment provides a modest but statistically significant boost in early model versions (+1.8 %).
– For phonemes, dynamic grouping contributes the biggest improvement (+13.3 % relative).
– Instance‑level normalization is essential for holdout generalization, delivering >200 % relative improvement compared to batch or layer normalization.

Additional analyses include data‑size scaling (showing diminishing returns beyond ~30 h of training data), frequency‑band contribution studies (highlighting the importance of high‑gamma activity), and a distribution‑shift assessment confirming that the holdout set exhibits distinct statistical properties that are largely mitigated by instance normalization.

Conclusion
The work demonstrates that modern ASR architectures, when appropriately adapted, can serve as powerful decoders for non‑invasive neural recordings. By combining a compact Conformer backbone with MEG‑specific preprocessing, augmentation, and normalization strategies, the authors achieve state‑of‑the‑art performance on both speech detection and phoneme classification tasks. This establishes a solid baseline for future brain‑computer interface research that aims to decode fine‑grained linguistic information from MEG or similar high‑dimensional neurophysiological signals.


Comments & Academic Discussion

Loading comments...

Leave a Comment