Brain-Informed Speech Separation for Cochlear Implants
We propose a brain-informed speech separation method for cochlear implants (CIs) that uses electroencephalography (EEG)-derived attention cues to guide enhancement toward the attended speaker. An attention-guided network fuses audio mixtures with EEG features through a lightweight fusion layer, producing attended-source electrodograms for CI stimulation while resolving the label-permutation ambiguity of audio-only separators. Robustness to degraded attention cues is improved with a mixed curriculum that varies cue quality during training, yielding stable gains even when EEG-speech correlation is moderate. In multi-talker conditions, the model achieves higher signal-to-interference ratio improvements than an audio-only electrodogram baseline while remaining slightly smaller (167k vs. 171k parameters). With 2 ms algorithmic latency and comparable cost, the approach highlights the promise of coupling auditory and neural cues for cognitively adaptive CI processing.
💡 Research Summary
The paper introduces a novel brain‑informed speech‑separation system specifically designed for cochlear implant (CI) processors. Traditional CI front‑ends map acoustic mixtures directly to electrodograms, often producing two separate output streams for the competing talkers. This creates a label‑permutation problem and, more importantly, provides no mechanism for the device to know which speaker the user is trying to attend to. The authors address both issues by integrating an EEG‑derived attention cue into the separation pipeline, thereby steering the enhancement toward the attended speaker and outputting a single attended electrodogram.
The architecture builds on the lightweight DeepACE family. An audio encoder (E_a) transforms the raw mixture waveform into a 64‑channel feature map (F_audio). In parallel, an EEG encoder (E_e) processes the attention cue (derived from EEG) into a feature map of identical dimensions (F_EEG). A fusion layer performs element‑wise multiplication of these two maps (F_fused = F_audio ⊙ F_EEG), a computationally cheap operation that nevertheless enables strong multimodal interaction. The fused representation is fed into a temporal convolutional network (TCN) that predicts a single mask M. This mask is applied to the audio features, and a decoder reconstructs the attended electrodogram (ˆp_att). Post‑processing (band selection and mapping to stimulation levels) follows the same procedure as the baseline, ensuring a fair comparison. Because only one output is produced, the system eliminates the permutation ambiguity inherent in dual‑output audio‑only separators.
Parameter count is 167 405, roughly 2 % fewer than the audio‑only baseline (171 409). Both models are causal with a 2 ms algorithmic latency, making them suitable for real‑time, hardware‑constrained CI devices.
A major contribution is the mixed curriculum learning strategy designed to improve robustness to noisy or unreliable EEG cues. Three curricula are examined: (1) No Curriculum (oracle cue throughout training), (2) Plain Curriculum (gradually increasing Gaussian noise), and (3) Mixed Curriculum (a stochastic mixture of clean, scheduled‑noise, and uniformly sampled intermediate noise at each epoch). The mixed approach retains 30 % clean cues, 65 % cues at the current noise level, and 5 % intermediate cues throughout training. This exposure prevents over‑fitting to a single cue quality and yields stable performance across a wide range of cue‑correlation values (ρ).
Experiments use the Libri2Mix two‑speaker dataset (16 kHz) with input SIRs uniformly sampled between 0 and 10 dB. Performance is measured in the electrodogram domain using signal‑to‑interference ratio improvement (SIRi) and electrode‑wise linear cross‑correlation (LCC). With oracle cues, the brain‑informed model consistently outperforms the audio‑only baseline across all input SIRs, achieving average SIRi gains of about 2.5–3 dB. When cue reliability varies (ρ from 0.1 to 0.5), the mixed curriculum model maintains gains of 1.5–2.2 dB, whereas the plain curriculum model shows degradation at high ρ due to late‑stage over‑exposure to noisy cues. LCC analysis mirrors these findings: the mixed curriculum yields the highest electrode‑wise correlations, indicating better preservation of the temporal envelope that is crucial for CI speech perception.
The study deliberately uses a proxy attention cue—rectified, block‑averaged speech envelope down‑sampled to 64 Hz—rather than a full EEG‑to‑attention decoding pipeline. This simplification isolates the downstream question: how effectively can a CI front‑end exploit an attention cue of varying reliability? The authors acknowledge that real‑world deployment will require integration of multi‑channel EEG preprocessing, an auditory‑attention‑decoding (AAD) model, and handling of artifacts, which are left for future work.
Limitations include the lack of real EEG data, the absence of user‑specific AAD models, and no in‑vivo testing with CI patients. Future directions outlined are: (i) replacing the proxy cue with genuine AAD‑derived signals from actual EEG recordings, (ii) evaluating the end‑to‑end system with CI users, (iii) exploring low‑power ASIC implementations for on‑device deployment, and (iv) extending the framework to handle more than two concurrent talkers.
In summary, the paper demonstrates that a compact, low‑latency multimodal fusion network, when trained with a mixed curriculum, can leverage EEG‑derived attention information to produce a single, high‑quality attended electrodogram. This yields consistent SIR improvements and better temporal envelope fidelity compared with a strong audio‑only baseline, highlighting a promising path toward cognitively adaptive cochlear implant processors.
Comments & Academic Discussion
Loading comments...
Leave a Comment