EMG-to-Speech with Fewer Channels
Surface electromyography (EMG) is a promising modality for silent speech interfaces, but its effectiveness depends heavily on sensor placement and channel availability. In this work, we investigate the contribution of individual and combined EMG channels to speech reconstruction performance. Our findings reveal that while certain EMG channels are individually more informative, the highest performance arises from subsets that leverage complementary relationships among channels. We also analyzed phoneme classification accuracy under channel ablations and observed interpretable patterns reflecting the anatomical roles of the underlying muscles. To address performance degradation from channel reduction, we pretrained models on full 8-channel data using random channel dropout and fine-tuned them on reduced-channel subsets. Fine-tuning consistently outperformed training from scratch for 4 - 6 channel settings, with the best dropout strategy depending on the number of channels. These results suggest that performance degradation from sensor reduction can be mitigated through pretraining and channel-aware design, supporting the development of lightweight and practical EMG-based silent speech systems.
💡 Research Summary
This paper tackles a central obstacle to the practical deployment of silent‑speech interfaces based on surface electromyography (sEMG): the need for a dense array of facial and neck sensors. Using the widely adopted single‑speaker, eight‑channel EMG dataset introduced by Gaddy et al., the authors systematically examine how individual channels and channel subsets contribute to speech‑synthesis performance.
First, a greedy backward‑elimination experiment starts from the full 8‑channel configuration and removes one channel at a time, always keeping the subset that yields the lowest word‑error rate (WER). The removal order is 6 → 7 → 8 → 5 → 4 → 1, leaving channels 2 and 3 as the most critical. Interestingly, some smaller subsets outperform larger ones, suggesting that certain channels may introduce redundant or noisy information that harms generalisation.
Second, the authors exhaustively evaluate all 70 possible 4‑channel combinations. Ten of these beat the greedy‑selected (1, 2, 3, 4) set, revealing strong complementary relationships: when channel 2 is absent, channels 5 or 6 frequently appear as substitutes; when channel 1 is missing, channel 7 often steps in. Spatially, the high‑performing channels are spread across the face and neck, indicating that diverse muscle‑activation zones capture distinct articulatory cues. Averaging over all combinations yields a channel‑importance ranking of 3 > 2 > 5 > 1 > 6 > 4 > 7 > 8.
Third, a phoneme‑error analysis on all 7‑channel subsets isolates the contribution of each channel to specific phoneme categories. Removing channel 8 (located over the posterior masseter) degrades bilabial consonants, central vowels, and silence detection. Channel 7 (near the zygomaticus major) mainly affects high‑front vowels, while channel 3 (sternohyoid region) impacts voiceless fricatives and low vowels. Channel 2 (depressor anguli oris) influences labiodental sounds, and channel 6 (orbicularis oris) is important for rounded vowels. These findings map EMG signal relevance directly onto known articulatory functions.
Finally, to mitigate the inevitable performance loss when fewer sensors are available, the authors propose a two‑stage training scheme. During pre‑training on the full 8‑channel data, they apply random channel dropout with probabilities p ∈ {0, 0.125, 0.25}, effectively masking each channel independently per training example. This forces the encoder to learn representations that are robust to missing inputs. The pre‑trained model is then fine‑tuned on reduced‑channel data (4–6 channels). Across all reduced‑channel settings, fine‑tuning consistently outperforms training from scratch. The optimal dropout probability varies with the target channel count: p = 0.125 (average 7 channels retained) works best for 4‑channel fine‑tuning, while p = 0.25 (average 6 channels retained) is best for 5–6 channels.
In sum, the paper delivers three major contributions: (1) a comprehensive, multi‑method assessment of EMG channel importance and complementarity; (2) a detailed phoneme‑level analysis linking channel locations to articulatory roles; and (3) a practical pre‑training + fine‑tuning strategy with channel dropout that substantially narrows the performance gap caused by sensor reduction. These insights pave the way for lightweight, user‑friendly EMG‑based silent‑speech devices that retain high synthesis quality while requiring far fewer electrodes.
Comments & Academic Discussion
Loading comments...
Leave a Comment