PAS-SE: Personalized Auxiliary-Sensor Speech Enhancement for Voice Pickup in Hearables

PAS-SE: Personalized Auxiliary-Sensor Speech Enhancement for Voice Pickup in Hearables
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Speech enhancement for voice pickup in hearables aims to improve the user’s voice by suppressing noise and interfering talkers, while maintaining own-voice quality. For single-channel methods, it is particularly challenging to distinguish the target from interfering talkers without additional context. In this paper, we compare two strategies to resolve this ambiguity: personalized speech enhancement (PSE), which uses enrollment utterances to represent the target, and auxiliary-sensor speech enhancement (AS-SE), which uses in-ear microphones as additional input. We evaluate the strategies on two public datasets, employing different auxiliary sensor arrays, to investigate their cross-dataset generalization. We propose training-time augmentations to facilitate cross-dataset generalization of AS-SE systems. We also show that combining PSE and AS-SE (PAS-SE) provides complementary performance benefits, especially when enrollment speech is recorded with the in-ear microphone. We further demonstrate that PAS-SE personalized with noisy in-ear enrollments maintains performance benefits over the AS-SE system.


💡 Research Summary

This paper addresses the problem of enhancing a user’s own speech captured by hearable devices in the presence of environmental noise and interfering talkers. While single‑channel deep learning speech enhancement (SE) methods can suppress background noise, they struggle to separate competing speakers because they lack any additional context about the target voice. Modern hearables often include an in‑ear microphone that records the user’s speech via body conduction with a high signal‑to‑noise ratio (SNR), but the signal is band‑limited, distorted, and contaminated by body‑produced noises, making it unsuitable for direct communication. The authors therefore explore two complementary strategies: (1) Personalized Speech Enhancement (PSE), which conditions the SE network on a speaker embedding derived from enrollment utterances of the device user, and (2) Auxiliary‑Sensor Speech Enhancement (AS‑SE), which feeds the in‑ear microphone signal as an additional input channel. They further propose a combined approach, PAS‑SE (Personalized Auxiliary‑Sensor Speech Enhancement), that merges both sources of information.

The signal model assumes two microphones: an outer microphone (OM) and an in‑ear microphone (IM). At each time‑frequency bin (k,l) the observed signals are Y_o = S_o + N_o + V_o and Y_i = S_i + N_i + V_i, where S denotes the user’s speech, N environmental noise, and V interfering speech. The key challenge is that V_o can be strong at the OM while V_i is typically weak but not zero due to leakage.

The backbone architecture is the FT‑JNF model, originally proposed for joint frequency‑time filtering. It processes the magnitude spectra of one or more microphones with a frequency‑wise LSTM (512 units) followed by a time‑wise LSTM (128 units), a linear projection, and a tanh activation to produce a magnitude mask M(k,l). The mask is applied to the noisy OM signal to obtain the enhanced speech estimate. The base SE model contains ≈1.38 M parameters, making it suitable for real‑time deployment on low‑power hearables.

For personalization, a speaker encoder based on the time‑domain SpeakerBeam architecture is attached. An enrollment utterance (either recorded at OM or IM) is passed through a learnable filterbank, a 1‑D convolutional block, and temporal averaging to produce a 128‑dimensional embedding e. A dense layer projects e to the same dimension as the FT‑JNF frequency‑LSTM output, and multiplicative conditioning (element‑wise product) injects the speaker information into the main network. The encoder adds ≈1.81 M parameters.

A major practical obstacle is that public datasets (e.g., Vibriavox) contain only clean in‑ear recordings; they lack realistic in‑ear noise or interferer signals. To train AS‑SE and PAS‑SE without such data, the authors devise four data‑augmentation configurations:
(A) Add noise N_o and interferer V_o only to OM; IM remains clean.
(B) Add noise N_o and N_i to both microphones, but no interferers.
(C) Add noise to both microphones and interferer V_o only to OM.
(D) Same as (C) but approximate the in‑ear interferer as a scaled version of the outer interferer (V_i ≈ a·V_o, a∈


Comments & Academic Discussion

Loading comments...

Leave a Comment