Lend me an Ear: Speech Enhancement Using a Robotic Arm with a Microphone Array
Speech enhancement performance degrades significantly in noisy environments, limiting the deployment of speech-controlled technologies in industrial settings, such as manufacturing plants. Existing speech enhancement solutions primarly rely on advanced digital signal processing techniques, deep learning methods, or complex software optimization techniques. This paper introduces a novel enhancement strategy that incorporates a physical optimization stage by dynamically modifying the geometry of a microphone array to adapt to changing acoustic conditions. A sixteen-microphone array is mounted on a robotic arm manipulator with seven degrees of freedom, with microphones divided into four groups of four, including one group positioned near the end-effector. The system reconfigures the array by adjusting the manipulator joint angles to place the end-effector microphones closer to the target speaker, thereby improving the reference signal quality. This proposed method integrates sound source localization techniques, computer vision, inverse kinematics, minimum variance distortionless response beamformer and time-frequency masking using a deep neural network. Experimental results demonstrate that this approach outperforms other traditional recording configruations, achieving higher scale-invariant signal-to-distortion ratio and lower word error rate accross multiple input signal-to-noise ratio conditions.
💡 Research Summary
The paper presents a novel speech‑enhancement system designed for noisy industrial environments where voice‑controlled human‑robot interaction (HRI) is required. The authors mount a sixteen‑microphone omnidirectional array on a Kinova Gen3 seven‑degree‑of‑freedom robotic arm, dividing the microphones into four sub‑arrays of four elements each. One sub‑array is positioned near the end‑effector, allowing the arm to bring a microphone physically close to the speaking operator.
The enhancement pipeline consists of five modules. First, a deep neural network (BLSTM‑RNN) predicts an ideal‑ratio mask (IRM) for each channel from the log‑magnitude STFT of the noisy signal. Second, a sound‑source‑localization (SSL) block uses a spatially‑whitened SRP‑PHAT algorithm, feeding the masks to compute online speech and noise spatial covariance matrices and estimating the azimuth of the target. Third, a vision module (MediaPipe face detection) combined with an Intel RealSense depth sensor estimates the 3‑D coordinates of the speaker’s face. Fourth, an inverse‑kinematics (IK) solver provided by Kinova computes the joint angles that move the arm so that the end‑effector (and its attached sub‑array) is optimally positioned relative to the speaker. Finally, a minimum‑variance distortionless‑response (MVDR) beamformer is applied, using the 16th microphone as the reference channel and the previously estimated masks to form speech and noise covariance matrices. The MVDR weights are computed analytically, guaranteeing unit gain for the reference while minimizing output variance.
Experiments are carried out in two stages. For SSL, the authors record three‑second speech utterances at 18 azimuths and mix them with six types of industrial noise (vacuum pump, drill, engine, electric hum, compressed air, mine noise) at four SNR levels (‑5, 0, 5, 10 dB), yielding 1,944 mixtures per SNR. Using a single arm pose, they evaluate localization error with a ±15° tolerance, obtaining error‑rates (ER15) above 80 % even at the lowest SNR, and showing that the DNN‑predicted mask approaches oracle performance as SNR improves.
For full speech enhancement, the optimized dynamic array is compared against three static configurations mounted on the arm, a conventional static array from prior work, and a small shotgun microphone placed near the end‑effector. Recordings are made from four speech directions and eighteen noise directions. Performance metrics are scale‑invariant signal‑to‑distortion ratio (SI‑SDR) and word error rate (WER) obtained with the base Whisper ASR model. Results show that the dynamic configuration consistently outperforms all baselines, achieving an average SI‑SDR gain of roughly 2–3 dB and a WER reduction of 10–15 percentage points. The improvement is attributed to the reduced distance between the reference microphone and the speaker, which yields a higher‑quality reference signal for the MVDR beamformer.
Key contributions of the work are: (1) introducing physical array reconfiguration via a robotic manipulator as a real‑time optimization variable for speech enhancement; (2) demonstrating that a hybrid of deep‑learning‑based mask estimation and classical MVDR beamforming remains robust to rapid geometry changes; (3) integrating vision‑based speaker localization, depth sensing, and inverse kinematics into a closed‑loop system that autonomously positions the microphone array for optimal listening. The authors discuss future directions, including handling multiple simultaneous speakers, incorporating online ASR feedback to adapt masks on the fly, and extending the approach to highly reverberant spaces where far‑field assumptions break down. Overall, the study provides a compelling proof‑of‑concept that mechanical adaptation of microphone geometry can substantially boost speech‑driven HRI performance in harsh acoustic settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment