Machines hear better when they have ears

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Deep-neural-network (DNN) based noise suppression systems yield significant improvements over conventional approaches such as spectral subtraction and non-negative matrix factorization, but do not generalize well to noise conditions they were not trained for. In comparison to DNNs, humans show remarkable noise suppression capabilities that yield successful speech intelligibility under various adverse listening conditions and negative signal-to-noise ratios (SNRs). Motivated by the excellent human performance, this paper explores whether numerical models that simulate human cochlear signal processing can be combined with DNNs to improve the robustness of DNN based noise suppression systems. Five cochlear models were coupled to fully-connected and recurrent NN-based noise suppression systems and were trained and evaluated for a variety of noise conditions using objective metrics: perceptual speech quality (PESQ), segmental SNR and cepstral distance. The simulations show that biophysically-inspired cochlear models improve the generalizability of DNN-based noise suppression systems for unseen noise and negative SNRs. This approach thus leads to robust noise suppression systems that are less sensitive to the noise type and noise level. Because cochlear models capture the intrinsic nonlinearities and dynamics of peripheral auditory processing, it is shown here that accounting for their deterministic signal processing improves machine hearing and avoids overtraining of multi-layer DNNs. We hence conclude that machines hear better when realistic cochlear models are used at the input of DNNs.

💡 Research Summary

The paper investigates whether incorporating biologically‑inspired cochlear models into deep‑neural‑network (DNN) based speech‑enhancement pipelines can improve robustness to unseen noise types and low signal‑to‑noise ratios (SNRs). Conventional DNN speech‑enhancement systems typically operate on engineered representations such as short‑time Fourier transform (STFT) or mel‑filterbank magnitudes, which, while effective on training conditions, degrade sharply when presented with novel noises or negative SNRs. Human listeners, by contrast, maintain intelligibility under a wide range of adverse acoustic environments, thanks to the nonlinear, level‑dependent filtering performed by the cochlea. The authors therefore hypothesize that feeding DNNs with a signal representation that mimics cochlear processing will yield a more generalizable system.

Five cochlear models of increasing physiological realism are evaluated: (1) a simple Gammatone (GT) filterbank, (2) Dynamically Compressed Gammachirp (DCGC) which adds level‑dependent high‑pass compression, (3) Dual‑Resonance Non‑Linear (DRNL) filterbank that incorporates outer‑ and middle‑ear transfer functions and a parallel linear/non‑linear path, (4) Cascade of Asymmetric Resonators with Fast‑Acting Compression (CARFAC) which uses a serial chain of second‑order resonators to emulate basilar‑membrane mechanics, and (5) a Transmission‑Line (TL) model that simulates the cochlea as a cascade of shunt admittances and serial impedances, capturing both mechanical and fluid coupling. Each model processes the noisy speech waveform and outputs a time‑frequency representation (essentially a cochleagram) that serves as the DNN input.

Two DNN architectures are trained to predict the Ideal Ratio Mask (IRM) for each time‑frequency bin: a fully‑connected feed‑forward network (FC‑DNN) that uses frame‑expansion to provide limited temporal context, and a Long Short‑Term Memory recurrent network (LSTM‑RNN) that explicitly models long‑range dependencies. The predicted mask is multiplied with the noisy Gammatone spectrogram, and the enhanced speech is reconstructed via an inverse Gammatone synthesis stage.

Experiments use three noise types (babble, ICRA, factory) and three SNR levels (–3 dB, 3 dB, 9 dB). Performance is measured with three objective metrics: Perceptual Evaluation of Speech Quality (PESQ, reported as MOS), segmental SNR (segSNR), and Cepstral Distance (CD). Improvements over the unprocessed noisy signal are reported as ΔPESQ, ΔsegSNR, and ΔCD.

Results show a clear trend: the more physiologically accurate the cochlear front‑end, the larger the performance gain. The Transmission‑Line model consistently yields the highest ΔPESQ (≈+0.8 MOS), the largest ΔsegSNR (≈+3 dB), and the lowest ΔCD across all noise‑SNR combinations. DCGC and DRNL also outperform the simple GT baseline, while CARFAC provides intermediate gains. The GT model, which lacks nonlinear compression, performs similarly to conventional STFT‑based systems, confirming that linear filtering alone does not capture the robustness of human hearing.

Both FC‑DNN and LSTM‑RNN benefit similarly from the cochlear preprocessing, indicating that the improvement stems primarily from the input representation rather than the network architecture. This suggests that a well‑designed front‑end can reduce the need for overly deep or complex networks and mitigate over‑training on limited noise conditions.

The authors discuss the implications: (i) biologically‑inspired preprocessing acts as a deterministic, noise‑robust feature extractor, (ii) it reduces the reliance on massive, diverse training corpora, and (iii) it opens a pathway toward more generalizable speech‑enhancement systems for real‑world applications such as hearing aids, teleconferencing, and automatic speech recognition. Limitations include the focus on monaural (single‑channel) signals, lack of real‑time implementation analysis, and the absence of subjective listening tests. Future work is proposed on multi‑channel extensions, low‑latency deployment, and integration with auditory prostheses.

In conclusion, the study provides empirical evidence that “machines hear better when they have ears”: augmenting DNN‑based noise suppression with realistic cochlear models yields substantial gains in speech quality and intelligibility, especially under unseen noise conditions and negative SNRs, thereby establishing a promising new direction for robust audio‑signal processing.

Machines hear better when they have ears

💡 Research Summary

Comments & Academic Discussion

Leave a Comment