HiFi-Glot: High-Fidelity Neural Formant Synthesis with Differentiable Resonant Filters

HiFi-Glot: High-Fidelity Neural Formant Synthesis with Differentiable Resonant Filters
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Formant synthesis aims to generate speech with controllable formant structures, enabling precise control of vocal resonance and phonetic features. However, while existing formant synthesis approaches enable precise formant manipulation, they often yield an impoverished speech signal by failing to capture the complex co-occurring acoustic cues essential for naturalness. To address this issue, this letter presents HiFi-Glot, an end-to-end neural formant synthesis system that achieves both precise formant control and high-fidelity speech synthesis. Specifically, the proposed model adopts a source–filter architecture inspired by classical formant synthesis, where a neural vocoder generates the glottal excitation signal, and differentiable resonant filters model the formants to produce the speech waveform. Experiment results demonstrate that our proposed HiFi-Glot model can generate speech with higher perceptual quality and naturalness while exhibiting a more precise control over formant frequencies, outperforming industry-standard formant manipulation tools such as Praat. Code, checkpoints, and representative audio samples are available at https://www.yichenggu.com/HiFi-Glot/.


💡 Research Summary

HiFi‑Glot introduces a novel end‑to‑end neural formant synthesis framework that merges the interpretability of classical source‑filter speech production models with the expressive power of modern neural vocoders. The system receives a set of interpretable speech parameters—fundamental frequency (F0), the first three formants (F1‑F3), spectral tilt, spectral centroid, energy, and voicing boundaries—and processes them through an 8‑layer gated convolutional neural network (GCNN). The GCNN outputs two streams: (1) a compact representation of an all‑pole filter (51 values: 50 log‑area ratios and a gain) and (2) a 128‑dimensional latent vector. The latent vector feeds an NSF‑HiFiGAN decoder that generates a glottal excitation signal. The all‑pole filter parameters are first constrained via a tanh activation to reflection coefficients, guaranteeing stability, and then transformed into direct‑form LPC coefficients using a differentiable Levinson recursion. For efficient parallel synthesis, the filter’s frequency response is approximated by an FFT‑based FIR (H(z)=FFT(a,N)+ε). The excitation is transformed to the STFT domain, multiplied by H(z), and converted back to the time domain with an inverse FFT and overlap‑add, yielding the final waveform. This design ensures full differentiability with respect to filter parameters while enabling fast GPU‑based batch synthesis, overcoming the speed and gradient‑flow limitations of prior approaches such as GELP and LPCNet.
Training combines several loss components: a multi‑scale mel‑spectrogram loss, a log‑magnitude envelope loss that aligns predicted filter spectra with ground‑truth LPC envelopes, and adversarial plus feature‑matching losses from four discriminators—multi‑period (MPD), multi‑scale (MSD), multi‑scale STFT (MS‑STFTD), and multi‑scale sub‑band CQT (MS‑SB‑CQTD). The adversarial framework encourages realistic fine‑grained temporal and spectral details. The model is trained on 1664 h of speech data (44.1 kHz) using AdamW (β1 = 0.8, β2 = 0.99) with a learning rate of 2e‑5 for one million steps.
Evaluation uses 1000 utterances from HiFi‑TTS2. Objective metrics compute RMSE between the original and re‑synthesized speech parameters after manipulation; subjective quality is measured with a MUSHRA‑style listening test (20 native English listeners, 35 trials, attention checks). Baselines include the DSP‑based Praat pipeline (which retains the original excitation and LPC envelope) and the neural formant synthesis model NFS‑HiFiGAN (same decoder but without differentiable filters). Results show that HiFi‑Glot consistently achieves lower RMSE across all scaling factors (0.7–1.3) for F1‑F3, tilt, centroid, and energy, especially when scaling downwards. In the MUSHRA tests, HiFi‑Glot attains higher quality and naturalness scores than both baselines across all formant‑scaling conditions; Praat performs best only at the neutral (1.0) scale but degrades sharply when formants are shifted. The neural models, despite generating speech directly from parameters, maintain competitive or superior perceptual quality, demonstrating that differentiable all‑pole filters provide precise control without sacrificing naturalness.
The paper concludes that integrating differentiable resonant filters into a neural source‑filter architecture yields a system capable of high‑fidelity speech synthesis with fine‑grained, accurate formant manipulation. Limitations include focus on a single speaker/language dataset, lack of explicit evaluation on unvoiced segments, and no real‑time latency analysis. Future work may explore multi‑speaker, multilingual extensions, real‑time deployment, and applications such as voice conversion, speech therapy tools, and expressive speech synthesis.


Comments & Academic Discussion

Loading comments...

Leave a Comment