Toward Complex-Valued Neural Networks for Waveform Generation
Neural vocoders have recently advanced waveform generation, yielding natural and expressive audio. Among these approaches, iSTFT-based vocoders have recently gained attention. They predict a complex-valued spectrogram and then synthesize the waveform via iSTFT, thereby avoiding learned upsampling stages that can increase computational cost. However, current approaches use real-valued networks that process the real and imaginary parts independently. This separation limits their ability to capture the inherent structure of complex spectrograms. We present ComVo, a Complex-valued neural Vocoder whose generator and discriminator use native complex arithmetic. This enables an adversarial training framework that provides structured feedback in complex-valued representations. To guide phase transformations in a structured manner, we introduce phase quantization, which discretizes phase values and regularizes the training process. Finally, we propose a block-matrix computation scheme to improve training efficiency by reducing redundant operations. Experiments demonstrate that ComVo achieves higher synthesis quality than comparable real-valued baselines, and that its block-matrix scheme reduces training time by 25%. Audio samples and code are available at https://hs-oh-prml.github.io/ComVo/.
💡 Research Summary
The paper introduces ComVo, a novel neural vocoder that operates entirely in the complex domain for iSTFT‑based waveform generation. Traditional iSTFT‑based vocoders (e.g., iSTFTNet, Vocos) predict complex‑valued spectrograms but use real‑valued neural networks (RVNNs) that treat the real and imaginary parts as separate channels. This separation prevents the model from learning the intrinsic coupling between magnitude and phase that is naturally expressed in complex numbers.
ComVo addresses this limitation by employing complex‑valued neural networks (CVNNs) for both the generator and the discriminator. The generator is built upon a Vocos‑style feed‑forward architecture, but every convolution, normalization, and activation is implemented with native complex arithmetic. A split‑GELU activation is used to preserve the ConvNeXt‑style block while remaining compatible with complex inputs.
A key contribution is the phase quantization layer. After the first complex convolution, the phase of each complex feature is discretized into a fixed number of levels (Nq). The forward pass uses the quantized phase, while the backward pass applies a straight‑through estimator (STE) that treats the quantization as an identity for gradient flow. This regularizes phase evolution, mitigates phase drift, and stabilizes GAN training.
The discriminator side introduces a complex multi‑resolution discriminator (cMRD) that processes complex spectrograms directly at several STFT resolutions. Unlike prior discriminators that either use only magnitude or concatenate real and imaginary channels, cMRD computes adversarial losses on both real and imaginary components, providing structured feedback that respects complex algebra. In addition, a conventional multi‑period discriminator (MPD) operates on the waveform level (real‑valued) to preserve fine‑grained temporal cues.
To make complex operations computationally efficient, the authors propose a block‑matrix computation scheme. By representing a complex weight matrix W = Wr + iWi and an input vector z = x + iy as a 2×2 real block, the forward and backward passes can be expressed as a single matrix multiplication rather than four separate real multiplications. Custom autograd functions implement this formulation, reducing redundant memory accesses and improving GPU parallelism. Empirically, this yields a 25 % reduction in training time without sacrificing accuracy.
Experiments are conducted on the LibriTTS corpus (24 kHz) and the MUSDB18‑HQ dataset. Baselines include HiFi‑GAN, iSTFTNet, BigVGAN, and Vocos. Objective metrics (UTMOS, PESQ, MR‑STFT, Periodicity, V/UV F1) and subjective mean opinion scores (MOS) are reported. ComVo consistently outperforms all baselines: on LibriTTS it achieves UTMOS 3.69 (vs. 3.60 for the best baseline), PESQ 3.82, and MOS 4.07; on MUSDB18‑HQ it records the lowest MR‑STFT (0.878) and highest PESQ (3.522). A controlled synthetic experiment comparing a lightweight MLP‑GAN implemented as an RVNN versus a CVNN shows that the complex model attains significantly lower Jensen–Shannon divergence for both magnitude and phase distributions, confirming the representational advantage of complex‑valued modeling.
The paper’s analysis highlights three main insights: (1) native complex‑valued adversarial training captures real‑imaginary dependencies that real‑valued networks miss; (2) phase quantization acts as a structured non‑linearity and regularizer, improving training stability; (3) block‑matrix reformulation makes complex networks practical for large‑scale audio tasks by cutting redundant computation.
Limitations are acknowledged: the study focuses on GAN‑based vocoders and does not explore complex diffusion or flow models; real‑time inference and model compression are left for future work; and the impact of the quantization level Nq on performance warrants deeper investigation.
Overall, ComVo demonstrates that complex‑valued neural networks can substantially improve the quality and efficiency of iSTFT‑based waveform generation, opening a promising direction for future audio synthesis research.
Comments & Academic Discussion
Loading comments...
Leave a Comment