EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding

EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Audio codecs power discrete music generative modelling, music streaming and immersive media by shrinking PCM audio to bandwidth-friendly bit-rates. Recent works have gravitated towards processing in the spectral domain; however, spectrogram-domains typically struggle with phase modeling which is naturally complex-valued. Most frequency-domain neural codecs either disregard phase information or encode it as two separate real-valued channels, limiting spatial fidelity. This entails the need to introduce adversarial discriminators at the expense of convergence speed and training stability to compensate for the inadequate representation power of the audio signal. In this work we introduce an end-to-end complex-valued RVQ-VAE audio codec that preserves magnitude-phase coupling across the entire analysis-quantization-synthesis pipeline and removes adversarial discriminators and diffusion post-filters. Without GANs or diffusion we match or surpass much longer-trained baselines in-domain and reach SOTA out-of-domain performance. Compared to standard baselines that train for hundreds of thousands of steps, our model reducing training budget by an order of magnitude is markedly more compute-efficient while preserving high perceptual quality.


💡 Research Summary

The paper introduces EuleroDec, the first fully end‑to‑end complex‑valued neural audio codec that integrates a complex‑valued VQ‑VAE with residual vector quantization (RVQ). Traditional spectral‑domain codecs either discard phase information or split the complex STFT into separate real‑valued channels, which destroys the intrinsic magnitude‑phase coupling. To compensate, many recent works add adversarial discriminators or diffusion‑based post‑filters, incurring instability, slower convergence, and higher computational cost.

EuleroDec avoids all of these by keeping the entire analysis‑quantization‑synthesis pipeline in the complex domain. The input waveform is transformed with a 24 kHz STFT (512‑point FFT, Hann window, hop 64). The encoder consists of four down‑sampling stages, each built from complex convolutions, complex batch/RMS normalization, modReLU or GELU activations, and complex axial attention along the time axis. Skip connections use gated complex average pooling. After the down‑sampling hierarchy, a complex axial attention across frequency and a feed‑forward block prepare the latent representation for quantization.

Quantization is performed directly on complex vectors. The latent tensor is reshaped, linearly projected to a D‑dimensional complex space, and then passed through a multi‑stage residual vector quantizer with 12 codebooks (2048 entries each). Codebooks are initialized by sampling from the current encoder embeddings plus a small complex Gaussian perturbation, ensuring diversity from the start. At each RVQ stage the nearest complex centroid is selected using the Hermitian‑induced Euclidean distance, and the residual is passed to the next stage. A commitment loss pulls the encoder output toward its assigned centroids, while EMA updates with a warm‑up decay schedule keep the codebooks stable and prevent “code‑book collapse”. Dead codes are refreshed with a low‑probability reseeding strategy, achieving 100 % code utilization and an effective perplexity of 73 % of the available codes.

The decoder mirrors the encoder without the pooling branch, employing complex transposed convolutions, complex axial attention, and feed‑forward blocks to reconstruct the full‑resolution complex spectrogram, which is finally transformed back to waveform via inverse STFT.

Training uses AdamW (β1 = 0.9, β2 = 0.99, weight decay = 7e‑4) with a batch size of 16. The loss combines multi‑resolution mel‑L1, complex L1, a spectral convergence term, and the quantization commitment term (β = 0.05). Learning rate follows a linear warm‑up then cosine decay, reduced by a factor of 100 at convergence. Training converges in only 35–41 k iterations (≈50 k steps), a ten‑fold reduction compared with baselines that require 500–700 k steps.

Evaluation on LibriTTS (clean and other splits) at 6 kbps and 12 kbps uses four metrics: SI‑SDR (waveform fidelity), PESQ (perceptual quality), STOI (intelligibility), and Group‑Delay Distortion (GDD, measuring phase accuracy). At 6 kbps in‑domain, EuleroDec achieves SI‑SDR = 10.5 dB, PESQ = 2.47, STOI = 0.842, GDD = 264 ms, outperforming AudioDec, EnCodec, and APCodec. In the out‑of‑domain “other” set, where APCodec’s performance collapses, EuleroDec maintains the best SI‑SDR and second‑best PESQ, with markedly lower GDD, indicating superior phase reconstruction under distribution shift. At 12 kbps, similar trends hold, with EuleroDec achieving the highest SI‑SDR (13.67 dB) and competitive PESQ (2.91) while preserving low GDD.

Ablation studies confirm the importance of complex‑valued processing: removing time‑axial attention reduces SI‑SDR and PESQ slightly, and a real‑valued AE variant performs substantially worse across all metrics. Parameter counts are modest (≈2.35 M for the full model), and the real‑time factor on an RTX 3090 is 0.344, suitable for high‑throughput offline generation.

In summary, EuleroDec demonstrates that a fully complex‑valued neural codec can preserve magnitude‑phase coupling, eliminate the need for adversarial or diffusion components, achieve fast and stable convergence, and deliver state‑of‑the‑art quality at low bitrates. This work opens a new direction for efficient, robust audio compression, especially for applications requiring faithful phase reconstruction such as music streaming, immersive media, and speech coding under bandwidth constraints.


Comments & Academic Discussion

Loading comments...

Leave a Comment