Neural autoencoders underpin generative models. Practical, large-scale use of neural autoencoders for generative modeling necessitates fast encoding, low latent rates, and a single model across representations. Existing approaches are reconstruction-first: they incur high latent rates, slow encoding, and separate architectures for discrete vs. continuous latents and for different audio channel formats, hindering workflows from preprocessing to inference conditioning. We introduce a generative-first architecture for audio autoencoding that increases temporal downsampling from 2048x to 3360x and supports continuous and discrete representations and common audio channel formats in one model. By balancing compression, quality, and speed, it delivers 10x faster encoding, 1.6x lower rates, and eliminates channel-format-specific variants while maintaining competitive reconstruction quality. This enables applications previously constrained by processing costs: a 60-second mono signal compresses to 788 tokens, making generative modeling more tractable.
Latent generative models have revolutionized audio synthesis, enabling applications from music generation [1,2,3] to source separation [4], upmixing [5], and understanding tasks [6]. These models fundamentally depend on neural audio autoencoders to compress raw waveforms into tractable latent representations. However, existing autoencoders are primarily designed as quantized variational autoencoders for reconstruction tasks, then mildly adapted for generative modeling through post hoc modifications. This reconstructionfirst design philosophy creates fundamental mismatches with generative requirements, leading to inefficient tokenization rates (we use "tokenize" to describe converting audio to any latent representation), fragmented architectures across audio channel formats, and computational bottlenecks that limit practical deployment at scale. The discrete-continuous latent representation divide further compounds these issues: discrete methods lack continuous latents for diffusion, and continuous methods lack discrete tokens for language models. These challenges are particularly acute for high-fidelity music processing at 44.1 kHz, where complex content amplifies the tension between compression, quality, and complexity.
These limitations are evident across established audio codecs. SoundStream, EnCodec, and DAC [7,8,9] exemplify this paradigm, operating at 75-150 Hz with only quantized latents unsuitable for diffusion training. A 4-minute song requires over 18,000 tokens, creating memory bottlenecks, while slow encoding can constitute 30% of training time, limiting data augmentation and throughput. Recent attempts achieve success with significant trade-offs: Stable Audio Open [2] reduces rates to 21.5 Hz but increases encoding costs and only provides continuous latents unsuitable for language Real-Time Factor ( ) model training, SpectroStream [10] maintains low rates but requires 64 codebooks, and HILCodec [11] improves speed but maintains high token rates. The Music2Latent [12] line of work, culminating in CoDiCodec [13], achieved impressive 11 Hz compression rates along with continuous and discrete latents, at the cost of performance in signal-level metrics. None of these explicitly account for different audio channel formats. Speech-focused approaches [14,15,16] achieve 12.5 Hz but target low-bandwidth applications unsuitable for high-fidelity music with its broader frequency content and complex stereo imaging. These developments point toward the need for unified architectures designed specifically for generative modeling.
To address these limitations, we propose the Generative-First Autoencoder (GenAE), a generative-first architecture that rethinks previous autoencoder designs for generation. GenAE provides a single architecture and training scheme that supports continuous and discrete latents, and all common audio channel formats (mono, stereo, mid/side). Although we ablate and evaluate on 44.1 kHz music, the design is not music-specific. Our contributions are threefold: (1) encoder architectural modifications including efficient activations, early downsampling, and strategic attention placement that enable aggressive tokenization with substantial computational speedups, (2) training improvements with audio channel format data augmentation and loss functions that enhance generalization and robustness, and (3) an optional post-training step that discretizes a trained continuous model to support both continuous and discrete latents, without retraining the backbone. Together, these choices yield a unified model that balances compression rate, reconstruction quality, and processing speed for generative workflows. arXiv:2602.15749v1 [cs.SD] 17 Feb 2026
Our method redesigns the autoencoder architecture for generation to balance three competing objectives: compression rate, reconstruction quality, and processing speed. Our base model is an encoderbottleneck-decoder type model, as proposed in SoundStream [7]. We specifically use the most modern architecture proposed in DAC [9] with 5 blocks and hop sizes set for a target rate of 13.125 Hz. In this section, we describe each change to DAC to arrive at our proposed GenAE to ensure measurable and cumulative performance gains. We first sort our contributions into architectural, training, and post-training categories, and within each group by intent. A diagram depicting the cumulative model architecture is in Fig. 2.
The first four are for efficiency, ordered by measured impact, and the last three are for quality. Efficient activations: Snake activations, x + sin 2 (βx) β , excel in audio tasks [9] but incur significant memory costs and are the memory bottleneck in our tasks. We use ELU in the encoder and introduce SnakeLite for the decoder. SnakeLite is the periodically wrapped Taylor approximation of sin 2 (•). We wrap the argument to Snake to (-π/2, π/2] using the round function, a(x, β) = βxπ round βx π , which we provide to the Taylor polynomial
315 ≈ sin
This content is AI-processed based on open access ArXiv data.