Time-Varying Audio Effect Modeling by End-to-End Adversarial Training
Deep learning has become a standard approach for the modeling of audio effects, yet strictly black-box modeling remains problematic for time-varying systems. Unlike time-invariant effects, training models on devices with internal modulation typically requires the recording or extraction of control signals to ensure the time-alignment required by standard loss functions. This paper introduces a Generative Adversarial Network (GAN) framework to model such effects using only input-output audio recordings, removing the need for modulation signal extraction. We propose a convolutional-recurrent architecture trained via a two-stage strategy: an initial adversarial phase allows the model to learn the distribution of the modulation behavior without strict phase constraints, followed by a supervised fine-tuning phase where a State Prediction Network (SPN) estimates the initial internal states required to synchronize the model with the target. Additionally, a new objective metric based on chirp-train signals is developed to quantify modulation accuracy. Experiments modeling a vintage hardware phaser demonstrate the method’s ability to capture time-varying dynamics in a fully black-box context.
💡 Research Summary
The paper tackles the challenging problem of black‑box modeling of time‑varying audio effects—such as phasers, choruses, and flangers—without requiring the extraction of internal modulation signals (e.g., LFO waveforms). Traditional supervised approaches assume that input‑output pairs are time‑aligned; however, when the effect’s internal modulator starts at different phases for each recording, standard loss functions penalize phase mismatches heavily, causing the model to learn an averaged behavior rather than the true dynamic modulation.
To overcome this, the authors propose a two‑stage training pipeline built around a Generative Adversarial Network (GAN). In the first stage, only adversarial loss (hinge formulation) and a multi‑resolution STFT spectral loss are used. The generator receives a random initial LSTM state h₀ sampled from a prior distribution, which injects stochasticity and allows the network to capture the distribution of modulation patterns without enforcing a specific phase. This stage learns the overall timbre and modulation statistics of the target effect.
In the second stage, a State Prediction Network (SPN) predicts the optimal initial LSTM states from the current input‑output window (x, y, φ). The SPN reuses the discriminator’s feature‑extraction backbone (FeatBlocks) to keep the parameter count low and to benefit from transfer learning; the discriminator’s weights are frozen while the SPN is trained. By feeding the predicted h₀ back into the generator, the model aligns its internal modulation phase with the ground‑truth signal, effectively solving the phase‑alignment problem.
The generator architecture, named Series‑Parallel Time‑Varying Modulation (SPTVMod), extends the earlier SPTMod design. It consists of alternating ModBlocks and FXBlocks. Each ModBlock contains a convolution‑PReLU front‑end, down‑sampling, an LSTM (initialized with h₀), up‑sampling, and two 1×1 convolutions that output a modulation tensor µ_j (applied to the audio path) and a residual s_j for the next block. FXBlocks apply depth‑wise dilated convolutions, modulated by µ_j via FiLM, and include residual connections. This design yields a large receptive field with modest computational cost, suitable for real‑time processing.
The discriminator mirrors the SPN’s feature extractor: a stack of FeatBlocks (conv‑PReLU‑optional FiLM‑average‑pool) produces hierarchical features z_j (real) or b z_j (generated). These are globally pooled, summed, and passed through a small fully‑connected network to output a scalar score.
Loss functions combine hinge adversarial terms for generator and discriminator with a Multi‑Resolution STFT loss (windows of 512, 1024, 2048 samples). An adaptive loss‑balancing scheme rescales gradients of each loss component, stabilizing training and preventing the adversarial term from dominating. Feature‑matching losses were tested but discarded because they degraded performance in this context.
Experiments focus on a vintage analog phaser. A dataset of paired dry‑wet audio snippets was created with random LFO phases and rates, while user‑control parameters were held constant (snapshot mode). The authors introduce a novel “chirp‑train” metric that uses frequency‑swept chirps to quantify how accurately the model reproduces the time‑varying modulation envelope. Compared to baseline methods that either fix the LFO phase or pre‑extract it with a separate network, the proposed GAN + SPN approach achieves significantly lower chirp‑train error, higher spectral similarity, and comparable perceptual scores in listening tests. The model reproduces subtle phase‑dependent characteristics such as the non‑linear sweep of the phaser’s notches and the dynamic resonance peaks, which are often lost in averaged models.
In summary, the paper makes three key contributions: (1) a GAN‑based adversarial phase that learns modulation distributions without explicit phase alignment, (2) a state‑prediction mechanism that restores precise temporal alignment by estimating LSTM initial states, and (3) a new objective metric for evaluating time‑varying modulation accuracy. The combined framework is generic and can be applied to any LFO‑driven or more complex time‑varying effect, opening the door to high‑quality, fully black‑box emulations of classic hardware for plugins, virtual instruments, and real‑time audio processing.
Comments & Academic Discussion
Loading comments...
Leave a Comment