Query-based Deep Improvisation
In this paper we explore techniques for generating new music using a Variational Autoencoder (VAE) neural network that was trained on a corpus of specific style. Instead of randomly sampling the latent states of the network to produce free improvisation, we generate new music by querying the network with musical input in a style different from the training corpus. This allows us to produce new musical output with longer-term structure that blends aspects of the query to the style of the network. In order to control the level of this blending we add a noisy channel between the VAE encoder and decoder using bit-allocation algorithm from communication rate-distortion theory. Our experiments provide new insight into relations between the representational and structural information of latent states and the query signal, suggesting their possible use for composition purposes.
💡 Research Summary
**
The paper introduces a novel method for controllable music generation using a pre‑trained Variational Autoencoder (VAE). Instead of sampling the latent space at random, the authors feed the VAE with a musical query that belongs to a style different from the one used during training. To regulate how much of the query’s characteristics are transferred to the generated output, they insert a noisy communication channel between the encoder and decoder. This channel is implemented via a bit‑allocation algorithm derived from Shannon’s rate‑distortion theory.
The theoretical foundation starts from the ELBO (Evidence Lower Bound) formulation of VAE training, which can be expressed as a trade‑off between data entropy (H), reconstruction distortion (D), and the KL‑divergence term (R) that measures the encoding rate. While β‑VAE and InfoVAE control the balance between D and R during training, the proposed approach keeps the VAE weights fixed and manipulates the information flow at generation time. Assuming the latent variables follow a multivariate uncorrelated Gaussian distribution, the authors derive closed‑form expressions for the conditional mean μ_d and variance σ²_d of the decoder’s input as functions of an allocated bit‑rate R. When R = 0, the decoder receives only the mean (maximum compression); when R → ∞, the decoder receives the exact encoder output (no compression).
Experimentally, a VAE is trained on a modest pop‑music dataset (126 clips, each represented as 16‑step, 16th‑note bars). As a query, a MIDI file from the anime “Naruto Shippuden”—stylistically distinct from the training set—is encoded. By limiting the bit‑rate to 256 bits per frame, the generated music shows a clear departure from the query’s original texture while preserving harmonic relationships. For instance, a G‑A‑D chord progression appears over a query that originally contains a D chord, creating new harmonic tension. Further reduction of the bit‑rate leads to increasingly simplified structures that converge toward the tonic C, reflecting the dominance of the training corpus’s style.
To quantify these effects, the authors employ Music Information Dynamics, specifically the Information Rate (IR) metric, estimated with the Variable Markov Oracle (VMO). IR measures the mutual information between the present and past of a quantized signal, capturing both unpredictability (high entropy) and predictability (low conditional entropy). VMO searches over similarity thresholds to find the maximal IR, simultaneously revealing recurring motifs. The analysis shows that lower bit‑rates reduce IR and the number of detected motifs, indicating that the noisy channel diminishes temporal predictability and structural repetition in the output.
The paper’s contributions are threefold: (1) it demonstrates that a pre‑trained VAE’s latent space can be externally modulated to achieve fine‑grained style blending without retraining; (2) it introduces a principled, rate‑distortion‑based bit‑allocation mechanism for generation‑time control, offering a complementary approach to β‑VAE’s training‑time regularization; and (3) it couples this control with a rigorous information‑theoretic analysis of the resulting music, linking the allocated bitrate to measurable changes in musical structure.
Limitations include the reliance on Gaussian latent assumptions and a linear bit‑allocation scheme; extending the method to non‑Gaussian or hierarchical latent models remains an open challenge. Moreover, the current setup is offline and not yet suited for real‑time interactive composition. Future work is suggested to integrate VMO‑derived symbolic motifs directly into the VAE pipeline, potentially enabling a hybrid system that combines human‑engineered musical features with deep generative models for richer, controllable improvisation.
Comments & Academic Discussion
Loading comments...
Leave a Comment