D3PIA: A Discrete Denoising Diffusion Model for Piano Accompaniment Generation From Lead sheet
Generating piano accompaniments in the symbolic music domain is a challenging task that requires producing a complete piece of piano music from given melody and chord constraints, such as those provided by a lead sheet. In this paper, we propose a discrete diffusion-based piano accompaniment generation model, D3PIA, leveraging local alignment between lead sheet and accompaniment in piano-roll representation. D3PIA incorporates Neighborhood Attention (NA) to both encode the lead sheet and condition it for predicting note states in the piano accompaniment. This design enhances local contextual modeling by efficiently attending to nearby melody and chord conditions. We evaluate our model using the POP909 dataset, a widely used benchmark for piano accompaniment generation. Objective evaluation results demonstrate that D3PIA preserves chord conditions more faithfully compared to continuous diffusion-based and Transformer-based baselines. Furthermore, a subjective listening test indicates that D3PIA generates more musically coherent accompaniments than the comparison models.
💡 Research Summary
The paper introduces D3PIA, a novel discrete denoising diffusion model designed specifically for generating piano accompaniments from lead sheets (melody + chord information). The authors argue that existing symbolic music generation methods—largely Transformer‑based language models—suffer from complex tokenization, error accumulation in autoregressive decoding, and limited controllability. Recent diffusion models have shown strong generative capabilities across domains, but prior music diffusion work has treated piano rolls as continuous images, which discards the inherently binary nature of note events.
To address this, D3PIA adopts an entirely discrete diffusion process that operates directly on four possible piano‑roll states per time‑pitch cell: onset, off, sustain, and MASK. The forward diffusion gradually perturbs these states using a transition matrix parameterized by preservation (α), perturbation (β), and masking (γ) probabilities. The reverse process predicts the conditional distribution pθ(yτ‑1 | yτ, x), where x is the lead‑sheet piano roll. Training minimizes the variational lower‑bound (VLB) loss, augmented with an absorbing‑state (AS) sampling technique that improves refinement in later diffusion steps.
A key architectural contribution is the use of Neighborhood Attention (NA), a local‑attention mechanism that efficiently focuses on nearby pitch‑time neighborhoods rather than the full self‑attention matrix. The model consists of two main components:
- Lead‑sheet encoder – receives the combined melody‑and‑chord piano roll, processes each pitch with a bidirectional LSTM, and then applies several dilated NA blocks (window size 5, dilation pattern
Comments & Academic Discussion
Loading comments...
Leave a Comment