Joint Estimation of Piano Dynamics and Metrical Structure with a Multi-task Multi-Scale Network
Estimating piano dynamic from audio recordings is a fundamental challenge in computational music analysis. In this paper, we propose an efficient multi-task network that jointly predicts dynamic levels, change points, beats, and downbeats from a shared latent representation. These four targets form the metrical structure of dynamics in the music score. Inspired by recent vocal dynamic research, we use a multi-scale network as the backbone, which takes Bark-scale specific loudness as the input feature. Compared to log-Mel as input, this reduces model size from 14.7 M to 0.5 M, enabling long sequential input. We use a 60-second audio length in audio segmentation, which doubled the length of beat tracking commonly used. Evaluated on the public MazurkaBL dataset, our model achieves state-of-the-art results across all tasks. This work sets a new benchmark for piano dynamic estimation and delivers a powerful and compact tool, paving the way for large-scale, resource-efficient analysis of musical expression.
💡 Research Summary
The paper tackles the long‑standing problem of estimating piano dynamics directly from audio recordings by proposing a compact, multi‑task deep learning framework that simultaneously predicts four interrelated musical attributes: dynamic level (e.g., pp, p, mf, f, ff), dynamic change points, beat positions, and downbeat positions. The authors argue that these four targets together constitute the metrical structure of dynamics in a musical score, and that jointly learning them can exploit shared acoustic cues while respecting their distinct temporal requirements.
Input representation – Instead of the ubiquitous log‑Mel spectrogram, the authors adopt Bark‑scale specific loudness (BSSL) as the front‑end feature. BSSL is derived from a psychoacoustic model that aggregates spectral energy into critical bands, applies outer‑ and middle‑ear weighting, models masking, and maps the result to the perceptual sone scale. This representation yields a 22‑band matrix (22 × T) per 60‑second audio segment, dramatically reducing input dimensionality compared with 128‑band log‑Mel. The reduction enables a model size of only 0.5 M parameters (versus 14.7 M for a comparable log‑Mel baseline) while preserving, and even improving, performance on dynamics‑related tasks.
Network architecture – The backbone is a multi‑scale convolutional network adapted from prior work. It processes the input through three parallel branches operating at temporal resolutions T, T/s, and T/s², where the scaling factor s is a tunable hyper‑parameter (empirically set to 5). Each branch consists of residual blocks and self‑attention modules, producing a shared latent sequence Z of shape T × 8. To prevent negative transfer among the four tasks, the authors employ a Multi‑gate Mixture‑of‑Experts (MMoE) decoder. Eight lightweight 1‑D convolutional “experts” process Z in parallel; for each task k (dynamics, change point, beat, downbeat) a dedicated gating network computes a softmax weight vector w_k(t) at every time step, which linearly combines the expert outputs into a task‑specific feature y_k,t. Separate linear heads then map these features to frame‑wise logits.
Post‑processing – Beat and downbeat logits are thresholded at 0.5 and peak‑picked within a ±70 ms window (±3 frames at 50 fps). Dynamic labels are obtained by taking the argmax of the dynamic logits at the detected beat times. Change points are first identified by a high‑confidence (≥ 75 %) threshold, then snapped to the nearest beat to respect the dataset’s annotation convention (all dynamics are beat‑aligned).
Loss function – A composite multi‑task loss L_MTL = L_Dyn + L_CP + L_Beat + L_Dbt is used. Binary tasks (beat, downbeat, change point) employ a shift‑tolerant weighted binary cross‑entropy that up‑weights sparse positive frames and allows a ±3‑frame tolerance. The dynamics loss is a standard cross‑entropy masked by ground‑truth beat positions, enforcing the prior that dynamics occur only on beats.
Experimental setup – Evaluation is performed on the MazurkaBL corpus, a score‑aligned collection of 1,999 Chopin mazurkas (after discarding two problematic pieces). A 5‑fold cross‑validation stratified by piece is used. Audio is resampled to 22.05 kHz, segmented into 60‑second windows with 50 % overlap during training, and fed either BSSL or log‑Mel features. The model is trained for 120 epochs with AdamW (lr = 3e‑4), batch size = 10, and a fixed random seed.
Results – The proposed multi‑task BSSL model achieves state‑of‑the‑art performance across all four tasks: dynamics F1 54.4 ± 8.9 %, change‑point F1 26.1 ± 9.7 %, beat F1 84.1 ± 1.3 %, downbeat F1 55.2 ± 4.2 %. Compared with a single‑task version using the same features, the multi‑task model improves dynamics by +3.8 pp, change points by +5.1 pp, beats by +0.1 pp, and downbeats by +10.2 pp, while using only 0.5 M parameters (versus 4 × 0.4 M for four separate single‑task models). Ablation with log‑Mel inputs shows comparable dynamics and change‑point performance but better downbeat scores, indicating that downbeat detection may benefit from higher spectral resolution.
Significance and implications – By jointly learning dynamics and its metrical scaffolding, the system can enrich scores that lack expressive markings, a common scenario for digitized archives and for outputs of score‑level automatic music transcription (AMT) pipelines. The compact BSSL‑based front‑end enables processing of long audio contexts (60 s) on modest GPU memory (≈ 4 GiB), facilitating large‑scale batch analysis or even real‑time deployment. The MMoE gating mechanism demonstrates an effective way to balance shared representation learning with task‑specific specialization, mitigating negative transfer in multi‑task audio tasks.
Future directions suggested by the authors include extending the approach to other instruments and genres, integrating the model into real‑time performance analysis tools, and leveraging the predicted dynamics for downstream applications such as expressive music generation, automatic score editing, or pedagogical feedback systems. Overall, the paper presents a well‑engineered, empirically validated solution that advances both the methodological toolkit for music information retrieval and the practical capabilities for large‑scale expressive music analysis.
Comments & Academic Discussion
Loading comments...
Leave a Comment