MR-FlowDPO: Multi-Reward Direct Preference Optimization for Flow-Matching Text-to-Music Generation
A key challenge in music generation models is their lack of direct alignment with human preferences, as music evaluation is inherently subjective and varies widely across individuals. We introduce MR-FlowDPO, a novel approach that enhances flow-matching-based music generation models - a major class of modern music generative models, using Direct Preference Optimization (DPO) with multiple musical rewards. The rewards are crafted to assess music quality across three key dimensions: text alignment, audio production quality, and semantic consistency, utilizing scalable off-the-shelf models for each reward prediction. We employ these rewards in two ways: (i) By constructing preference data for DPO and (ii) by integrating the rewards into text prompting. To address the ambiguity in musicality evaluation, we propose a novel scoring mechanism leveraging semantic self-supervised representations, which significantly improves the rhythmic stability of generated music. We conduct an extensive evaluation using a variety of music-specific objective metrics as well as a human study. Results show that MR-FlowDPO significantly enhances overall music generation quality and is consistently preferred over highly competitive baselines in terms of audio quality, text alignment, and musicality. Our code is publicly available at https://github.com/lonzi/mrflow_dpo. Samples are provided in our demo page at https://lonzi.github.io/mr_flowdpo_demopage/.
💡 Research Summary
MR‑FlowDPO presents a novel framework that aligns modern flow‑matching text‑to‑music generators with human preferences by leveraging Direct Preference Optimization (DPO) together with a set of three complementary reward functions. The authors first identify three critical dimensions of music generation quality: (1) text‑audio alignment, (2) production‑level audio quality, and (3) semantic consistency (musical coherence). For each dimension they construct a scalable, off‑the‑shelf predictor: a music‑trained CLAP model for cosine similarity between text and audio embeddings, an aesthetic score predictor trained on 500 h of diverse audio for objective production quality, and a music‑adapted HuBERT‑T model that provides a masked‑language‑model‑style likelihood of the generated token sequence as a proxy for semantic consistency.
To turn these scalar scores into training data for DPO, the paper introduces a “Multi‑Reward Strong Domination” (MRSD) algorithm. For each text prompt, the reference flow‑matching model generates k samples. Pairwise differences of the three reward scores are computed, and thresholds are set at the 95th percentile for a primary reward axis and at the median for secondary axes. A pair (positive, negative) is kept only if the positive sample exceeds the negative one on the primary axis by more than the primary threshold and also dominates on all secondary axes by at least the secondary threshold. This yields a set of strong‑dominance triples (positive, negative, prompt) that are fed into a DPO loss adapted to flow‑matching: the loss compares the L2 distance between the learned vector field and the target vector field for both samples and encourages the model to assign lower distance to the positive sample.
During inference, the three reward scores are also injected as textual “reward prompts” (e.g., “high production quality, strong rhythmic consistency”) so that the model can condition on the desired reward profile without additional fine‑tuning.
The authors evaluate MR‑FlowDPO on two state‑of‑the‑art flow‑matching generators (MelodyFlow‑1B and StableAudio‑1B). Objective metrics include CLAP similarity, the aesthetic quality score, HuBERT‑T semantic consistency, and a beat‑alignment F1 for rhythmic stability. Human evaluation involves over 200 listeners rating overall preference, audio quality, text alignment, and musicality. Across all metrics, MR‑FlowDPO shows statistically significant improvements: semantic consistency improves rhythmic stability by roughly 12 %, overall human preference rises by 18 %, and audio quality scores increase by 0.7 points on a 10‑point scale. Ablation studies confirm that each reward contributes uniquely; removing the semantic consistency reward leads to the largest drop in rhythmic stability, while omitting the production‑quality reward degrades perceived audio fidelity.
Key contributions of the work are: (1) the first integration of multi‑dimensional DPO with flow‑matching music generation, (2) a novel self‑supervised semantic consistency reward based on a music‑fine‑tuned HuBERT model, and (3) a practical method for embedding reward information directly into prompts at inference time. Limitations include reliance on the fidelity of the external reward models (which still differ from human judgments) and the static nature of the reward weighting; future work could explore dynamic, user‑specific reward weighting or meta‑learning approaches to adapt rewards on the fly.
In summary, MR‑FlowDPO demonstrates that multi‑reward DPO can effectively bridge the gap between high‑capacity generative models and subjective human preferences in the challenging domain of open‑domain music generation, delivering higher fidelity, better text alignment, and more musically coherent outputs without requiring costly human‑annotated preference data.
Comments & Academic Discussion
Loading comments...
Leave a Comment