LyriCAR: A Difficulty-Aware Curriculum Reinforcement Learning Framework For Controllable Lyric Translation
đĄ Research Summary
LyriCAR addresses the notoriously difficult task of lyric translation, where preserving musicality (rhythm, rhyme, format) must be balanced with semantic fidelity. Existing approaches rely heavily on handcrafted constraints, sentenceâlevel modeling, or require large amounts of parallel lyricâmelody data, limiting their ability to capture paragraphâlevel rhyme patterns and to scale efficiently. LyriCAR proposes a fully unsupervised, endâtoâend framework that integrates a difficultyâaware curriculum, multiâdimensional rewardâbased reinforcement learning, and an adaptive curriculumâswitching mechanism guided by reward convergence.
The curriculum designer quantifies the intrinsic linguistic complexity of each raw English lyric paragraph using BERTâperplexity and LIWCâinspired lexical, syntactic, and rhymeâdensity features. Paragraphs are stratified into Easy, Medium, and Hard tiers, each providing 9,600 samples per stage. This stratification enables a progressive âeasyâtoâhardâ training schedule without any manual annotations or alignment information.
For reinforcement learning, the large language model Qwenâ3â8B is fineâtuned. Four reward functions are defined: (i) format compliance (preserving special sentenceâboundary tokens), (ii) rhythm compliance (matching target syllable count), (iii) rhyme compliance (encouraging consistent rhyme across adjacent lines), and (iv) textâquality compliance (evaluated by a promptâbased Judge LLM that outputs a discrete score of â1, 0, or 1). The overall reward is a weighted sum of these components. Rather than applying these signals as external penalties, LyriCAR adopts Group Relative Policy Optimization (GRPO), which computes a relative advantage for each candidate within a sampled group by subtracting the groupâs mean reward. The policy is then updated to maximize the expected relative advantage, allowing the model to internally learn tradeâoffs among competing objectives.
Training efficiency is further enhanced by a convergenceâguided curriculum adaptation strategy. After every fixed interval of epochs, the variance of validation rewards over a sliding window is monitored. If the variance stays below a preâdetermined threshold for a patience of k epochs, the current curriculum stage is considered âconvergedâ and the training proceeds to the next, more difficult tier. This selfâpaced progression mirrors human tutoring: the model spends most of its capacity on easy data until mastery, then shifts resources to harder data precisely when needed.
Experiments use the DALI dataset, extracting English lyrics from 6,984 songs and constructing ENâZH parallel test sets. Training runs on eight NVIDIA A800 (80âŻGB) GPUs with stageâspecific learning rates (1eâ6 â 5eâ7 â 1eâ7) and KLâloss coefficients (0.01 â 0.05 â 0.1). Results show that LyriCAR outperforms strong baselinesâincluding the base Qwenâ3â8B modelâon BLEU (up to 21.37 vs. 16.87) and COMET (81.12 vs. 77.37). In the multiâdimensional reward evaluation, LyriCARâSD (dynamic curriculum) achieves the highest scores for rhythm (0.70), rhyme (0.77), and text quality (0.7), while reducing total training steps by roughly 34âŻ% compared to a static curriculum. Ablation studies confirm that fullâdata training without curriculum plateaus at a reward of ~0.5, whereas curriculumâbased training reaches ~0.7 and converges faster.
In summary, LyriCAR demonstrates that (1) difficultyâaware data stratification, (2) carefully designed multiâdimensional rewards, (3) groupârelative policy optimization, and (4) rewardâconvergenceâdriven adaptive curricula together enable a model to internalize complex musicalâlinguistic patterns without any supervision. This yields stateâofâtheâart lyric translation quality while markedly improving computational efficiency, offering a scalable blueprint for future research in musically informed language generation.
Comments & Academic Discussion
Loading comments...
Leave a Comment