From Utterance to Vividity: Training Expressive Subtitle Translation LLM via Adaptive Local Preference Optimization

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The rapid development of Large Language Models (LLMs) has significantly enhanced the general capabilities of machine translation. However, as application scenarios become more complex, the limitations of LLMs in vertical domain translations are gradually becoming apparent. In this study, we focus on how to construct translation LLMs that meet the needs of domain customization. We take visual media subtitle translation as our topic and explore how to train expressive and vivid translation LLMs. We investigated the situations of subtitle translation and other domains of literal and liberal translation, verifying the reliability of LLM as reward model and evaluator for translation. Additionally, to train an expressive translation LLM, we constructed and released a multidirectional subtitle parallel corpus dataset and proposed the Adaptive Local Preference Optimization (ALPO) method to address fine-grained preference alignment. Experimental results demonstrate that ALPO achieves outstanding performance in multidimensional evaluation of translation quality.

💡 Research Summary

The paper tackles the problem of building a large‑language‑model (LLM) that can produce expressive, vivid subtitles for visual media—a domain that demands liberal translation, i.e., conveying atmosphere, emotion, and tone rather than strict literal fidelity. The authors first argue that subtitle translation sits at the liberal end of a spectrum ranging from highly literal domains such as legislation, news, and medicine to more creative ones like literature and visual media. By performing back‑translation experiments on a newly constructed multilingual subtitle corpus (MuSC) across several language pairs (EN↔ZH, EN↔DE, ZH↔TH), they demonstrate that subtitles yield significantly lower BLEU and ChrF++ scores than literal domains, confirming a higher degree of liberal translation.

A central methodological contribution is the validation of “LLM‑as‑Judge”: the 14‑billion‑parameter Qwen3‑14B model is used both as a reward model and as an evaluator. Spearman rank correlation between its scores and those of human annotators reaches ρ ≥ 0.82 across all language pairs, and Bland‑Altman analysis shows negligible systematic bias. This establishes that a relatively modest‑size LLM can reliably emulate human judgments, enabling automated preference data generation for alignment training.

The authors then compare “chat” style LLMs (GPT‑4o, Qwen‑Max, Claude Opus) with “reasoning” style LLMs (GPT‑5 Thinking, DeepSeek‑R1). Using 2,000 subtitle lines per direction, they compute pairwise BLEU similarity between each model’s outputs and human references. Chat models cluster together with higher mutual similarity, indicating a tendency toward literal translation, whereas reasoning models produce more diverse outputs that align better with human liberal translations. This suggests that inference‑time reasoning mechanisms promote expressive translation.

To address the fine‑grained nature of subtitle translation—where each line must be aligned with a local stylistic preference—the paper introduces Adaptive Local Preference Optimization (ALPO). Traditional RLHF, PPO, or Direct Preference Optimization (DPO) operate on whole‑sequence outputs and thus cannot enforce line‑level preferences. ALPO proceeds as follows: (1) the parallel subtitle corpus is split into a supervised fine‑tuning (SFT) set (80 %) and an alignment set (20 %). (2) An SFT model is first trained on the larger set. (3) For each line i in the alignment set, the model generates k = 15 candidate translations conditioned on the previously generated prefix (lines 1…i‑1). (4) Duplicates are removed; if a human reference exists, it is added to the candidate pool. (5) The Qwen3‑14B evaluator scores each candidate, producing a local preference distribution. (6) A segment‑wise ranking loss (e.g., pairwise logistic loss) is back‑propagated, updating the model to respect the local ordering while preserving global context. This “process‑supervised” paradigm enables precise alignment of each subtitle line with the desired expressive style.

Experiments show that a 14‑B LLM fine‑tuned with ALPO outperforms state‑of‑the‑art baselines (including GPT‑4o and Claude Opus) on multiple metrics: BLEU and ChrF++ improve by 3–7 percentage points, and LLM‑as‑Judge scores for vividness and expressiveness increase substantially. Human evaluation confirms that the ALPO‑trained model’s subtitles better capture emotional nuance and cultural tone, narrowing the gap to professional human translators.

The paper also contributes a publicly released multidirectional subtitle parallel corpus covering high‑resource (EN↔ZH), medium‑resource (EN↔DE), and low‑resource (ZH↔TH) language pairs, along with a multidimensional evaluation framework based on LLM‑as‑Judge. By providing data, evaluation tools, and the ALPO algorithm, the authors lay a solid foundation for future research on domain‑specific translation LLMs.

In summary, the work demonstrates that (1) subtitle translation is a distinct, liberal translation task; (2) LLMs can serve as reliable evaluators and reward models; (3) line‑level preference alignment is essential for expressive subtitle generation; and (4) the proposed Adaptive Local Preference Optimization effectively trains a 14‑B LLM to produce high‑quality, vivid subtitles that surpass existing models across a range of languages and evaluation dimensions. Future directions include extending ALPO to other short‑text generation tasks (e.g., social media posts, ad copy) and integrating multimodal cues (video, audio) for even richer subtitle translation.

From Utterance to Vividity: Training Expressive Subtitle Translation LLM via Adaptive Local Preference Optimization

💡 Research Summary

Comments & Academic Discussion

Leave a Comment