Beyond a Single Reference: Training and Evaluation with Paraphrases in Sign Language Translation
Most Sign Language Translation (SLT) corpora pair each signed utterance with a single written-language reference, despite the highly non-isomorphic relationship between sign and spoken languages, where multiple translations can be equally valid. This limitation constrains both model training and evaluation, particularly for n-gram-based metrics such as BLEU. In this work, we investigate the use of Large Language Models to automatically generate paraphrased variants of written-language translations as synthetic alternative references for SLT. First, we compare multiple paraphrasing strategies and models using an adapted ParaScore metric. Second, we study the impact of paraphrases on both training and evaluation of the pose-based T5 model on the YouTubeASL and How2Sign datasets. Our results show that naively incorporating paraphrases during training does not improve translation performance and can even be detrimental. In contrast, using paraphrases during evaluation leads to higher automatic scores and better alignment with human judgments. To formalize this observation, we introduce BLEUpara, an extension of BLEU that evaluates translations against multiple paraphrased references. Human evaluation confirms that BLEUpara correlates more strongly with perceived translation quality. We release all generated paraphrases, generation and evaluation code to support reproducible and more reliable evaluation of SLT systems.
💡 Research Summary
This paper tackles a fundamental limitation in current Sign Language Translation (SLT) corpora: the reliance on a single written‑language reference for each signed utterance. Because sign languages and spoken languages are highly non‑isomorphic, a single sign sequence can legitimately be rendered in many different textual forms that vary in word order, explicitness, and syntactic structure. The authors investigate whether large language models (LLMs) can be used to automatically generate high‑quality paraphrases of the existing references, thereby providing synthetic multi‑reference data for both model training and evaluation.
First, a wide range of modern LLMs (including GPT‑4o‑mini and LLaMA variants) are prompted with a uniform instruction to produce five meaning‑preserving rewrites for each English reference sentence. Decoding is performed with temperature 0.7 and top‑p 0.95 to ensure diversity while maintaining fluency. The generated paraphrases are assessed with a customized ParaScore metric that combines BERTScore‑based semantic similarity and a normalized Levenshtein distance (NLD) to capture lexical divergence. Hyper‑parameters γ = 0.35 and ω = 0.5 balance the two components, yielding a score ranging from 0 to 1 after normalization. Manual inspection identifies a quality threshold of 0.7; paraphrases above this threshold are considered reliable. Among the models tested, GPT‑4o‑mini achieves the highest average ParaScore and its paraphrase sets are used for subsequent experiments.
The authors then explore two research questions: (1) Does augmenting the training data with paraphrased targets improve SLT performance? (2) Does evaluating against multiple paraphrases yield automatic scores that better correlate with human judgments?
For training, three configurations are compared on a pose‑based T5 architecture that processes key‑point sequences extracted from the YouTubeASL and How2Sign datasets. (a) Baseline: train on the original single reference only. (b) Random‑sample: for each training instance, randomly select one sentence from the set consisting of the original reference plus five paraphrases. (c) Minimum‑loss: compute the loss for all available paraphrases and back‑propagate only the gradient from the paraphrase that yields the smallest loss. All models are pretrained on YouTubeASL (≈400 k steps) and fine‑tuned on How2Sign (≈10 k steps) with identical hyper‑parameters. Evaluation metrics include sacreBLEU, ROUGE‑L, and BLEURT‑20. Results show that both paraphrase‑augmented strategies underperform the baseline across all metrics, indicating that exposing the model to multiple equally valid targets introduces ambiguity that harms learning.
For evaluation, the authors propose BLEUpara, an extension of BLEU that computes n‑gram overlap against all paraphrased references and selects the highest score for each test instance. Human evaluation is conducted on a subset of the test set, and Pearson correlation between automatic scores and human ratings is measured. BLEUpara demonstrates a substantially higher correlation than standard BLEU, especially for examples where the reference exhibits high lexical variability. This confirms that multi‑reference evaluation better reflects perceived translation quality.
Additional experiments examine the effect of providing video‑level textual context in the prompting stage. Supplying preceding sentences from the same video surprisingly degrades ParaScore, suggesting current LLMs struggle to incorporate longer discourse cues for short, context‑dependent sign language sentences. The authors also compare sequential versus iterative prompting for generating multiple paraphrases; sequential prompting yields slightly higher average ParaScore with lower computational cost.
In summary, the paper’s contributions are threefold: (1) a systematic comparison of LLM‑based paraphrase generation for SLT, introducing an adapted ParaScore metric to balance semantic fidelity and lexical diversity; (2) an empirical finding that naïvely adding paraphrased targets during training does not improve, and may even hurt, translation performance; (3) the introduction of BLEUpara, a multi‑reference evaluation metric that aligns more closely with human judgments. All generated paraphrases, code, and evaluation scripts are released publicly to promote reproducibility and encourage the community to adopt more reliable evaluation practices for sign language translation systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment