Can Linguistically Related Languages Guide LLM Translation in Low-Resource Settings?
Large Language Models (LLMs) have achieved strong performance across many downstream tasks, yet their effectiveness in extremely low-resource machine translation remains limited. Standard adaptation techniques typically rely on large-scale parallel d…
Authors: Aishwarya Ramasethu, Niyathi Allu, Rohin Garg
Can Linguisticall y R elated Languag es Guide LLM T ranslation in Lo w-R esour ce Settings? Aishw ary a Ramasethu Prediction Guard R ohin Garg ∗ Scale AI Niy athi Allu ∗ Independent Harshw ardhan F artale ∗ Independent Dun Li Chan INTI Inter national College P enang Abstract Larg e Language Models (LLMs) hav e achie v ed strong per f ormance across man y do wnstream tasks, y et their eectiv eness in e xtremel y lo w -resource machine translation re- mains limited. Standard adaptation tec hniques typically rely on larg e-scale parallel data or e xtensiv e ne-tuning, which are infeasible f or the long tail of under represented languages. In this work, w e inv es tigate a more cons trained question: in data-scarce settings, to what e xtent can linguisticall y similar piv ot lan- guages and f e w -shot demonstrations pro vide useful guidance f or on-the-y adaptation in LLMs? W e study a data-ecient e xperimental setup that combines linguisticall y related piv ot languages with fe w -shot in-context e xamples, without an y parameter updates, and ev aluate translation beha vior under controlled conditions. Our analy sis sho ws that while piv ot-based prompting can yield impro v e- ments in certain congurations, par ticularl y in settings where the targ et languag e is less w ell represented in the model’ s v ocabulary , the gains are often modest and sensitive to f ew shot example construction. For closel y related or better represented varieties, w e observe diminishing or inconsistent gains. Our ndings provide empir ical guidance on ho w and when inf erence-time prompting and pivot-based e xamples can be used as a lightw eight alter nativ e to ne-tuning in lo w-resource translation settings. 1 Introduction The adv ent of transf or mer -based arc hitectures and g eneral-pur pose Larg e Languag e Models (LLMs) such as ChatGPT ( OpenAI , 2024 ), DeepSeek - R1 ( DeepSeek -AI , 2025 ), Mistral ( Jiang et al. , 2023 ), and Llama 3 ( AI@Meta , 2024 ) has led to substantial advances in machine translation o v er the pas t decade. These models exhibit strong mul- tilingual capabilities and, f or many high-resource ∗ Equal contribution languag es, approach e xper t-lev el translation qual- ity . Ho we v er, this performance remains highl y un- e v en across languages. Despite the e xistence of ov er 7,000 languages w orldwide ( Eberhard et al. , 2024 ), NLP research and model de velopment remain heavil y ske wed to- w ard a small set of high-resource languag es ( Joshi et al. , 2021 ; P akray et al. , 2025 ). Prior w ork has documented signicant dispar ities in LLM translation perf or mance betw een English and lo w- resource languages ( Choudhury , 2023 ), and re- cent surve y s sho w that e ven s tate-of-the-ar t models such as GPT -4 often fail to outper f or m specialized sy stems on languages using non-Latin scr ipts ( Ata- man et al. , 2025 ). T o address these dispar ities, substantial eor t has gone into expanding multilingual datasets and models. Foundational work on massivel y multi- lingual representation lear ning, such as mBERT and XLM-R ( Ar iv azhagan et al. , 2019 ; Con- neau et al. , 2020 ), enabled cross-lingual trans- f er across hundreds of languages. More recent initiativ es, including N o Languag e Left Behind (NLLB) ( T eam , 2024 ) and the FLORES bench- mark ( Go y al et al. , 2022 ), aim to scale mul- tilingual machine translation to previousl y un- der represented languages, while projects such as A y a ( Üstün et al. , 2024 ), BLOOM ( Leong et al. , 2022 ), and Masakhane ( Nek oto et al. , 2020 ) emphasize broader linguistic co v erage and community -driven data creation. Despite these eor ts, co v erage remains unev en, and many lan- guag es with low digital presence are s till only par - tiall y supported in widely deplo yed translation sy s- tems. Rather than f ocusing on resource-intensiv e data collection or training new languag e-specic models, w e inv estig ate whether the f e w-shot instruction-f ollowing capabilities of exis ting LLMs can be lev eraged for extremel y low - resource machine translation. W e study an inf erence-time approach that combines linguis- ticall y related f ew -shot e xamples with a piv ot languag e—a higher -resource languag e closely related to the tar get—to pro vide additional conte xtual grounding dur ing g eneration. Our experiments f ocus on tw o linguistically distinct y et under represented languages: T unisian Arabic ( aeb ) ( Mahdi , 2025 ) and Konkani ( gom ) ( Rajan et al. , 2020 ). Both languages ha ve substantial regional and cultural signicance but receiv e limited co v erag e in multilingual benchmarks and are onl y par tiall y suppor ted in larg e pretrained translation sys tems such as NLLB ( T eam , 2024 ). This makes them represen- tativ e of practical lo w-resource scenarios where parallel data is scarce and model suppor t varies across dialects and scr ipts. W e e valuate our approach using In-Context Lear ning (ICL) with frozen, decoder -only LLMs. W e nd that incorporating linguisticall y and se- manticall y related fe w -shot e xamples can improv e translation beha vior in cer tain congurations, par - ticularl y when the tar get languag e appears w eakl y represented in the model’s pretraining distr ibu- tion. For K onkani, piv ot-augmented prompting yields moderate gains in chrF++ relative to di- rect prompting, while f or T unisian Arabic the im- pro vements are smaller and less consistent across models. These results sugges t that the eective- ness of piv ot-guided prompting depends strongly on language relatedness, representational co ver - ag e, and interactions between pivot and target va- rieties, rather than oering a univ ersally reliable translation s trategy . 2 Related W or k 2.1 In-Context Learning Prior work sho w s that multilingual LLM transla- tion per f or mance under f ew -shot in-conte xt lear n- ing (ICL) depends strongl y on prompt ex ample quality ( Chowdhery et al. , 2022 ). Ho we v er, it is also highlighted that substantial gains are observed in the high-resource language pairs. In addition Agra wal et al. ( 2022 ) conr m that ev en a single noisy or semantically unrelated demonstration can drasticall y reduce translation quality , whereas a w ell-formed equivalent-meaning e xample is often sucient to elicit better translation quality from the pretrained LLMs. Further w ork b y V ilar et al. ( 2023 ) demonstrate that translation quality depends on domain quality rather than le xical similarity of the in-context e x- amples and that the quality of translation degrades with poorl y selected in-context e xamples. Ho w- e v er their ev aluation is limited to translations be- tw een English and a small set of relativ ely high- resource languag es (French, German, and Chi- nese). The work of Garcia et al. ( 2023 ) also sup- por ts that the q uality of f ew -shot in-conte xt ex am- ples is crucial. Puduppully et al. ( 2023 ) introduce DecoMT , a fe w -shot prompting approach that de- composes the translation process into a sequence of w ord-chunk translations. Larg e-scale analy ses show that ICL perf or - mance in MT is dr iv en primar il y b y e xample qual- ity and targ et-side distribution rather than prompt structure or ordering ( Zhu et al. , 2024b ; Chitale et al. , 2024 ). Zhu et al. ( 2024a ) inv estigate robustness in ICL by introducing a dual-view demonstration selection strategy . They combine mar gin-based sentence-le v el similarity to a void semantic noise with w ord-embedding-based token w eighting to re- ne the inuence of demonstrations. T aken together , these studies show that ICL can impro ve machine translation under f av ourable con- ditions, but they also highlight its sensitivity to demonstration quality and distributional cov erag e. Impor tantl y , most of this work e valuates languag es with comparativ ely r ich digital resources, leaving open the question of ho w reliably ICL -based MT beha viour transf ers to truly lo w -resource languages with sparse data and unstable tokenization. R ecent w ork has explored structured linguis- tic scaolding as a complement to standard f ew - shot prompting. Lu et al. ( 2024 ) propose Chain- of-Dictionary Prompting (C OD), which augments prompts with chained multilingual dictionar y hints and reports larg e g ains on FLORES-200. While eectiv e, COD relies on propr ietary models and dictionary resources that may not e xist f or man y lo w-resource languag es. In contras t, our approach uses open 7B-8B models and small parallel cor - pora, providing pivot translations as broader con- te xtual scaolding rather than word-le v el le xical hints. Other w ork addresses lo w -resource adaptation through training-time methods. Y ong et al. ( 2023 ) sho w that adapter -based netuning can outper - f or m continued pretraining when adding new lan- guag es, with gains dr iv en pr imarily b y data a v ail- ability . Muennigho et al. ( 2023 ) introduce mul- titask prompted netuning f or multilingual mod- els and demons trate improv ed zero-shot general- ization when prompt language aligns with the tar - g et. Longpre et al. ( 2025 ) analyze multilingual scaling law s and ar gue that at v er y lo w data scales, neither pretraining nor netuning is computation- all y ecient. These approaches require supervised data and training compute, whereas our w ork tar - g ets inf erence-time prompting without parameter updates. 2.2 Piv ot languages aided LLM translation Piv ot strategies introduce an intermediate languag e to support translation in low -resource settings. Prior w ork has demonstrated that the choice of a piv ot language can hav e signicant impact on the translation quality . W ork by Imamura et al. ( 2023 ) sho w s the poor zero-shot perf or mance of multilingual NMT mod- els translation can be enhanced b y a using piv ot lan- guag e. In this w ork, the y compare the piv ot and di- rect translation using English as the piv ot language. Their study also in ves tigates which kind of parallel cor pora is mos t eectiv e to enhance multilingual piv ot translation. Jiao et al. ( 2023 ) also e v aluate ChatGPT f or machine translation and introduce a piv ot prompt- ing strategy , in which the model rst trans- lates a source sentence into a high-resource pivot languag e bef ore translating into the targ et lan- guag e. The y nd that piv ot prompting notice- abl y improv es translation quality f or distant or lo w- resource languag es, and with GPT -4, ChatGPT achie v es perf or mance comparable to commercial translation sy stems ev en on some of the challeng- ing languag e pairs. Extending these ideas, Elmadani and Buys ( 2024 ) introduces synthetic piv oting, where pivot sentences are generated from both the targ et and the source languages using the sequence lev el kno wledg e distillation. This approach reduces piv ot translation complexity and improv es BLEU scores f or low -resource Souther n African lan- guag es b y up to 5.6 points. R ecent w ork by T alwar and Laasri ( 2025 ) high- light this in their study on Nepali-English transla- tion, where Hindi is chosen as a piv ot language due to its linguistic proximity to Nepali and the greater av ailability of Hindi parallel cor pora. By emplo ying both fully super vised transfer lear ning and semi-supervised back -translation, they show that using Hindi as a pivot languag e improv es the Nepali-English translation baselines, emphasizing ho w a chosen pivot language can compensate f or limited data av ailability . Lim et al. ( 2025 ) reformulated low -resource translation as a post-editing task, where a teacher model generates auxiliar y translations and a stu- dent model is netuned to cor rect them, achie ving strong gains on FLORES-200/NTREX. Their re- sults sugg est that e v en imper f ect auxiliar y transla- tions can pro vide useful scaolding. While Mufu relies on super vised netuning, our w ork adapts this post-editing insight to pure ICL b y using a sin- gle piv ot translation combined with retr ie ved f ew - shot e xamples, without parameter updates or multi- model pipelines. Collectiv ely , these ndings motivate our in v esti- gation into piv ot languag e s trategies f or LLM trans- lation. Our w ork builds on these insights by e x- amining whether integ rating piv ot languag e e xam- ples and le veraging fe w -shot ICL can further en- hance translation performance f or languages like K onkani and T unisian Arabic. In doing so, w e aim to clar ify the mechanisms b y which piv ot languag es facilitate know ledg e transf er in LLMs, while also extending the adaptation capabilities of models to ne w languag es. This helps clar ify when such approaches ma y , or may not, be eective f or lo w-resource languages. 3 Methodology W e e xplore an inference-time technique of transla- tion in settings where data, compute, and model scale are limited. Our goal is to e xamine what kinds of evidence (such as linguisticall y related piv ot languages and semantically retr ie v ed f ew - shot ex amples) can be lev eraged to suppor t trans- lation in an ICL setting using small (≈8B) decoder - onl y models, without ne-tuning or larg e paral- lel cor pora. In par ticular , w e inv estigate whether these signals pro vide useful guidance when trans- lating into pre viously unseen or under -represented languag es, and under what conditions the y help, f ail, or produce inconsis tent beha vior . T o suppor t semantic retr iev al, w e cons tr uct a datastore of parallel translations organized as triplets consisting of an English source sentence, its piv ot-language translation, and the correspond- ing target-languag e translation. These tr iplets are derived e x clusivel y from the training split. W e inde x the datastore using the English source sen- tences, as English is the input language at infer - ence time. Sentence embeddings are computed us- Figure 1: The T raining Data Structure. The in put x con- tains the sy stem ins tr uction, retriev ed semantic context (top-5), and the pivot scaolding. The target y contains only the model’ s g enerated translation. ing the all-MiniLM-L12-v2 sentence transf or mer , which maps te xt into a dense vector space suitable f or semantic similar ity search. This representation allo ws semanticall y related translation e xamples to be clus tered and retr iev ed ecientl y . At inference time, we g enerate an embedding f or each input source sentence and quer y the vector datastore using cosine similar ity . The top-k most semanticall y similar tr iplets are retr iev ed and used as in-conte xt demonstrations. These demonstra- tions are formatted as English-pivot-tar get e xam- ples and combined with the piv ot translation cor re- sponding to the same input sentence from the par - allel cor pus; no piv ot translations are generated b y the model or obtained from e xter nal sources. The resulting prompt is structured in ChatML f or mat (see Appendix A.11 ). T o mitigate contamination and retr iev al leakag e, all retr iev ed e xamples are dra wn strictl y from the training datastore, while e v aluation is per f or med on a held-out test set that is ne v er inde x ed or quer ied during retr ie v al. No test sentences or paraphrases are included in the datastore. W e treat the number of in-conte xt e xam- ples k as a controllable parameter and e valuate its eect through a sy stematic ablation s tudy (see Ap- pendix A.8 ). This retriev al-augmented prompting w orkow is illustrated in Figure 1 . T o isolate the contribution of the pivot lan- guag e itself, we conduct controlled comparisons betw een pivot-augmented prompts and prompts constructed using identical retr ie val procedures but ex cluding the piv ot translation. This allo w s us to disentangle gains arising from semantic re- triev al alone from those attributable to the linguis- tic bridge provided b y the piv ot languag e. The ab- lation s tudy results without the piv ot language are sho wn in T able 10 and T able 11 . 4 Experiments 4.1 Languag es and dat a For this experiment, we f ocused on two low - resource languag es. The rst is K onkani, an Indian languag e spok en in the w estern part of India, with appro ximately 2.35 million speak ers as of 2011. The second is T unisian Arabic, spoken in T unisia, with around 12 million speakers as of 2021. A motiv ation f or selecting these languag es was their use of non-Latin scripts. In addition, T unisian Ara- bic is unique in the sense that, unlik e regular Latin script which reads from left to r ight, this is from right to left. T o pro vide a quantitativ e sanity check on piv ot selection, we also compute w ord-lev el Jaccard sim- ilarity betw een the pivot and targ et languages in our datasets. This allo ws us to v er ify that the c ho- sen piv ots are le xicall y closer to the targ et lan- guag es than English. A detailed description and full results are pro vided in Appendix A.6 . 4.2 Models In this e xper iment, w e e valuate the performance of U nbabel’ s T o w erInstr uct-7B-v0.1 ( Alv es et al. , 2024 ) and NousR esearch ’ s Her mes-2-Pro-Llama- 3-8B ( T eknium et al. , 2024 ). The Her mes-2-Pro- Llama-3 model is a instruction-tuned version of Llama 3 ( AI@Meta , 2024 ) kno wn for its mul- tilingual capabilities. Llama 3 suppor ts 8 lan- guag es: English, Ger man (deu), French (fra), Ital- ian (ita), P or tuguese (por), Hindi (hin), Spanish (spa), and Thai (tha), although the underl ying f oun- dation model has been trained on ov er 176 lan- guag es. Our aim is to see whether w e can make use of the latent know ledge alignment of the model while translating to lo w resource languages. The T o w erInstruct-7B-v0.1 model is netuned from T o werBase-7B-v0.1 model. The T ow erBase- 7B-v0.1 model is continuousl y pretrained from from the Llama 2 model with a mixture of mono- lingual and parallel data with 20B tok ens. The T o werBase-7B-v0.1 model is the netuned with in- struction dataset that is rele vant to translation pro- cess. Some of these instruction tasks include A u- tomatic P ost Edition, Conte xt-aw are T ranslation, N amed-entity Recognition etc. The languag es sup- por ted b y the T o werBase-7B-v0.1 model are En- glish (eng), Ger man (deu), French (fra), Dutch (nld), Italian (ita), Spanish (spa), Portuguese (por), K orean (k or), Russian (rus), and Chinese (zho). Although T o werIns tr uct-7B-v0.1 performs w ell on translation tasks, it is not e xpected to e xcel in lan- guag es it was not e xposed to dur ing training. Our strategic selection of these models is de- signed to assess their eectiveness in translating languag es outside their initial training sets. De- spite T o w erInstruct-7B-v0.1’ s specialized training in translation, it has not been directly exposed to the specic low -resource languag es f ocused on in this e xperiment, oering a unique test of its adapt- ability to unseen languag es. 4.3 Datasets W e utilized two distinct multiparallel datasets, ef- f ectivel y org anizing the data into aligned tr iplets (Source-Piv ot- T arg et) to suppor t our retriev al- augmented pipeline. K onkani: W e constructed a dataset of English- Marathi-K onkani triplets using the open-source cor pus from AI4Bharat ( Gala et al. , 2023 ). Marathi w as selected as the piv ot language due to its linguis tic similar ity to Konkani and wider pre va- lence in wes ter n India. W e created a distinct split of [800] ex amples f or the training (retriev al) datas- tore and 200 e xamples f or the held-out tes t set. T unisian Arabic: W e der iv ed a similar cor pus of English-MSA - T unisian triplets from the work described by Bouamor et al. ( 2014 ), with Mod- ern Standard Arabic (MS A) c hosen as the piv ot languag e. This dataset consists of 900 ex amples f or the training datastore and 100 ex amples for the held-out tes t set. In total, our study operates on small training sets of approximatel y 1,000 records per languag e. This constraint w as chosen specically to simulate real- istic lo w -resource scenarios where lar g e-scale par - allel data is una vailable. 5 Results 5.1 Does Piv ot-Based Pr ompting Impro ve T ranslation? T o establish a ref erence point, w e rst ev aluate a di- rect prompting baseline, where the model is given onl y the English source sentence and instructed to translate directl y into the targ et language, without access to a piv ot language. In this setting, chrF++ scores are often e xtremely lo w (in some cases close to 1) because the models do not reliably generate te xt in the intended targ et language or script. In- stead, the output frequently dr ifts to w ard better - represented neighboring languages (e.g., produc- ing Marathi- or Hindi-like text when the targ et is K onkani, or MSA -like te xt f or T unisian Arabic). This behavior is obser v ed across both Her mes and T o wer , indicating that, without grounding signals, the model does not consistentl y infer the cor rect output languag e from the instruction alone. W e then compare this to our pivot-augmented prompting condition, in which the same input is supplied along with a translation into a linguisti- call y related pivot language. In this setting, the f ew -shot demonstrations and piv ot translation act as g rounding signals that stabilize generation to- w ard the intended script and language f amil y . T a- bles 1 and 2 repor t BLEU and chrF++ scores across three conditions (zero-shot (k =0 ), direct f e w-shot prompting without a pivot, and piv ot-augmented prompting). F or each conguration, we repor t the best-perf orming number of in-context ex am- ples (k), as deter mined in ablations in Appendix 12 and 13 . For Konkani, introducing f e w-shot demonstra- tions, ev en without a pivot, leads to a substantial impro vement in both chrF++ and BLEU, indicat- ing that the e xamples themsel ves pro vide a strong anchoring eect f or this pre viously unseen lan- guag e. Adding the piv ot language on top of these e xamples results in only small or mix ed additional gains: f or Her mes, the pivot condition yields a modest impro v ement ov er direct f e w-shot prompt- ing (29.62→30.34 chrF++, 7.35→7.77 BLEU), whereas f or T o wer the piv ot impro v es BLEU from 3.67 to 5.68, but does not improv e chrF++. This sugg ests that, in this setting, most of the bene- t arises from e xample-dr iv en stabilization rather than from the piv ot languag e itself. For T unisian Arabic, zero-shot scores are al- ready relativ ely high, and both chrF++ and BLEU chang e marginall y across the direct and piv ot con- ditions, with no consis tent advantag e f or either model. Here, fe w -shot prompting provides lim- ited additional benet, and the piv ot languag e does not substantially alter model behavior , consistent with the inter pretation that T unisian Arabic which is already better represented in the under lying pre- trained models. W e additionall y ev aluate whether using a piv ot languag e that is e xplicitly supported b y the model leads to impro ved translation quality . Giv en the constraints of our setup, the only conguration that satises this condition is Hindi as a pivot f or Konkani using the Her mes-2-Pro-Llama-3-8B model. W e analyze this setting in detail in Ap- pendix A.9 , including tok en-to-word ratios and Jaccard similarity betw een Hindi and K onkani. A cross these e xper iments, w e nd that using a model-suppor ted pivot language does not yield sy stematic impro vements ov er linguistically moti- v ated pivots such as Marathi. In se veral cases, per - f or mance degrades as the number of in-conte xt e x- amples increases, suggesting that nativ e model sup- por t alone is insucient to improv e or stabilize lo w-resource translation. T o ensure that pivot-augmented prompting does not simply cause the model to reproduce piv ot- languag e translations, we measure chrF o v erlap betw een the pivot outputs and the nal gener - ated translations. This analy sis, reported in Ap- pendix A.5 , show s consistentl y lo w chrF scores f or both K onkani and T unisian Arabic, indicating lim- ited sur f ace-le v el o v erlap between pivot and gener - ated outputs. These results suggest that the model does not merely copy or lightly edit the piv ot trans- lation, but instead produces outputs that are sub- stantiall y dis tinct from the piv ot languag e. One possible e xplanation, whic h w e treat as h ypothesis-generating rather than conclusive, comes from the tok en-to-word anal ysis in T able 6 . For T unisian Arabic, both models e xhibit substan- tiall y low er token-to-w ord ratios (e.g., 4.96 vs. 7.65 for T o w er; 2.16 vs. 4.09 for Her mes, com- paring Aeb vs. Gom), indicating that the models segment T unisian Arabic into f ew er sub w ord units than Konkani. Because Moder n Standard Arabic (MS A) is well represented in most pretrained cor - pora, T unisian Arabic, which shares script and le x- ical characteristics with MS A, may benet indi- rectl y from this representation. This w ould help e xplain wh y f ew -shot prompting and piv ot augmen- tation yield smaller or inconsistent gains in this set- ting. In contrast, the much higher token-to-w ord ra- tios f or Konkani sugg est a weak er le xical footprint in the pretrained v ocabular y . Here, the f ew -shot e x- amples and the piv ot appear to act less as a source of additional translation competence and more as basic scaolding f or language identication, script adherence, and output s tability . Ho we v er , w e emphasize that this relationship is cor relational rather than causal: tokenization e- ciency alone does not fully explain perf ormance dierences, and other factors ma y contribute to the observed behavior . 5.2 Comparison to NLLB Ref erence Baselines As a point of e xter nal reference, we compare the best-perf or ming f ew -shot and piv ot-augmented scores in T ables 1 and 2 with the NLLB-200 dis- tilled baselines in T able 5 . For Konkani, NLLB does not pro vide native suppor t, and the baseline remains relativ ely low (26.82 chrF++, 7.51 BLEU). Our best Hermes piv ot-augmented conguration attains 30.34 chrF++ and 7.77 BLEU, while the T o wer pivot setting reaches 17.66 chrF++ and 5.68 BLEU . Thus, Hermes slightly ex ceeds the NLLB baseline on both metr ics, whereas T ow er remains belo w it. For T unisian Arabic, NLLB does include e x- plicit suppor t, but the baseline scores remain mod- est (10.42 chrF++, 4.20 BLEU). In contrast, both decoder -only LLMs achiev e substantiall y higher performance e ven without ne-tuning: Hermes reaches 24.32 c hrF++ and 6.27 BLEU (direct f e w- shot), and T o wer reaches 20.63 chrF++ and 4.99 BLEU (pivot). This indicates that, e v en relativ e to a supervised MT sys tem trained with e xplicit sup- por t f or the language, fe w-shot prompting can yield strong er perf or mance in this setting. This contrast highlights a practical trade-o: im- pro ving NLLB performance f or unsuppor ted or w eakly suppor ted languag es w ould typically re- quire collecting super vised training data and ne- tuning the model, whereas our approach obtains measurable gains using only f e w-shot prompting with no parameter updates. 5.3 Eect of Increasing the N umber of In-Context Examples W e next e xamine whether translation quality im- pro ves simpl y b y increasing the number of re- triev ed demonstrations (k), independent of the piv ot signal. For each model-language pair , w e e v aluate chrF++ and BLEU across multiple values of k in both the direct and piv ot-based translation settings (full results in Appendix 12 - 13 and 10 - 11 ). For Konkani, increasing k does not produce monotonic gains in either metric. In the direct translation setting, Her mes reaches its strong est chrF++ at k =3 (29.62) with relativel y lo w BLEU (2.33), while perf ormance drops at both smaller and larg er k values. T o wer show s a similar pat- tern: chrF++ peaks at k =2 (21.25), while BLEU remains in the 3-4 range and collapses to 0 bey ond k =3 . In the piv ot-based setting, Her mes attains Model Setting BLEU chrF++ Baseline NLLB-200 7.51 26.82 Hermes-2-Pro Zero-shot (k =0 ) 1.49 1.30 Direct (Best k) 7.35 29.62 With Piv ot (Best k ) 7.77 30.34 T ow erInstruct Zero-shot (k =0 ) 1.28 0.69 Direct (Best k) 3.67 21.25 With Piv ot (Best k) 5.68 17.66 T able 1: Select K onkani translation results (Eng→Gom) Model Setting BLEU chrF++ Baseline NLLB-200 4.20 10.42 Hermes-2-Pro Zero-shot (k =0 ) 4.62 24.32 Direct (Best k ) 6.27 24.32 With Piv ot (Best k) 5.06 24.31 T o werInstruct Zero-shot (k =0 ) 4.19 17.62 Direct (Best k) 4.46 20.74 With Piv ot (Best k) 4.99 20.63 T able 2: Select T unisian Arabic results (Eng→Aeb) its best chrF++ at k =1 (30.34) and best BLEU at k =4 (7.77), whereas T o wer peaks at k =2 in BLEU (5.68) but achie ves higher c hrF++ at k =3 (17.66). Thus, f or K onkani, both metr ics improv e relative to k =0 , but additional demonstrations bey ond the best-perf or ming k generall y degrade per f or mance. A similar trend appears in T unisian Arabic, al- though the baseline (k =0 ) per f or mance is much strong er . In the direct setting, Her mes ac hiev es its best BLEU at k =1 (6.27), while chrF++ remains highest at k =0 (24.32) and declines as k increases. T o wer e xhibits modest v ar iation across k, with chrF++ peaking at k =4 (20.74) despite little cor re- sponding chang e in BLEU . In the piv ot condition, Hermes again sho ws small uctuations around its k =0 baseline (24.31 chrF++), while T o wer reac hes its highest BLEU at k =2 (4.99) and highest c hrF++ at k =5 (20.63), bef ore declining at lar ger k. A cross both languages and models, these results indicate that: P er f or mance impro ves substantiall y when mo ving from k =0 to small k, but g ains do not scale with additional demonstrations. ChrF++ and BLEU often peak at dierent values of k. W e h ypothesize that one contr ibuting f actor is the interaction between k and model conte xt ca- pacity . T ow erInstruct operates eectiv ely within a 4K -token window , and per f or mance often de- clines once prompts approach this length, sugg est- ing truncation or o v er wr iting eects. Her mes sup- por ts a larg er conte xt windo w , y et its per f or mance like wise plateaus or degrades bey ond moderate k, impl ying that the limitation is not purely arc hitec- tural but also beha vioral: models may under utilize long-rang e prompt structure or ov er weight spur i- ous correlations from loosely related e xamples. T aken tog ether, these ndings sugg est that the gains obser v ed in our e xperiments are not sim- pl y an ar tifact of “more e xamples. ” Ins tead, a small number of semantically aligned demonstra- tions appears to pro vide mos t of the benet, while additional e xamples can introduce noise that re- duces both BLEU and chrF++. In settings where piv ot-based prompting yields improv ements, these eects should theref ore be inter preted as com- plementary to, rather than interchang eable with, the contribution of f ew -shot demonstrations them- sel ves. 6 Limitations Man y machine lear ning breakthroughs are enabled b y an abundance of computational resources. Ho w- e v er , access to larg e-scale compute is not uni- f or ml y a vailable, including to mos t authors of this w ork. This dispar ity becomes ev en more appar - ent when w orking with communities that speak lo w-resource languages. Within these constraints, w e aimed to r igorously test our hypotheses about piv ot-based translation using the resources av ail- able to us. Impor tantl y , these constraints also re- ect realistic deplo yment conditions f or many lo w- resource languag e communities, where access to larg e-scale compute, e xtensiv e annotation, and pro- prietar y models is limited. The primar y limitation of this w ork is that, while w e build on pr ior research on pivot languag es to in ves tigate whether linguisticall y related lan- guag es pro vide any useful signal f or inference- time translation under resource constraints, the per - f or mance gains we obser v e are modes t and often inconsistent. W orking within our computational budg et, we e valuated open-weight models in the 7B parameter range. While lar g er models ma y yield strong er per f or mance, our results indicate that piv ot-augmented prompting can sometimes impro ve per f or mance, but its eects are highl y sen- sitiv e to languag e c haracteristics and ex ample se- lection, sugg esting that fur ther s tudy is needed be- f ore dra wing strong conclusions. A dditionally , much of the e xisting research on multilinguality and machine translation relies on human e valuation, which was not feasible in our setting. Under these constraints, and with respect f or the communities that speak these languages, w e e v aluate how language models adapt to pre viously unseen languages in lo w-resource conditions using automatic metr ics. W e repor t BLEU and chrF++ scores computed with SacreBLEU ( Pos t , 2018 ) f or reproducibility (see Appendix A.2 f or scoring sig- natures). Ho we v er , these metr ics ha v e known limitations in lo w -resource and morphologically rich settings. As illustrated in Appendix A.7.1 , we observ e cases where the g enerated Konkani translation is linguis- ticall y plausible and semantically related to the ref- erence, y et diers substantiall y in sur face form, re- sulting in v er y lo w BLEU and chrF++ scores. This highlights the brittleness of n-gram–based metr ics f or ev aluating lo w-resource translation quality and motiv ates the need f or human ev aluation by nativ e speakers to better capture semantic adequacy , prag- matic meaning, and dialectal cor rectness. Another limitation is that our methodology de- pends on the av ailability of a high-resource piv ot languag e that is linguistically similar to the targ et languag e, which restricts its applicability to lan- guag es without closel y related pivots. While the approach is data-ecient, it also assumes access to high-quality parallel corpora; translation qual- ity may degrade when there is a domain mismatc h betw een the retr iev ed e xamples and the input te xt. Giv en the promising results observ ed under these constrained settings, natural e xtensions of this work include scaling e xperiments to larg er open-source models, conducting human-in-the- loop ev aluations with native speakers, and explor - ing additional language pairs to better character - ize the conditions under which piv ot-augmented prompting helps, fails, or produces negligible ef- f ects. Ref erences Sw eta A graw al, Chunting Zhou, Mike Lewis, Luke Zettlemo yer , and Marjan Ghazvininejad. 2022. In- conte xt examples selection for machine translation . Pr eprint , arXiv :2212.02437. AI@Meta. 2024. Llama 3 model card . Duarte M Al ves, José Pombal, Nuno M Guer reiro, Pe- dro H Martins, João Al ves, Amin F arajian, Ben Pe- ters, Ricardo R ei, Patrick Fernandes, Sw eta Agra wal, Pierre Colombo, José G. C. de Souza, and André F . T . Martins. 2024. T o w er: An open multilingual larg e language model f or translation-related tasks . In Pr o- ceedings of the Conf erence on Language Modeling (COLM) 2024 . Na v een Arivazhag an, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Kr ikun, Mia X u Chen, Y uan Cao, George Foster , Colin Cherr y , W olfgang Machere y , Zhifeng Chen, and Y onghui W u. 2019. Massivel y multilingual neural machine translation in the wild: Findings and chal- lenges . Pr eprint , Duy gu Ataman, Ale xandra Birch, Nizar Habash, Mar - cello F eder ico, Philipp Koehn, and K yunghyun Cho. 2025. Machine translation in the era of large lan- guage models:a sur v ey of historical and emerging problems . Inf ormation , 16(9). Houda Bouamor, Nizar Habash, and Kemal Oazer . 2014. A multidialectal parallel corpus of arabic. In LREC , pages 1240–1245. Pranjal Chitale, Ja y Gala, and Raj Dabre. 2024. An em- pirical study of in-conte xt lear ning in LLMs f or ma- chine translation . In F indings of the Association f or Computational Linguis tics: A CL 2024 , pages 7384– 7406, Bangkok, Thailand. Association f or Computa- tional Linguistics. Monojit Choudhur y . 2023. Generativ e AI has a language problem . N atur e Human Behaviour , 7(11):1802–1803. Aakanksha Cho wdhery , Sharan N arang, Jacob De vlin, Maarten Bosma, Gaurav Mishra, A dam R ober ts, Paul Barham, Hyung W on Chung, Charles Sutton, Sebastian Gehr mann, Parker Schuh, Kensen Shi, Sasha T svyashchenk o, Joshua Ma ynez, Abhishek Rao, Parker Bar nes, Yi T a y , Noam Shazeer, Vin- odkumar Prabhakaran, and 48 others. 2022. Palm: Scaling language modeling with pathw ay s . Preprint , Ale xis Conneau, Kar tika y Khandelwal, Naman Goy al, Vishra v Chaudhar y , Guillaume W enzek, Francisco Guzmán, Edouard Gra ve, Myle Ott, Luk e Zettle- mo yer , and V eselin Sto yano v . 2020. Unsuper - vised cross-lingual representation lear ning at scale . Pr eprint , arXiv :1911.02116. DeepSeek -AI. 2025. Deepseek -r1: Incentivizing rea- soning capability in llms via reinforcement learning . Pr eprint , arXiv :2501.12948. Da vid M. Eberhard, Gar y F . Simons, and Charles D. Fennig, editors. 2024. Ethnologue: Languag es of the W orld , twenty -sev enth edition. SIL International, Dallas, T e xas. Khalid N. Elmadani and Jan Buy s. 2024. Neural ma- chine translation between lo w-resource languages with synthetic piv oting . In Proceedings of the 2024 Joint International Confer ence on Computa- tional Linguistics, Languag e Resour ces and Ev alu- ation (LREC-COLING 2024) , pages 12144–12158, T or ino, Italia. ELRA and ICCL. Ja y Gala, Pranjal A. Chitale, Raghav an AK, V ar un Gumma, Sumanth Doddapaneni, Aswanth Kumar , Janki Na wale, Anupama Sujatha, Ratish Pudup- pully , V ivek Ragha van, Pratyush Kumar , Mitesh M. Khapra, Ra j Dabre, and Anoop Kunc huk uttan. 2023. Indictrans2: T o wards high-quality and accessible machine translation models f or all 22 scheduled in- dian languages . T ransactions on Mac hine Lear ning Resear ch , 2023. Xa vier Garcia, Y amini Bansal, Colin Cher ry , Georg e Fos ter, Maxim Kr ikun, Mel vin Johnson, and Orhan Firat. 2023. The unreasonable eectiveness of f ew -shot learning for machine translation . In Pro- ceedings of the 40th International Confer ence on Mac hine Lear ning , volume 202 of Pr oceedings of Mac hine Learning Resear c h , pages 10867–10878. PMLR. Naman Go y al, Cynthia Gao, Vishra v Chaudhary , P eng- Jen Chen, Guillaume W enzek, Da Ju, Sanjana Kr - ishnan, Marc’ A urelio Ranzato, Francisco Guzmán, and Ang ela F an. 2022. The ores-101 e valuation benchmark f or low -resource and multilingual ma- chine translation. T ransactions of the Association f or Computational Linguistics , 10:522–538. Kenji Imamura, Masao Utiy ama, and Eiichiro Sumita. 2023. Piv ot translation for zero-resource language pairs based on a multilingual pretrained model . In Proceedings of Mac hine T ranslation Summit XIX, V ol. 1: Resear ch T rac k , pages 348–359, Macau SAR, China. Asia-P acic Association f or Mac hine T rans- lation. Albert Q Jiang, Alexandre Sabla yrolles, and 1 others. 2023. Mistral 7b. arXiv pr eprint arXiv :2310.06825 . W enxiang Jiao, W enxuan W ang, Jen tse Huang, Xing W ang, Shuming Shi, and Zhaopeng T u. 2023. Is chat- gpt a good translator? y es with gpt-4 as the engine . Pr eprint , arXiv :2301.08745. Pratik Joshi, Sebastin Santy , Amar Budhira ja, Kalika Bali, and Monojit Choudhury . 2021. The state and fate of linguistic diversity and inclusion in the nlp w orld . Pr eprint , arXiv :2004.09095. Philipp K oehn. 2004. Statis tical signicance tests f or machine translation e valuation . In Proceedings of the 2004 Confer ence on Empirical Met hods in Nat- ural Languag e Pr ocessing , pages 388–395. Associa- tion f or Computational Linguistics. Colin Leong, Joshua Nemecek, Jacob Mansdor f er , Anna Filighera, A braham Owodunni, and Daniel Whitenack. 2022. Bloom library: Multimodal datasets in 300+ languages f or a variety of do wn- stream tasks . In Proceedings of the 2022 Conf erence on Empirical Methods in Natur al Languag e Pr ocess- ing , pag es 8608–8621, Abu Dhabi, U nited Arab Emi- rates. Association f or Computational Linguistics. Zheng W ei Lim, Nitish Gupta, Honglin Y u, and T rev or Cohn. 2025. Multilingual fused lear ning f or lo w- resource translation with llms . In International Con- f erence on Learning R epr esentations . Sha yne Longpre, Sneha Kudugunta, Niklas Muen- nigho, I-Hung Hsu, Isaac Casw ell, Ale x Pentland, Sercan Ö. Arik, Chen- Y u Lee, and Sa yna Ebrahimi. 2025. Atlas: A daptive transf er scaling la ws f or multilingual pretraining and netuning . Pr eprint , Hongyuan Lu, Haoran Y ang, Hao yang Huang, Dong- dong Zhang, W ai Lam, and Furu W ei. 2024. Chain- of-dictionary prompting elicits translation in large language models . Preprint , arXiv :2305.06575. Mohamed Mahdi. 2025. How w ell do llms understand tunisian arabic? Preprint , Niklas Muennigho, Thomas W ang, Lintang Sutawika, A dam Roberts, Stella Bider man, T ev en Le Scao, M Saiful Bar i, Sheng Shen, Zheng-Xin Y ong, Hai- le y Schoelkopf, Xiang ru T ang, Dragomir Radev , Alham Fikri Aji, Khalid Almubarak, Samuel Al- banie, Zaid Aly af eai, Alber t W ebson, Edward Ra, and Colin Rael. 2023. Crosslingual general- ization through multitask netuning . Preprint , Wilhelmina Nek oto, V uk osi Mar iv ate, T shinondiw a Matsila, Timi Fasubaa, T aiwo Fagbohungbe, Solomon Oluw ole Akinola, Shamsuddeen Hassan Muhammad, Salomon Kabongo Kabenamualu, Salome y Osei, Freshia Sacke y , Rubungo Andre Niy ongabo, Ric ky Macharm, Perez Oga y o, Ore- vaoghene Ahia, Musie Meressa Berhe, Mof etoluwa A dey emi, Masabata Mokgesi-Seling a, Lawrence Okegbemi, Laura Martinus, and 28 others. 2020. Participatory research f or lo w-resourced mac hine translation: A case study in afr ican languages . In F indings of the Association f or Computational Linguistics: EMNLP 2020 , pages 2144–2160. Association f or Computational Linguistics. OpenAI. 2024. Gpt-4.1 tec hnical report . Partha Pakra y , Ale xander Gelbukh, and Siva ji Bandy - opadh ya y . 2025. Natural language processing ap- plications f or low -resource languages . Natural Lan- guag e Pr ocessing , 31(2):183–197. Matt Post. 2018. A call f or clar ity in repor ting BLEU scores . In Pr oceedings of the Third Conf erence on Mac hine T ranslation: Resear ch P apers , pages 186– 191, Brussels, Belgium. Association f or Computa- tional Linguistics. Ratish Puduppull y , Anoop Kunc huk uttan, Raj Dabre, Ai T i A w , and Nancy F . Chen. 2023. Decomposed prompting for machine translation betw een related languages using large language models . Preprint , Annie Rajan, Ambuja Salgaonkar , and Ramprasad Joshi. 2020. A surve y of konkani nlp resources . Computer Science Review , 38:100299. Abhiman yu T alwar and Julien Laasri. 2025. Piv ot language f or low -resource machine translation . Pr eprint , arXiv :2505.14553. NLLB T eam. 2024. Scaling neural machine translation to 200 languages . Natur e , 630:841–846. R yan T eknium, Jere y Quesnelle, and Chen Guang. 2024. Her mes 3 technical repor t. arXiv preprint arXiv :2408.11857 . Ahmet Üs tün, Viraat Aryabumi, Zheng Y ong, W ei- Yin K o, Daniel D’ souza, Gbemileke Onilude, Neel Bhan- dari, Shivalika Singh, Hui-Lee Ooi, Amr Ka yid, and 1 others. 2024. A y a model: An instruction netuned open-access multilingual language model. In Pro- ceedings of the 62nd Annual Meeting of the Associa- tion f or Computational Linguistics (V olume 1: Long P apers) , pag es 15894–15939. Da vid Vilar , Markus Freitag, Colin Cher ry , Jiaming Luo, Viresh Ratnakar , and Georg e Fos ter . 2023. Prompting P aLM f or translation: Assessing s trate- gies and per f or mance . In Proceedings of the 61st An- nual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pag es 15406– 15427, T oronto, Canada. Association f or Computa- tional Linguistics. Zheng Xin Y ong, Haile y Schoelkopf, Niklas Muen- nigho, Alham Fikr i Aji, David If eoluw a Adelani, Khalid Almubarak, M Saiful Bar i, Lintang Sutawika, Jungo Kasai, Ahmed Bar uwa, Genta Winata, Stella Biderman, Edward Ra, Dragomir Radev , and V as- silina Nikoulina. 2023. BLOOM+1: Adding lan- guage suppor t to BLOOM f or zero-shot prompting . In Proceedings of the 61st Annual Meeting of the As- sociation f or Computational Linguistics (V olume 1: Long P apers) , pages 11682–11703, T oronto, Canada. Association f or Computational Linguistics. Biao Zhang, Zhongtao Liu, Colin Cherr y , and Orhan Fi- rat. 2024. When scaling meets llm netuning: The eect of data, model and netuning method . In Pr o- ceedings of the T welfth Int er national Conf erence on Learning R epresentations (ICLR 2024) . Shaolin Zhu, Menglong Cui, and Deyi Xiong. 2024a. T ow ards robust in-context learning for mac hine trans- lation with larg e language models . In Pr oceedings of the 2024 Joint International Confer ence on Compu- tational Linguistics, Languag e Resour ces and Eval- uation (LREC-C OLING 2024) , pages 16619–16629, T or ino, Italia. ELRA and ICCL. W enhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing X u, Shujian Huang, Lingpeng K ong, Jiajun Chen, and Lei Li. 2024b. Multilingual machine translation with larg e language models: Empirical results and analy - sis . Pr eprint , A Appendix A.1 Statistical Signicance W e conduct paired bootstrap resampling ( K oehn , 2004 ) (n =10 , 000 , p < 0 . 05 ) to test whether piv ot prompting signicantly outper f or ms direct transla- tion. As sho wn in T able 3 , no compar ison reaches signicance, indicating that obser v ed trends are sugg estiv e rather than conclusiv e. BLEU chrF++ Lang. Mod. k ∆ p ∆ p Gom Her 0 +0 . 12 . 08 +0 . 22 . 20 Her 1 – 0 . 18 . 88 – 0 . 89 1 . 0 Her 2 – 0 . 12 . 75 – 0 . 33 . 89 T ow 1 – 0 . 02 . 57 – 0 . 84 1 . 0 T ow 2 +0 . 07 . 23 – 1 . 11 1 . 0 Aeb Her 0 +0 . 23 . 07 – 0 . 05 . 56 Her 1 +0 . 38 . 10 – 0 . 37 . 71 Her 2 – 0 . 26 . 78 – 1 . 23 . 98 T ow 0 – 0 . 02 . 55 +0 . 02 . 48 T ow 1 – 0 . 20 . 72 +0 . 34 . 23 T ow 2 +0 . 13 . 31 +0 . 25 . 22 T able 3: Paired bootstrap signicance (piv ot – direct). No compar ison reaches p < 0 . 05 . Gom: K onkani (n =205 ), A eb: T unisian Arabic (n =100 ). Her: Hermes, T ow : T o wer . A.2 SacreBLEU Signatures and Repr oducibility All BLEU and c hrF++ scores were computed us- ing SacreBLEU ( P ost , 2018 ). Follo wing their rec- ommendations, w e report scor ing signatures below f or full reproducibility . Statis tical signicance was assessed via paired bootstrap resampling ( K oehn , 2004 ) (n =10 , 000 , p < 0 . 05 ); see Section A.1 . Metric Signature BLEU nrefs:1|case:mixed|eff:no| tok:13a|smooth:exp|version:2.5.1 chrF++ nrefs:1|case:mixed|eff:yes| nc:6|nw:2|space:no|version:2.5.1 T able 4: SacreBLEU scoring signatures. A.3 NLLB baseline s T able 5 repor ts reference translation scores from the NLLB-200 distilled model. NLLB is an encoder -decoder neural mac hine translation sys- tem trained f or super vised MT , whereas the models in our study (Her mes and T o wer) are decoder -only LLMs used in a f e w -shot, in-conte xt prompting set- ting with no task -specic parameter updates. A c- cordingl y , these numbers are provided only as con- te xtual reference points rather than as directly com- parable baselines. W e also note that NLLB does not nativ el y suppor t Konkani; the scores repor ted f or this variety in T able 5 reect zero-shot transf er beha vior rather than a tuned dialect model. Language Pair BLEU c hrF++ Eng-Gom 7.51 26.82 Eng- A eb 4.20 10.42 T able 5: NLLB-200 distilled ref erence baseline results f or our evaluation datasets. A.4 T ok en anal ysis As sho wn in T able 6 , Hermes consistentl y e xhibits lo wer tok en f er tility than T ow er across all non- English languag es, par ticularl y f or lo w -resource and dialectal varieties, indicating more ecient sub word representations. Dataset Language T o w er Hermes Gom Eng 1.59 1.34 Gom Mar 7.73 4.08 Gom Gom 7.65 4.09 A eb Eng 1.27 1.21 A eb MS A 4.74 2.12 A eb A eb 4.96 2.16 T able 6: T okens per word across languag es f or T ow er and Hermes models. Lo wer values indicate more e- cient tokenization. A.5 De viation fr om Piv ot T ranslations T o assess whether the model simply reproduces piv ot-language translations or instead generates g enuinel y dis tinct tar g et-languag e outputs, we compute chrF scores betw een piv ot translations and the nal generated outputs f or dierent values of k. chrF is well suited f or this anal ysis, as it mea- sures character -lev el ov erlap and is sensitiv e to di- rect cop ying, while remaining robust to mor pho- logical v ar iation. T able 7 sho ws consistentl y lo w chrF scores across models, languages, and values of k, indi- cating limited surface-le vel o ver lap between piv ot translations and generated output. This suggests that the models are not merely cop ying or lightly editing the piv ot translations but are instead pro- ducing subs tantially dierent outputs. Notabl y , the T ow er model e xhibits particularl y lo w chrF scores compared to Her mes f or both Ara- bic and K onkani, with scores f or K onkani remain- ing below 11 across all values of k. This behav - ior indicates an e ven stronger depar ture from the piv ot translations, reinf orcing the conclusion that the g enerated outputs are not simple transcr iptions or ref or mulations of the piv ot languag e. The s tability of c hrF scores across dierent v al- ues of k further sugges ts that this diver gence is sy s- tematic rather than an ar tif act of sampling v ariabil- ity . Overall, these results pro vide evidence that the g eneration step does not collapse to reproducing piv ot-language translations, but ins tead yields out- puts that are meaningfully distinct from the piv ot Language Model k=1 k=2 k=3 k=4 k=5 Arabic Hermes 24.09 24.89 25.17 23.79 24.06 T ow er 13.08 12.70 11.85 12.06 11.68 K onkani Hermes 28.96 27.45 27.82 25.95 26.80 T ow er 10.78 8.71 9.69 9.16 8.24 T able 7: chrF scores between piv ot translations and generated translations for dierent values of k. Lo w er scores indicate greater diver gence from the piv ot output, sugges ting that the model is not simply reproducing the piv ot translations. representations. A.6 Jaccard similarity Pairwise lexical similar ity between the languages in our cor pus is repor ted in T ables 8 and 9 . Marathi and Konkani e xhibit subs tantially higher le xical o ver lap than English with either language, while T unisian Arabic sho w s moderate o ver lap with Moder n Standard Arabic (MS A). These v al- ues are not used as a selection criter ion, but rather serve as suppor ting evidence that our chosen piv - ots are linguistically closer to the tar get languag es than English. For each pair of languag es, we compute word- le v el Jaccard similarity b y treating the v ocabulary e xtracted from each cor pus as a set. The similarity is dened as the size of the intersection divided b y the size of the union, yielding a value between 0 and 1, where higher scores indicate greater le xical o ver lap. In our datasets, Marathi (mar) and Konkani (Gom) show moderate le xical similarity (10.6%), while T unisian Arabic (Aeb) and Moder n Stan- dard Arabic (Msa) e xhibit higher o v erlap (16.5%). Notabl y , the Arabic variants show g reater le xi- cal closeness than the Indo- Aryan language pair , reecting the strong er typological anity among Arabic dialects. A detailed breakdo wn of v ocabular y sizes and pairwise similar ity scores is presented in T ables 8 and 9 . In future w ork, w e plan to e xplore whether le xical similar ity cor relates with translation per f or - mance. A.7 A blation on the N umber of In-Context Examples ( k ) in the Dir ect T ranslation Setting For completeness, we report ablations on the num- ber of in-context e xamples (k) in the direct En- glish→T arget setting, i.e., without the use of a Language Pair Jaccar d Similarity Eng-Mar 0.0002 Eng-Gom 0.0121 Mar -Gom 0.1054 T able 8: W ord-lev el Jaccard similarity scores f or the K onkani cor pus. Marathi and Konkani sho w substan- tially higher le xical o ver lap than English with either lan- guage. Language Pair Jaccar d Similarity Eng-MS A 0.0010 Eng- A eb 0.0010 MS A- Aeb 0.1646 T able 9: W ord-lev el Jaccard similarity scores f or the Arabic cor pus. MS A and T unisian Arabic show moder- ate le xical ov erlap, while both e xhibit minimal o v erlap with English. piv ot language. These results complement the piv ot-based experiments and allow us to isolate the marginal eect of the piv ot signal from the eect of in-conte xt demons trations alone. T ables 10 and 11 sho w results f or K onkani and T unisian Arabic respectiv ely , using the same re- triev al and prompting setup as in the pivot cong- uration, but omitting the piv ot translation from the prompt. A.7.1 Zero-BLEU but N on-Zero c hrF++ Cases During our direct translation experiments, w e ob- served instances where BLEU = 0 despite non- zero chrF++, par ticularl y for K onkani. This occurs when the model output div erg es le xicall y from the ref erence despite par tial topical or semantic align- ment. A representativ e e xample is shown belo w . Ground T ruth (K onkani): . Model T ranslation: A.8 Ablation on the Number of In-Context Examples ( k ) in the Piv ot-Based Setting For completeness, we report ablations on the num- ber of in-context e xamples (k) in the piv ot-based setting, i.e., where the model is pro vided with a lin- guisticall y related piv ot translation alongside the retriev ed f ew -shot e xamples allo w us to e xamine the marginal contribution of the piv ot signal. The performance of the models with piv ot f or k onkani is sho wn in T able 12 and T unisian arabic T able 13 A.9 A blations with LLM-Supported Piv ot Languages Our e xper imental design selects pivot languages based on two primar y cr iteria: (1) linguistic sim- ilarity to the target low -resource languag e, and (2) higher e xpected digital presence relativ e to the tar - g et. While w e attempt to quantify piv ot rele vance using Jaccard similarity , this metr ic onl y imper - f ectly captures linguistic suitability , leaving gaps in systematic piv ot selection. As an additional anal- y sis, w e consider piv ot languages that are e xplic- itl y suppor ted b y the model, in order to ex amine whether native model suppor t leads to impro v ed translation perf or mance. A ke y limitation of this approach is that most g eneral-pur pose LLMs support onl y a nar ro w sub- set of languag es, which subs tantially restricts co v- erag e f or lo w-resource targ ets. This constraint is e vident e v en in our experimental setup: nei- ther model explicitl y suppor ts T unisian Arabic or closel y related Arabic varieties, and f or Konkani, onl y Hindi is suppor ted, and only b y the Her mes- 2-Pro-Llama-3-8B model. Consequently , we e val- uate this supported-pivot conguration only f or K onkani and onl y under the Her mes-2-Pro-Llama- 3-8B setting. The Jaccard similarity between Hindi and K onkani (0.090) is slightly lo wer than that be- tw een Marathi and K onkani (0.105). How e v er , be- cause Hindi is explicitl y suppor ted by the model, this dierence is reected in tokenization behav - ior: the token-to-w ord ratio f or Hindi under Her - mes is substantiall y lo w er (2.85) than for Marathi and K onkani (both appro ximately 7, ref er to T a- ble 6 ), consistent with strong er le xical co verag e in the pretrained vocabulary . W e repor t BLEU and chrF++ scores f or this con- guration in T able 14 . When comparing chrF++ scores agains t the cor responding Marathi-piv ot set- ting, we do not observe sys tematic impro vements from using a model-suppor ted pivot languag e. In se v eral cases, performance degrades substantiall y as the number of in-context ex amples increases, sugg esting that nativ e model suppor t alone is insuf- cient to guarantee stable or improv ed piv ot-based Model Source T arget k BLEU c hrF++ Ablation: Number of In-Conte xt Exam ples ( k ) U nbabel/T ow erInstruct-v0.1 U nbabel/T o werIns tr uct-v0.1 Eng Gom 0 1.28 0.69 U nbabel/T o werIns tr uct-v0.1 Eng Gom 1 3.67 21.01 U nbabel/T o werIns tr uct-v0.1 Eng Gom 2 3.38 21.25 U nbabel/T o werIns tr uct-v0.1 Eng Gom 3 3.39 19.30 U nbabel/T o werIns tr uct-v0.1 Eng Gom 4 0.0 19.78 U nbabel/T o werIns tr uct-v0.1 Eng Gom 5 0.0 19.38 N ousResear ch/Hermes-2-Pr o-Llama-3-8B NousR esearch/Hermes-2-Pro-Llama-3-8B Eng Gom 0 1.49 1.30 NousR esearch/Hermes-2-Pro-Llama-3-8B Eng Gom 1 2.70 23.68 NousR esearch/Hermes-2-Pro-Llama-3-8B Eng Gom 2 2.72 23.87 NousR esearch/Hermes-2-Pro-Llama-3-8B Eng Gom 3 2.33 29.62 NousR esearch/Hermes-2-Pro-Llama-3-8B Eng Gom 4 1.90 28.75 NousR esearch/Hermes-2-Pro-Llama-3-8B Eng Gom 5 7.35 25.78 T able 10: Ablation on the number of in-conte xt e xamples (k) f or English → K onkani direct translation. translation in this lo w -resource setting. A.10 Fine- T uning Im pact Our ne-tuning experiments were limited in scope and not comprehensiv e. Fine-tuning w as per - f or med on the same small training sets ( 900 sam- ples) used for fe w-shot e xample retr ie val, with- out extensiv e hyperparameter tuning or architec- tural variations. While results sho w promise for K onkani, comprehensive ne-tuning ablations in- cluding varied training set sizes, lear ning rates, and LoRA congurations remain as future w ork. K onkani: In the netuning e xper iments, w e treat the zero-shot netuned model (En- glish→K onkani without pivot and without in-conte xt demonstrations) as the ref erence baseline. For Hermes, the zero-shot netuned condition achiev es a chrF++ of 36.61, which increases to 40.17 when a Marathi pivot is intro- duced. For T o werIns tr uct, chrF++ increases from 17.39 (zero-shot netuned without piv ot) to 31.91 with piv ot. For completeness, we also repor t f ew -shot netuned results in T able 18 ; across both settings, w e obser v e consistent g ains when the piv ot languag e is incor porated during prompting. T unisian Arabic: As in the Konkani setting, w e inter pret the zero-shot netuned model without a piv ot as the ref erence baseline. For Her mes, the zero-shot netuned condition achie v es a chrF++ of 18.07, which increases to 21.87 when an MSA piv ot is included dur ing prompting. For T o werIn- struct, chrF++ impro ves from 14.83 (zero-shot ne- tuned without piv ot) to 19.16 with piv ot. Fe w-shot netuned results are also repor ted in T able 19 ; W e again obser v e g ains when the piv ot languag e is in- cor porated. Belo w w e describe the e xper iment setting in de- tail: Hyperparameters: With limited data, netun- ing methods like prompt tuning (where em- beddings are adjusted) or LoRA (Lo w -Rank A daptation) pro ve par ticular l y eective ( Zhang et al. , 2024 ). With P arameter -Ecient netun- ing (PEFT), ev en increasing the data yielded mod- est performance improv ements. For instance, using LoRA on the Her mes-2-Pro-Llama-3-8B model brought the trainable parameters down to 176,242,688, or just 2% of the model’ s total param- eters. PEFT is computationally more ecient than pure ICL, which led us to adopt PEFT f or our model ne-tuning process. W e used the Hugging- f ace T ransf or mers library . In addition, the model was loaded in 4-bit pre- cision using the BitsAndBytes library with the nf4 quantization type. For ne-tuning, w e emplo yed the LoRA conguration, as detailed in the T a- ble 16 . Parameters in T able 17 w ere used to generate the output from the netuned model during the ev alu- ation. Model Source T arget k BLEU c hrF++ Ablation: Number of In-Conte xt Exam ples ( k ) U nbabel/T ow erInstruct-v0.1 U nbabel/T o werIns tr uct-v0.1 Eng Aeb 0 4.19 17.62 U nbabel/T o werIns tr uct-v0.1 Eng Aeb 1 4.46 19.49 U nbabel/T o werIns tr uct-v0.1 Eng Aeb 2 4.46 16.23 U nbabel/T o werIns tr uct-v0.1 Eng Aeb 3 4.07 15.59 U nbabel/T o werIns tr uct-v0.1 Eng Aeb 4 4.37 20.74 U nbabel/T o werIns tr uct-v0.1 Eng Aeb 5 4.37 18.61 N ousResear ch/Hermes-2-Pr o-Llama-3-8B NousR esearch/Hermes-2-Pro-Llama-3-8B Eng Aeb 0 4.62 24.32 NousR esearch/Hermes-2-Pro-Llama-3-8B Eng Aeb 1 6.27 23.96 NousR esearch/Hermes-2-Pro-Llama-3-8B Eng Aeb 2 5.06 20.35 NousR esearch/Hermes-2-Pro-Llama-3-8B Eng Aeb 3 5.93 20.84 NousR esearch/Hermes-2-Pro-Llama-3-8B Eng Aeb 4 6.27 20.99 NousR esearch/Hermes-2-Pro-Llama-3-8B Eng Aeb 5 5.52 20.60 T able 11: Ablation on the number of in-conte xt e xamples (k) f or English → Tn direct translation. A.11 Prom pt T emplate Both the T o w erInstruct-7B-v0.1 model and Hermes-2-Pro-Llama-3-8B model utilize a similar prompt f or mat. The full prompt f or mat is below . <|im_start|>user APE is a task designed to enhance the quality of the translation by performing minor adjustments Original (English): [Original text] Translation: [Pivot language] Post-edited: <|im_end|> <|im_start|>assistant [LLM translation] <|im_end|> The prompt includes the source sentence in En- glish and its translation in a piv ot languag e. For in- conte xt learning, the prompt contains v e demon- strations. In each demonstration, the assistant eld is pre-lled with the targ et language translation. These demonstrations are carefull y selected sen- tences that closel y resemble the sentence to be translated. In the nal instance, the assistant eld is left blank. This prompt str ucture pro ved to be highl y eective f or translation tasks of this na- ture. Ho we ver , when using this format with the base model, the outputs often included elements like “N ote, ” gibber ish, and repetitions. After ne- tuning the model with this f or mat, the generated translations adhered to the e xpected structure and consistentl y produced K onkani sentences. A.12 T ranslation APE Examples • T unisian Example Promp t: <|be- gin_of_te xt|><|im_star t|>user: APE is a task designed to enhance the quality of the translation by perf orming onl y minor adjustments to x any exis ting translation mistak es. If the translation is already correct, y ou should retain it as is. Original (English): alwa ys and alw ay s T ranslation (Modern Standard Arabic): ً ﺍﺩ P ost-edited (T unisian): <|im_end|> <|im_start|>assistant: ﺍﺍ ﺍﺍ <|im_end|> <|im_start|>user: APE is a task designed to enhance the quality of the translation by per - f or ming onl y minor adjustments to x an y e x- isting translation mistakes. If the translation is already cor rect, y ou should retain it as is. Original (English): there a lot of things that tell us shut up and brak e us... T ranslation (Modern Standard Arabic): ﺽ ﻭ ﻥﺍ ﻝ ﺍ ﺀﺍ ـﺍ ﻙ P ost-edited (T unisian): <|im_end|> Model Source Piv ot T arget k BLEU chrF++ Ablation: Number of In-Conte xt Exam ples ( k ) using Marathi Piv ot (N o Fine- T uning) U nbabel/T ow erInstruct-v0.1 U nbabel/T o werIns tr uct-v0.1 English Mar Gom 0 2.07 1.30 U nbabel/T o werIns tr uct-v0.1 English Mar Gom 1 2.58 16.03 U nbabel/T o werIns tr uct-v0.1 English Mar Gom 2 5.68 8.94 U nbabel/T o werIns tr uct-v0.1 English Mar Gom 3 4.11 17.66 U nbabel/T o werIns tr uct-v0.1 English Mar Gom 4 2.84 4.90 U nbabel/T o werIns tr uct-v0.1 English Mar Gom 5 2.84 6.11 N ousResear ch/Hermes-2-Pr o-Llama-3-8B NousR esearch/Hermes-2-Pro-Llama-3-8B English Mar Gom 0 2.35 24.9 NousR esearch/Hermes-2-Pro-Llama-3-8B English Mar Gom 1 3.49 30.34 NousR esearch/Hermes-2-Pro-Llama-3-8B English Mar Gom 2 2.36 27.59 NousR esearch/Hermes-2-Pro-Llama-3-8B English Mar Gom 3 2.72 25.89 NousR esearch/Hermes-2-Pro-Llama-3-8B English Mar Gom 4 7.77 27.53 NousR esearch/Hermes-2-Pro-Llama-3-8B English Mar Gom 5 5.73 28.65 T able 12: Ablation on the number of in-conte xt e xamples (k) f or English → Marathi → K onkani translation. <|im_start|>assistant: ﻭ ﺕ <|im_end|> <|im_start|>user: APE is a task designed to enhance the quality of the translation by per - f or ming onl y minor adjustments to x an y e x- isting translation mistakes. If the translation is already cor rect, y ou should retain it as is. Original (English): And sometime no T ranslation (Modern Standard Arabic): ً ﻭ P ost-edited (T unisian): <|im_end|> <|im_start|>assistant: ﺕ ﻭ <|im_end|> <|im_start|>user: APE is a task designed to enhance the quality of the translation by per - f or ming onl y minor adjustments to x an y e x- isting translation mistakes. If the translation is already cor rect, y ou should retain it as is. Original (English): like I said bef ore, in good and in bad T ranslation (Modern Standard Arabic): ﺍ ﻙﻭ ﺍ ﻙ ﺫ P ost-edited (T unisian): <|im_end|> <|im_start|>assistant: ﻭ ﺍ ﺍ <|im_end|> <|im_start|>user: APE is a task designed to enhance the quality of the translation by per - f or ming onl y minor adjustments to x an y e x- isting translation mistakes. If the translation is already cor rect, y ou should retain it as is. Original (English): let us be really happy a wa y from standard stus T ranslation (Modern Standard Arabic): ﺕﺍ ً ﺍ P ost-edited (T unisian): <|im_end|> <|im_start|>assistant: ﺡ <|im_end|> <|im_start|>user: APE is a task designed to enhance the quality of the translation by per - f or ming onl y minor adjustments to x an y e x- isting translation mistakes. If the translation is already cor rect, y ou should retain it as is. Original (English): w e shouldn’ t be negativ e all the time T ranslation (Modern Standard Arabic): .ﻡﺍﻭﺍ ﻝ ﺍ ﻩ ﻥ ﻥﺍ P ost-edited (T unisian): <|im_end|> <|im_start|>assistant: T ranslation: <|im_end|> Response from the model setting with the highest Chrf++ scor e: ﺍ ﻝ ﻥ ﻡﺯ • K onkani Exam ple Pr omp t: Model Source Piv ot T arget k BLEU chrF++ Ablation: Number of In-Conte xt Exam ples ( k ) using MS A Piv ot (No Fine- T uning) U nbabel/T ow erInstruct-v0.1 U nbabel/T o werIns tr uct-v0.1 Eng Msa Aeb 0 4.37 16.45 U nbabel/T o werIns tr uct-v0.1 Eng Msa Aeb 1 3.46 18.74 U nbabel/T o werIns tr uct-v0.1 Eng Msa Aeb 2 4.99 16.57 U nbabel/T o werIns tr uct-v0.1 Eng Msa Aeb 3 4.77 17.32 U nbabel/T o werIns tr uct-v0.1 Eng Msa Aeb 4 3.09 19.80 U nbabel/T o werIns tr uct-v0.1 Eng Msa Aeb 5 3.75 20.63 N ousResear ch/Hermes-2-Pr o-Llama-3-8B NousR esearch/Hermes-2-Pro-Llama-3-8B English Msa A eb 0 5.06 24.31 NousR esearch/Hermes-2-Pro-Llama-3-8B English Msa A eb 1 4.93 21.27 NousR esearch/Hermes-2-Pro-Llama-3-8B English Msa A eb 2 3.74 18.18 NousR esearch/Hermes-2-Pro-Llama-3-8B English Msa A eb 3 4.20 20.17 NousR esearch/Hermes-2-Pro-Llama-3-8B English Msa A eb 4 4.93 19.42 NousR esearch/Hermes-2-Pro-Llama-3-8B English Msa A eb 5 4.77 16.32 T able 13: Ablation on the number of in-conte xt e xamples (k) f or English → Msa → A eb translation Model Source Piv ot T arge t k BLEU chrF++ ∆ c hrF++ Ablation: Number of In-Context Examples ( k ) using Hindi Piv ot (N o Fine- T uning) NousR esearc h/Her mes-2-Pro-Llama-3-8B NousR esearch/Hermes-2-Pro-Llama-3-8B English Hin Gom 0 2.86 25.39 +0.49 NousR esearch/Hermes-2-Pro-Llama-3-8B English Hin Gom 1 2.47 24.12 -6.22 NousR esearch/Hermes-2-Pro-Llama-3-8B English Hin Gom 2 2.47 23.96 -3.63 NousR esearch/Hermes-2-Pro-Llama-3-8B English Hin Gom 3 2.41 23.69 -2.20 NousR esearch/Hermes-2-Pro-Llama-3-8B English Hin Gom 4 0.04 3.09 -24.44 NousR esearch/Hermes-2-Pro-Llama-3-8B English Hin Gom 5 0.02 2.49 -26.16 T able 14: Ablation on the number of in-context e xamples (k) f or English → Hindi → K onkani translation using Nous Hermes. ∆ chrF++ is computed relative to the Marathi-pivot setting at the same k. Scores computed with Sacre- BLEU ( Pos t , 2018 ); signatures in Appendix A.2 . Parame ter V alue batch_size 1 num_train_epochs 1.5 warmup_ratio 0.03 logging_steps 25 learning_rate 2e-4 gradient_checkpointing T r ue lr_scheduler_type Cosine weight_decay 0.001 save_strategy N o optim PagedAdam warmup_steps 100 bf16 T r ue T able 15: T raining parameters used in the model train- ing process. Parame ter V alue r 64 lora_alpha 16 lora_dropout 0.1 bias none task_type CAUSAL_LM target_modules ['q', 'k', 'v', 'o', 'up', 'down', 'gate', 'lm_head'] T able 16: LoRA conguration parameters. do_sample : True temperature : 0.1 num_return_sequences : 1 max_new_tokens : 200 return_full_text : False T able 17: Inference parameters used for te xt generation. <|begin_of_te xt|><|im_star t|>user: APE is a task designed to enhance the q uality of the translation b y perf or ming onl y minor adjustments to x any exis ting translation mis takes. If the translation is al- ready correct, y ou should retain it as is. Original (English): Great was his compas- sion f or the tw o dear ones at this par ting mo- ment. T ranslation (Marathi): - . P ost-edited (K onkani): <|im_end|> <|im_start|>assistant: - . <|im_end|> <|im_start|>user: APE is a task designed to enhance the quality of the translation by per - f or ming onl y minor adjustments to x an y e x- isting translation mistakes. If the translation is already cor rect, y ou should retain it as is. Original (English): Suleman ’ s parents were quite tall. T ranslation (Marathi): . P ost-edited (K onkani): <|im_end|> <|im_start|>assistant: . <|im_end|> <|im_start|>user: APE is a task designed to enhance the quality of the translation by per - f or ming onl y minor adjustments to x an y e x- isting translation mistakes. If the translation is already cor rect, y ou should retain it as is. Original (English): Our countr y o wes a deep debt of gratitude to our valiant e x- Servicemen. T ranslation (Marathi): . P ost-edited (K onkani): <|im_end|> <|im_start|>assistant: - . <|im_end|> <|im_start|>user: APE is a task designed to enhance the quality of the translation by per - f or ming onl y minor adjustments to x an y e x- isting translation mistakes. If the translation is already cor rect, y ou should retain it as is. Original (English): Bo y s are equall y vulner - able to sexual abuse. T ranslation (Marathi): . P ost-edited (K onkani): <|im_end|> <|im_start|>assistant: - . <|im_end|> <|im_start|>user: APE is a task designed to enhance the quality of the translation by per - f or ming onl y minor adjustments to x an y e x- isting translation mistakes. If the translation is already cor rect, y ou should retain it as is. Original (English): But Mangal Pande y’ s bra ve deed was done through de v otion to a high and noble principle. T ranslation (Marathi): - - . P ost-edited (K onkani): <|im_end|> <|im_start|>assistant: - . <|im_end|> <|im_start|>user: APE is a task designed to enhance the quality of the translation by per - f or ming onl y minor adjustments to x an y e x- isting translation mistakes. If the translation is already cor rect, y ou should retain it as is. Original (English): The brothers were deepl y attac hed to each other . T ranslation (Marathi): . P ost-edited (K onkani): <|im_end|> <|im_start|>assistant: T ranslation: <|im_end|> Response from the model setting with the highest Chrf++ scor e: . Model Source Piv ot T arget BLEU CHRF++ F e w-shot Finetuned U nbabel/T o werIns tr uct-v0.1 English - Konkani 4.18 31.57 NousR esearch/Hermes-2-Pro-Llama-3-8B English - K onkani 3.49 31.49 U nbabel/T o werIns tr uct-v0.1 English Marathi K onkani 7.80 17.60 NousR esearch/Hermes-2-Pro-Llama-3-8B English Marathi Konkani 12.14 34.92 Zero-sho t Finetuned U nbabel/T o werIns tr uct-v0.1 English - Konkani 1.89 17.39 NousR esearch/Hermes-2-Pro-Llama-3-8B English - K onkani 4.01 36.61 U nbabel/T o werIns tr uct-v0.1 English Marathi K onkani 7.94 31.91 NousR esearch/Hermes-2-Pro-Llama-3-8B English Marathi Konkani 8.38 40.17 T able 18: P erformance comparison of netuned models in fe w-shot and zero-shot settings f or Konkani translation. Model Source Piv ot T arget BLEU CHRF++ F e w-shot Finetuned U nbabel/T o werIns tr uct-v0.1 English - Tn 3.3 21.05 NousR esearch/Hermes-2-Pro-Llama-3-8B English - Tn N A N A U nbabel/T o werIns tr uct-v0.1 English Msa Tn 2.82 17.12 NousR esearch/Hermes-2-Pro-Llama-3-8B English Msa Tn 8.02 35.99 Zero-sho t Finetuned U nbabel/T o werIns tr uct-v0.1 English - Tn 1.48 14.83 NousR esearch/Hermes-2-Pro-Llama-3-8B English - Tn 5.02 18.07 U nbabel/T o werIns tr uct-v0.1 English Msa Tn 2.09 19.16 NousR esearch/Hermes-2-Pro-Llama-3-8B English Msa Tn 4.62 21.87 T able 19: Perf or mance comparison of netuned models in fe w-shot and zero-shot settings f or T unisian Arabic translation.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment