No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data

No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We explore machine translation for five Turkic language pairs: Russian-Bashkir, Russian-Kazakh, Russian-Kyrgyz, English-Tatar, English-Chuvash. Fine-tuning nllb-200-distilled-600M with LoRA on synthetic data achieved chrF++ 49.71 for Kazakh and 46.94 for Bashkir. Prompting DeepSeek-V3.2 with retrieved similar examples achieved chrF++ 39.47 for Chuvash. For Tatar, zero-shot or retrieval-based approaches achieved chrF++ 41.6, while for Kyrgyz the zero-shot approach reached 45.6. We release the dataset and the obtained weights.


💡 Research Summary

The paper investigates machine translation (MT) for five low‑resource Turkic language pairs: Russian‑Bashkir, Russian‑Kazakh, Russian‑Kyrgyz, English‑Tatar, and English‑Chuvash. The authors explore three broad strategies: (1) data augmentation with synthetic parallel sentences, (2) fine‑tuning a multilingual pretrained model (facebook/nllb‑200‑distilled‑600M) using LoRA adapters, and (3) retrieval‑augmented prompting of large language models (LLMs).

Data augmentation
Publicly available parallel corpora for the five pairs are extremely limited (from ~35 k to >1 M sentence pairs). To overcome this scarcity, the authors employed Yandex.Translate to generate synthetic translations. For pairs where Russian was not the source, they first translated English to Russian, then Russian to the target Turkic language. They processed data in large chunks (50 k–200 k samples) and filtered out any sentences that appeared in the test set. After augmentation, each language pair contained roughly 2.46 M training examples. Additional translations of the MASSIVE dataset were generated but only used for prompting experiments, not for model training. The final dataset, named YaTURK‑7lang, is released on HuggingFace.

LoRA‑based fine‑tuning
The base model is NLLB‑200‑distilled‑600M, a 600 M‑parameter multilingual encoder‑decoder pretrained on 200 languages. The authors added ten language‑pair tokens to the tokenizer (e.g., <prefix_rus_bash>). Two fine‑tuning regimes were tested: (a) independent fine‑tuning for each language pair (2 epochs each) and (b) a multi‑task pre‑training phase (1 epoch on the combined data of all five pairs) followed by LoRA adapter training for each language. LoRA adapters were implemented with the DORA method (r = 64, α = 64, dropout = 0.2) applied to query, key, value, output projection, and feed‑forward layers. Training used 8‑bit AdamW, batch size 16 (gradient accumulation 8), learning rate 5e‑4, weight decay 1e‑2, and cosine learning‑rate scheduling.

Results (Table 1) show that the multi‑task + LoRA configuration dramatically outperforms single‑task fine‑tuning for the better‑resourced languages. For Bashkir, chrF++ rose from 22.32 (single‑task) to 49.53 (multi‑task + LoRA). For Kazakh, the increase was from 40.96 to 49.93. Kyrgyz, Tatar, and Chuvash also benefited, though to a lesser extent. The best submitted models were the LoRA adapters for Bashkir and Kazakh, achieving test chrF++ scores of 46.94 and 49.71 respectively. Model weights are publicly released.

Retrieval‑augmented prompting
For Chuvash and Tatar, the authors turned to LLMs because NLLB performed poorly on Chuvash (the language was not present in its pre‑training). They built an ANNOY index over source‑language sentences from the augmented dataset. For each new source sentence, the most similar existing sentences (up to 7 000 examples) were retrieved and appended to a prompt that instructed the LLM to translate only the target language text. Two embedding models were used: thenlper/gte‑small (384‑dim) for English‑Chuvash and sentence‑transformers/paraphrase‑multilingual‑MiniLM‑L12‑v2 for the other pairs.

Five LLMs were evaluated via OpenRouter (DeepSeek‑R1‑0528, DeepSeek‑N1, MiMo‑V2, Gemma‑3‑27B) and the official DeepSeek‑V3.2 API (used in “reasoning” mode). Temperature was set to 0 for all except DeepSeek‑V3.2 (default 0.7). When a model returned an empty output, the request was retried.

Key findings:

  • For English‑Chuvash, retrieval‑augmented prompting with DeepSeek‑V3.2 achieved chrF++ 39.47 on the test set, a substantial gain over zero‑shot scores (≈22).
  • For English‑Tatar, zero‑shot DeepSeek‑R1 gave 38.04, improved to 41.11 with a larger context window, but DeepSeek‑V3.2 zero‑shot reached 43.66, the highest.
  • For Kyrgyz, zero‑shot MiMo‑V2 performed best (chrF++ ≈ 46.6), and expanding the context window actually reduced performance.
  • For Bashkir and Kazakh, prompting was inferior to LoRA‑fine‑tuned models; expanding the context window sometimes caused large drops (e.g., Bashkir from 39.55 to 33.31).

Stacking attempts
The authors tried to combine multiple system outputs by selecting the translation with the highest semantic similarity to the source, using LaBSE embeddings. This yielded a slight degradation (Kazakh validation chrF++ 49.93 → 49.08) and was not pursued further.

Discussion and conclusions
The study demonstrates that the optimal MT strategy depends heavily on the amount of available parallel data and the presence of the language in the pretrained model.

  • When synthetic data can be generated at scale and the target language is represented in the multilingual pretrained model (Bashkir, Kazakh), LoRA‑based fine‑tuning yields the best results.
  • For languages absent from the pretrained vocabulary and with extremely scarce data (Chuvash), retrieval‑augmented prompting dramatically improves quality.
  • For languages where zero‑shot performance is already strong (Kyrgyz, Tatar), simple prompting may be sufficient, and additional context can even hurt performance.

The authors release the augmented dataset (YaTURK‑7lang) and all model checkpoints, providing a valuable resource for future Turkic MT research. They suggest future work on fine‑tuning models that were originally pretrained on low‑resource languages (e.g., AI‑Forever mGPT‑kirgiz) and on hybrid approaches that combine adapters with retrieval‑augmented prompting.


Comments & Academic Discussion

Loading comments...

Leave a Comment