Luth: Efficient French Specialization for Small Language Models and Cross-Lingual Transfer
The landscape of Large Language Models remains predominantly English-centric, resulting in a significant performance gap for other major languages, such as French, especially in the context of Small Language Models (SLMs). Existing multilingual models demonstrate considerably lower performance in French compared to English, and research on efficient adaptation methods for French remains limited. To address this, we introduce \textbf{Luth}, a family of French-specialized SLMs: through targeted post-training on curated, high-quality French data, our models outperform all open-source counterparts of comparable size on multiple French benchmarks while retaining their original English capabilities. We further show that strategic model merging enhances performance in both languages, establishing Luth as a new state of the art for French SLMs and a robust baseline for future French-language research.
💡 Research Summary
The paper “Luth: Efficient French Specialization for Small Language Models and Cross‑Lingual Transfer” tackles the persistent English‑centric bias of large language models (LLMs) by focusing on French, a major world language spoken by over 280 million people, and on small language models (SLMs) with fewer than 2 billion parameters. The authors argue that while multilingual models such as BLOOM, LLaMA, and AYA cover many languages, they are not optimized for any single language and thus underperform on French compared to English. Existing French‑focused efforts (e.g., CroissantLLM, Gaperon, Pensez) either target large, resource‑intensive models or lack reproducible, low‑cost adaptation recipes.
To fill this gap, the authors introduce a family of French‑specialized SLMs called Luth. The work is organized around three core contributions: (1) the creation of a high‑quality French instruction‑response dataset named Luth‑SFT, containing 570 k samples (≈338 million tokens); (2) the construction of a suite of Luth models ranging from 350 M to 1.7 B parameters that achieve state‑of‑the‑art performance on French benchmarks within their size class; and (3) an efficient, reproducible methodology for language‑specific adaptation that preserves, and even improves, performance on other languages through strategic model merging.
Dataset Construction
The Luth‑SFT dataset is built in several stages. First, French‑language samples are extracted from existing multilingual corpora (AYA, Smoltalk, CroissantLLM) using the langdetect library. To diversify the instruction set, two high‑quality English instruction datasets—Tülu 3 and Open‑Hermes—are translated into French using strong multilingual models (GPT‑4o and Qwen3 32B) in a “translate‑prompt, generate‑answer” fashion, thereby avoiding direct answer translation and reducing semantic drift. A two‑stage filtering pipeline follows: (i) linguistic validation enforces grammatical correctness, coherence, and pure French usage; (ii) content filtering removes programming‑related, tool‑calling, and logically inconsistent samples.
A special “Scholar” subset is also curated to address the scarcity of French scientific resources. Over 14 k PDFs of French high‑school and preparatory school exam papers (Baccalauréat, CPGE) from 1980‑2025 are harvested, parsed with regular expressions and LLM‑assisted tools (Gemini 2.5 Flash/Pro), and refined to produce 30 k high‑quality question‑answer pairs across mathematics, physics‑chemistry, computer science, engineering, biology, and other domains. This subset is heavily weighted toward mathematics (≈67 %).
Model Selection and Fine‑Tuning
The authors evaluate several SLM candidates in the sub‑2 B regime, focusing on French and English capabilities in math, general knowledge, and instruction following. The best performers are the Qwen3 series (0.6 B and 1.7 B) and the LFM2 series (350 M, 700 M, 1.2 B). These models are fully fine‑tuned (not LoRA) on the Luth‑SFT dataset using the Axolotl framework, FlashAttention for memory‑efficient computation, and sequence packing up to 16 384 tokens. Training runs on a single NVIDIA H100 (80 GB) for three epochs, with modest hyper‑parameters (e.g., learning rate 2e‑5, effective batch size 24). Loss curves show stable convergence for both Qwen3 variants.
Model Merging for Cross‑Lingual Retention
Fine‑tuning on a French‑only corpus improves French performance but slightly degrades English abilities. To mitigate this, the authors employ model merging, combining the fine‑tuned model with its original base using either spherical linear interpolation (SLERP) or linear interpolation (LERP). MergeKit facilitates this process, and the mixing coefficient α (proportion of the fine‑tuned model) is tuned per size (e.g., 0.7 for 0.6 B, 0.5 for 1.7 B). Empirically, these simple blending methods yield the most stable results, recovering English scores while preserving French gains, and even providing modest cross‑lingual improvements—demonstrating that the two weight sets occupy complementary regions in parameter space.
Evaluation
Six benchmark suites, each available in both English and French, are used: IFEval (instruction following), Math500 (mathematical reasoning), GPQA‑Diamond (general knowledge), MMLU (broad subject knowledge), ARC‑Challenge (science reasoning), and HellaSwag (commonsense reasoning). The authors extend LightEval to support a “thinking” toggle for non‑reasoning mode, crucial for evaluating Qwen3 which defaults to chain‑of‑thought generation.
Results (Table 3) show that Luth‑1.7 B‑Instruct achieves 58.53 % on English IFEval and 49.75 % on French IFEval, outperforming all open‑source models of comparable size. Across the six French benchmarks, Luth models improve over their base counterparts by up to +11 % absolute, with particularly strong gains in mathematics due to the Scholar subset. The merged models (e.g., Luth‑0.6 B‑Instruct via SLERP 0.7) not only recover English performance lost during French‑only fine‑tuning but also slightly surpass the original base on several tasks, confirming the efficacy of the merging strategy.
Limitations and Future Work
The study is confined to sub‑2 B models, so direct comparison with larger LLMs is absent. Hyper‑parameter sweeps are limited due to computational constraints, leaving potential performance gains unexplored. The translation‑based data generation may inherit biases from the translation models, and the reliance on publicly available French exam materials could limit domain diversity. Future directions include exploring parameter‑efficient adapters (LoRA, adapters), scaling the methodology to larger models, automating optimal merging coefficient search, and extending the pipeline to other under‑represented languages.
Conclusion
Luth demonstrates that with a carefully curated French instruction dataset, modest full‑parameter fine‑tuning, and straightforward model merging, small language models can achieve state‑of‑the‑art French performance while retaining English capabilities. This provides a practical, reproducible blueprint for building high‑quality, language‑specialized LLMs in resource‑constrained settings, and establishes a strong baseline for subsequent French‑language research.
Comments & Academic Discussion
Loading comments...
Leave a Comment