EuroLLM-22B: Technical Report
This report presents EuroLLM-22B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-22B’s development, including tokenizer design, architectural specifications, data filtering, and training procedures. Across a broad set of multilingual benchmarks, EuroLLM-22B demonstrates strong performance in reasoning, instruction following, and translation, achieving results competitive with models of comparable size. To support future research, we release our base and instruction-tuned models, our multilingual web pretraining data and updated EuroBlocks instruction datasets, as well as our pre-training and evaluation codebases.
💡 Research Summary
EuroLLM‑22B is a 22‑billion‑parameter, open‑source large language model (LLM) explicitly built to serve all 24 official European Union (EU) languages and 11 additional languages, addressing the chronic under‑representation of European languages in the current open‑LLM ecosystem. The model follows the architectural lineage of EuroLLM‑1.7B and EuroLLM‑9B, employing a transformer stack with 54 layers, 48 attention heads, a hidden size of 6,144, and a 128 k BPE tokenizer that covers a broad spectrum of European and global scripts. Key architectural choices include grouped‑query attention (GQA), RMSNorm, SwiGLU activation, and rotary positional embeddings (RoPE) with a scaling factor θ = 10⁶, enabling a context window of up to 32 K tokens—four times larger than the 4 K window used in earlier versions. This extended context is crucial for tasks that require long‑range dependencies such as document summarisation, code analysis, and complex reasoning.
The pre‑training corpus, dubbed EuroWeb, contains roughly 4 trillion tokens and is processed in three distinct phases. Phase 1 (3.6 T tokens) mixes high‑quality English educational web data from FineWeb‑edu (documents with an educational score ≥ 2) with multilingual web data drawn from RedPajama‑Data‑v2 for high‑resource languages (German, Spanish, French, Italian) and from HPLT, MADLAD‑400, CulturaX, and mC4 for the remaining languages. A rigorous filtering pipeline—language identification, deduplication, KenLM‑based perplexity filtering, and heuristic rules (minimum length, upper‑case ratio, symbol‑to‑word ratio, etc.)—removes noisy content. All web records are scored by EuroFilter, a classifier fine‑tuned on the FineWeb‑edu quality annotations, which assigns a 0‑5 quality score. The corpus is split into three quality tiers, each assigned to a later training phase, ensuring that the highest‑quality data are presented last.
Parallel data are collected both sentence‑level (xx → en and en → xx) and document‑level (Europarl, ParaDocs) from a wide range of public sources listed in Table 2. Duplicates are removed with Bifixer, and translation quality is enforced using Bicleaner (threshold 0.5–0.6) and CometKiwi‑22 (threshold 0.7). For the second and third phases, additional document‑level parallel corpora are incorporated, again filtered with the same quality thresholds.
Code and mathematics data are sourced from The Stack, Algebraic‑stack, Open‑web‑math, GSM8k, Mathematics Aptitude Test, and the newly introduced FineMath dataset, which specifically targets mathematical reasoning. In the final phase, synthetic math data are generated by prompting Qwen‑2.5‑Math‑7B with rewritten questions from MathInstruct and MetaMathQA; answers are judged by Qwen‑2.5‑32B using an LLM‑as‑a‑Judge framework, retaining only samples scoring ≥ 9/10. Additional synthetic multiple‑choice items are created with Gemma‑2‑9B and Gemma‑2‑27B, and “best‑of‑N” selections are made using Qwen‑2.5‑32B.
Long‑context data are bolstered by an extra 60 B tokens in the final phase, split evenly between books and code. Books are up‑sampled, and code snippets are filtered to retain only repositories with ≥ 500 stars and ≥ 100 forks, ensuring high‑quality programming examples.
Training proceeds with a three‑phase schedule. Phase 1 uses a 10 % linear warm‑up to a peak learning rate of 1.5 × 10⁻⁴, which is then held constant. Phase 2 linearly decays the learning rate over 400 B tokens to 10 % of the peak, and Phase 3 continues the decay to zero. The overall token count reaches ~4 T. The schedule is implemented on NVIDIA’s Megatron‑LM, extended with a custom scheduler to accommodate the multi‑phase regime and the 32 K context scaling.
Post‑training (instruction‑tuning) leverages EuroBlocks‑22B, a newly compiled multilingual instruction‑response dataset of ~10.6 M examples. EuroBlocks‑22B builds on prior EuroBlocks releases by adding STEM‑focused prompts from Hermes‑3, Tülu 3, and Nemotron V2, and by regenerating answers with stronger models (Qwen‑2.5, DeepSeek‑AI, Llama, etc.). The best answer for each prompt is selected using Skywork‑Gemma2‑27B as a reward model. Structured reasoning traces are stripped to produce a pure instruction‑response corpus. Language distribution is roughly 60 % English, 20 % multilingual general text, and 20 % code/math/STEM.
Instruction‑tuning is performed with Axolotl coupled with Liger‑Kernel, using bfloat16 mixed precision, sequence packing, and a cosine learning‑rate schedule (max 1 × 10⁻⁵, 125 warm‑up steps). Training runs for five epochs on the 32 768‑token context length, with optimized kernels for RoPE, RMSNorm, GLU, layer‑norm, and fused cross‑entropy, dramatically reducing memory consumption and training time.
Evaluation spans both English‑only and multilingual benchmark suites, covering instruction‑following, general knowledge, STEM, and translation tasks. While detailed numeric results are not reproduced in the report, the authors claim EuroLLM‑22B‑Instruct achieves competitive performance relative to other open models of similar scale (e.g., Llama 3, Mistral, Mixtral) across all evaluated metrics, particularly excelling in multilingual reasoning and long‑context tasks. The evaluation framework and code are released to ensure reproducibility.
All assets—base and instruction‑tuned models, the EuroWeb pre‑training corpus, the EuroBlocks‑22B instruction set, and the modified Megatron‑LM and evaluation code—are publicly available on HuggingFace and GitHub. This openness promotes transparency, enables community‑driven improvements, and provides a solid foundation for future European AI research and product development.
Strengths of EuroLLM‑22B include: (1) comprehensive coverage of EU languages, offering native‑level capabilities across a linguistically diverse region; (2) a 32 K token context window that unlocks new use‑cases involving long documents; (3) a meticulous, tiered data‑filtering pipeline that maximises training data quality; (4) full open‑source release, fostering reproducibility and ecosystem growth. Limitations are: (1) the lack of detailed benchmark tables makes precise performance comparison difficult; (2) the massive compute and energy requirements for training such a model may limit accessibility to well‑funded institutions; (3) the 128 k tokenizer may still produce sub‑optimal sub‑word splits for low‑resource languages; and (4) the report provides limited analysis of memory/compute trade‑offs introduced by the 32 K context.
In summary, EuroLLM‑22B represents a significant milestone for multilingual, open‑source LLM development in Europe. By delivering a high‑capacity model that natively supports all EU official languages, extending context length, and releasing all components under permissive licenses, the project lays a robust groundwork for future research, multilingual applications, and the democratization of AI technology across the continent.
Comments & Academic Discussion
Loading comments...
Leave a Comment