Dicta-LM 3.0: Advancing The Frontier of Hebrew Sovereign LLMs

Dicta-LM 3.0: Advancing The Frontier of Hebrew Sovereign LLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Open-weight LLMs have been released by frontier labs; however, sovereign Large Language Models (for languages other than English) remain low in supply yet high in demand. Training large language models (LLMs) for low-resource languages such as Hebrew poses unique challenges. In this paper, we introduce Dicta-LM 3.0: an open-weight collection of LLMs trained on substantially-sized corpora of Hebrew and English texts. The model is released in three sizes: 24B - adapted from the Mistral-Small-3.1 base model, 12B - adapted from the NVIDIA Nemotron Nano V2 model, and 1.7B - adapted from the Qwen3-1.7B base model. We are releasing multiple variants of each model, each with a native context length of 65k tokens; base model and chat model with tool-calling support. To rigorously evaluate our models, we introduce a new benchmark suite for evaluation of Hebrew chat-LLMs, covering a diverse set of tasks including Translation, Summarization, Winograd, Israeli Trivia, and Diacritization (nikud). Our work not only addresses the intricacies of training LLMs in low-resource languages but also proposes a framework that can be leveraged for adapting other LLMs to various non-English languages, contributing to the broader field of multilingual NLP.


💡 Research Summary

Dicta‑LM 3.0 is an open‑weight collection of Hebrew‑focused large language models (LLMs) that aims to fill the gap of sovereign, high‑quality LLMs for low‑resource languages. The authors release three model sizes—24 B, 12 B, and 1.7 B—each adapted from a strong English‑centric base (Mistral‑Small‑3.1, NVIDIA Nemotron‑Nano V2, and Qwen‑3‑1.7B respectively). All variants support a native 65 k token context window and are provided in two forms: a base model for general-purpose generation and a chat model equipped with tool‑calling capabilities.

Data composition
The pre‑training corpus consists of roughly 100 billion Hebrew tokens (≈75 % of the total) and 30 billion English tokens (≈25 %). Hebrew data are drawn from a wide spectrum: web crawls (C4, OSCAR, FineWeb2, Wikipedia), social media, news & legal transcripts, academic & literary sources (including the Ben‑Yehuda and Sefaria projects), and a small fraction of manually tagged resources (NER, UD dependencies, diacritics). English data are assembled from Nemotron‑CC, FineWeb‑Edu, and SlimPajama, with the exact mix varying per model size.

Continuous pre‑training
Training proceeds in two phases. Phase 1 iterates over the entire 130 B token dataset with a 4 k token context, using model‑specific hyper‑parameters (global batch size, learning rate, tensor parallelism). Phase 2 expands the context to 65 k tokens, samples longer documents (>6 k tokens) for 75 % of the steps and shorter ones for the remaining 25 %, and continues training for an additional 18 B tokens. This two‑stage approach allows the models to first learn broad language patterns and then specialize in handling very long contexts, which is crucial for tasks like summarization and multi‑turn reasoning. Training is performed on an NVIDIA DGX Cloud Lepton cluster with 80 H200 GPUs, leveraging the NeMo framework, distributed AdamW, and cosine learning‑rate scheduling.

Base model evaluation
The authors evaluate the three base models on the Hebrew LLM‑Leaderboard, which aggregates a suite of Hebrew benchmarks (translation, summarization, Winograd, Israeli trivia, diacritization). The 24 B model matches or exceeds the performance of models four times larger, especially excelling in the Israeli Trivia category. The 12 B and 1.7 B models also show sizable gains (up to +30 percentage points on certain tasks) compared with their original counterparts. To verify that English capabilities are retained, the models are tested on Commonsense QA, WinoGrande, and ARC‑Challenge, achieving >98 % of the original English performance.

Chat model fine‑tuning (SFT)
To create conversational variants, the authors assemble two supervised fine‑tuning (SFT) corpora: an “instruct” set (direct answer style) and a “thinking” set (explicit reasoning step before the answer). Both are built from a mixture of publicly available English dialogue datasets (Hermes 3, Math, rStarCoder, SmolTalk v2, etc.) and are translated into Hebrew using the newly trained base model, followed by filtering to remove low‑quality translations. The final SFT data comprise roughly 2 B tokens (1.5 M dialogues) for instruct and 3.2 B tokens (725 k dialogues) for thinking. Special tokens (<|im_start|>, <tool_call>, , etc.) are added to the tokenizer to support tool‑calling and chain‑of‑thought reasoning, following conventions from Qwen‑3, DeepSeek‑R1, and Hermes. The chat models are fine‑tuned on the same 65 k token sequences using the same GPU cluster.

Results and insights
Key take‑aways include: (1) continuous pre‑training on a large Hebrew corpus dramatically improves performance despite the language’s low‑resource status; (2) extending the context window to 65 k tokens yields substantial gains on long‑form tasks; (3) mixing English data mitigates catastrophic forgetting and preserves reasoning abilities; (4) leveraging the base model for self‑translation enables rapid creation of high‑quality multilingual SFT data. The authors argue that this pipeline is readily transferable to other under‑represented languages.

Limitations and future work
The paper acknowledges potential biases inherited from web‑scraped Hebrew data, the high computational cost of 65 k token training, and the limited generality of the current tool‑calling schema. Future directions include bias mitigation, more memory‑efficient long‑context architectures, expanding the tool‑calling framework, and scaling the approach to additional languages.

In summary, Dicta‑LM 3.0 demonstrates that a pragmatic combination of (i) strong English‑centric base models, (ii) large‑scale Hebrew pre‑training, (iii) a two‑stage long‑context training regime, and (iv) carefully constructed bilingual SFT data can produce state‑of‑the‑art Hebrew LLMs that rival much larger multilingual models. The work provides a concrete, reproducible blueprint for building sovereign LLMs for other low‑resource languages, advancing the broader goal of multilingual AI equity.


Comments & Academic Discussion

Loading comments...

Leave a Comment