Goldfish: Monolingual Language Models for 350 Languages
For many low-resource languages, the only available language models are large multilingual models trained on many languages simultaneously. Despite state-of-the-art performance on reasoning tasks, we find that these models still struggle with basic grammatical text generation in many languages. First, large multilingual models perform worse than bigrams for many languages (e.g. 24% of languages in XGLM 4.5B; 43% in BLOOM 7.1B) using FLORES perplexity as an evaluation metric. Second, when we train small monolingual models with only 125M parameters on 1GB or less data for 350 languages, these small models outperform large multilingual models both in perplexity and on a massively multilingual grammaticality benchmark. To facilitate future work on low-resource language modeling, we release Goldfish, a suite of over 1,000 small monolingual language models trained comparably for 350 languages. These models represent the first publicly-available monolingual language models for 215 of the languages included.
💡 Research Summary
The paper “Goldfish: Monolingual Language Models for 350 Languages” addresses a fundamental limitation in the current state of low‑resource language modeling: the reliance on massive multilingual models such as XGLM, BLOOM, and MaLA‑500, which are trained on highly imbalanced data and often underperform even simple bigram baselines on many languages. The authors first demonstrate, using FLORES‑200 perplexity, that large multilingual models perform worse than bigrams for a substantial fraction of languages (24 % of XGLM 4.5B languages, 43 % of BLOOM 7.1B languages). This motivates the creation of a new suite of monolingual models called Goldfish.
Goldfish consists of 1,154 transformer language models covering 350 languages. Each language receives a model of either 125 M parameters (GPT‑2‑style) or a smaller 39 M‑parameter variant for the smallest data regimes. Training data are drawn from three large multilingual corpora (Chang 2024a, Glot500, MADLAD‑400) and are carefully deduplicated, filtered (Bible‑only corpora removed), and split into four size buckets: 5 MB, 10 MB, 100 MB, and 1 GB of text. To make data quantities comparable across scripts, the authors introduce a “byte‑premium” scaling factor that measures how many UTF‑8 bytes are needed to encode the same amount of content in a given language relative to English. Dataset sizes are therefore expressed in “equivalent English bytes,” ensuring that a 1 GB bucket for a language with a high byte‑premium actually contains fewer raw bytes but comparable information content.
Each model uses a custom 50 K Unigram tokenizer trained on the same data bucket, preserving language‑specific morphology. Training proceeds for ten epochs, with early stopping for the smallest buckets to avoid over‑fitting. The total compute cost is 1.65 × 10²⁰ FLOPs, roughly 1/1900 of the compute used for GPT‑3, highlighting the efficiency of the approach.
Evaluation is two‑fold. First, on FLORES‑200 log‑perplexity (computed on the second half of each sentence given the first half), Goldfish achieves lower perplexities than all four multilingual baselines on 98 of the 204 languages for which comparison data exist. On average, Goldfish reduces perplexity by 13 % relative to XGLM 4.5B and 11 % relative to MaLA‑500 10B. Moreover, the bigram baselines themselves beat the multilingual models on a non‑trivial number of languages, underscoring the data imbalance problem.
Second, the authors assess grammatical knowledge using MultiBLiMP, covering 74 of the Goldfish languages. Goldfish attains the highest average accuracy among all tested multilingual models (BLOOM 560M, XGLM 564M/1.7B, Gemma 3, Llama 3.2) and is the top performer on 25 languages. This demonstrates that small, language‑specific models capture syntactic regularities better than large, shared‑parameter models when the target language is under‑represented in the multilingual training mix.
Conversely, on three multilingual reasoning benchmarks—Belebele (reading comprehension), XCOPA (commonsense), and XStoryCloze (story reasoning)—all models, including Goldfish, perform near chance, indicating that the current scale of pre‑training alone is insufficient for higher‑order reasoning tasks in low‑resource settings.
The paper also details extensive data‑handling procedures: contamination checks against FLORES‑200 (less than 10 overlapping sentences for 98 % of languages), release of the curated training corpora, and open‑source code and models on Hugging Face. The authors argue that the combination of (1) language‑specific data scaling via byte‑premium, (2) dedicated tokenizers, and (3) modest‑size transformer architectures yields a cost‑effective solution that outperforms far larger multilingual systems on core language modeling metrics for low‑resource languages.
In conclusion, Goldfish provides the first publicly available monolingual language models for 215 of the 350 languages covered, setting a new baseline for low‑resource language modeling. The work highlights that, for many languages, “bigger is not better” and that careful data curation and language‑focused modeling can deliver superior performance with a fraction of the compute and parameter budget. Future directions include instruction‑tuning, few‑shot prompting, and integrating Goldfish models into downstream applications such as machine translation, speech recognition, and culturally‑aware NLP tools.
Comments & Academic Discussion
Loading comments...
Leave a Comment