OpenSeal: Good, Fast, and Cheap Construction of an Open-Source Southeast Asian LLM via Parallel Data
Large language models (LLMs) have proven to be effective tools for a wide range of natural language processing (NLP) applications. Although many LLMs are multilingual, most remain English-centric and perform poorly on low-resource languages. Recently, several Southeast Asia-focused LLMs have been developed, but none are truly open source, as they do not publicly disclose their training data. Truly open-source models are important for transparency and for enabling a deeper and more precise understanding of LLM internals and development, including biases, generalization, and multilinguality. Motivated by recent advances demonstrating the effectiveness of parallel data in improving multilingual performance, we conduct controlled and comprehensive experiments to study the effectiveness of parallel data in continual pretraining of LLMs. Our findings show that using only parallel data is the most effective way to extend an LLM to new languages. Using just 34.7B tokens of parallel data and 180 hours on 8x NVIDIA H200 GPUs, we built OpenSeal, the first truly open Southeast Asian LLM that rivals the performance of existing models of similar size.
💡 Research Summary
**
This paper investigates how the composition and ordering of training data affect continual pre‑training (CPT) of large language models (LLMs) for new languages. The authors focus on Southeast Asian languages, which are typically low‑resource, and examine whether parallel corpora (sentence‑aligned translations) can replace or complement monolingual corpora during CPT. They build on OLMo 2, an open‑source decoder‑only transformer, and experiment with two model sizes: 1 B and 7 B parameters. Both models are first pretrained on four trillion tokens of general‑domain text, then adapted using a large SEA‑English parallel dataset containing 403.8 million sentence pairs (≈17.2 B tokens) covering eight Southeast Asian languages (Indonesian, Khmer, Lao, Malay, Burmese, Tamil, Thai, Tagalog, Vietnamese, and Chinese).
A key design choice is the replay ratio: 25 % of training blocks are replayed from the original pre‑training data to mitigate catastrophic forgetting, following recommendations from Ibrahim et al. (2024). The authors define five CPT strategies:
- Multilingual – only monolingual blocks from all languages, sampled uniformly.
- Mixed – monolingual and parallel blocks interleaved, guaranteeing at least one of each type between two replay blocks.
- Parallel‑First – all parallel blocks are consumed first, then monolingual blocks. This mirrors the hypothesis that early exposure to aligned sentences accelerates cross‑lingual alignment.
- Parallel‑Last – monolingual blocks are presented first, followed by parallel blocks, reflecting the belief that fluency should be built before alignment.
- Parallel‑Only – training uses only parallel blocks plus replay data, completely omitting monolingual data.
All other training hyper‑parameters, including the WSD learning‑rate scheduler, are kept identical across settings to ensure a controlled comparison. The 1 B model is trained for 10 B tokens to identify the most effective strategy; the best‑performing configuration is then scaled to the 7 B model and trained for 34.7 B tokens. Evaluation covers translation quality (BLEU), multilingual question answering (MultiQA), cross‑lingual natural language inference (XNLI), and other standard multilingual benchmarks.
Results show that the Parallel‑Only strategy matches or exceeds the traditional Multilingual baseline on most metrics. Specifically, Parallel‑Only improves BLEU scores by an average of 2–3 % and raises multilingual QA accuracy by about 3 % compared to Multilingual. The Parallel‑First approach yields the fastest early‑stage performance gains, especially for the lowest‑resource languages (Khmer, Lao), where BLEU jumps by more than 10 % after the first few billion tokens. Conversely, Parallel‑Last achieves higher monolingual fluency early on but lags behind in final cross‑lingual transfer. The Mixed setting provides stable but not optimal performance, indicating that simply mixing data without careful ordering does not fully exploit the alignment signal. The 25 % replay ratio proves sufficient to preserve previously learned capabilities while allowing substantial adaptation to new languages.
These findings have two major implications. First, high‑quality parallel corpora are a powerful driver of cross‑lingual alignment; when available in sufficient quantity, they can replace monolingual data for CPT, simplifying data pipelines for low‑resource regions. Second, the order in which parallel and monolingual data are presented critically shapes learning dynamics. Presenting parallel data early (Parallel‑First) accelerates alignment and yields better final multilingual performance, while delaying alignment (Parallel‑Last) may be beneficial when the goal is to first solidify language‑specific fluency.
All data, code, and model checkpoints are released publicly, enabling the community to reproduce the experiments and explore extensions such as additional language pairs, higher‑quality human‑translated parallel data, and integration with instruction‑tuning stages. The authors suggest that future work should investigate how CPT strategies interact with instruction fine‑tuning and preference learning, as well as how to balance computational cost with performance gains for truly multilingual LLM deployment.
Comments & Academic Discussion
Loading comments...
Leave a Comment