Reading time: 10 minute
...

๐Ÿ“ Original Info

  • Title:
  • ArXiv ID: 2512.18399
  • Date:
  • Authors: Unknown

๐Ÿ“ Abstract

Tokenization is a critical preprocessing step for large language models (LLMs), directly impacting training efficiency and downstream performance. General-purpose tokenizers trained predominantly on English and Latin-script languages exhibit suboptimal performance on morphologically rich languages such as Arabic, resulting in inflated token sequences and reduced compression efficiency. In this work, we present AraToken, an Arabic-optimized tokenizer built on SentencePiece Unigram algorithm with a comprehensive normalization pipeline addressing Arabic-specific orthographic variations including Alif variants, diacritics, and Arabic-Indic numerals. We systematically compare BPE, Word-Piece, and SentencePiece algorithms across multiple configurations, demonstrating that Sentence-Piece with normalization achieves 18% lower fertility (1.199 vs 1.35 tokens/word) compared to unnormalized baselines. Furthermore, we introduce the Language Extension Pipeline (LEP), a method for integrating the optimized tokenizer into Qwen3-0.6B through vocabulary extension with mean subtoken initialization and selective transformer layer unfreezing. Our experiments show that LEP reduces evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. We release our tokenizer, training scripts, and model checkpoints to facilitate Arabic NLP research.

๐Ÿ“„ Full Content

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks [1,16,17]. However, the effectiveness of these models is fundamentally constrained by their tokenization strategy. Tokenizers trained on predominantly English corpora often exhibit poor compression efficiency for non-Latin scripts and morphologically rich languages [12,11].

Arabic presents unique challenges for tokenization due to several linguistic characteristics. First, Arabic is a highly inflected language where words carry extensive morphological information through prefixes, suffixes, and infixes [4]. Second, Arabic orthography exhibits significant variability, particularly in the representation of Alif variants ([Hamzaabove], [Hamza-below], [Madda], [Alif]) and the optional nature of diacritical marks (harakat). Third, Arabic text frequently contains Arabic-Indic numerals and specialized punctuation that require explicit normalization.

These challenges result in general-purpose tokenizers producing excessively fragmented token sequences for Arabic text, leading to: (1) increased computational costs during training and inference, (2) reduced effective context length, and (3) potential degradation in model performance on Arabic tasks.

In this paper, we address these challenges through a two-pronged approach:

โ€ข Arabic-Optimized Tokenizer: We develop AraToken, a SentencePiece Unigram tokenizer trained on Arabic corpora with a comprehensive normalization pipeline that unifies orthographic variations and removes optional diacritics.

โ€ข Language Extension Pipeline (LEP): We propose a method for integrating the optimized tokenizer into existing LLMs (specifically Qwen3) through vocabulary extension, mean subtoken initialization, and selective layer unfreezing.

Our experiments demonstrate that the normalized SentencePiece tokenizer achieves a fertility of 1.199 tokens per word, representing an 18% improvement over unnormalized baselines. When integrated into Qwen3-0.6B via LEP, the model achieves an evaluation loss of 2.43 after only 800 training steps, compared to 8.28 without adaptation.

Figure 1 illustrates our overall approach, combining tokenizer training with model adaptation through LEP.

The remainder of this paper is organized as follows: Section 2 reviews related work on tokenization and language adaptation. Section 3 describes our normalization pipeline, tokenizer training, and LEP architecture. Section 4 presents our experimental setup, and Section 5 discusses the results. We conclude in Section 7 with limitations and future directions.

Modern LLMs predominantly employ subword tokenization to balance vocabulary size with coverage. Byte Pair Encoding (BPE) [15] iteratively merges the most frequent character pairs to construct a vocabulary. WordPiece [13] uses a likelihood-based criterion for merge decisions, while the Unigram algorithm [6] learns a probabilistic language model over subword sequences using the EM algorithm.

SentencePiece [7] provides a language-agnostic implementation supporting both BPE and Unigram algorithms, operating directly on raw text without pre-tokenization. This is particularly advantageous for languages like Arabic that do not use whitespace consistently.

Arabic NLP has received significant attention due to the language’s morphological complexity and dialectal variation [2]. CAMeL Tools [10] provides comprehensive utilities for Arabic preprocessing including morphological analysis and normalization. AraBART [5] and AraT5 [9] are pretrained transformer models specifically designed for Arabic, employing custom tokenization strategies.

Normalization is a critical preprocessing step for Arabic text [4]. Common normalization operations include Alif unification (collapsing [Hamza-above], [Hamza-below], [Madda] to [Alif]), Hamza normalization, Ta Marbuta/Ha unification, and diacritic removal. The optimal normalization strategy depends on the downstream task, with some applications benefiting from preserved orthographic distinctions.

Extending pretrained LLMs to new languages has been explored through several approaches. BLOOM+1 [18] investigates language adaptation strategies including continued pretraining and adapter-based methods, finding that adapters outperform continued pretraining for larger models. LLaMA Beyond English [19] studies vocabulary extension for Chinese, demonstrating that effective transfer can be achieved with less than 1% of the original pretraining data. WECHSEL [8] proposes cross-lingual embedding initialization for vocabulary extension, while FOCUS [3] introduces a method for initializing new token embeddings based on semantic similarity. Our work builds on these approaches by combining vocabulary extension with selective layer unfreezing for Arabic adaptation.

We implement a comprehensive Arabic normalization pipeline designed to reduce orthographic variability while preserving semantic content. The pipeline is integrated into the tokenizer’s preprocessing stage using the HuggingFace Tokenizers library.

Unicode Normalization We apply NFKC normalization as the first step to decompose compatibility characters and ensure consistent Unicode representation.

Alif Variant Unification Arabic exhibits four common Alif variants that are often used interchangeably:

We also experiment with preserving Alif variants (Alif4 configuration) to evaluate the trade-off between normalization and linguistic fidelity.

Arabic-Indic numerals ([0-9 Arabic]) are mapped to their Western Arabic equivalents (0-9). Arabicspecific punctuation marks ([?-ar], [;-ar], [,-ar]) are normalized to their Latin counterparts.

Tatweel Removal The Tatweel character ([tatweel]) used for text justification is removed entirely.

We provide two configurations: (1) drop diacritics, which removes all harakat for maximum normalization, and (2) keep diacritics, which preserves vowel marks for applications requiring phonetic information.

Table 1 summarizes the character replacement rules implemented in our normalization pipeline. Figure 2 shows examples of text before and after normalization.

We train tokenizers using three algorithms: BPE, WordPiece, and SentencePiece Unigram. For each algorithm, we explore configurations with and without normalization, and with dropped or retained diacritics.

Alif+Hamza

Alif Variants

Punctuation Vocabulary Size We train tokenizers with a target vocabulary size of 80,000 tokens for base experiments and 150,000 tokens for normalized variants, matching the vocabulary scale of Qwen3.

Vocabulary Pruning Following training, we apply frequency-based pruning to remove tokens covering less than 0.01% of the corpus. We experiment with retention thresholds of 95% and 99% cumulative frequency coverage, resulting in pruned vocabularies of approximately 42K and 76K tokens respectively.

Evaluation Metrics We evaluate tokenizers using three intrinsic metrics:

โ€ข Fertility: Average number of tokens produced per word, where lower values indicate more efficient encoding.

โ€ข Compression Ratio: Ratio of characters to tokens, where higher values indicate better compression.

โ€ข OOV Rate: Percentage of words containing unknown tokens after tokenization.

The Language Extension Pipeline (LEP) integrates the Arabic-optimized tokenizer into Qwen3 through vocabulary extension and targeted fine-tuning. Figure 3 illustrates the overall architecture.

Vocabulary Extension Given the trained Senten-cePiece model V ar and the Qwen3 tokenizer V qwen , we extract Arabic tokens from V ar that are not present in V qwen . Tokens beginning with the Sen-tencePiece word boundary marker are added with the lstrip=True flag to handle whitespace correctly.

We filter tokens matching undesirable patterns (Latin characters, digits, Cyrillic, common punctuation) to ensure only Arabic-specific tokens are added.

Mean Subtoken Initialization New token embeddings are initialized using the mean of their constituent subtoken embeddings from the original tokenizer:

where S is the set of token IDs produced by encoding the new token string with the original Qwen3 tokenizer, and e i are the corresponding embeddings. This initialization provides a semantically meaningful starting point, as the new token’s embedding is positioned near the centroid of its subtokens in embedding space.

During training, we freeze the embeddings of original Qwen3 tokens to prevent catastrophic forgetting. This is implemented through gradient hooks that zero out gradients for token indices below the vocabulary extension threshold:

Selective Layer Unfreezing While the majority of transformer layers remain frozen, we unfreeze the last k layers to allow the model to adapt its representations to the new tokenization. In our experiments, we unfreeze layers 24-27 (the final 4 layers of Qwen3-0.6B’s 28-layer architecture).

4 Experimental Setup The text is preprocessed using our normalization pipeline before tokenization.

We use Qwen3-0.6B-Base as our foundation model.

Table 2 summarizes our LEP training configuration. We use a linear learning rate schedule with 10% warmup ratio and AdamW optimizer with weight decay 0.01. Gradient checkpointing is enabled to reduce memory consumption.

For tokenizer evaluation, we compute fertility, compression ratio, and OOV rate on held-out Arabic text. For LEP training, we track training loss and evaluation loss on the validation set. Evaluation is performed every 125 steps.

Table 3 presents the intrinsic evaluation metrics for all tokenizer configurations. We observe several key findings:

SentencePiece outperforms BPE and WordPiece Across all configurations, SentencePiece Unigram achieves the lowest fertility and highest compression ratio. With normalization, SentencePiece achieves fertility of 1.199 compared to 1.243 for BPE and 1.244 for WordPiece.

Applying our normalization pipeline reduces fertility by 8-9% across all algorithms. For Sentence-Piece, normalization reduces fertility from 1.311 to 1.199 (8.5% improvement). Fertility (tokens/word) Dropping diacritics further improves compression Configurations that remove diacritics achieve better compression than those retaining them, as the reduced character set allows for more efficient subword segmentation.

OOV rates are negligible All byte-level tokenizers (BPE, WordPiece) achieve 0% OOV rate. Sen-tencePiece shows a small OOV rate of approximately 0.1%, which is acceptable for practical applications. Figures 4 and5 visualize the algorithm comparison with and without normalization.

Table 4 and Figure 6 show the effect of vocabulary pruning on tokenizer metrics. Pruning to 99% coverage (76K vocabulary) maintains comparable performance to the full 150K vocabulary, while pruning to 95% coverage (42K vocabulary) incurs a modest increase in fertility.

Figure 7 shows the training and evaluation loss curves for LEP training. The model rapidly adapts to Arabic text, with evaluation loss decreasing from 8.28 to 2.43 within 800 training steps. outperforms lower rate (8e-5), enabling faster adaptation within the limited training budget.

โ€ข Layer Unfreezing: Unfreezing the last 4 layers is essential for adaptation; freezing all transformer layers results in significantly higher loss.

Figure 8 provides a visual comparison of ablation configurations.

SentencePiece Outperforms BPE/WordPiece SentencePiece’s Unigram algorithm achieves superior compression for Arabic due to its probabilistic approach to segmentation. Unlike BPE’s greedy merge strategy, Unigram considers the likelihood of the entire token sequence, enabling more globally optimal segmentations. This is particularly beneficial for Arabic’s rich morphology, where multiple valid segmentations exist for inflected forms.

Our experiments reveal a surprising finding: preserving Alif variants (Alif4 configuration) leads to lower language modeling loss compared to aggressive normalization. This suggests that Alif vari- ants carry disambiguating information that aids language modeling. For example, the Hamza placement ([Hamza-above] vs [Hamza-below]) often indicates grammatical case or word origin.

This finding aligns with recent work questioning aggressive text normalization for neural models [14]. We recommend that practitioners carefully consider the trade-off between tokenization efficiency and linguistic fidelity based on their downstream tasks.

The Language Extension Pipeline demonstrates remarkable efficiency, achieving significant adaptation within only 800 training steps on 100K samples. This represents less than 0.01% of a typical LLM pretraining budget. Key factors contributing to this efficiency include:

  1. Mean subtoken initialization: Provides semantically meaningful starting points for new embeddings.

  2. Gradient masking: Prevents catastrophic forgetting of existing knowledge.

  3. Selective unfreezing: Focuses adaptation capacity on the most relevant parameters.

We presented AraToken, an Arabic-optimized tokenizer achieving 18% lower fertility than unnormalized baselines through SentencePiece Unigram training with comprehensive Arabic normalization. We further introduced the Language Extension Pipeline (LEP) for efficiently integrating the tokenizer into Qwen3, reducing evaluation loss from 8.28 to 2.43 within 800 training steps.

Limitations

Qwen3 employs byte-level BPE tokenization with a vocabulary size of 151,646 tokens and supports a

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut