Morphemes Without Borders: Evaluating Root-Pattern Morphology in Arabic Tokenizers and LLMs

This work investigates how effectively large language models (LLMs) and their tokenization schemes represent and generate Arabic root-pattern morphology, probing whether they capture genuine morphological structure or rely on surface memorization. Ar…

Authors: Yara Alakeel, Chatrine Qwaider, Hanan Aldarmaki

Morphemes Without Borders: Evaluating Root-Pattern Morphology in Arabic Tokenizers and LLMs
Morphemes Without Bor ders: Ev aluating Root–P attern Morphology in Arabic T okenizers and LLMs Y ara Alakeel 1 , Chatrine Qwaider 2 , Hanan Aldarmaki 2 , Sawsan Alqahtani 3 , 1 * 1 Saudi Data & AI A uthority (SD AIA), 2 Mohamed bin Za yed Univ ersity of Artificial Intelligence (MBZUAI) 3 Princess Nourah Bint Abdulrahman Univ ersity (PNU) yalak eel@ncai.gov .sa, {chatr ine.qw aider ,hanan.aldarmaki}@mbzuai.ac.ae, saalqhtani@pnu.edu.sa Abstract This work inv estigates how eff ectiv ely large language models (LLMs) and their tokenization schemes represent and generate Arabic root-pattern morphology , probing whether the y capture genuine morphological structure or rely on surf ace memorization. Arabic morphological system provides a rich testbed f or analyzing how LLMs handle complex, non-concatenative f or ms and how tok enization choices influence this process. Our study begins with an ev aluation of mor phological fidelity across Arabic and multilingual tok enizers against gold-standard segmentation, f ollow ed by an analysis of LLM perf ormance in productive root-pattern generation using a ne wly de v eloped test set. Our findings across se v en Arabic-centric and multilingual LLMs and their respective tokeniz ers re v eal that tok enizer morphological alignment is not necessary nor sufficient f or morphological generation, which questions the role of morphological tokenization in do wnstream perf ormance. Ke ywor ds: Arabic, non-concatenativ e, mor phology , tokenization, LLMs, roots and patterns 1. Introduction Large language models (LLMs) hav e achie v ed im- pressive perf ormance across many natur al lan- guage processing tasks, yet their success is un- e v en across languages ( Grattafiori et al. , 2024 ; Hui et al. , 2025 ; Achiam et al. , 2024 ). Pr ior work suggests that typological comple xity is one f actor contributing to these disparities, with evidence that LLMs tend to perf orm worse on mor phologically rich languages, par tly due to tokenization ineffi- ciency and data sparsity ( Hofmann et al. , 2025 ). F er tility is often used as a measure of tokenization eff ectiv eness, with higher scores assumed to be generally worse both for performance ( Ali et al. , 2024 ) and cost ( P etrov et al. , 2023 ) considera- tions. Languages with non-concatenativ e mor phol- ogy , such as Arabic , pose par ticular challenges for current language models, which are based on con- tiguous tok enization schemes ( Beesle y and Kart- tunen , 2003 ; Habash , 2010 ). Arabic morphology operates through a root-and- pattern system in which consonantal roots combine with templatic vo wel patterns to produce deriva- tional and inflectional f or ms. For example , applying the patter n È ñª  ®Ó /mf Q u : l/ to the root I .  J» /ktb/ yields the passive participle H . ñ  JºÓ /mktu : b/ , where the pre- fix Ð /m/ and infix ð /u : / are inser ted into the root structure to form the word. The root encodes core le xical meaning (e .g., /ktb/ ‘write’), while the pattern contributes morpho-syntactic and lexical inf or ma- tion through templatic mater ial that ma y include prefix es, infix es, or suffix es. While frequent word f or ms ma y be memorized, *Corresponding author genuinely productive mor phological lear ning re- quires that models identify roots and apply inter- leaving patterns to generalize to unseen f orms ( Isma yilzada et al. , 2025 ; Hofmann et al. , 2025 ). Prior studies hav e attempted to incor porate mor- phological inf ormation into LLMs, typically through mor phology-based tokenization or architectural ad- justments. Y et their effects remain limited, and their implications f or non-concatenativ e languages lik e Arabic are still not w ell understood ( Jab bar , 2024 ; Gazit et al. , 2025 ; Asgari et al. , 2025 ). In this paper , we inv estigate ho w pre-trained LLMs and their tokeniz ers handle Arabic morphol- ogy , focusing on two ke y dimensions: tokeniz er mor phological alignment, and LLMs root-patter n mor phological generation. Our goal is to under- stand ho w existing tokenization schemes align with Arabic morphological structures, and ho w this influ- ences LLMs’ ability to generalize productiv ely . We f ocus not on efficiency or compression, but on the linguistic adequacy and representational beha vior of tokeniz ers, e xamining how the y suppor t or hin- der nov el word construction in morphologically r ich settings. Specifically , we ask: (1) to what extent do current tokeniz ers preserve Arabic morphologi- cal patterns, and (2) do morphological tokenizers correlate positively with the LLMs’ mor phological generation perf or mance? Our contributions can be summarized as f ollows: 1 1. W e thoroughly e valuate the mor phological alignment of various Arabic-centric and mul- tilingual LLM tok enizers against g round-truth segmentation, focusing on both mor pheme 1 The dataset and code are av ailab le at https:// github.com/YaraAlakeel/morphems_without_borders . boundaries and mor pheme integrity . 2. W e construct a dataset for probing Arabic root- pattern mor phological productivity using both real and nonce roots combined with various patterns or mor phemes. 3. W e e valuate instruction-tuned LLMs based on the various tok enizers in (1) and compare their perf ormance in the mor phological productivity task in (2). Sur prisingly , we find that tok enizer morphological alignment does not predict effectiv e mor phological generation. Models whose tokenizers poorly align with true mor pheme boundaries can still perform competitively , while models based on morphologi- cal tokenization can f ail to generaliz e . In f act, the second highest scoring model, GPT4, has the high- est f er tility score and low mor phological alignment scores on Arabic te xt, both of which are presumed to hav e negativ e impact on performance, which raises ne w questions about the role of tokenization in downstream perf or mance. 2. Related W ork Current LLMs typically employ surf ace-based sub- word tokenizers such as BPE ( Gage , 1994 ; Sen- nrich et al. , 2016 ), UnigramLM ( K udo , 2018 ), and W ordPiece ( Schuster and Nakajima , 2012 ; Wu et al. , 2016 ). These methods balance character- and word-le v el representations, reduce vocab u- lar y size , and efficiently handle out-of-v ocabular y items. BPE, in par ticular , remains widely adopted f or its simplicity and scalability ( Achiam et al. , 2024 ; Grattafiori et al. , 2024 ; Jiang et al. , 2023 ; Hui et al. , 2025 ). Howe ver , n umerous studies show that such tokeniz ers often misalign with mor phologi- cal structure in languages like T urkish and K orean, where subword segmentation frequently distor ts mor pheme boundaries and captures only par tial linguistic regularities ( Schmidt et al. , 2025 ; Pagnoni et al. , 2024 ; Rust et al. , 2021 ; Mielk e et al. , 2021 ; Bostrom and Durrett , 2020 ). Morphology-A ware T okenizers f or Arabic: T o address this, se v eral linguistically informed tok- enization methods hav e been proposed to better capture the mor phology of Arabic and other mor- phologically rich languages. Y et, their empirical benefits remain insufficiently explored. Mor phBPE ( Asgari et al. , 2025 ) e xtends BPE with linguistic su- per vision, using Arabic T reebank and F arasa seg- mentations to align merges with morpheme bound- aries. Many other tok enization schemes hav e been proposed to handle concatenative ( Hofmann et al. , 2022 ; J abbar , 2024 ) and non-conatenative ( Gazit et al. , 2025 ) morphology . Despite this, these approaches remain largely un used in large-scale LLM training, with Mor phBPE being the only one integrated into practice, specifically in the F anar model’ s tokenizer ( Ab bas et al. , 2025 ). Morphological Generalization: Evaluating mor- phological generalization in LLMs requires both structural and behavioral assessments. Str uc- tural metrics such as the Morphological Align- ment Score (MAS) ( Abbas et al. , 2025 ) quan- tify token–morpheme correspondence through sequence alignment. While informative , MAS conflates par tial ov erlaps and boundar y mis- matches, which reduces inter pretability . Similar ly , Mor phScore ( Asgar i et al. , 2025 ) measures mor- pheme boundar y accuracy b ut limits comparisons to two morphemes per word and functions pr imar- ily as a recall-based measure, sometimes re ward- ing spurious or linguistically implausible segmenta- tions. Beyond alignment-based metrics, beha vioral probing tasks complement these approaches by testing whether LLMs can use morphological struc- ture productively and diff erentiate between related mor phological patterns ( Isma yilzada et al. , 2025 ; W eiss weiler et al. , 2025 ). Effects of Morphological Complexity on LLM P erformance. A growing body of work investi- gates how mor phological comple xity influences LLM performance ( Ar nett and Bergen , 2025 ; Is- ma yilzada et al. , 2025 ; Seo et al. , 2025 ). Findings remain mix ed; some studies repor t degr aded per- f ormance on mor phologically r ich languages such as T urkish, attributing this to tokenization ineffi- ciency and data sparsity rather than inadequate mor phological modeling ( Ar nett and Bergen , 2025 ; Seo et al. , 2025 ). Others show that e v en state-of- the-ar t models struggle with compositional general- ization, par ticularly when e xposed to nov el roots or high mor phological comple xity ( Isma yilzada et al. , 2025 ). Complementar y e vidence from Ar nett et al. ( 2025 ) suggests that morphological alignment ex- plains little v ariance in downstream perf ormance, indicating that mor pheme-le v el alignment alone is not a reliable pro xy for tok enization quality . Large- scale multilingual studies fur ther corroborate these limitations: while LLMs handle surf ace mor phology reasonably w ell, they f ail to generalize productiv ely across languages, especially for irregular or low- frequency f orms ( Hofmann et al. , 2025 ; Dang et al. , 2024 ; W eiss weiler et al. , 2023 ). Ismayilzada et al. ( 2025 ) underscore this gap through productivity and systematicity tasks inv olving nonce roots in T ur kish and Finnish. Building on this line of work, we analyze how various LLMs handle Arabic mor phology , e valu- ating mor phological alignment and integrity in to- kenization as well as mor phological productivity as a down-stream task. We ev aluate both mor- phological and purely statisitcal tokenizers from Arabic-centric and multilingual LLMs. 3. Methodology 3.1. In vestigated T okenizers and LLMs W e e v aluate tokeniz ers from both multilingual and Arabic-centric large language models. The multi- lingual set includes GPT -4 ( Achiam et al. , 2024 ), GPT -4o ( Hurst et al. , 2024 ), LLaMA 3 ( Grattafior i et al. , 2024 ), Qwen 3 ( Hui et al. , 2025 ) and Co- here ( Ahmadian et al. , 2025 ). The Arabic-centric models include F anar ( Abbas et al. , 2025 ) and Al- lam ( Bari et al. , 2025 ). 2 These tokeniz ers diff er in their training data co v erage, vocab ular y constr uc- tion strategies, and architectural design. Among them, F anar is the only model whose tokenizer e xplicitly integrates morphemic information dur ing de v elopment, as discussed in § 2 . W e maintain consistent configuration settings across all runs and generations. For ev er y task, we utilized the same prompt across models as a result of an e xtensiv e prompt engineering phase in which we chose the best performing prompt. For the temperature, we tr ied values in the range of 0.0 to 0.6; we used 0.6 as multiple generations produced more diverse , non-deterministic outputs. Models such as GPT and ALLaM tend to follo w the prompt instructions and provide only the desired word; so, their max_token is set to 8. F or other models, whose typical outputs are much longer , max_token was set to 80. 3.2. T okeniz er -Morphology Alignment Datasets. We e valuate on two Arabic cor pora with gold-standard mor phological segmentation: (1) the Arabic T reebank P ar t 3, or A TB3 for shor t (LDC2010T08) ( Maamouri et al. , 2010 ), cov er ing Modern Standard Arabic, and (2) the BOL T Egyp- tian Arabic corpus (LDC2021T12) ( Maamour i and et al. , 2021 ), cov ering a colloquial dialectal vari- ety . They capture both f or mal and spoken registers of Arabic mor phology . W e remove diacritics to match undiacritized Arabic te xt used in real-world settings. This choice simplifies token matching but makes some templatic contrasts , those e xpressed only by short vowels or gemination, unobservab le. W e remov ed punctuation marks, numerical tokens, and English characters from both corpora, retain- ing only Arabic words for morphological ev aluation. The resulting statistics are shown in T able 1 . Morphological Analyzers. T o establish linguis- tically grounded baselines, we include two 2 https://huggingface.co/Xenova/gpt- 4 https://huggingface.co/Xenova/gpt- 4o https://huggingface.co/meta- llama/Meta- Llama- 3- 8B https://huggingface.co/Qwen/Qwen3- 8B https://huggingface.co/CohereLabs/c4ai- command- r7b- 12- 2024 https://huggingface.co/QCRI/Fanar- 1- 9B https://huggingface.co/humain- ai/ALLaM- 7B- Instruct- preview Dataset #Sents #W ords #T okens A vgT ok/Sent A TB3 12,626 337,312 571,449 45.3 A TB3_c 12,587 292,552 526,745 41.85 BOL T 19,994 149,940 223,326 11.2 BOL T_c 19,453 128,271 201,668 10.37 T able 1: Statistics of the e valuation datasets bef ore and after cleaning (remo v al of numbers , punctua- tion, and English characters, denoted b y ‘_c’). mor phology-guided segmenters that explicitly model Arabic word str ucture. F rom CAMeL T ools ( Obeid et al. , 2020 ), we use the maximum- likelihood (MLE) segmenter (C A M E L ), which com- bines lexicon- and rule-based heuristics with proba- bilistic scoring. 3 W e additionally include the F arasa segmenter ( Darwish and Mubarak , 2016 ), a f ast, super vised system widely adopted in Arabic NLP . W e use only their segmentation outputs and treat these analyzers as mor phology-aw are preprocess- ing baselines f or subword tok enization. 3.2.1. Alignment Metrics W e quantify how closely tokeniz er outputs align with Arabic mor phological str ucture. Our ev alua- tion targets two complementar y aspects: (i) con- catenative alignment , which measures correspon- dence between tok en and morpheme boundar ies, and (ii) mor phological integr ity , which assesses the preser v ation of whole mor phemes within token boundaries. Let W be the set of words. F or each word w ∈ W , let T ( w ) denote the sequence of pre- dicted tokens , G ( w ) the gold-standard mor pheme segmentation, B ( w ) and ˆ B ( w ) the sets of gold and predicted mor pheme-internal boundaries, and M ( w ) and ˆ M ( w ) the sets of gold and predicted mor pheme spans, respectiv ely . Fertility . Follo wing Rust et al. ( 2021 ), f er tility measures the av erage n umber of tokens per word: F er t = 1 | W | X w ∈ W | T ( w ) | Lower fer tility indicates more compact segmenta- tion and higher compression rate. Morpheme Boundary Precision and Recall. Boundar y metrics ev aluate the alignment between predicted and gold mor pheme boundaries: BRecall = P w | B ( w ) ∩ ˆ B ( w ) | P w | B ( w ) | , 3 W e also experimented with the BERT v ersion seg- menter , but observed comparable performance to the MLE-based version. BPrecision = P w | B ( w ) ∩ ˆ B ( w ) | P w | ˆ B ( w ) | Boundar y recall captures how many true mor- pheme breaks are recov ered, while boundar y pre- cision penalizes spurious splits. Their har monic mean defines the Boundary F1 score, a balance between cov erage and linguistic validity . Morpheme F1. This stricter metric assesses complete mor pheme recov er y by requiring both star t and end boundaries to match: F1 = 1 | W | X w 2 · | M ( w ) ∩ ˆ M ( w ) | | M ( w ) | + | ˆ M ( w ) | Unlike boundar y-le v el measures, Mor pheme F1 rew ards only fully correct morpheme spans. Morpheme Coverage Rate (MCR). MCR quantifies the propor tion of gold mor phemes preser v ed as intact units within tokens . More concretely , we calculate MCR as follo ws: 1 | W | X w ∈ W    n m ∈ M ( w ) : ∃ ˆ m ∈ ˆ M ( w ) , m ⊆ ˆ m o    | M ( w ) | A high MCR indicates that tokens maintain in- ternal mor phological integrity , while a low value reflects fragmentation across tok ens. Interpretation. T ogether , these metrics capture complementar y aspects of mor phological sensitiv- ity . Boundar y-based measures reflect segmenta- tion granularity , whereas mor pheme-lev el scores e v aluate linguistic coherence. MCR provide inter- pretable upper bound on mor phological faithful- ness, allowing fine-grained comparison between analyzers and data-driven tok enizers. 3.3. Morphological Pr oductivity T asks T o e v aluate whether language models capture the systematic mechanisms of Arabic word f or mation, we design a set of controlled morphological produc- tivity tasks targeting three core capacities: (1) pat- tern transf ormation , reflecting non-concatenative root-pattern mapping; (2) mor pheme attachment , capturing concatenativ e affixal composition; and (3) gener alization to unseen forms , testing mor pho- logical productivity on nonce roots. These tasks ser v e as a diagnostic of linguistic generalization, e v aluating whether a model can apply mor phologi- cal rules productively rather than relying on memo- rized le xical associations ( Weiss weiler et al. , 2025 ; Isma yilzada et al. , 2025 ). 3.3.1. Dataset Construction W e designed a dataset to enable controlled e v al- uation of Arabic der ivational mor phology across multiple tasks. The dataset integrates both at- tested and synthetic (nonce) e xamples to assess model generalization rather than mere mor phologi- cal memorization. The real root subset is derived from the Arabic Billion W ords cor pus 4 , and the words were first processed using C A M E L to e x- tract their roots and patterns, from which affixal inf ormation (prefixes and suffixes) was identified. Over all, this subset contains 13 diff erent patterns and 130 unique root-pattern forms. Fur ther , each root–pattern pair appears in three surface forms: one unaffixed and two affixed v ariants. The ex- amples were all man ually selected and v erified to ensure correctness. The nonce subset provides a controlled test f or generalization bey ond seen words, as the resulting w ords are non-e xistent but are mor phologically valid. The set consists of 20 synthetic roots generated randomly from arabic al- phabet characters and manually chec ked b y nativ e speakers to ensure the y do not correspond to any valid roots in Arabic. Each root was manually com- bined with five diff erent patter ns, resulting in 100 unique combinations. An e xample of a real entr y from our dataset is shown belo w . { "root" : " Q. Ò.  K " , "template" : " È Aª  ¯ " , "base_form" : " P AÖ  ß " , "prefix" : " È @ " , "suffix" : " " , "full_form" : " P AÒ  JË @ " , "has_affix" : "true" , "root_category" : "high_frequency" } Figure 1: Example dataset instance f or the pattern È Aª  ¯ /fi Q a : l/ . The instance sho ws the root QÖ  ß / T mr/ , f or the full form P AÒ  JË @ /al T ima : r/ . Mor phological features (e.g., root, patter n) were obtained using C A M E L T ools, from which other fields (e.g., affixes and template) were identified. 3.3.2. T ask Definition W e implement a prompt-based gener ation frame- work comprising two main conditions, each defined by the mor phological information provided to the model, enabling targeted e v aluation of its mor pho- logical knowledge . We test both Arabic and English prompts under 0-shot and 1-shot settings. We ev al- uate root-patter n and concatenative morphology 4 https://huggingface.co/datasets/oserikov/arabic_billion_ words (affixation) separately and in combination More specifically , we define the follo wing tasks: Root-P attern: The model receives a tr iliteral root and a der iv ational patter n ( “template" ) and must generate the corresponding der iv ed f orm ( “base_form" ). W e test set with both real roots and nonce roots. The English prompt is shown below: In Arabic , words are f ormed by applying a mor- phological pattern to a triliteral root. Each root consists of three consonants and follo ws the abstract root pattern  É  ª   ¯ (fa'ala). Given the root {root} and the target morpho- logical pattern {template} , generate the cor- responding Arabic word b y correctly applying the root to the specified pattern. Respond with only the fully-formed Arabic word—no transliter ation, spaces, punctuation, or explanation. Affix-Build: the model receives the un-affixed base_word and an unordered set of affixes and must produce the correctly ordered der ived full_form . This tasks test whether the model captures concatenative affix attachment and ordering. Arabic Unaffix ed base f orm {base_form} Apply the follo wing affixes to produce the final form: Affixes : {prefix} {suffix} Return ONE Arabic word only (no spaces, no punctuation). One-Shot Prompts: W e also test all the abov e prompts with the addition of a one-shot example. F or instance, we add the follo wing to the Root- P attern prompt: Example (one-shot): Root: ¨ P  P | T emplate: {template} → T arget form: {base_form( ¨ P  P )} Now answ er for the requested root and pat- tern. 3.3.3. Ev aluation W e e v aluate model perf or mance using generation accuracy . In practice, some language models tend to produce multiple-word outputs or additional con- te xt despite explicit prompt constraints . T o account f or this behavior , we adopt a lenient matching crite- rion: an output is considered correct if it contains a correctly formed word that conf or ms to the spec- ified derivational pattern, ev en when embedded within a longer response. 4. Results & Analysis Our results on token-morpheme alignment, along with probing task performance, are summarized in T ables 2 and 3 . Overall, we find no consistent relationship between morpheme alignment scores and mor phological generation accuracy . G P T 4 and G P T 4 O exhibit opposite alignment patterns yet achie ve the highest scores across all probing tasks. In contrast, the Arabic-centric models A L - L A M and F A N A R fail to gener alize to nonce words , indicating reliance on le x eme memorization rather than true mor phological generalization. In the fol- lowing sections , we discuss these aspects in detail and conclude with broader insights and limitations. 4.1. T oken-Morpheme Alignment Results As can be seen from the bottom par t of T able 2 , mor phological analyzers , namely C A M E L and F A R A S A , achie v e f er tility scores between 1 and 2, providing compression while preser ving mor- phemes with high precision, recall, and MCR. This suggests that accurate mor phemic segmentation can, in pr inciple, be obtained without excessiv e splitting, serving as an oracle boundary , e ven if it is not practical f or LLM de velopment. G P T 4 exhibits clear ov er-segmentation (f er til- ity > 3), producing many shor t tok ens that ar tifi- cially inflate boundar y recall (85%), but this also results in very low precision (23%). The remain- ing tokenizers maintain fertility scores of approxi- mately 1-2 similar to that in the morphological an- alyzers , indicating greater compression in input segmentation. At the mor pheme lev el, most tok- enizers (e xcept G P T 4 ) perf orm comparably , with A L L A M achieving the highest F1 scores. Their perf ormance, howe ver , remains well below that of morphological analyzers, as tok enizers are op- timized for statistical efficiency rather than mor- pheme preser vation. Boundary metr ics rev eal greater v ariation in how tok enizers align with mor- phemes: A L L A M , an Arabic-centric model lacking mor pheme-based pre-tokenization, e xhibit v er y low boundar y recall. In contrast, F A N A R also Arabic- centric but e xplicitly designed to preser ve mor- phemes, achie ves higher boundary recall, though its low precision (17%) and MCR suggest that it tends to ov er-segment mor phemes. Multilingual tokeniz ers show similar precision to Arabic-centric ones, except for Q W E N 3 and G P T - 4 , which per- f orm slightly better . As shown in Figure 2 , most tokeniz ers achie v e higher mor pheme F1 b ut lower boundary F1, while G P T 4 exhibits the re verse trend. The figure also shows that morphological analyzers hav e incon- sistent perf or mance between MSA and dialectal data, being biased towards MSA, while all tok eniz- ers show a consistent perf ormance or a slightly Data Model Fertility # T okens F1 Boundary P Boundar y R Boundary F1 MCR T okenizers A TB3 A L L A M 1.24 364,189 39.39 18.00 5.51 8.43 83.66 F AN A R 1.91 561,373 34.32 17.96 20.61 19.19 49.46 G P T 4 4.01 1,175,386 24.59 23.07 85.21 36.31 33.04 G P T 4 O 1.76 516,062 35.28 17.64 16.83 17.22 56.44 L L A M A 3 2.15 631,730 32.85 18.75 27.15 22.18 42.62 Q W E N 3 2.11 617,279 36.02 23.96 33.18 27.83 53.48 C O H E R E 1.92 564,098 33.30 19.33 22.38 20.74 49.65 BOL T A L L A M 1.23 158,348 57.23 38.68 15.85 22.49 86.43 F AN A R 1.77 227,951 43.99 26.23 35.63 30.22 55.26 G P T 4 3.33 428,143 21.28 22.90 92.40 36.70 26.66 G P T 4 O 1.69 216,801 45.17 26.54 32.01 29.02 60.38 L L A M A 3 1.94 248,937 39.01 23.00 37.81 28.60 46.38 Q W E N 3 1.96 252,335 39.66 26.13 44.16 32.83 52.09 C O H E R E 1.84 237,063 40.15 23.77 35.22 28.38 50.35 Morphological Analyzers A TB3 C A M E L 1.49 437,563 73.24 98.41 60.94 75.27 97.07 F AR A S A 1.74 510,214 93.78 98.92 91.94 95.30 99.26 BOL T C A M E L 1.27 163,073 66.17 76.60 36.32 49.27 85.29 F AR A S A 1.40 180,279 77.71 84.02 59.53 69.69 91.80 T able 2: Results of ev aluating tokeniz ers and mor phological analyzers . Evaluation metrics are described in § 3.2.1 . Bold values indicate the highest score f or each dataset across tokenizers or analyz ers. Figure 2: Comparison of Boundar y F1 and Mor- pheme F1 scores on A TB3 and BOL T datasets . better perf or mance in dialectal data. This is likely a result of the distr ib ution of data used to train these tokeniz ers, compared to the MSA-centric design of traditional morphological analyzers. Models show compar able f er tility values but dif- f er in their handling of mor phemes, indicating that similar f er tility does not ensure linguistic coher- ence. As shown in Figure 3 , most tokeniz ers clus- ter within a narrow f er tility range yet v ar y substan- tially in mor phological fidelity . Estimation of Root Preserv ation. Overall, A L - L A M exhibit strong mor phological integrity com- pared to all other tokenizers as indicated by the high MCR score; it rarely splits morphemes inter- Figure 3: Fertility scores vs. mor pheme F1 scores f or tokenizers and analyzers on A TB3 and BOL T datasets. nally , thus maintaining complete mor phemes and roots. This behavior accounts f or its low bound- ar y recall (e.g., producing “wasa” instead of “wa sa”). All other tokenizers exhibit rather low MCR rates, with GPT4 being the lowest as a result of ov er-segmentation. 4.2. Morphological Pr oductivity Results As can be seen from T able 3 , G P T 4 O consis- tently achie v es the highest scores across all prob- ing tasks. G P T 4 performs similarly , demonstrat- ing strong generalization despite its higher f er tility . High perf ormance on both real words ( Root-P atter n Real ) and nonce words ( Root-P atter n Nonce ) sug- gests that these models apply mor phological trans- f ormation r ules rather than rely on lexical mem- Model Root-P attern (%) Affix Build (%) Real Nonce A L L A M 66.92 20.00 69.23 F AN A R 56.92 52.00 44.23 G P T 4 94.62 92.00 88.46 G P T 4 O 96.92 97.00 91.92 L L A M A 3 26.15 10.00 68.08 Q W E N 3 43.08 30.00 17.69 C O H E R E 43.85 29.00 60.00 T able 3: Generation accuracy across models. Bold indicates the highest score in each column, while underlined indicates the second highest. Results are shown f or the English version of the 1-shot prompt, which yielded the highest or near-highest scores across most language models. orization. In contrast, other models show clear limitations in productive morphology: their accura- cies rarely exceed 60% and drop shar ply on nonce words, reflecting possible ov er-reliance on memo- rized lexical patter ns. The Arabic-centric models perf orm better than the multi-lingual models (other than the G P T 4 series), which could be attributed to the size of Arabic training data. F A N A R performs consistently across all probing tasks, while A L L A M shows a significant drop on nonce words. It is wor th noting that F A N A R incor porates mor pheme- inf ormed design, which may be a f actor contributing to its steadier perf ormance, but it could also be at- tributed to better instruction-follo wing perf or mance. A Note on Prompt V ariations. T able 3 presents the results for the English version of the 1-shot prompt. Before obtaining these scores, we ex- plored se veral prompt variations to more effectiv ely elicit the mor phological transf or mation capabilities of the e v aluated language models . The impact of one-shot prompting is evident in sev eral models, which show clear improv ements compared to their zero-shot counterpar ts. 5 These gains suggest that in-conte xt e xamples help the models infer morpho- logical relations more eff ectiv ely . In contrast, G P T 4 and G P T 4 O exhibit minimal differences across shot settings, indicating that they can apply the specified mor phological rules without explicit ex- emplars. A cross-linguistic comparison re v eals an- other patter n: most models perform substantially better with English prompts than Arabic, lik ely re- flecting both the greater morphological complexity of Arabic and the relative scarcity of Arabic data in instruction fine-tuning. 5 Comparison of zero-shot and one-shot prompts, and English vs. Arabic prompts, can be found in T able 5 in the Appendix. Figure 4: P earson correlation between morphologi- cal alignment and generation scores. 4.3. Qualitative Analysis T able 6 in the Appendix shows a breakdown of mor phological generation accuracy across pat- tern types. W e conducted a manual qualitative error analysis and examined outputs from each model, f ocusing on three patterns: É« A  ¯ /f a Q il/ , ɪ  ®  J @ /istaf Q ala/ , and È Aª  ¯ /fi Q a : l/ . T able 4 summa- rizes the main error categor ies obser ved across models. Most models (e xcept G P T 4 and G P T 4 O ) occasionally produce correct f orms, but their ov er- all behavior reflects partial, surface-le vel sensitivity to the instr uction rather than consistent application of mor phological rules. 4.4. Correlation with Internal Metrics. As shown in Figure 4 , we do not observe an y con- sistent correlation between tokeniz er alignment metrics such as boundar y F1, or mor pheme cov- erage rate, and model performance on the mor- phological probing or productivity tasks. Fertility scores are the only ones with consistent, but weak, positive correlation with generation tasks. This is likely due to the uniquely high fer tility score of G P T 4 , which achie ves the second highest gen- ration accuracy in probing tasks . Strong mor pho- logical alignment metrics, such as Mor pheme F1 or MCR, have zero or weak negativ e correlation with generation performance. These results sug- gest that surface tokenization quality in ter ms of mor phological alignment is neither necessary nor sufficient f or mor phological productivity . Root–Pattern Err ors Pattern Misapplication: The correct root was identified but the pattern was misapplied. GT  A  ®  k (/xif : a : s/) PRED   ®  k (/xafas/) P A TTERN È Aª  ¯ (/fi Q a : l/) LLM Q W E N Root Deformation: Root consonants were altered (added, dropped, replaced). GT X A  ª£ (/t ijG a : d/) PRED  A¢  « (/ G t ij a : s/) P A TTERN È Aª  ¯ (/fi Q a : l/) LLM A L L A M Wrong P attern Selection: A different morphological pattern was used instead of the target one. GT  P A  « X (/d G a : z/) PRED  Q  « @ X (/ da:Gz /) P A TTERN È Aª  ¯ (/fi Q a : l/) LLM C O H E R E Real W ord Substitution: Output is a valid Arabic word instead of applying the pattern to nonce roots. GT  ¨ A  ®  K (/nifa :G /) PRED ©  ¯ A  K (/na : fi Q /) P A TTERN È Aª  ¯ (/fi Q a : l/) LLM A L L A M Incorrect Affixation: Affixes merged in incorrect order (prefix, root, infix misaligned). GT Ð Y  j  J @ (/istaxdam/) PRED  I AÓ Y  g (/xdma : st/) P A TTERN ɪ  ®  J @ (/istaf Q ala/) LLM L L A M A 3 Morpheme-Based Errors Missing Affix: Model omits required prefix or suffix. GT ÑêË AJ . k . (/d Z iba : lhum/) PRED È AJ . k . (/d Z iba : l/) P A TTERN È Aª  ¯ (/fi Q ¯ al/) LLM C O H E R E Wrong Base Selection: Affixation is correct but added to an incorrect base word. GT È AÒªË @ (/al Q uma : l/) PRED ÉÓ AªË @ (/al Q a : mil/) P A TTERN È Aª  ¯ (/fi Q a : l/) LLM A L L A M Wrong Morpheme Attachment: Adds incorrect affixes instead of the intended ones. GT  H AJ . Ë A¢Ë @ (/alt ij a : liba : t/) PRED H . C¢Ë @ (/alt ij ula : b/) P A TTERN É« A  ¯ (/fa :Q il/) LLM A L L A M Partial T runcation: Emits only par t of the target word. GT ½Ò  J @ (/istamsak/) PRED Ò  J @ (/istams/) P A TTERN ɪ  ®  J @ (/istaf Q ala/) LLM G P T 4 Agreement / Category Error: Incorrect number , gender , or case. GT  á   Ö Ï Am Ì '@ (/al è a : limi : n/) PRED  à ñÖ Ï Am Ì '@ (/al è a : limu : n/) P A TTERN É« A  ¯ (/fa :Q il/) LLM A L L A M T able 4: Common qualitative error types observed in Arabic morphological generation across ev aluated models. GT = ground truth; PRED = model prediction; P A TTERN = target mor phological pattern; LM = language model. 5. Discussion & Implications F rom our findings, w e advocate a functional rather than structural view of morphological representa- tion in LLMs. Mor phological competence should be defined by a model’ s ability to generalize morpho- logical transf or mations bey ond the training v ocab- ular y rather than focusing on surf ace mor pheme segmentation. F rom this perspectiv e, LLMs can e xhibit morphological productivity ev en without aligned mor pheme boundar ies. There appears to be a lev el of subword segmentation that en- ables the model to represent words efficiently while remaining flexib le enough to capture novel compositional structure within the frame work of instruction tuning. Our results demonstrate that e xplicit mor phological alignment in the tokenizer is neither necessar y nor sufficient for productive generalization. Instead, compositional reasoning and instruction-follo wing capabilities ser ve as func- tional substitutes f or explicit morphological parsing, enabling rule-like transf or mations without prede- fined segmentation. Crucially , mor pheme align- ment does not predict morphological productivity: models with high alignment scores (e.g., A L L A M ) do not achie v e superior derivational perf ormance, whereas G P T - 4, despite its high fertility , general- izes more eff ectiv ely . This challenges the assump- tion that mor phology-aw are tok enization is essen- tial for modeling mor phologically r ich languages like Ar abic. The success of models like G P T 4 in produc- ing correct forms for nonce words in spite of hav- ing poor mor phological alignment suppor ts this h ypothesis: productive patterns arise from statisti- cal regularities in subword sequences rather than e xplicit and well-formed mor phemes. Character- le v el inter polation and context-based reasoning allow models to manipulate inter nal consonantal templates e ven when tok en boundaries are poor ly aligned with morphological units. Using instruction- f ollowing systems , mor phology can be dynamically reconstructed rather than statically encoded. Prac- tically , our results suggest that the role of linguisti- cally inf ormed tokenizers in large-scale LLM tr ain- ing may be questionable . Giv en the cost of de- veloping language-specific tokeniz ers, future w ork could instead pursue tokenization-agnostic mor- phology learning, where LLMs acquire productive mor phological rules through large-scale exposure and task-specific tuning. Adaptive or hybrid tok- enization, allowing fine-gr ained character-le vel pro- cessing only when morphologically necessar y , of- f ers a promising direction. 5.1. Limitations Our experimental setup can not directly probe deep mor phological representation and reasoning and could instead conflate them with instruction- f ollowing ability , which could be what primarily drives morphological generation in our experi- ments. While we ensured that all ev aluated mod- els can f ollow the giv en instr uctions at least some of the time, the examined models vary in their instruction-follo wing consistency , with some mod- els generating unpredictab ly long te xt streams. T o mitigate this eff ect, we tested multiple prompt f or- mulations during probing and adopted a lenient e v aluation criter ion which accepts an y correct tar- get word among the top predictions, ev en when accompanied by other words . Considering the diffi- culty of tracing the cause of perf or mance variations across models that hav e different architectures, training data, and optimization mechanisms, we based our analysis instead on correlation scores. How e ver , we concede that such analysis is still lim- ited and does not pro vide a basis f or establishing causality . Our results should be used as a ba- sis f or fur ther analysis and e valuation rather than a definitive conclusion about the relationship be- tween tok enization and morphological generation. Controlled experiments e xamining the impact of tokeniz er design on model perf ormance and mor- phological representation could provide stronger insights into the causal relationship between tok- enizer design and morphological productivity . 6. Bibliographical References Ummar Abbas et al. 2025. F anar : An arabic-centric multimodal generativ e ai platf orm . arXiv preprint arXiv:2501.13944 . Josh Achiam et al. 2024. Gpt-4 technical repor t . arXiv preprint arXiv:2303.08774 . Arash Ahmadian et al. 2025. Command a: An enter prise-ready large language model . arXiv preprint arXiv:2504.00698 . Mehdi Ali, Michael F romm, Klaudia Thellmann, Richard Rutmann, Max Lübbering, Johannes Le v eling, Katrin Klug, Jan Eber t, Niclas Doll, Jasper Buschhoff , et al. 2024. T okeniz er choice f or llm training: Negligible or crucial? In Findings of the Association for Computational Linguistics: NAA CL 2024 , pages 3907–3924. Catherine Arnett and Benjamin Bergen. 2025. Why do language models perform worse f or mor pho- logically comple x languages? In Proceedings of the 31st Inter national Conference on Computa- tional Linguistics , pages 6607–6623, Abu Dhabi, U AE. Association f or Computational Linguistics. Catherine Ar nett, Marisa Hudspeth, and Brendan O’Connor . 2025. Evaluating morphological align- ment of tokenizers in 70 languages . In Proceed- ings of the ICML 2025 T okenization Workshop (T okShop) . International Conf erence on Machine Learning. Ehsaneddin Asgari, Y assine El Kheir , and Moham- mad Ali Sadraei Jav aheri. 2025. Mor phbpe: A mor pho-a ware tok enizer bridging linguistic com- ple xity f or efficient llm training across morpholo- gies . arXiv prepr int . M Saiful Bari, Y azeed Alnumay , Norah Alzahrani, Nouf Alotaibi, Hisham Alyah ya, AlRashed Al- Rashed, F aisal Mirza, Shaykhah Alsubaie , Has- san Alahmed, Ghadah Alabduljabbar , Raghad Alkhathran, Y ousef Almusha yqih, Raneem Alna- jim, Salman I Alsubaihi, Mar yam Al Mansour , Saad Hassan, Majed Alrubaian, Ali Alammar i, Zaki Alaw ami, Abdulmohsen Al-Thubaity , Ahmed Abdelali, Jer il K uriakose, Abdalghani Abujabal, Nora Al-T wairesh, Areeb Alowisheq, and Haidar Khan. 2025. Allam: Large language models f or arabic and english . In Inter national Con- f erence on Lear ning Representations , volume 2025, pages 34179–34214. K enneth R Beesley and Laur i Kar ttunen. 2003. Finite-state mor phology: Xerox tools and tech- niques. CSLI, Stanf ord , pages 359–375. Kaj Bostrom and Greg Durrett. 2020. Byte pair en- coding is suboptimal for language model pretrain- ing . In Findings of the Association for Compu- tational Linguistics: EMNLP 2020 , pages 4617– 4624, Online. Association f or Computational Lin- guistics. Anh Dang, Limor Raviv , and Lukas Galke . 2024. Mor phology matters: Probing the cross-linguistic mor phological generalization abilities of large language models through a wug test . In 13th edi- tion of the Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2024) , pages 177–188. Association for Computational Linguis- tics (A CL). Kareem Darwish and Hamdy Mubarak. 2016. F arasa: A new f ast and accurate arabic word segmenter . In Proceedings of the T enth Inter na- tional Conf erence on Language Resources and Evaluation (LREC’16) , pages 1070–1074. Philip Gage. 1994. A ne w algorithm for data com- pression. C Users J. , 12(2):23–38. Bar Gazit, Shaltiel Shmidman, A vi Shmidman, and Y uval Pinter . 2025. Splintering nonconcatenativ e languages for better tokenization . In Findings of the Association for Computational Linguistics: A CL 2025 , pages 22405–22417, Vienna, Austria. Association f or Computational Linguistics. Aaron Grattafiori et al. 2024. The llama 3 herd of models . arXiv prepr int . Nizar Y Habash. 2010. Introduction to Arabic nat- ural language processing . Morgan & Cla ypool Publishers . V alentin Hofmann, Hinr ich Schuetze, and Janet Pierrehumber t. 2022. An embarrassingly sim- ple method to mitigate undesirable properties of pretrained language model tokeniz ers . In Pro- ceedings of the 60th Annual Meeting of the Asso- ciation f or Computational Linguistics (V olume 2: Shor t P apers) , pages 385–393, Dub lin, Ireland. Association f or Computational Linguistics. V alentin Hofmann, Leonie W eiss weiler , David R Mor tensen, Hinrich Schütze , and Janet B Pierre- humber t. 2025. Der iv ational morphology re veals analogical generalization in large language mod- els . Proceedings of the National Academy of Sciences , 122(19):e2423232122. Binyuan Hui et al. 2025. Qwen2.5 technical repor t . arXiv preprint arXiv:2409.12186 . Aaron Hurst, Adam Lerer , Adam P Goucher , Adam P erelman, Aditya Ramesh, Aidan Clark, AJ Os- trow , Akila W elihinda, Alan Hay es, Alec Radford, et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276 . Mete Isma yilzada, Defne Circi, Jonne Säle vä, Hale Sirin, Abdullatif Köksal, Bhuwan Dhingra, An- toine Bosselut, Duygu Ataman, and Lonneke V an Der Plas. 2025. Evaluating mor phological compositional generalization in large language models . In Proceedings of the 2025 Conference of the Nations of the Amer icas Chapter of the As- sociation f or Computational Linguistics: Human Language T echnologies . Association f or Compu- tational Linguistics. Haris Jabbar . 2024. Mor phpiece : A linguistic tok- enizer f or large language models . arXiv prepr int arXiv:2307.07262 . Alber t Q. Jiang, Ale xandre Sabla yrolles, Ar thur Mensch, Chris Bamford, De vendra Singh Chap- lot, Diego de las Casas, Florian Bressand, Gianna Lengy el, Guillaume Lample, Lucile Saulnier , Lélio Renard La v aud, Marie-Anne Lachaux, Pierre Stoc k, T ev en Le Scao , Thibaut Lavril, Thomas W ang, Timothée Lacroix, and William El Sa yed. 2023. Mistral 7b . T aku K udo . 2018. Subword regular ization: Improv- ing neural network translation models with multi- ple subword candidates . In Proceedings of the 56th Annual Meeting of the Association for Com- putational Linguistics (V olume 1: Long P apers) , pages 66–75, Melbour ne, A ustralia. Association f or Computational Linguistics. Sabrina J. Mielke , Zaid Aly af eai, Elizabeth Salesky , Colin Raff el, Manan Dey , Matthias Gallé, Ar un Raja, Chenglei Si, Wilson Y . Lee, Benoît Sagot, and Samson T an. 2021. Between words and characters: A br ief histor y of open-vocab ular y modeling and tokenization in nlp . Ossama Obeid, Nasser Zalmout, Salam Khalif a, Dima T aji, Mai Oudah, Bashar Alhafni, Go Inoue , F adhl Er yani, Ale xander Erdmann, and Nizar Habash. 2020. CAMeL tools: An open source p ython toolkit for Arabic natural language pro- cessing . In Proceedings of the T welfth Language Resources and Evaluation Conference , pages 7022–7032, Marseille, F rance . European Lan- guage Resources Association. Ar tidoro P agnoni, Ram P asunuru, P edro Ro- driguez, John Nguyen, Benjamin Muller , Mar- garet Li, Chunting Zhou, Lili Y u, Jason W eston, Luke Zettlemo yer , et al. 2024. Byte latent trans- f ormer : Patches scale better than tok ens. arXiv preprint arXiv:2412.09871 . Aleksandar P etrov , Emanuele La Malf a, Philip T orr , and Adel Bibi. 2023. Language model tokenizers introduce unfairness between languages. Ad- vances in neural information processing sys- tems , 36:36963–36990. Phillip Rust, Jonas Pfeiff er , Iv an V uli ´ c, Sebastian Ruder , and Ir yna Gurevych. 2021. How good is your tokeniz er? on the monolingual performance of multilingual language models . In Proceedings of the 59th Annual Meeting of the Association f or Computational Linguistics and the 11th Inter- national Joint Conference on Natural Language Processing (V olume 1: Long Papers) , pages 3118–3135, Online. Association for Computa- tional Linguistics. Craig W Schmidt, V arshini Reddy , Chr is T anner , and Y uv al Pinter . 2025. Boundless byte pair encoding: Breaking the pre-tokenization barrier . arXiv preprint arXiv:2504.00178 . Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search . In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 5149–5152. Rico Sennrich, Barr y Haddow , and Ale xandra Birch. 2016. Neural machine translation of rare w ords with subword units . In Proceedings of the 54th Annual Meeting of the Association f or Compu- tational Linguistics (V olume 1: Long P apers) , pages 1715–1725, Berlin, Germany . Association f or Computational Linguistics. Jean Seo, Jaey oon Kim, SungJoo Byun, and Hyopil Shin. 2025. How does a language- specific tokeniz er aff ect llms? arXiv prepr int arXiv:2502.12560 . Leonie Weiss weiler , V alentin Hofmann, Anjali Kan- tharuban, Anna Cai, Ritam Dutt, Ame y Hengle , Anubha Kabra, Athar va Kulkarni, Abhishek Vi- ja yakumar , Haof ei Y u, Hinrich Schuetze, K emal Oflazer , and Da vid Mor tensen. 2023. Counting the bugs in ChatGPT’s wugs: A multilingual in- vestigation into the morphological capabilities of a large language model . In Proceedings of the 2023 Conf erence on Empir ical Methods in Nat- ural Language Processing , pages 6508–6524, Singapore. Association for Computational Lin- guistics. Leonie W eiss weiler , K yle Mahowald, and Adele Goldberg. 2025. Linguistic generalizations are not r ules: Impacts on ev aluation of lms. arXiv preprint arXiv:2502.13195 . Y onghui W u, Mike Schuster , Zhif eng Chen, Quoc V . Le, Mohammad Norouzi, Wolfgang Macherey , Maxim Kr ikun, Y uan Cao , Qin Gao , Klaus Macherey , Jeff Klingner , Apurva Shah, Melvin Johnson, Xiaobing Liu, ukasz Kaiser , Stephan Gouws, Y oshikiyo Kato , T aku K udo, Hideto Kazaw a, K eith Stev ens, George K ur ian, Nishant P atil, Wei W ang, Cliff Y oung, Jason Smith, Ja- son Riesa, Alex Rudnick, Oriol Vinyals , Greg Corrado , Macduff Hughes, and Jeffrey Dean. 2016. Google’ s neural machine translation sys- tem: Bridging the gap between human and ma- chine translation . 7. Language Resource Ref erences Maamouri, Mohamed and Bies, Ann and Buck- walter , Tim and Mekki, Wigdan. 2010. Arabic T reebank: P ar t 3 v 3.2 LDC2010T08 . ISLRN 770-467-034-042-0 . W eb Download. Philadel- phia: Linguistic Data Consor tium. Maamouri, Mohamed and et al. 2021. BOL T Egyp- tian Arabic T reebank - Conv ersational T elephone Speech LDC2021T12 . ISLRN 430-645-589-448- 0 . Web Download. Philadelphia: Linguistic Data Consor tium. A. Additional Results Model Prompt Shots Root-Pattern (%) Affix Build (%) Real Nonce A L L A M EN 0 53.08 30.00 43.08 A L L A M EN 1 66.92 20.00 69.23 A L L A M AR 0 58.46 23.00 40.77 A L L A M AR 1 80.77 15.00 58.85 F AN A R EN 0 31.54 14.00 37.69 F AN A R EN 1 56.92 52.00 44.23 F AN A R AR 0 0.77 2.00 36.15 F AN A R AR 1 32.31 21.00 7.31 G P T 4 EN 0 82.31 89.00 71.15 G P T 4 EN 1 94.62 92.00 88.46 G P T 4 AR 0 86.15 72.00 73.85 G P T 4 AR 1 84.62 78.00 31.54 G P T 4 O EN 0 96.92 96.00 80.00 G P T 4 O EN 1 96.92 97.00 91.92 G P T 4 O AR 0 96.15 96.00 72.31 G P T 4 O AR 1 97.69 95.00 90.77 L L A M A 3 EN 0 26.92 2.00 46.92 L L A M A 3 EN 1 26.15 10.00 68.08 L L A M A 3 AR 0 20.00 0.00 20.00 L L A M A 3 AR 1 16.15 7.00 57.69 Q W E N 3 EN 0 26.15 16.00 12.31 Q W E N 3 EN 1 43.08 30.00 17.69 Q W E N 3 AR 0 21.54 5.00 23.46 Q W E N 3 AR 1 37.69 40.00 67.69 C O H E R E EN 0 37.69 16.00 35.77 C O H E R E EN 1 43.85 29.00 60.00 C O H E R E AR 0 21.54 3.00 11.54 C O H E R E AR 1 7.69 11.00 28.46 T able 5: Generation accuracy across models and prompting settings (zero-shot vs . one-shot / English vs. Arabic prompts). Bold indicates the highest score in each column per model. Pattern Model Root-P attern Affix Real Nonce È ñª  ®Ó Allam 70 10 70 F anar 80 40 40 GPT -4 100 80 90 GPT -4o 100 95 90 LLaMA-3 10 5 60 Qwen-3 60 5 40 Cohere 30 15 50 É« A  ¯ Allam 100 30 80 F anar 90 45 10 GPT -4 100 100 100 GPT -4o 100 100 100 LLaMA-3 50 20 75 Qwen-3 30 20 5 Cohere 70 25 50  éË Aª  ¯ Allam 70 ... 45 F anar 50 ... 30 GPT -4 90 ... 80 GPT -4o 90 ... 85 LLaMA-3 60 ... 55 Qwen-3 50 ... 0 Cohere 40 ... 60 ɪ  ®  J @ Allam 20 40 100 F anar 100 75 55 GPT -4 100 95 100 GPT -4o 100 95 100 LLaMA-3 0 10 90 Qwen-3 80 85 10 Cohere 90 55 30 ÉJ  ª  ¯ Allam 80 ... 60 F anar 40 ... 75 GPT -4 100 ... 95 GPT -4o 100 ... 90 LLaMA-3 40 ... 55 Qwen-3 50 ... 15 Cohere 50 ... 70  à Cª  ¯ Allam 90 ... 45 F anar 80 ... 30 GPT -4 100 ... 70 GPT -4o 100 ... 70 LLaMA-3 40 ... 50 Qwen-3 90 ... 10 Cohere 50 ... 45 È Aª  ®Ó Allam 30 ... 60 F anar 50 ... 35 GPT -4 90 ... 95 GPT -4o 90 ... 100 LLaMA-3 30 ... 90 Qwen-3 20 ... 20 Cohere 30 ... 85 (a) P atterns 1–7 Pattern Model Root-P attern Affix Real Nonce ɪ  ®  K @ Allam 80 ... 90 F anar 60 ... 75 GPT -4 100 ... 70 GPT -4o 100 ... 100 LLaMA-3 0 ... 75 Qwen-3 50 ... 0 Cohere 60 ... 50 ɪ  J  ®Ó Allam 60 ... 75 F anar 20 ... 55 GPT -4 50 ... 90 GPT -4o 100 ... 95 LLaMA-3 10 ... 90 Qwen-3 0 ... 25 Cohere 0 ... 50 È Aª  J  ¯ @ Allam 40 ... 85 F anar 20 ... 80 GPT -4 100 ... 90 GPT -4o 80 ... 95 LLaMA-3 0 ... 60 Qwen-3 10 ... 30 Cohere 10 ... 65 È ñª  ¯ Allam 100 15 75 F anar 20 35 30 GPT -4 100 90 100 GPT -4o 100 100 100 LLaMA-3 30 5 50 Qwen-3 10 25 25 Cohere 50 35 85 È Aª  ¯ Allam 60 5 70 F anar 50 65 45 GPT -4 100 95 100 GPT -4o 100 95 100 LLaMA-3 30 10 75 Qwen-3 20 15 25 Cohere 60 15 85 Z Cª  ¯ Allam 70 ... 45 F anar 80 ... 15 GPT -4 100 ... 70 GPT -4o 100 ... 70 LLaMA-3 40 ... 60 Qwen-3 90 ... 25 Cohere 30 ... 55 (b) P atterns 8–13 T able 6: Generation accuracy obtained with English one-shot prompts across 13 Arabic der ivational patterns across models. “. . . ” indicates that this patter n is not present f or the nonce words in our dataset. Bold indicates the highest score f or each column per model.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment