From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning

F R O M L A N G U A G E T O A C T I O N I N A R A B I C : R E L I A B L E S T R U C T U R E D T O O L C A L L I N G V I A D A T A - C E N T R I C F I N E - T U N I N G Omer Nacar 1 , Deema Alquff ari 1 , Saleh Alsharideh 1 , Adeem AlOtaibi 3 , Abdulaziz Alabdulkarim 2 , Leen Alhazmi 1 , Nada Alomar 1 , W areef Alzubaidi 1 , Nada Alsultan 1 , Ahmed Alrabghi 1 , Demah Alhoshan 1 , Rana Alsayyari 5 , Hamed Alruwaili 1 , Albaraa Jaafar 4 , Khaled Alusmani 1 , Abdulaziz Alsohimy 1 , Munirah Alsubaie 3 , Shahd Aldukhayil 1 , Arwa Alali 1 , Y azeed BinShihah 1 , Razan Alsulaymi 1 , Nourah Alhumaid 1 , Razan Abdulsalam 1 , Reem Alamoudi 6 , Mohammed Alkhalifa 1 1 T uwaiq Academy , Riyadh, Saudi Arabia 2 T ahakom, 3 W akeb Company , 4 Qatmeer Co., 5 NCGR, 6 V ision Bank, o.najar@tuwaiq.edu.sa March 19, 2026 A B S T R AC T Function-calling language models are essential for agentic AI systems that translate natural language into ex ecutable structured actions, yet existing models e xhibit se vere structural instability when ap- plied to Arabic. W e present AISA-AR-FunctionCall , a production-oriented Arabic function-calling framew ork built on a 270M-parameter FunctionGemma backbone and trained through systematic dataset auditing, schema repair , tool-aware prompt restructuring, and full-parameter supervised ﬁne- tuning. On a held-out test set, ﬁne-tuning reduces parse failures from 87% to belo w 1%, impro ves function name accuracy by more than eightfold, and substantially enhances argument alignment across dialects and domains. Error analysis re veals a transition from structural collapse to semantic misalignment, suggesting that serialization stability and decision-lev el reasoning are separable chal- lenges. W e further explore a reasoning-augmented LoRA variant that introduces e xplicit intermediate reasoning prior to tool in vocation. All datasets and models are publicly released under the AISA framew ork. 1 1 Introduction Large language models (LLMs) are increasingly deployed not merely as text generators, but as decision-making components in agentic systems that translate natural language intent into e xecutable actions. This capability—commonly referred to as function calling or tool use —sits at the boundary between language understanding and software e xecution. Instead of responding with free-form text, the model emits a structured representation of an API call, which an external runtime v alidates and executes before returning results to the model for ﬁnal user-f acing synthesis. Such patterns underpin modern assistants, enterprise workﬂo w agents, and local-ﬁrst automation systems [1, 2, 3]. Despite rapid methodological progress, tool calling introduces ne w reliability and safety challenges. Failures often stem from malformed arguments, incorrect tool selection, schema violations, or brittle orchestration logic across multiple system layers. Importantly , these errors are rarely attributable to the base model alone; rather , they emerge from interactions between prompting formats, schema design, runtime v alidation, and e valuation blind spots [ 4 , 5 ]. As ev aluation frameworks ha ve matured—from T oolQA and T oolLLM to StableT oolBench and BFCL—they re veal that structured ex ecution remains substantially more difﬁcult than te xt generation alone [6, 7, 5, 4]. Recent open releases aim to address format reliability directly . FunctionGemma, built on Gemma 3 270M, introduces a dedicated control-token interface for tool declaration, inv ocation, and response handling, alongside structured delimiters to reduce ambiguity between natural language and executable artifacts [ 8 , 9 , 10 ]. The Gemma family emphasizes 1 https://huggingface.co/collections/AISA- Framework/aisa- arabic- functioncall- datasets- and- models A P R E P R I N T - M A R C H 1 9 , 2 0 2 6 lightweight, deployable architectures capable of specialization for on-device and priv acy-preserving use cases [ 11 , 12 ]. Ho wev er , FunctionGemma is explicitly designed as a base for domain- or language-speciﬁc ﬁne-tuning, rather than as a production-ready multilingual agent out of the box. A critical gap remains in multilingual and Arabic tool-calling performance . While multilingual benchmarks such as MASSIVE-Agents reformulate datasets across 52 languages and demonstrate standardized ev aluation pipelines, they re veal substantial cross-lingual disparities in function-call correctness [ 13 ]. Performance outside English drops markedly e ven for strong multilingual models, reinforcing that structured execution does not transfer reliably across languages. In parallel, Arabic NLP research has produced strong localized language models—including AraBER T , ARBER T/MARBER T , Jais, and AceGPT [ 14 , 15 , 16 , 17 ]—yet tool-calling datasets and agentic ev aluation resources in Arabic remain underdev eloped relati ve to English ecosystems. This paper addresses that gap by presenting an Arabic-ﬁrst function-calling dataset and a fully ﬁne-tuned e xecution model, dev eloped through a community-dri ven ef fort to localize and specialize FunctionGemma for Arabic structured action generation. W e introduce (i) a large-scale Arabic dataset pairing natural-language requests with structured tool schemas and ex ecutable tool-call annotations, (ii) reasoning supervision in a r eason-befor e-call format inspired by chain-of-thought training [ 18 , 1 ], and (iii) an e valuation protocol combining structure-le vel correctness metrics with Arabic-speciﬁc robustness tests for ambiguity , slot-ﬁlling, and refusal behavior . Beyond dataset and model contributions, we frame the work as a systems-level instantiation grounded in AISA (Agentic AI Systems Ar chitectur e) [ 19 ]. AISA separates concerns across foundational models, tool interfaces, orchestration infrastructure, e valuation layers, deployment controls, and governance mechanisms. Rather than treating function calling as a prompt-engineering artifact, we implement explicit cross-layer contracts: schema validation and safe dispatch at the tool layer , structured parsing and retry logic at the orchestration layer , versioned datasets and release gating at the deployment layer , and policy enforcement aligned with AI risk management guidance [ 20 , 21 , 22 ]. This architecture-ﬁrst perspectiv e enables reproducible ev aluation, auditability , and production-readiness for Arabic agentic systems. In summary , our work positions Arabic tool calling at the intersection of three threads: (1) structured execution research in LLMs, (2) Arabic NLP localization and cultural alignment, and (3) gov ernance-aware agent system engineering. By grounding Arabic function calling in both empirical ﬁne-tuning and explicit architectural design, we aim to move from isolated model adaptation tow ard reliable, deployable Arabic agentic systems. 2 Related W ork The integration of external tools into language model reasoning has e volv ed from prompt-based experimentation to structured, format-controlled execution. Early paradigms such as ReAct interleav ed reasoning traces with actions, demonstrating that explicit reasoning combined with external tool interaction improves task completion and inter- pretability [ 1 ]. Similarly , MRKL systems proposed modular architectures in which language models route queries to symbolic or e xternal components, emphasizing separation between reasoning and e xecution [ 2 ]. T oolformer further sho wed that language models can self-supervise tool inv ocation behavior by learning when and how to call APIs during generation [3]. As tool ecosystems expanded, large-scale datasets and benchmarks emer ged to ev aluate tool-use capabilities. T oolLLM introduced T oolBench, scaling training and e v aluation to thousands of real-w orld APIs [ 7 ]. T oolQA focused speciﬁcally on question answering tasks requiring external tool usage rather than memorized knowledge [ 6 ]. StableT oolBench emphasized stability and reproducibility through API simulation and caching mechanisms, highlighting ev aluation brittleness in real API-dependent setups [ 5 ]. UltraT ool further expanded benchmarking to comple x, real-world multi-step tool utilization scenarios [ 23 ]. The Berkeley Function Calling Leaderboard (BFCL) formalized structured ev aluation of tool-calling correctness via abstract syntax tree (AST) comparisons and extended e v aluation toward agentic beha viors [ 4 ]. Howe ver , most of these resources remain English-centric, leaving multilingual and morphologically rich languages underrepresented. Reliable tool in vocation depends on structured, parseable outputs. In practice, failures often arise from in v alid JSON formatting, missing ar guments, or schema mismatches. T o mitigate such issues, recent systems adopt explicit structured- output enforcement. Closed APIs pro vide schema-constrained JSON generation and function-calling interfaces to ensure machine-readable outputs [24, 25, 26]. FunctionGemma represents an open-model effort to ward format-controlled e xecution [ 8 , 9 ]. Built on Gemma 3 270M [ 12 ], it introduces six dedicated control tokens for tool lifecycle management (declaration, call, response) and a specialized delimiter token to disambiguate structured string ﬁelds. Importantly , the model card e xplicitly states that it is designed for specialization through ﬁne-tuning and supports single-turn and parallel tool calls nativ ely , while multi-turn 2 A P R E P R I N T - M A R C H 1 9 , 2 0 2 6 and multi-step reasoning require further adaptation [ 10 ]. Our work builds directly on this interface, extending it to Arabic and incorporating reasoning supervision to improv e semantic disambiguation prior to tool in vocation. Recent research demonstrates that tool-calling performance does not transfer uniformly across languages. MASSIVE- Agents reformats a multilingual dataset into a BFCL-style function-calling benchmark spanning 52 languages and reports substantial cross-lingual disparities in correctness [ 13 ]. Even strong multilingual models exhibit signiﬁcant degradation outside English under standardized ev aluation. These ﬁndings underscore that structured execution requires language-speciﬁc training signals and schema adaptation. Beyond indi vidual models, production agent systems require orchestration frame works capable of tool routing, state management, and multi-step execution. AutoGen provides a programmable multi-agent con versation frame work where agents collaborate with tools and humans-in-the-loop [ 27 ]. AgentBench e v aluates LLMs as agents across interacti ve en vironments, emphasizing reasoning and decision-making under multi-turn conditions [ 28 ]. Frame works such as LangChain/LangGraph and Cre wAI operationalize stateful graphs and multi-agent workﬂo ws in open ecosystems, though governance and evaluation discipline are often left to implementers. The Model Context Protocol (MCP) proposes interoperability standards for connecting models to tools and data sources, including explicit safety consider - ations [ 29 , 30 ]. These developments highlight the necessity of architectural abstractions that treat telemetry , policy enforcement, and ev aluation as ﬁrst-class concerns rather than post-hoc additions. As agentic systems transition from research prototypes to real-w orld deployment, go vernance and risk management become central. The NIST AI Risk Management Frame work (AI RMF 1.0) formalizes lifecycle risk gov ernance for AI systems [ 20 ]. ISO standards such as ISO/IEC 42001 and ISO/IEC 23894 deﬁne or ganizational controls for AI management and risk mitigation [ 21 , 22 ]. Observ ability framew orks like OpenT elemetry provide standardized tracing mechanisms that can instrument agent ex ecution ﬂows in production en vironments [31]. AISA (Agentic AI Systems Architecture) proposes a layered reference architecture that elev ates ev aluation, deployment, and gov ernance as core system components rather than peripheral engineering tasks [ 19 ]. Our w ork operationalizes these architectural principles in the context of Arabic tool calling, demonstrating ho w localized datasets and models can be integrated within structured go vernance-aw are pipelines. 3 Methodology W e describe the end-to-end pipeline used to construct, repair , and ﬁne-tune an Arabic function-calling model based on unsloth/functiongemma-270m-it 2 . The methodology follo ws a data-centric and systems-a ware approach: (i) structural auditing and schema repair of the dataset, (ii) prompt-length reduction via tool sampling, (iii) format-aligned chat construction compatible with FunctionGemma control tokens, and (iv) full-parameter supervised ﬁne-tuning under completion-only masking. 3.1 Model Foundation W e initialize from unsloth/functiongemma-270m-it , a 270M-parameter variant of FunctionGemma optimized for structured function calling. Unlike parameter-ef ﬁcient ﬁne-tuning (e.g., LoRA), we adopt full ﬁne-tuning , updating all parameters: θ ∗ = arg min θ E ( x,y ) ∼D [ − log P θ ( y | x )] , (1) where x is the formatted chat prompt (de veloper turn + tool declarations + user query), y is the assistant’ s structured function call, and D is the curated Arabic function-calling dataset. Completion-only masking ensures gradients are computed only ov er the assistant’ s function-call tokens. 3.2 Dataset A uditing and Structural Repair The Arabic Function Calling dataset 3 serves as the primary data source for this study . It comprises 50,810 samples spanning 36 tools, ﬁ ve major Arabic dialects (MSA, Egyptian, Gulf, Le vantine, and Maghrebi), and eight real-world domains. While the dataset provides broad dialectal and functional coverage, preliminary ﬁne-tuning experiments rev ealed se veral structural limitations that materially af fected training stability and ev aluation reliability . Figure 1 presents the complete transformation pipeline adopted in this w ork. As illustrated, the process be gins with a structural audit phase, where empty queries, enum violations, and duplicated tool deﬁnitions are identiﬁed. This is 2 https://huggingface.co/unsloth/functiongemma- 270m- it 3 https://huggingface.co/datasets/HeshamHaroon/Arabic_Function_Calling 3 A P R E P R I N T - M A R C H 1 9 , 2 0 2 6 Input Ra w Da taset Structur al Audit (Empty queries, Enum violations, tool duplication) Schema Rep air (Empty norm. & None-is- valid fix) Tool Optimization (Purning and Enum Flattening) Tool Sampling ( 1 correct + 4 Distractors) Chat Serialization FunctionGemma Format with Completion Masking 4 1 , 1 0 4 , 4 , 5 6 8 , 5 , 0 7 9 e x a m p l e s Figure 1: End-to-end transformation pipeline for AISA-AR-FunctionCall. The process includes structural auditing, schema repair , tool optimization, stochastic tool sampling, chat serialization using the FunctionGemma format, and stratiﬁed train/validation/test splitting. followed by schema repair , including normalization of ar gument values and correction of enum constraints (notably the None-is-valid ﬁx for optional parameters). Next, tool optimization is performed through pruning of unstable or redundant tools and ﬂattening of enum constraints into descriptive ﬁelds to reduce schema rigidity . These steps collectiv ely stabilize the supervision signal before do wnstream prompt construction and model training. This staged repair process con verts the original raw dataset into AISA-AR-FunctionCall , a schema-consistent and production-ready corpus speciﬁcally engineered for structured function-calling ﬁne-tuning. Empirical inspection rev ealed four dominant failure modes: (i) silent outputs for negati ve samples, (ii) enum constraint violations, (iii) duplicated or semantically ov erlapping tool deﬁnitions, and (iv) systematic prompt truncation caused by excessi ve tool declarations. These deﬁciencies directly impaired supervision quality and necessitated structured repair prior to model optimization. Enum Compliance Correction. A critical source of data loss was the v alidation logic applied to enum-constrained parameters. The original ﬁltering rule considered a sample valid only if the parameter value v belonged to the predeﬁned enum set: valid = ( v ∈ Enum ) , (2) thereby incorrectly treating None values—used to indicate optional or unspeciﬁed arguments—as violations. This resulted in systematic e xclusion of otherwise v alid samples, particularly for tools with optional enum ﬁelds. The validation rule w as corrected to explicitly allo w null assignments: valid = ( v = None ) ∨ ( v ∈ Enum ) , (3) ensuring that absence of a value is interpreted as permissible when the parameter is not required. This modiﬁcation restored thousands of pre viously discarded training instances and reactiv ated six tools that had ef fectiv ely become “dead” due to complete sample exclusion. Enum Normalization and T ool Pruning. In addition to correcting enum validation logic, we performed systematic normalization of enum values and structural consolidation of the tool in ventory . Se veral enum-constrained parameters contained heterogeneous representations, including Arabic surface forms, variant English spellings, or free-text v alues that did not match the canonical schema. T o enforce consistent supervision signals, these variants were mapped to standardized enum v alues deﬁned in the tool schema, thereby reducing label fragmentation and improving alignment between arguments and schema deﬁnitions. Concurrently , an error-driv en audit re vealed that a subset of tools contributed disproportionate noise due to se vere schema inconsistencies, duplicated functional intent, or unstable argument structures. F or example, duplicated currenc y- con version tools and overlapping time-retrie val functions fragmented the learning signal across semantically equiv alent operations. After consolidation and remov al of high-noise tools, the effecti ve tool in ventory w as reduced from 36 to 27 tools. This pruning step decreased schema variability , reduced parameter -type violations, and stabilized supervision across domains, resulting in a more coherent and learnable action space for ﬁne-tuning. 3.3 Prompt Length Reduction via T ool Sampling A major bottleneck in early training experiments was prompt truncation. When all av ailable tool declarations were included in e very training instance (originally 36 tools, later 27 after pruning), the serialized chat prompt frequently 4 A P R E P R I N T - M A R C H 1 9 , 2 0 2 6 Algorithm 1 Stochastic T ool Sampling for Structured Function Calling Require: Full tool in ventory T , ground-truth tool t ∗ (optional), ﬂag r eq uir es _ f unction Ensure: Sampled tool subset S of size 5 1: if req uir es _ f unction = T rue then 2: D ← T \ { t ∗ } 3: R ← UniformSample ( D , 4) 4: S ← { t ∗ } ∪ R 5: else 6: S ← UniformSample ( T , 5) where UniformSample ( · , k ) denotes sampling without replacement 7: end if 8: S ← RandomShufﬂe ( S ) 9: r eturn S exceeded 4,900 tokens. Given a maximum sequence length of 2048 tokens, this resulted in systematic truncation before the assistant’ s function-call response, ef fecti vely pre venting the model from observing the supervision signal. Consequently , gradient updates were dominated by prompt tokens rather than the structured action output. T o address this issue, we introduce a stochastic tool sampling strategy that constrains each training instance to a ﬁxed-size subset of tools. Each example contains exactly ﬁve tool declarations. For positive samples (i.e., requires_function=True ), the sampled set consists of the ground-truth tool t ∗ and four randomly selected distractor tools dra wn without replacement from the remaining tool in ventory . For negativ e samples, ﬁ ve tools are sampled uniformly at random, with no designated correct tool. The ﬁnal subset is randomly permuted to eliminate positional bias. Formally , for a positive training instance i with correct tool t ∗ and full tool in ventory T , the sampled tool set is: T i = π ( { t ∗ } ∪ Sample ( T \ { t ∗ } , 4)) , (4) where π ( · ) denotes a random permutation operator . For negati ve instances: T i = π ( Sample ( T , 5)) . (5) This mechanism reduces the median prompt length from approximately 4,900 tokens to approximately 793 tokens, ensuring all examples ﬁt within the 2048-token context windo w . Be yond prev enting truncation, stochastic sampling introduces implicit data augmentation: across epochs, each instance is paired with dif ferent distractor combinations, encouraging the model to discriminate the correct tool under v arying contextual alternatives. Empirically , this substantially improv es stability and con ver gence in lightweight (270M parameter) structured function-calling models. After tool subset construction, each instance is serialized into the FunctionGemma-compatible chat format described below . 3.4 Chat T emplate Construction Each training instance is serialized using the nativ e FunctionGemma control-token format to preserv e structural alignment between tool declarations and assistant outputs. Concretely , ev ery example follo ws a four-part structure: (i) a dev eloper turn containing the system instruction and a dynamically injected timestamp (to support temporal reasoning for expressions such as “tomorrow” or “ne xt Monday”), (ii) a sampled set of tool declarations encoded with control tokens, (iii) the user query in Arabic, and (i v) the assistant’ s structured function call. Completion-only masking is enabled by specifying dataset_text_field="text" during training, which automati- cally masks all prompt tokens (developer , tool declarations, and user turns) and computes loss exclusi vely over the assistant’ s function-call output. This ensures that gradients optimize structured action generation rather than prompt reproduction. 3.5 Dataset Splitting and T raining Conﬁguration All experiments are conducted on AISA-AR-FunctionCall 4 , a production-ready Arabic function-calling dataset released under the AISA frame work. The dataset is fully schema-v alidated, tool-normalized, and formatted for direct ﬁne-tuning of structured function-calling models such as FunctionGemma. 4 https://huggingface.co/datasets/AISA- Framework/AISA- AR- FunctionCall 5 A P R E P R I N T - M A R C H 1 9 , 2 0 2 6 Figure 2: Example of serialized training instance using the FunctionGemma control-token format. The data split follo ws the original metadata annotations to a void distrib utional drift between training and ev aluation. After formatting, tool sampling, and ﬁltering, the corpus was partitioned into 41,104 training examples, 4,568 validation examples, and a held-out test set of 5,079 examples. The split preserves domain and dialect distributions across partitions, and stratiﬁcation is applied to pre vent tool or domain leakage. The test set remains strictly unseen during training and hyperparameter tuning. T raining is performed via full-parameter ﬁne-tuning of all 268M model weights. W e train for two epochs using a per-de vice batch size of 4 and gradient accumulation of 8, resulting in an ef fectiv e batch size of 32. Optimization is carried out using 8-bit AdamW with a cosine learning rate scheduler and an initial learning rate of 2 × 10 − 5 . Gradient checkpointing is enabled to reduce memory ov erhead during backpropagation, enabling stable full-parameter training within hardware constraints. This conﬁguration provides a balanced trade-off between con ver gence stability and computational efﬁcienc y for lightweight structured ex ecution models. 3.6 Full-Parameter Fine-T uning Protocol W e adopt full-parameter supervised ﬁne-tuning of the 270M-parameter FunctionGemma model, updating all trainable weights rather than using parameter -efﬁcient adaptation methods (e.g., LoRA). Let θ denote the full parameter set of the model. Gi ven a serialized training instance x and structured function-call target y , optimization follows the standard causal language modeling objectiv e: L ( θ ) = − X t ∈Y log P θ ( y t | x, y and tags, followed by the structured tool call. The modiﬁed generation pattern becomes: x = [ x prompt , r , . . . ] (11) where r denotes a short reasoning trace explaining tool selection and argument e xtraction. During inference, the model is primed to begin its turn with \n , ensuring that reasoning cannot be skipped. Unlike the primary production model, which employs full-parameter ﬁne-tuning, the reasoning-augmented variant is trained using LoRA adaptation with increased capacity . Speciﬁcally , we use a LoRA rank of r = 64 (increased from 16), α = 64 , and dropout of 0.05, resulting in approximately 5.36% trainable parameters. This conﬁguration enables targeted beha vioral adaptation while preserving the underlying base model weights. Completion-only masking is retained, such that gradients propagate exclusiv ely through the reasoning ( ) segment and the subsequent structured function-call tokens: L ( θ ) = − X t ∈{ think , tool_call } log P θ ( x t | x segment prior to emitting the function call. This reasoning block is supervised during training and enforced during inference, ensuring that tool selection is preceded by an e xplicit decision trace. Under strict ev aluation—where both reasoning presence and correct tool in vocation are required—the reasoning model demonstrates near-perfect alignment. The model consistently emits a structured reasoning block prior to the tool call and achiev es ﬂawless ar gument extraction on a stratiﬁed e valuation subset of 240 samples. It is important to note that strict formatting validators classify many reasoning outputs as parse failures because the serialized output now includes tokens before the function-call marker . This does not reﬂect structural 10 A P R E P R I N T - M A R C H 1 9 , 2 0 2 6 T able 3: Reasoning Model Results (Strict Evaluation, n = 240 ) Metric Score T ool Call Rate 0.992 Think-Before-Call Rate 1.000 Function Name Accuracy 0.992 Argument F1 1.000 Decision Accuracy 0.992 Hallucination Rate 0.000 instability , but rather a dif ference in output serialization. Under deployment-a ware ev aluation—where reasoning segments are permitted—the model maintains near -perfect tool in vocation correctness. This reasoning-augmented model is presented as an exploratory extension to analyze structured reasoning behavior . The primary production-ready system remains the fully ﬁne-tuned AISA-AR-FunctionCall-FT model. 5 Discussion This work demonstrates that reliable Arabic function calling is not primarily a model-size limitation, but a data and supervision alignment problem. The baseline results re veal systemic structural collapse, with the majority of outputs failing to produce valid function-call formats. This conﬁrms that multilingual pretraining alone does not guarantee ex ecutable structured behavior in morphologically rich and dialectally di verse languages such as Arabic. The full ﬁne-tuned AISA-AR-FunctionCall-FT model sho ws that structured dataset repair , schema normalization, and tool-aware sampling are sufﬁcient to restore stable function-calling behavior within a lightweight 270M-parameter model. Parse failures are nearly eliminated, format v alidity approaches 100%, and function name accuracy increases by more than eightfold. These impro vements indicate that structural serialization learning can be reliably achieved when prompt construction and supervision are carefully engineered. Howe ver , the failure mode analysis highlights a second-stage challenge: once structural collapse is resolved, remaining errors shift tow ard semantic misalignment. T ool hallucination, incorrect function selection, and argument mismatches become the dominant error types. This suggests that structured learning and decision-lev el reasoning are separable phenomena. While serialization stability can be enforced through format-aware training, accurate tool selection requires deeper semantic grounding and possibly contrastiv e or ranking-based supervision. Dialect-lev el results further indicate that structured supervision reduces multilingual e xecution bias. After ﬁne-tuning, performance disparities between dialects narrow substantially , suggesting that schema-aligned training promotes robustness across linguistic v ariation rather than amplifying language imbalance. The reasoning-augmented variant pro vides additional insight. When explicit reasoning traces are supervised, the model achiev es near -perfect structured alignment within the e valuated subset. This suggests that intermediate reasoning can improv e tool selection consistency and ar gument extraction ﬁdelity . Ne vertheless, reasoning supervision alters output serialization and introduces deployment considerations regarding formatting v alidation. Consequently , while reasoning improv es decision alignment, it must be carefully integrated into production pipelines. Overall, the ﬁndings emphasize that production-grade multilingual tool calling requires a layered approach: structural reliability ﬁrst, followed by semantic calibration and decision reﬁnement. The AISA-AR-FunctionCall framework provides an empirical demonstration of this progression. 6 Conclusion This paper presents AISA-AR-FunctionCall, a production-oriented Arabic function-calling frame work built through systematic dataset auditing, schema repair , tool-a ware prompt restructuring, and full-parameter ﬁne-tuning. W e demonstrate that reliable Arabic tool in vocation can be achieved within a lightweight 270M-parameter model when structural supervision is carefully engineered. Baseline results re veal severe structural collapse under multilingual pretraining alone, while the ﬁne-tuned model nearly eliminates parse failures and substantially improv es function selection and argument alignment across dialects and domains. Our analysis further shows that once structural reliability is restored, remaining limitations shift toward semantic decision-le vel errors, such as incorrect tool disambiguation and argument mismatch. This suggests a two-stage progression for multilingual agentic systems: ﬁrst ensuring format and schema stability , then improving semantic calibration. The reasoning-augmented v ariant provides additional e vidence that explicit intermediate reasoning can enhance tool-selection alignment, although integration into production pipelines 11 A P R E P R I N T - M A R C H 1 9 , 2 0 2 6 requires careful serialization management. Overall, the results highlight that multilingual structured execution is primarily a data and supervision alignment challenge rather than a model-scale limitation. The AISA-AR-FunctionCall framew ork offers both a production-ready training corpus and a research testbed for adv ancing structured tool use in Arabic and other morphologically rich languages. Future work will explore contrasti ve supervision, tool-ranking reﬁnement, and conﬁdence-based calibration to further close the semantic gap to ward deployment-grade reliability . References [1] Shunyu Y ao, Jef frey Zhao, Dian Y u, Nan Du, Izhak Shafran, Karthik Narasimhan, and Y uan Cao. React: Synergizing reasoning and acting in language models. arXiv pr eprint arXiv:2210.03629 , 2022. [2] Ehud Karpas, Omri Abend, Y onatan Belinkov , et al. Mrkl systems: A modular , neuro-symbolic architecture that combines large language models, external kno wledge sources and discrete reasoning. arXiv preprint arXiv:2205.00445 , 2022. [3] T imo Schick, Jane Dwivedi-Y u, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer , Nicola Cancedda, and Thomas Scialom. T oolformer: Language models can teach themselves to use tools. arXiv pr eprint arXiv:2302.04761 , 2023. [4] Shishir G. Patil, Huanzhi Mao, F anjia Y an, Charlie Cheng-Jie Ji, V ishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic ev aluation of large language models. In ICML P osters , 2025. [5] Zhicheng Guo, Sijie Cheng, Hao W ang, Shihao Liang, Y ujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Y ang Liu. Stabletoolbench: T owards stable large-scale benchmarking on tool learning of large language models. arXiv pr eprint arXiv:2403.07714 , 2024. [6] Y uchen Zhuang, Ding Y u, et al. A dataset for llm question answering with external tools. arXiv pr eprint arXiv:2306.13304 , 2023. [7] Y ujia Qin, Shihao Liang, Y ining Y e, Kunlun Zhu, Lan Y an, Y axi Lu, Y ankai Lin, Xin Cong, Xiangru T ang, Bill Qian, et al. T oolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789 , 2023. [8] Google AI for De velopers. Functiongemma model overvie w , 2025. Last updated 2025-12-18. [9] Google AI for De velopers. Functiongemma formatting and best practices, 2025. Last updated 2025-12-18. [10] Google DeepMind. google/functiongemma-270m-it model card. Hugging Face model card, 2025. [11] Gemma T eam. Gemma 2: Improving open language models at a practical size. arXiv preprint , 2024. [12] Gemma T eam. Gemma 3 technical report. arXiv preprint , 2025. [13] Mayank Kulkarni, V ittorio Mazzia, Judith Gaspers, Christopher Hench, and Jack FitzGerald. Massiv e-agents: A benchmark for multilingual function-calling in 52 languages. In F indings of the Association for Computational Linguistics: EMNLP , 2025. [14] W issam Antoun, Fady Baly , and Hazem Hajj. Arabert: T ransformer-based model for arabic language understand- ing. arXiv pr eprint arXiv:2003.00104 , 2020. [15] Muhammad Abdul-Mageed, AbdelRahim Elmadan y , and El Moatez Billah Nagoudi. Arbert & marbert: Deep bidirectional transformers for arabic. In Pr oceedings of ACL , 2021. [16] Nandita Sengupta et al. Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative lar ge language models. arXiv pr eprint arXiv:2308.16149 , 2023. [17] H. Huang et al. Acegpt: Localizing large language models in arabic. In Proceedings of NAA CL , 2024. [18] Jason W ei, Xuezhi W ang, Dale Schuurmans, Maarten Bosma, Brian Ichter , Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint , 2022. [19] Omer Nacar , Alquffari Deema, and Alkhalif ah Mohammed. Aisa: A uniﬁed architecture for agentic ai systems. Zenodo preprint, 2026. [20] National Institute of Standards and T echnology. Artiﬁcial intelligence risk management framework (ai rmf 1.0). T echnical Report NIST AI 100-1, NIST , 2023. [21] International Organization for Standardization. Iso/iec 42001:2023 – artiﬁcial intelligence management systems. ISO standard, 2023. 12 A P R E P R I N T - M A R C H 1 9 , 2 0 2 6 [22] International Organization for Standardization. Iso/iec 23894:2023 – artiﬁcial intelligence – guidance on risk management. ISO/IEC standard, 2024. [23] Shuo Huang et al. Benchmarking llms for comprehensive tool utilization in real-w orld scenarios. In F indings of A CL , 2024. [24] OpenAI. Function calling – openai api. Developer documentation, 2025. [25] OpenAI. Structured model outputs – openai api. Developer documentation, 2025. [26] Anthropic. T ool use with claude – overvie w . Developer documentation, 2025. [27] Qingyun W u, Gagan Bansal, Jieyu Zhang, Y iran W u, Beibin Li, Erkang Zhu, et al. Autogen: Enabling next-gen llm applications via multi-agent con versation frame work. In COLM , 2024. [28] Xiao Liu et al. Agentbench: Evaluating llms as agents. arXiv preprint , 2023. [29] Anthropic. Introducing the model context protocol. Company announcement, 2024. [30] Model Context Protocol Contrib utors. Model context protocol speciﬁcation. Online speciﬁcation, 2025. Includes tool safety guidance. [31] OpenT elemetry. Opentelemetry traces: Path of a request. OpenT elemetry documentation, 2026. 13

From Language to Action in Arabic: Reliable Structured Tool Calling via Data-Centric Fine-Tuning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment