Enhancing Tool Calling in LLMs with the International Tool Calling Dataset

Enhancing Tool Calling in LLMs with the International Tool Calling Dataset
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Tool calling allows large language models (LLMs) to interact with external systems like APIs, enabling applications in customer support, data analysis, and dynamic content generation. While recent benchmarks have advanced tool-use research, they suffer from key limitations, including reliance on simulated or restricted APIs, limited reproducibility, and a lack of cultural and geographic diversity. To address these gaps, we introduce International Tool Calling (ITC), a large-scale, multilingual benchmark designed for realistic, globally distributed tool-calling scenarios. ITC includes 3,571 real APIs and 17,540 tool calling tasks across 20 categories and 40 countries. Experiments reveal substantial performance gaps between open- and closed-source LLMs, while fine-tuning on ITC yields significant improvements, particularly for non-English queries, enhancing cross-lingual generalization, reasoning consistency, and robustness to out-of-domain tools. ITC provides a valuable benchmark for advancing LLM robustness and performance in complex, multi-tool, and international scenarios. Dataset: https://anonymous.4open.science/r/International-Tool-Calling-ITC-dataset-FAF4/.


💡 Research Summary

The paper introduces the International Tool Calling (ITC) dataset, a large‑scale, multilingual benchmark designed to evaluate and improve large language models’ (LLMs) ability to invoke real‑world APIs. Existing tool‑calling benchmarks suffer from three major drawbacks: (1) reliance on simulated or synthetic APIs that do not capture the variability of production services, (2) limited reproducibility because many datasets require paid API keys, strict quotas, or are otherwise inaccessible, and (3) a lack of cultural and geographic diversity, which hampers the generalization of models to non‑Western or region‑specific services.

To address these gaps, the authors collected 49,937 public REST APIs from five major sources (RapidAPI, Juhe Data, Public‑apis, Xiarou API, Free‑api). After extensive automated monitoring (weekly health checks) and manual verification (live calls, parameter validation), only 3,571 stable APIs were retained—approximately 7 % of the original pool. These APIs span 20 functional categories (Finance, Data, Communication, Entertainment, etc.) and originate from 40 countries, covering 29 languages.

Task generation proceeds in four stages. First, a seed set of 36 high‑quality examples (covering single‑tool, repeated, parallel, and nested calls) is curated. Using GPT‑4o, the authors generate three user queries per seed‑API pair, yielding 44,198 candidate queries. Second, each query is scored on relevance, practicality, linguistic applicability, clarity, and specificity by two independent LLMs (Claude‑3.5‑Sonnet and Gemini‑1.5‑Pro). Queries scoring ≥ 4 from both models are kept, discarding 58.4 % of candidates. Third, a crowd‑sourced verification round with 100 annotators (Fleiss’ κ = 0.68) removes another 4.5 % of low‑quality items, leaving 17,540 final QA pairs.

Answer generation employs a tri‑model approach: GPT‑4o, Gemini‑1.5‑Pro, and Claude‑3.5‑Sonnet each produce a candidate answer for every query. The other two models evaluate each candidate on reasoning‑to‑API consistency, solution validity, and linguistic quality. Human experts then perform a final audit, especially for complex nested‑tool scenarios, to eliminate model‑specific hallucinations.

The benchmark is split at the API level: 15,790 training tasks and 1,750 test tasks, ensuring that many test‑time APIs are unseen during training. The authors evaluate 16 open‑source LLMs and 8 closed‑source commercial models on the test set. Closed‑source models achieve an average 12.4 % point higher accuracy, with pronounced gaps in handling “non‑existent tool” errors, missing parameters, and incorrect parameter formatting.

Fine‑tuning on the full multilingual ITC dataset yields substantial gains: overall accuracy improves by 9.7 % points, while non‑English queries (Japanese, Spanish, Arabic, etc.) see improvements exceeding 15 % points. Moreover, models fine‑tuned on ITC demonstrate better out‑of‑domain robustness: on external benchmarks such as APIBench and ToolBench, tool‑selection precision rises by 6.3 % points and invocation success by 8.1 % points. These results indicate that exposure to a diverse, real‑world API pool enhances cross‑lingual generalization, reasoning consistency, and resilience to unseen tools.

Beyond the dataset itself, the authors release the entire construction pipeline: automated API health‑check scripts, prompt templates for query and answer generation, scoring rubrics, and evaluation metrics, all under an open‑source license to promote reproducibility. Limitations are acknowledged: the current focus is on HTTP/REST endpoints, excluding multimodal tools (image, audio) and more complex authentication schemes; future work will aim to broaden modality coverage and maintain the dataset as APIs evolve.

In sum, ITC provides a realistic, globally representative benchmark that pushes LLMs toward reliable, culturally aware tool‑calling capabilities, and serves as an effective fine‑tuning resource for improving LLM performance in complex, multi‑tool, and multilingual real‑world scenarios.


Comments & Academic Discussion

Loading comments...

Leave a Comment