ToolForge: A Data Synthesis Pipeline for Multi-Hop Search without Real-World APIs

February 20, 2026

Reading time: 5 minute

...

📝 Original Info

Title: ToolForge: A Data Synthesis Pipeline for Multi-Hop Search without Real-World APIs
ArXiv ID: 2512.16149
Date: 2025-12-18
Authors: Hao Chen, Zhexin Hu, Jiajun Chai, Haocheng Yang, Hang He, Xiaohan Wang, Wei Lin, Luhang Wang, Guojun Yin, Zhuofeng zhao

📝 Abstract

Training LLMs to invoke tools and leverage retrieved information necessitates high-quality, diverse data. However, existing pipelines for synthetic data generation often rely on tens of thousands of real API calls to enhance generalization, incurring prohibitive costs while lacking multi-hop reasoning and self-reflection. To address these limitations, we introduce ToolForge, an automated synthesis framework that achieves strong real-world tool-calling performance by constructing only a small number of virtual tools, eliminating the need for real API calls. ToolForge leverages a (question, golden context, answer) triple to synthesize large-scale tool-learning data specifically designed for multi-hop search scenarios, further enriching the generated data through multi-hop reasoning and self-reflection mechanisms. To ensure data fidelity, we employ a Multi-Layer Validation Framework that integrates both rule-based and model-based assessments. Empirical results show that a model with only 8B parameters, when trained on our synthesized data, outperforms GPT-4o on multiple benchmarks. Our code and dataset are publicly available at https://github.com/Buycar-arb/ToolForge .

💡 Deep Analysis

📄 Full Content

ToolForge: A Data Synthesis Pipeline for Multi-Hop Search without Real-World APIs Hao Chen1,2, Zhexin Hu2,3, Jiajun Chai2, Haocheng Yang2,4, Hang He2,5, Xiaohan Wang2, Wei Lin2, Luhang Wang1, Guojun Yin2†, Zhuofeng Zhao1† 1North China University of Technology, 2Meituan, 3Institute of Software, Chinese Academy of Sciences 4National University of Singapore, 5East China Normal University Abstract Training LLMs to invoke tools and leverage retrieved infor- mation necessitates high-quality, diverse data. However, ex- isting pipelines for synthetic data generation often rely on tens of thousands of real API calls to enhance generalization, incurring prohibitive costs while lacking multi-hop reasoning and self-reflection. To address these limitations, we introduce ToolForge, an automated synthesis framework that achieves strong real-world tool-calling performance by constructing only a small number of virtual tools, eliminating the need for real API calls. ToolForge leverages a (question, golden con- text, answer) triple to synthesize large-scale tool-learning data specifically designed for multi-hop search scenarios, further enriching the generated data through multi-hop reasoning and self-reflection mechanisms. To ensure data fidelity, we em- ploy a Multi-Layer Validation Framework that integrates both rule-based and model-based assessments. Empirical results show that a model with only 8B parameters, when trained on our synthesized data, outperforms GPT-4o on multiple benchmarks. Our code and dataset are publicly available at https://github.com/Buycar-arb/ToolForge. 1 Introduction In recent years, large language models (LLMs) have demon- strated remarkable capabilities in natural language understand- ing [19, 46], particularly in search and tool learning. By inte- grating external tool APIs [5, 30], tool-augmented LLMs have achieved a qualitative leap in practical applicability, enabling them to tackle complex real-world scenarios [30, 33]. Tool- calling mechanisms empower Large Language Models (LLMs) to interact with the external environment, enabling them to dynamically retrieve and access up-to-date information [51]. This strategy effectively mitigates the inherent limitations of static pretraining data and substantially expands their practical value across various application domains. Notable examples include workflow automation [53] and travel planning [10]. Training LLMs for tool-calling typically requires large-scale, high-quality synthetic data covering diverse scenarios [20]. However, since training LLMs on a limited set of tools is in- sufficient for achieving robust generalization [24, 34], existing data synthesis pipelines [20, 39, 42] commonly compensate for this deficiency by designing templates and executing tens of thousands of real-world API calls to obtain results [29] and construct training samples. To enhance the generalization †Corresponding author. capability of Large Language Models (LLMs), we utilize 19 virtual tools as surrogates for real-world API calls. Real-world tasks often require multi-hop reasoning [43], i.e., a reasoning process that derives the final answer through multiple intermediate steps and logical chains [12, 47]. Yet, most existing works concentrate on text-based multi-hop rea- soning [28, 41], lacking the capability to integrate with exter- nal tools. Concurrently, while research indicates that reflec- tion enables models to inspect and revise their own reasoning processes [35], its potential in complex scenarios involving multi-hop reasoning and tool-calling remains underexplored. To bridge this gap, we devise four tool-calling paradigms and three error perturbation classes, thereby deriving 29 distinct interaction patterns that encompass complex, multi-turn tool- calling scenarios. Ensuring the fidelity of complex, automatically synthesized data presents a significant challenge [3, 39]. Existing valida- tion approaches are often superficial, primarily verifying the syntactic correctness of tool-calling and the final answer con- sistency [27, 42]. They largely overlook the semantic and logi- cal integrity of intermediate reasoning steps, allowing subtle errors in multi-step chains to go undetected and thus com- promising the overall data quality [20]. To address this, we introduce a Multi-Layer Validation Framework that combines rule-based heuristics with model-based assessments, using Monte Carlo Tree Search (MCTS) [36] for hard negative mining to substantially enhance validation robustness and coverage. In summary, our key contributions are fourfold: • We introduce ToolForge, a novel automated synthesis frame- work that, given only a (question, golden context, answer) triple, can generate large-scale tool-calling data featuring multi-hop reasoning and self-reflection. • We enhance the generalization capability of LLMs by lever- aging virtual tools instead of real APIs, and incorporating reflection-driven multi-turn interactions to generate diverse reasoning-tool int

📄 Read Full PDF on ArXiv