MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition

February 23, 2026

Reading time: 6 minute

...

📝 Original Info

Title: MedAI: Evaluating TxAgent’s Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition
ArXiv ID: 2512.11682
Date: 2025-12-12
Authors: Tim Cofala, Christian Kalfar, Jingge Xiao, Johanna Schrader, Michelle Tang, Wolfgang Nejdl

📝 Abstract

Therapeutic decision-making in clinical medicine constitutes a high-stakes domain in which AI guidance interacts with complex interactions among patient characteristics, disease processes, and pharmacological agents. Tasks such as drug recommendation, treatment planning, and adverse-effect prediction demand robust, multi-step reasoning grounded in reliable biomedical knowledge. Agentic AI methods, exemplified by TxAgent, address these challenges through iterative retrieval-augmented generation (RAG). TxAgent employs a fine-tuned Llama-3.1-8B model that dynamically generates and executes function calls to a unified biomedical tool suite (ToolUniverse), integrating FDA Drug API, OpenTargets, and Monarch resources to ensure access to current therapeutic information. In contrast to general-purpose RAG systems, medical applications impose stringent safety constraints, rendering the accuracy of both the reasoning trace and the sequence of tool invocations critical. These considerations motivate evaluation protocols treating token-level reasoning and tool-usage behaviors as explicit supervision signals. This work presents insights derived from our participation in the CURE-Bench NeurIPS 2025 Challenge, which benchmarks therapeutic-reasoning systems using metrics that assess correctness, tool utilization, and reasoning quality. We analyze how retrieval quality for function (tool) calls influences overall model performance and demonstrate performance gains achieved through improved tool-retrieval strategies. Our work was awarded the Excellence Award in Open Science. Complete information can be found at https://curebench.ai/.

💡 Deep Analysis

Deep Dive into MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition.

📄 Full Content

MedAI: Evaluating TxAgent’s Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition Tim Cofala L3S Research Center tim.cofala@l3s.de Christian Kalfar L3S Research Center christian.kalfar@l3s.de Jingge Xiao L3S Research Center xiao@l3s.de Johanna Schrader L3S Research Center schrader@l3s.de Michelle Tang L3S Research Center tang@l3s.de Wolfgang Nejdl L3S Research Center nejdl@L3S.de Abstract Therapeutic decision-making in clinical medicine constitutes a high-stakes domain in which AI guidance interacts with complex interactions among patient characteristics, disease processes, and pharmacological agents. Tasks such as drug recommendation, treatment planning, and adverse-effect prediction demand robust, multi-step reasoning grounded in reliable biomedical knowledge. Agentic AI methods, exemplified by TxAgent, address these challenges through iterative retrieval- augmented generation (RAG). TxAgent employs a fine-tuned Llama-3.1-8B model that dynamically generates and executes function calls to a unified biomedical tool suite (ToolUniverse), integrating FDA Drug API, OpenTargets, and Monarch resources to ensure access to current therapeutic information. In contrast to general-purpose RAG systems, medical applications impose stringent safety constraints, rendering the accuracy of both the reasoning trace and the sequence of tool invocations critical. These considerations motivate evaluation protocols treating token- level reasoning and tool-usage behaviors as explicit supervision signals. This work presents insights derived from our participation in the CURE-Bench NeurIPS 2025 Challenge, which benchmarks therapeutic-reasoning systems using metrics that assess correctness, tool utilization, and reasoning quality. We analyze how retrieval quality for function (tool) calls influences overall model performance and demonstrate performance gains achieved through improved tool-retrieval strategies. Our work was awarded the Excellence Award in Open Science. Complete information can be found at curebench.ai. 1 Introduction Therapeutic decision-making in clinical medicine presents a demanding environment for artificial intelligence. Clinicians routinely integrate hetero- geneous information on patient characteristics, dis- ease pathophysiology, comorbid conditions and the pharmacological properties of candidate treatments. AI systems intended to support such decisions must therefore demonstrate not only competent predic- tion capabilities but also the capacity to reason through multi-step therapeutic processes in a man- ner that is grounded in reliable biomedical knowl- edge. Recent advances in agentic AI (Tang et al., 2024; Bran et al., 2024; Yao et al., 2023) and retrieval- augmented generation (RAG) (Wei et al., 2025; Sun et al., 2025; Wang et al., 2025) have introduced new opportunities for building systems that can navigate complex biomedical toolchains. Rather than relying solely on parametric knowledge, agen- tic approaches iteratively retrieve, evaluate, and integrate external information sources through or- chestrated tool use. These methods are promising for therapeutic applications, where accurate access to up-to-date drug and disease information is nec- essary for safe model operation (Gao et al., 2025a). However, general-purpose agentic frameworks are not by themselves sufficient: medical contexts im- pose stringent constraints on verifiability and errors in either the reasoning trace or the sequence of tool calls can propagate to clinically significant mis- takes (Gorenshtein et al., 2025; Asgari et al., 2025; Singhal et al., 2022; Thirunavukarasu et al., 2023). As a result, evaluating therapeutic-reasoning sys- tems requires protocols that directly assess reason- ing quality, tool utilization, and the correctness of answers and intermediate steps. The Agentic Tool-Augmented Reasoning track of the CURE-Bench NeurIPS 2025 Challenge es- tablishes a rigorous framework for evaluating these capabilities. By combining metrics for answer ac- curacy, tool utilization, and reasoning validity with expert human review, the challenge ensures that agentic systems for therapeutic reasoning are as- sessed with the necessary precision and care. arXiv:2512.11682v1 [cs.AI] 12 Dec 2025 Building on the framework of this challenge, our effort focused on enhancing TxAgent (Gao et al., 2025a), an agentic therapeutic-reasoning system built on a fine-tuned Llama-3.1-8B model equipped with the ToolUniverse (Gao et al., 2025b), a uni- fied suite of biomedical resources integrating FDA drug data, OpenTargets associations, and Monarch ontologies. We conducted a detailed analysis of TxAgent’s performance on the CURE-Bench challenge, with a particular focus on retrieval quality for function (tool) calls, as retrieval failures were frequently responsible for downstream reasoning errors. To address these issues, we investigated ap- proaches to improve the system by: 1. Integrating DailyMed to access up-to-date drug label info

…(Full text truncated)…

📄 Read Full PDF on ArXiv