AutoMedic: An Automated Evaluation Framework for Clinical Conversational Agents with Medical Dataset Grounding

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Evaluating large language models (LLMs) has recently emerged as a critical issue for safe and trustworthy application of LLMs in the medical domain. Although a variety of static medical question-answering (QA) benchmarks have been proposed, many aspects remain underexplored, such as the effectiveness of LLMs in generating responses in dynamic, interactive clinical multi-turn conversation situations and the identification of multi-faceted evaluation strategies beyond simple accuracy. However, formally evaluating a dynamic, interactive clinical situation is hindered by its vast combinatorial space of possible patient states and interaction trajectories, making it difficult to standardize and quantitatively measure such scenarios. Here, we introduce AutoMedic, a multi-agent simulation framework that enables automated evaluation of LLMs as clinical conversational agents. AutoMedic transforms off-the-shelf static QA datasets into virtual patient profiles, enabling realistic and clinically grounded multi-turn clinical dialogues between LLM agents. The performance of various clinical conversational agents is then assessed based on our CARE metric, which provides a multi-faceted evaluation standard of clinical conversational accuracy, efficiency/strategy, empathy, and robustness. Our findings, validated by human experts, demonstrate the validity of AutoMedic as an automated evaluation framework for clinical conversational agents, offering practical guidelines for the effective development of LLMs in conversational medical applications.

💡 Research Summary

The paper “AutoMedic: An Automated Evaluation Framework for Clinical Conversational Agents with Medical Dataset Grounding” addresses a critical gap in the assessment of Large Language Models (LLMs) for medical applications. While LLMs have shown impressive performance on static medical question-answering (QA) benchmarks, these evaluations fail to capture the dynamic, interactive, and nuanced nature of real-world clinical conversations. Evaluating an AI’s ability to act as a clinical conversational agent—actively gathering information through dialogue, demonstrating empathy, and making robust decisions—remains a significant challenge due to the vast combinatorial space of possible patient interactions.

To overcome this, the authors introduce AutoMedic, a fully automated, multi-agent simulation framework designed to evaluate LLMs as clinical conversational agents. The core innovation of AutoMedic lies in its ability to repurpose existing, off-the-shelf static medical QA datasets (like those for medical exam preparation) into rich, interactive evaluation scenarios. The framework operates in three main stages.

First, in the Patient Profile Generation stage, a “profile generator” LLM agent filters and transforms a standard medical QA item. It assesses whether the item describes a specific patient case suitable for simulation, filtering out abstract or research-oriented questions. For suitable items, it extracts and structures information into a virtual patient profile containing demographics, basic information (chief complaint, history), and optional information (test results, imaging findings). Missing basic details are plausibly imputed to enhance realism without altering the correct medical answer.

Second, the Multi-Agent Conversation Simulation stage brings this profile to life. Three specialized LLM agents interact: a doctor agent (the system under evaluation), a patient agent (embodying the virtual patient), and a clinical staff agent (holding test results). The doctor agent begins with only demographic data and must strategically converse with the patient agent (using <patient> tags) to elicit symptoms and history, and request specific tests from the clinical staff agent (using <clinical> tags). This mimics the real-world process of information gathering. The conversation proceeds until the doctor agent decides to conclude or a turn limit is reached, after which the doctor must answer the original medical question based solely on the information gathered.

Third, the Automated Evaluation stage assesses the doctor agent’s performance using the novel CARE metric, a multi-faceted scoring system. CARE evaluates:

Clinical conversational Accuracy: The correctness of the final diagnosis or recommendation.
Efficiency/Strategy: The number of turns taken relative to an optimal information-gathering path.
Empathy: The use of empathetic and supportive language during the conversation.
Robustness: The ability to maintain correct reasoning when presented with misleading or contradictory information.

The paper presents experiments evaluating several state-of-the-art LLMs (e.g., GPT-4, Claude 3) as the doctor agent within the AutoMedic framework. Results demonstrate that AutoMedic provides a more comprehensive assessment than static QA, revealing differences in models’ conversational proficiency that pure knowledge benchmarks miss. For instance, a model might achieve high accuracy but do so through inefficient questioning or with a lack of empathetic rapport. The CARE metric scores effectively quantify these trade-offs. Furthermore, human expert validation confirmed the alignment between AutoMedic’s automated scores and human judgment, establishing the framework’s validity.

In conclusion, AutoMedic offers a scalable, automated, and ecologically valid method for evaluating LLMs in a role closer to real clinical practice. By transforming abundant static data into interactive tests and providing a multi-dimensional evaluation metric, it moves beyond simple accuracy checks towards assessing the holistic competency required for safe and effective deployment of conversational AI in medicine.

AutoMedic: An Automated Evaluation Framework for Clinical Conversational Agents with Medical Dataset Grounding

💡 Research Summary

Comments & Academic Discussion

Leave a Comment