Evaluating ChatGPT on Medical Information Extraction Tasks: Performance, Explainability and Beyond
Large Language Models (LLMs) like ChatGPT have demonstrated amazing capabilities in comprehending user intents and generate reasonable and useful responses. Beside their ability to chat, their capabilities in various natural language processing (NLP) tasks are of interest to the research community. In this paper, we focus on assessing the overall ability of ChatGPT in 4 different medical information extraction (MedIE) tasks across 6 benchmark datasets. We present the systematically analysis by measuring ChatGPT’s performance, explainability, confidence, faithfulness, and uncertainty. Our experiments reveal that: (a) ChatGPT’s performance scores on MedIE tasks fall behind those of the fine-tuned baseline models. (b) ChatGPT can provide high-quality explanations for its decisions, however, ChatGPT is over-confident in its predcitions. (c) ChatGPT demonstrates a high level of faithfulness to the original text in the majority of cases. (d) The uncertainty in generation causes uncertainty in information extraction results, thus may hinder its applications in MedIE tasks.
💡 Research Summary
This paper conducts a comprehensive evaluation of ChatGPT’s capabilities on medical information extraction (MedIE) tasks. Four fine‑grained tasks—medical named entity recognition (NER), triple extraction (TE), clinical event extraction (CEE), and ICD‑10 coding—are tested across six publicly available benchmark datasets. Because ChatGPT cannot be fine‑tuned, the authors design task‑specific prompts that include a description, label set with explanations, output format specifications, and a few demonstration examples. Using the official OpenAI gpt‑3.5‑turbo API, they query the model for each test instance and repeat the query five times to assess generation uncertainty.
Performance is measured with strict instance‑level F1 scores and compared against three baselines: BERT fine‑tuning, the UIE unified IE framework, and the state‑of‑the‑art (SOTA) model reported for each dataset. Across all tasks, ChatGPT lags behind the fine‑tuned baselines, especially on complex NER with nested or discontinuous entities and on the large‑label ICD‑10 coding task.
Explainability is examined by requesting both sample‑level and instance‑level rationales. Human expert evaluation finds the explanations generally logical, coherent, and faithful to the source text, demonstrating that large language models can articulate their reasoning in a medically understandable way.
Faithfulness is assessed in two dimensions: adherence to task instructions (instruction‑following) and alignment of the provided reasoning with the original input (faithful reasoning). ChatGPT performs well on both, indicating it respects the prompt constraints and does not hallucinate unrelated content.
Confidence analysis reveals a pronounced over‑confidence phenomenon: the model often assigns high certainty to its predictions even when the ground‑truth F1 is low. This mismatch suggests that raw confidence scores from ChatGPT should not be taken at face value in clinical settings.
Uncertainty is quantified by measuring variability across the five repeated queries. The study observes noticeable output fluctuations, particularly in entity boundaries, relation types, and selected ICD codes, highlighting the stochastic nature of top‑p sampling and its impact on reproducibility.
The authors conclude that while ChatGPT can generate high‑quality explanations and generally follows instructions, its extraction accuracy, over‑confidence, and generation uncertainty limit its readiness for direct deployment in medical IE applications. Future work should explore domain‑specific pre‑training, prompt engineering, and mechanisms for calibrating confidence and reducing stochastic variance to harness LLMs more safely in healthcare NLP.
Comments & Academic Discussion
Loading comments...
Leave a Comment