EPPCMinerBen: A Novel Benchmark for Evaluating Large Language Models on Electronic Patient-Provider Communication via the Patient Portal

EPPCMinerBen: A Novel Benchmark for Evaluating Large Language Models on Electronic Patient-Provider Communication via the Patient Portal
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Effective communication in health care is critical for treatment outcomes and adherence. With patient-provider exchanges shifting to secure messaging, analyzing electronic patient-communication (EPPC) data is both essential and challenging. We introduce EPPCMinerBen, a benchmark for evaluating LLMs in detecting communication patterns and extracting insights from electronic patient-provider messages. EPPCMinerBen includes three sub-tasks: Code Classification, Subcode Classification, and Evidence Extraction. Using 1,933 expert annotated sentences from 752 secure messages of the patient portal at Yale New Haven Hospital, it evaluates LLMs on identifying communicative intent and supportive text. Benchmarks span various LLMs under zero-shot and few-shot settings, with data to be released via the NCI Cancer Data Service. Model performance varied across tasks and settings. Llama-3.1-70B led in evidence extraction (F1: 82.84%) and performed well in classification. Llama-3.3-70b-Instruct outperformed all models in code classification (F1: 67.03%). DeepSeek-R1-Distill-Qwen-32B excelled in subcode classification (F1: 48.25%), while sdoh-llama-3-70B showed consistent performance. Smaller models underperformed, especially in subcode classification (>30% F1). Few-shot prompting improved most tasks. Our results show that large, instruction-tuned models generally perform better in EPPCMinerBen tasks, particularly evidence extraction while smaller models struggle with fine-grained reasoning. EPPCMinerBen provides a benchmark for discourse-level understanding, supporting future work on model generalization and patient-provider communication analysis. Keywords: Electronic Patient-Provider Communication, Large language models, Data collection, Prompt engineering


💡 Research Summary

The paper introduces EPPCMinerBen, a novel benchmark designed to evaluate large language models (LLMs) on electronic patient‑provider communication (EPPC) extracted from secure patient portal messages. Using 752 de‑identified secure messages from Yale New Haven Hospital (1,933 sentences, 27,849 words), the authors created a richly annotated dataset with a hierarchical coding scheme: nine high‑level communication codes (e.g., Information Giving, Patient Partnership) and multiple sub‑codes (e.g., Salutation, Diagnostics, Drugs). Three tasks are defined: (1) Code Classification – multi‑label sentence‑level classification of the high‑level codes; (2) Subcode Classification – conditional selection of the most appropriate sub‑code for each predicted code; (3) Evidence Extraction – extraction of minimal text spans that justify each code‑subcode pair.

The benchmark fills a gap in existing biomedical NLP resources, which focus on structured clinical notes or simulated dialogues and lack support for bidirectional, informal, and socio‑emotional aspects of real‑world EPPC. Annotation was performed by experts using a RIAS‑inspired coding book, resulting in a distribution heavily weighted toward informational exchanges and patient partnership, with relatively few emotional support cues.

The authors evaluated twelve contemporary LLMs—including Llama‑3.1‑70B, Llama‑3.3‑70B‑Instruct, DeepSeek‑R1‑Distill‑Qwen‑32B, sdoh‑llama‑3‑70B, and several smaller models—under zero‑shot and few‑shot prompting. Performance was measured with F1 scores. Llama‑3.3‑70B‑Instruct achieved the highest code‑classification score (F1 = 67.03 %). DeepSeek‑R1‑Distill‑Qwen‑32B led subcode classification (F1 = 48.25 %). Llama‑3.1‑70B excelled in evidence extraction (F1 = 82.84 %). Smaller models (<7 B parameters) struggled, especially on subcode classification where F1 fell below 30 %. Few‑shot prompting (2‑3 exemplars) consistently improved results by 5–10 % points, highlighting the importance of prompt engineering for hierarchical tasks.

Key insights include: (1) large, instruction‑tuned models are better at hierarchical reasoning and grounding in EPPC contexts; (2) the scarcity of emotional sub‑codes in the data limits model learning of affective communication; (3) the benchmark’s multi‑task design enables assessment of discourse understanding, fine‑grained labeling, and evidence grounding—capabilities not captured by existing benchmarks.

Limitations are acknowledged: the dataset originates from a single academic health system and focuses on oncology patients, which may affect generalizability. The annotation process, while rigorous, still reflects a limited set of socio‑emotional behaviors.

Future work suggested includes expanding to multi‑institutional, multilingual corpora, enriching SDoH and emotional annotations, and exploring graph‑based models to capture code‑subcode relationships. EPPCMinerBen is positioned as a valuable resource for developing and evaluating NLP tools that can improve patient‑provider communication, support personalized care, and advance clinical dialogue research.


Comments & Academic Discussion

Loading comments...

Leave a Comment