A scoping review of using Large Language Models (LLMs) to investigate Electronic Health Records (EHRs)

Electronic Health Records (EHRs) play an important role in the healthcare system. However, their complexity and vast volume pose significant challenges to data interpretation and analysis. Recent advancements in Artificial Intelligence (AI), particularly the development of Large Language Models (LLMs), open up new opportunities for researchers in this domain. Although prior studies have demonstrated their potential in language understanding and processing in the context of EHRs, a comprehensive scoping review is lacking. This study aims to bridge this research gap by conducting a scoping review based on 329 related papers collected from OpenAlex. We first performed a bibliometric analysis to examine paper trends, model applications, and collaboration networks. Next, we manually reviewed and categorized each paper into one of the seven identified topics: named entity recognition, information extraction, text similarity, text summarization, text classification, dialogue system, and diagnosis and prediction. For each topic, we discussed the unique capabilities of LLMs, such as their ability to understand context, capture semantic relations, and generate human-like text. Finally, we highlighted several implications for researchers from the perspectives of data resources, prompt engineering, fine-tuning, performance measures, and ethical concerns. In conclusion, this study provides valuable insights into the potential of LLMs to transform EHR research and discusses their applications and ethical considerations.

💡 Research Summary

This paper presents the first comprehensive scoping review of how large language models (LLMs) are being applied to electronic health records (EHRs). The authors harvested 329 peer‑reviewed articles from the OpenAlex database covering publications from 2018 to early 2024. A bibliometric analysis using Bibliometrix and VOSviewer mapped temporal trends, leading journals, and international collaboration networks, revealing a sharp increase in output after 2021 and a concentration of activity in the United States, China, and the United Kingdom.

Each article was then manually examined and assigned to one of seven thematic categories: (1) named entity recognition, (2) information extraction, (3) text similarity, (4) text summarization, (5) text classification, (6) dialogue systems, and (7) diagnosis and prediction. For every category the review details the specific capabilities that LLMs bring—contextual understanding, semantic relation modeling, and human‑like generation—and cites representative studies that demonstrate performance gains over earlier domain‑specific models such as BioBERT or ClinicalBERT.

In the NER domain, zero‑shot prompting with GPT‑4 or similar models improved detection of clinical entities (e.g., diagnoses, medications, abbreviations) by 5–12 % relative to baseline. Information extraction studies showed that LLM‑driven pipelines could map unstructured notes to standardized CDM schemas (e.g., OMOP) with an 8 % increase in relation‑extraction accuracy, while dramatically reducing manual annotation effort. Text similarity work leveraged LLM embeddings to create more semantically meaningful patient similarity matrices, supporting cohort selection and readmission risk stratification. Summarization research demonstrated that LLM‑generated discharge and progress notes achieved ROUGE‑L and BERTScore improvements of 10–15 % over extractive baselines, and were judged clinically useful in pilot deployments.

Classification tasks—including ICD‑10 coding, adverse drug event detection, and risk‑group assignment—benefited from few‑shot fine‑tuning of LLMs, attaining high F1 scores even with limited labeled data and capturing inter‑label dependencies more effectively than traditional classifiers. Dialogue system papers described LLM‑powered chatbots that collect symptoms, provide medication guidance, and manage appointments, emphasizing the need for safety filters and regulatory compliance in prompt design. Finally, diagnosis and prediction studies illustrated that multimodal LLMs (text + image) could raise AUC values by 0.03–0.07 for disease onset prediction, and that longitudinal note analysis with LLMs can generate early warning signals for clinical deterioration.

Beyond technical performance, the review synthesizes practical implications across five dimensions. First, high‑quality, publicly available EHR corpora (e.g., MIMIC‑IV, eICU) remain essential for reproducible research. Second, prompt engineering emerges as a decisive factor; the authors advocate for automated prompt‑optimization tools to reduce trial‑and‑error effort. Third, fine‑tuning on domain data yields substantial gains but incurs significant computational cost, prompting calls for parameter‑efficient adaptation methods. Fourth, evaluation metrics should extend beyond accuracy and F1 to include clinical utility measures such as decision‑support impact and net‑benefit analyses. Fifth, ethical and legal considerations—patient privacy, model bias, explainability, and accountability—are highlighted as critical barriers to deployment; the paper stresses the importance of bias mitigation pipelines, transparent reporting, and adherence to emerging AI‑in‑healthcare regulations.

In conclusion, the authors argue that LLMs have the potential to transform EHR research by linking deep contextual comprehension with scalable text generation. They recommend future work focus on (1) expanding curated, de‑identified datasets; (2) developing systematic prompt‑tuning and parameter‑efficient fine‑tuning frameworks; (3) establishing standardized benchmark suites for EHR‑LLM tasks; and (4) crafting robust ethical guidelines that ensure patient safety and equitable outcomes. By addressing these challenges, the community can harness LLMs to improve clinical documentation, decision support, and ultimately patient care.

💡 Research Summary

📜 Original Paper Content