Ranking medical jargon in electronic health record notes by adapted distant supervision

February 23, 2026

Reading time: 6 minute

...

📝 Abstract

Objective: Allowing patients to access their own electronic health record (EHR) notes through online patient portals has the potential to improve patient-centered care. However, medical jargon, which abounds in EHR notes, has been shown to be a barrier for patient EHR comprehension. Existing knowledge bases that link medical jargon to lay terms or definitions play an important role in alleviating this problem but have low coverage of medical jargon in EHRs. We developed a data-driven approach that mines EHRs to identify and rank medical jargon based on its importance to patients, to support the building of EHR-centric lay language resources. Methods: We developed an innovative adapted distant supervision (ADS) model based on support vector machines to rank medical jargon from EHRs. For distant supervision, we utilized the open-access, collaborative consumer health vocabulary, a large, publicly available resource that links lay terms to medical jargon. We explored both knowledge-based features from the Unified Medical Language System and distributed word representations learned from unlabeled large corpora. We evaluated the ADS model using physician-identified important medical terms. Results: Our ADS model significantly surpassed two state-of-the-art automatic term recognition methods, TF*IDF and C-Value, yielding 0.810 ROC-AUC versus 0.710 and 0.667, respectively. Our model identified 10K important medical jargon terms after ranking over 100K candidate terms mined from over 7,500 EHR narratives. Conclusion: Our work is an important step towards enriching lexical resources that link medical jargon to lay terms/definitions to support patient EHR comprehension. The identified medical jargon terms and their rankings are available upon request.

💡 Analysis

🇰🇷 한글로 읽기

📄 Content

Ranking medical jargon in electronic health record notes by adapted distant supervision Jinying Chen1, Abhyuday N. Jagannatha2, Samah J. Jarad3, Hong Yu4,1 1 Department of Quantitative Health Sciences, University of Massachusetts Medical School, Worcester, MA, USA 2 School of Computer Science, University of Massachusetts, Amherst, MA, USA 3 Yale Center for Medical Informatics, Yale University, New Haven, CT, USA 4 Bedford Veterans Affairs Medical Center, Center for Healthcare Organization and Implementation Research, Bedford, MA, United States
{jinying.chen, hong.yu}@umassmed.edu, abhyuday@cs.umass.edu, samah.fodeh@yale.edu

Keywords: natural language processing, electronic health record, information extraction, distant supervision, ranking

ABSTRACT Objective: Allowing patients to access their own electronic health record (EHR) notes through online patient portals has the potential to improve patient-centered care. However, medical jargon, which abounds in EHR notes, has been shown to be a barrier for patient EHR comprehension. Existing knowledge bases that link medical jargon to lay terms or definitions play an important role in alleviating this problem but have low coverage of medical jargon in EHRs. We developed a data-driven approach that mines EHRs to identify and rank medical jargon based on its importance to patients, to support the building of EHR-centric lay language resources. Methods: We developed an innovative adapted distant supervision (ADS) model based on support vector machines to rank medical jargon from EHRs. For distant supervision, we utilized the open-access, collaborative consumer health vocabulary, a large, publicly available resource that links lay terms to medical jargon. We explored both knowledge-based features from the Unified Medical Language System and distributed word representations (word embeddings) learned from unlabeled large corpora. We evaluated the ADS model using physician-identified important medical terms. Results: Our ADS model significantly surpassed two state-of-the-art automatic term recognition methods, TF*IDF and C-Value, yielding 0.810 ROC-AUC versus 0.710 and 0.667, respectively. Our model identified over 10K important medical jargon terms after ranking over 100K candidate terms mined from over 7,500 EHR narratives. Conclusion: Our work is an important step towards enriching lexical resources that link medical jargon to lay terms/definitions to support patient EHR comprehension. The identified medical jargon terms and their rankings are available upon request. INTRODUCTION Patient portals, including My HealtheVet [1], have been embraced by many healthcare organizations for patient-clinician communication. Allowing patients to access their EHR notes helps improve their disease understanding, self-management and outcomes [1,2]. However, studies have shown that patients often have difficulty in comprehending medical jargon [3–7] (here “medical jargon” is defined as “technical terminology or special words that are used by medical professions and are difficult for others to understand”), limiting their ability to understand their clinical status [5,6]. Figure 1 shows a sample text found in a typical clinical note. The medical terms that may hinder patients’ comprehension are italicized. In addition, those medical jargon terms judged by physicians to be important for patient understanding are also underlined. Figure 1. Illustration of medical jargon in a clinical note

To reduce the communication gap between patients and clinicians, there have been decades of research efforts in creating medical resource for lay people [8]. Natural language processing methods have also been developed to automatically substitute medical jargon with lay terms [9– 11] or to link them to consumer-oriented definitions [12]. These approaches require high-quality lexical resources of medical jargon and lay terms/definitions. The open-access collaborative consumer health vocabulary (CHV) is one such resource [13]. It has been incorporated into the Unified Medical Language System and has also been used in EHR simplification [9,10].

Research in CHV has been motivated by the vocabulary discrepancies between lay people and health care professionals [14–17]. CHV incorporates terms extracted from various consumer health sites, such as queries submitted to MedLinePlus, a consumer-oriented online knowledge resource maintained by the National Library of Medicine, and from postings in health-focused online discussion forums [18,19]. CHV contained 152,338 terms, most of which are consumer health terms [18–20]. Zeng et al. [18] mapped these consumer health terms to the Unified Medical Language System by a semi-automatic approach. As the result of this work, CHV encompasses lay terms as well as corresponding medical jargon.

From our current work, we found that CHV alone is not sufficient as a lexical resource for comprehending EHR n

View Original ArXiv

This content is AI-processed based on ArXiv data.

Ranking medical jargon in electronic health record notes by adapted distant supervision

📝 Abstract

💡 Analysis

📄 Content

Table of Contents

Table of Contents

📝 Abstract

💡 Analysis

📄 Content

Start searching

No results found