Reading time: 10 minute
...

๐Ÿ“ Original Info

  • Title:
  • ArXiv ID: 2512.22795
  • Date:
  • Authors: Unknown

๐Ÿ“ Abstract

Clinical notes are often stored in unstructured or semi-structured formats after extraction from electronic medical record (EMR) systems, which complicates their use for secondary analysis and downstream clinical applications. Reliable identification of section boundaries is a key step toward structuring these notes, as sections such as history of present illness, medications, and discharge instructions each provide distinct clinical contexts. In this work, we evaluate rule-based baselines, domain-specific transformer models, and large language models for clinical note segmentation using a curated dataset of 1,000 notes from MIMIC-IV. Our experiments show that large API-based models achieve the best overall performance, with GPT-5-mini reaching a best average F1 of 72.4 across sentence-level and freetext segmentation. Lightweight baselines remain competitive on structured sentence-level tasks but falter on unstructured freetext. Our results provide guidance for method selection and lay the groundwork for downstream tasks such as information extraction, cohort identification, and automated summarization.

๐Ÿ“„ Full Content

EHR data is often processed and presented in plain text format for secondary use in modeling and data retrieval tasks. Clinical notes contain a wide range of information such as chief complaints, physician observations, and past medical history that can supplement structured data like lab values or medications. However the note text itself is often difficult to process with traditional or out-of-the-box models because of domain specific issues including medical abbreviations, semi-structured text, unrelated or redundant documentation, and information that varies depending on the patient or encounter.

The first step in analyzing clinical notes is to identify sections from the full text and segment notes into distinct categories. Our goal in this work is to evaluate the note segmentation effectiveness of different models from open source libraries and prior research using a curated dataset.

Several prior studies have examined the problem of clinical note segmentation. Davis et al. (2025) introduced MedSlice, a pipeline that uses fine-tuned large language models from open source libraries to securely segment clinical notes (Davis et al., 2025). Their work highlights the effectiveness of LLMs in capturing section boundaries in a domain where notes often vary in length and structure.

Earlier work by Ganesan and Subotin (2014) presented a supervised approach to clinical text segmentation. They evaluated the Essie 4 NLM library for its ability to identify and retrieve relevant sections from documents, demonstrating the utility of domain-specific tools for segmentation tasks (Ganesan and Subotin, 2014). Edinger et al. (2018) evaluated multiple modeling approaches including regularized logistic regression, support vector machines, Naive Bayes, and conditional random fields for clinical text segmentation. Their study focused on improving cohort retrieval through accurate identification of note sections, emphasizing the importance of segmentation as a foundation for downstream clinical informatics applications (Edinger et al., 2018).

We evaluate whether different modeling approaches can improve clinical note segmentation compared to naive baselines.

We use four main datasets in this project, derived from the MIMIC-IV corpus (Johnson et al., 2023) and supplemented with additional clinicaltext resources. MIMIC-IV is a publicly available, de-identified dataset containing a wide range of clinical notes, including discharge summaries, physician notes, and semi-structured chart data. For evaluation across all datasets, we apply an 80/10/10 train/validation/test split using k-fold cross-validation. A distribution of the most frequent tags in these datasets are displayed in Appendix B. MIMIC Hospital. From MIMIC-IV, we construct two custom datasets organized by note type using the labeled “Hospital Course” subset (Aali et al., 2025):

โ€ข MIMIC Hospital Sentences: 1,000 unlabeled free-text clinical notes, split into 17,487 labeled clinical note sections segmented at the sentence level.

โ€ข MIMIC Hospital Freetext: 1,000 unlabeled free-text clinical notes, generated from Mimic Hospital Sentences, which preserve the original narrative structure of the documentation.

Freetext. This dataset consists of unlabeled free-text clinical notes drawn directly from MIMIC-IV (Johnson et al., 2022). It provides a large, unstructured resource of clinical language for representation learning.

Augmented Clinical Notes. This dataset consists of additional unlabeled free-text clinical notes curated for this project (Bonnet and Boulenger, 2024).

It is used to increase the diversity and size of the training corpus for representation learning.

As baselines, we include a multinomial logistic regression classifier, a regex-based header matcher, and MedSpaCy (Eyre et al., 2021), a clinical NLP toolkit for section detection and rule-based information extraction. These methods provide interpretable and lightweight points of comparison.

For domain-specific transformer models, we evaluate LLaMA-2-7B (Touvron et al., 2023), MedAlpaca-7B (Han et al., 2023), and Meditron-7B (Chen et al., 2023). These open-source models vary in their degree of biomedical adaptation and represent mid-scale transformer approaches tailored to clinical language tasks. Finally, we benchmark three API-based large language models: GPT-5-mini (OpenAI, 2025), Gemini 2.5 Flash (DeepMind, 2025), and Claude 4.5 Haiku (Anthropic, 2025). These state-of-theart systems offer broad-domain generalization and strong zero-and few-shot performance.

Model performance is assessed using token-level Precision, Recall, and F1. For clinical note segmentation, we treat each predicted section boundary token as a classification decision. In this setup, a True Positive (TP) is a predicted boundary token that exactly matches a gold-standard boundary token, a False Positive (FP) is a predicted boundary token that does not correspond to any gold-standard boundary (i.e., a spurious split), and a False Negative (FN) is a gold-standard boundary token that For the sentence classification task, we report weighted F1, which accounts for class imbalance by averaging per-class performance proportional to class frequency. This choice reflects the fact that some section types occur more frequently than others in the labeled dataset. For the freetext segmentation task, we instead report micro-averaged F1, which aggregates decisions across all boundaries. This metric provides a clearer measure of overall segmentation quality when sections are highly variable in length and frequency.

In Table 1, we analyze three API-based LLM models, three HuggingFace locally hosted models, and two baselines for comparison. A summary of the model performance is observed in Figure 2.

Traditional baselines achieve solid but limited performance on sentence-based analysis. The Multinomial Logistic Regression classifier (Baseline 1) achieves strong performance on the sentence classification task, with a 74.3 F1. While this lags behind API-based LLMs by 4-6 points, it is still competitive with weaker commercial systems. MedSpaCy (Baseline 2) goes even further, reaching a 78.0 F1 on sentences-just 2.8 points below Gemini 2.5 Flash (78.5) and 2.8 points below Claude 4.5 Haiku (76.5). This demonstrates that rule-based, domain-specific NLP tools remain surprisingly strong in structured settings.

Traditional baselines surpass LLMs with strong performance on freetext-based analysis. MedSpaCy (Baseline 2) achieves the highest scores on freetext analysis, achieving a high score of 94.4 Precision, which is over 6 points higher than GPT-5-mini.

Across the board, GPT-5-mini performed the best. On both the Sentence Classification and Freetext tasks, GPT-5-mini achieved the highest scores across all metrics. Its F1 of 80.8 on sen-tences and 63.9 on freetext demonstrates not only high precision but also consistently strong recall, outperforming other commercial LLMs by several points. This indicates that GPT-5-mini is more robust at identifying relevant spans without overpredicting, balancing sensitivity with specificity in the clinical setting.

Gemini 2.5 Flash and Claude 4.5 Haiku trail closely but unevenly. Gemini performs competitively on both tasks, ranking second on freetext classification (60.8 F1) and maintaining strong performance on sentences (78.5 F1). Claude 4.5 Haiku exhibits high precision (83.3) but suffers from lower recall (72.9) on sentence classification, indicating a conservative prediction style. On freetext, its performance drops significantly (47.8 F1), underscoring difficulty in handling unstructured, noisy inputs.

Domain-tuned small LLMs underperform substantially. Both LLaMA-2-7B and MedAlpaca-7B struggle, with sentence-level F1-scores of 8.9 and 0.6 respectively. This reflects their inability to capture the clinical context with limited scale and training. Although LLaMA-2-7B reaches moderate precision on freetext (87.3), its recall (37.5) lags, leading to mediocre overall performance (52.4 F1). Abbreviations and domain-specific shorthand often caused these models to miss boundaries, particularly in medication and lab result sections where formatting was dense. Multi-line headers and irregular spacing also confused both rule-based baselines and smaller LLMs, leading to false positives where extraneous breaks were inserted.

Meditron-7B underperforms expectations. Despite being designed as a biomedical-focused LLM, Meditron-7B fails to produce competitive results, with F1-scores of 0.3 on sentences and 0.0 on freetext. This suggests either significant mismatches between training and evaluation domains or limitations in model scale and optimization. Unlike MedAlpaca, which shows partial utility in freetext, Meditron’s outputs approach non-functional levels of performance.

ANOVA Testing. We conducted a one-way ANOVA on model F1 scores across the two tasks. The test yielded F = 2.45 with p = 0.18, indicating that observed variations among models are not statistically significant at the 0.05 level. This suggests that while performance differences across tasks are visible, they should be interpreted with caution.

To better understand how humans compare to automated systems on difficult sentence classification tasks, we conducted a small human evaluation using an unlabeled subset of clinical notes. Since it is challenging to obtain large expert-annotated corpora, our goal was not to create a new gold-standard dataset but to examine how general human annotators behave when presented with ambiguous or weakly structured inputs. This allows us to compare annotator behavior to the behavior of both baseline models and large language models and to evaluate where automated methods diverge from human judgment. We sampled sentences from an unlabeled MIMIC-derived corpus by splitting raw notes into individual units. These sentences were then provided to nonexpert annotators who selected a section label from the same set of categories used in our supervised experiments. Because many extracted sentences contain partial headers, demographic fragments, or minimal context, this setup reflects a more challenging classification scenario than the segmented hospital course dataset used for model evaluation.

We report Cohen’s Kappa and Percent Agreement between each system and the majority human label (Table 2). Agreement between human annotators and API-based LLMs is moderate, with Claude and Gemini reaching Kappa values of 0.46 and agreements above 52 percent. GPT 5 is slightly lower with 49.5 percent agreement. Baseline models show substantially weaker alignment with humans. Logistic regression reaches a Kappa of 0.19, while embedding-based and blank MedSpaCy variants range from 0.14 to 0.13.

To further understand disagreement patterns, we analyzed the labels assigned to the most ambiguous subset of sentences, which primarily consisted of header-like fragments or demographic rows. As shown in Table 3, humans label 66 percent of these sentences as OTHER, recognizing that many of these items have no single correct section label. Automated systems behave very differently. The logistic regression model strongly defaults to ALLERGIES, embedding-based MedSpaCy predicts FAMILY HISTORY for nearly all items, and blank MedSpaCy heavily favors ALLERGIES. API-based LLMs also exhibit biases but with less extreme concentration in a single class.

These results indicate that human annotators demonstrate flexibility when dealing with ambiguous clinical text and do not force a strict interpretation of section categories. Automated models, in contrast, tend to collapse uncertainty into a small set of high-probability labels. This evaluation was therefore used not as ground truth for training or benchmarking but as a way to measure differences in behavior between humans and automated systems when faced with difficult and underspecified sentences.

In this study, we evaluated a wide range of clinical note segmentation approaches, including rule-based systems, traditional classifiers, domainspecific transformer models, and state-of-the-art large language models. Our results show that general lightweight baselines like MedSpacy remain strong on unstructured freetext tasks, but their per-formance drops when applied to structured sentences. API-based large language models achieve the highest overall performance on Sentence-level evaluations, with GPT 5 providing the most consistent gains across both sentence classification and freetext segmentation.

The human evaluation highlights an additional dimension of this problem. When presented with difficult or weakly contextualized clinical sentences, human annotators show flexible judgment and frequently select ambiguous categories rather than forcing a narrow interpretation. In contrast, automated systems tend to collapse uncertainty into a small set of high-probability labels. This pattern is especially visible in header and demographic fragments, where baselines and LLMs often apply strong but inaccurate defaults. These findings indicate that segmentation models are sensitive not only to domain structure but also to the way ambiguous input is represented and labeled.

Overall, our study demonstrates that high quality segmentation of clinical notes depends on both model capability and the nature of the input text. Large models offer strong generalization to structured documentation, while custom rule-based and classical methods remain valuable for un-structured formats. The human evaluation further underscores the importance of understanding annotation behavior when designing or interpreting segmentation systems. Future work should explore methods that better capture uncertainty in ambiguous sections and that leverage human-like flexibility in classification decisions.

the model fails to predict (i.e., a missed split).

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut