Almost Clinical: Linguistic properties of synthetic electronic health records
This study evaluates the linguistic and clinical suitability of synthetic electronic health records in mental health. First, we describe the rationale and the methodology for creating the synthetic corpus. Second, we examine expressions of agency, modality, and information flow across four clinical genres (Assessments, Correspondence, Referrals and Care plans) with the aim to understand how LLMs grammatically construct medical authority and patient agency through linguistic choices. While LLMs produce coherent, terminology-appropriate texts that approximate clinical practice, systematic divergences remain, including registerial shifts, insufficient clinical specificity, and inaccuracies in medication use and diagnostic procedures. The results show both the potential and limitations of synthetic corpora for enabling large-scale linguistic research otherwise impossible with genuine patient records.
💡 Research Summary
This paper investigates whether large language models (LLMs) can generate synthetic electronic health records (EHRs) that are linguistically and clinically suitable for research, focusing on mental‑health documentation. The authors first outline a systematic pipeline for creating a massive synthetic corpus: they define four clinically relevant genres—Initial Assessments, GP Correspondence, Referral/Handover letters, and Care Plans—each with a detailed prompt and a system prompt that forces the model to adopt the role of a psychiatrist. To capture the diversity of real‑world cases, eight demographic and clinical variables (age, gender, sexual orientation, ethnicity, diagnosis, medication, risk factors, and admission status) are combined, yielding 12 960 distinct patient stories (2 × 2 × 3 × 3 × 6 × 3 × 5 × 4).
Two instruction‑tuned LLMs, Llama 3.2 (3 B) and Mistral v0.3 (7 B), were selected after a preliminary comparison with DeepSeek V2 and MediPhi, because expert judges found their outputs closest to authentic clinical language. The full corpus therefore consists of 12 960 records per model, with genre‑specific statistics showing that Llama tends to produce longer texts (median ≈ 800 words, some outliers > 8 000 words) while Mistral generates more uniformly sized documents (median ≈ 600 words).
For linguistic analysis the authors adopt Systemic Functional Linguistics (SFL) as a theoretical framework, examining three metafunctions—field, tenor, and mode—through three clause‑level clusters: agency (who does what), modality (deontic, epistemic, volitional), and information flow (textual themes such as “arguing”, “extending”, “structuring”). A random subset of 24 texts (six per genre per model) is annotated using CorpusTool and manually validated.
Key findings:
-
Transitivity (Clause Types) – Care Plans are dominated by material processes (≈ 83 % for Llama, ≈ 89 % for Mistral), reflecting an “enabling‑doing” register that issues instructions and actions. Initial Assessments and Referrals show higher proportions of relational and existential processes, aligning with a “categorising‑inventory” register that lists diagnoses, histories, and new entities. Llama rarely uses existential clauses, whereas Mistral explicitly introduces entities (e.g., “there is a heightened concern for her safety”).
-
Modality – Deontic requirements are the most frequent across all genres, especially obligations in Care Plans and Referrals. Llama exhibits more patient‑oriented volition (“I will adhere to my medication regimen”), while Mistral’s volitional expressions are scarce and, when present, are doctor‑centric (“I will collaborate with pain specialists”).
-
Information Flow – The “arguing” theme, signalled mainly by “however”, appears most often in Referrals (Llama) and GP correspondence (Mistral), marking contrast or complication. The “extending” theme, using connectors like “additionally” or “furthermore”, is prevalent in Care Plans and GP letters, supporting incremental accumulation of clinical details. Structuring connectors are rare.
-
Length and Register Shifts – Llama’s longer outputs sometimes contain informal or overly verbose phrasing, leading to register mismatches (e.g., occasional colloquial tone). Mistral’s more concise style stays closer to typical clinical brevity but may omit nuanced patient agency cues.
-
Clinical Accuracy – Both models produce terminology‑appropriate language but make systematic errors in medication dosing, diagnostic naming, and procedural description. For instance, dosage details are sometimes omitted or mis‑stated, and certain psychotic features are incorrectly linked to non‑psychotic diagnoses.
The authors conclude that synthetic EHRs generated by LLMs are promising for large‑scale linguistic research that would otherwise be impossible due to privacy constraints. However, they caution that the corpora are not yet fit for direct clinical decision‑support or training of diagnostic NLP systems without further validation. Potential biases linked to demographic‑clinical variable combinations were observed, underscoring the need for bias‑mitigation strategies.
Future work is proposed in three directions: (1) creating a gold‑standard, expert‑validated synthetic corpus to correct factual inaccuracies; (2) systematic comparison of additional LLM architectures and prompt engineering techniques to improve fidelity; (3) quantitative similarity assessments between synthetic and real EHRs using lexical, semantic, and clinical ontology metrics. By addressing these issues, synthetic corpora could become reliable, shareable resources for both linguistic inquiry and safe development of mental‑health NLP applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment