Reliable Curation of EHR Dataset via Large Language Models under Environmental Constraints

Reliable Curation of EHR Dataset via Large Language Models under Environmental Constraints
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Electronic health records (EHRs) are central to modern healthcare delivery and research; yet, many researchers lack the database expertise necessary to write complex SQL queries or generate effective visualizations, limiting efficient data use and scientific discovery. To address this barrier, we introduce CELEC, a large language model (LLM)-powered framework for automated EHR data extraction and analytics. CELEC translates natural language queries into SQL using a prompting strategy that integrates schema information, few-shot demonstrations, and chain-of-thought reasoning, which together improve accuracy and robustness. CELEC also adheres to strict privacy protocols: the LLM accesses only database metadata (e.g., table and column names), while all query execution occurs securely within the institutional environment, ensuring that no patient-level data is ever transmitted to or shared with the LLM. On a subset of the EHRSQL benchmark, CELEC achieves execution accuracy comparable to prior systems while maintaining low latency, cost efficiency, and strict privacy by exposing only database metadata to the LLM. Ablation studies confirm that each component of the SQL generation pipeline, particularly the few-shot demonstrations, plays a critical role in performance. By lowering technical barriers and enabling medical researchers to query EHR databases directly, CELEC streamlines research workflows and accelerates biomedical discovery.


💡 Research Summary

The paper introduces CELEC, a framework that enables biomedical researchers to query electronic health record (EHR) databases using natural language without exposing patient‑level data to external large language models (LLMs). The system follows a strict privacy‑by‑design principle: the LLM receives only schema metadata (table and column names, data types) while all SQL execution and subsequent data handling occur locally within the institution’s secure environment.

CELEC’s core consists of two LLM‑driven modules. First, a text‑to‑SQL component translates a user’s natural‑language question into a SQL statement. The prompt combines three techniques: (1) explicit schema information to ground the model, (2) few‑shot demonstrations drawn from medical literature and the EHRSQL benchmark (a total of 4,761 high‑quality NL‑SQL pairs), and (3) chain‑of‑thought (CoT) reasoning that forces the model to first identify relevant tables before generating the final query. Demonstrations are selected dynamically: the user query is embedded with MiniLM‑L6‑v2, compared to an indexed demo pool via cosine similarity, and the top‑k (k = 2) examples are inserted into the prompt.

After the LLM produces a SQL query, CELEC executes it against a local DuckDB instance of the MIMIC‑III/MIMIC‑IV de‑identified datasets. If execution fails (e.g., due to hallucinated aliases), the error message is fed back to the LLM for up to two retries, substantially improving robustness.

The second module creates visualizations. Once a dataframe is retrieved, the LLM receives the original natural‑language question and the column metadata of the result set. It outputs a structured specification of the chart type (histogram, bar, line, scatter, etc.) and aesthetic mappings. The actual rendering is performed by hard‑coded TextScript functions, guaranteeing that no patient data ever leaves the secure environment.

Evaluation uses a modified version of the EHRSQL‑2024 benchmark. After removing unanswerable queries, the test set contains 707 executable questions. CELEC achieves an RS(0) score of 81.05 %, comparable to the top‑performing systems on the original leaderboard (the best reported 88.17 %). Latency averages 6.0 seconds per query, and the API cost is about $0.0152 per question, demonstrating practical feasibility for interactive use.

Ablation studies reveal the importance of each design choice. Removing schema information drops accuracy to 77.93 %; reducing few‑shot examples from two to one lowers performance to 73.97 %; omitting demonstrations entirely collapses accuracy to 50.21 %. The retry mechanism contributes a modest but consistent gain (≈2 % increase).

Overall, CELEC showcases how careful prompt engineering, metadata‑only LLM interaction, and lightweight error‑handling can deliver high‑quality, privacy‑preserving text‑to‑SQL translation for clinical data. The system lowers the technical barrier for researchers, enabling rapid cohort definition, exploratory analysis, and immediate visual insight without requiring SQL expertise. Limitations include reliance on the MIMIC schema and a limited set of chart types; future work should test generalization to other hospital data models and expand visualization capabilities. The paper contributes a concrete, deployable solution that balances the power of LLMs with the stringent privacy demands of healthcare data.


Comments & Academic Discussion

Loading comments...

Leave a Comment