Evaluating Open-Weight Large Language Models for Structured Data Extraction from Narrative Medical Reports Across Multiple Use Cases and Languages

Background: Large language models (LLMs) are increasingly explored for extracting structured information from free-text clinical records. However, most studies focus on single tasks, limited models, and English-language reports, leaving gaps in understanding performance across diseases, languages, prompting strategies and report types. Methods: We retrospectively evaluated 15 open-weight LLMs for structured information extraction from pathology and radiology reports across six use cases at three institutes in the Netherlands, United Kingdom, and Czech Republic, covering colorectal liver metastases, liver tumours, neurodegenerative diseases, soft-tissue tumours, melanomas, and sarcomas. Reports were manually annotated by one or more raters, with consensus annotations where applicable. Models included general-purpose and medical-specialised LLMs across four scales: large, medium, small, and tiny. Six prompting strategies were compared: zero-shot, one-shot, few-shot, chain-of-thought, self-consistency, and prompt graph. Performance was assessed using metrics appropriate for each variable type, summarised via macro-averages, and where available compared with inter-rater agree-

📜 Original Paper Content