OrthoInsight: Rib Fracture Diagnosis and Report Generation Based on Multi-Modal Large Models
The growing volume of medical imaging data has increased the need for automated diagnostic tools, especially for musculoskeletal injuries like rib fractures, commonly detected via CT scans. Manual interpretation is time-consuming and error-prone. We propose OrthoInsight, a multi-modal deep learning framework for rib fracture diagnosis and report generation. It integrates a YOLOv9 model for fracture detection, a medical knowledge graph for retrieving clinical context, and a fine-tuned LLaVA language model for generating diagnostic reports. OrthoInsight combines visual features from CT images with expert textual data to deliver clinically useful outputs. Evaluated on 28,675 annotated CT images and expert reports, it achieves high performance across Diagnostic Accuracy, Content Completeness, Logical Coherence, and Clinical Guidance Value, with an average score of 4.28, outperforming models like GPT-4 and Claude-3. This study demonstrates the potential of multi-modal learning in transforming medical image analysis and providing effective support for radiologists.
💡 Research Summary
The paper introduces OrthoInsight, a comprehensive multimodal framework designed to automatically detect rib fractures on computed tomography (CT) scans and generate clinically useful diagnostic reports. The system integrates three core components: (1) a YOLOv9‑based object detector that localizes fractures and predicts attributes such as type, segment, and healing status; (2) a curated orthopedic knowledge graph (KG) that stores etiological information, potential complications, management options, and follow‑up recommendations; and (3) a fine‑tuned LLaVA multimodal large language model (LLM) that fuses visual features with KG‑derived textual context to produce structured reports consisting of imaging findings, diagnostic conclusions, and clinical recommendations.
A large dataset of 28,675 CT slices with expert annotations and corresponding radiology reports was assembled from a single academic hospital. The KG contains over 3,400 triples extracted from textbooks and up‑to‑date clinical guidelines. The authors define four clinically oriented evaluation metrics—Diagnostic Accuracy (DA), Content Completeness (CC), Logical Coherence and Consistency (LCC), and Clinical Guidance Value (CGV)—each scored on a 5‑point scale by orthopedic specialists.
Training proceeds in stages: the YOLOv9 detector is first fine‑tuned on the fracture annotations, achieving a mean average precision at IoU = 0.5 of 0.972 and a recall of 94.1 %. The KG is queried with the detector’s outputs to retrieve relevant textual snippets. These snippets, together with the image embeddings, are fed to LLaVA, which has been further fine‑tuned on a corpus of paired images and reports. An end‑to‑end fine‑tuning step aligns the three modules and reduces error propagation.
Experimental results show that OrthoInsight outperforms state‑of‑the‑art large language models such as GPT‑4V and Claude‑3 on all four metrics, achieving an average score of 4.28/5 (GPT‑4V = 3.91, Claude‑3 = 3.84). Ablation studies reveal the importance of each component: removing the KG drops CGV by 0.73 points; using a vanilla LLaVA instead of the fine‑tuned version reduces LCC by more than one point; substituting YOLOv9 with Faster‑RCNN lowers the overall score by 0.45 points. Qualitative case studies demonstrate that the system can correctly identify subtle multi‑site fractures, describe healing status, and suggest appropriate follow‑up (e.g., “re‑examination in 4 weeks, pain management, respiratory physiotherapy”).
The authors acknowledge limitations, including the single‑center data source, a static KG that does not capture patient‑specific variables, and inference latency (~1.8 seconds per slice) that may hinder real‑time deployment. Ethical considerations are discussed, emphasizing the need for rigorous validation, bias audits, and clear responsibility frameworks before clinical adoption.
In conclusion, OrthoInsight represents a novel integration of vision‑only detection, domain‑specific knowledge grounding, and multimodal language generation, offering a promising pathway toward AI‑augmented radiology workflows that not only detect pathology but also deliver actionable, context‑aware reports. Future work will focus on multi‑institutional validation, dynamic KG updates, model compression for faster inference, and designing human‑AI interaction protocols to ensure safe and effective clinical use.
Comments & Academic Discussion
Loading comments...
Leave a Comment