A Unified Multimodal Framework for Dataset Construction and Model-Based Diagnosis of Ameloblastoma
Artificial intelligence (AI)-enabled diagnostics in maxillofacial pathology require structured, high-quality multimodal datasets. However, existing resources provide limited ameloblastoma coverage and lack the format consistency needed for direct model training. We present a newly curated multimodal dataset specifically focused on ameloblastoma, integrating annotated radiological, histopathological, and intraoral clinical images with structured data derived from case reports. Natural language processing techniques were employed to extract clinically relevant features from textual reports, while image data underwent domain specific preprocessing and augmentation. Using this dataset, a multimodal deep learning model was developed to classify ameloblastoma variants, assess behavioral patterns such as recurrence risk, and support surgical planning. The model is designed to accept clinical inputs such as presenting complaint, age, and gender during deployment to enhance personalized inference. Quantitative evaluation demonstrated substantial improvements; variant classification accuracy increased from 46.2 percent to 65.9 percent, and abnormal tissue detection F1-score improved from 43.0 percent to 90.3 percent. Benchmarked against resources like MultiCaRe, this work advances patient-specific decision support by providing both a robust dataset and an adaptable multimodal AI framework.
💡 Research Summary
The paper addresses a critical gap in maxillofacial pathology: the lack of a large, well‑structured multimodal dataset for the rare odontogenic tumor ameloblastoma. Existing public resources either contain very few ameloblastoma cases or suffer from inconsistent formatting that prevents direct model training. To overcome these limitations, the authors introduce a newly curated multimodal dataset focused exclusively on ameloblastoma and a deep‑learning framework that leverages both image and textual modalities for diagnosis, variant classification, recurrence risk assessment, and surgical planning.
Dataset Construction
The image component was built by first extracting a subset of 161 case reports from the publicly available MultiCaRe dataset, yielding 552 raw images. Recognizing that many reports lacked associated images, the authors queried the PubMed Central (PMC) API using PMCID identifiers to retrieve missing figures. Composite figures containing multiple sub‑images were automatically split using a two‑pass OpenCV algorithm, improving granularity but occasionally producing blank or over‑split segments. Images were then labeled into three categories—radiology, pathology, and clinical photographs—using a pre‑trained MultiCaRe classifier. Because the classifier mis‑assigned several images (e.g., clinical photos labeled as pathology), a custom ReactJS web interface was developed for semi‑automated validation, allowing reviewers to edit metadata, upload corrected images, and delete erroneous entries. After this manual curation, nine invalid entries were removed, leaving 152 unique patients and a total of 1,152 images (radiology = 548, pathology = 477, clinical = 421). All images were upscaled fourfold with the open‑source “upscayl” tool and normalized to 512 × 512 pixels.
The textual component was derived from the same case reports. Four extraction pipelines were implemented and compared:
- Rule‑based keyword matching – a deterministic pipeline that uses curated medical term lists and regular expressions to capture entities such as presenting complaint, age, gender, radiological findings, and histopathological diagnosis.
- Word2Vec semantic similarity – embeddings from the Google News 300‑dimensional model were averaged per sentence and compared to centroid vectors of predefined categories. A cosine similarity threshold of 0.65 determined inclusion; below this threshold the system fell back to the rule‑based method.
- BioBERT contextual embeddings – a biomedical‑specific transformer model that generates sentence‑level contextual vectors, which are then matched against curated examples for each information category.
- Gemini LLM with structured prompting – a large language model accessed via Google’s Gemini API. Detailed prompts instructed the model to extract specific fields and return them in a JSON schema. Extensive prompt engineering and error‑handling were employed to minimize hallucinations and ensure consistent output.
Empirical evaluation showed that the Gemini approach achieved the highest precision and recall across all fields, particularly for the four ameloblastoma variants (solid/multicystic, unicystic, peripheral, desmoplastic). The rule‑based method performed adequately for simple entities but struggled with complex phrasing; Word2Vec suffered from domain mismatch (general‑news corpus vs. medical terminology); BioBERT offered solid contextual understanding but lacked generative flexibility.
Multimodal Model Architecture
Using the curated dataset, the authors built a multimodal deep‑learning model. Image data passed through a convolutional backbone (ResNet‑50 variant) after augmentation (rotation, color jitter, random cropping). Textual data—structured JSON fields plus free‑text descriptions—were tokenized and embedded via a lightweight transformer encoder. Patient metadata (age, gender, presenting complaint) were concatenated as additional features. The two modality streams were fused using a cross‑modal attention block, producing a joint representation that fed into two heads: (a) multi‑label classification for ameloblastoma variants and (b) binary detection of abnormal tissue (e.g., tumor vs. healthy). The model was trained with weighted binary cross‑entropy to address class imbalance and evaluated using 5‑fold cross‑validation.
Performance Gains
Compared with a baseline model trained on the original MultiCaRe subset (without the extensive preprocessing and multimodal fusion), the proposed system achieved:
- Variant classification accuracy: 46.2 % → 65.9 % (↑ 19.7 percentage points)
- Abnormal tissue detection F1‑score: 43.0 % → 90.3 % (↑ 47.3 percentage points)
Statistical testing (paired t‑tests) confirmed that these improvements were significant (p < 0.01). Across all evaluated metrics (precision, recall, AUROC), the new model displayed reduced variance and higher robustness, indicating that the curated dataset and multimodal architecture contributed synergistically to performance.
Case‑Based Retrieval System
Beyond classification, the authors implemented a case‑based retrieval engine to support clinical decision‑making. Structured embeddings of each case (derived from the same text pipeline) are stored in a vector database. When a clinician inputs a new patient’s description—either as free text or via a structured form—the system computes cosine similarity to retrieve the most similar historical cases, displaying their imaging, pathology, treatment, and outcomes. This facilitates evidence‑based recommendations for diagnosis and surgical planning.
Limitations and Future Directions
The authors acknowledge several constraints:
- Label validation still relies on expert review; fully automated, high‑confidence labeling remains an open challenge.
- Incomplete modality coverage: some cases lack one or more image types, limiting the model’s ability to learn cross‑modal correlations for those instances.
- Absence of EHR data: electronic health records (e.g., lab values, longitudinal follow‑up) were not incorporated, which could further improve recurrence risk prediction.
- Domain‑specific embeddings: while BioBERT performed well, training a transformer on a large corpus of oral pathology literature could yield superior semantic representations.
Future work will focus on expanding the dataset with additional peer‑reviewed case reports, integrating EHR streams, developing semi‑supervised labeling tools, and exploring domain‑adapted language models (e.g., ClinicalBERT, DentalBERT). The authors also plan to open‑source the dataset and code to encourage community validation and extension.
Conclusion
This study delivers the first publicly available, high‑quality multimodal dataset dedicated to ameloblastoma and demonstrates that careful data curation, combined with a multimodal deep‑learning framework, can substantially improve diagnostic accuracy and tissue detection performance. By coupling the model with a case‑based retrieval system and providing detailed methodological documentation, the work lays a solid foundation for AI‑augmented decision support in maxillofacial pathology and sets a precedent for similar efforts in other under‑represented medical domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment