A Vision-Language Foundation Model for Zero-shot Clinical Collaboration and Automated Concept Discovery in Dermatology

A Vision-Language Foundation Model for Zero-shot Clinical Collaboration and Automated Concept Discovery in Dermatology
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Medical foundation models have shown promise in controlled benchmarks, yet widespread deployment remains hindered by reliance on task-specific fine-tuning. Here, we introduce DermFM-Zero, a dermatology vision-language foundation model trained via masked latent modelling and contrastive learning on over 4 million multimodal data points. We evaluated DermFM-Zero across 20 benchmarks spanning zero-shot diagnosis and multimodal retrieval, achieving state-of-the-art performance without task-specific adaptation. We further evaluated its zero-shot capabilities in three multinational reader studies involving over 1,100 clinicians. In primary care settings, AI assistance enabled general practitioners to nearly double their differential diagnostic accuracy across 98 skin conditions. In specialist settings, the model significantly outperformed board-certified dermatologists in multimodal skin cancer assessment. In collaborative workflows, AI assistance enabled non-experts to surpass unassisted experts while improving management appropriateness. Finally, we show that DermFM-Zero’s latent representations are interpretable: sparse autoencoders unsupervisedly disentangle clinically meaningful concepts that outperform predefined-vocabulary approaches and enable targeted suppression of artifact-induced biases, enhancing robustness without retraining. These findings demonstrate that a foundation model can provide effective, safe, and transparent zero-shot clinical decision support.


💡 Research Summary

**
The paper introduces DermFM‑Zero, a vision‑language foundation model specifically designed for zero‑shot clinical collaboration in dermatology. Trained on more than four million multimodal data points—including dermoscopic images, clinical photographs, textual descriptions, patient demographics, and medical histories—the model employs a two‑stage pre‑training pipeline. First, masked latent modeling extracts fine‑grained morphological features from three million unlabeled images. Second, bootstrapped contrastive learning aligns these visual embeddings with a domain‑specialized text encoder based on PubMedBERT using one million image‑text pairs. The resulting 304 M‑parameter vision encoder and 110 M‑parameter text encoder together form a compact yet powerful system that outperforms much larger general‑domain models.

Performance is evaluated across three dimensions. (1) Benchmarking on 20 public datasets covering multi‑class skin‑cancer diagnosis (HAM‑10000, PAD‑UFES‑20), melanoma detection (ISIC2020, PH2), extensive skin‑condition classification (SNU‑134, SD‑128), and rare disease identification (DAF‑ODIL‑5) shows that DermFM‑Zero achieves state‑of‑the‑art zero‑shot accuracy and retrieval metrics. Notably, for rare, life‑threatening conditions the model reaches a balanced accuracy of 0.893 despite the scarcity of training examples. (2) Three multinational reader studies involving over 1,100 clinicians assess real‑world utility. In primary‑care simulations, 30 general practitioners experience a jump in top‑3 diagnostic accuracy from 0.266 to 0.482 and a rise in appropriate management decisions from 0.504 to 0.592 when assisted by the model. In a specialist benchmark with 1,090 clinicians, DermFM‑Zero attains an overall diagnostic accuracy of 0.717, surpassing the collective clinician average (0.663) and even outperforming board‑certified dermatologists by 2.3 percentage points. A subsequent multimodal collaboration study with 34 specialists shows diagnostic accuracy improving from 0.50 to 0.61 and management appropriateness from 0.70 to 0.73 when the AI’s suggestions are incorporated. (3) Interpretability is addressed through Sparse Autoencoders (SAEs) that unsupervisedly discover clinically meaningful concepts (e.g., pigment patterns, vascular structures) within the latent space. These concepts outperform predefined‑vocabulary baselines and enable targeted suppression of artifact‑induced biases (such as device‑specific color shifts) without any retraining, thereby enhancing robustness and transparency.

Additional analyses demonstrate label efficiency: linear probing with only 10 % of annotated data yields a 9.6 % average improvement over domain‑specific baselines, and the model remains superior to a 7 B‑parameter general‑domain CLIP variant across most data fractions. Multimodal fine‑tuning that integrates clinical photographs, dermoscopy, and text further boosts performance (e.g., +8.1 % macro F1 on the Derm7pt task).

In summary, DermFM‑Zero showcases that a well‑scaled, multimodal vision‑language foundation model can deliver zero‑shot diagnostic and retrieval capabilities, provide measurable benefits in human‑AI collaborative workflows across primary and specialist settings, and offer interpretable, bias‑mitigating mechanisms through unsupervised concept discovery. These findings argue for a shift from task‑specific fine‑tuning toward zero‑shot, collaborative AI as a practical, safe, and transparent decision‑support paradigm in dermatology and potentially other medical specialties.


Comments & Academic Discussion

Loading comments...

Leave a Comment