Colorimeter-Supervised Skin Tone Estimation from Dermatoscopic Images for Fairness Auditing
Neural-network-based diagnosis from dermatoscopic images is increasingly used for clinical decision support, yet studies report performance disparities across skin tones. Fairness auditing of these models is limited by the lack of reliable skin-tone annotations in public dermatoscopy datasets. We address this gap with neural networks that predict Fitzpatrick skin type via ordinal regression and the Individual Typology Angle (ITA) via color regression, using in-person Fitzpatrick labels and colorimeter measurements as targets. We further leverage extensive pretraining on synthetic and real dermatoscopic and clinical images. The Fitzpatrick model achieves agreement comparable to human crowdsourced annotations, and ITA predictions show high concordance with colorimeter-derived ITA, substantially outperforming pixel-averaging approaches. Applying these estimators to ISIC 2020 and MILK10k, we find that fewer than 1% of subjects belong to Fitzpatrick types V and VI. We release code and pretrained models as an open-source tool for rapid skin-tone annotation and bias auditing. This is, to our knowledge, the first dermatoscopic skin-tone estimation neural network validated against colorimeter measurements, and it supports growing evidence of clinically relevant performance gaps across skin-tone groups.
💡 Research Summary
This paper addresses a critical gap in fairness auditing of deep‑learning models for dermatoscopic image analysis: the lack of reliable skin‑tone annotations in public datasets. The authors develop two neural‑network pipelines that predict skin tone directly from dermatoscopic images. One network predicts the categorical Fitzpatrick (FP) skin type (six classes) using ordinal regression, while the other predicts the continuous Individual Typology Angle (ITA) by regressing the three CIELAB color channels (L*, a*, b*) and converting them to ITA. Both networks share an EfficientNet‑B0 backbone pre‑trained on ImageNet and further pre‑trained on a large collection of clinical and synthetic dermatoscopic datasets (ISIC 2020, SCIN, PAD‑UFES‑20, Fitzpatrick17k, DDI, MRA‑MIDAS, and a synthetic melanin‑controlled set S‑SYNTH).
The core training data is the MSK CC Skin Tone Labeling dataset, which contains 4,878 dermatoscopic and clinical close‑up images from 64 subjects, each annotated with expert in‑person FP labels and triplicate colorimeter measurements. This provides a gold‑standard physical reference for both the ordinal (FP) and continuous (ITA) targets. The authors perform patient‑level 5‑fold cross‑validation to avoid subject leakage.
For FP classification, they employ the CORAL ordinal regression loss, which respects the ordered nature of the classes. The model achieves a linear‑weighted Cohen’s κ of 52.98 % (95 % CI 51.54–54.40), lower than the 66.08 % κ obtained from crowdsourced human annotations but still useful. The mean absolute error is 0.84 and the “within‑one” accuracy (prediction within one FP class of the ground truth) is 84.84 %, surpassing the crowdsourced within‑one accuracy of 80.77 %. Errors are predominantly adjacent‑class confusions, especially between FP I and FP II, with minimal systematic bias.
For ITA estimation, the network minimizes the ΔE₇₆ color difference between predicted and measured CIELAB values. The resulting ITA predictions achieve an ICC(3) of 93.88 % (95 % CI 93.31–94.40), closely approaching the repeatability of the colorimeter itself (ICC ≈ 98.38 %). This performance markedly exceeds previously published pixel‑averaging baselines that rely on hand‑crafted skin‑pixel selection and white‑balancing.
The authors then apply the trained models to two widely used public dermatoscopic benchmarks that lack skin‑tone labels: ISIC 2020 (33,126 images) and MILK10k (5,240 lesions). The inference reveals a severe under‑representation of darker skin tones: fewer than 1 % of subjects are estimated as Fitzpatrick V or VI in both datasets. This quantitative evidence supports earlier concerns about dataset composition bias and highlights the need for more diverse data.
All code, pretrained weights, and inference results are released at https://github.com/marinbenc/nn_colorimetry_dermatoscopy, providing the community with a ready‑to‑use tool for rapid skin‑tone annotation and systematic bias audits. The paper’s contributions are threefold: (1) a validated neural‑network method for ITA estimation from dermatoscopic images, the first to be benchmarked against colorimeter ground truth; (2) an ordinal‑regression FP classifier with performance comparable to human raters; (3) a large‑scale analysis of skin‑tone distributions in public dermatoscopic datasets, confirming substantial under‑representation of darker skin. The work paves the way for more equitable AI in dermatology by enabling scalable, physically grounded skin‑tone annotation and rigorous fairness evaluation.
Comments & Academic Discussion
Loading comments...
Leave a Comment