Random forest-based out-of-distribution detection for robust lung cancer segmentation
Accurate detection and segmentation of cancerous lesions from computed tomography (CT) scans is essential for automated treatment planning and cancer treatment response assessment. Transformer-based models with self-supervised pretraining can produce reliably accurate segmentation from in-distribution (ID) data but degrade when applied to out-of-distribution (OOD) datasets. We address this challenge with RF-Deep, a random forest classifier that utilizes deep features from a pretrained transformer encoder of the segmentation model to detect OOD scans and enhance segmentation reliability. The segmentation model comprises a Swin Transformer encoder, pretrained with masked image modeling (SimMIM) on 10,432 unlabeled 3D CT scans covering cancerous and non-cancerous conditions, with a convolution decoder, trained to segment lung cancers in 317 3D scans. Independent testing was performed on 603 3D CT public datasets that included one ID dataset and four OOD datasets comprising chest CTs with pulmonary embolism (PE) and COVID-19, and abdominal CTs with kidney cancers and healthy volunteers. RF-Deep detected OOD cases with a FPR95 of 18.26%, 27.66%, and less than 0.1% on PE, COVID-19, and abdominal CTs, consistently outperforming established OOD approaches. The RF-Deep classifier provides a simple and effective approach to enhance reliability of cancer segmentation in ID and OOD scenarios.
💡 Research Summary
The paper addresses the critical problem of out‑of‑distribution (OOD) failure in deep learning models for lung cancer segmentation on CT scans. While transformer‑based segmentation networks pretrained with self‑supervised learning (SSL) achieve high accuracy on in‑distribution (ID) data, their performance degrades sharply when presented with scans that differ in pathology or anatomy. To mitigate this, the authors propose RF‑Deep, a lightweight OOD detector that leverages deep features extracted from a pretrained Swin‑Transformer encoder and a random‑forest (RF) classifier.
First, a Swin‑Transformer encoder is pretrained on 10,432 unlabeled 3‑D CT volumes using SimMIM, a masked‑image‑modeling SSL approach that masks 75 % of patches and forces the network to reconstruct them. This pretraining endows the encoder with robust, modality‑agnostic representations. The encoder is then fine‑tuned together with a convolutional decoder on 317 labeled lung‑cancer CTs to produce a segmentation model. After fine‑tuning, the encoder weights are frozen. For each new scan, the model predicts a tumor mask; the authors then crop the tumor‑centered region and extract multi‑scale feature vectors from the patch‑embedding layer and all Swin stages (0‑4). Features from eight crops per scan are averaged to obtain a compact representation.
A random‑forest classifier (1000 trees, max depth 20, class‑balanced) is trained on a modest OOD exposure set: 140 ID lung‑cancer scans and 442 OOD scans covering four distinct domains—pulmonary embolism (PE), COVID‑19, kidney cancer, and healthy abdominal CTs. The RF learns non‑linear decision boundaries on the deep features, while remaining interpretable via SHAP values.
The authors evaluate RF‑Deep against standard OOD baselines (MaxSoftmax, MaxLogits, Energy, Entropy) and a radiomics‑based RF (RF‑Radiomics). Performance is measured by AUROC and FPR@95 % (the false‑positive rate when 95 % of OOD samples are correctly detected). RF‑Deep achieves AUROC = 95.16 % and FPR95 = 18.26 % on PE, AUROC = 92.88 % and FPR95 = 27.66 % on COVID‑19, and near‑perfect scores on the abdominal datasets (AUROC ≈ 99.8 %, FPR95 < 1 %). In contrast, the best baseline (MaxLogits) reaches only AUROC ≈ 91 % and FPR95 ≈ 34 % on PE, and similar gaps on the other datasets. The radiomics RF performs worse than RF‑Deep, highlighting the advantage of deep, task‑specific features over handcrafted texture descriptors.
Visualization with t‑SNE shows partial clustering of ID versus OOD samples in the encoder feature space, confirming separability. SHAP analysis reveals that early‑ and mid‑stage Swin features contribute most to OOD discrimination, aligning with an ablation study that demonstrates the importance of multi‑scale representations.
Overall, RF‑Deep provides a simple, computationally inexpensive, and interpretable solution to flag scans that lie outside the training distribution, thereby improving the safety and reliability of automated lung‑cancer segmentation pipelines. The method can be integrated into existing workflows with minimal overhead, as it only requires a frozen encoder and a lightweight RF at inference time. Future work will explore scaling to multiple disease sites, multi‑class OOD detection, and evaluation across different foundation‑model pretraining strategies.
Comments & Academic Discussion
Loading comments...
Leave a Comment