Reliable Mislabel Detection for Video Capsule Endoscopy Data
The classification performance of deep neural networks relies strongly on access to large, accurately annotated datasets. In medical imaging, however, obtaining such datasets is particularly challenging since annotations must be provided by specialized physicians, which severely limits the pool of annotators. Furthermore, class boundaries can often be ambiguous or difficult to define which further complicates machine learning-based classification. In this paper, we want to address this problem and introduce a framework for mislabel detection in medical datasets. This is validated on the two largest, publicly available datasets for Video Capsule Endoscopy, an important imaging procedure for examining the gastrointestinal tract based on a video stream of lowresolution images. In addition, potentially mislabeled samples identified by our pipeline were reviewed and re-annotated by three experienced gastroenterologists. Our results show that the proposed framework successfully detects incorrectly labeled data and results in an improved anomaly detection performance after cleaning the datasets compared to current baselines.
💡 Research Summary
The paper addresses a critical bottleneck in medical image analysis—noisy or incorrect labels—by proposing a systematic mislabel detection and cleaning framework tailored to Video Capsule Endoscopy (VCE) datasets. VCE generates massive amounts of low‑resolution video frames of the gastrointestinal tract, but annotating these frames requires expert gastroenterologists, leading to limited, costly, and potentially error‑prone labeling. Moreover, the inherent class imbalance (few pathological frames amid many normal frames) and ambiguous boundaries between normal and abnormal appearances exacerbate the impact of mislabeled samples on deep learning models, which tend to overfit noisy labels and suffer degraded generalization.
The authors design a two‑stage pipeline. First, they train a lightweight MobileNetV3 convolutional neural network (CNN) three times independently on the raw dataset, using focal loss to mitigate class imbalance. During training, they record per‑sample loss values across epochs and runs, and combine these with prediction confidence and entropy to compute an uncertainty score that reflects how difficult each sample is to classify. Second, they fit a three‑component Gaussian Mixture Model (GMM) to the distribution of average losses. The component with the highest mean is interpreted as the “noisy” cluster, the lowest‑mean component as the “clean” cluster, and the intermediate component as “hard‑to‑learn” samples. For each frame i, the probability of belonging to the noisy component (p_i) serves as a noise probability estimate. By also estimating a corrected‑label probability (p_c_i) from the CNN’s confidence, they compute a noise‑reduction score (p_i – p_c_i). The top k_c samples with the largest reduction are relabeled (binary flips or multi‑label adjustments), after which the CNN‑GMM training loop is repeated. Finally, the top k_f samples with the highest remaining noise probability are filtered out of the dataset. This correction‑then‑filtering strategy follows recent work on combined noise handling but adapts it to the VCE context.
To evaluate the method, the authors conduct controlled experiments on the Kvasir‑Capsule dataset, where they artificially inject label noise at rates of 1 %, 5 %, 10 %, 15 % and 20 % by flipping labels preferentially among high‑uncertainty samples while preserving class distribution. Because the true corrupted labels are known, they can directly measure detection accuracy. Results show that for 5 % injected noise the pipeline correctly identifies 2 262 out of 2 360 noisy frames (≈95.9 % recall) and filters out only a small fraction of clean frames. Even at 10 % noise the recall remains above 92 %, demonstrating robustness.
The second stage applies the pipeline to the real‑world Galar dataset, the largest publicly available multi‑label VCE collection (≈3.5 M labeled frames). Since ground‑truth noise is unknown, the authors rank all frames by noise‑reduction score, select the top 500 candidates, and then randomly sample 100 frames (70 normal, 30 pathological) for expert review by three gastroenterologists (two of whom contributed to the original dataset). The clinicians confirm that many of the flagged frames were indeed mislabeled, leading to a substantial number of label corrections. After cleaning, the authors retrain the MobileNetV3 model and evaluate anomaly detection on the original test split. Compared to baselines trained on the uncleaned data, the cleaned model achieves markedly higher F1‑scores across pathologies (e.g., polyp detection improves from ~5 % to ~37 %, blood detection from ~14 % to ~54 %). This demonstrates that cleaning the dataset directly translates into better clinical performance.
The paper also provides qualitative analysis using t‑SNE and PCA visualizations of the latent feature space before and after cleaning. The visualizations reveal that noisy labels cause pathological samples to be scattered within the normal cluster, while after correction the two classes form more distinct, compact clusters, confirming that the GMM‑based loss clustering aligns with true semantic structure.
Key contributions include: (1) a loss‑based GMM approach for estimating per‑sample label noise probability, (2) an integrated correction‑and‑filtering workflow that balances retaining useful hard examples with discarding likely mislabeled data, (3) extensive validation on both synthetic and real VCE datasets, and (4) clinical verification by expert gastroenterologists, establishing practical relevance. The use of MobileNetV3 ensures that the entire pipeline remains feasible for deployment on low‑power embedded devices, a crucial requirement for on‑capsule real‑time anomaly detection.
In conclusion, the study demonstrates that systematic detection and removal/correction of mislabeled frames can substantially improve deep‑learning‑based anomaly detection in VCE, addressing a major obstacle in medical AI where high‑quality annotated data are scarce. Future work may extend the framework to multi‑label settings, incorporate iterative re‑estimation loops, and explore on‑device implementation of the GMM noise estimator for real‑time quality control during capsule operation.
Comments & Academic Discussion
Loading comments...
Leave a Comment