When CNNs Outperform Transformers and Mambas: Revisiting Deep Architectures for Dental Caries Segmentation

Reading time: 5 minute
...

📝 Original Info

  • Title: When CNNs Outperform Transformers and Mambas: Revisiting Deep Architectures for Dental Caries Segmentation
  • ArXiv ID: 2511.14860
  • Date: 2025-11-18
  • Authors: ** Jun Zeng 외 (※ 정확한 저자 명단은 논문 원문을 확인 필요) **

📝 Abstract

Accurate identification and segmentation of dental caries in panoramic radiographs are critical for early diagnosis and effective treatment planning. Automated segmentation remains challenging due to low lesion contrast, morphological variability, and limited annotated data. In this study, we present the first comprehensive benchmarking of convolutional neural networks, vision transformers and state-space mamba architectures for automated dental caries segmentation on panoramic radiographs through a DC1000 dataset. Twelve state-of-the-art architectures, including VMUnet, MambaUNet, VMUNetv2, RMAMamba-S, TransNetR, PVTFormer, DoubleU-Net, and ResUNet++, were trained under identical configurations. Results reveal that, contrary to the growing trend toward complex attention based architectures, the CNN-based DoubleU-Net achieved the highest dice coefficient of 0.7345, mIoU of 0.5978, and precision of 0.8145, outperforming all transformer and Mamba variants. In the study, the top 3 results across all performance metrics were achieved by CNN-based architectures. Here, Mamba and transformer-based methods, despite their theoretical advantage in global context modeling, underperformed due to limited data and weaker spatial priors. These findings underscore the importance of architecturetask alignment in domain-specific medical image segmentation more than model complexity. Our code is available at: https://github.com/JunZengz/dental-caries-segmentation.

💡 Deep Analysis

📄 Full Content

Dental caries is among the most common chronic diseases worldwide (Abdalla, Elsayed, and Ahmed 2022;Lee, Kim, and Jeong 2021). Although largely preventable, dental caries remains a leading cause of tooth loss and oral discomfort (Hirata et al. 2023). Epidemiological data show that untreated dental caries affects over one-third of the global population. In children, tooth decay is the most common chronic dental condition, impacting approximately 514 million individuals worldwide (GBD 2017 Disease and Injury Incidence and Prevalence Collaborators 2018). Therefore, early and accurate detection of carious lesions is crucial. Conventional diagnostic methods, such as visual-tactile examination, intraoral radiography, approximal tooth separation, caries-detection dyes, fiber-optic transillumination (FOTI), and indices such as DMFT, are widely used. These approaches, however, are limited by subjective interpretation, inter-examiner variability, low sensitivity to early enamel lesions, false-positive staining, and difficulties in detecting early-stage or overlapping lesions (Abdalla, Elsayed, and Ahmed 2022;Srilatha et al. 2019;Abdelaziz 2023).

Deep neural networks have transformed medical image analysis in recent years. Convolutional Neural Networks (CNNs) have been especially successful in segmentation tasks (e.g., U-Net (Ronneberger, Fischer, and Brox 2015) and its variants), due to their ability to learn rich hierarchical features from images. More recently, Vision Transformers (ViTs) (Dosovitskiy et al. 2021;Chen et al. 2021) with self-attention mechanisms have emerged, showing promise in capturing global context dependencies and improving long-range reasoning. In parallel, a new family of state-space models has emerged and its medical adaptations, such as VM-UNet (Ruan, Li, and Xiang 2024a), VM-UNetV2 (Zhang et al. 2024) and Rmamba-s (Zeng et al. 2025), aim to balance efficiency and global modeling through selective state-space recurrence.

There are studies in the literature that have exhibited

Deep Learning is one of the fundamental components that aid automated diagnosis of dental imaging, especially for segmenting dental caries in panoramic X-rays. Initial research on the field has focused on convolutional architectures such as U-Net (Ronneberger, Fischer, and Brox 2015) and its variants. Table 1 summarizes representative DL studies on dental image segmentation on various image modalities, detailing their datasets, model types, and reported performance. Furthermore, many studies employ differing protocols and do not offer standardized toolkits or unified benchmarks, thereby limiting reproducibility and generalization. Additionally, the literature indicates that benchmarking efforts on the publicly available DC1000 dataset still remain sparse and lack systematic cross-architectural evaluation. To address these gaps, we present a first unified benchmark of 12 diverse DL models, including CNNs, transformer-based, and Mamba-based architectures. All of our models are evaluated under a consistent training pipeline. Our study aims to serve as a foundational resource guiding the development of robust, generalizable and clinically relevant caries segmentation models.

Most of the medical image segmentation methods still depends on CNNs as their underlying foundation. The number of convolutional layers with a gradual reduction of the spatial resolution helps to extract local and global information. Architectures that use encoder-decoder structure like U-Net (Ronneberger, Fischer, and Brox 2015), Re-sUNet++ (Jha et al. 2019), DoubleU-Net (Jha et al. 2020) and ColonSegNet (Jha et al. 2021) are enhanced with skip connection, and thus are able to reconstruct minute detail. In addition to that, they can learn to increase abstract semantic representation. These designs can be explained as being characterized by effective training dynamics, stability on heterogeneous data, and high spatial resolution, which makes them useful in detecting small or low-contrast caries.

Transformer-based models replace localized convolutional processing with self-attention and in this way, allow

In this study, we have used DC1000 dataset, which consists of 597 high resolution panoramic images. Each of them were annotated with pixel-level segmentation masks for dental caries (Wang et al. 2023). These radiographs were annotated by experienced dentists, and were obtained from the clinical sources, such that it ensures dataset’s reliability and

To ensure consistency in runtime, throughput, and reproducibility, all models were trained on the PyTorch framework using a single NVIDIA V100 GPU with 32GB memory. Furthermore, the performance of all models was assessed using standard segmentation metrics, including mean Intersection over Union (mIoU), Dice coefficient(mDSC), Precision, Recall and F2-score. To ensure a fair comparison across architectures, all models were trained and evaluated under a unified experimental configuration using identica

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut