3DTeethSAM: Taming SAM2 for 3D Teeth Segmentation
3D teeth segmentation, involving the localization of tooth instances and their semantic categorization in 3D dental models, is a critical yet challenging task in digital dentistry due to the complexity of real-world dentition. In this paper, we propose 3DTeethSAM, an adaptation of the Segment Anything Model 2 (SAM2) for 3D teeth segmentation. SAM2 is a pretrained foundation model for image and video segmentation, demonstrating a strong backbone in various downstream scenarios. To adapt SAM2 for 3D teeth data, we render images of 3D teeth models from predefined views, apply SAM2 for 2D segmentation, and reconstruct 3D results using 2D-3D projections. Since SAM2’s performance depends on input prompts and its initial outputs often have deficiencies, and given its class-agnostic nature, we introduce three light-weight learnable modules: (1) a prompt embedding generator to derive prompt embeddings from image embeddings for accurate mask decoding, (2) a mask refiner to enhance SAM2’s initial segmentation results, and (3) a mask classifier to categorize the generated masks. Additionally, we incorporate Deformable Global Attention Plugins (DGAP) into SAM2’s image encoder. The DGAP enhances both the segmentation accuracy and the speed of the training process. Our method has been validated on the 3DTeethSeg benchmark, achieving an IoU of 91.90% on high-resolution 3D teeth meshes, establishing a new state-of-the-art in the field.
💡 Research Summary
The paper introduces 3DTeethSAM, a novel framework that adapts the Segment Anything Model 2 (SAM2), a large‑scale 2‑D vision foundation model, for high‑resolution 3‑D teeth segmentation. Traditional 3‑D dental segmentation methods (e.g., PointNet++, TSGCNet, TSRNet) operate directly on meshes or point clouds and struggle with complex dentition, high‑resolution geometry, and limited annotated data. By leveraging SAM2’s powerful pretrained knowledge, the authors aim to overcome these limitations while keeping the adaptation lightweight and data‑efficient.
Key components of the method
-
Multi‑view rendering and 2‑D‑to‑3‑D voting – Each 3‑D dental mesh is normalized, oriented, and rendered into a set of 512 × 512 RGB images from several fixed camera angles (front, back, side views). SAM2 processes each view to produce 2‑D masks. A voting scheme aggregates mask predictions across views, and a Graph‑Cut post‑processing step refines the aggregated mask before lifting it back into 3‑D space. This multi‑view strategy ensures that occluded or ambiguous regions in one view are compensated by other views, enabling accurate segmentation of high‑resolution meshes.
-
Three lightweight learnable adapters – Because SAM2 is prompt‑driven, class‑agnostic, and trained on generic images, the authors introduce three small modules that sit on top of the frozen SAM2 backbone:
-
Prompt Embedding Generator – A Transformer decoder receives randomly initialized query vectors (one per possible tooth, up to 16) and the image embeddings from SAM2. Through self‑attention and cross‑attention, it produces prompt embeddings and a confidence score for each query, effectively generating automatic prompts without human input. This module also captures spatial relationships among teeth.
-
Mask Refiner – The coarse 16‑channel mask output from SAM2 often has blurry boundaries. A UNet‑style convolutional network takes the original image, the coarse mask, and SAM2’s image embedding as inputs, fusing low‑level texture with high‑level semantics to sharpen tooth boundaries and correct small segmentation errors.
-
Mask Classifier – To make the system class‑aware, another Transformer decoder maps the 16 mask channels to tooth IDs. Instead of a fixed channel‑to‑ID mapping (which fails when teeth are missing), the classifier predicts a probability distribution over the 16 tooth classes for each channel, using the same query‑based mechanism as the Prompt Generator.
-
-
Deformable Global Attention Plugin (DGAP) – SAM2’s image encoder is a Vision Transformer that uses global self‑attention, which can be inefficient for focusing on the relatively small region of interest (the teeth) within a full image. DGAP replaces the standard global attention block with a deformable attention mechanism that learns offset vectors to sample features adaptively across the image. This concentrates computational resources on tooth regions, accelerates convergence (≈ 15 % faster) and yields a modest IoU gain (≈ 0.7 %).
Training strategy – The SAM2 backbone remains frozen; only the adapters and DGAP are fine‑tuned. This keeps the number of trainable parameters low (on the order of a few hundred thousand) while preserving the extensive knowledge encoded in SAM2’s pretraining on billions of masks.
Experimental results – On the 3DTeethSeg benchmark (high‑resolution dental meshes with 16‑tooth instance labels), 3DTeethSAM achieves a mean Intersection‑over‑Union of 91.90 %, surpassing prior state‑of‑the‑art 3‑D methods by a sizable margin. Ablation studies demonstrate that each component (multi‑view voting, each adapter, DGAP) contributes positively: the Prompt Generator eliminates the need for manual prompts, the Mask Refiner improves boundary IoU by ~2 %, the Mask Classifier resolves channel‑ID mismatches especially in cases with missing teeth, and DGAP speeds up training while slightly boosting accuracy.
Insights and broader impact – The work shows that a large 2‑D foundation model can be repurposed for a 3‑D domain by (i) bridging the dimensional gap with multi‑view rendering, (ii) providing automatic prompt generation and class awareness through lightweight adapters, and (iii) refining the attention mechanism to focus on the region of interest. This recipe is potentially transferable to other 3‑D segmentation problems in medical imaging (CT/MRI organ segmentation), industrial inspection (CAD part segmentation), and robotics (object manipulation from point clouds).
Future directions suggested by the authors include: dynamic query allocation for varying numbers of teeth, incorporating illumination and material cues during rendering to improve texture‑based segmentation, and exploring end‑to‑end architectures that bypass rendering altogether by feeding point clouds directly into a SAM‑style decoder.
In summary, 3DTeethSAM successfully merges the generalization power of SAM2 with domain‑specific adaptations, delivering state‑of‑the‑art performance on high‑resolution dental meshes while maintaining a compact, data‑efficient training regime. This represents a significant step toward universal foundation models that can operate across both 2‑D and 3‑D visual domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment