Contour Refinement using Discrete Diffusion in Low Data Regime
Boundary detection of irregular and translucent objects is an important problem with applications in medical imaging, environmental monitoring and manufacturing, where many of these applications are plagued with scarce labeled data and low in situ computational resources. While recent image segmentation studies focus on segmentation mask alignment with ground-truth, the task of boundary detection remains understudied, especially in the low data regime. In this work, we present a lightweight discrete diffusion contour refinement pipeline for robust boundary detection in the low data regime. We use a Convolutional Neural Network(CNN) architecture with self-attention layers as the core of our pipeline, and condition on a segmentation mask, iteratively denoising a sparse contour representation. We introduce multiple novel adaptations for improved low-data efficacy and inference efficiency, including using a simplified diffusion process, a customized model architecture, and minimal post processing to produce a dense, isolated contour given a dataset of size <500 training images. Our method outperforms several SOTA baselines on the medical imaging dataset KVASIR, is competitive on HAM10K and our custom wildfire dataset, Smoke, while improving inference framerate by 3.5X.
💡 Research Summary
The paper tackles the problem of precise boundary detection for irregular and translucent objects in scenarios where labeled data are scarce and computational resources are limited. While most recent segmentation research concentrates on mask‑level accuracy, the authors argue that boundary detection remains under‑explored, especially in low‑data regimes. To address this gap, they propose a lightweight, discrete diffusion‑based contour refinement pipeline that operates on a sparse contour representation conditioned on an initial segmentation mask.
Core Architecture
The backbone is an attention‑enhanced DUCKNet, a residual encoder‑decoder network originally designed for polyp segmentation. Self‑attention blocks (8‑head, 2‑layer repetitions) are inserted in the middle of the network to capture global context without sacrificing the multi‑scale detail preservation inherent to DUCKNet. The model receives a concatenated input of the raw image and a coarse mask produced by a separate lightweight detector (YOLOv11s, DeepLab‑v3+, or SAM2.1). The output is a set of discrete confidence tokens representing the contour, with the number of classes adapted per dataset (8 for KVASIR, 11 for HAM10K, 32 for the custom Smoke dataset).
Discrete Diffusion Process
Instead of the continuous Gaussian noise used in standard diffusion models (DDPM, DDIM), the authors adopt a simplified discrete diffusion. Each pixel is treated as a one‑hot vector; noise is injected by multiplying with a transition matrix Qₜ defined by a schedule βₜ (ranging from 0.0001 to 0.02). This yields a categorical distribution over a small set of confidence levels, drastically reducing memory and compute requirements. During training, Gumbel noise is added to the clean contour and a softmax is applied, following the approach of Austin et al. (2023). The loss function is limited to a Dice loss between the predicted and ground‑truth contour tokens, avoiding the KL‑matching term that would otherwise demand large datasets and cause artifacts. An exponential moving average (EMA) with decay 0.98, gradient clipping (200), and a modest attention dropout (0.01) further stabilize training.
Training Regime
Training is performed on batches of size 15 for at most 100 epochs (50 for HAM10K and KVASIR, 80 for Smoke). The optimizer is AdamW with a learning rate of 1e‑4. Because the diffusion is discrete and the loss is simple, convergence is rapid even with only 200‑400 training images per dataset. The authors also fine‑tune a lightweight base segmentation model to generate the conditioning mask, experimenting with three architectures to verify robustness.
Inference Procedure
The reverse diffusion step is replaced by an iterative denoising loop: starting from pure categorical noise, the model’s output at step t is fed back as input for step t‑1, repeated for 10 deterministic steps with a linear noise schedule. This deterministic sampling eliminates the stochasticity of traditional diffusion sampling and enables real‑time performance.
Post‑Processing
The raw output often contains thick or fragmented lines due to up‑sampling. The authors apply a Gaussian blur, then a morphological skeletonization (reducing the contour to a width of one pixel), followed by a morphological closure to fill small gaps. For the Smoke dataset, the longest closed contour is selected and masked with a “fire front” truncation mask derived from the segmentation model.
Experimental Evaluation
Three datasets are used:
- KVASIR (gastrointestinal endoscopy, 200 train / 40 test)
- HAM10K (dermatoscopic skin lesions, 200 train / 40 test)
- Smoke (aerial fire‑smoke imagery, 389 train / 32 test)
Metrics include pixel‑tolerant F1 (tolerance = 10) and shape‑similarity (IoU‑based). The proposed method outperforms SegRefiner and MedSegDiff by 2–5 percentage points on F1 and shows comparable or better shape similarity. Importantly, inference speed improves by a factor of 3.5, achieving roughly 100 FPS on an RTX 5090, making the approach viable for edge devices.
Contributions
- A computationally efficient discrete diffusion pipeline tailored for contour refinement under severe data scarcity.
- Low‑data training tricks: quantized confidence scores, Dice‑only loss, Gumbel noise, EMA, and minimal dropout.
- Demonstrated superiority on medical, dermatological, and wildfire monitoring tasks, with a strong emphasis on translucent objects where traditional methods struggle.
Limitations and Future Work
The current system processes only 2‑D static images; extending to 3‑D volumes or video streams is an open direction. The pipeline also relies heavily on the quality of the initial mask, suggesting a potential end‑to‑end joint training of mask generation and contour refinement. Moreover, exploring higher‑dimensional confidence spaces or adaptive noise schedules could further improve performance on highly complex scenes with multiple overlapping objects.
In summary, the paper presents a novel, low‑resource‑friendly approach to contour refinement that leverages discrete diffusion and an attention‑augmented DUCKNet backbone, achieving state‑of‑the‑art boundary detection in low‑data regimes while maintaining real‑time inference capabilities.
Comments & Academic Discussion
Loading comments...
Leave a Comment