Bounding Box-Guided Diffusion for Synthesizing Industrial Images and Segmentation Map

Bounding Box-Guided Diffusion for Synthesizing Industrial Images and Segmentation Map
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Synthetic dataset generation in Computer Vision, particularly for industrial applications, is still underexplored. Industrial defect segmentation, for instance, requires highly accurate labels, yet acquiring such data is costly and time-consuming. To address this challenge, we propose a novel diffusion-based pipeline for generating high-fidelity industrial datasets with minimal supervision. Our approach conditions the diffusion model on enriched bounding box representations to produce precise segmentation masks, ensuring realistic and accurately localized defect synthesis. Compared to existing layout-conditioned generative methods, our approach improves defect consistency and spatial accuracy. We introduce two quantitative metrics to evaluate the effectiveness of our method and assess its impact on a downstream segmentation task trained on real and synthetic data. Our results demonstrate that diffusion-based synthesis can bridge the gap between artificial and real-world industrial data, fostering more reliable and cost-efficient segmentation models. The code is publicly available at https://github.com/covisionlab/diffusion_labeling.


💡 Research Summary

The paper addresses the costly and time‑consuming nature of acquiring pixel‑accurate annotations for industrial defect segmentation. It proposes a conditional diffusion pipeline that requires only cheap bounding‑box annotations to generate paired high‑fidelity RGB images and corresponding segmentation masks. The key innovation lies in two enriched box encodings: (1) a Bounding‑Box‑Aware Signed Distance (BASD) map that stores signed distances to the nearest box edge, and (2) a class‑aware version (C‑BASD) where the positive region is encoded with analog binary vectors representing the defect class. These maps, together with an analog‑bit representation of the ground‑truth segmentation, are concatenated with the image and fed into a UNet‑based DDPM. During training the model learns to predict the injected Gaussian noise conditioned on the box maps, thereby enforcing spatial consistency between generated defects and their prescribed boxes.

Experiments use the Wood Defect Detection dataset, containing over 20 k images of five defect types. The diffusion model is trained for 300 epochs (≈1 day on two RTX 4090 GPUs) with 1000 denoising steps. Quantitative evaluation compares the proposed method against a state‑of‑the‑art layout‑conditional diffusion model. While standard image quality metrics (FID, KID, LPIPS) are comparable, the new metrics—Segmentation Alignment Error (SAE) and Empty Bounding‑Box Rate (EBR)—show dramatic improvements: mean SAE drops from 46.77 % to 4.99 % and EBR from >13 % to below 6 %. Downstream segmentation experiments demonstrate that models trained with the synthetic data achieve higher mIoU on a real test set, confirming the practical utility of the generated samples.

The authors release code and pretrained models, enabling reproducibility and facilitating adoption in real industrial inspection pipelines. Overall, the work demonstrates that bounding‑box‑guided diffusion can produce realistic, accurately localized defect images and masks with minimal supervision, substantially reducing annotation costs while maintaining or improving downstream performance.


Comments & Academic Discussion

Loading comments...

Leave a Comment