A Comparative Analysis of Semiconductor Wafer Map Defect Detection with Image Transformer

A Comparative Analysis of Semiconductor Wafer Map Defect Detection with Image Transformer
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Predictive maintenance is an important sector in modern industries which improves fault detection and cost reduction processes. By using machine learning algorithms in the whole process, the defects detection process can be implemented smoothly. Semiconductor is a sensitive maintenance field that requires predictability in work. While convolutional neural networks (CNNs) such as VGG-19, Xception and Squeeze-Net have demonstrated solid performance in image classification for semiconductor wafer industry, their effectiveness often declines in scenarios with limited and imbalanced data. This study investigates the use of the Data-Efficient Image Transformer (DeiT) for classifying wafer map defects under data-constrained conditions. Experimental results reveal that the DeiT model achieves highest classification accuracy of 90.83%, outperforming CNN models such as VGG-19(65%), SqueezeNet(82%), Xception(66%) and Hybrid(67%). DeiT also demonstrated superior F1-score (90.78%) and faster training convergence, with enhanced robustness in detecting minority defect classes. These findings highlight the potential of transformer-based models like DeiT in semiconductor wafer defect detection and support predictive maintenance strategies within semiconductor fabrication processes.


💡 Research Summary

The paper investigates the use of a Data‑Efficient Image Transformer (DeiT) for classifying semiconductor wafer map defects under conditions of limited and imbalanced data, and compares its performance against several convolutional neural network (CNN) architectures. The authors focus on predictive maintenance (PdM) in semiconductor manufacturing, where early detection of wafer defects can reduce equipment downtime and improve yield.

The dataset employed is the publicly available WM‑811k augmented and pre‑processed wafer map collection, which contains 9,000 images evenly distributed across nine defect categories: Center, Donut, Edge‑Localized, Edge‑Ring, Local, Near‑Full, Random, Scratch, and None (no defect). The authors manually split the data into training, validation, and test sets (approximately 70 %/15 %/15 %). All models receive the same preprocessing (normalization, resizing to 224 × 224) to ensure a fair comparison.

Five models are evaluated:

  1. DeiT‑Base – a lightweight Vision Transformer that uses a distillation token and self‑attention mechanisms designed for data‑efficient training.
  2. VGG‑19 – a deep CNN, fine‑tuned from ImageNet weights.
  3. Xception – a depthwise‑separable convolution network.
  4. SqueezeNet – an ultra‑compact CNN with ~0.5 MB parameters.
  5. Hybrid – a custom architecture that combines convolutional blocks with a small Transformer encoder.

Performance metrics include overall accuracy, macro‑averaged precision, recall, F1‑score, and confusion matrices. Training efficiency is also reported as average time per training step.

Results:

  • DeiT achieves the highest overall accuracy (90.83 %) and F1‑score (90.78 %). Its macro‑averaged precision and recall are both around 0.91, indicating balanced performance across all classes.
  • SqueezeNet follows with 82 % accuracy, while VGG‑19, Xception, and the Hybrid model reach 65 %, 66 %, and 67 % respectively.
  • In terms of training speed, DeiT requires only 0.035 seconds per step, markedly faster than VGG‑19 (0.263 s/step) and comparable to the other lightweight models.
  • Confusion‑matrix analysis shows DeiT correctly classifies the majority of samples in each class, including the minority classes (Donut, Edge‑Ring, Scratch) where CNNs suffer notable drops in recall.

Interpretation:
The self‑attention mechanism in DeiT captures global patterns in wafer maps more effectively than the locality‑biased convolutions of CNNs, which explains its robustness to class imbalance. The data‑efficient training strategy (distillation token, reduced parameter count) also prevents over‑fitting when the training set is small, leading to rapid convergence.

Limitations and Future Work:

  • The dataset is artificially augmented; real‑world production data with noise and varying illumination were not tested.
  • Real‑time inference latency on edge hardware (e.g., FPGA, ASIC) was not measured, despite the importance of low‑latency predictions for on‑line PdM.
  • The study does not explore newer lightweight Vision Transformer variants (Tiny‑ViT, MobileViT) that could further reduce computational load.

Future research directions suggested include: (a) deploying online learning pipelines for streaming sensor data, (b) evaluating transformer‑based models on edge devices, (c) integrating multimodal inputs (thermal, current) for richer defect characterization, and (d) conducting ablation studies on the impact of data augmentation strategies.

Conclusion:
The study provides empirical evidence that a data‑efficient transformer model can outperform traditional CNNs in semiconductor wafer defect classification, especially when training data are scarce and class distributions are skewed. This finding supports the adoption of transformer‑based architectures in predictive maintenance systems for semiconductor fabrication, offering both higher accuracy and faster training convergence.


Comments & Academic Discussion

Loading comments...

Leave a Comment