From Images to Decisions: Assistive Computer Vision for Non-Metallic Content Estimation in Scrap Metal

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Scrap quality directly affects energy use, emissions, and safety in steelmaking. Today, the share of non-metallic inclusions (contamination) is judged visually by inspectors - an approach that is subjective and hazardous due to dust and moving machinery. We present an assistive computer vision pipeline that estimates contamination (per percent) from images captured during railcar unloading and also classifies scrap type. The method formulates contamination assessment as a regression task at the railcar level and leverages sequential data through multi-instance learning (MIL) and multi-task learning (MTL). Best results include MAE 0.27 and R2 0.83 by MIL; and an MTL setup reaches MAE 0.36 with F1 0.79 for scrap class. Also we present the system in near real time within the acceptance workflow: magnet/railcar detection segments temporal layers, a versioned inference service produces railcar-level estimates with confidence scores, and results are reviewed by operators with structured overrides; corrections and uncertain cases feed an active-learning loop for continual improvement. The pipeline reduces subjective variability, improves human safety, and enables integration into acceptance and melt-planning workflows.

💡 Research Summary

The paper addresses a critical bottleneck in steelmaking: the manual, subjective, and hazardous assessment of non‑metallic inclusions (contamination) in scrap metal during railcar unloading. The authors propose an end‑to‑end computer‑vision pipeline that automatically estimates the percentage of contamination and simultaneously classifies the scrap grade from high‑resolution video captured at the unloading point.

Key contributions include: (1) framing contamination assessment as a regression problem at the railcar level while treating each unload “layer” (magnet grab) as an instance within a bag, enabling the use of Multi‑Instance Learning (MIL); (2) extending the model to Multi‑Task Learning (MTL) so that the same backbone predicts both contamination (regression) and scrap class (classification); (3) a thorough experimental comparison of CNN backbones (EfficientNet, ResNet‑50, ResNeXt‑101) and Transformer‑based backbones (ViT‑B, Swin‑Transformer‑B), all pre‑trained on ImageNet‑1K; (4) a production‑ready architecture that includes magnet‑railcar detection, temporal segmentation, versioned inference services, confidence scoring, operator review with structured overrides, and an active‑learning loop for continual improvement.

The dataset comprises 58,574 annotated images from 40 cameras covering 2,000 railcars (≈90,000 t of scrap). Each railcar is labeled by three independent inspectors; the mean of their scores is used as ground truth, and samples with high inter‑annotator variance are flagged for re‑annotation. The data are split by railcar to avoid leakage (44 092 training layers, 8 575 validation, 5 907 test).

In the MIL setting, each railcar bag contains a variable number of layers. An encoder extracts features per layer, an attention module computes instance weights, and a weighted sum yields a bag representation that feeds a regression head. The Swin‑Transformer MIL model achieves MAE = 0.27 % and R² = 0.83, outperforming CNN baselines (e.g., EfficientNet MAE = 1.43 %).

For MTL, the same attention‑pooled representation is fed to both a regression head and a classification head. The Swin‑MTL model reaches MAE = 0.36 % (R² = 0.78) and a scrap‑grade F1 = 0.79, surpassing single‑task models and demonstrating the benefit of shared representations. Grad‑CAM visualizations show that Transformers focus tightly on metal pieces while ignoring dust and lighting, explaining their superior regression performance.

The system operates in near real‑time: a magnet position detector segments the video into grab intervals, keyframes are sampled, and the bag is sent to a versioned ML service. Results with confidence scores are stored in a database, displayed to operators for verification, and any corrections are fed back through an active‑learning pipeline that updates the model registry and experiment tracking.

Industrial impact is twofold: (a) it dramatically reduces inter‑inspector variability (down to ~0.2 % MAE) and eliminates the need for personnel to work in dusty, high‑magnetic‑field zones, improving safety; (b) accurate contamination estimates can be directly fed into Electric Arc Furnace (EAF) load‑planning, lowering energy consumption and CO₂ emissions.

Overall, the paper delivers a comprehensive, data‑driven solution that bridges the gap between research‑grade computer vision and real‑world steel recycling operations, demonstrating that Transformer‑based MIL/MTL architectures are well‑suited for sequential, weakly‑labeled industrial vision tasks.

From Images to Decisions: Assistive Computer Vision for Non-Metallic Content Estimation in Scrap Metal

💡 Research Summary

Comments & Academic Discussion

Leave a Comment