MSEG-VCUQ: Multimodal SEGmentation with Enhanced Vision Foundation Models, Convolutional Neural Networks, and Uncertainty Quantification for High-Speed Video Phase Detection Data

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

High-speed video (HSV) phase detection (PD) segmentation is crucial for monitoring vapor, liquid, and microlayer phases in industrial processes. While CNN-based models like U-Net have shown success in simplified shadowgraphy-based two-phase flow (TPF) analysis, their application to complex HSV PD tasks remains unexplored, and vision foundation models (VFMs) have yet to address the complexities of either shadowgraphy-based or PD TPF video segmentation. Existing uncertainty quantification (UQ) methods lack pixel-level reliability for critical metrics like contact line density and dry area fraction, and the absence of large-scale, multimodal experimental datasets tailored to PD segmentation further impedes progress. To address these gaps, we propose MSEG-VCUQ. This hybrid framework integrates U-Net CNNs with the transformer-based Segment Anything Model (SAM) to achieve enhanced segmentation accuracy and cross-modality generalization. Our approach incorporates systematic UQ for robust error assessment and introduces the first open-source multimodal HSV PD datasets. Empirical results demonstrate that MSEG-VCUQ outperforms baseline CNNs and VFMs, enabling scalable and reliable PD segmentation for real-world boiling dynamics.

💡 Research Summary

The paper introduces MSEG‑VCUQ, a comprehensive framework for segmenting high‑speed video (HSV) phase‑detection (PD) data and quantifying the associated uncertainties. Traditional convolutional neural networks such as U‑Net have demonstrated success on simplified shadowgraphy images but have not been applied to the more complex PD imagery that contains overlapping bubbles, microlayer phases, and rapid dynamics. Likewise, Vision Foundation Models (VFMs) like the Segment Anything Model (SAM) excel in general‑purpose segmentation of natural images but have not been tailored to scientific boiling experiments. To bridge these gaps, the authors propose a hybrid architecture called VideoSAM that combines the strengths of both approaches.

In the first stage, a U‑Net model pre‑trained on biological cell images is fine‑tuned for each fluid modality (water, FC‑72, nitrogen, argon). This network produces an initial mask that captures primary liquid‑vapor boundaries with high spatial fidelity. In the second stage, the initial mask and the original frame are fed into a SAM‑based transformer that refines the mask using self‑attention mechanisms, thereby improving robustness to overlapping bubbles, varying illumination, and diverse heat‑flux conditions. The resulting two‑step pipeline yields pixel‑level segmentation that outperforms either component alone.

A major contribution of the work is the creation and open‑source release of a large multimodal HSV PD dataset. The dataset comprises 25,500 frame‑mask pairs collected under saturated pool boiling (SPB) and flow boiling (FB) conditions across four fluids, with heat fluxes ranging from 120 kW m⁻² to 3,000 kW m⁻². Each video is pre‑processed through grayscale conversion, patch extraction, and semi‑automated annotation to ensure consistency. This dataset addresses the scarcity of realistic, high‑resolution boiling data and provides a solid foundation for training and evaluating both CNN‑based and transformer‑based models.

Beyond segmentation, the framework incorporates a rigorous uncertainty quantification (UQ) module that evaluates discretization errors for key boiling metrics: dry‑area fraction and contact‑line density. By performing Monte‑Carlo simulations and weighted frequency analysis on pixel‑level predictions, the authors generate confidence intervals for these metrics, offering a level of interpretability that generic UQ methods lack.

Experimental results demonstrate three key findings. First, when heat flux exceeds ~140 kW m⁻², the U‑Net‑based component of VideoSAM detects finer vapor structures, leading to higher measured dry‑area fractions and contact‑line densities compared with traditional adaptive thresholding. Second, three‑dimensional histograms of bubble‑size versus heat flux reveal that VideoSAM captures a smooth transition from small to large bubbles as flux increases, whereas thresholding fails to identify many small bubbles at low fluxes. Third, the pixel‑level UQ analysis shows increasing uncertainty in metric estimates at higher fluxes, providing actionable insight for experimental design and model refinement.

Overall, MSEG‑VCUQ advances the state of the art in boiling heat‑transfer research by (i) integrating CNN and VFM technologies into a unified, high‑performance segmentation pipeline, (ii) supplying the community with the first large‑scale multimodal HSV PD dataset, and (iii) delivering metric‑specific uncertainty estimates that enhance the reliability of derived physical parameters. The authors suggest future work on real‑time inference optimization, incorporation of additional surface‑property variables, and multi‑task learning to further broaden the applicability of the framework in autonomous experimentation and high‑fidelity thermal modeling.

MSEG-VCUQ: Multimodal SEGmentation with Enhanced Vision Foundation Models, Convolutional Neural Networks, and Uncertainty Quantification for High-Speed Video Phase Detection Data

💡 Research Summary

Comments & Academic Discussion

Leave a Comment