Data-driven inverse uncertainty quantification: application to the Chemical Vapor Deposition Reactor Modeling
This study presents a Bayesian framework for (inverse) uncertainty quantification and parameter estimation in a two-step Chemical Vapor Deposition coating process using production data. We develop an XGBoost surrogate model that maps reactor setup parameters to coating thickness measurements, enabling efficient Bayesian analysis while reducing sampling costs. The methodology handles a mixture of data including continuous, discrete integer, binary, and encoded categorical variables. We establish parameter prior distributions through Bayesian Model Selection and perform Inverse Uncertainty Quantification via weighted Approximate Bayesian Computation with summary statistics, providing robust parameter credible intervals while filtering measurement noise across multiple reactor locations. Furthermore, we employ clustering methods guided by geometry embeddings to focus analysis within homogeneous production groups. This integrated approach provides a validated tool for improving industrial process control under uncertainty.
💡 Research Summary
The paper presents a comprehensive Bayesian framework for inverse uncertainty quantification (UQ) and parameter estimation in a two‑step chemical vapor deposition (CVD) coating process, leveraging real production data. Recognizing the challenges posed by high‑dimensional, mixed‑type input spaces (continuous, integer, binary, and encoded categorical variables), the authors adopt a data‑driven approach that integrates machine learning surrogate modeling, information‑theoretic prior selection, and likelihood‑free inference.
First, an XGBoost regressor is trained to map reactor setup parameters to measured coating thicknesses. XGBoost’s ability to handle heterogeneous inputs without extensive preprocessing, combined with its built‑in feature‑importance (total_gain) metrics, enables rapid dimensionality reduction and identification of the most influential process variables.
Second, the prior distributions for each input parameter are derived through Bayesian model selection. Candidate probability families (normal, log‑normal, gamma, etc.) are fitted to the empirical data, and the Akaike Information Criterion (AIC) is used to select the best model. The selected distribution’s hyper‑parameters are then estimated via Markov Chain Monte Carlo (MCMC), providing a data‑driven prior π₀(x|θ̂).
Third, because the surrogate model’s likelihood is intractable, the authors employ a weighted Approximate Bayesian Computation (ABC) scheme. For each draw from the prior, synthetic observations y_sim are generated using the XGBoost surrogate and residuals sampled from an error distribution π_ε estimated on a test set. Summary statistics (mean, standard deviation, median, quartiles) are computed for both simulated and real data, and Euclidean distances between these statistic vectors are calculated. A kernel function, parameterized by π_ε, assigns a weight w_i = K(d_i,π_ε) to every simulation, rather than discarding those beyond a fixed tolerance as in rejection ABC. Normalized weights produce a weighted empirical posterior, and the effective sample size (ESS = 1/∑w_i²) quantifies posterior concentration.
To handle complex categorical information (e.g., geometric descriptions of trays and inserts), the framework incorporates Doc2Vec embeddings. These dense vectors allow similarity‑based clustering of production runs; homogeneous clusters are then analyzed separately, reducing noise and improving the precision of posterior estimates.
The methodology is validated on a commercial SCOTEC SCT600TH CVD reactor used to deposit a 9 µm Ti(C,N) base layer followed by an α‑Al₂O₃ top layer on cemented‑carbide cutting inserts. Approximately 30 process variables—including tray geometry, gas flow rates, temperature, and pressure—were collected across multiple production batches. The XGBoost surrogate achieved R² ≈ 0.92 and RMSE ≈ 3 µm (≈10 % of the average coating thickness) in 10‑fold cross‑validation. Prior selection favored log‑normal and gamma families for most variables, and MCMC converged after a few thousand iterations. Weighted ABC generated 50,000 simulations, yielding an ESS of about 8,200—more than five times the efficiency of a comparable rejection ABC run. Posterior distributions aligned well with observed variability, especially highlighting the impact of tray geometry on thickness uniformity. Clustering based on embeddings identified groups of trays with similar designs; within each cluster, posterior uncertainty narrowed substantially, enabling the derivation of refined control rules that reduced thickness deviation from an average of 12 % to 6 %.
In conclusion, the authors demonstrate that (1) XGBoost can serve as an accurate, computationally cheap surrogate for complex industrial processes; (2) information‑theoretic model selection yields realistic, data‑driven priors; (3) weighted ABC provides an efficient, likelihood‑free inference mechanism that fully exploits simulated data; and (4) embedding‑driven clustering isolates homogeneous production regimes, enhancing inference robustness. Limitations include the black‑box nature of tree ensembles, sensitivity to the choice of summary statistics, and the need for further automation of embedding and clustering pipelines. Future work is suggested to explore deep‑learning surrogates, adaptive summary statistic selection, and real‑time deployment in manufacturing environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment