Predictive Inorganic Synthesis based on Machine Learning using Small Data sets: a case study of size-controlled Cu Nanoparticles
Copper nanoparticles (Cu NPs) have a broad applicability, yet their synthesis is sensitive to subtle changes in reaction parameters. This sensitivity, combined with the time- and resource-intensive nature of experimental optimization, poses a major challenge in achieving reproducible and size-controlled synthesis. While Machine Learning (ML) shows promise in materials research, its application is often limited by scarcity of large high-quality experimental data sets. This study explores ML to predict the size of Cu NPs from microwave-assisted polyol synthesis using a small data set of 25 in-house performed syntheses. Latin Hypercube Sampling is used to efficiently cover the parameter space while creating the experimental data set. Ensemble regression models successfully predict particle sizes with high accuracy ($R^2 = 0.74$), outperforming classical statistical approaches ($R^2 = 0.60$). Additionally, classification models using both random forests and Large Language Models (LLMs) are evaluated to distinguish between large and small particles. While random forests show moderate performance, LLMs offer no significant advantages under data-scarce conditions. Overall, this study demonstrates that carefully curated small data sets, paired with robust classical ML, can effectively predict the synthesis of Cu NPs and highlights that for lab-scale studies, complex models like LLMs may offer limited benefit over simpler techniques.
💡 Research Summary
This paper addresses the challenge of achieving reproducible, size‑controlled synthesis of copper nanoparticles (Cu NPs) in a laboratory setting where only a limited number of experiments can be afforded. The authors construct a high‑quality, small‑scale dataset of 25 microwave‑assisted polyol syntheses by employing Latin Hypercube Sampling (LHS) to uniformly explore a three‑dimensional parameter space: copper precursor concentration (1–10 mM), reaction time (1–20 min), and temperature (175–200 °C). Dynamic light scattering (DLS) and UV‑Vis spectroscopy provide the hydrodynamic diameter and surface plasmon resonance (SPR) peak for each experiment. To ensure consistency, only the 18 experiments that yielded a monomodal particle‑size distribution are retained for model development.
Two modeling pathways are pursued. First, an ensemble regression framework built on the AMADEUS platform is used. One hundred base learners are trained on random 80/20 splits (in‑bag/out‑of‑bag) of the data. Each base learner is a linear or polynomial (degree 1–5) model regularized with LASSO. Hyper‑parameter α is manually set to 0.1 after visual inspection of performance across four candidate values (1.0, 0.1, 0.01, 0.001). To avoid over‑fitting, polynomial degree is limited to three and feature engineering is guided by “ensemble importance” – the proportion of base models in which a feature retains a non‑zero coefficient. Seven features survive this filter:
Comments & Academic Discussion
Loading comments...
Leave a Comment