Exploring the Limitations of kNN Noisy Feature Detection and Recovery for Self-Driving Labs
Self-driving laboratories (SDLs) have shown promise to accelerate materials discovery by integrating machine learning with automated experimental platforms. However, errors in the capture of input parameters may corrupt the features used to model system performance, compromising current and future campaigns. This study develops an automated workflow to systematically detect noisy features, determine sample-feature pairings that can be corrected, and finally recover the correct feature values. A systematic study is then performed to examine how dataset size, noise intensity, noise type, and feature value distribution affect both the detectability and recoverability of noisy features on both Density Functional Theory (DFT) and SDL datasets. In general, high-intensity noise and large training datasets are conducive to the detection and correction of noisy features. Low-intensity noise reduces detection and recovery but can be compensated for by larger clean training data sets. Detection and correction results vary between features, with continuous and dispersed feature distributions showing greater recoverability compared to features with discrete or narrow distributions. This systematic study not only demonstrates a model agnostic framework for rational data recovery in the presence of noise, limited data, and differing feature distributions but also provides a tangible benchmark of kNN imputation in materials datasets. Ultimately, it aims to enhance data quality and experimental precision in automated materials discovery.
💡 Research Summary
This paper addresses a critical bottleneck in self‑driving laboratories (SDLs): the degradation of machine‑learning‑driven experimental design caused by noisy input features. The authors develop a fully automated, model‑agnostic workflow that (1) detects which feature has been corrupted, (2) determines which individual samples are recoverable, and (3) restores the original feature values using k‑Nearest Neighbors (kNN) imputation. Two representative datasets are used to benchmark the approach: a large computational materials database (JARVIS‑DFT, 71 571 entries, 273 original features reduced to 46 after correlation and importance filtering) and a real‑world 3‑D‑printing SDL dataset (13 250 runs, 12 process/measurement features).
Three realistic noise models are injected into a single feature at a time: additive Gaussian noise (thermal/electronic fluctuations), Poisson noise (count‑based measurement uncertainty), and drift noise (slow systematic bias modeled as a random walk). Noise intensity is varied to span signal‑to‑noise ratios from well‑below to well‑above unity.
The workflow proceeds in four stages. First, a baseline kNN model is tuned via GridSearchCV (5‑fold) to use Manhattan distance, k = 5, leaf size = 30, and distance‑inverse weighting. For each target feature, the model is trained on the remaining N‑1 features and the prediction error (Δbase) is recorded. Second, after noise injection, the same imputation is performed, yielding Δnoise. The similarity between the Δbase and Δnoise distributions for every feature is quantified with Earth Mover’s Distance (EMD); the feature with the largest EMD is flagged as the noisy one. Successful detection is counted when this matches the ground‑truth noisy feature, allowing a detection rate (detectability) to be computed across noise levels, dataset sizes, and noise types. Third, a sample is deemed recoverable if its Δnoise exceeds the 95th percentile of the baseline error distribution. The recoverability metric is the fraction of such samples. Finally, for recoverable samples the kNN imputed values are compared to the true (pre‑noise) values, and performance is reported using R² and other regression metrics.
Key findings: (i) Larger training sets dramatically improve both baseline kNN accuracy and noise detection. With the full JARVIS‑DFT set, most features achieve R² > 0.8; reducing the training set to 0.1 k causes R² to drop below 0.6 for many features. (ii) High‑intensity noise (e.g., Gaussian σ ≈ 0.3) yields detection rates above 90 % and recoverability R² of 0.85–0.91. Low‑intensity noise (σ ≈ 0.05) reduces detection to ~45 % and recoverability R² to <0.6. (iii) Feature distribution matters: continuous, broadly dispersed features (e.g., “MagpieData Minimum GSbandgap”) are recovered with high fidelity, whereas discrete or narrowly distributed features (categorical process parameters) show poor recovery (R² ≈ 0.4). (iv) The average Pearson correlation of a target feature with the remaining features strongly predicts kNN recovery success (correlation coefficient ≈ 0.78), confirming that kNN’s reliance on neighbor similarity makes inter‑feature correlation a decisive factor. (v) Across noise types, Gaussian and Poisson behave similarly, while drift noise—being temporally correlated—slightly lowers detection rates but still permits robust recovery when enough data are available.
The authors conclude that kNN‑based noisy‑feature detection and correction is viable for SDLs provided (a) sufficient clean training data exist, (b) the noise level is moderate to high, and (c) the affected features have strong statistical relationships with other measured variables. The workflow is model‑agnostic, requiring only a distance metric and a set of clean features, making it readily deployable in diverse materials‑science and manufacturing automation contexts. By systematically quantifying the limits of kNN imputation, the study offers practical guidelines for data‑quality assurance in autonomous laboratories, ultimately enhancing experimental precision and accelerating discovery cycles.
Comments & Academic Discussion
Loading comments...
Leave a Comment