Automated Modernization of Machine Learning Engineering Notebooks for Reproducibility
Interactive computational notebooks (e.g., Jupyter notebooks) are widely used in machine learning engineering (MLE) to program and share end-to-end pipelines, from data preparation to model training and evaluation. However, environment erosion-the rapid evolution of hardware and software ecosystems for machine learning-has rendered many published MLE notebooks non-reproducible in contemporary environments, hindering code reuse and scientific progress. To quantify this gap, we study 12,720 notebooks mined from 79 popular Kaggle competitions: only 35.4% remain reproducible today. Crucially, we find that environment backporting, i.e., downgrading dependencies to match the submission time, does not improve reproducibility but rather introduces additional failure modes. To address environment erosion, we design and implement MLEModernizer, an LLM-driven agentic framework that treats the contemporary environment as a fixed constraint and modernizes notebook code to restore reproducibility. MLEModernizer iteratively executes notebooks, collects execution feedback, and applies targeted fixes in three types: error-repair, runtime-reduction, and score-calibration. Evaluated on 7,402 notebooks that are non-reproducible under the baseline environment, MLEModernizer makes 5,492 (74.2%) reproducible. MLEModernizer enables practitioners to validate, reuse, and maintain MLE artifacts as the hardware and software ecosystems continue to evolve.
💡 Research Summary
The paper tackles the growing reproducibility crisis of machine‑learning‑engineering (MLE) notebooks, which are the de‑facto medium for sharing end‑to‑end pipelines on platforms such as Kaggle. By mining 12,720 Python notebooks from 79 popular Kaggle competitions and re‑executing them in a single, up‑to‑date Kaggle Docker container, the authors establish a baseline reproducibility rate of only 35.4 % when allowing a 10 % relative score deviation and requiring a valid CSV submission. The failures stem from broken APIs, dependency conflicts, runtime overruns, and subtle changes in hardware‑accelerated libraries.
A natural remedy—environment back‑porting—was implemented by inferring the submission timestamp, extracting the exact package list with an AST‑based tool, and downgrading each dependency accordingly. Surprisingly, this strategy did not improve reproducibility; the success rate fell to 35.1 %, highlighting that older container images are unavailable, that older library versions may still contain bugs, and that the “reconstruct the historic environment, keep the code unchanged” paradigm is fundamentally flawed for modern MLE workflows.
In response, the authors propose MLEModernizer, an LLM‑driven agentic framework that treats the contemporary environment as a fixed constraint and modernizes the notebook code itself. The system operates iteratively: (1) execute the notebook, capture errors, runtime, and score deviation; (2) based on the execution context, formulate a targeted prompt for a large language model (LLM) in one of three fix categories—error‑repair, runtime‑reduction, or score‑calibration; (3) ask the LLM (GPT‑5.2) to generate a patch at the whole‑file level, allowing it to reason across cells and propose comprehensive changes; (4) re‑run the patched notebook and repeat until the reproducibility criteria are met.
Evaluated on the 7,402 notebooks that were non‑reproducible under the baseline environment, MLEModernizer succeeded in making 5,492 (74.2 %) fully reproducible. An additional 1,284 notebooks (17.3 %) became “error‑reproducible”: they still contain runtime errors but produce a valid CSV and a score within the acceptable band, indicating that some errors are benign for the downstream task. The system averages 5.70 iterations per notebook and costs roughly $0.31 per notebook, demonstrating economic viability for large‑scale industrial deployment. Ablation studies show that whole‑file patches outperform cell‑level fixes by reducing the number of required iterations.
The contributions are threefold: (1) a large‑scale reproducibility audit of real‑world Kaggle notebooks, confirming that environment erosion is a dominant failure mode and that naïve back‑porting is ineffective; (2) the design of MLEModernizer, an LLM‑based agent that integrates execution feedback with targeted prompt engineering to modernize code for a fixed environment; (3) an extensive empirical evaluation showing high success rates, low cost, and practical applicability.
The work has broad implications: researchers can automatically revive legacy code to validate prior results, and engineers can keep production pipelines up‑to‑date without manually tracking deprecations. Limitations include reliance on LLM correctness (risk of hallucinations), exclusion of notebooks that depend on external or evolving datasets, and potential challenges with highly specialized libraries (e.g., computer‑vision frameworks). Future directions suggested are incorporating multimodal LLMs, generating automated unit tests to further guard against hallucinations, extending the framework to handle external data dependencies, and exploring human‑in‑the‑loop workflows for higher‑stakes domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment