bigMICE: Multiple Imputation of Big Data
Missing data is a prevalent issue in many applications, including large medical registries such as the Swedish Healthcare Quality Registries, potentially leading to biased or inefficient analyses if not handled properly. Multiple Imputation by Chained Equations (MICE) is a popular and versatile method for handling multivariate missing data but traditional implementations face significant challenges when applied to big data sets due to computational time and memory limitations. To address this, the bigMICE package was developed, adapting the MICE framework to big data using Apache Spark MLLib and Spark ML. Our implementation allows for controlling the maximum memory usage during the execution, enabling processing of very large data sets on a hardware with a limited memory, such as ordinary laptops. The developed package was tested on a large Swedish medical registry to measure memory usage, runtime and dependence of the imputation quality on sample size and on missingness proportion in the data. In conclusion, our method is generally more memory efficient and faster on large data sets compared to a commonly used MICE implementation. We also demonstrate that working with very large datasets can result in high quality imputations even when a variable has a large proportion of missing data. This paper also provides guidelines and recommendations on how to install and use our open source package.
💡 Research Summary
The paper addresses the pervasive problem of missing data in large‑scale datasets, particularly in medical registries such as the Swedish Healthcare Quality Registries, where traditional Multiple Imputation by Chained Equations (MICE) implementations struggle with prohibitive memory consumption and long runtimes. To overcome these limitations, the authors developed the bigMICE package, which re‑implements the MICE algorithm on top of Apache Spark and Spark MLlib, thereby exploiting distributed computing and on‑disk storage to keep RAM usage within user‑specified bounds.
Key technical contributions include: (1) a Spark‑based data pipeline that loads raw data into Spark DataFrames, initializes missing values by simple random draws, and then iteratively fits conditional regression models for each incomplete variable using Spark MLlib’s linear, logistic, and multinomial regression algorithms; (2) a memory‑management scheme that allows the user to set a maximum RAM limit (maxMemory). The package automatically adjusts partition sizes, caches, and spill‑to‑disk behavior so that the iterative MICE chain can run without triggering out‑of‑memory errors even on modest laptops with 4–8 GB of RAM; (3) parallelization of the multiple imputation repetitions (the “m” imputations) as independent Spark jobs, which yields near‑linear speed‑up with the number of imputations; (4) seamless integration with the R ecosystem via the sparklyr interface, enabling R users to call bigMICE with a familiar function signature while the heavy lifting occurs in Spark.
The authors evaluated bigMICE on a real‑world Swedish medical registry containing roughly two million rows and three hundred variables. They varied both sample size and missingness proportion (10 % to 80 %). Compared with the widely used R‑mice package, bigMICE consistently used far less memory (often under 4 GB versus exceeding available RAM for R‑mice) and achieved 2–3× faster runtimes. Importantly, statistical quality was preserved: mean absolute error (MAE) and root‑mean‑square error (RMSE) of imputed values were lower for high‑missingness variables, and Rubin’s pooling rules produced unbiased parameter estimates with appropriate standard errors.
Practical guidance is provided for deploying bigMICE: configure Spark executors and partitions to match data size, select appropriate regression models for each variable type, and tune the maxMemory parameter to balance RAM usage against disk I/O overhead. The package is open‑source (GitHub link) and includes comprehensive documentation and reproducible examples.
In conclusion, bigMICE fills a critical gap in the statistical software landscape by delivering a memory‑efficient, scalable, and statistically sound implementation of MICE for big data. Future work may extend the framework with newer Spark ML models (e.g., gradient‑boosted trees, deep learning) and explore Bayesian extensions to handle Missing Not At Random (MNAR) mechanisms. The work demonstrates that high‑quality multiple imputation is feasible even on ordinary hardware when leveraging modern distributed computing platforms.
Comments & Academic Discussion
Loading comments...
Leave a Comment