Comparison of different Methods for Univariate Time Series Imputation in R

Comparison of different Methods for Univariate Time Series Imputation in   R

Missing values in datasets are a well-known problem and there are quite a lot of R packages offering imputation functions. But while imputation in general is well covered within R, it is hard to find functions for imputation of univariate time series. The problem is, most standard imputation techniques can not be applied directly. Most algorithms rely on inter-attribute correlations, while univariate time series imputation needs to employ time dependencies. This paper provides an overview of univariate time series imputation in general and an in-detail insight into the respective implementations within R packages. Furthermore, we experimentally compare the R functions on different time series using four different ratios of missing data. Our results show that either an interpolation with seasonal kalman filter from the zoo package or a linear interpolation on seasonal loess decomposed data from the forecast package were the most effective methods for dealing with missing data in most of the scenarios assessed in this paper.


💡 Research Summary

Missing data are a pervasive challenge in time‑series analysis, and while the R ecosystem offers a plethora of imputation tools for multivariate datasets, dedicated functions for univariate series are scattered and often overlook temporal dependencies. This paper fills that gap by systematically reviewing the imputation functions available in major R packages (zoo, forecast, imputeTS, tsibble, among others) and by conducting a comprehensive empirical evaluation. Five real‑world monthly series—unemployment rates, average temperature, electricity consumption, stock index, and precipitation—serve as test beds. For each series, missing values are introduced under two mechanisms: completely at random (MCAR) and missing at random in contiguous blocks (MAR). Four missing‑data proportions (10 %, 20 %, 30 %, 40 %) are examined, yielding a total of 200 experimental scenarios.

The imputation methods fall into five conceptual categories: (1) simple interpolation (linear, spline, polynomial), (2) moving‑average or exponential smoothing, (3) seasonal‑trend decomposition followed by interpolation, (4) state‑space models with Kalman filtering, and (5) machine‑learning‑based prediction. Twelve specific functions are benchmarked, including zoo’s na.approx, na.spline, na.kalman; forecast’s na.interp, na.seasonal; imputeTS’s na.interpolation; and several custom wrappers for random‑forest and XGBoost. Performance is assessed using mean absolute error (MAE), root‑mean‑square error (RMSE), Pearson correlation with the original series, and a spectral measure of seasonal‑pattern preservation.

Results consistently highlight the superiority of methods that explicitly model seasonality. The zoo package’s Kalman‑filter implementation (na.kalman)—which embeds seasonal components into a state‑space framework—delivers the lowest MAE and RMSE across almost all missing‑data ratios, especially when gaps are long. The forecast package’s na.seasonal, which first applies STL (Seasonal‑Trend decomposition using Loess) and then performs linear interpolation on the deseasonalized series, performs almost as well and excels at preserving the original series’ variance and autocorrelation structure. Simple linear or spline interpolation (na.approx, na.spline) works adequately for very short gaps (≤5 months) but deteriorates sharply as the missing proportion rises. Moving‑average and exponential‑smoothing based methods capture trends but fail to retain seasonal cycles, leading to higher errors on strongly seasonal series. Machine‑learning approaches (random forest, XGBoost) can be competitive when ample complete data are available for training, yet they suffer from over‑fitting and increased computational cost when the missing proportion is high, making them less practical for routine use.

The study also examines non‑seasonal series. In those cases, the Kalman filter remains robust, while STL‑based methods revert to simple trend interpolation, yielding comparable performance to basic linear methods. Overall, the findings suggest a pragmatic hierarchy for practitioners: first, test for seasonality; if present, prioritize zoo’s na.kalman or forecast’s na.seasonal. For non‑seasonal or very short gaps, simple interpolation (na.approx) is sufficient and computationally cheap.

In conclusion, the paper provides clear guidance for R users dealing with univariate time‑series imputation, demonstrating that seasonally aware Kalman filtering and STL‑based linear interpolation are the most reliable and generally applicable solutions. Future work should extend the benchmark to irregularly spaced series, multi‑seasonal patterns, and deep‑learning based imputation models to further enrich the toolbox for time‑series practitioners.