Evaluating the Impact of Missing Data Imputation through the use of the Random Forest Algorithm

Reading time: 6 minute
...

📝 Abstract

This paper presents an impact assessment for the imputation of missing data. The data set used is HIV Seroprevalence data from an antenatal clinic study survey performed in 2001. Data imputation is performed through five methods: Random Forests, Autoassociative Neural Networks with Genetic Algorithms, Autoassociative Neuro-Fuzzy configurations, and two Random Forest and Neural Network based hybrids. Results indicate that Random Forests are superior in imputing missing data in terms both of accuracy and of computation time, with accuracy increases of up to 32% on average for certain variables when compared with autoassociative networks. While the hybrid systems have significant promise, they are hindered by their Neural Network components. The imputed data is used to test for impact in three ways: through statistical analysis, HIV status classification and through probability prediction with Logistic Regression. Results indicate that these methods are fairly immune to imputed data, and that the impact is not highly significant, with linear correlations of 96% between HIV probability prediction and a set of two imputed variables using the logistic regression analysis.

💡 Analysis

This paper presents an impact assessment for the imputation of missing data. The data set used is HIV Seroprevalence data from an antenatal clinic study survey performed in 2001. Data imputation is performed through five methods: Random Forests, Autoassociative Neural Networks with Genetic Algorithms, Autoassociative Neuro-Fuzzy configurations, and two Random Forest and Neural Network based hybrids. Results indicate that Random Forests are superior in imputing missing data in terms both of accuracy and of computation time, with accuracy increases of up to 32% on average for certain variables when compared with autoassociative networks. While the hybrid systems have significant promise, they are hindered by their Neural Network components. The imputed data is used to test for impact in three ways: through statistical analysis, HIV status classification and through probability prediction with Logistic Regression. Results indicate that these methods are fairly immune to imputed data, and that the impact is not highly significant, with linear correlations of 96% between HIV probability prediction and a set of two imputed variables using the logistic regression analysis.

📄 Content

MISSING data are a common difficulty encountered in many real-world situations and studies, and creates difficulties with data analysis, study and visualization [1], [2]. The missing information also reduces insight into the data, and the underlying cause for the fact that data are missing may make the data of particular interest. Furthermore, decision policies made by a decision making system often cannot exact a decision without all the information at hand. For this reason, it is important to find effective and viable methods of imputing data, and furthermore, the effect of this imputation should be considered such that insight is gained into the validity of decisions made by such decision making systems. The problem is assessed in the context of a real-world data set taken from an HIV sero-prevalence survey performed in South Africa in 2001 [3].

This paper evaluates the concept, classification, problem and treatment of missing data. Background into the various methods and paradigms used in the paper are then considered, followed by a description into the implementation of these concepts. The data set is considered, and thereafter feature selection on the data is described. Comparisons are drawn in the paradigms used, and the impact and sensitivity analysis is performed. Finally, a discussion is presented and conclusions are drawn.

I. MISSING DATA Missing data are a problem inherent and common in data collection, especially when dealing with large, real-world data sets. Missing data are a problem since statistical methods have difficulty in performing when data are unknown. Studies have highlighted the need to research decision support systems when key information is missing or inaccessible [4]. The effect of missing data on such decision support systems is marked, and it is shown that results are degraded by simply assigning an arbitrary value to the missing data elements.

In the context of surveys, missing data may result for a number of reasons. Incomplete variable collection from subjects, non-response from subjects declining to provide information, poorly defined surveys, or data being removed for reasons such as confidentiality are some of the explanations for missing data [1], [2].

Missing data can be categorized based on the pattern of missingness and the missingness mechanism. The methods with which the missing data are dealt are dependent on the category into which the data fall. Three broad categories for pattern missingness are defined: monotone missingness, file matching, and general missingness [5], [6]. If a set of variables for a given instance are k y y ;…; 1 , monotone missingness occurs if when a missing value y j occurs, the variables can be ordered such that k j y y ;…; 1 + are also missing. The pattern of file matching occurs when two variables are never jointly observed. Arbitrary missingness is a missingness pattern which occurs when neither of the former two patterns applies.

Missing data are often classified into one of three mechanisms, as defined by Little and Rubin [5]. The mechanisms are listed as follows in order from least to most dependent on other information. 1) Missing Completely At Random (MCAR) arises if the probability of a missing value is unrelated to the variable value itself or any other variable in the data set.

  1. Missing At Random (MAR) arises if the probability of missing data of a variable could depend on other variables in the data set, but not on the variable’s own value. 3) Non ignorable case in which the probability of missing data is related to the value of the variable even if other variables are known/controlled. In the MCAR case, data cannot be predicted using any information in the set, known or unknown. For the MAR mechanism, there is a correlation between the missing data and the observed data, but not necessarily on the value of the missing data [7].

A number of strategies have been devised for dealing with missing data. The simplest means is discarding the instances in which missing data occur (a complete-case method), which is both inefficient and leads to potentially biased conclusions and observations. This is also not practical if a large proportion of data are missing. This method leads to information waste as information is discarded [1]. Despite this, the method is used commonly in practice [2]. Other techniques include available-case procedures, weighting procedures and imputation-based procedures [7]. The latter is discussed further here, since imputation methods can be applied to the MCAR and MAR cases [8].

Imputation techniques involve predicting the values of the data which are missing. Two categories of techniques exist, model-based techniques and non-model based techniques. Non-model based approaches include mean imputation and hot-deck imputation. These techniques have been said to decrease the variance estimates in statistical procedures [7]. Furthermore, such techniques may result in standard errors and bias on resul

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut