The random forest algorithm is a popular non-parametric alternative to more parametric approaches when building risk prediction models. In short, the algorithm averages predictions from multiple classification and regression trees built on different bootstrap samples. This thesis describes multiple different ways of handling missing data when implementing the random forest algorithm. The methods described include multiple imputation, inverse probability weighing, complete case analysis, mean imputation, and the default method in the randomForestSRC package proposed by Ishwaran. Simulations and analysis of data on diabetes of women of the Pima Indian heritage are used to compare the prediction accuracy of the random forest algorithm when the different missing data techniques are used. The simulations vary the level of missingness and the missing data mechanisms. Finally, we describe a future research direction that involves extending method for variance estimation for random forest prediction to the setting when missing data is present.
All rights reserved. Collection is open to the Brown community for research.
Citation
Zhang, Yimo,
"Accounting for Missing Data in the Random Forest Algorithm"
(2019).
Biostatistics Theses and Dissertations.
Brown Digital Repository. Brown University Library.
https://doi.org/10.26300/b90x-hc38