Skip to page navigation menu Skip entire header
Brown University
Skip 13 subheader links

Accounting for Missing Data in the Random Forest Algorithm

Description

Abstract:
The random forest algorithm is a popular non-parametric alternative to more parametric approaches when building risk prediction models. In short, the algorithm averages predictions from multiple classification and regression trees built on different bootstrap samples. This thesis describes multiple different ways of handling missing data when implementing the random forest algorithm. The methods described include multiple imputation, inverse probability weighing, complete case analysis, mean imputation, and the default method in the randomForestSRC package proposed by Ishwaran. Simulations and analysis of data on diabetes of women of the Pima Indian heritage are used to compare the prediction accuracy of the random forest algorithm when the different missing data techniques are used. The simulations vary the level of missingness and the missing data mechanisms. Finally, we describe a future research direction that involves extending method for variance estimation for random forest prediction to the setting when missing data is present.
Notes:
Thesis (Sc. M.)--Brown University, 2019

Access Conditions

Rights
In Copyright
Restrictions on Use
All rights reserved. Collection is open to the Brown community for research.

Citation

Zhang, Yimo, "Accounting for Missing Data in the Random Forest Algorithm" (2019). Biostatistics Theses and Dissertations. Brown Digital Repository. Brown University Library. https://doi.org/10.26300/b90x-hc38

Relations

Collection: