Accounting for Missing Data in the Random Forest Algorithm

Zhang, Yimo

Full Metadata

Overview

Year:: 2019
Contributor:: Zhang, Yimo (creator); Liu, Tao (Reader); Steingrimsson, Jon (Advisor); Brown University. Department of Biostatistics (sponsor)
Genre:: theses
Subject:: Machine learning--Statistical methods
Extent:: , None p.
DOI: https://doi.org/10.26300/b90x-hc38

Files

Description

Abstract:: The random forest algorithm is a popular non-parametric alternative to more parametric approaches when building risk prediction models. In short, the algorithm averages predictions from multiple classification and regression trees built on different bootstrap samples. This thesis describes multiple different ways of handling missing data when implementing the random forest algorithm. The methods described include multiple imputation, inverse probability weighing, complete case analysis, mean imputation, and the default method in the randomForestSRC package proposed by Ishwaran. Simulations and analysis of data on diabetes of women of the Pima Indian heritage are used to compare the prediction accuracy of the random forest algorithm when the different missing data techniques are used. The simulations vary the level of missingness and the missing data mechanisms. Finally, we describe a future research direction that involves extending method for variance estimation for random forest prediction to the setting when missing data is present.
Notes:: Thesis (Sc. M.)--Brown University, 2019

Content

Access Conditions

Rights: In Copyright

Citation

Zhang, Yimo, "Accounting for Missing Data in the Random Forest Algorithm" (2019). Biostatistics Theses and Dissertations. Brown Digital Repository. Brown University Library. https://doi.org/10.26300/b90x-hc38

Relations

Collection:

Biostatistics Theses and Dissertations

Theses and Dissertations for the Biostatistics department.
...