AUTHORIZATION TO LEND AND REPRODUCE THE THESIS As the sole author of this thesis, I authorize Brown University to lend it to other Institutions or individuals of the purpose of scholarly research. ____________ _________________________________________ Date Author: Yimo Zhang I further authorize Brown University to reproduce this thesis by photocopying or other means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly research. ____________ _________________________________________ Date Author: Yimo Zhang iii Accounting for Missing Data in the Random Forest Algorithm By Yimo Zhang Thesis Submitted in partial fulfillment of the requirements for the Degree of Master of Science in the Department of Biostatistics at Brown University PROVIDENCE, RHODE ISLAND MAY 2019 This thesis by Yimo Zhang is accepted in its present form by the Department of Biostatistics as satisfying the thesis requirement for the degree of Master of Science. Date: 05/01/2019 Advisor: Jon Steingrimsson, PhD Reader: Tao Liu, PhD Master’s Program Director: Adam J. Sullivan, PhD Approved by the Graduate Council Date: 05/01/2019 Andrew G. Campbell, PhD, Dean of the Graduate School ii Acknowledgements I would like to take this opportunity to express my gratitude to my advisor, Dr. Jon Steingrimsson, who has been patient and supportive during the last two years. He guided me through the whole research process and gave constructive suggestions on my thesis and study. I would also like to acknowledge my thesis reader, Dr. Tao Liu, who has been giving enlightening comments on my thesis. Finally, thank my parents for their endless love, support and encouragement that make me who I am today. iii Table of contents List of tables vii List of figures ix 1 Introduction 1 1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Missing Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Missing Completely at Random . . . . . . . . . . . . . . . . . . 3 1.2.2 Missing at Random . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.3 Missing Not at Random . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Handling Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Methods 9 2.1 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Building a Single Tree . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.2 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Handling missing data in the Random Forest algorithm . . . . . . . . . 14 2.2.1 Multiple Imputation by Chain Equation . . . . . . . . . . . . . 14 2.2.2 Inverse Probability Weighting . . . . . . . . . . . . . . . . . . . 15 v Table of contents 3 Simulation 17 3.1 Set Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.1 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4 Discussion 27 4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2.1 Subsampled Random Forest and U-statistic . . . . . . . . . . . 28 4.2.2 Subsampled Random Forest and Missing Data . . . . . . . . . . 31 References 33 vi List of tables 3.1 Variable description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Mean MSE for continuous outcome of 100 simulations . . . . . . . . . . 20 3.3 CE for binary outcome of 100 simulations . . . . . . . . . . . . . . . . 20 3.4 Variable description [13] . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5 Results of diabetes prediction . . . . . . . . . . . . . . . . . . . . . . . 25 vii List of figures 2.1 Tree Built by Partitioning a Two-Dimensional Feature Space (adopted from [14]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1 Prediction accuracy under MCAR. The upper figure shows CE associated with binary outcome; the lower figure shows MSE associated with continuous outcome (scaled by log) . . . . . . . . . . . . . . . . . . . . 21 3.2 Prediction accurary under MAR. The upper figure shows CE associated with binary outcome; the lower figure shows MSE associated with continuous outcome (scaled by log) . . . . . . . . . . . . . . . . . . . . 22 3.3 Prediction accuracy under MNAR. The upper figure shows CE associated with binary outcome; the lower figure shows MSE associated with continuous outcome (scaled by log) . . . . . . . . . . . . . . . . . . . . 23 3.4 ROC curves for the five different methods to deal with missing data . . 26 4.1 Prediction accuracy of binary outcome when the outcome can be missing. 35 ix Chapter 1 Introduction It is well known that failing to appropriately account for missing data can lead to bias and inefficiency [7]. Several approaches have been proposed to deal with missing data which can roughly be classified into likelihood based methods, imputation based methods, and weighing based methods [15]. Risk prediction models based on regression trees are becoming popular non- parametric alternatives to parametric regression models. The Classification and Regression Trees (CART) algorithm builds prediction models by recursively parti- tioning the covariate space by maximizing within group homogeneity with respect to the outcome of interest Breiman [3]. The covariate space is recursively partitioned until a pre-determined stopping criteria is met, creating a large tree (model) usually substantially overfitting the data. Pruning is used to create a sequence of candidate trees for being the final model, and cross-validation is employed to select the final tree from the candidate trees. CART trees are easily visualized and interpretable risk prediction models. But, the hierarchical nature of the model can lead to instability as minor changes in the data can lead to different partitions high up the tree. Different splits high up the tree results in the following splits potentially being drastically different. This instability 1 Introduction and the simplicity of the final piece-wise constant prediction model can lead to low prediction accuracy. To improve prediction accuracy, [1] proposed bagging which averages several different regression trees built on different bootstrap samples. To further improve performance, Breiman [2] proposed the random forest algorithm. The random forest algorithm de-correlates the different trees in the bagging procedure by randomly selecting only a subset of variables to be considered for splitting. Although extensively evaluated in the context of traditional parametric inference, little is known about how different methods for handling missing data perform when used in connection with the random forest algorithm. In this thesis we use simulations as well as analysis of data on Diabetes in Pima Indian Women to evaluate the performance of several methods for handling missing data. 1.1 Notation Throughout the whole passage, capital letters X, Y, Z... refer to random variables or fixed constants; lower case letters x, y, z... refer to the realization of corresponding random variables. We introduce the following notations for the convenience of analysis. 1. N , sample size; 2. M , the number of covariates/features; 3. Z = (Z 1 , Z 2 , ..., Z N ), the full data we intend to collect; 4. X = (X 1 , X 2 , ..., X N ), the covariates/features, X i = (Xi1 , Xi2 , ..., XiM ); 5. Y = (Y1 , Y2 , ..., YN ), the outcomes, and Z i = (Yi , X i ); 6. R = (R1 , R2 , ..., RN ), and Ri = (Ri0 , Ri1 , ..., RiM ) is the missing indicator of Z i = (Yi , Xi1 , ...XiM ), where Rij = 0 indicates that the corresponding component of (Yi , Xi1 , Xi2 , ..., XiM ) is missing (where j = 0 corresponds to the outcome and j = k corresponds to Xik ), Rij = 1 if the corresponding component of (Yi , Xi1 , Xi2 , ..., XiM ) is observed; ¯ = (R 7. R ¯ 1 , ..., R ¯ N ), and R ¯ i = (R ¯0, R ¯ 1 , ..., R ¯ j = 1 − Rj ; ¯ M ), where R i i i i i 2 1.2 Missing Mechanism 8. L = (L1 , ..., LN ) is the missing indicator of Z i as a whole, where Li = 1 if Z i is fully observed, and Li = 0 if Z i is not fully observed; Pn 9. O = i=1 Li , the number of individuals that are fully observed. 1.2 Missing Mechanism Missingness can happen either at covariates or the outcome. The missingness type is said to be univariate if all missingness happens in a single covariate (i.e. the outcome and all other covariates are fully observed). Monotone missingness occurs if Rj = 0 implies Rk = 0 for all k > j. The missingness is said to be nonmonotone if it is not monotone. Now we describe three types of missingness mechanisms: 1.2.1 Missing Completely at Random Missing Completely at Random refers to the situations where the missing of data has nothing to do with the data itself, that is, the missingness is independent of the full data set, P(R = r | Z) does not depend on Z. The reason behind this kind of missingness often has little to do with the study itself. For example, the records were documented with random errors and the erroneous records were marked to be missing. 3 Introduction 1.2.2 Missing at Random Missing at Random refers to the situations where the missingness of the data depends only on the data that are observed, P(R = r | Z) = P(R = r | Zr ) = π(r, Zr ), where π is some function. Here, Zr is the observed component of Z. For example, in a non-blinded study, the patients who are not assigned to treatment group drop out, thus leave their outcome measurement unobserved. 1.2.3 Missing Not at Random Missing not at Random refers to the situations where the probability of being missing depends on the data that are not observed, P(R = r | Z) depends on Zr¯. Here, Zr¯ is the missing component of Z. This happens, for example, when patients somehow get to know in advance that the outcome is not desirable and they choose to leave the study due to negativity. 1.3 Handling Missing Data Many methods can be adopted to address missing data problems. The most straight- forward one is to completely ignore observations that are not fully observed. However, this method is not recommended in practical settings as it throws away information carried by the data set, and can lead to bias and inefficiency. 4 1.3 Handling Missing Data To make use of the information in partially observed observations, two popular ways of handling missing data are imputation and weighing. Now we describe the methods in general terms and in Section 2 we describe them in the context of the random forest algorithm. Imputing the Mean The simplest method to impute missing data is to impute the mean of observed data, that is 1 − rij X N Xij = PN j rkj Xkj + rij Xij , j = 1, 2, ..., M, (1.1) k=1 rk k=1 N 1 − r0 X Yi = PN i 0 rk0 Yk + ri0 Yi . (1.2) k=1 rk k=1 However, in despite of the simplicity, this method can also lead to problematic results where variance is underestimated and the missing values are imputed using a complete case estimator. A way to allow partially missing observations to behave differently than fully observed observations is to add a missing indicator term when we fit model using this mean-imputed data. For example, in the context of a linear model M γj Xij , X E[Y | X i ] = γ0 Li + j=1 where X i ’s that are conditioned on in the above statements are covariates after mean imputation. An alternative would be to add an indicator whether a variable was missing on that participant for each variable. That would allow the missingness corresponding to each variable to be different depending on if that variable is missing or not. 5 Introduction Multiple Imputation The basic idea of multiple imputation is to first impute the data based on some model a few times to obtain several full data set; then conduct analysis on each data set and calculate their own estimators, and the final estimator is the average of them and the variance estimator is adjusted for missing data. Assume the number of imputations is K. Then the multiple imputation algorithm is given by Algorithm 1 Algorithm of general MI while i in 1 to K do 1. Impute in the missing values based on some method and create the ith data set 2. Conduct data analysis on the ith data set and attain an estimator θˆi for some parameter θ end while 3. The final estimator of θ is: θˆ = 1 PK ˆ K i=1 θi . 4. The estimator for variance of θˆ is: K K ˆ= 1 Vˆ ar(θ)i + K +1 X (θˆi − θ)( ˆ θˆi − θ) ˆ T, X Σ K i=1 K(K − 1) i=1 where Vˆ ar(θ)i is the variance estimator from the ith imputed data set. Several methods of MI have been proposed in terms of how to impute data (fill in the missing values in Step 1), including proper imputation, improper imputation and multivariate imputation [11]. The method we adopt is called Multivariate Imputation by Chain Equations (MICE) [17], which imputes data again and again till convergence to create a single imputed data set. Assuming there are B(B ⩽ M ) variables in the data set containing missing values, then the algorithm is as follows: 6 1.3 Handling Missing Data Algorithm 2 MICE Algorithm for multiple imputation 1. Fill in all missing values by random sampling from the observed values. while j in 1 to B do 2. Regress the jth variable with missing values, say, X j , on all other variables, using only individuals whose X j is observed. Note that the imputed values in previous steps should also be included in the regression. 3. Impute the missing values of the jth variable using the regression model obtained in the previous step. end while 4. Repeat the while loop several times until the result becomes stable. One key advantage of this method is its compatibility to different types of responses, whether it’s continuous, binary or categorical, and the use of regression model in Step 2 (linear regression, logistic regression or multinomial regression) should depend on the type of response. Inverse Probability Weighting The basic idea of inverse probability weighting [10]is to model the probability of missingness and weigh the observations by the probability of being fully observed. The intuition is that for every observed individual, the less likely it is to be observed, the more likely similar individuals are supposed to be missing in the data set. Thus, to implement inverse probability weighting, we must: 1. Fit model for probability of being observed by all variables that are fully observed. Assume that a total of J variables, {W 1 , W 2 , ..., W J }, are fully observed, where {W 1 , W 2 , ..., W J } ⊆ {X 1 , X 2 , ..., X M , Y }. Fit a logistic regression model for 7 Introduction the probability of L = 1   J βk Wik . X P (L = 1|X, Y ) = logit π(Zi ; β) = β0 + (1.3) k=1 Other prediction models can also be used. 2. Using the model fitted above to estimate the parameter of interest by solving a weighted estimating equation (or a weighted version of the analysis of interest). 8 Chapter 2 Methods 2.1 Random Forest 2.1.1 Building a Single Tree The regression and classification tree algorithm is supervised learning algorithm. To build a tree, we partition the feature space into small regions, and then fit a simple model for each region. In figure 2.1, a tree is built by partitioning a two-dimensional space into five regions, R1 , R2 , R3 , R4 , R5 using the splits X1 = t1 , X1 = t3 , X2 = t2 , X2 = t4 . Once the tree is built, we can make prediction on a new feature vector x = (x1 , x2 ) by 5 fˆi (x)I{x∈Ri } , X T (x) = (2.1) i=1 where  1 if x ∈ Ri    I{x∈Ri } = 0 if x ∈ / Ri    9 Methods Fig. 2.1 Tree Built by Partitioning a Two-Dimensional Feature Space (adopted from [14]) 10 2.1 Random Forest is the indicator function and fˆi (x) is a specific model fitted using data from region Ri . When the outcome is continuous and a squared error loss is used, fˆi is a constant by averaging all the outcomes whose features fall in region Ri , that is, N 1 fˆi = Cˆi = PN X I{X j ∈Ri } Yj . (2.2) j=1 I{X j ∈Ri } j=1 where the model fˆi = Cˆi minimizes mean squared error in each region: N Cˆi = min I{X j ∈Ri } (Yj − Ci )2 . X Ci ∈R j=1 After determining the regression model in each region, we then start to grow the tree by deciding: 1. Which feature/variable and nodes to split; 2. When to stop. The first step is solved by first going over all possible splitting variables and nodes, and selecting the pair of variable and node (j, s) that minimizes the total sum of squares,   (yk − fˆ1 )2 + (yk − fˆ2 )2 , X X min (2.3) j,s k:xk ∈R1 k:xk ∈R2 where fˆi , i = 1, 2 is calculated by (1) with regions R1 = {X k : Xkj < s} and R2 = {X k : Xkj ⩾ s}. After making the first split, continue this process for every resulting region. The second steps is achieved when some pre-specified stopping criteria is met (e.g. the depth of the tree is large enough or there is not enough data in each node to warrant further splitting). 11 Methods 2.1.2 Bagging After building a tree T , a prediction can be made for a new individual x, |T | fˆk I{x∈Rk } , X T (x) = k=1 where |T | is the number of regions of tree T . However, basing the prediction on only one tree can be unreliable as the splitting criteria can easily be affected by outliers. Bagging is a technique for reducing the variance of prediction by averaging the predictions yielded by models built from bootstrap samples. It’s argued that bagging helps reduce mean-squared error and leaves the bias unchanged [1]. Assume we draw a total of B bootstrap samples from (Z 1 , Z 2 , ..., Z N ) and the tree built from the bth bootstrap sample is marked as T ∗b , then the final prediction is B 1 X Tbag (x) = T ∗b (x). B b=1 2.1.3 Random Forest Bagging method improves prediction accuracy produced by a single tree. However, the variance is also considerable since each tree built from bootstrap sample has large correlation since they use the same splitting criteria and almost the same data. Assume the B bootstrap trees are identically distributed with variance σ 2 and pairwise correlation ρ, then the variance of the bagging average (final prediction) is: B 1 X 1−ρ 2     var Tbag (x) = var T ∗b (x) = ρσ 2 + σ , B b=1 B 12 2.1 Random Forest and   lim var Tbag (x) = ρσ 2 . B→∞ The idea of random forest is to reduce the correlation between trees ρ without increasing the variance σ 2 too much. To achieve this, instead of applying splitting criteria to all features on all bootstrap trees, we randomly choose P features out of M as the candidates for each splitting. In this way, even though the splitting criteria is the same, each bootstrap tree will likely look much different. Algorithm 3 Random Forest Algorithm for Fully Observed Data while i in 1 to B do 1. Draw the bth bootstrap sample Z∗b of size N from Z 2. Build the bth tree T ∗b by (2.1) on the bootstrapped sample from previous step by repeating the following steps on each node until the number of nodes of T ∗b reaches some threshold. i. Select P features at random from the M features. ii. Find the best split pair of feature and node (j, s) by solving (2.3) and split that node. end while 3. The final prediction Trf (x) = 1 PB ∗b B b=1 T (x) 13 Methods 2.2 Handling missing data in the Random Forest algorithm When using random forest to train an incomplete data set, we first address the missing data problem by either imputing or inverse probability weighting to obtain a complete data set, and then implement random forest algorithm on this data set. 2.2.1 Multiple Imputation by Chain Equation For a data set yielded by imputing, we can just treat it as an original data set and simply run a random forest algorithm. Algorithm 4 Random Forest Algorithm with MICE while m in 1 to K do 1. Impute the data set using algorithm 2 to obtain the mth data set Z(m) . while i in 1 to B do 2. Draw the bth bootstrap sample Z∗b (m) of size N from Z(m) ∗b by (2.1) on the bootstrapped sample from previous step 3. Build the bth tree T(m) by repeating the following steps on each node until the number of nodes of ∗b reaches some threshold. T(m) i. Select P features at random from the M features. ii. Find the best split pair of feature and node (j, s) by solving (2.3) and split that node. end while (m) PB ∗b 4. The prediction based on the mth imputed data set is: Trf (x) = B1 b=1 T(m) (x). end while 5. The final prediction is the average based on the K imputed data sets: Trf (x) = 1 PK (m) K m=1 Trf (x). 14 2.2 Handling missing data in the Random Forest algorithm 2.2.2 Inverse Probability Weighting To incorporate inverse probability weighting into the random forest algorithm, some modifications need to be made. There are two ways to incorporate weights in random forest algorithm. One is to use the weighted bootstrap instead of the regular bootstrap Præstgaard and Wellner [9]: Algorithm 5 IPW Random Forest Algorithm 1 1.Estimate the weights for each observed individual ω ˆi = π ˆi using (1.3). When observations are not fully observed L = 0 the weights are 0 so observations that are not fully observed are not selected in the bootstrap sample. while i in 1 to B do ω ˆi 2. Draw the bth weighted bootstrap sample Z∗b of size O with probability P i:Li =1 ω ˆi from {Z i : Li = 1} 3. Build the bth tree T ∗b by (2.1) on the bootstrapped sample from previous step by repeating the following steps on each node until the number of nodes of T ∗b reaches some threshold. i. Select P features at random from the M features. ii. Find the best split pair of feature and node (j, s) by solving (2) and split that node. end while 4. The final prediction Trf (x) = 1 PB ∗b B b=1 T (x) By only putting positive weights on fully observed observations in the bootstrap sample, Algorithm 5 ensures that the individual trees in the bootstrap sample are built only on fully observed data. 15 Methods The other way to use inverse probability weighting to account for missing data in the random forest algorithm is to apply a different splitting criteria: Algorithm 6 Random Forest Algorithm using IPW based Splitting Criteria 1 1. Estimate the weights for each observed individual ω ˆi = π ˆi using 1.3 while i in 1 to B do 2. Draw the bth bootstrap sample Z∗b of size N from fully observed observations in Z. 3. Build the bth tree T ∗b by (2.1) on the bootstrapped sample from previous step by repeating the following steps on each node until the number of nodes of T ∗b reaches some threshold. i. Select P features at random from the M features. ii. Find the best split pair of feature and node (j, s) by solving the following and split that node.   ˆ k (yk − fˆ1 )2 + ˆ k (yk − fˆ2 )2 , X X min ω ω (2.4) j,s k:xk ∈R1 k:xk ∈R2 where fˆi = P 1 with regions R1 = {X k : Xkj < s} and R2 = P ω ˆk ˆ k Yk k:xk ∈Ri ω k:xk ∈Ri {X k : Xkj ⩾ s}. end while 4. The final prediction Trf (x) = 1 PB ∗b B b=1 T (x) 16 Chapter 3 Simulation 3.1 Set Up We conducted 100 simulations to evaluate the performance of random forest algorithm when incorporating the methods to address missing data discussed in the previous section. We used simulation settings that included both continuous and binary outcomes. In each simulation, a data set of size 1000 was created, with 15 variables, X1 , X2 , ..., X15 generated from multivariate normal distribution with mean 0. The covariance of any pair of covariates (Xi , Xj ) was set to be 0.9|i−j| . The continuous outcome was then calculated using five covariates through the following formula: Ycon = 5 + 4X1 + 3eX2 + 3X33 − X11 4 3 + 2X12 +ε (3.1) where ε ∼ N (0, 1) is the error term. The data set was then split into training set and test set, if size 600 and 400, respectively. Then, missing values were created in the training set under three missing 17 Simulation mechanisms respectively. More specifically, we made use of the function ampute in R package MICE to create missing data with arguments prop = 0.3 (the proportion of individuals who has missing data is 30%), pattern = c(rep(1, 5), rep(0, 10), 1) (the missing pattern is: observed under the first five variables, missing under the last ten variables and the outcome is always observed). For Missing Completely at Random, we set the argument mech to be ’MCAR’; for Missing at Random, we set weights to be c(rep(1, 5), rep(0, 11)) so that missingness is only related to observed covariates; for Missing Not at Random, we set weights to be rep(1, 16) so that missingness is related to both observed and missing covariates and outcome. We create the outcome generation equation (3.2) with randomly chosen constant coefficients from 1 to 5 in such way that among the covariates from which the outcome is generated, some are missing while the others are always observed. By doing this, we want the simulation to be more generalizable in the sense that not all missing variables are related/unrelated to the outcome. For binary outcome, we used a slightly different outcome generation equation Y = 20 + X1 + 5eX2 + X34 + 5X42 + X11 3 , (3.2) Ybi = I{Y >35} . The constant coefficients are still randomly chosen and the way that the covariates were generated is the same as for the continuous case. The threshold is set to be 35 to mimic the structure of the real data in the next Chapter, so that the ratio of label 0 outcomes to label 1 outcomes is nearly 2 : 1. After creating missing values, we used the following methods to deal with the missing data in the random forest algorithm: multiple imputation using chain equation (4), inverse probability weighting (6)(we used the second method where the weights are used in the splitting process and prediction function), mean imputation, complete case 18 3.2 Results (ignore individuals containing missing values), the missing data method developed by [5] and mean imputation with missing indicator. 1000 trees were built in each random forest and splitrule was set to be mse for continuous outcomes and gini for binary outcomes. The prediction accuracy was measure by mean squared error (MSE) for continuous outcomes, 1 ˆ MSE = (Yi − E[Yi ])2 , n and cross entropy (CE) for binary outcomes, 1h  i CE = − Yi log Yˆi + (1 − Yi ) log 1 − Yˆi . n Lower value of MSE/CE indicates better prediction accuracy. 3.2 Results The prediction accuracy was calculated by either MSE or CE using six missing data techniques under three missing data mechanisms. The methods and their corresponding abbreviation were shown below. 19 Simulation Table 3.1 Variable description Method abbreviation Method mi multiple imputation using chain equation ipw inverse probability weighting ignore complete case mean mean imputation survival survival random forest mean_ind mean imputation with missing indicator Table 3.2 Mean MSE for continuous outcome of 100 simulations MI IPW Ignore Mean Survival Mean_ind MCAR 96.142 91.630 107.543 102.675 101.117 101.193 MAR 91.547 97.819 115.960 91.601 94.003 90.410 MNAR 91.617 96.092 114.500 90.664 92.744 88.716 Table 3.3 CE for binary outcome of 100 simulations MI IPW Ignore Mean Survival Mean_ind MCAR 0.1533 0.1569 0.1632 0.1573 0.1656 0.1591 MAR 0.1532 0.1574 0.1628 0.1566 0.1666 0.1596 MNAR 0.1582 0.1578 0.1624 0.1595 0.1685 0.1631 For continuous outcome, the differences of the prediction accuracy for the six methods are quite small compared to their scales according to figures (3.1), (3.2) and (3.3). As shown by (3.2), under the three missing mechanisms, complete case is always the worst one having the largest MSE. Under MCAR, IPW exhibits smallest prediction 20 3.2 Results Prediction Accuracy Under MCAR 0.225 0.200 Cross Entropy 0.175 0.150 0.125 Prediction Accuracy Under MCAR 6 Mean Squared Error (log) 5 4 3 mi ignore survival Methods ipw mean mean_ind Fig. 3.1 Prediction accuracy under MCAR. The upper figure shows CE associated with binary outcome; the lower figure shows MSE associated with continuous outcome (scaled by log) 21 Simulation Prediction Accuracy Under MAR 0.200 0.175 Cross Entropy 0.150 0.125 Prediction Accuracy Under MAR 6 Mean Squared Error (log) 5 4 3 mi ignore survival Methods ipw mean mean_ind Fig. 3.2 Prediction accurary under MAR. The upper figure shows CE associated with binary outcome; the lower figure shows MSE associated with continuous outcome (scaled by log) 22 3.2 Results Prediction Accuracy Under MNAR 0.225 0.200 Cross Entropy 0.175 0.150 0.125 Prediction Accuracy Under MNAR 6 Mean Squared Error (log) 5 4 3 mi ignore survival Methods ipw mean mean_ind Fig. 3.3 Prediction accuracy under MNAR. The upper figure shows CE associated with binary outcome; the lower figure shows MSE associated with continuous outcome (scaled by log) 23 Simulation error; under MAR and MNAR, MI, mean imputation and mean imputation with indicator have lower MSE than the others, thus have the potential to yield more precise predictions. But, overall the difference between the methods is small. For binary outcome, the differences of the prediction accuracy for the six methods are small according to the figures. However, as shown by (3.3), under the three missing mechanisms, MI, IPW and mean imputation are associated with slightly lower CE compared to complete case, survival random forest and mean imputation with indicator. 3.3 Data Analysis 3.3.1 Data Set The data we used is the medical information of women from Pima Indian population near Phoenix, Arizona who were at the risk of diabetes [13]. This data set contains 7 covariates and 1 outcome variable: Table 3.4 Variable description [13] Variable Name Variable Description npreg number of times pregnant glu plasma glucose concentration(mg/dl) bp diastolic blood pressure(mm Hg) skin triceps skin fold thickness(mm) bmi body mass index(kg/(m)2 ) ped diabetes pedigree function age age in years type indicator of diabetes, Yes (1) and No (0) 24 3.3 Data Analysis The outcome ped is calculated as a measure of family diabetes occurence. The exact way to calculate this variable was described in [13]. This data set was split into training set and test set. The training set contains 300 observations, out of which 100 contain missing values. The missingness only happened in 3 covariates: skin, bp and bmi, containing 98, 3 and 13 missing values respectively. Because bp and bmi contains few missing values, we simply got rid of the observations who have missing values under these two variables to avoid imbalance. After this, we had 284 observations in the training set and 84 of them contains missing values under skin variable. The test set contains 332 observations with no missing values. 3.3.2 Results Following the procedure of simulation, we implemented 5 missing data techniques with random forest on training set to build 5 models, then made predictions on test set with these models and calculated prediction accuracy, and compared these models by CE and AUC. The results are as follows: Table 3.5 Results of diabetes prediction MI IPW Ignore Mean Survival CE 0.4819 0.4818 0.4940 0.4828 0.4766 AUC 0.8239 0.8185 0.8149 0.8246 0.8272 25 Simulation ROC Curve (Method: ignore) ROC Curve (Method: mean) 1.00 1.00 True Positive Rate 0.75 0.75 0.50 0.50 0.25 0.25 AUC is 0.8149 AUC is 0.8246 0.00 0.00 False Positive ROC Curve (Method: mi) ROC Curve (Method: ipw) 1.00 1.00 True Positive Rate 0.75 0.75 0.50 0.50 0.25 0.25 AUC is 0.8239 AUC is 0.8185 0.00 0.00 False Positive Rate ROC Curve (Method: survival) 1.00 0.75 True Positive 0.50 0.25 AUC is 0.8272 0.00 False Positive Rate Fig. 3.4 ROC curves for the five different methods to deal with missing data By (3.5), survival random forest is the best missing data technique in this case for it is associated with the highest AUC(0.8272) and the lowest CE(0.4766). Complete case, on the other side, is the least desirable one for it’s associated with the highest CE and the lowest AUC. However, the differences of CE and AUC among these 5 methods are pretty small. 26 Chapter 4 Discussion 4.1 Conclusion From the simulations, we see that except for complete case analysis, no methods can dominate the others in terms of prediction accuracy. There are many factors that can impact the performance of one method: 1. Missing mechanism: for simulation of binary outcome, MI was the best choice for both MCAR and MAR mechanisms, but in MNAR, IPW is slightly better. However, as the data missing mechanisms are assumptions that cannot be verified, in practical setting when we try to make predictions, we can try each method by splitting the data into training set and test set, and check out the prediction accuracy on the test set. Cross validation can be a better way to remove bias due to splitting. 2. Type of outcome: in the simulation part, when the outcome is continuous, survival random forest has relatively low MSE; but when outcome is binary, it is associated with the highest CE under all three missing mechanisms. 3. Missing pattern: the continuous outcome is constructed by five covariates. In the missing pattern, the first three are always observed and the last two are sometimes missing. That is, the underlying relationship is partially covered. If all the covariates 27 Discussion that affect the outcome are always observed and only nuisance covariates are missing, the results will be different. 4. Level of missingness: even though only missing in covariates is discussed in this project, missing in outcome can also lead to different result. For example, in binary simulation, if we allow outcome to be missing (the outcome contruction function and other missing parameters are the same), we will see that MI behaves much worse than the other methods, including complete case (4.1). 4.2 Future Work Estimating the variability associated with the prediction estimator is an important component of predicting. [8] derived variance estimators from the random forest algorithm using subsampled random forests. The core idea of subsampled random forest [8] is to build each tree in a cer- tain manner so that the random forest can be viewed as random kernel U-statistic and asymptotic normal distribution can be yielded, thus variance estimator can be constructed through this distributional results. 4.2.1 Subsampled Random Forest and U-statistic A U-statistic is defined as 1 X Un = n h(Zi1 , Zi2 ..., Zik ), (4.1) k (i1 ,...,ik ) where (i1 , ..., ik ) are k distinct elements from (1, 2, ..., n) (which implies that n ⩾ k), the   n summation is over all k combinations of (i1 , i2 , ..., ik ), h(·) is a symmetric function with k arguments and Z1 , Z2 , ..., Zn are n independent identical random variables (or random vectors) [12]. 28 4.2 Future Work A random forest built through algorithm (3) almost fits the structure of an incom- plete U-statistic. First of all, all observed data (X 1 , Yi ), (X 2 , Y2 ), ..., (X n , Yn ) are iid (or at least assumed to be iid); secondly, when using bootstrap sample of size k(k < n) to build a tree, the order of the data has no impact on the result, which corresponds to the symmetric structure of h(·); finally, the prediction of a random forest is the   n average of predictions from the k trees.   n However, as n grows larger, it becomes computationally infeasible to build k   n trees. Therefore, instead of building k trees, we limit the amount of trees to mn   n (mn < k ). Also, as n increases, the prediction from each tree can be inaccurate if the number of samples k to build each tree stays the same, simply because each tree is built on insufficient information. Hence, [8] allow the number of samples kn to increase as n increases. The statistic that corresponds to the changes above is called resampled U-statistic [4]. It is built in similar manner as (4.1), except that instead of averaging over all the   n combinations of (i1 , ..., ik ) sets, it averages over mn (mn < m ) sets with changeable sample size kn , and these mn sets are selected uniformly at random with replacement   n from the kn combinations, that is 1 X Un,mn ,kn = hk (Zi , ..., Zikn ). (4.2) mn (i) n 1 The only thing algorithm (3) that does not have a correspondence in (4.2) is the random selection of variables. To incorporate this, an additional randomization parameter ω was added to the statistic to incorporate the randomness brought by random selection of variables. Write 29 Discussion 1 X (ωi ) Uω;n,kn ,mn = h (Zi1 , Zi2 , ..., Zikn ), (4.3) mn (i) kn which is called random kernel U-statistic. Because the variables used to build each tree iid are uniformly randomly selected, then all ω’s can be seen as iid (ω1 , ω2 , ..., ωmn ∼ Fω ) and independent of the data. Then [8] proved that such random kernel U-statistic has an asymptotic normal distribution under some mild conditions. The proof is quite long and need some techniques and theorems from other sources [6], [16], so we broken it down into two steps and state succinctly without proof. ∗ Step 1: Let Un,k ∗ = Eω (Uω;n,kn ,mn ), then Uω;n,k is a non-random resam- n ,mn n ,mn (ω ) pled U-statistic by definition. (Just view h∗kn (·) = Eω [hkni (·)] as another symmetric function.) It can be shown that non-random kernel U-statistic (or resampled U-statistic) has an asymptotic normal distribution. To be more specific, one of the distributional result from theorem 1 of [8] is, √ ∗ n(Un,k n ,mn − θkn ) d q → N (0, 1) (4.4) kn2 ζ1,kn where 30 4.2 Future Work n α = lim =0 n→∞ mn θkn = Eh∗kn (Z1 , ..., Zkn ) ζ1,kn = cov(h∗kn (Z1 , Z2 , ..., Zkn ), h∗kn (Z1 , Z2′ , ..., Zk′ n ). Step 2: Show that √ ∗ P n(Uω;n,kn ,mn − Un,k n ,mn ) → 0, (4.5) then follows that √ √ ∗ √ ∗ n(Uω;n,kn ,mn − θkn ) n(Uω;n,kn ,mn − Un,k n ,mn ) n(Un,kn ,mn − θkn ) q = q + q kn2 ζ1,kn kn2 ζ1,kn kn2 ζ1,kn (4.6) d → 0 + N (0, 1) = N (0, 1). That is, a random kernel U-statistic (corresponding to a random forest prediction) has an asymptotic normal distribution. Having this knowledge, the variance of prediction (to be more specific, ζ1,kn ) can be estimated using Monte-Carlo simulations, and 95% confidence interval can be constructed accordingly. 4.2.2 Subsampled Random Forest and Missing Data Back to the context of missing data. In the simulation part we showed that multiple imputation and inverse probability weighting are the two best methods to handle missing data when accompanied with random forest. Multiple imputation is too complicated and algorithm-reliable to work on the theoretical properties. Thus, we are interested in how to make prediction and variance estimate under inverse probability 31 Discussion weighting using subsampled random forest. However, in order to do that, we need to extend the subsample random forest theory to the case inverse probability weighting. There are some problems need to be solved before we can fit inverse probability weighting into subsampled random forest framework. First, after using inverse proba- 1 bility weighting, every fully observed individual is associated with a weight wi = ˆi, pi where π ˆi is the estimate of the true probability of being observed (1.3). In this way, the data we are about to use to build random forest (X1 , π ˆ1 , Y1 ), ..., (Xm , π ˆm , Ym ) are no longer iid, simply because π ˆi is estimated using all the data, and this doesn’t fit in with the frame of U-statistic. A potential solution may be to prove that √ P m(Uπˆ (Z1 , ..., Zm ) − Uπ (Z1 , ..., Zm )) → 0. (4.7) Because Uπ (Z1 , ..., Zm ) is of the form of U-statistic, it has an asymptotic normal distribution. If 4.7 is right, then Uπˆ (Z1 , ..., Zm ) also has an asymptotic normal distri- bution, and inference can be made in similar ways. We plan to explore this in future work. 32 References [1] Breiman, L. (1996). Bagging predictors. Machine learning, 24(2):123–140. [2] Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32. [3] Breiman, L. (2017). Classification and regression trees. Routledge. [4] Free, E. W. (1989). Infinite order u-statistics. Scandinavian Journal of Statistics, 16(1):29–45. [5] Ishwaran, H., Kogalur, U. B., Blackstone, E. H., Lauer, M. S., et al. (2008). Random survival forests. The annals of applied statistics, 2(3):841–860. [6] Lee, A. J. (1990). U- Statistics: Theory and Practice. CRC Press. [7] Little, R. J. and Rubin, D. B. (2014). Statistical analysis with missing data, volume 333. John Wiley & Sons. [8] Mentch, L. and Hooker, G. (2016). Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. The Journal of Machine Learning Research, 17(1):841–881. [9] Præstgaard, J. and Wellner, J. A. (1993). Exchangeably weighted bootstraps of the general empirical process. The Annals of Probability, pages 2053–2086. [10] Robins, J. M. and Rotnitzky, A. (1992). Recovery of information and adjustment for dependent censoring using surrogate markers. In AIDS epidemiology, pages 297–331. Springer. [11] Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American statistical Association, 91(434):473–489. [12] Shao, J. (2006). Mathematical statistics: exercises and solutions. Springer Science & Business Media. [13] Smith, J. W., Everhart, J., Dickson, W., Knowler, W., and Johannes, R. (1988). Using the adap learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Annual Symposium on Computer Application in Medical Care, page 261. American Medical Informatics Association. [14] Trevor Hastie, Robert Tibshirani, J. F. (2008). The elements of statistical learning, 2nd edition. 33 References [15] Tsiatis, A. (2007). Semiparametric theory and missing data. Springer Science & Business Media. [16] Van der Vaart, A. W. (2000). Asymptotic statistics, volume 3. Cambridge university press. [17] White, I. R., Royston, P., and Wood, A. M. (2011). Multiple imputation using chained equations: issues and guidance for practice. Statistics in medicine, 30(4):377– 399. 34 Appendix Prediction Accuracy Under MCAR Prediction Accuracy Under MAR 0.25 0.30 Cross Entropy 0.25 0.20 0.20 0.15 0.15 Prediction Accuracy Under MNAR 0.30 0.25 Cross Entropy mi ignore survival Methods 0.20 ipw mean mean_ind 0.15 Fig. 4.1 Prediction accuracy of binary outcome when the outcome can be missing. 35 References 36