AUTHORIZATION TO LEND AND REPRODUCE THE THESIS


As the sole author of this thesis, I authorize Brown University to lend it to other Institutions or
individuals of the purpose of scholarly research.


____________                   _________________________________________
Date                           Author: Yimo Zhang


I further authorize Brown University to reproduce this thesis by photocopying or other means, in
total or in part, at the request of other institutions or individuals for the purpose of scholarly
research.


____________                   _________________________________________
Date                           Author: Yimo Zhang


                                                                                                      iii
         Accounting for Missing Data in the Random
                     Forest Algorithm


                                      By

                                 Yimo Zhang


                                    Thesis

Submitted in partial fulfillment of the requirements for the Degree of Master of

        Science in the Department of Biostatistics at Brown University


                     PROVIDENCE, RHODE ISLAND

                                  MAY 2019
                   This thesis by Yimo Zhang is accepted in its present

                   form by the Department of Biostatistics as satisfying

                    the thesis requirement for the degree of Master of

                                          Science.


Date: 05/01/2019   Advisor:

                                                                     Jon Steingrimsson, PhD


                   Reader:

                                                                                 Tao Liu, PhD


                   Master’s Program Director:

                                                                         Adam J. Sullivan, PhD


                              Approved by the Graduate Council


Date: 05/01/2019

                                      Andrew G. Campbell, PhD, Dean of the Graduate School


                                             ii
                          Acknowledgements


I would like to take this opportunity to express my gratitude to my advisor, Dr. Jon
Steingrimsson, who has been patient and supportive during the last two years. He
guided me through the whole research process and gave constructive suggestions on
my thesis and study.


I would also like to acknowledge my thesis reader, Dr. Tao Liu, who has been giving
enlightening comments on my thesis.


Finally, thank my parents for their endless love, support and encouragement that make
me who I am today.


                                         iii
Table of contents


List of tables                                                                            vii

List of figures                                                                           ix

1 Introduction                                                                             1
   1.1   Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    2
   1.2   Missing Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       3
         1.2.1    Missing Completely at Random . . . . . . . . . . . . . . . . . .         3
         1.2.2    Missing at Random . . . . . . . . . . . . . . . . . . . . . . . . .      4
         1.2.3    Missing Not at Random . . . . . . . . . . . . . . . . . . . . . .        4
   1.3   Handling Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . .       4

2 Methods                                                                                  9
   2.1   Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       9
         2.1.1    Building a Single Tree . . . . . . . . . . . . . . . . . . . . . . .     9
         2.1.2    Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   12
         2.1.3    Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . .     12
   2.2   Handling missing data in the Random Forest algorithm . . . . . . . . .           14
         2.2.1    Multiple Imputation by Chain Equation . . . . . . . . . . . . .         14
         2.2.2    Inverse Probability Weighting . . . . . . . . . . . . . . . . . . .     15


                                             v
Table of contents


3 Simulation                                                                               17
   3.1   Set Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    17
   3.2   Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   19
   3.3   Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     24
         3.3.1   Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    24
         3.3.2   Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   25

4 Discussion                                                                               27
   4.1   Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    27
   4.2   Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     28
         4.2.1   Subsampled Random Forest and U-statistic . . . . . . . . . . .            28
         4.2.2   Subsampled Random Forest and Missing Data . . . . . . . . . .             31

References                                                                                 33


                                             vi
List of tables

 3.1   Variable description . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   20
 3.2   Mean MSE for continuous outcome of 100 simulations . . . . . . . . . .           20
 3.3   CE for binary outcome of 100 simulations . . . . . . . . . . . . . . . .         20
 3.4   Variable description [13] . . . . . . . . . . . . . . . . . . . . . . . . . .    24
 3.5   Results of diabetes prediction . . . . . . . . . . . . . . . . . . . . . . .     25


                                          vii
List of figures

 2.1   Tree Built by Partitioning a Two-Dimensional Feature Space (adopted
       from [14]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   10

 3.1   Prediction accuracy under MCAR. The upper figure shows CE associated
       with binary outcome; the lower figure shows MSE associated with
       continuous outcome (scaled by log) . . . . . . . . . . . . . . . . . . . .       21
 3.2   Prediction accurary under MAR. The upper figure shows CE associated
       with binary outcome; the lower figure shows MSE associated with
       continuous outcome (scaled by log) . . . . . . . . . . . . . . . . . . . .       22
 3.3   Prediction accuracy under MNAR. The upper figure shows CE associated
       with binary outcome; the lower figure shows MSE associated with
       continuous outcome (scaled by log) . . . . . . . . . . . . . . . . . . . .       23
 3.4   ROC curves for the five different methods to deal with missing data . .          26

 4.1   Prediction accuracy of binary outcome when the outcome can be missing. 35


                                          ix
Chapter 1

Introduction

It is well known that failing to appropriately account for missing data can lead to
bias and inefficiency [7]. Several approaches have been proposed to deal with missing
data which can roughly be classified into likelihood based methods, imputation based
methods, and weighing based methods [15].
   Risk prediction models based on regression trees are becoming popular non-
parametric alternatives to parametric regression models. The Classification and
Regression Trees (CART) algorithm builds prediction models by recursively parti-
tioning the covariate space by maximizing within group homogeneity with respect to
the outcome of interest Breiman [3]. The covariate space is recursively partitioned
until a pre-determined stopping criteria is met, creating a large tree (model) usually
substantially overfitting the data. Pruning is used to create a sequence of candidate
trees for being the final model, and cross-validation is employed to select the final tree
from the candidate trees.
   CART trees are easily visualized and interpretable risk prediction models. But,
the hierarchical nature of the model can lead to instability as minor changes in the
data can lead to different partitions high up the tree. Different splits high up the tree
results in the following splits potentially being drastically different. This instability

                                            1
Introduction


and the simplicity of the final piece-wise constant prediction model can lead to low
prediction accuracy. To improve prediction accuracy, [1] proposed bagging which
averages several different regression trees built on different bootstrap samples. To
further improve performance, Breiman [2] proposed the random forest algorithm. The
random forest algorithm de-correlates the different trees in the bagging procedure by
randomly selecting only a subset of variables to be considered for splitting.
    Although extensively evaluated in the context of traditional parametric inference,
little is known about how different methods for handling missing data perform when
used in connection with the random forest algorithm. In this thesis we use simulations as
well as analysis of data on Diabetes in Pima Indian Women to evaluate the performance
of several methods for handling missing data.


1.1       Notation

Throughout the whole passage, capital letters X, Y, Z... refer to random variables or
fixed constants; lower case letters x, y, z... refer to the realization of corresponding
random variables. We introduce the following notations for the convenience of analysis.
1. N , sample size;
2. M , the number of covariates/features;
3. Z = (Z 1 , Z 2 , ..., Z N ), the full data we intend to collect;
4. X = (X 1 , X 2 , ..., X N ), the covariates/features, X i = (Xi1 , Xi2 , ..., XiM );
5. Y = (Y1 , Y2 , ..., YN ), the outcomes, and Z i = (Yi , X i );
6. R = (R1 , R2 , ..., RN ), and Ri = (Ri0 , Ri1 , ..., RiM ) is the missing indicator of Z i =
(Yi , Xi1 , ...XiM ), where Rij = 0 indicates that the corresponding component of (Yi , Xi1 , Xi2 , ..., XiM )
is missing (where j = 0 corresponds to the outcome and j = k corresponds to Xik ),
Rij = 1 if the corresponding component of (Yi , Xi1 , Xi2 , ..., XiM ) is observed;
   ¯ = (R
7. R    ¯ 1 , ..., R
                   ¯ N ), and R
                              ¯ i = (R
                                     ¯0, R
                                         ¯ 1 , ..., R            ¯ j = 1 − Rj ;
                                                    ¯ M ), where R
                                      i    i          i            i        i


                                                 2
                                                                 1.2 Missing Mechanism


8. L = (L1 , ..., LN ) is the missing indicator of Z i as a whole, where Li = 1 if Z i is fully
observed, and Li = 0 if Z i is not fully observed;
         Pn
9. O =      i=1 Li ,   the number of individuals that are fully observed.


1.2      Missing Mechanism

Missingness can happen either at covariates or the outcome. The missingness type is
said to be univariate if all missingness happens in a single covariate (i.e. the outcome
and all other covariates are fully observed). Monotone missingness occurs if Rj = 0
implies Rk = 0 for all k > j. The missingness is said to be nonmonotone if it is not
monotone.
   Now we describe three types of missingness mechanisms:


1.2.1       Missing Completely at Random

Missing Completely at Random refers to the situations where the missing of data has
nothing to do with the data itself, that is, the missingness is independent of the full
data set,


                               P(R = r | Z) does not depend on Z.


The reason behind this kind of missingness often has little to do with the study itself.
For example, the records were documented with random errors and the erroneous
records were marked to be missing.


                                                3
Introduction


1.2.2    Missing at Random

Missing at Random refers to the situations where the missingness of the data depends
only on the data that are observed,


                       P(R = r | Z) = P(R = r | Zr ) = π(r, Zr ),


where π is some function. Here, Zr is the observed component of Z. For example, in
a non-blinded study, the patients who are not assigned to treatment group drop out,
thus leave their outcome measurement unobserved.


1.2.3    Missing Not at Random

Missing not at Random refers to the situations where the probability of being missing
depends on the data that are not observed,


                             P(R = r | Z) depends on Zr¯.


Here, Zr¯ is the missing component of Z. This happens, for example, when patients
somehow get to know in advance that the outcome is not desirable and they choose to
leave the study due to negativity.


1.3     Handling Missing Data

Many methods can be adopted to address missing data problems. The most straight-
forward one is to completely ignore observations that are not fully observed. However,
this method is not recommended in practical settings as it throws away information
carried by the data set, and can lead to bias and inefficiency.


                                           4
                                                               1.3 Handling Missing Data


   To make use of the information in partially observed observations, two popular
ways of handling missing data are imputation and weighing. Now we describe the
methods in general terms and in Section 2 we describe them in the context of the
random forest algorithm.


Imputing the Mean

The simplest method to impute missing data is to impute the mean of observed data,
that is

                           1 − rij X
                                   N
                   Xij   = PN    j     rkj Xkj + rij Xij , j = 1, 2, ..., M,        (1.1)
                            k=1 rk k=1
                                               N
                                    1 − r0 X
                             Yi = PN i 0          rk0 Yk + ri0 Yi .                 (1.2)
                                     k=1 rk k=1


However, in despite of the simplicity, this method can also lead to problematic results
where variance is underestimated and the missing values are imputed using a complete
case estimator. A way to allow partially missing observations to behave differently
than fully observed observations is to add a missing indicator term when we fit model
using this mean-imputed data. For example, in the context of a linear model

                                                     M
                                                           γj Xij ,
                                                     X
                             E[Y | X i ] = γ0 Li +
                                                     j=1


where X i ’s that are conditioned on in the above statements are covariates after mean
imputation.


   An alternative would be to add an indicator whether a variable was missing on
that participant for each variable. That would allow the missingness corresponding to
each variable to be different depending on if that variable is missing or not.


                                              5
Introduction


Multiple Imputation

The basic idea of multiple imputation is to first impute the data based on some model
a few times to obtain several full data set; then conduct analysis on each data set and
calculate their own estimators, and the final estimator is the average of them and the
variance estimator is adjusted for missing data.
   Assume the number of imputations is K. Then the multiple imputation algorithm
is given by

Algorithm 1 Algorithm of general MI
   while i in 1 to K do
 1. Impute in the missing values based on some method and create the ith data set
 2. Conduct data analysis on the ith data set and attain an estimator θˆi for some
      parameter θ
    end while
 3. The final estimator of θ is: θˆ =   1 PK ˆ
                                        K  i=1 θi .

 4. The estimator for variance of θˆ is:

                          K                        K
                    ˆ= 1     Vˆ ar(θ)i +
                                          K +1 X
                                                      (θˆi − θ)(
                                                             ˆ θˆi − θ)
                                                                     ˆ T,
                         X
                    Σ
                       K i=1             K(K − 1) i=1


   where Vˆ ar(θ)i is the variance estimator from the ith imputed data set.


   Several methods of MI have been proposed in terms of how to impute data (fill in
the missing values in Step 1), including proper imputation, improper imputation and
multivariate imputation [11].
   The method we adopt is called Multivariate Imputation by Chain Equations
(MICE) [17], which imputes data again and again till convergence to create a single
imputed data set. Assuming there are B(B ⩽ M ) variables in the data set containing
missing values, then the algorithm is as follows:

                                              6
                                                             1.3 Handling Missing Data

Algorithm 2 MICE Algorithm for multiple imputation
 1. Fill in all missing values by random sampling from the observed values.
    while j in 1 to B do
 2. Regress the jth variable with missing values, say, X j , on all other variables, using
      only individuals whose X j is observed. Note that the imputed values in previous
      steps should also be included in the regression.
 3. Impute the missing values of the jth variable using the regression model obtained
      in the previous step.
    end while
 4. Repeat the while loop several times until the result becomes stable.


   One key advantage of this method is its compatibility to different types of responses,
whether it’s continuous, binary or categorical, and the use of regression model in Step
2 (linear regression, logistic regression or multinomial regression) should depend on
the type of response.


Inverse Probability Weighting

The basic idea of inverse probability weighting [10]is to model the probability of
missingness and weigh the observations by the probability of being fully observed. The
intuition is that for every observed individual, the less likely it is to be observed, the
more likely similar individuals are supposed to be missing in the data set.
   Thus, to implement inverse probability weighting, we must:


  1. Fit model for probability of being observed by all variables that are fully observed.
     Assume that a total of J variables, {W 1 , W 2 , ..., W J }, are fully observed, where
      {W 1 , W 2 , ..., W J } ⊆ {X 1 , X 2 , ..., X M , Y }. Fit a logistic regression model for


                                               7
Introduction


    the probability of L = 1

                                                            J
                                                                    βk Wik .
                                                              X
                   P (L = 1|X, Y ) = logit π(Zi ; β) = β0 +                     (1.3)
                                                              k=1


    Other prediction models can also be used.

  2. Using the model fitted above to estimate the parameter of interest by solving a
    weighted estimating equation (or a weighted version of the analysis of interest).


                                         8
Chapter 2

Methods


2.1      Random Forest

2.1.1      Building a Single Tree

The regression and classification tree algorithm is supervised learning algorithm. To
build a tree, we partition the feature space into small regions, and then fit a simple model
for each region. In figure 2.1, a tree is built by partitioning a two-dimensional space
into five regions, R1 , R2 , R3 , R4 , R5 using the splits X1 = t1 , X1 = t3 , X2 = t2 , X2 = t4 .
Once the tree is built, we can make prediction on a new feature vector x = (x1 , x2 ) by

                                            5
                                                  fˆi (x)I{x∈Ri } ,
                                            X
                                  T (x) =                                                   (2.1)
                                            i=1


where

                                              
                                                   1 if x ∈ Ri
                                              
                                              
                                              
                                 I{x∈Ri } =
                                                   0 if x ∈
                                                          / Ri
                                              
                                              
                                              


                                                  9
Methods


Fig. 2.1 Tree Built by Partitioning a Two-Dimensional Feature Space (adopted from
[14])


                                       10
                                                                                       2.1 Random Forest


is the indicator function and fˆi (x) is a specific model fitted using data from region Ri .
When the outcome is continuous and a squared error loss is used, fˆi is a constant by
averaging all the outcomes whose features fall in region Ri , that is,

                                                               N
                                                  1
                        fˆi = Cˆi = PN
                                                               X
                                                                   I{X j ∈Ri } Yj .                 (2.2)
                                             j=1 I{X j ∈Ri } j=1


where the model fˆi = Cˆi minimizes mean squared error in each region:

                                               N
                             Cˆi = min               I{X j ∈Ri } (Yj − Ci )2 .
                                               X
                                      Ci ∈R
                                               j=1


After determining the regression model in each region, we then start to grow the tree
by deciding:
1. Which feature/variable and nodes to split;
2. When to stop.
The first step is solved by first going over all possible splitting variables and nodes, and
selecting the pair of variable and node (j, s) that minimizes the total sum of squares,

                                                                                  
                                            (yk − fˆ1 )2 +              (yk − fˆ2 )2 ,
                                   X                           X
                       min                                                                          (2.3)
                       j,s
                                 k:xk ∈R1                    k:xk ∈R2


where fˆi , i = 1, 2 is calculated by (1) with regions R1 = {X k : Xkj < s} and R2 = {X k :
Xkj ⩾ s}. After making the first split, continue this process for every resulting region.
The second steps is achieved when some pre-specified stopping criteria is met (e.g. the
depth of the tree is large enough or there is not enough data in each node to warrant
further splitting).


                                                       11
Methods


2.1.2     Bagging

After building a tree T , a prediction can be made for a new individual x,

                                             |T |
                                                    fˆk I{x∈Rk } ,
                                             X
                                  T (x) =
                                             k=1


   where |T | is the number of regions of tree T . However, basing the prediction
on only one tree can be unreliable as the splitting criteria can easily be affected by
outliers. Bagging is a technique for reducing the variance of prediction by averaging the
predictions yielded by models built from bootstrap samples. It’s argued that bagging
helps reduce mean-squared error and leaves the bias unchanged [1].
   Assume we draw a total of B bootstrap samples from (Z 1 , Z 2 , ..., Z N ) and the tree
built from the bth bootstrap sample is marked as T ∗b , then the final prediction is

                                                  B
                                               1 X
                                  Tbag (x) =         T ∗b (x).
                                               B b=1


2.1.3     Random Forest

Bagging method improves prediction accuracy produced by a single tree. However,
the variance is also considerable since each tree built from bootstrap sample has
large correlation since they use the same splitting criteria and almost the same data.
Assume the B bootstrap trees are identically distributed with variance σ 2 and pairwise
correlation ρ, then the variance of the bagging average (final prediction) is:

                                              B
                                           1 X                     1−ρ 2
                                                         
                  var Tbag (x) = var             T ∗b (x) = ρσ 2 +    σ ,
                                           B b=1                    B


                                               12
                                                                   2.1 Random Forest


and

                                                 
                                 lim var Tbag (x) = ρσ 2 .
                                 B→∞


The idea of random forest is to reduce the correlation between trees ρ without increasing
the variance σ 2 too much. To achieve this, instead of applying splitting criteria to
all features on all bootstrap trees, we randomly choose P features out of M as the
candidates for each splitting. In this way, even though the splitting criteria is the same,
each bootstrap tree will likely look much different.

Algorithm 3 Random Forest Algorithm for Fully Observed Data
   while i in 1 to B do
1. Draw the bth bootstrap sample Z∗b of size N from Z
2. Build the bth tree T ∗b by (2.1) on the bootstrapped sample from previous step
       by repeating the following steps on each node until the number of nodes of T ∗b
       reaches some threshold.
       i. Select P features at random from the M features.
       ii. Find the best split pair of feature and node (j, s) by solving (2.3) and split
       that node.
      end while
3. The final prediction Trf (x) =   1 PB    ∗b
                                    B  b=1 T (x)


                                             13
Methods


2.2     Handling missing data in the Random Forest

        algorithm

When using random forest to train an incomplete data set, we first address the missing
data problem by either imputing or inverse probability weighting to obtain a complete
data set, and then implement random forest algorithm on this data set.


2.2.1    Multiple Imputation by Chain Equation

For a data set yielded by imputing, we can just treat it as an original data set and
simply run a random forest algorithm.

Algorithm 4 Random Forest Algorithm with MICE
   while m in 1 to K do
1. Impute the data set using algorithm 2 to obtain the mth data set Z(m) .
      while i in 1 to B do
   2. Draw the bth bootstrap sample Z∗b
                                     (m) of size N from Z(m)
                          ∗b by (2.1) on the bootstrapped sample from previous step
   3. Build the bth tree T(m)
        by repeating the following steps on each node until the number of nodes of
         ∗b reaches some threshold.
        T(m)
        i. Select P features at random from the M features.
        ii. Find the best split pair of feature and node (j, s) by solving (2.3) and split
        that node.
      end while
                                                               (m)         PB     ∗b
   4. The prediction based on the mth imputed data set is: Trf (x) = B1      b=1 T(m) (x).

    end while
5. The final prediction is the average based on the K imputed data sets: Trf (x) =
    1 PK    (m)
    K  m=1 Trf (x).


                                           14
                    2.2 Handling missing data in the Random Forest algorithm


2.2.2        Inverse Probability Weighting

To incorporate inverse probability weighting into the random forest algorithm, some
modifications need to be made.
       There are two ways to incorporate weights in random forest algorithm. One is to
use the weighted bootstrap instead of the regular bootstrap Præstgaard and Wellner
[9]:

Algorithm 5 IPW Random Forest Algorithm
                                                                     1
 1.Estimate the weights for each observed individual ω
                                                     ˆi =            π
                                                                     ˆi   using (1.3). When
       observations are not fully observed L = 0 the weights are 0 so observations that
       are not fully observed are not selected in the bootstrap sample.
       while i in 1 to B do
                                                                                        ω
                                                                                        ˆi
   2. Draw the bth weighted bootstrap sample Z∗b of size O with probability P
                                                                                      i:Li =1 ω
                                                                                              ˆi

         from {Z i : Li = 1}
   3. Build the bth tree T ∗b by (2.1) on the bootstrapped sample from previous step
         by repeating the following steps on each node until the number of nodes of T ∗b
         reaches some threshold.
         i. Select P features at random from the M features.
         ii. Find the best split pair of feature and node (j, s) by solving (2) and split that
         node.
       end while
 4. The final prediction Trf (x) =    1 PB    ∗b
                                      B  b=1 T (x)


       By only putting positive weights on fully observed observations in the bootstrap
sample, Algorithm 5 ensures that the individual trees in the bootstrap sample are built
only on fully observed data.


                                              15
Methods


   The other way to use inverse probability weighting to account for missing data in
the random forest algorithm is to apply a different splitting criteria:

Algorithm 6 Random Forest Algorithm using IPW based Splitting Criteria
                                                                                 1
1. Estimate the weights for each observed individual ω
                                                     ˆi =                        π
                                                                                 ˆi   using 1.3
    while i in 1 to B do
  2. Draw the bth bootstrap sample Z∗b of size N from fully observed observations in
      Z.
  3. Build the bth tree T ∗b by (2.1) on the bootstrapped sample from previous step
      by repeating the following steps on each node until the number of nodes of T ∗b
      reaches some threshold.
      i. Select P features at random from the M features.
      ii. Find the best split pair of feature and node (j, s) by solving the following and
      split that node.

                                                                                            
                                              ˆ k (yk − fˆ1 )2 +              ˆ k (yk − fˆ2 )2 ,
                                     X                               X
                         min                  ω                               ω                    (2.4)
                         j,s
                                   k:xk ∈R1                        k:xk ∈R2


      where fˆi = P      1
                                                         with regions R1 = {X k : Xkj < s} and R2 =
                                    P
                               ω
                               ˆk
                                               ˆ k Yk
                                      k:xk ∈Ri ω
                    k:xk ∈Ri

      {X k : Xkj ⩾ s}.
    end while
4. The final prediction Trf (x) =             1 PB    ∗b
                                              B  b=1 T (x)


                                                       16
Chapter 3

Simulation


3.1      Set Up

We conducted 100 simulations to evaluate the performance of random forest algorithm
when incorporating the methods to address missing data discussed in the previous
section. We used simulation settings that included both continuous and binary outcomes.
In each simulation, a data set of size 1000 was created, with 15 variables, X1 , X2 , ..., X15
generated from multivariate normal distribution with mean 0. The covariance of any
pair of covariates (Xi , Xj ) was set to be 0.9|i−j| . The continuous outcome was then
calculated using five covariates through the following formula:


                     Ycon = 5 + 4X1 + 3eX2 + 3X33 − X11
                                                     4      3
                                                        + 2X12 +ε                       (3.1)


where ε ∼ N (0, 1) is the error term.


   The data set was then split into training set and test set, if size 600 and 400,
respectively. Then, missing values were created in the training set under three missing


                                             17
Simulation


mechanisms respectively. More specifically, we made use of the function ampute in
R package MICE to create missing data with arguments prop = 0.3 (the proportion
of individuals who has missing data is 30%), pattern = c(rep(1, 5), rep(0, 10), 1) (the
missing pattern is: observed under the first five variables, missing under the last ten
variables and the outcome is always observed). For Missing Completely at Random,
we set the argument mech to be ’MCAR’; for Missing at Random, we set weights
to be c(rep(1, 5), rep(0, 11)) so that missingness is only related to observed covariates;
for Missing Not at Random, we set weights to be rep(1, 16) so that missingness is
related to both observed and missing covariates and outcome. We create the outcome
generation equation (3.2) with randomly chosen constant coefficients from 1 to 5 in
such way that among the covariates from which the outcome is generated, some are
missing while the others are always observed. By doing this, we want the simulation to
be more generalizable in the sense that not all missing variables are related/unrelated
to the outcome.
   For binary outcome, we used a slightly different outcome generation equation


                        Y = 20 + X1 + 5eX2 + X34 + 5X42 + X11
                                                           3
                                                              ,
                                                                                    (3.2)
                                                  Ybi = I{Y >35} .


The constant coefficients are still randomly chosen and the way that the covariates
were generated is the same as for the continuous case. The threshold is set to be 35 to
mimic the structure of the real data in the next Chapter, so that the ratio of label 0
outcomes to label 1 outcomes is nearly 2 : 1.
After creating missing values, we used the following methods to deal with the missing
data in the random forest algorithm: multiple imputation using chain equation (4),
inverse probability weighting (6)(we used the second method where the weights are
used in the splitting process and prediction function), mean imputation, complete case


                                           18
                                                                         3.2 Results


(ignore individuals containing missing values), the missing data method developed by
[5] and mean imputation with missing indicator. 1000 trees were built in each random
forest and splitrule was set to be mse for continuous outcomes and gini for binary
outcomes. The prediction accuracy was measure by mean squared error (MSE) for
continuous outcomes,

                                          1 ˆ
                                 MSE =      (Yi − E[Yi ])2 ,
                                          n

and cross entropy (CE) for binary outcomes,

                                1h                                 i
                       CE = −     Yi log Yˆi + (1 − Yi ) log 1 − Yˆi .
                                n

Lower value of MSE/CE indicates better prediction accuracy.


3.2     Results

The prediction accuracy was calculated by either MSE or CE using six missing data
techniques under three missing data mechanisms. The methods and their corresponding
abbreviation were shown below.


                                            19
Simulation

                            Table 3.1 Variable description

          Method abbreviation                          Method

                   mi                multiple imputation using chain equation
                   ipw                     inverse probability weighting
                  ignore                            complete case
                  mean                            mean imputation
                 survival                       survival random forest
               mean_ind              mean imputation with missing indicator


          Table 3.2 Mean MSE for continuous outcome of 100 simulations

                    MI      IPW        Ignore      Mean      Survival    Mean_ind

         MCAR     96.142    91.630     107.543    102.675    101.117      101.193
         MAR      91.547    97.819     115.960     91.601    94.003       90.410
         MNAR     91.617    96.092     114.500     90.664    92.744       88.716


                 Table 3.3 CE for binary outcome of 100 simulations

                     MI      IPW        Ignore    Mean      Survival    Mean_ind

          MCAR     0.1533    0.1569     0.1632    0.1573    0.1656       0.1591
           MAR     0.1532    0.1574     0.1628    0.1566    0.1666       0.1596
          MNAR     0.1582    0.1578     0.1624    0.1595    0.1685       0.1631


   For continuous outcome, the differences of the prediction accuracy for the six
methods are quite small compared to their scales according to figures (3.1), (3.2) and
(3.3). As shown by (3.2), under the three missing mechanisms, complete case is always
the worst one having the largest MSE. Under MCAR, IPW exhibits smallest prediction

                                            20
                                                                                    3.2 Results


                                            Prediction Accuracy Under MCAR
                0.225


                0.200
Cross Entropy


                0.175


                0.150


                0.125


                                           Prediction Accuracy Under MCAR
                                       6
            Mean Squared Error (log)


                                       5


                                       4


                                       3

                                                     mi         ignore   survival
                                           Methods
                                                     ipw        mean     mean_ind


Fig. 3.1 Prediction accuracy under MCAR. The upper figure shows CE associated
with binary outcome; the lower figure shows MSE associated with continuous outcome
(scaled by log)


                                                           21
Simulation


                                               Prediction Accuracy Under MAR
                0.200


                0.175
Cross Entropy


                0.150


                0.125


                                              Prediction Accuracy Under MAR
                                       6
            Mean Squared Error (log)


                                       5


                                       4


                                       3


                                                     mi         ignore   survival
                                           Methods
                                                     ipw        mean     mean_ind


Fig. 3.2 Prediction accurary under MAR. The upper figure shows CE associated with
binary outcome; the lower figure shows MSE associated with continuous outcome
(scaled by log)

                                                           22
                                                                                    3.2 Results


                                            Prediction Accuracy Under MNAR
                0.225


                0.200
Cross Entropy


                0.175


                0.150


                0.125


                                            Prediction Accuracy Under MNAR
                                       6
            Mean Squared Error (log)


                                       5


                                       4


                                       3


                                                     mi         ignore   survival
                                           Methods
                                                     ipw        mean     mean_ind


Fig. 3.3 Prediction accuracy under MNAR. The upper figure shows CE associated
with binary outcome; the lower figure shows MSE associated with continuous outcome
(scaled by log)


                                                           23
Simulation


error; under MAR and MNAR, MI, mean imputation and mean imputation with
indicator have lower MSE than the others, thus have the potential to yield more precise
predictions. But, overall the difference between the methods is small.
   For binary outcome, the differences of the prediction accuracy for the six methods
are small according to the figures. However, as shown by (3.3), under the three missing
mechanisms, MI, IPW and mean imputation are associated with slightly lower CE
compared to complete case, survival random forest and mean imputation with indicator.


3.3     Data Analysis

3.3.1    Data Set

The data we used is the medical information of women from Pima Indian population
near Phoenix, Arizona who were at the risk of diabetes [13]. This data set contains 7
covariates and 1 outcome variable:

                          Table 3.4 Variable description [13]

             Variable Name                Variable Description

                  npreg                 number of times pregnant
                   glu           plasma glucose concentration(mg/dl)
                    bp               diastolic blood pressure(mm Hg)
                   skin              triceps skin fold thickness(mm)
                   bmi                 body mass index(kg/(m)2 )
                   ped                  diabetes pedigree function
                   age                          age in years
                   type        indicator of diabetes, Yes (1) and No (0)


                                           24
                                                                   3.3 Data Analysis


   The outcome ped is calculated as a measure of family diabetes occurence. The
exact way to calculate this variable was described in [13].
   This data set was split into training set and test set. The training set contains 300
observations, out of which 100 contain missing values. The missingness only happened
in 3 covariates: skin, bp and bmi, containing 98, 3 and 13 missing values respectively.
Because bp and bmi contains few missing values, we simply got rid of the observations
who have missing values under these two variables to avoid imbalance. After this, we
had 284 observations in the training set and 84 of them contains missing values under
skin variable. The test set contains 332 observations with no missing values.


3.3.2     Results

Following the procedure of simulation, we implemented 5 missing data techniques with
random forest on training set to build 5 models, then made predictions on test set with
these models and calculated prediction accuracy, and compared these models by CE
and AUC.
The results are as follows:

                        Table 3.5 Results of diabetes prediction

                              MI   IPW      Ignore   Mean     Survival

                   CE     0.4819   0.4818   0.4940   0.4828   0.4766
                  AUC     0.8239   0.8185   0.8149   0.8246   0.8272


                                            25
Simulation


                                ROC Curve (Method: ignore)                                             ROC Curve (Method: mean)
                      1.00                                                              1.00
 True Positive Rate


                      0.75                                                              0.75


                      0.50                                                              0.50


                      0.25                                                              0.25


                                                                 AUC is 0.8149                                             AUC is 0.8246
                      0.00                                                              0.00

                                                                             False Positive
                                 ROC Curve (Method: mi)                                                ROC Curve (Method: ipw)
                      1.00                                                             1.00
 True Positive Rate


                      0.75                                                             0.75


                      0.50                                                             0.50


                      0.25                                                             0.25


                                                                 AUC is 0.8239                                             AUC is 0.8185
                      0.00                                                             0.00

                                                                             False Positive Rate

                                                                     ROC Curve (Method: survival)
                                                          1.00


                                                          0.75
                                          True Positive


                                                          0.50


                                                          0.25


                                                                                                   AUC is 0.8272
                                                          0.00

                                                                                 False Positive Rate


                       Fig. 3.4 ROC curves for the five different methods to deal with missing data


By (3.5), survival random forest is the best missing data technique in this case for it is
associated with the highest AUC(0.8272) and the lowest CE(0.4766). Complete case,
on the other side, is the least desirable one for it’s associated with the highest CE and
the lowest AUC. However, the differences of CE and AUC among these 5 methods are
pretty small.

                                                                                   26
Chapter 4

Discussion


4.1      Conclusion

From the simulations, we see that except for complete case analysis, no methods can
dominate the others in terms of prediction accuracy. There are many factors that can
impact the performance of one method:
   1. Missing mechanism: for simulation of binary outcome, MI was the best choice for
both MCAR and MAR mechanisms, but in MNAR, IPW is slightly better. However,
as the data missing mechanisms are assumptions that cannot be verified, in practical
setting when we try to make predictions, we can try each method by splitting the data
into training set and test set, and check out the prediction accuracy on the test set.
Cross validation can be a better way to remove bias due to splitting.
   2. Type of outcome: in the simulation part, when the outcome is continuous,
survival random forest has relatively low MSE; but when outcome is binary, it is
associated with the highest CE under all three missing mechanisms.
   3. Missing pattern: the continuous outcome is constructed by five covariates. In
the missing pattern, the first three are always observed and the last two are sometimes
missing. That is, the underlying relationship is partially covered. If all the covariates

                                           27
Discussion


that affect the outcome are always observed and only nuisance covariates are missing,
the results will be different.
   4. Level of missingness: even though only missing in covariates is discussed in this
project, missing in outcome can also lead to different result. For example, in binary
simulation, if we allow outcome to be missing (the outcome contruction function and
other missing parameters are the same), we will see that MI behaves much worse than
the other methods, including complete case (4.1).


4.2      Future Work

Estimating the variability associated with the prediction estimator is an important
component of predicting. [8] derived variance estimators from the random forest
algorithm using subsampled random forests.
   The core idea of subsampled random forest [8] is to build each tree in a cer-
tain manner so that the random forest can be viewed as random kernel U-statistic
and asymptotic normal distribution can be yielded, thus variance estimator can be
constructed through this distributional results.


4.2.1      Subsampled Random Forest and U-statistic

A U-statistic is defined as

                                     1       X
                               Un = n                   h(Zi1 , Zi2 ..., Zik ),          (4.1)
                                     k    (i1 ,...,ik )


where (i1 , ..., ik ) are k distinct elements from (1, 2, ..., n) (which implies that n ⩾ k), the
                          
                           n
summation is over all      k   combinations of (i1 , i2 , ..., ik ), h(·) is a symmetric function
with k arguments and Z1 , Z2 , ..., Zn are n independent identical random variables (or
random vectors) [12].

                                                     28
                                                                                    4.2 Future Work


    A random forest built through algorithm (3) almost fits the structure of an incom-
plete U-statistic. First of all, all observed data (X 1 , Yi ), (X 2 , Y2 ), ..., (X n , Yn ) are iid
(or at least assumed to be iid); secondly, when using bootstrap sample of size k(k < n)
to build a tree, the order of the data has no impact on the result, which corresponds
to the symmetric structure of h(·); finally, the prediction of a random forest is the
                                            
                                            n
average of predictions from the             k     trees.
                                                                                                    
                                                                                                    n
    However, as n grows larger, it becomes computationally infeasible to build                      k
                                                     
                                                     n
trees. Therefore, instead of building                k     trees, we limit the amount of trees to mn
         
         n
(mn <    k    ). Also, as n increases, the prediction from each tree can be inaccurate if the
number of samples k to build each tree stays the same, simply because each tree is
built on insufficient information. Hence, [8] allow the number of samples kn to increase
as n increases.
    The statistic that corresponds to the changes above is called resampled U-statistic
[4]. It is built in similar manner as (4.1), except that instead of averaging over all the
                                                                            
                                                                            n
combinations of (i1 , ..., ik ) sets, it averages over mn (mn <             m    ) sets with changeable
sample size kn , and these mn sets are selected uniformly at random with replacement
                     
                 n
from the         kn       combinations, that is

                                                   1 X
                                    Un,mn ,kn =         hk (Zi , ..., Zikn ).                     (4.2)
                                                  mn (i) n 1


    The only thing algorithm (3) that does not have a correspondence in (4.2) is
the random selection of variables. To incorporate this, an additional randomization
parameter ω was added to the statistic to incorporate the randomness brought by
random selection of variables. Write


                                                           29
Discussion


                                       1 X (ωi )
                      Uω;n,kn ,mn =         h (Zi1 , Zi2 , ..., Zikn ),              (4.3)
                                      mn (i) kn


which is called random kernel U-statistic. Because the variables used to build each tree
                                                                                    iid
are uniformly randomly selected, then all ω’s can be seen as iid (ω1 , ω2 , ..., ωmn ∼ Fω )
and independent of the data.
   Then [8] proved that such random kernel U-statistic has an asymptotic normal
distribution under some mild conditions. The proof is quite long and need some
techniques and theorems from other sources [6], [16], so we broken it down into two
steps and state succinctly without proof.
                ∗
   Step 1: Let Un,k                                  ∗
                          = Eω (Uω;n,kn ,mn ), then Uω;n,k       is a non-random resam-
                    n ,mn                                  n ,mn
                                                                (ω )
pled U-statistic by definition. (Just view h∗kn (·) = Eω [hkni (·)] as another symmetric
function.)
   It can be shown that non-random kernel U-statistic (or resampled U-statistic) has
an asymptotic normal distribution. To be more specific, one of the distributional result
from theorem 1 of [8] is,


                             √      ∗
                                 n(Un,k n ,mn
                                              − θkn )   d
                                    q                   → N (0, 1)                   (4.4)
                                      kn2 ζ1,kn

   where


                                              30
                                                                                            4.2 Future Work


                                                        n
                                              α = lim      =0
                                                    n→∞ mn

                                          θkn = Eh∗kn (Z1 , ..., Zkn )

                       ζ1,kn = cov(h∗kn (Z1 , Z2 , ..., Zkn ), h∗kn (Z1 , Z2′ , ..., Zk′ n ).


Step 2: Show that

                                    √                 ∗           P
                                     n(Uω;n,kn ,mn − Un,k n ,mn
                                                                ) → 0,                                   (4.5)


then follows that
    √                                  √                 ∗                     √      ∗
        n(Uω;n,kn ,mn − θkn )           n(Uω;n,kn ,mn − Un,k n ,mn
                                                                   )               n(Un,kn ,mn
                                                                                               − θkn )
            q                     =               q                        +          q
                kn2 ζ1,kn                             kn2 ζ1,kn                           kn2 ζ1,kn      (4.6)
                                   d
                                  → 0 + N (0, 1) = N (0, 1).


That is, a random kernel U-statistic (corresponding to a random forest prediction) has
an asymptotic normal distribution.
   Having this knowledge, the variance of prediction (to be more specific, ζ1,kn ) can
be estimated using Monte-Carlo simulations, and 95% confidence interval can be
constructed accordingly.


4.2.2      Subsampled Random Forest and Missing Data

Back to the context of missing data. In the simulation part we showed that multiple
imputation and inverse probability weighting are the two best methods to handle
missing data when accompanied with random forest. Multiple imputation is too
complicated and algorithm-reliable to work on the theoretical properties. Thus, we are
interested in how to make prediction and variance estimate under inverse probability

                                                         31
Discussion


weighting using subsampled random forest. However, in order to do that, we need to
extend the subsample random forest theory to the case inverse probability weighting.
   There are some problems need to be solved before we can fit inverse probability
weighting into subsampled random forest framework. First, after using inverse proba-
                                                                                           1
bility weighting, every fully observed individual is associated with a weight wi =        ˆi,
                                                                                          pi

where π
      ˆi is the estimate of the true probability of being observed (1.3). In this way,
the data we are about to use to build random forest (X1 , π
                                                          ˆ1 , Y1 ), ..., (Xm , π
                                                                                ˆm , Ym ) are
no longer iid, simply because π
                              ˆi is estimated using all the data, and this doesn’t fit in
with the frame of U-statistic. A potential solution may be to prove that


                       √                                            P
                        m(Uπˆ (Z1 , ..., Zm ) − Uπ (Z1 , ..., Zm )) → 0.               (4.7)


   Because Uπ (Z1 , ..., Zm ) is of the form of U-statistic, it has an asymptotic normal
distribution. If 4.7 is right, then Uπˆ (Z1 , ..., Zm ) also has an asymptotic normal distri-
bution, and inference can be made in similar ways. We plan to explore this in future
work.


                                              32
References

[1] Breiman, L. (1996). Bagging predictors. Machine learning, 24(2):123–140.
[2] Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
[3] Breiman, L. (2017). Classification and regression trees. Routledge.
[4] Free, E. W. (1989). Infinite order u-statistics. Scandinavian Journal of Statistics,
   16(1):29–45.
[5] Ishwaran, H., Kogalur, U. B., Blackstone, E. H., Lauer, M. S., et al. (2008). Random
   survival forests. The annals of applied statistics, 2(3):841–860.
[6] Lee, A. J. (1990). U- Statistics: Theory and Practice. CRC Press.
[7] Little, R. J. and Rubin, D. B. (2014). Statistical analysis with missing data, volume
   333. John Wiley & Sons.
[8] Mentch, L. and Hooker, G. (2016). Quantifying uncertainty in random forests
   via confidence intervals and hypothesis tests. The Journal of Machine Learning
   Research, 17(1):841–881.
[9] Præstgaard, J. and Wellner, J. A. (1993). Exchangeably weighted bootstraps of
   the general empirical process. The Annals of Probability, pages 2053–2086.
[10] Robins, J. M. and Rotnitzky, A. (1992). Recovery of information and adjustment
   for dependent censoring using surrogate markers. In AIDS epidemiology, pages
   297–331. Springer.
[11] Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American
  statistical Association, 91(434):473–489.
[12] Shao, J. (2006). Mathematical statistics: exercises and solutions. Springer Science
   & Business Media.
[13] Smith, J. W., Everhart, J., Dickson, W., Knowler, W., and Johannes, R. (1988).
   Using the adap learning algorithm to forecast the onset of diabetes mellitus. In
  Proceedings of the Annual Symposium on Computer Application in Medical Care,
   page 261. American Medical Informatics Association.
[14] Trevor Hastie, Robert Tibshirani, J. F. (2008). The elements of statistical learning,
   2nd edition.


                                           33
References


[15] Tsiatis, A. (2007). Semiparametric theory and missing data. Springer Science &
   Business Media.
[16] Van der Vaart, A. W. (2000). Asymptotic statistics, volume 3. Cambridge university
   press.
[17] White, I. R., Royston, P., and Wood, A. M. (2011). Multiple imputation using
   chained equations: issues and guidance for practice. Statistics in medicine, 30(4):377–
   399.


                                           34
Appendix


                       Prediction Accuracy Under MCAR                    Prediction Accuracy Under MAR


                0.25                                         0.30
Cross Entropy


                                                             0.25
                0.20

                                                             0.20


                0.15
                                                             0.15


                       Prediction Accuracy Under MNAR
                0.30


                0.25
Cross Entropy


                                                                              mi     ignore   survival
                                                                    Methods
                0.20                                                          ipw    mean     mean_ind


                0.15


        Fig. 4.1 Prediction accuracy of binary outcome when the outcome can be missing.


                                                        35
References


             36