download.lww.comdownload.lww.com/wolterskluwer_vitalstream_com/per… · web viewsupport vector...

Multicenter Comparison of Machine Learning Methods and Conventional Regression for

Predicting Clinical Deterioration on the Wards

Matthew M Churpek, MD, MPH, PhD1,*; Trevor C Yuen, MS1; Christopher Winslow, MD2; David O

Meltzer, MD, PhD1; Michael W Kattan, MBA, PhD3; Dana P Edelson, MD, MS1

SUPPLEMENTARY APPENDIX

eMETHODS

Overview

This Appendix Methods section provides more rigorous definitions and details of the methods

used in the study. As noted in the main text, model tuning was completed in the training

(derivation) data only, and predicted probabilities from the final models were used in the testing

(validation) data to compare the accuracy of the different methods. Finally, there are many

variations of each of these methods described in the literature. For the purposes of this study, the

default settings for binary classification problems were used to fit each model unless otherwise

stated.

Logistic regression

Logistic regression is a linear statistical model that utilizes the logit (log odds) link to predict a

binary outcome from one or more predators using maximum likelihood estimation.1 Logistic

regression has been extended for use with survival analysis data in the discrete-time framework,

as described elsewhere.2,3 Importantly, predictor variables entered into the model are assumed to

have a linear relationship to the outcome on the log odds scale, unless coded otherwise. In this

study, two different logistic regression models were compared. The first model was fit using linear

terms for each predictor. The second model relaxed the linear assumption by using restricted

cubic splines with three knots for all continuous variables, with placement of the knots made a

priori as recommended by Harrell.1 This allowed predictor variables to have increased risk at both

low and high ends of the range (e.g. low and high respiratory rates). A post-hoc sensitivity

analysis was performed by fitting the spline regression model with four knots to determine if this

improved model accuracy. The logistic regression models were fit using the “glm” package in R.

Tree-based models

Tree-based models partition the feature space into a set of rectangles, and are often presented

as a series of “if-then” statements in the form of a tree.4, 5 For this study, the Gini index was used

to determine the optimal variable and location of the split at each node in the tree.4 In order to

optimize the area under the curve (AUC) of the resulting tree, a cost complexity parameter, which

penalizes larger trees, was used to control the size of the final tree. The tree with the highest ten-

fold cross-validation AUC in the training set was used as the final tree.

Decision trees are created one variable at a time, and thus splits earlier in the tree affect which

splits are optimal later in the tree. Thus, there is no guarantee that the resulting splits will lead to

an “optimal tree” overall, only that the split at each node is optimal at that point for the data at

hand. In addition, small changes in the data can lead to different variables or cut-offs used in the

tree and can thus have a large effect on the tree itself. These issues, and others, often lead

decision trees to have suboptimal accuracy compared to other methods.

To improve the accuracy and stability of the decision tree model, a procedure called bagging was

used to fit a bagged tree model. This involved taking random bootstrap samples of patient data,

with replacement, and fitting an unpruned tree model to each sample.4 The number of bagged

trees in the final model was determined in the training dataset using ten-fold cross-validation to

maximize the training set AUC, with 500, 1000, and 2000 trees considered.

In addition to a bagged tree model, a random forest model was also fit. Random forests modify

the bagged tree procedure by only allowing a random number of the predictor variables to be

considered at each split of each tree.4,6 This results in trees that are less correlated with each

other compared to bagged trees, and thus potentially increasing accuracy. The optimal number of

trees and predictor variables to be considered at each split were determined using ten-fold cross-

validation, and the combination with the highest training set AUC was denoted the final model.

The final tree-based model fit was a gradient boosted machine (i.e. boosted tree). This algorithm

fits one tree at a time, first to all the outcomes in the training data and then to the residuals of the

previous models, thus creating a combination of trees that increasingly weight the “difficult to

predict” events to a greater degree.4,7 The optimal number of splits for each individual tree, the

total number of trees, and an additional shrinkage factor, which weights the contribution from

each individual tree, were determined using ten-fold cross-validation.

All the tree-based models were run using the “caret” with the “rpart” and rf” methods and the

“randomForest” package in R.

K-nearest neighbors

K-nearest neighbors (KNN) models use local geographic information in the predictor space to

predict the outcome of a new sample.4,5 For example, a KNN model utilizing five neighbors uses

the classes of the five closest observations in multidimensional space, based on a distance

measure, to predict the outcome of a new observation. In this study, Euclidian distance was used

to determine the nearest neighbors for each observation, and ten-fold cross-validation was used

to determine the number of neighbors for the final model that maximized the training AUC. Of

note, because distance is scale dependent, all predictors were rescaled to have a mean of zero

and a standard deviation of one when fitting this model. The R package “caret” with the “knn”

method was used to fit the K-nearest neighbors model.

Support vector machines

Support vector machines are a generalization of the maximal margin classifier and the support

vector classifier, which fit a separating hyperplane in high-dimensional space to the data in order

to separate the two classes.5 Specifically, support vector machines generalize the idea of creating

a hyperplane that separates the two classes of interest (e.g. event vs. no event) with the farthest

minimum distance to the training observations to the situation where the two classes are not

completely separable and the hyperplane is allowed to be non-linear. This class of methods is

unique in that they ignore data points from each outcome class that are “easy” to predict when

determining the structure of the boundary and instead focus on the points that are hardest to

predictor (i.e. closest to the boundary) and those points that are misclassified.4, 5 In order to allow

the boundary separately the classes to be non-linear, the radial basis function, a commonly used

flexible kernel function, was used for the support vector machine model in this study. The main

tuning parameter for support vector machines is the cost penalty, with higher values penalizing

misclassified observations to a greater degree. The optimal value was determined using ten-fold

cross-validation and the scale parameter was determined computationally using the method of

Caputo et al.4 Similar to K-nearest neighbors, this method is scale dependent so all predictors

were rescaled to have a mean of zero and a standard deviation of one. The R package “caret”

with the “svmRadial” method was used to fit the support vector machine model.

Neural networks

Neural networks are non-linear models that involve creating a set of linear combinations of the

original predictor variables and then using those as inputs into a hidden layer (or layers) of units,

which then create new combinations of these inputs to finally output the probability of the event of

interest (after a suitable transformation).4, 5 A feed-forward multi-layer perceptron neural network

was fit for this study. A penalty term, known as weight decay, and the number of hidden units in

the model were determined using ten-fold cross-validation in order to maximize the training set

AUC. The “caret” package in R with the “nnet” method was used to fit the neural network model.

REFERENCES

1. Harrell FE. Regression modeling strategies : with applications to linear models, logistic

regression, and survival analysis. New York: Springer; 2001.

2. Singer JD, Willett JB. Its About Time - Using Discrete-Time Survival Analysis to Study

Duration and the Timing of Events. Journal of Educational Statistics. Sum

1993;18(2):155-195.

3. Efron B. Logistic Regression, Survival Analysis, and the Kaplan-Meier Curve. Journal of

the American Statistical Association. Jun 1988;83(402):414.

4. Kuhn M, Johnson K. Applied predictive modeling. New York, NY: Springer; 2013.

5. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining,

Inference, and Prediction, Second Edition (Springer Series in Statistics). 2nd Ed. New

York, NY: Springer; 2009

6. Boulesteix AL, Janitza S, Kruppa J, Konig IR. Overview of random forest methodology

and practical guidance with emphasis on computational biology and bioinformatics. Wiley

Interdisciplinary Reviews-Data Mining and Knowledge Discovery. Nov-Dec

2012;2(6):493-507.

7. Friedman JH. Greedy function approximation: A gradient boosting machine. Annals of

Statistics. Oct 2001;29(5):1189-1232.

eTables

eTable 1. Hospital and patient characteristics

Hosp A

(n=83,948)

Hosp B

(n=85,787)

Hosp C

(n=52,610)

Hosp D

(n=33,066)

Hosp E

(n=14,588)

Total

(n=269,999)

Dates included 11/2008-

1/2013

1/2006-

11/2011

1/2006-

11/2011

1/2006-

11/2011

11/2009-

11/2011

Number of beds 500 350 150 120 135

Teaching status Major

teaching

Major

teaching

Major

teaching

Minor

teaching

Non-

teaching

Age, mean (SD),

years

54 (18) 56 (21) 71 (17) 67 (18) 71 (16) 60 (20)

Race

Black, % 39,461

(47)

7,744

(9)

803

(2)

1,098

(3)

564

(4)

49,670

(18)

White, % 31,208

(37)

42,076

(49)

35,359

(67)

21,840

(66)

9,947

(68)

140,430

(52)

Other/

Unknown, %

13,279

(16)

35,967

(42)

16,448

(31)

10,128

(31)

4,077

(28)

79,899

(30)

Female sex, n

(%)

48,021

(57)

57,640

(67)

30, 218

(57)

18,200

(55)

8,202

(56)

162,281

(60)

eTable 2. Model calibration in the validation dataset*

*A well-calibrated model should have a non-significant H-L test, a slope

Model HL Chi-

squared

HL

p-value

Intercept Slope

Random forest 1503.8 <0.001 2.2 1.42

GBM 7.44 0.68 -0.1 0.98

Bagged trees 82.9 <0.001 -0.35 0.92

SVM 30.4 <0.001 -0.32 0.94

Neural network 120.3 <0.001 -0.55 0.89

Logistic regression

(splines)

987.5 <0.001 -1.8 0.65

KNN 3489.4 <0.001 2.3 1.32

Logistic regression (linear) 25.0 0.005 -0.56 0.89

Decision tree 114.3 <0.001 -0.17 0.97

near 1 and an intercept near 0.

Abbreviations: HL = Hosmer-Lemeshow goodness of fit test

eTable 3. Association between the number of variables allowed to be considered at each split in the random forest model with model calibration*

*A well-calibrated model will have an intercept near 0 and a slope near 1

Number of variables allowed Intercept Slope

3 0.037 1.27

5 0.023 1.20

9 0.011 1.12

15 0.009 1.05

20 0.006 1.03

25 0.003 1.01

eFigures

eFigure 1. Discrete-time data format for model development.*

*As shown, time is separated into discrete intervals (i.e., every eight hours

for this study) and the predictor values closest to the beginning of each

time block are used to predict whether the outcome occurs during that

interval.

Abbreviations: SBP = systolic blood pressure, RR = respiratory rate, HR =

heart rate

eFigure 2. Area under the receiver operator characteristic curves of the compared methods for cardiac arrest in the validation cohort.*

MEWS

Decision tree

Logistic (spline)

Neural network

Bagged trees

Support vector machine

Gradient boosted machine

K-nearest neighbors

Logistic (linear)

Random forest

0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84

0.714

0.739

0.776

0.790

0.795

0.806

0.808

0.809

0.814

0.825

AUC

*Error bars indicate the upper 95% confidence intervals. Abbreviations: MEWS: Modified Early Warning Score, AUC: Area under the receiver operating characteristic curve

eFigure 3. Area under the receiver operator characteristic curves of the compared methods for intensive care unit transfer in the validation cohort.*

MEWS

Logistic (linear)

Decision tree

K-nearest neighbors

Logistic (spline)

Neural network


Bagged trees


Random forest

0.6 0.65 0.7 0.75 0.8

0.678

0.712

0.719

0.733

0.754

0.767

0.770

0.775

0.779

0.785

AUC


eFigure 4. Area under the receiver operator characteristic curves of the compared methods for death on the wards in the validation cohort.*

Decision tree

MEWS

Logistic (spline)

Bagged trees

Neural network


Logistic (linear)


K-nearest neighbors

Random forest

0.65 0.7 0.75 0.8 0.85 0.9 0.95

0.870

0.880

0.909

0.916

0.917

0.920

0.921

0.926

0.931

0.936

AUC


eFigure 5. Partial plot of the interaction between heart rate, systolic blood pressure, and risk of an adverse event.*

*Risk is shown on the z-axis and is also illustrated using dark blue (low) to green (medium) to dark red (high) risk. Abbreviations: SBP, systolic blood pressure (in mm Hg)

eFigure 6. Partial plot of the interaction between heart rate, respiratory rate, and risk of an adverse event.*

*Risk is shown on the z-axis and is also illustrated using dark blue (low) to green (medium) to dark red (high) risk.

eFigure 7. Partial plot of the interaction between heart rate, age, and risk of an adverse event.*


eFigure 8. Partial plot of the interaction between respiratory rate, systolic blood pressure, and risk of an adverse event.*


eFigure 9. Partial plot of the interaction between respiratory rate, age, and risk of an adverse event.*


eFigure 10. Partial plot of the interaction between age, systolic blood pressure, and risk of an adverse event.*


eFigure 11. Calibration plot across deciles of risk for selected models in the validation dataset

download.lww.comdownload.lww.com/wolterskluwer_vitalstream_com/per… · web viewsupport vector...

Documents