Multicenter Comparison of Machine Learning Methods and Conventional Regression for
Predicting Clinical Deterioration on the Wards
Matthew M Churpek, MD, MPH, PhD1,*; Trevor C Yuen, MS1; Christopher Winslow, MD2; David O
Meltzer, MD, PhD1; Michael W Kattan, MBA, PhD3; Dana P Edelson, MD, MS1
SUPPLEMENTARY APPENDIX
eMETHODS
Overview
This Appendix Methods section provides more rigorous definitions and details of the methods
used in the study. As noted in the main text, model tuning was completed in the training
(derivation) data only, and predicted probabilities from the final models were used in the testing
(validation) data to compare the accuracy of the different methods. Finally, there are many
variations of each of these methods described in the literature. For the purposes of this study, the
default settings for binary classification problems were used to fit each model unless otherwise
stated.
Logistic regression
Logistic regression is a linear statistical model that utilizes the logit (log odds) link to predict a
binary outcome from one or more predators using maximum likelihood estimation.1 Logistic
regression has been extended for use with survival analysis data in the discrete-time framework,
as described elsewhere.2,3 Importantly, predictor variables entered into the model are assumed to
have a linear relationship to the outcome on the log odds scale, unless coded otherwise. In this
study, two different logistic regression models were compared. The first model was fit using linear
terms for each predictor. The second model relaxed the linear assumption by using restricted
cubic splines with three knots for all continuous variables, with placement of the knots made a
priori as recommended by Harrell.1 This allowed predictor variables to have increased risk at both
low and high ends of the range (e.g. low and high respiratory rates). A post-hoc sensitivity
analysis was performed by fitting the spline regression model with four knots to determine if this
improved model accuracy. The logistic regression models were fit using the “glm” package in R.
Tree-based models
Tree-based models partition the feature space into a set of rectangles, and are often presented
as a series of “if-then” statements in the form of a tree.4, 5 For this study, the Gini index was used
to determine the optimal variable and location of the split at each node in the tree.4 In order to
optimize the area under the curve (AUC) of the resulting tree, a cost complexity parameter, which
penalizes larger trees, was used to control the size of the final tree. The tree with the highest ten-
fold cross-validation AUC in the training set was used as the final tree.
Decision trees are created one variable at a time, and thus splits earlier in the tree affect which
splits are optimal later in the tree. Thus, there is no guarantee that the resulting splits will lead to
an “optimal tree” overall, only that the split at each node is optimal at that point for the data at
hand. In addition, small changes in the data can lead to different variables or cut-offs used in the
tree and can thus have a large effect on the tree itself. These issues, and others, often lead
decision trees to have suboptimal accuracy compared to other methods.
To improve the accuracy and stability of the decision tree model, a procedure called bagging was
used to fit a bagged tree model. This involved taking random bootstrap samples of patient data,
with replacement, and fitting an unpruned tree model to each sample.4 The number of bagged
trees in the final model was determined in the training dataset using ten-fold cross-validation to
maximize the training set AUC, with 500, 1000, and 2000 trees considered.
In addition to a bagged tree model, a random forest model was also fit. Random forests modify
the bagged tree procedure by only allowing a random number of the predictor variables to be
considered at each split of each tree.4,6 This results in trees that are less correlated with each
other compared to bagged trees, and thus potentially increasing accuracy. The optimal number of
trees and predictor variables to be considered at each split were determined using ten-fold cross-
validation, and the combination with the highest training set AUC was denoted the final model.
The final tree-based model fit was a gradient boosted machine (i.e. boosted tree). This algorithm
fits one tree at a time, first to all the outcomes in the training data and then to the residuals of the
previous models, thus creating a combination of trees that increasingly weight the “difficult to
predict” events to a greater degree.4,7 The optimal number of splits for each individual tree, the
total number of trees, and an additional shrinkage factor, which weights the contribution from
each individual tree, were determined using ten-fold cross-validation.
All the tree-based models were run using the “caret” with the “rpart” and rf” methods and the
“randomForest” package in R.
K-nearest neighbors
K-nearest neighbors (KNN) models use local geographic information in the predictor space to
predict the outcome of a new sample.4,5 For example, a KNN model utilizing five neighbors uses
the classes of the five closest observations in multidimensional space, based on a distance
measure, to predict the outcome of a new observation. In this study, Euclidian distance was used
to determine the nearest neighbors for each observation, and ten-fold cross-validation was used
to determine the number of neighbors for the final model that maximized the training AUC. Of
note, because distance is scale dependent, all predictors were rescaled to have a mean of zero
and a standard deviation of one when fitting this model. The R package “caret” with the “knn”
method was used to fit the K-nearest neighbors model.
Support vector machines
Support vector machines are a generalization of the maximal margin classifier and the support
vector classifier, which fit a separating hyperplane in high-dimensional space to the data in order
to separate the two classes.5 Specifically, support vector machines generalize the idea of creating
a hyperplane that separates the two classes of interest (e.g. event vs. no event) with the farthest
minimum distance to the training observations to the situation where the two classes are not
completely separable and the hyperplane is allowed to be non-linear. This class of methods is
unique in that they ignore data points from each outcome class that are “easy” to predict when
determining the structure of the boundary and instead focus on the points that are hardest to
predictor (i.e. closest to the boundary) and those points that are misclassified.4, 5 In order to allow
the boundary separately the classes to be non-linear, the radial basis function, a commonly used
flexible kernel function, was used for the support vector machine model in this study. The main
tuning parameter for support vector machines is the cost penalty, with higher values penalizing
misclassified observations to a greater degree. The optimal value was determined using ten-fold
cross-validation and the scale parameter was determined computationally using the method of
Caputo et al.4 Similar to K-nearest neighbors, this method is scale dependent so all predictors
were rescaled to have a mean of zero and a standard deviation of one. The R package “caret”
with the “svmRadial” method was used to fit the support vector machine model.
Neural networks
Neural networks are non-linear models that involve creating a set of linear combinations of the
original predictor variables and then using those as inputs into a hidden layer (or layers) of units,
which then create new combinations of these inputs to finally output the probability of the event of
interest (after a suitable transformation).4, 5 A feed-forward multi-layer perceptron neural network
was fit for this study. A penalty term, known as weight decay, and the number of hidden units in
the model were determined using ten-fold cross-validation in order to maximize the training set
AUC. The “caret” package in R with the “nnet” method was used to fit the neural network model.
REFERENCES
1. Harrell FE. Regression modeling strategies : with applications to linear models, logistic
regression, and survival analysis. New York: Springer; 2001.
2. Singer JD, Willett JB. Its About Time - Using Discrete-Time Survival Analysis to Study
Duration and the Timing of Events. Journal of Educational Statistics. Sum
1993;18(2):155-195.
3. Efron B. Logistic Regression, Survival Analysis, and the Kaplan-Meier Curve. Journal of
the American Statistical Association. Jun 1988;83(402):414.
4. Kuhn M, Johnson K. Applied predictive modeling. New York, NY: Springer; 2013.
5. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, Second Edition (Springer Series in Statistics). 2nd Ed. New
York, NY: Springer; 2009
6. Boulesteix AL, Janitza S, Kruppa J, Konig IR. Overview of random forest methodology
and practical guidance with emphasis on computational biology and bioinformatics. Wiley
Interdisciplinary Reviews-Data Mining and Knowledge Discovery. Nov-Dec
2012;2(6):493-507.
7. Friedman JH. Greedy function approximation: A gradient boosting machine. Annals of
Statistics. Oct 2001;29(5):1189-1232.
eTables
eTable 1. Hospital and patient characteristics
Hosp A
(n=83,948)
Hosp B
(n=85,787)
Hosp C
(n=52,610)
Hosp D
(n=33,066)
Hosp E
(n=14,588)
Total
(n=269,999)
Dates included 11/2008-
1/2013
1/2006-
11/2011
1/2006-
11/2011
1/2006-
11/2011
11/2009-
11/2011
Number of beds 500 350 150 120 135
Teaching status Major
teaching
Major
teaching
Major
teaching
Minor
teaching
Non-
teaching
Age, mean (SD),
years
54 (18) 56 (21) 71 (17) 67 (18) 71 (16) 60 (20)
Race
Black, % 39,461
(47)
7,744
(9)
803
(2)
1,098
(3)
564
(4)
49,670
(18)
White, % 31,208
(37)
42,076
(49)
35,359
(67)
21,840
(66)
9,947
(68)
140,430
(52)
Other/
Unknown, %
13,279
(16)
35,967
(42)
16,448
(31)
10,128
(31)
4,077
(28)
79,899
(30)
Female sex, n
(%)
48,021
(57)
57,640
(67)
30, 218
(57)
18,200
(55)
8,202
(56)
162,281
(60)
eTable 2. Model calibration in the validation dataset*
*A well-calibrated model should have a non-significant H-L test, a slope
Model HL Chi-
squared
HL
p-value
Intercept Slope
Random forest 1503.8 <0.001 2.2 1.42
GBM 7.44 0.68 -0.1 0.98
Bagged trees 82.9 <0.001 -0.35 0.92
SVM 30.4 <0.001 -0.32 0.94
Neural network 120.3 <0.001 -0.55 0.89
Logistic regression
(splines)
987.5 <0.001 -1.8 0.65
KNN 3489.4 <0.001 2.3 1.32
Logistic regression (linear) 25.0 0.005 -0.56 0.89
Decision tree 114.3 <0.001 -0.17 0.97
near 1 and an intercept near 0.
Abbreviations: HL = Hosmer-Lemeshow goodness of fit test
eTable 3. Association between the number of variables allowed to be considered at each split in the random forest model with model calibration*
*A well-calibrated model will have an intercept near 0 and a slope near 1
Number of variables allowed Intercept Slope
3 0.037 1.27
5 0.023 1.20
9 0.011 1.12
15 0.009 1.05
20 0.006 1.03
25 0.003 1.01
eFigures
eFigure 1. Discrete-time data format for model development.*
*As shown, time is separated into discrete intervals (i.e., every eight hours
for this study) and the predictor values closest to the beginning of each
time block are used to predict whether the outcome occurs during that
interval.
Abbreviations: SBP = systolic blood pressure, RR = respiratory rate, HR =
heart rate
eFigure 2. Area under the receiver operator characteristic curves of the compared methods for cardiac arrest in the validation cohort.*
MEWS
Decision tree
Logistic (spline)
Neural network
Bagged trees
Support vector machine
Gradient boosted machine
K-nearest neighbors
Logistic (linear)
Random forest
0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84
0.714
0.739
0.776
0.790
0.795
0.806
0.808
0.809
0.814
0.825
AUC
*Error bars indicate the upper 95% confidence intervals. Abbreviations: MEWS: Modified Early Warning Score, AUC: Area under the receiver operating characteristic curve
eFigure 3. Area under the receiver operator characteristic curves of the compared methods for intensive care unit transfer in the validation cohort.*
MEWS
Logistic (linear)
Decision tree
K-nearest neighbors
Logistic (spline)
Neural network
Support vector machine
Bagged trees
Gradient boosted machine
Random forest
0.6 0.65 0.7 0.75 0.8
0.678
0.712
0.719
0.733
0.754
0.767
0.770
0.775
0.779
0.785
AUC
*Error bars indicate the upper 95% confidence intervals. Abbreviations: MEWS: Modified Early Warning Score, AUC: Area under the receiver operating characteristic curve
eFigure 4. Area under the receiver operator characteristic curves of the compared methods for death on the wards in the validation cohort.*
Decision tree
MEWS
Logistic (spline)
Bagged trees
Neural network
Support vector machine
Logistic (linear)
Gradient boosted machine
K-nearest neighbors
Random forest
0.65 0.7 0.75 0.8 0.85 0.9 0.95
0.870
0.880
0.909
0.916
0.917
0.920
0.921
0.926
0.931
0.936
AUC
*Error bars indicate the upper 95% confidence intervals. Abbreviations: MEWS: Modified Early Warning Score, AUC: Area under the receiver operating characteristic curve
eFigure 5. Partial plot of the interaction between heart rate, systolic blood pressure, and risk of an adverse event.*
*Risk is shown on the z-axis and is also illustrated using dark blue (low) to green (medium) to dark red (high) risk. Abbreviations: SBP, systolic blood pressure (in mm Hg)
eFigure 6. Partial plot of the interaction between heart rate, respiratory rate, and risk of an adverse event.*
*Risk is shown on the z-axis and is also illustrated using dark blue (low) to green (medium) to dark red (high) risk.
eFigure 7. Partial plot of the interaction between heart rate, age, and risk of an adverse event.*
*Risk is shown on the z-axis and is also illustrated using dark blue (low) to green (medium) to dark red (high) risk.
eFigure 8. Partial plot of the interaction between respiratory rate, systolic blood pressure, and risk of an adverse event.*
*Risk is shown on the z-axis and is also illustrated using dark blue (low) to green (medium) to dark red (high) risk. Abbreviations: SBP, systolic blood pressure (in mm Hg)
eFigure 9. Partial plot of the interaction between respiratory rate, age, and risk of an adverse event.*
*Risk is shown on the z-axis and is also illustrated using dark blue (low) to green (medium) to dark red (high) risk.
eFigure 10. Partial plot of the interaction between age, systolic blood pressure, and risk of an adverse event.*
*Risk is shown on the z-axis and is also illustrated using dark blue (low) to green (medium) to dark red (high) risk. Abbreviations: SBP, systolic blood pressure (in mm Hg)
eFigure 11. Calibration plot across deciles of risk for selected models in the validation dataset