classification part 4: tree-based methods bmtry 726 4/18/14
TRANSCRIPT
So FarLinear Classifiers
-Logistic regression: most commonly used
-LDA: Uses more information (prior probabilities) and therefore more efficient but less interpretable output
-Weaknesses:-modeling issues for larger p relative to n -inflexibility in decision boundary
So FarNeural Networks
-Able to handle large p by reducing the number of inputs, X, to a smaller set of derived features, Z, used to model Y
-Applying (non-linear) functions to linear combinations of the X’s and Z’s results in very flexible decision boundaries
-Weaknesses:-Limited means of identifying important features -Tend to over-fit the data
Tree-Based Methods
Non-parametric and semi-parametric methods
Models can be represented graphically making them very easy to interpret
We will consider two decision tree methods:
-Classification and Regression Trees (CART)
-Logic Regression
Tree-Based Methods
Classification and Regression Trees (CART) -Used for both classification and regression -Features can be categorical or continuous -Binary splits selected for all features -Features may occur more than once in a model
Logic Regression -Also used for classification and regression -Features must be categorical for these models -Features may occur more than once in a model
CART
Main Idea: divide feature space into disjoint set of rectangles and fit a simple model to each one
-For regression, the predicted output is just the mean of the training data in that region.
-Fit a linear function in each region
-For classification the predicted class is just the most frequent class in the training data in that region.
-Estimate class probabilities by the frequencies in the training data in
that region.
CART Trees
CART trees: graphical representations of a CART model
Features in a CART tree:-Nodes/partitions: binary splits in the data based on one feature (e.g. X1 < t1)
-Branches: paths leading down from a split
-Go left if X follows the rule defines at the split -Go right if X doesn’t meet criteria
-Terminal nodes: rectangular region where splitting ends
X1 < t1
X2 < t2 X1 < t3
X2 < t4
R1 R2 R3
R4 R5
CART Tree Predictions
Predictions from a tree are made from the top down
For new observation with feature vector X, follow path through the tree that is true for the observed vector until reaching a terminal node
Prediction matches prediction made at terminal node
- In the case of classification, it is the class with the largest probability at that node
X1 < t1
X2 < t2 X1 < t3
X2 < t4
R1 R2 R3
R4 R5
CART Tree Predictions
For example, say our feature vector is X = (6, -1.5)
Start at the top- which direction do we go?
X1 < 4
X2 < -1 X1 < 5.5
X2 < -0.2
R1 R2 R3
R4 R5
Building a CART Tree
Given a set of features X and associated outcomes Y how do we fit a tree?
CART uses a Recursive Partitioning (aka Greedy Search) Algorithm to select the “best” feature at each split.
So let’s look at the recursive partitioning algorithm.
Building a CART TreeRecursive Partitioning (aka Greedy Search) Algorithm
(1) Consider all possible splits for each features/predictors -Think of split as a cutpoint (when X is binary this is always ½)
(2) Select predictor and split on that predictor that provides the best separation of the data
(3) Divide data into 2 subsets based on the selected split
(4) For each subset, consider all possible splits for each feature
(5) Select the predictor/split that provides best separation for each subset of data
(6) Repeat this process until some desired level of data purity is reach this is a terminal node!
Impurity of a Node
Need a measure of impurity of a node to help decide on how to split a node, or which node to split
The measure should be at a maximum when a node is equally divided amongst all classes
The impurity should be zero if the node is all one class
Consider the proportion of observations of class k in node m:
1m
i m
mk iNx R
p̂ I y k
Measures of Impurity
Possible measures of impurity:(1) Misclassification Rate
-Situations can occur where no split improves the misclassification rate
-The misclassification rate can be equal when one option is clearly better for the next step
(2) Deviance/Cross-entropy:
(3) Gini Index:
1 1m
m
i mkNi R
ˆI y k m p m
1
K
mk mkkˆ ˆp log p
1
1K
mk mk' mk mkk k' kˆ ˆ ˆ ˆp p p p
Tree Impurity
Define impurity of a tree to be the sum over all terminal nodes of the impurity of a node multiplied by the proportion of cases that reach that node of the tree
Example: Impurity of a tree with one single node, with both A and B having 400 cases, using the Gini Index:
-Proportions of the two cases= 0.5 -Therefore Gini Index= (0.5)(1-0.5)+ (0.5)(1-0.5) = 0.5
Simple ExampleX1 X2
Class
0.0 0.8 2
8.0 -0.4 1
2.9 -1.4 1
4.9 1.0 1
2.9 0.0 2
9.0 1.2 2
1.0 -0.6 2
4.9 1.6 1
2.0 1.8 2
1.0 -1.8 1
7.0 2.0 2
1.0 -2.0 1
1.0 -0.4 2
6.0 1.4 2
9.0 -0.1 2
X1 X2Class
0.0 0.8 2
8.0 -0.4 1
2.9 -1.4 1
4.9 1.0 1
2.9 0.0 2
9.0 1.2 2
1.0 -0.6 2
4.9 1.6 1
2.0 1.8 2
1.0 -1.8 1
7.0 2.0 2
1.0 -2.0 1
1.0 -0.4 2
6.0 1.4 2
9.0 -0.1 2
X2 < -1.0
Blue
X1 X2Class
0.0 0.8 2
8.0 -0.4 1
2.9 -1.4 1
4.9 1.0 1
2.9 0.0 2
9.0 1.2 2
1.0 -0.6 2
4.9 1.6 1
2.0 1.8 2
1.0 -1.8 1
7.0 2.0 2
1.0 -2.0 1
1.0 -0.4 2
6.0 1.4 2
9.0 -0.1 2
X1 < 3.9
Red
X2 < -1.0
Blue
X1 X2Class
0.0 0.8 2
8.0 -0.4 1
2.9 -1.4 1
4.9 1.0 1
2.9 0.0 2
9.0 1.2 2
1.0 -0.6 2
4.9 1.6 1
2.0 1.8 2
1.0 -1.8 1
7.0 2.0 2
1.0 -2.0 1
1.0 -0.4 2
6.0 1.4 2
9.0 -0.1 2
X1 < 3.9
Red
X2 < -1.0
Blue
X1 < 5.45
Blue
X1 X2Class
0.0 0.8 2
8.0 -0.4 1
2.9 -1.4 1
4.9 1.0 1
2.9 0.0 2
9.0 1.2 2
1.0 -0.6 2
4.9 1.6 1
2.0 1.8 2
1.0 -1.8 1
7.0 2.0 2
1.0 -2.0 1
1.0 -0.4 2
6.0 1.4 2
9.0 -0.1 2
X1 < 3.9
Red
X2 < -1.0
Blue
X1 < 5.45
Blue X2 < -0.25
Blue Red
Pruning the Tree
Over fittingCART trees grown to maximum purity often over fit the data
Error rate will always drop (or at least not increase) with every split
Does not mean error rate on test data will improve
Best method of arriving at a suitable size for the tree is to grow an overly complex one then to prune it back
Pruning (in classification trees) based on misclassification
Pruning the Tree
Pruning parameters include (to name a few):
(1) minimum number of observations in a node to allow a split
(2) minimum number of observations in a terminal node
(3) max depth of any node in the tree
Use cross-validation to determine the appropriate tuning parameters
Possible rules: -minimize cross validation relative error (xerror)-use the “1-SE rule” which uses the largest value of Cp with the “xerror” within one standard deviation of the minimum
Logic Regression Logic regression is an alternative tree-based method
It requires exclusively binary features although the response can be binary or continuous
Logic regression produces models that represent Boolean (i.e. logical) combinations of binary predictors -“AND”= -“OR” = -“NOT” = !Xj
This allows logic regression to model complex interactions among binary predictors
Logic Regression Like CART, logic regression is a non-parametric/semi-parametric
data driven method
Like CART also able to model continuous and categorical responses
Logic regression can also be used for survival outcomes (though I’ve never found an instance where someone actually did!)
Unlike CART, logic regression requires that all features are binary when added to model-Features in CART model treated as binary BUT CART identifies the “best” split for a given feature to improve purity in the data
Logic regression also uses a different fitting algorithm to identify the “best” model -Simulated Annealing vs. Recursive Partitioning
Comparing Tree StructuresCART nodes: define splits on features in the tree
branches: define relationship between features in the tree
terminal nodes: Defined by a subgroup of the data (probability of each class in that node defines final prediction)
predictions: made by starting at the top of the tree and going down the appropriate splits until a terminal node is reached
Logic Regression knot (nodes): “AND”/ “OR” operators in the tree
branches: define relationship between features in the tree
leaves (terminal nodes): Define the features (i.e. X’s) use in the tree
predictions: the tree can be broken down into a set of logical relationships among features, if an observation matches one of the feature combinations defined in the tree, the class is predicted to be a 1
Logic Regression Branches in a CART tree can be thought of as “AND”
combinations of the features at each node in the branch
The only “OR” operator in a CART tree occurs at the first split
Inclusion of “OR”, “AND”, and “NOT” operators in logic regression trees provides greater flexibility than is seen in CART models
For data that include predominantly binary predictors (i.e. single nucleotide polymorphisms, yes/no survey data, etc.) logic regression is a better option than CART
Consider an example…
CART vs. Logic Regression Trees
X4 < 0.5
Assume features are a set of binary predictors and the following combinations predict that an individual has disease (versus being healthy):
(X1 AND X4) OR (X1 AND !X5)
X5 < 0.5
X1 < 0.5
PredictC = 0
PredictC = 1
PredictC = 1
PredictC = 0
CART Logic Regression
AND
OR
AND
X4X1 !X5X1
Logic Regression Trees Not UniqueLogic regression trees are not necessarily unique.We can represent our same model with two seemingly different trees…
(X1 AND X4) OR (X1 AND !X5)
Logic Regression Tree 2
AND
X1
X4
Logic Regression Tree 1
AND
OR
AND
X4X1 !X5X1
OR
!X5
Copyright © American Society of Clinical Oncology
Mitra, A. P. et al. , 2006
Cancer Example of Logic Regression
I Y=Cancer + H-ras FGFR3 !p53 !p21
9q- !p53 !p21 !p53 Rb
Logic Model for Stage P1 Cancer:
Equivalent Logic Tree:
AND AND
AND AND
ANDH-ras FGFR3
9q- !p21
OR
OR
Rb !p21 !p53
!p53
Copyright © American Society of Clinical Oncology
Mitra, A. P. et al. , 2006
Cancer Example of Logic Regression
!
! !
!
H-ras FGFR3 p53I Y=Cancer +
9q- p53 p21 !
21
p53
p
Rb
Equivalent Logic Tree:
Logic Model for Stage P1 Cancer:
AND AND
AND AND
ANDH-ras FGFR3
9q- !p21
OR
OR
Rb !p21 !p53
!p53
Copyright © American Society of Clinical Oncology
Mitra, A. P. et al. , 2006
Cancer Example of Logic Regression
Equivalent Logic Tree:
!
!
!
!
I Y=Cancer + H-ras FGFR3 p53
9q- p53 p21
p21
!p53 Rb
Logic Model for Stage P1 Cancer:
AND AND
AND AND
ANDH-ras FGFR3
9q- !p21
OR
OR
Rb !p21 !p53
!p53
Copyright © American Society of Clinical Oncology
Mitra, A. P. et al. J Clin Oncol; 24:5552-5564 2006
Cancer Example of Logic Regression
Equivalent Logic Tree:
! !
! !
I Y=Cancer + H-ras FGFR3 p53 p21
9q- p53 p21 !p53 Rb
Logic Model for Stage P1 Cancer:
AND AND
AND AND
ANDH-ras FGFR3
9q- !p21
OR
OR
Rb !p21 !p53
!p53
Fitting a Logic Regression ModelCART -considers one feature at a time
Logic regression -considers all logical combinations of predictors (up to a fixed size)
As a result… The search space can be very large
Recursive partitioning could be used for logic regression, however, since it only considers one predictor at a time, it may fail to identify optimal combinations of predictors
Simulated annealing is a stochastic learning algorithm that is a viable alternative to recursive partitioning
Simulated AnnealingOriginal idea for the simulated annealing (SA) algorithm came from metallurgy -With metal mixtures, control strength, malleability, etc. by controlling temperature at which the metals anneal
SA borrows from this idea to effectively search large feature space without getting “stuck” at local max/min
SA a combination of a random walk and a Markov chain to conduct its search
Rate of annealing controlled by:-number of random steps the algorithm takes-cooling schedule (includes staring and ending annealing “temperature”
Simulated Annealing(0) Select measure of model fit (scoring function)
(1) Select a random starting point in the search space (i.e. one possible combination of features)
(2) Randomly select of six possible moves to update the current model-Alternate a leaf; alternate an operator; grow a branch; prune a branch;
split a leaf; delete a leaf
(3) Compare the new model to the old model using some measure of fit
(4) If the new model is better, accept unconditionally
(5) If the new model is worse, with some probability choose the new model -This probability related to the current annealing “temperature”
(6) Repeat for some set number of iterations (number of steps)
Six Allowable Moves
Ruczinski I, Kooperberg C, LeBlanc M (2003). Logic Regression. JCGS, 3(12), 475-511.
Simulated AnnealingMeasures of model fit:
(1) Misclassification (classification model)-Simulated annealing does not have the same issues for misclassification
that recursive partitioning does
(2) Deviance/cross-entropy (logistic classification model)
(3) Sums-of-square error (regression model)
(4) Hazard function (survival)
(5) User specified model fit/loss function -You could write your own
Other Fitting Considerations(1)Choosing annealing parameters (start/end temperatures,
number of iterations)-Determine an appropriate set of annealing parameters prior to running
model
-Starting temperature: select to accept 90-95% of “worse” models
-Ending temperature: select so >5% of worse models are accepted
-Iterations is a matter of choice more iterations = longer run time
(2) Choosing maximum number of leaves/trees-Good to determine appropriate number of leaves/trees before running
final model
-Can be done by cross-validation (select model size with smallest CV error)
Some Final NotesThe Good:
Tree-based methods are very flexible for classification and regression
The models can be presented graphically for easier interpretation
-clinicians like them, they tend to think this way anyway
CART is very useful in settings where features are continuous and binary Logic regression is useful for modeling data with all binary features (e.g. SNP data)
Some Final NotesThe Bad (and sometimes ugly):
They do have a tendency to over-fit the data -not uncommon among machine learning methods
They are referred to as weak learners-small changes in the data results in very different models
There has been extensive research into improving the performance of these methods which is the topic of our next class.