classification part 4: tree-based methods bmtry 726 4/18/14

Classification Part 4: Tree-Based Methods

BMTRY 7264/18/14

So FarLinear Classifiers

-Logistic regression: most commonly used

-LDA: Uses more information (prior probabilities) and therefore more efficient but less interpretable output

-Weaknesses:-modeling issues for larger p relative to n -inflexibility in decision boundary

So FarNeural Networks

-Able to handle large p by reducing the number of inputs, X, to a smaller set of derived features, Z, used to model Y

-Applying (non-linear) functions to linear combinations of the X’s and Z’s results in very flexible decision boundaries

-Weaknesses:-Limited means of identifying important features -Tend to over-fit the data

Tree-Based Methods

Non-parametric and semi-parametric methods

Models can be represented graphically making them very easy to interpret

We will consider two decision tree methods:

-Classification and Regression Trees (CART)

-Logic Regression

Tree-Based Methods

Classification and Regression Trees (CART) -Used for both classification and regression -Features can be categorical or continuous -Binary splits selected for all features -Features may occur more than once in a model

Logic Regression -Also used for classification and regression -Features must be categorical for these models -Features may occur more than once in a model

CART

Main Idea: divide feature space into disjoint set of rectangles and fit a simple model to each one

-For regression, the predicted output is just the mean of the training data in that region.

-Fit a linear function in each region

-For classification the predicted class is just the most frequent class in the training data in that region.

-Estimate class probabilities by the frequencies in the training data in

that region.

Feature Space as Rectangular Regions

X1 < t1

X2 < t2 X1 < t3

X2 < t4

R1 R2 R3

R4 R5

CART Trees

CART trees: graphical representations of a CART model

Features in a CART tree:-Nodes/partitions: binary splits in the data based on one feature (e.g. X1 < t1)

-Branches: paths leading down from a split

-Go left if X follows the rule defines at the split -Go right if X doesn’t meet criteria

-Terminal nodes: rectangular region where splitting ends

X1 < t1

X2 < t2 X1 < t3

X2 < t4

R1 R2 R3

R4 R5

CART Tree Predictions

Predictions from a tree are made from the top down

For new observation with feature vector X, follow path through the tree that is true for the observed vector until reaching a terminal node

Prediction matches prediction made at terminal node

- In the case of classification, it is the class with the largest probability at that node

X1 < t1

X2 < t2 X1 < t3

X2 < t4

R1 R2 R3

R4 R5

CART Tree Predictions

For example, say our feature vector is X = (6, -1.5)

Start at the top- which direction do we go?

X1 < 4

X2 < -1 X1 < 5.5

X2 < -0.2

R1 R2 R3

R4 R5

Building a CART Tree

Given a set of features X and associated outcomes Y how do we fit a tree?

CART uses a Recursive Partitioning (aka Greedy Search) Algorithm to select the “best” feature at each split.

So let’s look at the recursive partitioning algorithm.

Building a CART TreeRecursive Partitioning (aka Greedy Search) Algorithm

(1) Consider all possible splits for each features/predictors -Think of split as a cutpoint (when X is binary this is always ½)

(2) Select predictor and split on that predictor that provides the best separation of the data

(3) Divide data into 2 subsets based on the selected split

(4) For each subset, consider all possible splits for each feature

(5) Select the predictor/split that provides best separation for each subset of data

(6) Repeat this process until some desired level of data purity is reach this is a terminal node!

Impurity of a Node

Need a measure of impurity of a node to help decide on how to split a node, or which node to split

The measure should be at a maximum when a node is equally divided amongst all classes

The impurity should be zero if the node is all one class

Consider the proportion of observations of class k in node m:

1m

i m

mk iNx R

p̂ I y k

Measures of Impurity

Possible measures of impurity:(1) Misclassification Rate

-Situations can occur where no split improves the misclassification rate

-The misclassification rate can be equal when one option is clearly better for the next step

(2) Deviance/Cross-entropy:

(3) Gini Index:

1 1m

m

i mkNi R

ˆI y k m p m

1

K

mk mkkˆ ˆp log p

1

1K

mk mk' mk mkk k' kˆ ˆ ˆ ˆp p p p

Tree Impurity

Define impurity of a tree to be the sum over all terminal nodes of the impurity of a node multiplied by the proportion of cases that reach that node of the tree

Example: Impurity of a tree with one single node, with both A and B having 400 cases, using the Gini Index:

-Proportions of the two cases= 0.5 -Therefore Gini Index= (0.5)(1-0.5)+ (0.5)(1-0.5) = 0.5

Simple ExampleX1 X2

Class

0.0 0.8 2

8.0 -0.4 1

2.9 -1.4 1

4.9 1.0 1

2.9 0.0 2

9.0 1.2 2

1.0 -0.6 2

4.9 1.6 1

2.0 1.8 2

1.0 -1.8 1

7.0 2.0 2

1.0 -2.0 1

1.0 -0.4 2

6.0 1.4 2

9.0 -0.1 2

X1 X2Class

0.0 0.8 2

8.0 -0.4 1

2.9 -1.4 1

4.9 1.0 1

2.9 0.0 2

9.0 1.2 2

1.0 -0.6 2

4.9 1.6 1

2.0 1.8 2

1.0 -1.8 1

7.0 2.0 2

1.0 -2.0 1

1.0 -0.4 2

6.0 1.4 2

9.0 -0.1 2

X2 < -1.0

Blue

X1 X2Class

0.0 0.8 2

8.0 -0.4 1

2.9 -1.4 1

4.9 1.0 1

2.9 0.0 2

9.0 1.2 2

1.0 -0.6 2

4.9 1.6 1

2.0 1.8 2

1.0 -1.8 1

7.0 2.0 2

1.0 -2.0 1

1.0 -0.4 2

6.0 1.4 2

9.0 -0.1 2

X1 < 3.9

Red

X2 < -1.0

Blue

X1 X2Class

0.0 0.8 2

8.0 -0.4 1

2.9 -1.4 1

4.9 1.0 1

2.9 0.0 2

9.0 1.2 2

1.0 -0.6 2

4.9 1.6 1

2.0 1.8 2

1.0 -1.8 1

7.0 2.0 2

1.0 -2.0 1

1.0 -0.4 2

6.0 1.4 2

9.0 -0.1 2

X1 < 3.9

Red

X2 < -1.0

Blue

X1 < 5.45

Blue

X1 X2Class

0.0 0.8 2

8.0 -0.4 1

2.9 -1.4 1

4.9 1.0 1

2.9 0.0 2

9.0 1.2 2

1.0 -0.6 2

4.9 1.6 1

2.0 1.8 2

1.0 -1.8 1

7.0 2.0 2

1.0 -2.0 1

1.0 -0.4 2

6.0 1.4 2

9.0 -0.1 2

X1 < 3.9

Red

X2 < -1.0

Blue

X1 < 5.45

Blue X2 < -0.25

Blue Red

Pruning the Tree

Over fittingCART trees grown to maximum purity often over fit the data

Error rate will always drop (or at least not increase) with every split

Does not mean error rate on test data will improve

Best method of arriving at a suitable size for the tree is to grow an overly complex one then to prune it back

Pruning (in classification trees) based on misclassification

Pruning the Tree

Pruning parameters include (to name a few):

(1) minimum number of observations in a node to allow a split

(2) minimum number of observations in a terminal node

(3) max depth of any node in the tree

Use cross-validation to determine the appropriate tuning parameters

Possible rules: -minimize cross validation relative error (xerror)-use the “1-SE rule” which uses the largest value of Cp with the “xerror” within one standard deviation of the minimum

Logic Regression Logic regression is an alternative tree-based method

It requires exclusively binary features although the response can be binary or continuous

Logic regression produces models that represent Boolean (i.e. logical) combinations of binary predictors -“AND”= -“OR” = -“NOT” = !Xj

This allows logic regression to model complex interactions among binary predictors

Logic Regression Like CART, logic regression is a non-parametric/semi-parametric

data driven method

Like CART also able to model continuous and categorical responses

Logic regression can also be used for survival outcomes (though I’ve never found an instance where someone actually did!)

Unlike CART, logic regression requires that all features are binary when added to model-Features in CART model treated as binary BUT CART identifies the “best” split for a given feature to improve purity in the data

Logic regression also uses a different fitting algorithm to identify the “best” model -Simulated Annealing vs. Recursive Partitioning

Comparing Tree StructuresCART nodes: define splits on features in the tree

branches: define relationship between features in the tree

terminal nodes: Defined by a subgroup of the data (probability of each class in that node defines final prediction)

predictions: made by starting at the top of the tree and going down the appropriate splits until a terminal node is reached

Logic Regression knot (nodes): “AND”/ “OR” operators in the tree

branches: define relationship between features in the tree

leaves (terminal nodes): Define the features (i.e. X’s) use in the tree

predictions: the tree can be broken down into a set of logical relationships among features, if an observation matches one of the feature combinations defined in the tree, the class is predicted to be a 1

Logic Regression Branches in a CART tree can be thought of as “AND”

combinations of the features at each node in the branch

The only “OR” operator in a CART tree occurs at the first split

Inclusion of “OR”, “AND”, and “NOT” operators in logic regression trees provides greater flexibility than is seen in CART models

For data that include predominantly binary predictors (i.e. single nucleotide polymorphisms, yes/no survey data, etc.) logic regression is a better option than CART

Consider an example…

CART vs. Logic Regression Trees

X4 < 0.5

Assume features are a set of binary predictors and the following combinations predict that an individual has disease (versus being healthy):

(X1 AND X4) OR (X1 AND !X5)

X5 < 0.5

X1 < 0.5

PredictC = 0

PredictC = 1

PredictC = 1

PredictC = 0

CART Logic Regression

AND

OR

AND

X4X1 !X5X1

Logic Regression Trees Not UniqueLogic regression trees are not necessarily unique.We can represent our same model with two seemingly different trees…

(X1 AND X4) OR (X1 AND !X5)

Logic Regression Tree 2

AND

X1

X4

Logic Regression Tree 1

AND

OR

AND

X4X1 !X5X1

OR

!X5

Copyright © American Society of Clinical Oncology

Mitra, A. P. et al. , 2006

Cancer Example of Logic Regression

I Y=Cancer + H-ras FGFR3 !p53 !p21

9q- !p53 !p21 !p53 Rb

Logic Model for Stage P1 Cancer:

Equivalent Logic Tree:

AND AND

AND AND

ANDH-ras FGFR3

9q- !p21

OR

OR

Rb !p21 !p53

!p53




!

! !

!

H-ras FGFR3 p53I Y=Cancer +

9q- p53 p21 !

21

p53

p

Rb



AND AND

AND AND

ANDH-ras FGFR3

9q- !p21

OR

OR

Rb !p21 !p53

!p53





!

!

!

!

I Y=Cancer + H-ras FGFR3 p53

9q- p53 p21

p21

!p53 Rb


AND AND

AND AND

ANDH-ras FGFR3

9q- !p21

OR

OR

Rb !p21 !p53

!p53


Mitra, A. P. et al. J Clin Oncol; 24:5552-5564 2006



! !

! !

I Y=Cancer + H-ras FGFR3 p53 p21

9q- p53 p21 !p53 Rb


AND AND

AND AND

ANDH-ras FGFR3

9q- !p21

OR

OR

Rb !p21 !p53

!p53

Fitting a Logic Regression ModelCART -considers one feature at a time

Logic regression -considers all logical combinations of predictors (up to a fixed size)

As a result… The search space can be very large

Recursive partitioning could be used for logic regression, however, since it only considers one predictor at a time, it may fail to identify optimal combinations of predictors

Simulated annealing is a stochastic learning algorithm that is a viable alternative to recursive partitioning

Simulated AnnealingOriginal idea for the simulated annealing (SA) algorithm came from metallurgy -With metal mixtures, control strength, malleability, etc. by controlling temperature at which the metals anneal

SA borrows from this idea to effectively search large feature space without getting “stuck” at local max/min

SA a combination of a random walk and a Markov chain to conduct its search

Rate of annealing controlled by:-number of random steps the algorithm takes-cooling schedule (includes staring and ending annealing “temperature”

Simulated Annealing(0) Select measure of model fit (scoring function)

(1) Select a random starting point in the search space (i.e. one possible combination of features)

(2) Randomly select of six possible moves to update the current model-Alternate a leaf; alternate an operator; grow a branch; prune a branch;

split a leaf; delete a leaf

(3) Compare the new model to the old model using some measure of fit

(4) If the new model is better, accept unconditionally

(5) If the new model is worse, with some probability choose the new model -This probability related to the current annealing “temperature”

(6) Repeat for some set number of iterations (number of steps)

Six Allowable Moves

Ruczinski I, Kooperberg C, LeBlanc M (2003). Logic Regression. JCGS, 3(12), 475-511.

Simulated AnnealingMeasures of model fit:

(1) Misclassification (classification model)-Simulated annealing does not have the same issues for misclassification

that recursive partitioning does

(2) Deviance/cross-entropy (logistic classification model)

(3) Sums-of-square error (regression model)

(4) Hazard function (survival)

(5) User specified model fit/loss function -You could write your own

Other Fitting Considerations(1)Choosing annealing parameters (start/end temperatures,

number of iterations)-Determine an appropriate set of annealing parameters prior to running

model

-Starting temperature: select to accept 90-95% of “worse” models

-Ending temperature: select so >5% of worse models are accepted

-Iterations is a matter of choice more iterations = longer run time

(2) Choosing maximum number of leaves/trees-Good to determine appropriate number of leaves/trees before running

final model

-Can be done by cross-validation (select model size with smallest CV error)

Some Final NotesThe Good:

Tree-based methods are very flexible for classification and regression

The models can be presented graphically for easier interpretation

-clinicians like them, they tend to think this way anyway

CART is very useful in settings where features are continuous and binary Logic regression is useful for modeling data with all binary features (e.g. SNP data)

Some Final NotesThe Bad (and sometimes ugly):

They do have a tendency to over-fit the data -not uncommon among machine learning methods

They are referred to as weak learners-small changes in the data results in very different models

There has been extensive research into improving the performance of these methods which is the topic of our next class.

classification part 4: tree-based methods bmtry 726 4/18/14

Documents