from decision trees to random forests

From decision trees to random forests

Viet-Trung Tran

Decision tree learning

•  Supervised learning •  From a set of measurements, –  learn a model –  to predict and understand a phenomenon

Example 1: wine taste preference •  From physicochemical properties (alcohol, acidity,

sulphates, etc) •  Learn a model •  To predict wine taste preference (from 0 to 10)

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, Modeling wine preferences by data mining from physicochemical proper@es, 2009

Observation

•  Decision tree can be interpreted as set of IF...THEN rules

•  Can be applied to noisy data •  One of popular inductive learning •  Good results for real-life applications

Decision tree representation

•  An inner node represents an attribute •  An edge represents a test on the attribute of

the father node •  A leaf represents one of the classes •  Construction of a decision tree – Based on the training data – Top-down strategy

Example 2: Sport preferene

Example 3: Weather & sport practicing

Classification •  The classification of an unknown input vector is done by

traversing the tree from the root node to a leaf node. •  A record enters the tree at the root node. •  At the root, a test is applied to determine which child node

the record will encounter next. •  This process is repeated until the record arrives at a leaf

node. •  All the records that end up at a given leaf of the tree are

classified in the same way. •  There is a unique path from the root to each leaf. •  The path is a rule which is used to classify the records.

•  The data set has five attributes. •  There is a special attribute: the attribute class is the class

label. •  The attributes, temp (temperature) and humidity are

numerical attributes •  Other attributes are categorical, that is, they cannot be

ordered. •  Based on the training data set, we want to find a set of rules

to know what values of outlook, temperature, humidity and wind, determine whether or not to play golf.

•  RULE 1 If it is sunny and the humidity is not above 75%, then play.

•  RULE 2 If it is sunny and the humidity is above 75%, then do not play.

•  RULE 3 If it is overcast, then play. •  RULE 4 If it is rainy and not windy, then play. •  RULE 5 If it is rainy and windy, then don't play.

Splitting attribute •  At every node there is an attribute associated with

the node called the splitting attribute •  Top-down traversal –  In our example, outlook is the splitting attribute at root. –  Since for the given record, outlook = rain, we move to the

rightmost child node of the root. –  At this node, the splitting attribute is windy and we find

that for the record we want classify, windy = true. –  Hence, we move to the left child node to conclude that

the class label Is "no play".

Decision tree construction

•  Identify the splitting attribute and splitting criterion at every level of the tree

•  Algorithm –  Iterative Dichotomizer (ID3)

Iterative Dichotomizer (ID3) •  Quinlan (1986) •  Each node corresponds to a splitting attribute •  Each edge is a possible value of that attribute. •  At each node the splitting attribute is selected to be the

most informative among the attributes not yet considered in the path from the root.

•  Entropy is used to measure how informative is a node.

Splitting attribute selection •  The algorithm uses the criterion of information gain

to determine the goodness of a split. –  The attribute with the greatest information gain is taken

as the splitting attribute, and the data set is split for all distinct values of the attribute values of the attribute.

•  Example: 2 classes: C1, C2, pick A1 or A2

Entropy – General Case •  Impurity/Inhomogeneity measurement •  Suppose X takes n values, V1, V2,… Vn, and

P(X=V1)=p1, P(X=V2)=p2, … P(X=Vn)=pn

•  What is the smallest number of bits, on average, per symbol, needed to transmit the symbols drawn from distribution of X? It’s E(X) = p1 log2 p1 – p2 log2 p2 – … pnlog2pn

•  E(X) = the entropy of X

)(log1

2 i

n

ii pp∑

=

−=

Example: 2 classes

Information gain

•  Gain(S,Wind)? •  Wind = {Weak, Strong} •  S = {9 Yes &5 No} •  Sweak = {6 Yes & 2 No | Wind=Weak} •  Sstrong = {3 Yes &3 No | Wind=Strong}

Example: Decision tree learning •  Choose splitting attribute for root among {Outlook,

Temperature, Humidity, Wind}? –  Gain(S, Outlook) = ... = 0.246 –  Gain(S, Temperature) = ... = 0.029 –  Gain(S, Humidity) = ... = 0.151 –  Gain(S, Wind) = ... = 0.048

•  Gain(Ssunny,Temperature) = 0,57 •  Gain(Ssunny, Humidity) = 0,97 •  Gain(Ssunny, Windy) =0,019

Over-fitting example •  Consider adding noisy training example #15

–  Sunny, hot, normal, strong, playTennis = No •  What effect on earlier tree?

Over-fitting

Avoid over-fitting

•  Stop growing when data split not statistically significant

•  Grow full tree then post-prune •  How to select best tree – Measure performance over training tree – Measure performance over separate validation

dataset – MDL minimize •  size(tree) + size(misclassifications(tree))

Reduced-error pruning

•  Split data into training and validation set •  Do until further pruning is harmful –  Evaluate impact on validation set of pruning

each possible node – Greedily remove the one that most improves

validation set accuracy

Rule post-pruning •  Convert tree to equivalent set

of rules •  Prune each rule independently

of others •  Sort final rules into desired

sequence for use

Issues in Decision Tree Learning •  How deep to grow? •  How to handle continuous attributes? •  How to choose an appropriate attributes selection

measure? •  How to handle data with missing attributes values? •  How to handle attributes with different costs? •  How to improve computational efficiency? •  ID3 has been extended to handle most of these.

The resulting system is C4.5 (http://cis-linux1.temple.edu/~ingargio/cis587/readings/id3-c45.html)

Decision tree – When?

References •  Data mining, Nhat-Quang Nguyen, HUST •  http://www.cs.cmu.edu/~awm/10701/slides/

DTreesAndOverfitting-9-13-05.pdf

RANDOM FORESTS Credits: Michal Malohlava @Oxdata

Motivation •  Training sample of points

covering area [0,3] x [0,3] •  Two possible colors of

points

•  The model should be able to predict a color of a new point

Decision tree

How to grow a decision tree •  Split rows in a given

node into two sets with respect to impurity measure –  The smaller, the more

skewed is distribution –  Compare impurity of

parent with impurity of children

When to stop growing tree •  Build full tree or •  Apply stopping criterion - limit on: –  Tree depth, or –  Minimum number of points in a leaf

How to assign leafvalue?

•  The leaf value is –  If leaf contains only one point

then its color represents leaf value

•  Else majority color is picked, or color distribution is stored

Decision tree

•  Tree covered whole area by rectangles predicting a point color

Decision tree scoring

•  The model can predict a point color based on its coordinates.

Over-fitting

•  Tree perfectly represents training data (0% training error), but also learned about noise!

•  And hence poorly predicts a new point!

Handle over-fitting

•  Pre-pruning via stopping criterion! •  Post-pruning: decreases complexity of

model but helps with model generalization

•  Randomize tree building and combine trees together

Randomize #1- Bagging

Randomize #1- Bagging

•  Each tree sees only sample of training data and captures only a part of the information.

•  Build multiple weak trees which vote together to give resulting prediction – voting is based on majority vote, or weighted

average

Bagging - boundary

•  Bagging averages many trees, and produces smoother decision boundaries.

Randomize #2 - Feature selectionRandom forest

Random forest - properties •  Refinement of bagged trees; quite popular •  At each tree split, a random sample of m features is drawn,

and only those m features are considered for splitting. Typically

•  m=√p or log2(p), where p is the number of features •  For each tree grown on a bootstrap sample, the error rate

for observations left out of the bootstrap sample is monitored. This is called the “out-of-bag” error rate.

•  Random forests tries to improve on bagging by “de-correlating” the trees. Each tree has the same expectation

Advantages of Random Forest

•  Independent trees which can be built in parallel

•  The model does not overfit easily •  Produces reasonable accuracy •  Brings more features to analyze data variable

importance, proximities, missing values imputation

Out of bag points and validation

•  Each tree is built over a sample of training points.

•  Remaining points are called “out-of-bag” (OOB).

These points are used for valida@on as a good approxima@on for generaliza@on error. Almost iden@cal as N-‐fold cross valida@on.

from decision trees to random forests

Data & Analytics