from decision trees to random forests
TRANSCRIPT
From decision trees to random forests
Viet-Trung Tran
Decision tree learning
• Supervised learning • From a set of measurements, – learn a model – to predict and understand a phenomenon
Example 1: wine taste preference • From physicochemical properties (alcohol, acidity,
sulphates, etc) • Learn a model • To predict wine taste preference (from 0 to 10)
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis, Modeling wine preferences by data mining from physicochemical proper@es, 2009
Observation
• Decision tree can be interpreted as set of IF...THEN rules
• Can be applied to noisy data • One of popular inductive learning • Good results for real-life applications
Decision tree representation
• An inner node represents an attribute • An edge represents a test on the attribute of
the father node • A leaf represents one of the classes • Construction of a decision tree – Based on the training data – Top-down strategy
Example 2: Sport preferene
Example 3: Weather & sport practicing
Classification • The classification of an unknown input vector is done by
traversing the tree from the root node to a leaf node. • A record enters the tree at the root node. • At the root, a test is applied to determine which child node
the record will encounter next. • This process is repeated until the record arrives at a leaf
node. • All the records that end up at a given leaf of the tree are
classified in the same way. • There is a unique path from the root to each leaf. • The path is a rule which is used to classify the records.
• The data set has five attributes. • There is a special attribute: the attribute class is the class
label. • The attributes, temp (temperature) and humidity are
numerical attributes • Other attributes are categorical, that is, they cannot be
ordered. • Based on the training data set, we want to find a set of rules
to know what values of outlook, temperature, humidity and wind, determine whether or not to play golf.
• RULE 1 If it is sunny and the humidity is not above 75%, then play.
• RULE 2 If it is sunny and the humidity is above 75%, then do not play.
• RULE 3 If it is overcast, then play. • RULE 4 If it is rainy and not windy, then play. • RULE 5 If it is rainy and windy, then don't play.
Splitting attribute • At every node there is an attribute associated with
the node called the splitting attribute • Top-down traversal – In our example, outlook is the splitting attribute at root. – Since for the given record, outlook = rain, we move to the
rightmost child node of the root. – At this node, the splitting attribute is windy and we find
that for the record we want classify, windy = true. – Hence, we move to the left child node to conclude that
the class label Is "no play".
Decision tree construction
• Identify the splitting attribute and splitting criterion at every level of the tree
• Algorithm – Iterative Dichotomizer (ID3)
Iterative Dichotomizer (ID3) • Quinlan (1986) • Each node corresponds to a splitting attribute • Each edge is a possible value of that attribute. • At each node the splitting attribute is selected to be the
most informative among the attributes not yet considered in the path from the root.
• Entropy is used to measure how informative is a node.
Splitting attribute selection • The algorithm uses the criterion of information gain
to determine the goodness of a split. – The attribute with the greatest information gain is taken
as the splitting attribute, and the data set is split for all distinct values of the attribute values of the attribute.
• Example: 2 classes: C1, C2, pick A1 or A2
Entropy – General Case • Impurity/Inhomogeneity measurement • Suppose X takes n values, V1, V2,… Vn, and
P(X=V1)=p1, P(X=V2)=p2, … P(X=Vn)=pn
• What is the smallest number of bits, on average, per symbol, needed to transmit the symbols drawn from distribution of X? It’s E(X) = p1 log2 p1 – p2 log2 p2 – … pnlog2pn
• E(X) = the entropy of X
)(log1
2 i
n
ii pp∑
=
−=
Example: 2 classes
Information gain
• Gain(S,Wind)? • Wind = {Weak, Strong} • S = {9 Yes &5 No} • Sweak = {6 Yes & 2 No | Wind=Weak} • Sstrong = {3 Yes &3 No | Wind=Strong}
Example: Decision tree learning • Choose splitting attribute for root among {Outlook,
Temperature, Humidity, Wind}? – Gain(S, Outlook) = ... = 0.246 – Gain(S, Temperature) = ... = 0.029 – Gain(S, Humidity) = ... = 0.151 – Gain(S, Wind) = ... = 0.048
• Gain(Ssunny,Temperature) = 0,57 • Gain(Ssunny, Humidity) = 0,97 • Gain(Ssunny, Windy) =0,019
Over-fitting example • Consider adding noisy training example #15
– Sunny, hot, normal, strong, playTennis = No • What effect on earlier tree?
Over-fitting
Avoid over-fitting
• Stop growing when data split not statistically significant
• Grow full tree then post-prune • How to select best tree – Measure performance over training tree – Measure performance over separate validation
dataset – MDL minimize • size(tree) + size(misclassifications(tree))
Reduced-error pruning
• Split data into training and validation set • Do until further pruning is harmful – Evaluate impact on validation set of pruning
each possible node – Greedily remove the one that most improves
validation set accuracy
Rule post-pruning • Convert tree to equivalent set
of rules • Prune each rule independently
of others • Sort final rules into desired
sequence for use
Issues in Decision Tree Learning • How deep to grow? • How to handle continuous attributes? • How to choose an appropriate attributes selection
measure? • How to handle data with missing attributes values? • How to handle attributes with different costs? • How to improve computational efficiency? • ID3 has been extended to handle most of these.
The resulting system is C4.5 (http://cis-linux1.temple.edu/~ingargio/cis587/readings/id3-c45.html)
Decision tree – When?
References • Data mining, Nhat-Quang Nguyen, HUST • http://www.cs.cmu.edu/~awm/10701/slides/
DTreesAndOverfitting-9-13-05.pdf
RANDOM FORESTS Credits: Michal Malohlava @Oxdata
Motivation • Training sample of points
covering area [0,3] x [0,3] • Two possible colors of
points
• The model should be able to predict a color of a new point
Decision tree
How to grow a decision tree • Split rows in a given
node into two sets with respect to impurity measure – The smaller, the more
skewed is distribution – Compare impurity of
parent with impurity of children
When to stop growing tree • Build full tree or • Apply stopping criterion - limit on: – Tree depth, or – Minimum number of points in a leaf
How to assign leafvalue?
• The leaf value is – If leaf contains only one point
then its color represents leaf value
• Else majority color is picked, or color distribution is stored
Decision tree
• Tree covered whole area by rectangles predicting a point color
Decision tree scoring
• The model can predict a point color based on its coordinates.
Over-fitting
• Tree perfectly represents training data (0% training error), but also learned about noise!
• And hence poorly predicts a new point!
Handle over-fitting
• Pre-pruning via stopping criterion! • Post-pruning: decreases complexity of
model but helps with model generalization
• Randomize tree building and combine trees together
Randomize #1- Bagging
Randomize #1- Bagging
Randomize #1- Bagging
• Each tree sees only sample of training data and captures only a part of the information.
• Build multiple weak trees which vote together to give resulting prediction – voting is based on majority vote, or weighted
average
Bagging - boundary
• Bagging averages many trees, and produces smoother decision boundaries.
Randomize #2 - Feature selectionRandom forest
Random forest - properties • Refinement of bagged trees; quite popular • At each tree split, a random sample of m features is drawn,
and only those m features are considered for splitting. Typically
• m=√p or log2(p), where p is the number of features • For each tree grown on a bootstrap sample, the error rate
for observations left out of the bootstrap sample is monitored. This is called the “out-of-bag” error rate.
• Random forests tries to improve on bagging by “de-correlating” the trees. Each tree has the same expectation
Advantages of Random Forest
• Independent trees which can be built in parallel
• The model does not overfit easily • Produces reasonable accuracy • Brings more features to analyze data variable
importance, proximities, missing values imputation
Out of bag points and validation
• Each tree is built over a sample of training points.
• Remaining points are called “out-of-bag” (OOB).
These points are used for valida@on as a good approxima@on for generaliza@on error. Almost iden@cal as N-‐fold cross valida@on.