lecture6 - c4.5
DESCRIPTION
TRANSCRIPT
Introduction to MachineIntroduction to Machine LearningLearning
Lecture 6
Albert Orriols i Puigi l @ ll l [email protected]
Artificial Intelligence – Machine LearningEnginyeria i Arquitectura La Salleg y q
Universitat Ramon Llull
Recap of Lecture 4ID3 is a strong system thatg y
Uses hill-climbing search based on the information gain measure to search through the space of decision treeseasu e o sea c oug e space o dec s o ees
Outputs a single hypothesis.
N b kt k It t l ll ti l l tiNever backtracks. It converges to locally optimal solutions.
Uses all training examples at each step, contrary to methods th t k d i i i t llthat make decisions incrementally.
Uses statistical properties of all examples: the search is less iti t i i di id l t i i lsensitive to errors in individual training examples.
Can handle noisy data by modifying its termination criterion to t h th th t i f tl fit th d taccept hypotheses that imperfectly fit the data.
Slide 2Artificial Intelligence Machine Learning
Recap of Lecture 4
However, ID3 has some drawbacksI l d l i h i l dIt can only deal with nominal data
It is not able to deal with noisy data sets
It may be not robust in presence of noise
Slide 3Artificial Intelligence Machine Learning
Today’s Agenda
Going from ID3 to C4.5How C4.5 enhances C4.5 to
Be robust in the presence of noise. Avoid overfitting
Deal with continuous attributes
Deal with missing data
Convert trees to rules
Slide 4Artificial Intelligence Machine Learning
What’s Overfitting?Overfitting = Given a hypothesis space H, a hypothesis hєH is said to overfit the training data if there exists some alternative hypothesis h’єH, such that
1 h has smaller error than h’ over the training examples but1. h has smaller error than h over the training examples, but 2. h’ has a smaller error than h over the entire distribution of instances.
Slide 5Artificial Intelligence Machine Learning
Why May my System Overfit?
In domains with noise or uncertaintyythe system may try to decrease the training error by completely fitting all the training examplesg a e a g e a p es
The learner overfits
Noisy instancesto correctly classify the noisy instances
Occam’s razor: Prefer the simplest hypothesis that fits the data with high accuracythe data with high accuracy
Slide 6Artificial Intelligence Machine Learning
How to Avoid Overfitting?Ok, my system may overfit… Can I avoid it?, y y y
Sure! Do not include branches that fit data too specifically
H ?How?1. Pre-prune: Stop growing a branch when information becomes
li blunreliable
2. Post-prune: Take a fully-grown decision tree and discard li blunreliable parts
Slide 7Artificial Intelligence Machine Learning
Pre-pruningBased on statistical significance testg
Stop growing the tree when there is no statistically significant association between any attribute and the class at a particular assoc a o be ee a y a bu e a d e c ass a a pa cu anode
Use all available data for training and apply the statistical test Use a a a ab e da a o a g a d app y e s a s ca esto estimate whether expanding/pruning a node is to produce an improvement beyond the training set
Most popular test: chi-squared test
ID3 used chi-squared test in addition to information gainID3 used chi-squared test in addition to information gainOnly statistically significant attributes were allowed to be selected by information gain procedureselected by information gain procedure
Slide 8Artificial Intelligence Machine Learning
Pre-pruningEarly stopping: Pre-pruning may stop the growth process prematurely
Classic example: XOR/Parity-problem
No individual attribute exhibits any significant association to the classNo individual attribute exhibits any significant association to the class
Structure is only visible in fully expanded tree
Pre pruning won’t expand the root nodePre-pruning won t expand the root node
But: XOR-type problems rare in practice
And: pre-pruning faster than post-pruning
x1 x2 Classx1 x2 Class
1 0 0 0
2 0 1 101 10
3 1 0 1
4 1 1 0 00 10
Slide 9Artificial Intelligence Machine Learning
Post-pruningFirst, build the full tree,
Then, prune itFully grown tree shows all attribute interactionsFully-grown tree shows all attribute interactions
Problem: some subtrees might be due to chance effects
Two pruning operations: 1. Subtree replacement
2. Subtree raising
Possible strategies:Possible strategies:error estimation
i ifi t tisignificance testing
MDL principle
Slide 10Artificial Intelligence Machine Learning
Subtree ReplacementBottom up approachp pp
Consider replacing a tree after considering all its subtrees
Ex: labor negotiationsEx: labor negotiations
Slide 11Artificial Intelligence Machine Learning
Subtree ReplacementAlgorithm:1. Split the data into training and validation set2. Do until further pruning is harmful:
a. Evaluate impact on the validation set of pruning each possible node
b S l t th d h l t ib. Select the node whose removal most increases the validation set accuracy
Slide 12Artificial Intelligence Machine Learning
Subtree RaisingDelete nodeRedistribute instancesSlower than subtreereplacementreplacement(Worthwhile?)
X
Slide 13Artificial Intelligence Machine Learning
Estimating Error RatesOk we can prune. But when?p
Prune only if it reduces the estimated error
Error on the training data is NOT a useful estimatorError on the training data is NOT a useful estimatorQ: Why it would result in very little pruning?
Use hold-out set for pruning T i iUse hold out set for pruningSeparate a validation setUse this validation set to Training
Training Data set’
test the improvement
C4 5’s method
gData set
Validation setC4.5 s method
Derive confidence interval from training dataUse a heuristic limit derived from this for pruning
set
Use a heuristic limit, derived from this, for pruningStandard Bernoulli-process-based methodShaky statistical assumptions (based on training data)
Slide 14
y p ( g )
Artificial Intelligence Machine Learning
Deal with continuous attributes
When dealing with nominal datagWe evaluated the grain for each possible value
I ti d t h i fi it lIn continuous data, we have infinite values.
What should we do?Continuous-valued attributes may take infinite values, but we have a limited number of values in our instances (at most N if we have N instances)
Therefore, simulate that you have N nominal valuesEvaluate information gain for every possible split point of the attributeChoose the best split pointThe information gain of the attribute is the information gain of the best split
Slide 15
of the best split
Artificial Intelligence Machine Learning
Deal with continuous attributes
Example
Outlook Temperature Humidity Windy Play
Sunny 85 85 False Noy
Sunny 80 90 True No
Overcast 83 86 False Yes
Rainy 75 80 False Yes
… … … … …
Continuous attributes
Slide 16Artificial Intelligence Machine Learning
Deal with continuous attributes
Split on temperature attribute:Split on temperature attribute:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Y N Y Y Y N N Y Y Y N Y Y N
E.g.: temperature < 71.5: yes/4, no/2
Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
g te pe atu e 5 yes/ , o/temperature ≥ 71.5: yes/5, no/3
Info([4,2],[5,3]) = 6/14 info([4,2]) + 8/14 info([5,3]) = 0.939 bits
Place split points halfway between valuesCan evaluate all split points in one pass!
Slide 17Artificial Intelligence Machine Learning
Deal with continuous attributes
To speed upp pEntropy only needs to be evaluated between points of different classesc asses
64 65 68 69 70 71 72 72 75 75 80 81 83 85value 64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No
valueclass X
Potential optimal breakpointsPotential optimal breakpoints
Breakpoints between values of the same class cannotbe optimalbe optimal
Slide 18Artificial Intelligence Machine Learning
Deal with Missing DataTreat missing values as a separate valueg p
Missing value denoted “?” in C4.X
Simple idea: treat missing as a separate valueSimple idea: treat missing as a separate value
Q: When this is not appropriate?
A Wh l i i d diffA: When values are missing due to different reasons Example 1: gene expression could be missing when it is very high or very lowhigh or very low Example 2: field IsPregnant=missing for a male patient should be treated differently (no) than for a female patient of age 25 (unknown)(unknown)
Slide 19Artificial Intelligence Machine Learning
Deal with Missing DataSplit instances with missing values into piecesSplit instances with missing values into pieces
A piece going down a branch receives a weight proportional to the popularity of the branchthe popularity of the branch
weights sum to 1
Info gain works with fractional instancesUse sums of weights instead of counts
During classification, split the instance into pieces in the same wayin the same way
Merge probability distribution using weights
Slide 20Artificial Intelligence Machine Learning
From Trees to RulesI finally got a tree from domains withy g
Noisy instances
Mi i lMissing values
Continuous attributes
But I prefer rules…No context dependentNo context dependent
Procedure Generate a rule for each tree
Get context-independent rules
Slide 21Artificial Intelligence Machine Learning
From Trees to RulesA procedure a little more sophisticated: C4.5Rulesp p
C4.5rules: greedily prune conditions from each rule if this reduces its estimated erroreduces s es a ed e o
Can produce duplicate rulesCheck for this at the end
Thenlook at each class in turnlook at each class in turnconsider the rules for that classfind a “good” subset (guided by MDL)find a good subset (guided by MDL)
Then rank the subsets to avoid conflicts
Finally, remove rules (greedily) if this decreases error on the training data
Slide 22Artificial Intelligence Machine Learning
Next Class
Instance-based Classifiers
Slide 23Artificial Intelligence Machine Learning
Introduction to MachineIntroduction to Machine LearningLearning
Lecture 6
Albert Orriols i Puigi l @ ll l [email protected]
Artificial Intelligence – Machine LearningEnginyeria i Arquitectura La Salleg y q
Universitat Ramon Llull