1 decision tree learning chapter 3 decision tree representation id3 learning algorithm entropy,...
TRANSCRIPT
1
Decision Tree Learning
Chapter 3
• Decision tree representation• ID3 learning algorithm• Entropy, information gain• Overfitting
Review example: Image Categorization (two phases)
Training LabelsTraining
Images
Classifier Training
Training
Image Features
Trained Classifier
Image Features
Testing
Test Image
Trained Classifier Outdoor
Prediction
Inductive Learning
• Learning a function from examples
Occam’s Razor: prefer the simplest hypothesis consistent with dataOne of the most widely used inductive learnings: Decision Tree Learning
4
Color
ShapeSize +
+- Size
+-
+
big
big small
small
roundsquare
redgreen blue
I
Decision Tree Example
+ : Filled with blue- : Filled with red
• Each internal node corresponds to a test• Each branch corresponds to a result of the test• Each leaf node assigns a classification
PlayTennis: Training Examples
Sample, attribute, target_attribute
A Decision Tree for the concept PlayTennis
Outlook?
Humidity? Wind?
SunnyOvercast
Rain
High Normal Strong Weak
• Finding the most suitable attribute for the root• Finding redundant attributes, like temperature
Convert a tree to rule
Decision trees can represent any Boolean function
If (O=Sunny AND H=Normal) OR (O=Overcast) OR (O=Rain AND W=Weak) then YES
• “A disjunction of conjunctions of constraints on attribute values”
Decision trees can represent any Boolean function
• In the worst case, it need exponentially many nodes• XOR, as an extreme case
Decision tree, decision boundaries
Decision Regions
Decision Trees
• One of the most widely used and practical methods for inductive inference
• Approximates discrete-valued functions – can be extended to continuous valued functions
• Can be used for classification (most common) or regression problems
Decision Trees for Regression(Continuous values)
Decision tree learning algorithm
• Learning process: finding the tree from the training set
• For a given training set, there are many trees that code it without any error
• Finding the smallest tree is NP-complete (Quinlan 1986), hence we are forced to use some (local) search algorithm to find reasonable solutions
ID3: The basic decision tree learning algorithm
• Basic idea: A decision tree can be constructed by considering attributes of instances one by one.
– Which attribute should be considered first?
– The height of a decision tree depends on the order attributes that are considered.
==> Entropy
17
Which Attribute is ”best”?
A1=?
True False
[21+, 5-] [8+, 30-]
[29+,35-]A2=?
True False
[18+, 33-] [11+, 2-]
[29+,35-]
Entropy:Large Entropy => more information
18
Entropy
Entropy(S) = -p+ log2 p+ - p- log2 p-
• S is training examples• p+ is the proportion of positive examples
• p- is the proportion of negative examples
• Exercise: • Calculate the Entropy in two cases (Note that: 0Log20 =0):
– P+ = 0.5, P- = 0.5– P+ = 1, P- = 0
Question: Why (-) in the equation?
Entropy
A measure of uncertainty
Example:- Probability of Head in a fair coin- Same for a coin with two head
Information theory
آنتروپی زیاد=عدم قطعیت زیاد=بیشتر تصادفی بودن پدیده=اطالعات بیشتر= نیاز به تعداد بیت های بیشتر برای کد کردن
Entropy
• For multi-class problems with c categories, entropy generalizes to:
• Q: Why log2?
c
iii ppSEntropy
12 )(log)(
21
Information Gain
• Gain(S,A): expected reduction in entropy due to sorting S on attribute A
• where Sv is the subset of S having value v for attribute A
)()(),()(
vAValuesv
v SEntropyS
SSEntropyASGain
22
A1=?
True False
[21+, 5-] [8+, 30-]
[29+,35-] A2=?
True False
[18+, 33-] [11+, 2-]
[29+,35-]
Entropy(S) = ?, Gain(S,A1)=?, Gain(S,A2)=?
Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64 = 0.99
Information Gain)()(),(
)(v
AValuesv
v SEntropyS
SSEntropyASGain
c
iii ppSEntropy
12 )(log)(
Information Gain
A1=?
True False
[21+, 5-] [8+, 30-]
[29+,35-]
Entropy([21+,5-]) = 0.71
Entropy([8+,30-]) = 0.74
Gain(S,A1)=Entropy(S)
-26/64*Entropy([21+,5-])
-38/64*Entropy([8+,30-])
=0.27
A1 is higher in the tree
Entropy([18+,33-]) = 0.94
Entropy([11+,2-]) = 0.62
Gain(S,A2)=Entropy(S)
-
51/64*Entropy([18+,33-])
-
13/64*Entropy([11+,2-])
=0.11
A2=?
True False
[18+, 33-] [11+, 2-]
[29+,35-]
)()(),()(
vAValuesv
v SEntropyS
SSEntropyASGain
ID3 for the playTennis example
ID3: The Basic Decision Tree Learning Algorithm
• What is the “best” attribute?• [“best” = with highest information gain]
• Answer: Outlook
26
ID3 (Cont’d)
D13
D12D11
D10
D9D4
D7D5
D3D14
D8 D6
D2
D1
OutlookSunny
OvercastRain
What are the “best” next attributes? Humidity Wind
PlayTennis Decision Tree
Outlook?
Humidity? Wind?
SunnyOvercast
Rain
High Normal Strong Weak
Stopping criteria
• each leaf-node contains examples of one type, or
• algorithm ran out of attributes
ID3
Over fitting in Decision Trees
• Why “over”-fitting?– A model can become more complex than the true target
function (concept) when it tries to satisfy noisy data as well.
hypothesis complexity
accu
racy
on training data
on test data
Over fitting in Decision Trees
Overfitting Example
32
voltage (V)
curr
ent (
I)
Testing Ohms Law: V = IR
Ohm was wrong, we have found a more accurate function!
Perfect fit to training data with an 9th degree polynomial(can fit n points exactly with an n-1 degree polynomial)
Experimentallymeasure 10 points
Fit a curve to theResulting data.
Overfitting Example
33
voltage (V)
curr
ent (
I)
Testing Ohms Law: V = IR
Better generalization with a linear functionthat fits training data less accurately.
Avoiding over-fitting the data
• How can we avoid overfitting? There are 2 approaches:
1. Early stopping: stop growing the tree before it perfectly classifies the training data
2. Pruning: grow full tree, then prune
• Reduced error pruning
• Rule post-pruning
– Pruning approach is found more useful in practice.
Other issues in Decision tree learning
• Incorporating continuous valued attributes
• Alternative measures for selecting attributes
• Handling training examples with missing attribute value
• Handling attributes with different costs
Strengths and Advantages of Decision Trees
• Rule extraction from trees– A decision tree can be used for feature extraction
(e.g. seeing which features are useful)
• Interpretability: human experts may verify and/or discover patterns
• It is a compact and fast classification method
Your Assignments
• HW1 is uploaded, Due date: 94/08/14
• Proposal: same due date– One page maximum
– Include the following information:• Project title
• Data set
• Project idea. approximately two paragraphs.
• Software you will need to write.
• Papers to read. Include 1-3 relevant papers.
• Teammate (if any) and work division. We expect projects done in a group to be more substantial than projects done individually.