1 decision tree learning chapter 3 decision tree representation id3 learning algorithm entropy,...

36
1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain • Overfitting

Upload: claribel-howard

Post on 19-Jan-2016

224 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

1

Decision Tree Learning

Chapter 3

• Decision tree representation• ID3 learning algorithm• Entropy, information gain• Overfitting

Page 2: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Review example: Image Categorization (two phases)

Training LabelsTraining

Images

Classifier Training

Training

Image Features

Trained Classifier

Image Features

Testing

Test Image

Trained Classifier Outdoor

Prediction

Page 3: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Inductive Learning

• Learning a function from examples

Occam’s Razor: prefer the simplest hypothesis consistent with dataOne of the most widely used inductive learnings: Decision Tree Learning

Page 4: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

4

Color

ShapeSize +

+- Size

+-

+

big

big small

small

roundsquare

redgreen blue

I

Decision Tree Example

+ : Filled with blue- : Filled with red

• Each internal node corresponds to a test• Each branch corresponds to a result of the test• Each leaf node assigns a classification

Page 5: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

PlayTennis: Training Examples

Sample, attribute, target_attribute

Page 6: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

A Decision Tree for the concept PlayTennis

Outlook?

Humidity? Wind?

SunnyOvercast

Rain

High Normal Strong Weak

• Finding the most suitable attribute for the root• Finding redundant attributes, like temperature

Page 7: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Convert a tree to rule

Page 8: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Decision trees can represent any Boolean function

If (O=Sunny AND H=Normal) OR (O=Overcast) OR (O=Rain AND W=Weak) then YES

• “A disjunction of conjunctions of constraints on attribute values”

Page 9: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Decision trees can represent any Boolean function

• In the worst case, it need exponentially many nodes• XOR, as an extreme case

Page 10: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Decision tree, decision boundaries

Page 11: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Decision Regions

Page 12: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Decision Trees

• One of the most widely used and practical methods for inductive inference

• Approximates discrete-valued functions – can be extended to continuous valued functions

• Can be used for classification (most common) or regression problems

Page 13: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Decision Trees for Regression(Continuous values)

Page 14: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Decision tree learning algorithm

• Learning process: finding the tree from the training set

• For a given training set, there are many trees that code it without any error

• Finding the smallest tree is NP-complete (Quinlan 1986), hence we are forced to use some (local) search algorithm to find reasonable solutions

Page 15: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

ID3: The basic decision tree learning algorithm

• Basic idea: A decision tree can be constructed by considering attributes of instances one by one.

– Which attribute should be considered first?

– The height of a decision tree depends on the order attributes that are considered.

==> Entropy

Page 16: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

17

Which Attribute is ”best”?

A1=?

True False

[21+, 5-] [8+, 30-]

[29+,35-]A2=?

True False

[18+, 33-] [11+, 2-]

[29+,35-]

Entropy:Large Entropy => more information

Page 17: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

18

Entropy

Entropy(S) = -p+ log2 p+ - p- log2 p-

• S is training examples• p+ is the proportion of positive examples

• p- is the proportion of negative examples

• Exercise: • Calculate the Entropy in two cases (Note that: 0Log20 =0):

– P+ = 0.5, P- = 0.5– P+ = 1, P- = 0

Question: Why (-) in the equation?

Page 18: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Entropy

A measure of uncertainty

Example:- Probability of Head in a fair coin- Same for a coin with two head

Information theory

آنتروپی زیاد=عدم قطعیت زیاد=بیشتر تصادفی بودن پدیده=اطالعات بیشتر= نیاز به تعداد بیت های بیشتر برای کد کردن

Page 19: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Entropy

• For multi-class problems with c categories, entropy generalizes to:

• Q: Why log2?

c

iii ppSEntropy

12 )(log)(

Page 20: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

21

Information Gain

• Gain(S,A): expected reduction in entropy due to sorting S on attribute A

• where Sv is the subset of S having value v for attribute A

)()(),()(

vAValuesv

v SEntropyS

SSEntropyASGain

Page 21: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

22

A1=?

True False

[21+, 5-] [8+, 30-]

[29+,35-] A2=?

True False

[18+, 33-] [11+, 2-]

[29+,35-]

Entropy(S) = ?, Gain(S,A1)=?, Gain(S,A2)=?

Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64 = 0.99

Information Gain)()(),(

)(v

AValuesv

v SEntropyS

SSEntropyASGain

c

iii ppSEntropy

12 )(log)(

Page 22: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Information Gain

A1=?

True False

[21+, 5-] [8+, 30-]

[29+,35-]

Entropy([21+,5-]) = 0.71

Entropy([8+,30-]) = 0.74

Gain(S,A1)=Entropy(S)

-26/64*Entropy([21+,5-])

-38/64*Entropy([8+,30-])

=0.27

A1 is higher in the tree

Entropy([18+,33-]) = 0.94

Entropy([11+,2-]) = 0.62

Gain(S,A2)=Entropy(S)

-

51/64*Entropy([18+,33-])

-

13/64*Entropy([11+,2-])

=0.11

A2=?

True False

[18+, 33-] [11+, 2-]

[29+,35-]

)()(),()(

vAValuesv

v SEntropyS

SSEntropyASGain

Page 23: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

ID3 for the playTennis example

Page 24: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

ID3: The Basic Decision Tree Learning Algorithm

• What is the “best” attribute?• [“best” = with highest information gain]

• Answer: Outlook

Page 25: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

26

ID3 (Cont’d)

D13

D12D11

D10

D9D4

D7D5

D3D14

D8 D6

D2

D1

OutlookSunny

OvercastRain

What are the “best” next attributes? Humidity Wind

Page 26: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

PlayTennis Decision Tree

Outlook?

Humidity? Wind?

SunnyOvercast

Rain

High Normal Strong Weak

Page 27: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Stopping criteria

• each leaf-node contains examples of one type, or

• algorithm ran out of attributes

Page 28: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

ID3

Page 29: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Over fitting in Decision Trees

• Why “over”-fitting?– A model can become more complex than the true target

function (concept) when it tries to satisfy noisy data as well.

hypothesis complexity

accu

racy

on training data

on test data

Page 30: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Over fitting in Decision Trees

Page 31: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Overfitting Example

32

voltage (V)

curr

ent (

I)

Testing Ohms Law: V = IR

Ohm was wrong, we have found a more accurate function!

Perfect fit to training data with an 9th degree polynomial(can fit n points exactly with an n-1 degree polynomial)

Experimentallymeasure 10 points

Fit a curve to theResulting data.

Page 32: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Overfitting Example

33

voltage (V)

curr

ent (

I)

Testing Ohms Law: V = IR

Better generalization with a linear functionthat fits training data less accurately.

Page 33: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Avoiding over-fitting the data

• How can we avoid overfitting? There are 2 approaches:

1. Early stopping: stop growing the tree before it perfectly classifies the training data

2. Pruning: grow full tree, then prune

• Reduced error pruning

• Rule post-pruning

– Pruning approach is found more useful in practice.

Page 34: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Other issues in Decision tree learning

• Incorporating continuous valued attributes

• Alternative measures for selecting attributes

• Handling training examples with missing attribute value

• Handling attributes with different costs

Page 35: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Strengths and Advantages of Decision Trees

• Rule extraction from trees– A decision tree can be used for feature extraction

(e.g. seeing which features are useful)

• Interpretability: human experts may verify and/or discover patterns

• It is a compact and fast classification method

Page 36: 1 Decision Tree Learning Chapter 3 Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting

Your Assignments

• HW1 is uploaded, Due date: 94/08/14

• Proposal: same due date– One page maximum

– Include the following information:• Project title

• Data set

• Project idea. approximately two paragraphs.

• Software you will need to write.

• Papers to read. Include 1-3 relevant papers.

• Teammate (if any) and work division. We expect projects done in a group to be more substantial than projects done individually.