1 ch 3. decision tree learning decision trees basic learning algorithm (id3) entropy, information...

1

Ch 3. Decision Tree Learning

Decision trees Basic learning algorithm (ID3)

Entropy, information gain Hypothesis space Inductive bias

Occam’s razor in general Overfitting problem & extensions

post-pruning real values, missing values, attribute costs, …

2

Decision Tree for PlayTennis

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

3

Training Examples

Day Outlook Temp. Humidity Wind Play TennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Weak YesD8 Sunny Mild High Weak NoD9 Sunny Cold Normal Weak YesD10 Rain Mild Normal Strong YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No

4

Decision Tree Learning

Classify instances sorting them down the tree to a leaf node

containing the class (value) based on attributes of instances branch for each value

Widely used, practical method of approximating discrete-valued functions

Robust to noisy data !! Capable of learning disjunctive expressions Typical bias: prefer smaller trees

5


Outlook

Sunny Overcast Rain

Humidity

High Normal

No Yes

Each internal node tests an attribute

Each branch corresponds to anattribute value node

Each leaf node assigns a classification

6

No


Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

Outlook Temperature Humidity Wind PlayTennis Sunny Hot High Weak ?

7

Decision Tree for Conjunction

Outlook

Sunny Overcast Rain

Wind

Strong Weak

No Yes

No

Outlook=Sunny Wind=Weak

No

8

Decision Tree for Disjunction

Outlook

Sunny Overcast Rain

Yes

Outlook=Sunny Wind=Weak

Wind

Strong Weak

No Yes

Wind

Strong Weak

No Yes

9

Decision Tree for XOR

Outlook

Sunny Overcast Rain

Wind

Strong Weak

Yes No

Outlook=Sunny XOR Wind=Weak

Wind

Strong Weak

No Yes

Wind

Strong Weak

No Yes

10

Decision Tree

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

• decision trees represent disjunctions of conjunctions

(Outlook=Sunny Humidity=Normal) (Outlook=Overcast) (Outlook=Rain Wind=Weak)

11

When to use Decision Trees?

Instances presented as attribute-value pairs Target function has discrete values

classification problems Disjunctive descriptions required

disjunction of conjunctions of constraints on attribute values of instances

Training data may contain errors missing attribute values

Examples: Medical diagnosis Credit risk analysis Object classification for robot manipulator

12

Top-Down Induction of

Decision Trees - ID3

1. A the “best” decision attribute for next node

2. Assign A as decision attribute for node

3. For each value of A create new descendant

4. Sort training examples to leaf nodes according to the attribute value of the branch

5. If all training examples are perfectly classified (same value of target attribute) stop, else iterate over new leaf nodes.

13

Which Attribute is ”best”?

A1=?

True False

[21+, 5-] [8+, 30-]

[29+,35-] A2=?

True False

[18+, 33-] [11+, 2-]

[29+,35-]

14

Entropy

S is a sample of training examples p+ is the proportion of positive examples

p- is the proportion of negative examples

Entropy measures the impurity of S

Entropy(S) = -p+ log2 p+ - p- log2 p-

15

Entropy

Entropy(S)= expected number of bits needed to encode class (+ or -) of randomly drawn members of S (under the optimal, shortest length-code)

Why? Information theory optimal length code assign

–log2 p bits to messages having probability p. So the expected number of bits to encode (+ or -) of random member of S:

-p+ log2 p+ - p- log2 p-

16

Information Gain

Gain(S,A): expected reduction in entropy due to sorting S on attribute A

A1=?

True False

[21+, 5-] [8+, 30-]

[29+,35-] A2=?

True False

[18+, 33-] [11+, 2-]

[29+,35-]

Gain(S,A)=Entropy(S) - vvalues(A) |Sv|/|S| Entropy(Sv)

Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64 = 0.99

17

Information Gain

A1=?

True False

[21+, 5-] [8+, 30-]

[29+,35-]

Entropy([21+,5-]) = 0.71Entropy([8+,30-]) = 0.74Gain(S,A1)=Entropy(S) -26/64*Entropy([21+,5-])

-38/64*Entropy([8+,30-])

=0.27

Entropy([18+,33-]) = 0.94Entropy([8+,30-]) = 0.62Gain(S,A2)=Entropy(S) -51/64*Entropy([18+,33-]) -13/64*Entropy([11+,2-])

=0.12

A2=?

True False

[18+, 33-] [11+, 2-]

[29+,35-]

18

Training Examples

Day Outlook Temp. Humidity Wind Play TennisD1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Weak YesD8 Sunny Mild High Weak NoD9 Sunny Cold Normal Weak YesD10 Rain Mild Normal Strong YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No

19

Selecting the Best Attribute

Humidity

High Normal

[3+, 4-] [6+, 1-]

S=[9+,5-]E=0.940

Gain(S,Humidity)=0.940-(7/14)*0.985 – (7/14)*0.592=0.151

E=0.985 E=0.592

Wind

Weak Strong

[6+, 2-] [3+, 3-]

S=[9+,5-]E=0.940

E=0.811 E=1.0

Gain(S,Wind)=0.940-(8/14)*0.811 – (6/14)*1.0=0.048

20

Selecting the Best Attribute

Outlook

Sunny Rain

[2+, 3-] [3+, 2-]

S=[9+,5-]E=0.940

Gain(S,Outlook)=0.940-(5/14)*0.971 -(4/14)*0.0 – (5/14)*0.971=0.247

E=0.971

E=0.971

Overcast

[4+, 0]

E=0.0

21

ID3 Algorithm

Outlook

Sunny Overcast Rain

Yes

[D1,D2,…,D14] [9+,5-]

Ssunny=[D1,D2,D8,D9,D11] [2+,3-]

? ?

[D3,D7,D12,D13] [4+,0-]

[D4,D5,D6,D10,D14] [3+,2-]

Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019

22

ID3 Algorithm

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

[D3,D7,D12,D13]

[D8,D9,D11] [D6,D14]

[D1,D2] [D4,D5,D10]

23

Hypothesis Space Search ID3

+ - +

+ - +

A1

- - ++ - +

A2

+ - -

+ - +

A2

-

A4+ -

A2

-

A3- +

24

Hypothesis Space Search ID3

Hypothesis space is complete! Target function surely in there…

Outputs a single hypothesis No backtracking on selected attributes (greedy

search) Local minimal (suboptimal splits)

Statistically-based search choices Robust to noisy data

Inductive bias (search bias) Prefer shorter trees over longer ones Place high info. gain attributes close to the root

25

Inductive Bias in ID3

H is the power set of instances X Unbiased ?

Preference for short trees, and for those with high information gain attributes near the root

Bias is a preference for some hypotheses, rather than a restriction of the hypothesis space H

Compare bias to C-E ID3: complete space, incomplete search --> bias from

search strategy C-E: incomplete space, complete search --> bias from

expressive power of H

26

Occam’s Razor

Why prefer short hypotheses? Argument in favor:

Fewer short hypotheses than long hypotheses A short hypothesis that fits the data is unlikely to be a

coincidence A long hypothesis that fits the data might be a

coincidence Argument against:

There are many ways to define small sets of hypotheses

E.g. All trees with a prime number of nodes that use attributes beginning with ”Z”

What is so special about small sets based on size of hypothesis???

27

Overfitting - definition

Consider error of hypothesis h over Training data: errortrain(h) Entire distribution D of data: errorD(h)

Hypothesis hH overfits training data if there is an alternative hypothesis h’H such that

errortrain(h) < errortrain(h’)and

errorD(h) > errorD(h’) Possible reasons: noise and small samples

coincidences possible with small samples noisy data creates a large tree h h’ not fitting it is likely to work better

28

Overfitting in Decision Trees

error

#nodes

validation set

training set

optimal#nodes

29

Avoid Overfitting

How can we avoid overfitting? Stop growing when data split not

statistically significant Grow full tree then post-prune

has been found more successful

Minimum description length (MDL):

Minimize:

size(tree) + size(misclassifications(tree))

30

How to decide tree size?

What criterion to use separate test set to evaluate the use of

pruning use all data, apply statistical test to estimate if

expanding/pruning is likely to produce improvement

use an explicit complexity measure (coding length of data & tree), stop growth when minimized

31

Training/validation sets

Available data split training set: apply learning to this validation set: evaluate result

accuracy impact of pruning ‘safety check’ against overfit

common strategy: 2/3 for training

32

Reduced error pruning

Pruning make an inner node a leaf node assign it the most common class

ProcedureSplit data into training and validation setDo until further pruning is harmful:1. Evaluate impact on validation set of pruning each

possible node (plus those below it)2. Greedily remove the one that most improves the

validation set accuracy Produces smallest version of most accurate

subtreee

33

Rule Post-Pruning

Procedure (C4.5 uses a variant) infer DT as usual (allow overfit) convert tree to rules (one per path) prune each rule independently

remove preconditions if result is more accurate sort rules by estimated accuracy into a

desired sequence to use apply rules in this order in classification

34

Rule post-pruning…

Estimating the accuracy separate validation set training data & pessimistic estimates

data is too favorable for the rules compute accuracy & standard deviation take lower bound from given confidence level

(e.g. 95%) as the measure very close to observed one for large sets not statistically valid but works

35

Why convert to rules?

Distinguishing different contexts in which a node is used separate pruning decision for each path

No difference for root/inner no bookkeeping on how to reorganize tree if

root node is pruned

Improves readability

36

Converting a Tree to Rules

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

R1: If (Outlook=Sunny) (Humidity=High) Then PlayTennis=No R2: If (Outlook=Sunny) (Humidity=Normal) Then PlayTennis=YesR3: If (Outlook=Overcast) Then PlayTennis=Yes R4: If (Outlook=Rain) (Wind=Strong) Then PlayTennis=NoR5: If (Outlook=Rain) (Wind=Weak) Then PlayTennis=Yes

37

Continuous values

Define new discrete-valued attr partition the continuous value into a discrete set of

intervals Ac = true iff A < c How to select best c? (inf. gain)

Example case sort examples by continuous value identify borderlines

Temperature 150C 180C 190C 220C 240C 270C

PlayTennis No No Yes Yes Yes No

38

Continuous values…

Fact value maximizing inf. gain lies on such

boundary

Evaluation compute gain for each boundary

Extensions multiple values LTUs based on many attributes

39

Alternative selection measures

Information gain measure favors attributes with many values separates data into small subsets high gain, poor prediction E.g.: Imagine using Date=xx.yy.zz as attribute

perfectly splits the data into subsets of size 1

Use gain ratio instead of information gain!! penalize gain with split information sensitive to how broadly & uniformly attribute splits

data Entropy of S with respect to values of A

earlier: entropy of S w.r.t target values

40

Split information

GainRatio(S,A) = Gain(S,A) / SplitInfo (S,A) SplitInfo(S,A) = -i=1..c |Si|/|S| log2 |Si|/|S|

Where Si is the subset for which attribute A has the value vi

Discourages selection of attr with many uniformly distr. Values n values: log n, boolean: 1

Better to apply heuristics to select attributes compute Gain first compute GR only when Gain large enough (above

average)

41

Another alternative

Distance-based measure define a metric between partitions of the data evaluate attributes: distance between created

& perfect partition choose the attribute with closest one

Shown not biased towards attr. with large value sets

Many alternatives like this in the literature.

42

Missing values

Estimate value other examples with known value

Compute Gain(S,A), A(x) unknown assign most common value in S most common with class c(x) assign probability for each value, distribute

fractionals of x down

Similar techniques in classification

43

Attributes with differing costs

Measuring attribute costs something prefer cheap ones if possible use costly ones only if good gain introduce cost term in selection measure no guarantee in finding optimum, but give

bias towards cheapest Replace Gain by : Gain2(S,A)/Cost(A) Example applications

robot & sonar: time required to position medical diagnosis: cost of a laboratory test

44

Summary

Decision-tree induction is a popular approach to Decision-tree induction is a popular approach to classification that enables us to interpret the output classification that enables us to interpret the output hypothesishypothesis

Practical learning method discrete-valued functions ID3: top-down greedy algorithm

Complete hypothesis space Preference bias - shorter trees preferred.shorter trees preferred. Overfitting is an important issue Overfitting is an important issue

Reduced error pruningReduced error pruning Rule post-pruningRule post-pruning

Techniques exist to deal with continuous attributes and Techniques exist to deal with continuous attributes and missing attribute values.missing attribute values.

1 ch 3. decision tree learning decision trees basic learning algorithm (id3) entropy, information...

Documents