ma mru dm chapter06

Chapter 6Decision Trees

2

An Example

Outlook Temperature Humidity Windy Class

sunny hot high false Nsunny hot high true N

overcast hot high false Prain mild high false Prain cool normal false Prain cool normal true N

overcast cool normal true Psunny mild high false Nsunny cool normal false Prain mild normal false P

sunny mild normal true Povercast mild high true Povercast hot normal false Prain mild high true N

Outlook

overcast

humidity windy

high normal falsetrue

sunny rain

N NP P

P

overcast

3

Another Example - Grades

Percent >= 90%?

Yes Grade = A

No 89% >= Percent >= 80%?

Yes Grade = B

No 79% >= Percent >= 70%?

Yes Grade = C

No Etc...

4

Yet Another Example1 of 2

5

Yet Another Example

• English Rules (for example):

2 of 2

If tear production rate = reduced then recommendation = none.If age = young and astigmatic = no and tear production rate = normal

then recommendation = softIf age = pre-presbyopic and astigmatic = no and tear production

rate = normal then recommendation = softIf age = presbyopic and spectacle prescription = myope and

astigmatic = no then recommendation = noneIf spectacle prescription = hypermetrope and astigmatic = no and

tear production rate = normal then recommendation = softIf spectacle prescription = myope and astigmatic = yes and

tear production rate = normal then recommendation = hardIf age = young and astigmatic = yes and tear production rate =

normalthen recommendation = hard

If age = pre-presbyopic and spectacle prescription = hypermetropeand astigmatic = yes then recommendation = none

If age = presbyopic and spectacle prescription = hypermetropeand astigmatic = yes then recommendation = none

6

Decision Tree Template

• Drawn top-to-bottom or left-

to-right

• Top (or left-most) node =

Root Node

• Descendent node(s) =

Child Node(s)

• Bottom (or right-most)

node(s) = Leaf Node(s)

• Unique path from root to

each leaf = Rule

Root

Child Child Leaf

LeafChild

Leaf

7

Introduction

• Decision Trees– Powerful/popular for classification & prediction– Represent rules

• Rules can be expressed in English– IF Age <=43 & Sex = Male

& Credit Card Insurance = NoTHEN Life Insurance Promotion = No

• Rules can be expressed using SQL for query

– Useful to explore data to gain insight into relationships of a large number of candidate input variables to a target (output) variable

• You use mental decision trees often!• Game: “I’m thinking of…” “Is it …?”

8

Decision Tree – What is it?

• A structure that can be used to divide up a large

collection of records into successively smaller

sets of records by applying a sequence of simple

decision rules

• A decision tree model consists of a set of rules

for dividing a large heterogeneous population

into smaller, more homogeneous groups with

respect to a particular target variable

9

Decision Tree Types

• Binary trees – only two choices in each split. Can be non-uniform (uneven) in depth

• N-way trees or ternary trees – three or more choices in at least one of its splits (3-way, 4-way, etc.)

10

A binary decision tree classification example

Potential catalog recipients as likely (1) or unlikely (0) to place an order if sent a new catalog

Each node is labeled with a node number in the upper-right corner and the predicted class in the center. The decision rules to split each node are printed on the lines connecting each node to its children.

Any record that reaches leaf nodes 19, 14, 16, 17, or 18 is classified as likely to respond, because the predicted class in this case is 1. The paths to these leaf nodes describe the rules in the tree. For example, the rule for leaf 19 is If the customer has made more than 6.5 orders and it has been fewer than 765 days since the last order, the customer is likely to respond.

11

Scoring example

• Often it is useful to show the proportion of the data in each of the desired classes

• Clarify Fig 6.2

12

Scoring example

13

A ternary decision tree example

14

Decision Tree Splits (Growth)

• The best split at root or child nodes is defined as

one that does the best job of separating the data

into groups where a single class predominates

in each group

– Example: US Population data input categorical

variables/attributes include:

• Zip code

• Gender

• Age

– Split the above according to the above “best split” rule

15

Split Criteria

• The best split is defined as one that does the best job of separating the data into groups where a single class predominates in each group

• Measure used to evaluate a potential split is purity– The best split is one that increases purity of

the sub-sets by the greatest amount

– A good split also creates nodes of similar size or at least does not create very small nodes

16

Example: Good & Poor Splits

The final split is a good one because:- it leads to children of roughly same

size and- with much higher purity than the

parent.

17

Tests for Choosing Best Split• Purity (Diversity) Measures:

For categorical targets:

– Gini (population diversity)

– Entropy (information gain)

– Information Gain Ratio

– Chi-square Test For numeric targets:

− Reduction in variance

− F test

18

Gini (Population Diversity)

• The Gini measure of a node is the sum of the squares of the proportions of the classes.

Root Node: 0.5^2 + 0.5^2 = 0.5 (even balance)

Leaf Nodes: 0.1^2 + 0.9^2 = 0.82 (close to pure)

19

Entropy (Information Gain)

• Entropy = a measure of how disorganized a system is.

-1 * ( P(dark)log2P(dark) + P(light)log2P(light) )

-1 * (0.5 log2(0.5) + 0.5 log2(0.5)) = +1

The entropy of a node is the sum, over all the classes represented in the node, of the proportion of records belonging to a particular class multiplied by the base two logarithm of that proportion (usually multiplied by –1 in order to obtain a positive number).The entropy of a split is the sum of the entropies of all the nodes resulting from the split weighted by each node’s proportion of the records.

-1 * (0.1 log2(0.1) + 0.9 log2(0.9)) = 0.33 + 0.14 = 0.47

-1 * (0.1 log2(0.1) + 0.9 log2(0.9)) = 0.33 + 0.14 = 0.47

weighted average = 0.47

total entropy reduction / information gain for split = 0.53

20

Information Gain Ratio

• Problems with entropy:

– By breaking the larger data set into many small subsets, the number of classes represented in each node tends to go down, and with it, the entropy

– Decision trees tend to be quite bushy

• In reaction to this problem, C5 and other descendants of ID3 that once used information gain now use the ratio of the total information gain due to a proposed split to the intrinsic information attributable solely to the number of branches created as the criterion for evaluating proposed splits.

21

Chi-square

• The chi-square test gives its name to CHAID, a well-known decision tree algorithm = Chi-square Automatic Interaction Detector.

• CHAID makes use of the Chi-square test in several ways:

– to merge classes that do not have significantly different effects on the target variable;

– to choose a best split;

– to decide whether it is worth performing any additional splits on a node.

22

Reduction in Variance• The mean value in the parent node is 0.5 (0/1).

• Every one of the 20 observations differs from the mean by 0.5, so the variance is (20 * 0.52) / 20 = 0.25.

• After the split, the left child has 9 dark spots and one light spot, so the node mean is 0.9.

• Nine of the observations differ from the mean value by 0.1 and one observation differs from the mean value by 0.9 so the variance is (0.92 + 9 * 0.12) / 10 = 0.09.

• Since both nodes resulting from the splithave variance 0.09, the total variance afterthe split is also 0.09.

• The reduction in variance due to the splitis 0.25 – 0.09 = 0.16.

23

F Test

• The F score is calculated by dividing the between-sample estimate by the pooled sample estimate.

– The larger the score, the less likely it is that the samples are all randomly drawn from the same population.

• In the decision tree context, a large F-score indicates that a proposed split has successfully split the population into subpopulations with significantly different distributions.

24

Pruning

• The decision tree keeps growing as long as new splits can be found that improve the ability of the tree to separate the records of the training set into increasingly pure subsets.

– Such a tree has been optimized for the training set, so eliminating any leaves would only increase the error rate of the tree on the training set.

• Does this imply that the full tree will also do the best job of classifying new datasets?

25

Pruning

• A decision tree algorithm makes its best split first, at the root node where there is a large population of records.

• As the nodes get smaller, idiosyncrasies of the particular training records at a node come to dominate the process.

• = the tree finds general patterns at the big nodes and patterns specific to the training set in the smaller nodes;

• = the tree over-fits the training set.

• The result is an unstable tree that will not make good predictions.

26

Pruning

• The cure is to eliminate the unstable splits by merging smaller leaves through a process called pruning.

• Three general approaches to pruning:

– CART

– C5

– Stability-based

27

CARTPruning

• The goal is to prune first those branches providing the least additional predictive power per leaf.

• In order to identify these least useful branches, CART relies on a concept called the adjusted error rate.• This is a measure that increases each node’s misclassification rate

on the training set by imposing a complexity penalty based on the number of leaves in the tree.

• The adjusted error rate is used to identify weak branches (those whose misclassification rate is not low enough to overcome the penalty) and mark them for pruning.

Creating the Candidate Subtrees

28

CARTPruning

Picking the Best Subtree

• Each of the candidate subtrees is used to classify the records in the validation set.

• The tree that performs this task with the lowest overall error rate is declared the winner.

• The winning subtree has been pruned sufficiently to remove the effects of overtraining, but not so much as to lose valuable information.

29

CART Pruning

• Do not evaluate the performance of a model by its lift or error rate on the validation set.

– Like the training set, it has had a hand in creating the model and so will overstate the model’s accuracy.

• Always measure the model’s accuracy on a test set that is drawn from the same population as the training and validation sets, but has not been used in any way to create the model.

Using the Test Set to Evaluate the Final Tree

30

The C5 Pruning Algorithm

• Like CART, the C5 algorithm first grows an overfit tree and then prunes it back to create a more stable model.

• The pruning strategy is different:

– C5 does not make use of a validation set to choose from among candidate subtrees; thesame data used to grow the tree is also used to decide how the tree should be pruned.

– C5 prunes the tree by examining the error rate at each node and assuming that the true error rate is actually substantially worse.

31

Stability-BasedPruning

• CART and C5 fail to prune some nodes that are clearly unstable.

• One of the main purposes of a model is to make consistent predictions on previously unseen records.

– Any rule that cannot achieve that goal should be eliminated from the model.

32

Stability-Based Pruning

• Small nodes cause big problems.

– A common cause of unstable decision tree models is allowing nodes with too few records.

• Most decision tree tools allow the user to set a minimum node size.

– As a rule of thumb, nodes that receive fewer than about 100 training set records are likely to be unstable.

33

Extracting Rules from Trees

Watch the game and out with friends then beer.

34

FurtherRefinements

Using More Than One Field at a Time

35

Further Refinements

Tilting the Hyperplane

36

Alternate Representations for Decision Trees

Box Diagrams

Shading is proportional to the purity of the box; size is proportional to the number of records that land there.

37

Alternate Representations for Decision Trees

Tree Ring Diagrams

• The circle at the center of the diagram represents the root node, before any splits have been made.

• Moving out from the center, each concentric ring represents a new level in the tree.

• The ring closest to the center represents the root node split.

• The arc length is proportional to the number of records taking each of the two paths, and the shading represents the node’s purity.

38

Decision Tree Advantages

1. Easy to understand

2. Map nicely to a set of business rules

3. Applied to real problems

4. Make no prior assumptions about the data

5. Able to process both numerical and

categorical data

39

Decision Tree Disadvantages

1. Output attribute must be categorical

2. Limited to one output attribute

3. Decision tree algorithms are unstable

4. Trees created from numeric datasets can

be complex

40

RapidMiner Practice

• To see:

– Training Videos\01 - Ralf Klinkenberg –RapidMinerResources\5 - Modelling -Classification -1- Decision trees - Basic.mp4

• To practice:

– Do the exercises presented in the movie using the files “Iris.ioo” and “Sonar.ioo”.

41

RapidMiner Practice

• To see:

– Training Videos\02 - Tom Ott - Neural Market Trends\06 - Creating a Decision Tree with Rapidminer 5.0

• To practice:

– Do the exercises presented in the movie.

42

• Data Preprocessing

– GermanCredit.xls GermanCredit.ioo

• Process design• Take a look at the .ioo file and attributes / variables

• Use desision tree to predict response (credit rating)– Try different criterion and options

• Validate your model– Use validation operator

– Inside this put the Decision Tree learner (left side) and Apply Model and Performance (right side)

• Model

– Read and interpret the results

RapidMiner Practice

ma mru dm chapter06

Documents