classification trees hi. what should you get out of this? obtain an introduction to classifiaction...

Classification trees

Hi

What should you get out of this?

• Obtain an introduction to classifiaction tree modelling

• Allow you to follow our steps and be able to reproduce them at home.

• Understand the usefulness of classification trees.

• A „Manual” explaining our steps will be posted online.

Schedule

• Part 1 – Introduction to Classification Trees• Part 2 – Hands on example of classification

tree• Part 3 – Examples of research using

classification trees• Part 4 – CHAID and Gini Index models

• Part 5 - Assymetric payoff analysis

Let’s Begin!

• Part 1 – Michel van Dijck• Part 2 – Timo van Dockum• Part 3&4 – Stanisław Guner

• Part 5 – Ruud Moers

Part 1

Introduction to classification trees - Theory

Decision tree

• Root Node• Internal Node• Leaf Node

Binary vs. Multiway split

Binary attributes

Nominal attributes

Ordinal Attributes

Continuous attributes

Measures for selecting best split

Criteria for growing a tree

• Maximize Information gain

• Maximize Gain Ratio

Problems with decision trees

• Under fitting – Not enough data– Algorithm is “clueless”

• Over fitting– Too many nodes– Tree pruning

Part 2

How does it work in practise?

# Home Owner Marital Status Annual Income Defaulted Borrower

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes

Node 0

10 peopleDefaulted BorrowerYes 3No 7

Node 0


No home owner


Home owner


Node 0


Single or divorced


Married


Node 0


Income < 100


Income ≥ 100


Selecting the best split

Impurity:• When all people belong to the same class, the

node has zero impurity.• When all people are equally split between the

two classes, the node has the highest impurity.

Goal: minimizing impurity

Node 0


No home owner


Home owner


p(yes) = 3/7p(no) = 4/7

High impurity

p(yes) = 0p(no) = 1

Zero impurity

Degree of impurity

• Self-information is the amount of information you receive when an event occurs.

• The smaller the probability an event occurs, the greater the self-information.

• You want to minimize expected amount of self-information to make more accurate predictions

• Definition:2log(1/p)(if p increases, 2log(1/p) decreases)

• Entropy is the expected value of the self-information.

• The Entropy of Node 1 is:3/7* 2log(1/(3/7)) + 4/7* 2log(1/(4/7))= 0,99

Node 0


Node 1: No home owner


p(yes) = 3/7p(no) = 4/7

High impurity

Similarly, the Entropy of Node 2 is:0* 2log(1/0) + 1* 2log(1/1)

= 0

Node 0


Node 2: Home owner


p(yes) = 0p(no) = 1

Zero impurity

Node 0


No home owner


Home owner


p(yes) = 3/7p(no) = 4/7

High impurity=

0,96

p(yes) = 0p(no) = 1

Zero impurity=0

Impurity = 0,88

• A split is an improvement if the weighted average of the impurities of the subnodes is smaller than the impurity of the root node.

Node 0


No home owner


Home owner


p(yes) = 3/7p(no) = 4/7

High impurity=

0,96

p(yes) = 0p(no) = 1

Zero impurity=0

Impurity = 0,88

Weighted impurity = 7/10 * 0,96 + 3/10 * 0 = 0,672 < 0,88 improvement

Split based on Weighted impurity

No split 0,88

House owner yes/no 0,67

Marital status 0,60

Annual income less or higher than 100 0,60

Let’s go for the split based on income.

Node 0


Income < 100


Income ≥ 100


Impurity = 0No further splitSingle or divorced


Married


Impurity = 0No further split

Impurity = 0,81

Weighted impurity = 4/6*0,81 + 2/6*0 = 0,54 < 0,67 < 0,88 improvement

Impurity = 0,88

Impurity = 0,67

Part 3

Case Studies

Case 1 – Classifying an Iris

Sertosa Versicolor Verginica

• Not enough knowledge to distinguish them by looking at them?

• Use the petal’s width and is length

• Follow the decision tree

• First Decision Tree constructed for this problem in 1936 by the British scientist Fisher.

• Different trees over the last 80 years

• Accuracy >95%

Case 2 – Classifcation of osteoarthritis

List of characteristics of idiopathic and non-idiopathic OA

M.D.s were surveyed in order to determine the classification atributes out of the list of potential characteristics

X-ray as a first exam

Physical examination as a first step

Repeated Synovial fluid test

Conclusions

Case 3 – Carsh vs Non-crash unit independet classification

Non crash specific factors

They used Gini impurity measure

This equation finds the gain (in terms of impurity) from creating a new node

VIM asseses performance of a variable in producing splits

VIM Performance Speed limit seem to be crucial in distinguishing purer nodes

Lift chart – Generic and baseline model

Lift chart – Generic and crash type models

Conclusions

Part 4

Modelling

Data Transformation

• „not donated” and „donated”• Attributes must characterize donators in other

ways than their label. DONAMT out.• SPSS: Aggerage test and train sheet into one

data set. Distinguish between the two by a new, 0 or 1 attribute „train”.

• Create standardized values of attributes (PCA)

DATA REDUCTION

Correlations

TIMELR TIMECL FRQRES MEDTOR AVGDON LSTDON ANNDON

TIMELR

Pearson

Correlation1 -,063** -,736** -,109** -,333** -,048** -,293**

Sig. (2-tailed) ,000 ,000 ,000 ,000 ,000 ,000

N 8137 8137 8137 8137 8137 8137 8137

TIMECL

Pearson

Correlation-,063** 1 ,056** ,083** ,057** ,031** -,168**

Sig. (2-tailed) ,000 ,000 ,000 ,000 ,005 ,000

N 8137 8137 8137 8137 8137 8137 8137

FRQRES

Pearson

Correlation-,736** ,056** 1 -,025* ,412** -,025* ,353**

Sig. (2-tailed) ,000 ,000 ,025 ,000 ,025 ,000

N 8137 8137 8137 8137 8137 8137 8137

MEDTOR

Pearson

Correlation-,109** ,083** -,025* 1 ,030** ,091** ,014

Sig. (2-tailed) ,000 ,000 ,025 ,007 ,000 ,214

N 8137 8137 8137 8137 8137 8137 8137

AVGDON

Pearson

Correlation-,333** ,057** ,412** ,030** 1 ,642** ,878**

Sig. (2-tailed) ,000 ,000 ,000 ,007 ,000 ,000

N 8137 8137 8137 8137 8137 8137 8137

LSTDON

Pearson

Correlation-,048** ,031** -,025* ,091** ,642** 1 ,618**

Sig. (2-tailed) ,000 ,005 ,025 ,000 ,000 ,000

N 8137 8137 8137 8137 8137 8137 8137

ANNDON

Pearson

Correlation-,293** -,168** ,353** ,014 ,878** ,618** 1

Sig. (2-tailed) ,000 ,000 ,000 ,214 ,000 ,000

N 8137 8137 8137 8137 8137 8137 8137

**. Correlation is significant at the 0.01 level (2-tailed).

*. Correlation is significant at the 0.05 level (2-tailed).

Correlation matrix

• We observe high correlations (above 0,5) between:– 1) FRQRES and TIMELR– 2) ANNDON and AVGDON– 3) LSTDON and AVGDON– 4) ANNDON and LSTDON.

Principal Component Analysis – Scree plot

PCA – Cumulative Variance explainedTotal Variance Explained

Factor Initial Eigenvalues Extraction Sums of Squared Loadings

Total % of Variance Cumulative % Total % of Variance Cumulative %

1 1,959 39,172 39,172 1,631 32,619 32,619

2 1,153 23,063 62,235 ,532 10,642 43,261

3 ,965 19,306 81,542

4 ,674 13,480 95,021

5 ,249 4,979 100,000

Extraction Method: Principal Axis Factoring.

Communalities

Initial Extraction

Zscore(TIMELR) ,559 ,675

Zscore(TIMECL) ,052 ,443

Zscore(FRQRES) ,575 ,804

Zscore(MEDTOR) ,043 ,015

Zscore(ANNDON) ,164 ,225


Factor loadings – 2 Factors foundFactor Matrixa

Factor

1 2

Zscore(TIMELR) -,818 -,074

Zscore(TIMECL) ,034 ,665

Zscore(FRQRES) ,896 ,018

Zscore(MEDTOR) ,047 ,114

Zscore(ANNDON) ,393 -,267


a. Attempted to extract 2 factors. More than 100 iterations required. (Convergence=,001). Extraction was terminated.

PCA - Conclusions

• Helps us to understand dimensionality of data• Informs us about uniquness of attributes.

• We know what to expect in terms of complexity of a decision tree

MODEL 1 – SPSS CHAID

SPSS CHAID model

Notice:1) 2/3 levels2) Attributes can be split over a range

SPSS – Right side

SPSS – Left side

Confusion matrixClassification

Sample Observed Predicted

donated not donated Percent Correct

Training

donated 899 510 63,8%

not donated 523 2125 80,2%

Overall Percentage 35,1% 64,9% 74,5%

Test

Donated 837 569 59,5%

not donated 574 2100 78,5%

Overall Percentage 34,6% 65,4% 72,0%

Growing Method: CHAID

Dependent Variable: label

Ratios

• Overall error rate = (574+569)/4080=28,0%• Overall accuracy = (837+2100)/4080=72,0%

• Sensitivity = 837/(837+569)= 59,5%• Specificity = 2100/(2100+574)=78,5% ability to rule

out not donators correctly.• False positive rate = 574/(574+837) = 40,7% • False negative test = 569/(569+2100)=21,3%

0 1000 2000 3000 4000

0

200

400

600

800

1000

1200

1400

1600

Mod

el rep

onse

Number of mailings

Model reponse Naive response

Lift chart

Why to make this model?

• Pros:– Intuitive– Easy to read– Pruned– Provides good knowledge overview

MODEL 2 – RAPIDMINER GINI INDEX

CriteriaBased on Gini criterion

RM Gini index model

Performance

Overall accuracy of different models was very similar

Lift chart

0 1000 2000 3000 40000

200

400

600

800

1000

1200

1400

1600

Exp

ecte

d nu

mbe

r of

don

ator

s (n

)

Cumulative number of sent letters (n)

Number of responses from the model Number of responses from naive classifier

Part 5 – Payoff analysis

Assymetric payoff analysis – different approaches – a challenge

• Method 1: Focus of probabilities and select records based on chance of being a donator.

• Method 2: Include in the cut-off calculation the amount each client group generates.

Method 1

• Select all records with p>0,5 – Classified as donators.

• Add records between 0,3 and 0,5. – More capture.

• Or select at random sub-sample from non-donators (20% records)

– SIMPLE !

Method 2: Assymetrix payoff analysis

• High accuracy previous model

• However, should this be your goal?

Assymetrix payoff analysis

• No, you should aim to get as many profit as possible out of your actions

• Expected revenue of sending a letter to somebody is positive ⇒sending a letter only costs 50 cents

• E.g. For one of the groups only one in ten people is expected to donate, but still when this person is willing to donate 10 euros it will exceed the costs.

Solution• Expected Revenue

• Calculated by– Expected Average Donation per person x

possibility somebody will donate – costs of a letter– Gather the Expected Donation from the overall

average donation ⇒ 7,01

Expected Revenues• Expected Revenue of group at node 10

• 0.068 x 7.01 – 0.5 = - 0.0235⇒ Negative, so do not send a letter• Expected Revenu of group at node 9

• 0.205 x 7.01 - 0,5 = 0.9366 ⇒ Eventhough low percentage is donating , you should send them a letter

Expected Revenues

• Expected Revenues for every group at every node

• Conclusie, if the costs of sending a letter would increase the amount that is given in the table, you stop sending a letter to the group that the node belongs to.

• Consequences Classification tree

Node Number Expected Revenue

1 0.0676

9 0.9366

10 -0.0235

11 1.6444

12 0.5652

13 0.3900

14 1.1749

16 1.4272

17 2.4083

18 3.2353

19 2.4644

20 0.7825

21 3.2913

22 1.6655

23 3.4035

24 4.5107

25 2.8708

26 5.3096

27 1.4692

28 3.1372

Old Classification Tree

Change in Classification Tree

• Tremendous simplification of the tree

• However, (0.182, 0.234] is a strange interval

Further analyses

• Now take into account every group will probably donate another average amount

• Calculate this amount for every specific group and determine the Expected Revenue per person per group

• Also the group that belongs to node 1 will not receive a letter

Node Number Expected Revenue

1-0.3089

9 0.298110 -0.273311 1.329712 0.079613 0.214114 0.744416 1.450517 2.671918 3.716419 2.355320 0.431521 5.106822 2.694623 5.567224 7.925925 6.779026 11.422127 1.403628 3.7041

New Classification Tree

• Still there are some remarks

Analyses new Tree

• Consider whether you are statisfied with forecasting by only using the frequency of response and the time to last response

• The tree does make sense→ If long time no response, most likely nog donation

• Results received with the out-of-sample forecasting using the other dataset

Out-of-sample forecasting* (error)

• If we apply the last model we would in the end have a total revenue of € 22373 and send out 3046 letters

• Compared to the model in which we would only send letters to those we expect to donate we would a have total revenue of € 15544 and send 1477 letters, which is indeed lower.

• The model in which we just use the average donation of all (ex)-donators gives us a total revenue of € 22460 and we would have to send 3270 letters.

• Even better would be sending a letter to everybody, that would lead to a total revenue of € 23257,50

Discussion

• Unfortunately ‘our’ model has been beaten by the most simplistic model of sending a letter to everybody, at least for the data we forecasted

• However, we did deliver some insights of to who you might not want to send letters to if the costs for every letter would increase

classification trees hi. what should you get out of this? obtain an introduction to classifiaction...

Documents

entropy of node

parent node

new node

classification treespart

treemaximize information

highest impurity

minimal impurity

best split criteria