implementation in tree stat 6601 november 24, 2004 bin hu, philip wong, yu ye

Implementation In TreeImplementation In Tree

Stat 6601Stat 6601

November 24, 2004November 24, 2004

Bin Hu, Philip Wong, Yu YeBin Hu, Philip Wong, Yu Ye

Data BackgroundData Background

From SPSS Answer Tree program, we use From SPSS Answer Tree program, we use its credit scoring exampleits credit scoring example

There are 323 data pointsThere are 323 data points The target variable isThe target variable is credit ranking credit ranking (good [48%], bad [52%])(good [48%], bad [52%])

The four predictor variables are The four predictor variables are age categoricalage categorical (young [58%], middle [24%], old[18%]) (young [58%], middle [24%], old[18%]) has AMEX cardhas AMEX card (yes [48%], no [52%]) (yes [48%], no [52%]) paid weekly/monthlypaid weekly/monthly (weekly pay [51%], monthly salary (weekly pay [51%], monthly salary

[49%])[49%]) social class social class (management [12%], professional [49%], (management [12%], professional [49%],

clerical [15%], skilled [13%], unskilled [12%])clerical [15%], skilled [13%], unskilled [12%])


CLASS Bad Good % Goodmanagement 1 38 97.4%professional 56 102 64.6%clerical 40 7 14.9%skilled manual 40 1 2.4%unskilled 31 7 18.4%

PAY_WEEKweekly pay 143 22 13.3%monthly salary 25 133 84.2%

AGEyoung 151 36 19.3%middle 16 63 79.7%old 1 56 98.2%

AMEXno 89 78 46.7%yes 79 77 49.4%

• Useful to see how the target variable is distributed by each of the predictor variable


Pearson Correlation Coefficients, N = 323Pearson Correlation Coefficients, N = 323 Prob > |r| under H0: Rho=0Prob > |r| under H0: Rho=0

CREDIT_R PAY_WEEK AGE AMEXCREDIT_R PAY_WEEK AGE AMEX

CREDIT_R 1.00000 0.70885 0.66273 0.02653CREDIT_R 1.00000 0.70885 0.66273 0.02653 CREDIT_R <.0001 <.0001 0.6348CREDIT_R <.0001 <.0001 0.6348

PAY_WEEK 0.70885 1.00000 0.51930 0.08292PAY_WEEK 0.70885 1.00000 0.51930 0.08292 PAY_WEEK <.0001 <.0001 0.1370PAY_WEEK <.0001 <.0001 0.1370

AGE 0.66273 0.51930 1.00000 -0.00172AGE 0.66273 0.51930 1.00000 -0.00172 AGE <.0001 <.0001 0.9755AGE <.0001 <.0001 0.9755

AMEX 0.02653 0.08292 -0.00172 1.00000AMEX 0.02653 0.08292 -0.00172 1.00000 AMEX 0.6348 0.1370 0.9755AMEX 0.6348 0.1370 0.9755

• Correlation Matrix:

ObjectiveObjective To create a predictive model of good To create a predictive model of good

credit risks credit risks To assess the performance of the model, To assess the performance of the model,

we randomly split data into two parts: a we randomly split data into two parts: a training set to develop the model (60%) training set to develop the model (60%) and the rest (40%) to validate.and the rest (40%) to validate.

This is done to avoid possible “over fitting” This is done to avoid possible “over fitting” since the validation set was not involve in since the validation set was not involve in deriving the modelderiving the model

Using the same data, we compare the Using the same data, we compare the results using R’s Tree, Answer Tree’s results using R’s Tree, Answer Tree’s CART, and SAS’ Proc LogisticCART, and SAS’ Proc Logistic

Logistic RegressionLogistic Regression

Let x be a vector of explanatory variablesLet x be a vector of explanatory variables Let y be a binary target variable (0 or 1)Let y be a binary target variable (0 or 1) p = Pr(Y=1|x) is the target probabilityp = Pr(Y=1|x) is the target probability The linear logistic model has the formThe linear logistic model has the form

xp-1

ploglogit(p)

• Predicted probability, phat = 1/(1+exp(-(α+β’x)))

• Note that the range for p is (0,1), but logit(p) is the whole real line

Logistic ResultsLogistic Results Using the training set, the maximum likelihood estimate Using the training set, the maximum likelihood estimate

failed to converge using the social class and age variablesfailed to converge using the social class and age variables Only the paid weekly/monthly and has AMEX card variables Only the paid weekly/monthly and has AMEX card variables

could be estimatedcould be estimated The AMEX variable was highly insignificant and so was The AMEX variable was highly insignificant and so was

droppeddropped Apparently, the tree algorithm does a better job in handling Apparently, the tree algorithm does a better job in handling

all the variablesall the variables• SAS output of model: Standard WaldStandard Wald

Parameter DF Estimate Error Chi-Square Pr > ChiSqParameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 1.5856 0.2662 35.4756 <.0001Intercept 1 1.5856 0.2662 35.4756 <.0001

PAY_WEEK 1 1 -3.6066 0.4169 74.8285 <.0001PAY_WEEK 1 1 -3.6066 0.4169 74.8285 <.0001

• So, the odds of weekly pay to be a good risk over a monthly salary person is exp(-3.6) ≈ 0.027 to 1 or 36 to 1 against.

Validation ResultsValidation Results With only one variable in our predicted model, there are only two With only one variable in our predicted model, there are only two

possible predicted probabilities: 0.117 and 0.830possible predicted probabilities: 0.117 and 0.830 Taking the higher probability as predicting a “good” account, our results Taking the higher probability as predicting a “good” account, our results

are beloware below

Validation Set Training SetValidation Set Training Set Actual BadActual Bad Actual GoodActual Good Actual BadActual Bad Actual GoodActual Good

Predicted BadPredicted Bad 60 60 11 83 11 11 83 11

Predicted Good 8Predicted Good 8 50 17 83 50 17 83

Percent Agreement Percent Agreement 85.3% 85.6%85.3% 85.6%

• The better measure is use the validation set results. Note that the results are very similar, so overfitting does not appear to be a problem.

Growing a Tree in RGrowing a Tree in R(Based on training data)(Based on training data)

> credit_data<- read.csv(file="training.csv")> credit_data<- read.csv(file="training.csv")

> library(tree)> library(tree)

> credit_tree<-tree(CREDIT_R ~ CLASS + PAY_METHOD + AGE + > credit_tree<-tree(CREDIT_R ~ CLASS + PAY_METHOD + AGE + AMEX, data=credit_data, split=c("gini"))AMEX, data=credit_data, split=c("gini"))

> tree.pr<-prune.tree(credit_tree)> tree.pr<-prune.tree(credit_tree)

> plot(tree.pr) # figure 1 > plot(tree.pr) # figure 1

> plot(credit_tree, type="u"); text(credit_tree, pretty=0) > plot(credit_tree, type="u"); text(credit_tree, pretty=0)

# figure 2 # figure 2

> > tree.1<-prune.tree(credit_tree, best=5)tree.1<-prune.tree(credit_tree, best=5)

> plot(tree.1, type="u"); text(tree.1, pretty=0) > plot(tree.1, type="u"); text(tree.1, pretty=0)

# figure 3, 4, 5 # figure 3, 4, 5

> summary(tree.1) > summary(tree.1)

Figure 1 Figure 1

size

de

via

nce

10

01

50

20

02

50

2 4 6 8 10

110.00 1.40 0.78 0.13

Figure 2Figure 2

|PAY_METHOD: pay by month

AGE: young

AMEX: no

CLASS: clerical,skilled,unskilled

CLASS: clerical

CLASS: skilled

AMEX: no AMEX: no

AMEX: no

AGE: middle

good bad

good

bad

bad bad bad bad

bad bad

bad

Figure 3Figure 3


AGE: young CLASS: clerical,skilled,unskilled

CLASS: clericalbad good

bad bad

bad

Figure 4Figure 4


AGE: young CLASS: clerical,skilled,unskilled

bad good bad bad

Figure 5Figure 5


AGE: young

bad good

bad

Tree Based on Validation DataTree Based on Validation Data


AGE: middle,younggood

bad good

Implementing using SPSS ANSWER TREEImplementing using SPSS ANSWER TREETraining sample – C&RT (Min impunity change .01)Training sample – C&RT (Min impunity change .01)

Category % nGood 48.45 94Bad 51.55 100Total (100.00) 194

Node 0


Node 2


Node 6Category % nGood 3.03 2Bad 96.97 64Total (34.02) 66

Node 5


Node 1



Node 3

Credit ranking (1=default)

Paid Weekly/MonthlyImprovement=0.2539

Weekly pay

Social ClassImprovement=0.0172

Professional;ManagementClerical;Unskilled;Skilled Manual

Monthly salary

Age CategoricalImprovement=0.0633

Middle (25-35);Old ( > 35)Young (< 25)

Implementing using SPSS ANSWER TREEImplementing using SPSS ANSWER TREETraining sample – CHAID (Pearson Chi2, p=.05)Training sample – CHAID (Pearson Chi2, p=.05)


Node 0


Node 2



Node 5


Node 1



Node 3

Credit ranking (1=default)

Paid Weekly/MonthlyAdj. P-value=0.0000, Chi-square=98.6177, df=1

Weekly pay

Social ClassAdj. P-value=0.0009, Chi-square=16.1257, df=1

Clerical;Unskilled;Skilled ManualProfessional;Management

Monthly salary

Age CategoricalAdj. P-value=0.0000, Chi-square=43.5241, df=1

Young (< 25)Middle (25-35);Old ( > 35)

Summary classification for training dataSummary classification for training data

Level 1Level 1 Level 2Level 2 PredictePredictedd

% of % of errorerror

Pay-MonthlyPay-Monthly Age-med,oldAge-med,old GoodGood 0%0%

Pay-MonthlyPay-Monthly Age-YoungAge-Young BadBad 47%47%

Pay-WeeklyPay-Weekly Class-P,MClass-P,M BadBad 32%32%

Pay-WeeklyPay-Weekly Class-C,S,UClass-C,S,U BadBad 3%3%

Summary of Validation dataSummary of Validation datagrouped by training data classificationgrouped by training data classification

Level 1Level 1 Level 2Level 2 PredictePredictedd

% of % of errorerror

Pay-MonthlyPay-Monthly Age-med,oldAge-med,old GoodGood 2%2%

Pay-MonthlyPay-Monthly Age-YoungAge-Young BadBad 59%59%

Pay-WeeklyPay-Weekly Class-P,MClass-P,M BadBad 23%23%

Pay-WeeklyPay-Weekly Class-C,S,UClass-C,S,U BadBad 13%13%

Crosstabulation of predicted and actual classificationCrosstabulation of predicted and actual classification

Training sampleTraining sample Validation sampleValidation sample

Actual Actual goodgood

Actual Actual badbad

Actual Actual goodgood

Actual Actual badbad

Predict Predict goodgood 6868 00 6767 2121

Predict badPredict bad 2626 100100 11 4040

AgreementAgreement 86.6%86.6% 82.9%82.9%

Agreement Agreement in in

regressionregression85.6%85.6% 85.3%85.3%

Summary of resultSummary of result

Similar trees were generated from R and SPSS Similar trees were generated from R and SPSS ANSWER TREEANSWER TREE

Similar results were derived using different tree Similar results were derived using different tree generation methods – C&RT and CHAIDgeneration methods – C&RT and CHAID

Classification tree has higher percentage of Classification tree has higher percentage of agreement between predicted values and agreement between predicted values and actual values than logistic regression on actual values than logistic regression on training datatraining data

Utilizing the grouping criteria derived from Utilizing the grouping criteria derived from training data, logistic regression has higher training data, logistic regression has higher percentage of agreement than classification percentage of agreement than classification treetree

ConclusionConclusion

Classification tree is a non-parametric method to select Classification tree is a non-parametric method to select predictive variables sequentially and group cases to predictive variables sequentially and group cases to homogenous clusters to derive the highest predictive homogenous clusters to derive the highest predictive probability probability

Classification tree can be implemented in different Classification tree can be implemented in different software and using different tree growing methodologiessoftware and using different tree growing methodologies

Classification tree normally performs better than Classification tree normally performs better than parametric models with higher percentage of agreement parametric models with higher percentage of agreement between predicted values and actual valuesbetween predicted values and actual values

Classification tree has special advantages in industries Classification tree has special advantages in industries like credit card and marketing research by 1) grouping like credit card and marketing research by 1) grouping individuals by homogenous clusters 2) assigning not only individuals by homogenous clusters 2) assigning not only the predicted values, but also the probability of the predicted values, but also the probability of predicting errorpredicting error

Conclusion – con’dConclusion – con’d

As a non-parametric method, no function As a non-parametric method, no function form is specified and no parameter will be form is specified and no parameter will be estimated and testedestimated and tested

As showed in this small study, the lower As showed in this small study, the lower percentage of agreement for validation percentage of agreement for validation data shows “overfitting” might be a data shows “overfitting” might be a potential problem in classification treepotential problem in classification tree

implementation in tree stat 6601 november 24, 2004 bin hu, philip wong, yu ye

Documents