statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot...

48
An R tutorial on statistical, na¨ ıve and intuitive predictors in credit risk classification Rodolfo Vanzini — Bologna. September 15, 2015 By quoting extensively the extraordinary work of Kahneman (2011) I’d like to explain the rationale of this work on credit risk classification during training sessions in traditional classroom settings – like the ones that I have conducted in the last three years – for loan officers employed by banks. Why are experts so inferior to algorithms? One reason [. . . ] is that experts try to be clever, think outside the box, and consider combi- nations of features in making their predictions. [. . . ] Complexity may work in the odd case, but more often than not it reduces validity. (Kahneman 2011, page 224) Another reason for the inferiority of expert judgment is that hu- mans are incorrigibly inconsistent in making summary judgments of complex information. When asked to evaluate the same information twice, they frequently give different answers. (Kahneman 2011, page 224) The research suggests a surprising conclusion: to maximize predictive accuracy, final decision should be left to formulas, especially in low- validity environments. (Kahneman 2011, page 225) [Dawes] observed that the complex statistical algorithm adds little or no value. One can do just as well by selecting a set of scores that have some validity for predicting the outcome and ajusting the values to make them comparable [. . . ] it is possible to develop useful algorithms without any prior statistical research. Simple equally weighted formulas based on existing statistics or on common sense are often very good predictors of significant outcomes. [. . . ] The important conclusion of this research is that an algorithm that is constructed on the back of an envelope is often good enough to compete with an optimally weighted formula, and certainly good enough to outdo expert judgment. This logic can be applied to many domains, ranging from the selection of stocks by portfolio managers to the choices of medical treatments by doctors or patients. 1

Upload: buitruc

Post on 17-May-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

An R tutorial on statistical, naıve and intuitive

predictors in credit risk classification

Rodolfo Vanzini — Bologna.

September 15, 2015

By quoting extensively the extraordinary work of Kahneman (2011) I’d like toexplain the rationale of this work on credit risk classification during trainingsessions in traditional classroom settings – like the ones that I have conductedin the last three years – for loan officers employed by banks.

Why are experts so inferior to algorithms? One reason [. . . ] is thatexperts try to be clever, think outside the box, and consider combi-nations of features in making their predictions. [. . . ] Complexity maywork in the odd case, but more often than not it reduces validity.(Kahneman 2011, page 224)

Another reason for the inferiority of expert judgment is that hu-mans are incorrigibly inconsistent in making summary judgments ofcomplex information. When asked to evaluate the same informationtwice, they frequently give different answers. (Kahneman 2011, page224)

The research suggests a surprising conclusion: to maximize predictiveaccuracy, final decision should be left to formulas, especially in low-validity environments. (Kahneman 2011, page 225)

[Dawes] observed that the complex statistical algorithm adds littleor no value. One can do just as well by selecting a set of scoresthat have some validity for predicting the outcome and ajusting thevalues to make them comparable [. . . ] it is possible to develop usefulalgorithms without any prior statistical research. Simple equallyweighted formulas based on existing statistics or on common senseare often very good predictors of significant outcomes. [. . . ] Theimportant conclusion of this research is that an algorithm that isconstructed on the back of an envelope is often good enough tocompete with an optimally weighted formula, and certainly goodenough to outdo expert judgment. This logic can be applied tomany domains, ranging from the selection of stocks by portfoliomanagers to the choices of medical treatments by doctors or patients.

1

Page 2: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

(Kahneman 2011, page 226)

Whenever we can replace human judgment by a formula, we shouldat least consider it. (Kahneman 2011, page 233)

Data are generated according to a desired range of features the sample data setwill have to have in terms of default frequency, degree of overlapping betweennon-defaulted and defaulted companies, and key financial ratios as predictors.The code below generates two predictors: debt-to-book-value ratio (DMP, fromthe original Italian debito-mezzi-propri) and EBIT-to-interest-payments ratio(EBITOF, from the original Italian EBIT-oneri-finanziari)

n <- 1000

p <- 0.82 #proportion of non defaulters

set.seed(321) #for reproducibility

DMP.no <- rnorm(n = n * p, mean = 1.5, sd = 0.75)

DMP.si <- rnorm(n = n * (1 - p), mean = 3.0, sd = 0.75)

EBITOF.no <- rnorm(n = n * p, mean = 2.0, sd = 0.75)

EBITOF.si <- rnorm(n = n * (1 - p), mean = 0.75, sd = 0.75)

df <- data.frame(Default = c(rep('No', p * n),

rep('Si', (1 - p) * n)),

DMP = c(DMP.no, DMP.si),

EBITOF = c(EBITOF.no, EBITOF.si)

)

str(df)

## 'data.frame': 1000 obs. of 3 variables:

## $ Default: Factor w/ 2 levels "No","Si": 1 1 1 1 1 1 1 1 1 1 ...

## $ DMP : num 2.779 0.966 1.292 1.41 1.407 ...

## $ EBITOF : num 0.878 1.036 3.052 2.021 3.426 ...

#Adjust DMP for negative values

d <- df$DMP

d[df$DMP < 0] <- 0

df$DMP <- d

head(df)

## Default DMP EBITOF

## 1 No 2.7786774 0.8777016

## 2 No 0.9659711 1.0358616

## 3 No 1.2915113 3.0520025

## 4 No 1.4102632 2.0214981

## 5 No 1.4070295 3.4255801

## 6 No 1.7011378 1.1667068

RColorBrewer is loaded to set a palette with four colors to be used in enhancegraphics.

require(RColorBrewer)

pal <- brewer.pal(4, "Set1")

pal

## [1] "#E41A1C" "#377EB8" "#4DAF4A" "#984EA3"

2

Page 3: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

After setting up the palette, it’s necessary to plot the sample to inspect it forpossible irregularities with respect to the desired features. To do so the ggplot2package is required and the scatter plot options are adjusted so as to get a plotsimilar to base R.

require(ggplot2)

pl <- ggplot(df, aes(DMP, EBITOF, color = Default, shape = Default)) +

geom_point(size = 3, alpha = 0.5) +

scale_shape(solid = FALSE) +

scale_color_manual(values = c(pal[2], pal[1])) +

scale_shape_manual(values = c(1,3))

pl

0

2

4

0 1 2 3 4 5DMP

EB

ITO

F Default

No

Si

Overlapping histograms of non defaulted/defaulted companies of both key finan-cial ratios.

pl.1 <- ggplot(df, aes(DMP, fill = Default)) +

scale_fill_manual(values = c(pal[2], pal[1]))

pl.2<-ggplot(df, aes(EBITOF, fill = Default)) +

scale_fill_manual(values = c(pal[2], pal[1]))

pl.1 + geom_histogram(data = subset(df, Default = "Si"),

binwidth = 0.1,

alpha = 0.5,

position = 'identity') +

geom_histogram(data = subset(df, Default = "No"),

binwidth = 0.1,

alpha = 0.5,

position = 'identity')

3

Page 4: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

0

10

20

30

40

50

0 1 2 3 4 5DMP

coun

t Default

No

Si

pl.2 + geom_histogram(data = subset(df, Default = "Si"),

binwidth=0.1, alpha = 0.5, position = 'identity') +

geom_histogram(data = subset(df, Default = "No"),

binwidth=0.1, alpha = 0.5, position = 'identity')

0

10

20

30

40

50

0 2 4EBITOF

coun

t Default

No

Si

Let’s generate some false predictors and sample a reasonable subset to be handedto students:

# subset data

set.seed(666)

s.s <- 200

require(dplyr)

4

Page 5: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

df.sm <- sample_n(df, size = s.s, replace = FALSE)

df.sm$SaleChg <- rnorm(n = s.s,

mean = 0.0,

sd = 10.0)

df.sm$ClienteStorico<- sample(c('Si', 'No'),

size = s.s,

replace = TRUE)

df.sm$Settore <- sample(c("Meccanica industriale",

"Servizi",

"Meccanica automotive",

"Dettaglio",

"Edile"),

size = s.s,

replace = TRUE)

df.sm$Outlook <- sample(c('Positive', 'Stable', 'Negative'),

size = s.s,

replace = TRUE)

set.seed(1)

z <- sort(sample(nrow(df.sm), nrow(df.sm) * 0.5))

train <- df.sm[z,]

test <- df.sm[-z,]

Both train and test samples will be duplicated in a second data frame trainnand testt just in case they are needed (seeding will be set again to new valuesto generate new random samples).

trainn <- train

testt <- test

Before contintuing data will be saved to produce hand-outs for students.

write.table(train, file = "train_100.csv",

sep = ";",

dec = ",")

write.table(test, file = "test_100.csv",

sep = ";",

dec = ",")

Perform some exploratory data on data frame train:

par(mfrow = c(2,3))

T1 <- table(df.sm$ClienteStorico, df.sm$Default,

dnn = c('Cliente storico', 'Default'))

mosaicplot(T1,

main = 'Cliente storico per default')

T2 <- table(df.sm$Settore, df.sm$Default,

dnn = c('Settore', 'Default'))

mosaicplot(T2,

main = 'Settore per default')

T3 <- table(df.sm$Outlook, df.sm$Default,

5

Page 6: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

dnn = c('Outlook', 'Default'))

mosaicplot(T3,

main = 'Outlook per default')

boxplot(EBITOF ~ Default, data = df.sm,

main = 'EBITOF per default')

boxplot(DMP ~ Default, data = df.sm,

main = 'DMP per default')

boxplot(SaleChg ~ Default, data = df.sm,

main = 'Sales chg. per default')

Cliente storico per default

Cliente storico

Def

ault

No Si

No

Si

Settore per default

Settore

Def

ault

Dettaglio Edile Meccanica automotive Meccanica industriale Servizi

No

Si

Outlook per default

Outlook

Def

ault

Negative Positive Stable

No

Si

No Si

−1

01

23

4

EBITOF per default

No Si

01

23

45

DMP per default

No Si

−20

−10

010

20

Sales chg. per default

# barplot(table(df.sm£ClienteStorico, df.sm£Default)/nrow(df.sm),

# names.arg = c('Default No', 'Default Si'),

# main = 'Cliente storico per default')

par(mfrow = c(1,1))

s <- chisq.test(T1)

print(s)

##

## Pearson's Chi-squared test with Yates' continuity correction

##

## data: T1

## X-squared = 3.8975, df = 1, p-value = 0.04836

Cor plot for:

6

Page 7: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

pairs(df.sm[, 2:4])

DMP

−1 0 1 2 3 4

01

23

45

−1

01

23

4

EBITOF

0 1 2 3 4 5 −20 0 10

−20

010

SaleChg

pl.sm <- ggplot(df.sm, aes(DMP, EBITOF,

color = Default,

shape = Default)) +

geom_point(size = 3, alpha = 1.0) +

scale_shape(solid = FALSE) +

scale_color_manual(values = c(pal[2], pal[1])) +

scale_shape_manual(values = c(1,3))

pl.sm + stat_smooth(aes(group = 1), method = 'lm')

7

Page 8: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

−1

0

1

2

3

4

0 1 2 3 4 5DMP

EB

ITO

F Default

No

Si

require(MASS)

pr.dmp.fit <- lda(Default ~ DMP, data = df.sm)

pr.ebitof.fit <- lda(Default ~ EBITOF, data = df.sm)

#function to define prediction rule on LDA

dec.rule.ebit <- function(lda, df){A <- A <- mean(lda$means)

B <- log(lda$prior[2]) - log(lda$prior[1])

s2.k <- t(tapply(df$EBITOF, df$Default, var)) %*% lda$prior

C <- s2.k/(lda$means[1] - lda$means[2])

dr <- A + B * C

dr

}

dec.rule.dmp <- function(lda, df){A <- A <- mean(lda$means)

B <- log(lda$prior[2]) - log(lda$prior[1])

s2.k <- t(tapply(df$DMP, df$Default, var)) %*% lda$prior

C <- s2.k/(lda$means[1] - lda$means[2])

dr <- A + B * C

dr

}

dr.dmp <- dec.rule.dmp(pr.dmp.fit, df.sm)

dr.ebitof <- dec.rule.ebit(pr.ebitof.fit, df.sm)

pl.1.sm <- ggplot(df.sm, aes(DMP, fill = Default)) +

scale_fill_manual(values = c(pal[2], pal[1]))

pl.2.sm <- ggplot(df.sm, aes(EBITOF, fill = Default)) +

scale_fill_manual(values = c(pal[2], pal[1]))

8

Page 9: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

pl.1.sm + geom_histogram(data = subset(df.sm, Default = "Si"),

binwidth = 0.25,

alpha = 0.5,

position = 'identity') +

geom_histogram(data = subset(df.sm, Default = "No"),

binwidth = 0.25,

alpha = 0.5,

position = 'identity') +

geom_vline(xintercept = dr.dmp,

linetype = 'dashed')

pl.2.sm + geom_histogram(data = subset(df.sm, Default = "Si"),

binwidth=0.25, alpha = 0.5, position = 'identity') +

geom_histogram(data = subset(df.sm, Default = "No"),

binwidth=0.25, alpha = 0.5, position = 'identity') +

geom_vline(xintercept = dr.ebitof,

linetype = 'dashed')

0

5

10

15

20

0 2 4DMP

coun

t Default

No

Si

0

5

10

15

20

0 2 4EBITOF

coun

t Default

No

Si

1 Statistical & naıve predictors

Run logistic regression to show false predictors aren’t significant:

lgt.null <- glm(Default ~ .,

data = df.sm,

family = 'binomial')

summary(lgt.null)

##

## Call:

## glm(formula = Default ~ ., family = "binomial", data = df.sm)

##

## Deviance Residuals:

## Min 1Q Median 3Q Max

## -2.05817 -0.13653 -0.02901 -0.00372 2.02415

9

Page 10: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

##

## Coefficients:

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) -6.89273 2.12989 -3.236 0.00121 **

## DMP 3.27608 0.73848 4.436 9.15e-06 ***

## EBITOF -2.96387 0.74027 -4.004 6.23e-05 ***

## SaleChg -0.08523 0.05387 -1.582 0.11360

## ClienteStoricoSi 1.06459 0.81142 1.312 0.18951

## SettoreEdile 0.43600 1.18740 0.367 0.71348

## SettoreMeccanica automotive -1.00954 1.43137 -0.705 0.48062

## SettoreMeccanica industriale 1.22437 1.18360 1.034 0.30093

## SettoreServizi -0.82071 1.30597 -0.628 0.52972

## OutlookPositive 2.26890 1.08911 2.083 0.03723 *

## OutlookStable 1.21965 1.16211 1.050 0.29394

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## (Dispersion parameter for binomial family taken to be 1)

##

## Null deviance: 185.491 on 199 degrees of freedom

## Residual deviance: 49.379 on 189 degrees of freedom

## AIC: 71.379

##

## Number of Fisher Scoring iterations: 8

plot(data = df.sm, EBITOF ~ DMP,

main = 'Sample train + test',

cex = 1.5)

plot(data = df.sm, EBITOF ~ DMP,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3),

main = 'Sample train + test (defaulters displayed)',

cex = 1.5)

0 1 2 3 4 5

−1

01

23

4

Sample train + test

DMP

EB

ITO

F

0 1 2 3 4 5

−1

01

23

4

Sample train + test (defaulters displayed)

DMP

EB

ITO

F

10

Page 11: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

par(mfrow = c(1,2))

plot(data = train, EBITOF ~ DMP,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3),

main = 'Train',

cex = 1.5)

plot(data = test, EBITOF ~ DMP,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3),

main = 'Test',

cex = 1.5)

0 1 2 3 4

01

23

4

Train

DMP

EB

ITO

F

0 1 2 3 4 5

−1

01

23

Test

DMP

EB

ITO

F

par(mfrow=c(1,1))

plot(data = train, EBITOF ~ DMP,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3),

main = 'Train sample data set',

cex = 1.5)

11

Page 12: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

0 1 2 3 4

01

23

4

Train sample data set

DMP

EB

ITO

F

Display train data sample:

plot(data = train, EBITOF ~ DMP,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3),

cex = 1.5)

par(mfrow = c (1,2))

boxplot(data = train, EBITOF ~ Default,

col = c(pal[2], pal[1]), ylab='EBIT/OF',

xlab = 'Default')

boxplot(data = train, DMP ~ Default,

col = c(pal[2], pal[1]), ylab='D/MP',

xlab = 'Default')

par(mfrow = c(1,1))

12

Page 13: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

0 1 2 3 4

01

23

4

DMP

EB

ITO

F

No Si

01

23

4

Default

EB

IT/O

F

No Si

01

23

4

Default

D/M

P

Fit logit model to train data sample, check R has coded the response Defaultcorrectly, contrasts(train$Default) shows it has been created a dummyvariable with 1 being the default status.

lgt.fit <- glm(Default ~ DMP + EBITOF,

data = train,

family = 'binomial')

contrasts(train$Default)

## Si

## No 0

## Si 1

summary(lgt.fit)

##

## Call:

## glm(formula = Default ~ DMP + EBITOF, family = "binomial", data = train)

##

## Deviance Residuals:

## Min 1Q Median 3Q Max

## -2.30735 -0.22448 -0.06741 -0.02830 2.78986

##

## Coefficients:

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) -5.8695 2.3823 -2.464 0.013746 *

## DMP 3.0051 0.8837 3.400 0.000673 ***

## EBITOF -1.8921 0.8158 -2.319 0.020383 *

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## (Dispersion parameter for binomial family taken to be 1)

##

## Null deviance: 77.277 on 99 degrees of freedom

## Residual deviance: 25.021 on 97 degrees of freedom

13

Page 14: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

## AIC: 31.021

##

## Number of Fisher Scoring iterations: 7

Perform prediction on test data sample:

lgt.probs <- predict(lgt.fit, newdata = test, type = 'response')

Prepare canvas grid to highlight accept/reject areas on chart based on df.sm dataframe length of x and y axes. expand.grid() generates a set of x-y coordinatesto grid the canvas.

xlim <- range(df.sm$DMP)

xlim

## [1] 0.000000 4.815215

ylim <- range(df.sm$EBITOF)

ylim

## [1] -0.8916338 3.9687379

x <- seq(xlim[1], xlim[2], length = s.s/4)

y <- seq(ylim[1], ylim[2], length = s.s/4)

grid <- expand.grid(x = x,y = y)

names(grid) <- c('DMP', 'EBITOF')

g <- predict(lgt.fit,newdata = grid, type = 'response')

head(g)

## 1 2 3 4 5 6

## 0.01503178 0.02009202 0.02680935 0.03569068 0.04737098 0.06262557

plot(EBITOF ~ DMP, data = test,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3),

main = 'Campione di test (LGT)',

cex = 1.5,

xlim = xlim,

ylim = ylim)

z <- outer(x, y, function(x,y)predict(lgt.fit,

newdata = data.frame(DMP = x,

EBITOF = y),

type = 'response'))

contour(x, y, z, add = TRUE, level = 0.20, lwd = 1)

points(grid, pch = '.', lwd = 1.25,

col=ifelse(g>=0.2, pal[1],pal[2]))

14

Page 15: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

0 1 2 3 4 5

−1

01

23

4

Campione di test (LGT)

DMP

EB

ITO

F

0.2

Confusion matrix to present prediction results:

lgt.pred = rep("No", s.s/2)

lgt.pred[lgt.probs >= 0.20] <- "Si"

tab <- table(lgt.pred, test$Default,

dnn = c('Class. prevista',

'Class. effettiva'))

addmargins(tab)

## Class. effettiva

## Class. prevista No Si Sum

## No 72 1 73

## Si 6 21 27

## Sum 78 22 100

#error rate er

er <- mean(lgt.pred != test$Default); names(er) <- 'Error rate'

# sensitivity

sen <- tab[2,2]/(tab[1,2]+tab[2,2])

names(sen) <- 'Sensitivity'

# specificity

sp <- tab[1,1]/(tab[1,1]+tab[2,1]); names(sp) <- 'Specificity'

er; sen; sp

## Error rate

## 0.07

## Sensitivity

## 0.9545455

## Specificity

## 0.9230769

15

Page 16: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

Load required package MASS for LDA analysis and predict response on testsample:

require(MASS)

lda.fit <- lda(data = train, Default ~ DMP + EBITOF)

lda.probs <- predict(lda.fit,

newdata = test, type = 'response')

g.lda <- predict(lda.fit, newdata = grid)

plot(EBITOF ~ DMP, data = test,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3),

main = 'Campione di test (LDA)',

cex = 1.5,

xlim = xlim,

ylim = ylim)

z <- outer(x, y, function(x,y)predict(lda.fit,

newdata = data.frame(DMP = x,

EBITOF = y))$posterior[,2])

contour(x, y, z, add = TRUE, level = 0.20, lwd = 1)

points(grid, pch = ".", lwd = 1.25,

col=ifelse(g.lda$posterior[,2]>=0.2, pal[1],pal[2]))

0 1 2 3 4 5

−1

01

23

4

Campione di test (LDA)

DMP

EB

ITO

F

0.2

Confusion matrix with LDA results and test error rate:

lda.pred <- rep('No', s.s/2)

lda.pred[lda.probs$posterior[,2]>=0.2] <- 'Si'

lda.tab <- table(lda.pred, test$Default,

dnn = c('Class. prevista',

16

Page 17: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

'Class. effettica'))

addmargins(lda.tab)

## Class. effettica

## Class. prevista No Si Sum

## No 72 1 73

## Si 6 21 27

## Sum 78 22 100

lda.er <- mean(lda.pred != test$Default)

lda.er

## [1] 0.07

qda.fit <- qda(data = train, Default ~ DMP + EBITOF)

qda.probs <- predict(qda.fit,

newdata = test, type = "response")

g.qda <- predict(qda.fit,newdata = grid)

Confusion matrix with QDA results:

qda.pred <- rep('No', s.s/2)

qda.pred[qda.probs$posterior[,2] >= 0.2] <- 'Si'

qda.tab <- table(qda.pred, test$Default,

dnn = c('Class. prevista',

'Class. effettica'))

addmargins(qda.tab)

## Class. effettica

## Class. prevista No Si Sum

## No 72 2 74

## Si 6 20 26

## Sum 78 22 100

qda.er <- mean(qda.pred != test$Default)

qda.er

## [1] 0.08

plot(EBITOF ~ DMP, data = test,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3),

main = 'Campione di test (QDA)',

cex = 1.5,

xlim = xlim,

ylim = ylim)

z <- outer(x, y, function(x,y)predict(qda.fit,

newdata = data.frame(DMP = x,

EBITOF = y))$posterior[,2])

contour(x, y, z, add = TRUE,

level = 0.20, lwd = 1)

17

Page 18: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

points(grid,

col = ifelse(g.qda$posterior[,2]>=0.2,

pal[1], pal[2]),

pch= '.', cex = 0.5)

0 1 2 3 4 5

−1

01

23

4

Campione di test (QDA)

DMP

EB

ITO

F

0.2

library(class)

train.X <- cbind(train$DMP, train$EBITOF)

test.X <- cbind(test$DMP, test$EBITOF)

train.Default <- train$Default

set.seed(1)

kk = 15

knn.pred <- knn(train = train.X,

test = test.X,

cl = train.Default,

k = kk,

prob = TRUE)

summary(knn.pred)

## No Si

## 87 13

Confusion matrix and test error rate for KNN classifier, based on 0.2 probabilities(as opposed to 0.50 probabilities by default):

knn.pred.prob <- attr(knn.pred, 'prob')

knn.probs <- ifelse(knn.pred == 'No',

1 - knn.pred.prob,knn.pred.prob)

knn.pred.cl <- rep("No", s.s/2)

knn.pred.cl[knn.probs >= 0.2] <- "Si"

18

Page 19: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

knn.tab <- table(knn.pred.cl,

test$Default,

dnn = c('Class. prevista',

'Class. effettiva'))

addmargins(knn.tab)

## Class. effettiva

## Class. prevista No Si Sum

## No 74 2 76

## Si 4 20 24

## Sum 78 22 100

knn.er <- mean(knn.pred.cl != test$Default)

knn.er

## [1] 0.06

Accesso knn estimated probabilities via the attr function and transform prob-abilities of default accordingly. Prepare grid matrix for plotting decision area(matrix(knn.probs, ...)).

knn.probs <- attr(knn.pred, "prob")

head(knn.probs)

## [1] 1.0000000 0.8000000 1.0000000 0.6000000 0.5333333 1.0000000

knn.probs <- ifelse(knn.pred == 'No',

1 - knn.probs,knn.probs)

head(knn.probs)

## [1] 0.0000000 0.8000000 0.0000000 0.4000000 0.4666667 0.0000000

knn.probs.kk <- matrix(knn.probs,

length(x), length(y))

z.knn <- knn(train = train.X,

test = grid,

cl = train.Default,

k = kk, prob = TRUE)

z.knn.probs <- attr(z.knn, "prob")

z.knn.probs <- ifelse(z.knn == 'No',

1 - z.knn.probs,

z.knn.probs)

z.knn.probs.kk <- matrix(z.knn.probs,

length(x),

length(y))

g.knn <- knn(train.X,

grid,

train.Default,

k = kk,

prob = TRUE)

g.knn.probs <- attr(g.knn, "prob")

g.knn.probs <- ifelse(g.knn == 'No',

19

Page 20: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

1 - g.knn.probs,

g.knn.probs)

g.knn.probs.kk <- matrix(g.knn.probs,

length(x),

length(y))

Chart KNN decision boundary and points:

# chart KNN

plot(grid, col = ifelse(g.knn.probs.kk>=0.2,

pal[1], pal[2]),

cex = 0.25,

pch = ".",

main = 'KNN = 15',

xlim = xlim,

ylim = ylim)

points(EBITOF ~ DMP, data = test,

col = ifelse(Default == 'No',

pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3),

cex = 1.5)

z.knn <- knn(train = train.X,

test = grid,

cl = train.Default,

k = kk, prob = TRUE)

z.knn.probs <- attr(z.knn, "prob")

z.knn.probs <- ifelse(z.knn == 'No',

1 - z.knn.probs,

z.knn.probs)

z.knn.probs.kk <- matrix(z.knn.probs,

length(x),

length(y))

contour(x, y, z.knn.probs.kk,

levels = 0.20,

add = TRUE)

20

Page 21: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

0 1 2 3 4 5

−1

01

23

4

KNN = 15

DMP

EB

ITO

F

Plot four charts on train data set aligned:

par(mfrow=c(2,2))

plot(grid, pch = ".", lwd = 0.25,

col=ifelse(g>=0.2, pal[1],pal[2]),

cex = 0.25, main = 'LGT',

xlim = xlim,

ylim = ylim)

points(EBITOF ~ DMP, data = train,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3), cex = 1.5)

z <- outer(x, y,

function(x,y)predict(lgt.fit,

newdata = data.frame(DMP = x,

EBITOF = y),

type = 'response'))

contour(x, y, z, add = TRUE, level = 0.20, lwd = 1)

plot(grid, pch = ".", lwd = 0.25,

col=ifelse(g.lda$posterior[,2]>=0.2,

pal[1],pal[2]),

main = 'LDA',

cex = 0.25,

xlim = xlim,

ylim = ylim)

points(EBITOF ~ DMP, data = train,

col = ifelse(Default == 'No',

pal[2], pal[1]),

21

Page 22: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

pch = ifelse(Default == 'No', 1, 3),

cex = 1.5)

z <- outer(x, y,

function(x,y)predict(lda.fit,

newdata = data.frame(DMP = x,

EBITOF = y))$posterior[,2])

contour(x, y, z, add = TRUE, level = 0.20, lwd = 1)

plot(grid,

col = ifelse(g.qda$posterior[,2]>=0.2, pal[1], pal[2]),

pch= '.', cex = 0.25,

main = 'QDA',

xlim = xlim,

ylim = ylim)

points(EBITOF ~ DMP,

data = train,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3), cex = 1.5)

z <- outer(x, y,

function(x,y)predict(qda.fit,

newdata = data.frame(DMP = x,

EBITOF = y))$posterior[,2])

contour(x, y, z,

add = TRUE,

level = 0.20, lwd = 1)

plot(grid,

col = ifelse(g.knn.probs.kk>=0.2, pal[1], pal[2]),

cex = 0.25, pch = ".", main = 'KNN = 15',

xlim = xlim,

ylim = ylim)

points(EBITOF ~ DMP,

data = train,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3), cex = 1.5)

z.knn <- knn(train = train.X,

test = grid,

cl = train.Default,

k = kk,

prob = TRUE)

z.knn.probs <- attr(z.knn, "prob")

z.knn.probs <- ifelse(z.knn == 'No',

1 - z.knn.probs,

z.knn.probs)

z.knn.probs.kk <- matrix(z.knn.probs,

length(x),

22

Page 23: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

length(y))

contour(x, y, z.knn.probs.kk,

levels = 0.20,

add = TRUE)

0 1 2 3 4 5

−1

01

23

4

LGT

DMP

EB

ITO

F

0.2

0 1 2 3 4 5

−1

01

23

4

LDA

DMP

EB

ITO

F

0.2

0 1 2 3 4 5

−1

01

23

4

QDA

DMP

EB

ITO

F

0.2

0 1 2 3 4 5

−1

01

23

4

KNN = 15

DMP

EB

ITO

F

Plot four charts for decision rules on test data set aligned:

par(mfrow=c(2,2))

plot(grid, pch = ".", lwd = 0.25,

col=ifelse(g>=0.2, pal[1],pal[2]),

cex = 0.25, main = 'LGT',

xlim = xlim,

ylim = ylim)

points(EBITOF ~ DMP, data = test,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3), cex = 1.5)

z <- outer(x, y,

function(x,y)predict(lgt.fit,

23

Page 24: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

newdata = data.frame(DMP = x,

EBITOF = y),

type = 'response'))

contour(x, y, z, add = TRUE, level = 0.20, lwd = 1)

plot(grid, pch = ".", lwd = 0.25,

col=ifelse(g.lda$posterior[,2]>=0.2,

pal[1],pal[2]),

main = 'LDA',

cex = 0.25,

xlim = xlim,

ylim = ylim)

points(EBITOF ~ DMP, data = test,

col = ifelse(Default == 'No',

pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3),

cex = 1.5)

z <- outer(x, y,

function(x,y)predict(lda.fit,

newdata = data.frame(DMP = x,

EBITOF = y))$posterior[,2])

contour(x, y, z, add = TRUE, level = 0.20, lwd = 1)

plot(grid,

col = ifelse(g.qda$posterior[,2]>=0.2, pal[1], pal[2]),

pch= '.', cex = 0.25,

main = 'QDA',

xlim = xlim,

ylim = ylim)

points(EBITOF ~ DMP,

data = test,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3), cex = 1.5)

z <- outer(x, y,

function(x,y)predict(qda.fit,

newdata = data.frame(DMP = x,

EBITOF = y))$posterior[,2])

contour(x, y, z,

add = TRUE,

level = 0.20, lwd = 1)

plot(grid,

col = ifelse(g.knn.probs.kk>=0.2, pal[1], pal[2]),

cex = 0.25, pch = ".", main = 'KNN = 15',

xlim = xlim,

ylim = ylim)

points(EBITOF ~ DMP,

24

Page 25: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

data = test,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3), cex = 1.5)

z.knn <- knn(train = train.X,

test = grid,

cl = train.Default,

k = kk,

prob = TRUE)

z.knn.probs <- attr(z.knn, "prob")

z.knn.probs <- ifelse(z.knn == 'No',

1 - z.knn.probs,

z.knn.probs)

z.knn.probs.kk <- matrix(z.knn.probs,

length(x),

length(y))

contour(x, y, z.knn.probs.kk,

levels = 0.20,

add = TRUE)

25

Page 26: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

0 1 2 3 4 5

−1

01

23

4

LGT

DMP

EB

ITO

F

0.2

0 1 2 3 4 5

−1

01

23

4

LDA

DMP

EB

ITO

F

0.2

0 1 2 3 4 5

−1

01

23

4

QDA

DMP

EB

ITO

F

0.2

0 1 2 3 4 5

−1

01

23

4

KNN = 15

DMP

EB

ITO

F

Plot naıve predictors based on financial ratios (EBITOF and DMP) on traindata sample:

par(mfrow=c(1,2))

plot(EBITOF ~ DMP, data = train,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3), cex = 1.5,

xlim = xlim,

ylim = ylim)

points(grid, pch = ".", lwd = 0.25,

col = ifelse(grid$EBITOF <= 1.2, pal[1], pal[2] ))

abline(h = 1.2, lty = 2, lwd = 1)

plot(EBITOF ~ DMP, data = train,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3), cex = 1.5,

xlim = xlim,

ylim = ylim)

points(grid, pch = ".", lwd = 0.25,

26

Page 27: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

col = ifelse(grid$DMP >= 2.0, pal[1], pal[2] ))

abline(v = 2.0, lty = 2, lwd = 1)

0 1 2 3 4 5

−1

01

23

4

DMP

EB

ITO

F

0 1 2 3 4 5

−1

01

23

4DMP

EB

ITO

F

par(mfrow=c(1,1))

Plot naıve predictors based on financial ratios (EBITOF and DMP):

par(mfrow=c(1,2))

plot(EBITOF ~ DMP, data = test,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3), cex = 1.5,

xlim = xlim,

ylim = ylim)

points(grid, pch = ".", lwd = 0.25,

col = ifelse(grid$EBITOF <= 1.2, pal[1], pal[2] ))

abline(h = 1.2, lty = 2, lwd = 1)

plot(EBITOF ~ DMP, data = test,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3), cex = 1.5,

xlim = xlim,

ylim = ylim)

points(grid, pch = ".", lwd = 0.25,

col = ifelse(grid$DMP >= 2.0, pal[1], pal[2] ))

abline(v = 2.0, lty = 2, lwd = 1)

27

Page 28: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

0 1 2 3 4 5

−1

01

23

4

DMP

EB

ITO

F

0 1 2 3 4 5

−1

01

23

4

DMP

EB

ITO

Fpar(mfrow=c(1,1))

Use naıve predictors compounding them in a logical AND decision rule (EBIT≤ 1.2 AND DMP ≥ 2.0):

plot(EBITOF ~ DMP, data = train,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3), cex = 1.5,

xlim = xlim,

ylim = ylim, main = 'Train data set')

points(grid, pch = ".", lwd = 0.25,

col = ifelse((grid$EBITOF <= 1.2) &

(grid$DMP >= 2.0),

pal[1], pal[2] ))

abline(h = 1.2, lty = 2, lwd = 2)

abline(v = 2.0, lty = 2, lwd = 2)

plot(EBITOF ~ DMP, data = test,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3), cex = 1.5,

xlim = xlim,

ylim = ylim, main = 'Test data set')

points(grid, pch = ".", lwd = 0.25,

col = ifelse((grid$EBITOF <= 1.2) &

(grid$DMP >= 2.0),

pal[1], pal[2] ))

abline(h = 1.2, lty = 2, lwd = 2)

abline(v = 2.0, lty = 2, lwd = 2)

28

Page 29: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

0 1 2 3 4 5

−1

01

23

4

Train data set

DMP

EB

ITO

F

0 1 2 3 4 5

−1

01

23

4

Test data set

DMP

EB

ITO

Fedmp.pred <- rep('No', s.s/2)

edmp.pred[(test$EBITOF <= 1.2) & (test$DMP >= 2.0)] <- 'Si'

edmp.tab <- table(edmp.pred,

test$Default,

dnn = c('Class. prevista',

'Class. effettiva'))

addmargins(edmp.tab)

## Class. effettiva

## Class. prevista No Si Sum

## No 77 4 81

## Si 1 18 19

## Sum 78 22 100

edmp.er <- mean(edmp.pred != test$Default)

edmp.er

## [1] 0.05

Prepare data frame for ROC curves:

require(pROC)

res <- data.frame(Default = test$Default,

LGT = lgt.probs,

LDA = lda.probs$posterior[,2],

QDA = qda.probs$posterior[,2],

KNN = knn.probs,

EBITOF = test$EBITOF,

DMP = test$DMP)

head(res)

## Default LGT LDA QDA KNN EBITOF

## 198 No 0.019904605 0.012715513 0.009984894 0.0000000 2.05021166

## 978 Si 0.997576429 0.999267410 0.999048875 0.8000000 0.02209348

## 740 No 0.006655441 0.003060701 0.002081429 0.0000000 3.31898715

## 974 Si 0.562134689 0.619997321 0.585406157 0.4000000 1.17433535

29

Page 30: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

## 14 No 0.766882711 0.818991764 0.813543896 0.4666667 1.56117928

## 258 No 0.002403967 0.001199350 0.001596928 0.0000000 1.63575939

## DMP

## 198 1.9473854

## 978 3.9704303

## 740 2.3772296

## 974 2.7757477

## 14 3.3324449

## 258 0.9771178

pal1 <- brewer.pal(7, "Dark2")

par(mfrow = c(1,2))

plot.roc(Default ~ LGT, data = res,

main = "Curve ROC machine learning su test",

print.auc = T, percent = T,

print.auc.x = 30,

print.auc.y = 70,

col=pal1[1],

grid = TRUE)

plot.roc(Default ~ LDA, data = res, add = TRUE,

print.auc = T, percent = T,

print.auc.x = 30,

print.auc.y = 60,

col=pal1[2])

plot.roc(Default ~ QDA, data = res, add = TRUE,

thresholds="best",

print.thres = "best",

print.auc = T, percent = T,

print.auc.x = 30,

print.auc.y = 50,

col=pal1[3])

plot.roc(Default ~ KNN, data = res, add = TRUE,

thresholds="best",

print.thres = "best",

print.auc = T, percent = T,

print.auc.x = 30,

print.auc.y = 40,

col=pal1[4])

legend("bottomright", legend=c("LGT", "LDA", "QDA", "KNN"),

col=c(pal1[1], pal1[2], pal1[3], pal1[4]),

lwd=2)

# secondo plot classificatori naive

plot.roc(Default ~ EBITOF, data = res,

thresholds="best",

30

Page 31: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

main = "Curve ROC class. naive su test",

print.thres = "best",

print.auc = T, percent = T,

print.auc.x = 30,

print.auc.y = 70,

col=pal1[5],

grid = TRUE)

plot.roc(Default ~ DMP, data = res, add = TRUE,

thresholds="best",

print.thres = "best",

print.auc = T, percent = T,

print.auc.x = 30,

print.auc.y = 60,

col=pal1[6])

legend("bottomright", legend=c("EBITOF", "DMP"),

col=c(pal1[5], pal1[6]),

lwd=2)

Curve ROC machine learning su test

Specificity (%)

Sen

sitiv

ity (

%)

020

4060

8010

0

100 80 60 40 20 0

AUC: 96.8%

AUC: 97.0%

0.2 (91.0%, 95.5%)

AUC: 97.0%

0.2 (94.9%, 90.9%)

AUC: 94.4%

LGTLDAQDAKNN

Curve ROC class. naive su test

Specificity (%)

Sen

sitiv

ity (

%)

020

4060

8010

0

100 80 60 40 20 0

1.2 (91.0%, 81.8%)

AUC: 91.7%

2.0 (82.1%, 100.0%)

AUC: 93.1%

EBITOFDMP

par(mfrow = c(1,1))

2 Cross validation

n.iter <- 100

lgt.er <- rep(0,n.iter)

lda.er <- rep(0,n.iter)

qda.er <- rep(0,n.iter)

knn.er <- rep(0,n.iter)

31

Page 32: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

for (i in 1:n.iter){set.seed(i)

z <- sort(sample(nrow(df.sm), nrow(df.sm) * 0.5))

train <- df.sm[z,]

test <- df.sm[-z,]

# regression logistica

lgt.fit <- glm(Default ~ DMP + EBITOF,

data = train,

family = 'binomial' )

lgt.probs <- predict(lgt.fit, newdata = test,

type = 'response')

lgt.pred <- rep("No", s.s/2)

lgt.pred[lgt.probs >= 0.20] <- "Si"

lgt.er[i] <- mean(lgt.pred != test$Default);

# LDA

lda.fit <- lda(data = train, Default ~ DMP + EBITOF)

lda.probs <- predict(lda.fit,

newdata = test, type = 'response')

lda.pred <- rep('No', s.s/2)

lda.pred[lda.probs$posterior[,2]>=0.2] <- 'Si'

lda.er[i] <- mean(lda.pred != test$Default)

# QDA

qda.fit <- qda(data = train, Default ~ DMP + EBITOF)

qda.probs <- predict(qda.fit,

newdata = test, type = "response")

qda.pred <- rep('No', s.s/2)

qda.pred[qda.probs$posterior[,2]>=0.2] <- 'Si'

qda.er[i] <- mean(qda.pred != test$Default)

# KNN

train.X <- cbind(train$DMP, train$EBITOF)

test.X <- cbind(test$DMP, test$EBITOF)

train.Default <- train$Default

kk = 15

knn.pred <- knn(train = train.X,

test = test.X,

cl = train.Default,

k = kk, prob = TRUE)

knn.pred.prob <- attr(knn.pred, 'prob')

knn.probs <- ifelse(knn.pred == 'No',

1 - knn.pred.prob,

knn.pred.prob)

knn.pred.cl <- rep('No', s.s/2)

knn.pred.cl[knn.probs >= 0.2] <- 'Si'

knn.er[i] <- mean(knn.pred.cl != test$Default)

}

32

Page 33: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

df.res <- data.frame(LGT = lgt.er,

LDA = lda.er,

QDA = qda.er,

KNN = knn.er)

head(df.res)

## LGT LDA QDA KNN

## 1 0.07 0.07 0.08 0.06

## 2 0.07 0.07 0.07 0.08

## 3 0.09 0.08 0.11 0.09

## 4 0.04 0.05 0.05 0.04

## 5 0.06 0.09 0.07 0.08

## 6 0.07 0.07 0.06 0.06

require(tidyr)

df.res.n <- gather(df.res, "Model", "Error", 1:4)

head(df.res.n)

## Model Error

## 1 LGT 0.07

## 2 LGT 0.07

## 3 LGT 0.09

## 4 LGT 0.04

## 5 LGT 0.06

## 6 LGT 0.07

require(RColorBrewer)

pal2 <- brewer.pal(5, 'Dark2')

boxplot(data = df.res.n,

Error ~ Model,

col = pal2,

main = "Test error rate su validation set (100 iteraz.)")

33

Page 34: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

LGT LDA QDA KNN

0.02

0.06

0.10

0.14

Test error rate su validation set (100 iteraz.)

3 Intuitive predictors

Again, considering Kahneman’s quoted paragraph at the beginning of this essay,I find approriate to quote now the sensible contribution of Prof. Tagliavini inBiffis et al. (2014, page 144)

There are solutions that are elegant and precise and solutions thatare rough and approximate: not necessarily the former are betterthan the latter.

Intuitive predictors needed by loan managers to diagnose

An intuitive predictor like:

DMP − EBITOF ≥ C (1)

In altre parole quando la differenza tra il livello di indebitamento (DMP) e ilmargine sugli oneri finanziari (EBITOF) e sale al di sopra di un certo livello (inFigura sotto −1 dal momento che in ordinata abbiamo EBITOF) allora entriamoin un’area di rischio eccessivo.

Bisogna usare il data frame res because train e test sono stati ri-seedatidurante la fase di validation set.

plot(EBITOF ~ DMP, data = res,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3),

34

Page 35: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

main = 'Previsore intuitivo (DMP - EBITOF)',

cex = 1.5,

xlim = xlim,

ylim = ylim)

int <- outer(x, y, function(x,y)y - x)

contour(x, y, int, add = TRUE, level = c(-2.0, -1.0, 0.0),

lwd = 2, lty = 2,

col = pal[4])

plot(EBITOF ~ DMP, data = res,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3),

main = 'LDA e previsore intuitivo (DMP - EBITOF)',

cex = 1.5,

xlim = xlim,

ylim = ylim)

z <- outer(x, y, function(x,y)predict(lda.fit,

newdata = data.frame(DMP = x,

EBITOF = y))$posterior[,2])

contour(x, y, z, add = TRUE, level = 0.20, lwd = 1)

int <- outer(x, y, function(x,y)y - x)

contour(x, y, int, add = TRUE, level = c(-1.0),

lwd = 2, lty = 2,

col = pal[4])

0 1 2 3 4 5

−1

01

23

4

Previsore intuitivo (DMP − EBITOF)

DMP

EB

ITO

F

−2

−1

0

0 1 2 3 4 5

−1

01

23

4

LDA e previsore intuitivo (DMP − EBITOF)

DMP

EB

ITO

F

0.2

−1

Thus we have:DMP − EBITOF ≥ 1 (2)

or equivalently on the x-y chart where EBITOF is on the y axis:

EBITOF ≤ DMP − 1 (3)

The classifier in (3) gives the following outcome:

35

Page 36: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

int <- res$EBITOF - res$DMP

int.pred <- rep('No', s.s/2)

int.pred[int <= -1.0] <- 'Si'

int.tab <- table(int.pred, res$Default,

dnn = c('Class. prevista', 'Class. effettiva'))

addmargins(int.tab)

## Class. effettiva

## Class. prevista No Si Sum

## No 73 2 75

## Si 5 20 25

## Sum 78 22 100

# error rate

int.er <- mean(int.pred != res$Default)

names(int.er) <- 'Error rate'

# sensitivity

int.sen <- int.tab[2,2]/(int.tab[1,2]+int.tab[2,2])

names(int.sen) <- 'Sensitivity'

# specificity

int.sp <- int.tab[1,1]/(int.tab[1,1]+int.tab[2,1])

names(int.sp) <- 'Specificity'

int.er; int.sen; int.sp

## Error rate

## 0.07

## Sensitivity

## 0.9090909

## Specificity

## 0.9358974

res$INTUIT <- res$DMP - res$EBITOF

plot.roc(Default ~ LDA, data = res,

grid = TRUE,

main = "Curve ROC class. LDA e intuitivo",

print.auc = T, percent = T,

print.auc.x = 30,

print.auc.y = 70,

col=pal2[2])

plot.roc(Default ~ INTUIT, data = res, add = TRUE,

thresholds=c(1.0),

print.thres = c(1.0),

print.auc = T, percent = T,

print.auc.x = 30,

print.auc.y = 60,

col=pal2[5])

legend("bottomright", legend=c("LDA", "DMP - EBITOF"),

col=c(pal2[2], pal2[5]),

lwd=2)

36

Page 37: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

Curve ROC class. LDA e intuitivo

Specificity (%)

Sen

sitiv

ity (

%)

020

4060

8010

0

100 80 60 40 20 0

AUC: 97.0%

1.0 (93.6%, 90.9%)

AUC: 97.4%

LDADMP − EBITOF

Let’s test the intuitive predictor in a validation set:

n.iter <- 100

lgt.er <- rep(0,n.iter)

lda.er <- rep(0,n.iter)

qda.er <- rep(0,n.iter)

knn.er <- rep(0,n.iter)

int.er <- rep(0,n.iter)

eBit.er <- rep(0,n.iter)

dMp.er <- rep(0,n.iter)

for (i in 1:n.iter){set.seed(i)

z <- sort(sample(nrow(df.sm), nrow(df.sm) * 0.5))

train <- df.sm[z,]

test <- df.sm[-z,]

# regression logistica

lgt.fit <- glm(Default ~ DMP + EBITOF,

data = train,

family = 'binomial' )

lgt.probs <- predict(lgt.fit,

newdata = test,

type = 'response')

lgt.pred <- rep("No", s.s/2)

lgt.pred[lgt.probs >= 0.20] <- "Si"

lgt.er[i] <- mean(lgt.pred != test$Default);

# LDA

lda.fit <- lda(data = train, Default ~ DMP + EBITOF)

37

Page 38: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

lda.probs <- predict(lda.fit,

newdata = test,

type = 'response')

lda.pred <- rep('No', s.s/2)

lda.pred[lda.probs$posterior[,2]>=0.2] <- 'Si'

lda.er[i] <- mean(lda.pred != test$Default)

# QDA

qda.fit <- qda(data = train, Default ~ DMP + EBITOF)

qda.probs <- predict(qda.fit,

newdata = test,

type = "response")

qda.pred <- rep('No', s.s/2)

qda.pred[qda.probs$posterior[,2]>=0.2] <- 'Si'

qda.er[i] <- mean(qda.pred != test$Default)

# KNN

train.X <- cbind(train$DMP, train$EBITOF)

test.X <- cbind(test$DMP, test$EBITOF)

train.Default <- train$Default

kk = 15

knn.pred <- knn(train = train.X,

test = test.X,

cl = train.Default,

k = kk, prob = TRUE)

knn.pred.prob <- attr(knn.pred, 'prob')

knn.probs <- ifelse(knn.pred == 'No',

1 - knn.pred.prob,

knn.pred.prob)

knn.pred.cl <- rep('No', s.s/2)

knn.pred.cl[knn.probs >= 0.2] <- 'Si'

knn.er[i] <- mean(knn.pred.cl != test$Default)

#INT

int <- test$EBITOF - test$DMP

int.pred <- rep('No', s.s/2)

int.pred[int <= -1.0] <- 'Si'

int.er[i] <- mean(int.pred != test$Default)

#EBITOF

eBit <- test$EBITOF

eBit.pred <- rep('No', s.s/2)

eBit.pred[eBit <= 1.2] <- 'Si'

eBit.er[i] <- mean(eBit.pred != test$Default)

#DMP

dMp <- test$DMP

dMp.pred <- rep('No', s.s/2)

38

Page 39: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

dMp.pred[dMp >= 2.0] <- 'Si'

dMp.er[i] <- mean(dMp.pred != test$Default)

}

Plot CV test error rates of predictors:

df.res <- data.frame(LGT = lgt.er,

LDA = lda.er,

QDA = qda.er,

KNN = knn.er,

INT = int.er,

EOF = eBit.er,

DMP = dMp.er)

head(df.res)

## LGT LDA QDA KNN INT EOF DMP

## 1 0.07 0.07 0.08 0.06 0.07 0.11 0.16

## 2 0.07 0.07 0.07 0.08 0.06 0.17 0.25

## 3 0.09 0.08 0.11 0.09 0.08 0.16 0.24

## 4 0.04 0.05 0.05 0.04 0.05 0.16 0.19

## 5 0.06 0.09 0.07 0.08 0.04 0.12 0.14

## 6 0.07 0.07 0.06 0.06 0.04 0.10 0.20

require(tidyr)

df.res.n <- gather(df.res, "Model", "Error", 1:7)

head(df.res.n)

## Model Error

## 1 LGT 0.07

## 2 LGT 0.07

## 3 LGT 0.09

## 4 LGT 0.04

## 5 LGT 0.06

## 6 LGT 0.07

# compute index of ordered 'cost factor' and reassign

#oind <- order(as.numeric(by(DF£cost, DF£type, median)))

oind <- order(as.numeric(by(df.res.n$Error,

df.res.n$Model,

median)))

# DF£type <- ordered(DF£type, levels=levels(DF£type)[oind])

#

df.res.n$Model <- ordered(df.res.n$Model,

levels = levels(df.res.n$Model)[oind])

# boxplot(cost ~ type, data=DF)

require(RColorBrewer)

pal2 <- brewer.pal(7, 'Dark2')

boxplot(data = df.res.n,

Error ~ Model,

col = pal2,

39

Page 40: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

main = "Test error rate CV (valid. set 100 iteraz.)")

INT LGT LDA QDA KNN EOF DMP

0.05

0.10

0.15

0.20

0.25

Test error rate CV (valid. set 100 iteraz.)

Plot intuitive predictor on train data sample:

plot(EBITOF ~ DMP, data = trainn,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3),

main = '(DMP - EBITOF = 1) | train sample',

cex = 1.5,

xlim = xlim,

ylim = ylim)

points(grid, pch = ".", lwd = 0.25,

col=ifelse(grid$DMP - grid$EBITOF >= 1, pal[1],pal[2]))

int <- outer(x, y, function(x,y)y - x)

contour(x, y, int, add = TRUE, level = c(-1.0), lwd = 1.5, lty = 2)

40

Page 41: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

0 1 2 3 4 5

−1

01

23

4

(DMP − EBITOF = 1) | train sample

DMP

EB

ITO

F

−1

Plot intuitive predictor on test data sample:

plot(EBITOF ~ DMP, data = testt,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3),

main = '(DMP - EBITOF = 1) | test sample',

cex = 1.5,

xlim = xlim,

ylim = ylim)

points(grid, pch = ".", lwd = 0.25,

col=ifelse(grid$DMP - grid$EBITOF >= 1, pal[1],pal[2]))

int <- outer(x, y, function(x,y)y - x)

contour(x, y, int, add = TRUE, level = c(-1.0), lwd = 1.5, lty = 2)

41

Page 42: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

0 1 2 3 4 5

−1

01

23

4

(DMP − EBITOF = 1) | test sample

DMP

EB

ITO

F

−1

4 Rating class clustering

Save firm’s id number, now in row names, in a separate column: it’ll be neededlater as row names will be dropped by applying a function.

df.sm$Firm <- row.names(df.sm)

df.sm.dat <- df.sm[ , 2:3]

# set seed and sample a smaller set too

set.seed(123)

df.sm1 <- sample_n(df.sm, 20,

replace = FALSE)

df.sm1.dat <- df.sm1[, 2:3]

#first clustering

hc.comp <- hclust(dist(df.sm.dat),

method = 'complete')

cl <- cutree(hc.comp, k = 4)

df.sm$Cluster <- as.factor(cl)

#second plot on sample

hc.sm <- hclust(dist(df.sm1.dat),

method = 'complete')

42

Page 43: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

cl.sm <- cutree(hc.sm, k = 4)

df.sm1$Cluster <- as.factor(cl.sm)

# contingeny table di numero di default/non default per cluster

tab.rat <- table(df.sm$Default, df.sm$Cluster)

# numero di default per cluster

tab.df.rat <- table(df.sm$Cluster); tab.df.rat

##

## 1 2 3 4

## 98 23 68 11

tab.df.fr <- prop.table(table(df.sm$Default, df.sm$Cluster),

margin = 2)

tab.df.fr[2,]

## 1 2 3 4

## 0.08163265 0.86956522 0.00000000 0.63636364

# contingeny table di numero di default/non default per cluster

tab.rat1 <- table(df.sm1$Default, df.sm1$Cluster)

# numero di default per cluster

tab.df1.rat <- table(df.sm1$Cluster); tab.df1.rat

##

## 1 2 3 4

## 4 10 3 3

tab.df1.fr <- prop.table(table(df.sm1$Default, df.sm1$Cluster),

margin = 2)

tab.df1.fr[2,]

## 1 2 3 4

## 0.25 0.10 0.00 1.00

Compute probabilities of default on mid and small samples and write table todrive for handout for students.

require(dplyr)

df.sm <- df.sm %>%

group_by(Cluster) %>%

mutate(PD = prop.table(table(Default))[2]

)

head(df.sm$PD)

## [1] 0.08163265 0.86956522 0.08163265 0.08163265 0.00000000 0.86956522

df.sm <- df.sm[order(df.sm$PD), ]

df.sm$PD <- as.factor(round(df.sm$PD, digits = 2))

df.sm1 <- df.sm1 %>%

group_by(Cluster) %>%

mutate(PD = prop.table(table(Default))[2]

43

Page 44: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

)

head(df.sm1$PD)

## [1] 0.25 0.10 0.00 0.10 0.10 0.10

df.sm1 <- df.sm1[order(df.sm1$PD), ]

df.sm1$PD <- as.factor(round(df.sm1$PD, digits = 2))

write.table(df.sm1, file = "train_rat_20.csv",

sep = ";",

dec = ",")

Chart rating classes on scatterplot:

xx <- range(df.sm$DMP); yy <- range(df.sm$EBITOF)

pl.rat <- ggplot(df.sm, aes(DMP, EBITOF,

color = PD,

shape = Default)) +

geom_point(size = 3, alpha = 0.5) +

scale_color_manual(values = c(pal[2],

pal[3],

pal[4],

pal[1]

)) +

scale_shape_manual(values = c(1,3)) +

xlim(xx) + ylim(yy)

pl.rat +

geom_abline(intercept = -1, slope = 1, linetype = 'dashed')

pl.rat1 <- ggplot(df.sm1, aes(DMP, EBITOF,

color = PD,

shape = Default,

label=Firm)) +

geom_point(size = 3, alpha = 1.0) +

scale_color_manual(values = c(pal[2],

pal[3],

pal[4],

pal[1]

)) +

scale_shape_manual(values = c(1,3)) +

xlim(xx) + ylim(yy) + geom_text(hjust=1.25, vjust=1.25, size = I(3))

pl.rat1 +

geom_abline(intercept = -1, slope = 1, linetype = 'dashed')

44

Page 45: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

−1

0

1

2

3

4

0 1 2 3 4 5DMP

EB

ITO

F

PD

0

0.08

0.64

0.87

Default

No

Si

714274

730

130765

350

258

973

190

244

637

151

656

85

966

257

14

886

997951

−1

0

1

2

3

4

0 1 2 3 4 5DMP

EB

ITO

F

PD

a

a

a

a

0

0.1

0.25

1

Default

No

Si

Plot plain dendogram of hierarchical clustering with no aesthetical attributes, atlast.

plot(hc.comp, cex = 0.2,

xlab = "",

sub = "",

main = "")

rect.hclust(hc.comp, k = 4,

border = pal[1])

# second plot

plot(hc.sm, cex = 1.0,

xlab = '',

sub = '',

main = ''

)

rect.hclust(hc.sm, k = 4, border = pal[1])

822

856

880

888

878

860

245

876

866

826

703

905

891

974

510

978

852

958

997

951

939

968

962

261

924

366

966

202 14 94

383

085

198

591

374

077

052

979

6 155

344

196

216

610 78

049

9 666

509

217

257

288

762

175

618

98 545 57 460 70

117 468

527

274

501 48

971

4 590

458

151

795

138

764 76 234

137

416

579

708 40

161

965

799

614

225 79

327

120

952

859

722

0 8464 58

464

945

732

756

012

424

9 730

333

37 324

677

201

404

130

276

513

930

647

612

2 788

478

386

493

551

812

471

72 638

258 95 244

800

774

8145

918

461

248

751

291 51

830

366

797

538

569 72

019

341

263

7 9469

468

283

409

142

844

886 99

287

778

436

1 3885 52

253

951

594

267

028

925

065

173

570

756

341

184

275

618

038

090

776

949

659

9 121

311

198 74 362

113

706

300

36 161

973

608

350

42 656 14

119

046 9

114

676

235

206

01

23

45

Hei

ght

730

714

274

244

151

258

637

130

765

350

973

190

656

886

997

951 96

6 14 85 257

01

23

4

Hei

ght

The use of the excellent package dendextend by Galili (2015) is now required toadjust color, nodes and branches of rating class clusters on dendrogram.

require(dendextend)

require(colorspace)

45

Page 46: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

pl.rat1

hc <- as.dendrogram(hc.sm) %>%

rotate(c(4,1,2,3) )

hc %>% set("labels_cex", 1.0) %>%

set("labels_col", value = c(pal[1],

pal[4],

pal[3],

pal[2]),

k = 4) %>%

set("branches_k_color", value = c(pal[1],

pal[4],

pal[3],

pal[2]),

k = 4) %>%

plot(main = "")

hc %>% rect.dendrogram(k = 4,

border = 8,

lty = 5,

lwd = 1.0)

# abline(h = 2.25, lty = 2, lwd = 2.0)

714274

730

130765

350

258

973

190

244

637

151

656

85

966

257

14

886

997951

−1

0

1

2

3

4

0 1 2 3 4 5DMP

EB

ITO

F

PD

a

a

a

a

0

0.1

0.25

1

Default

No

Si

01

23

4

886

997

951

966 14 85 257

130

765

350

973

190

656

258

637

151

244

730

714

274

hc %>% set("labels_cex", 1.0) %>%

set("labels_col", value = c(pal[1],

pal[4],

pal[3],

pal[2]),

k = 4) %>%

set("branches_k_color", value = c(pal[1],

pal[4],

pal[3],

pal[2]),

k = 4) %>%

plot(main = "")

46

Page 47: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

hc %>% rect.dendrogram(k = 4,

border = 8,

lty = 5,

lwd = 1.0)

hc.all <- as.dendrogram(hc.comp)

hc.all %>%

set("labels_cex", 0.1) %>%

set("labels_col", value = c(pal[1],

pal[4],

pal[2],

pal[3]),

k = 4) %>%

set("branches_k_color", value = c(pal[1],

pal[4],

pal[2],

pal[3]),

k = 4) %>%

plot(main = "")

hc.all %>% rect.dendrogram(k = 4,

border = 8,

lty = 5,

lwd = 1.0)

01

23

4

886

997

951

966 14 85 257

130

765

350

973

190

656

258

637

151

244

730

714

274

01

23

45

822

856

880

888

878

860

245

876

866

826

703

905

891

974

510

978

852

958

997

951

939

968

962

261

924

366

966

202 14 943

830

851

985

913

740

770

529

796

155

344

196

216

610

780

499

666

509

217

257

288

762

175

618 98 545 57 460

701 17 468

527

274

501

489

714

590

458

151

795

138

764 76 234

137

416

579

708

401

619

657 99 614

225

793

271

209

528

597

220 84 64 584

649

457

327

560

124

249

730

333 37 324

677

201

404

130 2

765

139

306

476

122

788

478

386

493

551

812

471 72 638

258 95 244

800

774 81 459

184

612

487

512 91 518

303 66 797

538

569

720

193

412

637 94 694

682 83 409

142

844

886

992

877

784

361 38 85 522

539

515

942

670

289

250

651

735

707

563

411

842

756

180

380

907

769

496

599

121

311

198 74 362

113

706

300 36 161

973

608

350 42 656

141

190 46 9

114

676

235

206

47

Page 48: Statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot options are adjusted so as to get a plot ... and testt just in case they are needed

References

Biffis, Paolo, a cura di, (2014), con scritti di M. S. Avi, G. Tagliavini e F. Zen,Analisi del Merito di Credito, EIF.e-Book.

Galili, Tal (2015), dendextend: an R package for visualizing, adjusting, andcomparing trees of hierarchical clustering, The Journal of Bioinformatics.

Kahneman, Daniel (2011), Thinking, Fast and Slow, Penguin.

48