statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot...

An R tutorial on statistical, naıve and intuitive

predictors in credit risk classification

Rodolfo Vanzini — Bologna.

September 15, 2015

By quoting extensively the extraordinary work of Kahneman (2011) I’d like toexplain the rationale of this work on credit risk classification during trainingsessions in traditional classroom settings – like the ones that I have conductedin the last three years – for loan officers employed by banks.

Why are experts so inferior to algorithms? One reason [. . . ] is thatexperts try to be clever, think outside the box, and consider combi-nations of features in making their predictions. [. . . ] Complexity maywork in the odd case, but more often than not it reduces validity.(Kahneman 2011, page 224)

Another reason for the inferiority of expert judgment is that hu-mans are incorrigibly inconsistent in making summary judgments ofcomplex information. When asked to evaluate the same informationtwice, they frequently give different answers. (Kahneman 2011, page224)

The research suggests a surprising conclusion: to maximize predictiveaccuracy, final decision should be left to formulas, especially in low-validity environments. (Kahneman 2011, page 225)

[Dawes] observed that the complex statistical algorithm adds littleor no value. One can do just as well by selecting a set of scoresthat have some validity for predicting the outcome and ajusting thevalues to make them comparable [. . . ] it is possible to develop usefulalgorithms without any prior statistical research. Simple equallyweighted formulas based on existing statistics or on common senseare often very good predictors of significant outcomes. [. . . ] Theimportant conclusion of this research is that an algorithm that isconstructed on the back of an envelope is often good enough tocompete with an optimally weighted formula, and certainly goodenough to outdo expert judgment. This logic can be applied tomany domains, ranging from the selection of stocks by portfoliomanagers to the choices of medical treatments by doctors or patients.

1

(Kahneman 2011, page 226)

Whenever we can replace human judgment by a formula, we shouldat least consider it. (Kahneman 2011, page 233)

Data are generated according to a desired range of features the sample data setwill have to have in terms of default frequency, degree of overlapping betweennon-defaulted and defaulted companies, and key financial ratios as predictors.The code below generates two predictors: debt-to-book-value ratio (DMP, fromthe original Italian debito-mezzi-propri) and EBIT-to-interest-payments ratio(EBITOF, from the original Italian EBIT-oneri-finanziari)

n <- 1000

p <- 0.82 #proportion of non defaulters

set.seed(321) #for reproducibility

DMP.no <- rnorm(n = n * p, mean = 1.5, sd = 0.75)

DMP.si <- rnorm(n = n * (1 - p), mean = 3.0, sd = 0.75)

EBITOF.no <- rnorm(n = n * p, mean = 2.0, sd = 0.75)

EBITOF.si <- rnorm(n = n * (1 - p), mean = 0.75, sd = 0.75)

df <- data.frame(Default = c(rep('No', p * n),

rep('Si', (1 - p) * n)),

DMP = c(DMP.no, DMP.si),

EBITOF = c(EBITOF.no, EBITOF.si)

)

str(df)

## 'data.frame': 1000 obs. of 3 variables:

## $ Default: Factor w/ 2 levels "No","Si": 1 1 1 1 1 1 1 1 1 1 ...

## $ DMP : num 2.779 0.966 1.292 1.41 1.407 ...

## $ EBITOF : num 0.878 1.036 3.052 2.021 3.426 ...

#Adjust DMP for negative values

d <- df$DMP

d[df$DMP < 0] <- 0

df$DMP <- d

head(df)

## Default DMP EBITOF

## 1 No 2.7786774 0.8777016

## 2 No 0.9659711 1.0358616

## 3 No 1.2915113 3.0520025

## 4 No 1.4102632 2.0214981

## 5 No 1.4070295 3.4255801

## 6 No 1.7011378 1.1667068

RColorBrewer is loaded to set a palette with four colors to be used in enhancegraphics.

require(RColorBrewer)

pal <- brewer.pal(4, "Set1")

pal

## [1] "#E41A1C" "#377EB8" "#4DAF4A" "#984EA3"

2

After setting up the palette, it’s necessary to plot the sample to inspect it forpossible irregularities with respect to the desired features. To do so the ggplot2package is required and the scatter plot options are adjusted so as to get a plotsimilar to base R.

require(ggplot2)

pl <- ggplot(df, aes(DMP, EBITOF, color = Default, shape = Default)) +

geom_point(size = 3, alpha = 0.5) +

scale_shape(solid = FALSE) +

scale_color_manual(values = c(pal[2], pal[1])) +

scale_shape_manual(values = c(1,3))

pl

0

2

4

0 1 2 3 4 5DMP

EB

ITO

F Default

No

Si

Overlapping histograms of non defaulted/defaulted companies of both key finan-cial ratios.

pl.1 <- ggplot(df, aes(DMP, fill = Default)) +

scale_fill_manual(values = c(pal[2], pal[1]))

pl.2<-ggplot(df, aes(EBITOF, fill = Default)) +


pl.1 + geom_histogram(data = subset(df, Default = "Si"),

binwidth = 0.1,

alpha = 0.5,

position = 'identity') +

geom_histogram(data = subset(df, Default = "No"),

binwidth = 0.1,

alpha = 0.5,

position = 'identity')

3

0

10

20

30

40

50

0 1 2 3 4 5DMP

coun

t Default

No

Si

pl.2 + geom_histogram(data = subset(df, Default = "Si"),

binwidth=0.1, alpha = 0.5, position = 'identity') +

geom_histogram(data = subset(df, Default = "No"),

binwidth=0.1, alpha = 0.5, position = 'identity')

0

10

20

30

40

50

0 2 4EBITOF

coun

t Default

No

Si

Let’s generate some false predictors and sample a reasonable subset to be handedto students:

# subset data

set.seed(666)

s.s <- 200

require(dplyr)

4

df.sm <- sample_n(df, size = s.s, replace = FALSE)

df.sm$SaleChg <- rnorm(n = s.s,

mean = 0.0,

sd = 10.0)

df.sm$ClienteStorico<- sample(c('Si', 'No'),

size = s.s,

replace = TRUE)

df.sm$Settore <- sample(c("Meccanica industriale",

"Servizi",

"Meccanica automotive",

"Dettaglio",

"Edile"),

size = s.s,

replace = TRUE)

df.sm$Outlook <- sample(c('Positive', 'Stable', 'Negative'),

size = s.s,

replace = TRUE)

set.seed(1)

z <- sort(sample(nrow(df.sm), nrow(df.sm) * 0.5))

train <- df.sm[z,]

test <- df.sm[-z,]

Both train and test samples will be duplicated in a second data frame trainnand testt just in case they are needed (seeding will be set again to new valuesto generate new random samples).

trainn <- train

testt <- test

Before contintuing data will be saved to produce hand-outs for students.

write.table(train, file = "train_100.csv",

sep = ";",

dec = ",")

write.table(test, file = "test_100.csv",

sep = ";",

dec = ",")

Perform some exploratory data on data frame train:

par(mfrow = c(2,3))

T1 <- table(df.sm$ClienteStorico, df.sm$Default,

dnn = c('Cliente storico', 'Default'))

mosaicplot(T1,

main = 'Cliente storico per default')

T2 <- table(df.sm$Settore, df.sm$Default,

dnn = c('Settore', 'Default'))

mosaicplot(T2,

main = 'Settore per default')

T3 <- table(df.sm$Outlook, df.sm$Default,

5

dnn = c('Outlook', 'Default'))

mosaicplot(T3,

main = 'Outlook per default')

boxplot(EBITOF ~ Default, data = df.sm,

main = 'EBITOF per default')

boxplot(DMP ~ Default, data = df.sm,

main = 'DMP per default')

boxplot(SaleChg ~ Default, data = df.sm,

main = 'Sales chg. per default')

Cliente storico per default

Cliente storico

Def

ault

No Si

No

Si

Settore per default

Settore

Def

ault

Dettaglio Edile Meccanica automotive Meccanica industriale Servizi

No

Si

Outlook per default

Outlook

Def

ault

Negative Positive Stable

No

Si

No Si

−1

01

23

4

EBITOF per default

No Si

01

23

45

DMP per default

No Si

−20

−10

010

20

Sales chg. per default

# barplot(table(df.sm£ClienteStorico, df.sm£Default)/nrow(df.sm),

# names.arg = c('Default No', 'Default Si'),

# main = 'Cliente storico per default')

par(mfrow = c(1,1))

s <- chisq.test(T1)

print(s)

##

## Pearson's Chi-squared test with Yates' continuity correction

##

## data: T1

## X-squared = 3.8975, df = 1, p-value = 0.04836

Cor plot for:

6

pairs(df.sm[, 2:4])

DMP

−1 0 1 2 3 4

01

23

45

−1

01

23

4

EBITOF

0 1 2 3 4 5 −20 0 10

−20

010

SaleChg

pl.sm <- ggplot(df.sm, aes(DMP, EBITOF,

color = Default,

shape = Default)) +


scale_shape(solid = FALSE) +

scale_color_manual(values = c(pal[2], pal[1])) +

scale_shape_manual(values = c(1,3))

pl.sm + stat_smooth(aes(group = 1), method = 'lm')

7

−1

0

1

2

3

4

0 1 2 3 4 5DMP

EB

ITO

F Default

No

Si

require(MASS)

pr.dmp.fit <- lda(Default ~ DMP, data = df.sm)

pr.ebitof.fit <- lda(Default ~ EBITOF, data = df.sm)

#function to define prediction rule on LDA

dec.rule.ebit <- function(lda, df){A <- A <- mean(lda$means)

B <- log(lda$prior[2]) - log(lda$prior[1])

s2.k <- t(tapply(df$EBITOF, df$Default, var)) %*% lda$prior

C <- s2.k/(lda$means[1] - lda$means[2])

dr <- A + B * C

dr

}

dec.rule.dmp <- function(lda, df){A <- A <- mean(lda$means)

B <- log(lda$prior[2]) - log(lda$prior[1])

s2.k <- t(tapply(df$DMP, df$Default, var)) %*% lda$prior

C <- s2.k/(lda$means[1] - lda$means[2])

dr <- A + B * C

dr

}

dr.dmp <- dec.rule.dmp(pr.dmp.fit, df.sm)

dr.ebitof <- dec.rule.ebit(pr.ebitof.fit, df.sm)

pl.1.sm <- ggplot(df.sm, aes(DMP, fill = Default)) +


pl.2.sm <- ggplot(df.sm, aes(EBITOF, fill = Default)) +


8

pl.1.sm + geom_histogram(data = subset(df.sm, Default = "Si"),

binwidth = 0.25,

alpha = 0.5,


geom_histogram(data = subset(df.sm, Default = "No"),

binwidth = 0.25,

alpha = 0.5,


geom_vline(xintercept = dr.dmp,

linetype = 'dashed')

pl.2.sm + geom_histogram(data = subset(df.sm, Default = "Si"),


geom_histogram(data = subset(df.sm, Default = "No"),


geom_vline(xintercept = dr.ebitof,

linetype = 'dashed')

0

5

10

15

20

0 2 4DMP

coun

t Default

No

Si

0

5

10

15

20

0 2 4EBITOF

coun

t Default

No

Si

1 Statistical & naıve predictors

Run logistic regression to show false predictors aren’t significant:

lgt.null <- glm(Default ~ .,

data = df.sm,

family = 'binomial')

summary(lgt.null)

##

## Call:

## glm(formula = Default ~ ., family = "binomial", data = df.sm)

##

## Deviance Residuals:

## Min 1Q Median 3Q Max

## -2.05817 -0.13653 -0.02901 -0.00372 2.02415

9

##

## Coefficients:

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) -6.89273 2.12989 -3.236 0.00121 **

## DMP 3.27608 0.73848 4.436 9.15e-06 ***

## EBITOF -2.96387 0.74027 -4.004 6.23e-05 ***

## SaleChg -0.08523 0.05387 -1.582 0.11360

## ClienteStoricoSi 1.06459 0.81142 1.312 0.18951

## SettoreEdile 0.43600 1.18740 0.367 0.71348

## SettoreMeccanica automotive -1.00954 1.43137 -0.705 0.48062

## SettoreMeccanica industriale 1.22437 1.18360 1.034 0.30093

## SettoreServizi -0.82071 1.30597 -0.628 0.52972

## OutlookPositive 2.26890 1.08911 2.083 0.03723 *

## OutlookStable 1.21965 1.16211 1.050 0.29394

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## (Dispersion parameter for binomial family taken to be 1)

##

## Null deviance: 185.491 on 199 degrees of freedom

## Residual deviance: 49.379 on 189 degrees of freedom

## AIC: 71.379

##

## Number of Fisher Scoring iterations: 8

plot(data = df.sm, EBITOF ~ DMP,

main = 'Sample train + test',

cex = 1.5)

plot(data = df.sm, EBITOF ~ DMP,

col = ifelse(Default == 'No', pal[2], pal[1]),

pch = ifelse(Default == 'No', 1, 3),

main = 'Sample train + test (defaulters displayed)',

cex = 1.5)

0 1 2 3 4 5

−1

01

23

4

Sample train + test

DMP

EB

ITO

F

0 1 2 3 4 5

−1

01

23

4

Sample train + test (defaulters displayed)

DMP

EB

ITO

F

10

par(mfrow = c(1,2))

plot(data = train, EBITOF ~ DMP,



main = 'Train',

cex = 1.5)

plot(data = test, EBITOF ~ DMP,



main = 'Test',

cex = 1.5)

0 1 2 3 4

01

23

4

Train

DMP

EB

ITO

F

0 1 2 3 4 5

−1

01

23

Test

DMP

EB

ITO

F

par(mfrow=c(1,1))




main = 'Train sample data set',

cex = 1.5)

11

0 1 2 3 4

01

23

4

Train sample data set

DMP

EB

ITO

F

Display train data sample:




cex = 1.5)

par(mfrow = c (1,2))

boxplot(data = train, EBITOF ~ Default,

col = c(pal[2], pal[1]), ylab='EBIT/OF',

xlab = 'Default')

boxplot(data = train, DMP ~ Default,

col = c(pal[2], pal[1]), ylab='D/MP',

xlab = 'Default')

par(mfrow = c(1,1))

12

0 1 2 3 4

01

23

4

DMP

EB

ITO

F

No Si

01

23

4

Default

EB

IT/O

F

No Si

01

23

4

Default

D/M

P

Fit logit model to train data sample, check R has coded the response Defaultcorrectly, contrasts(train$Default) shows it has been created a dummyvariable with 1 being the default status.

lgt.fit <- glm(Default ~ DMP + EBITOF,

data = train,

family = 'binomial')

contrasts(train$Default)

## Si

## No 0

## Si 1

summary(lgt.fit)

##

## Call:

## glm(formula = Default ~ DMP + EBITOF, family = "binomial", data = train)

##

## Deviance Residuals:

## Min 1Q Median 3Q Max

## -2.30735 -0.22448 -0.06741 -0.02830 2.78986

##

## Coefficients:

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) -5.8695 2.3823 -2.464 0.013746 *

## DMP 3.0051 0.8837 3.400 0.000673 ***

## EBITOF -1.8921 0.8158 -2.319 0.020383 *

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## (Dispersion parameter for binomial family taken to be 1)

##

## Null deviance: 77.277 on 99 degrees of freedom

## Residual deviance: 25.021 on 97 degrees of freedom

13

## AIC: 31.021

##

## Number of Fisher Scoring iterations: 7

Perform prediction on test data sample:

lgt.probs <- predict(lgt.fit, newdata = test, type = 'response')

Prepare canvas grid to highlight accept/reject areas on chart based on df.sm dataframe length of x and y axes. expand.grid() generates a set of x-y coordinatesto grid the canvas.

xlim <- range(df.sm$DMP)

xlim

## [1] 0.000000 4.815215

ylim <- range(df.sm$EBITOF)

ylim

## [1] -0.8916338 3.9687379

x <- seq(xlim[1], xlim[2], length = s.s/4)

y <- seq(ylim[1], ylim[2], length = s.s/4)

grid <- expand.grid(x = x,y = y)

names(grid) <- c('DMP', 'EBITOF')

g <- predict(lgt.fit,newdata = grid, type = 'response')

head(g)

## 1 2 3 4 5 6

## 0.01503178 0.02009202 0.02680935 0.03569068 0.04737098 0.06262557

plot(EBITOF ~ DMP, data = test,



main = 'Campione di test (LGT)',

cex = 1.5,

xlim = xlim,

ylim = ylim)

z <- outer(x, y, function(x,y)predict(lgt.fit,

newdata = data.frame(DMP = x,

EBITOF = y),

type = 'response'))

contour(x, y, z, add = TRUE, level = 0.20, lwd = 1)

points(grid, pch = '.', lwd = 1.25,

col=ifelse(g>=0.2, pal[1],pal[2]))

14

0 1 2 3 4 5

−1

01

23

4

Campione di test (LGT)

DMP

EB

ITO

F

0.2

Confusion matrix to present prediction results:

lgt.pred = rep("No", s.s/2)

lgt.pred[lgt.probs >= 0.20] <- "Si"

tab <- table(lgt.pred, test$Default,

dnn = c('Class. prevista',

'Class. effettiva'))

addmargins(tab)

## Class. effettiva

## Class. prevista No Si Sum

## No 72 1 73

## Si 6 21 27

## Sum 78 22 100

#error rate er

er <- mean(lgt.pred != test$Default); names(er) <- 'Error rate'

# sensitivity

sen <- tab[2,2]/(tab[1,2]+tab[2,2])

names(sen) <- 'Sensitivity'

# specificity

sp <- tab[1,1]/(tab[1,1]+tab[2,1]); names(sp) <- 'Specificity'

er; sen; sp

## Error rate

## 0.07

## Sensitivity

## 0.9545455

## Specificity

## 0.9230769

15

Load required package MASS for LDA analysis and predict response on testsample:

require(MASS)

lda.fit <- lda(data = train, Default ~ DMP + EBITOF)

lda.probs <- predict(lda.fit,

newdata = test, type = 'response')

g.lda <- predict(lda.fit, newdata = grid)




main = 'Campione di test (LDA)',

cex = 1.5,

xlim = xlim,

ylim = ylim)

z <- outer(x, y, function(x,y)predict(lda.fit,


EBITOF = y))$posterior[,2])


points(grid, pch = ".", lwd = 1.25,

col=ifelse(g.lda$posterior[,2]>=0.2, pal[1],pal[2]))

0 1 2 3 4 5

−1

01

23

4

Campione di test (LDA)

DMP

EB

ITO

F

0.2

Confusion matrix with LDA results and test error rate:

lda.pred <- rep('No', s.s/2)

lda.pred[lda.probs$posterior[,2]>=0.2] <- 'Si'

lda.tab <- table(lda.pred, test$Default,


16

'Class. effettica'))

addmargins(lda.tab)

## Class. effettica


## No 72 1 73

## Si 6 21 27

## Sum 78 22 100

lda.er <- mean(lda.pred != test$Default)

lda.er

## [1] 0.07

qda.fit <- qda(data = train, Default ~ DMP + EBITOF)

qda.probs <- predict(qda.fit,

newdata = test, type = "response")

g.qda <- predict(qda.fit,newdata = grid)

Confusion matrix with QDA results:

qda.pred <- rep('No', s.s/2)

qda.pred[qda.probs$posterior[,2] >= 0.2] <- 'Si'

qda.tab <- table(qda.pred, test$Default,


'Class. effettica'))

addmargins(qda.tab)

## Class. effettica


## No 72 2 74

## Si 6 20 26

## Sum 78 22 100

qda.er <- mean(qda.pred != test$Default)

qda.er

## [1] 0.08




main = 'Campione di test (QDA)',

cex = 1.5,

xlim = xlim,

ylim = ylim)

z <- outer(x, y, function(x,y)predict(qda.fit,



contour(x, y, z, add = TRUE,

level = 0.20, lwd = 1)

17

points(grid,

col = ifelse(g.qda$posterior[,2]>=0.2,

pal[1], pal[2]),

pch= '.', cex = 0.5)

0 1 2 3 4 5

−1

01

23

4

Campione di test (QDA)

DMP

EB

ITO

F

0.2

library(class)

train.X <- cbind(train$DMP, train$EBITOF)

test.X <- cbind(test$DMP, test$EBITOF)

train.Default <- train$Default

set.seed(1)

kk = 15

knn.pred <- knn(train = train.X,

test = test.X,

cl = train.Default,

k = kk,

prob = TRUE)

summary(knn.pred)

## No Si

## 87 13

Confusion matrix and test error rate for KNN classifier, based on 0.2 probabilities(as opposed to 0.50 probabilities by default):

knn.pred.prob <- attr(knn.pred, 'prob')

knn.probs <- ifelse(knn.pred == 'No',

1 - knn.pred.prob,knn.pred.prob)

knn.pred.cl <- rep("No", s.s/2)

knn.pred.cl[knn.probs >= 0.2] <- "Si"

18

knn.tab <- table(knn.pred.cl,

test$Default,



addmargins(knn.tab)

## Class. effettiva


## No 74 2 76

## Si 4 20 24

## Sum 78 22 100

knn.er <- mean(knn.pred.cl != test$Default)

knn.er

## [1] 0.06

Accesso knn estimated probabilities via the attr function and transform prob-abilities of default accordingly. Prepare grid matrix for plotting decision area(matrix(knn.probs, ...)).

knn.probs <- attr(knn.pred, "prob")

head(knn.probs)

## [1] 1.0000000 0.8000000 1.0000000 0.6000000 0.5333333 1.0000000


1 - knn.probs,knn.probs)

head(knn.probs)

## [1] 0.0000000 0.8000000 0.0000000 0.4000000 0.4666667 0.0000000

knn.probs.kk <- matrix(knn.probs,

length(x), length(y))

z.knn <- knn(train = train.X,

test = grid,

cl = train.Default,

k = kk, prob = TRUE)

z.knn.probs <- attr(z.knn, "prob")

z.knn.probs <- ifelse(z.knn == 'No',

1 - z.knn.probs,

z.knn.probs)

z.knn.probs.kk <- matrix(z.knn.probs,

length(x),

length(y))

g.knn <- knn(train.X,

grid,

train.Default,

k = kk,

prob = TRUE)

g.knn.probs <- attr(g.knn, "prob")

g.knn.probs <- ifelse(g.knn == 'No',

19

1 - g.knn.probs,

g.knn.probs)

g.knn.probs.kk <- matrix(g.knn.probs,

length(x),

length(y))

Chart KNN decision boundary and points:

# chart KNN

plot(grid, col = ifelse(g.knn.probs.kk>=0.2,

pal[1], pal[2]),

cex = 0.25,

pch = ".",

main = 'KNN = 15',

xlim = xlim,

ylim = ylim)

points(EBITOF ~ DMP, data = test,

col = ifelse(Default == 'No',

pal[2], pal[1]),


cex = 1.5)


test = grid,

cl = train.Default,




1 - z.knn.probs,

z.knn.probs)


length(x),

length(y))

contour(x, y, z.knn.probs.kk,

levels = 0.20,

add = TRUE)

20

0 1 2 3 4 5

−1

01

23

4

KNN = 15

DMP

EB

ITO

F

Plot four charts on train data set aligned:

par(mfrow=c(2,2))

plot(grid, pch = ".", lwd = 0.25,

col=ifelse(g>=0.2, pal[1],pal[2]),

cex = 0.25, main = 'LGT',

xlim = xlim,

ylim = ylim)

points(EBITOF ~ DMP, data = train,


pch = ifelse(Default == 'No', 1, 3), cex = 1.5)

z <- outer(x, y,

function(x,y)predict(lgt.fit,


EBITOF = y),

type = 'response'))



col=ifelse(g.lda$posterior[,2]>=0.2,

pal[1],pal[2]),

main = 'LDA',

cex = 0.25,

xlim = xlim,

ylim = ylim)

points(EBITOF ~ DMP, data = train,


pal[2], pal[1]),

21


cex = 1.5)

z <- outer(x, y,

function(x,y)predict(lda.fit,




plot(grid,

col = ifelse(g.qda$posterior[,2]>=0.2, pal[1], pal[2]),

pch= '.', cex = 0.25,

main = 'QDA',

xlim = xlim,

ylim = ylim)

points(EBITOF ~ DMP,

data = train,



z <- outer(x, y,

function(x,y)predict(qda.fit,



contour(x, y, z,

add = TRUE,

level = 0.20, lwd = 1)

plot(grid,

col = ifelse(g.knn.probs.kk>=0.2, pal[1], pal[2]),

cex = 0.25, pch = ".", main = 'KNN = 15',

xlim = xlim,

ylim = ylim)


data = train,




test = grid,

cl = train.Default,

k = kk,

prob = TRUE)



1 - z.knn.probs,

z.knn.probs)


length(x),

22

length(y))


levels = 0.20,

add = TRUE)

0 1 2 3 4 5

−1

01

23

4

LGT

DMP

EB

ITO

F

0.2

0 1 2 3 4 5

−1

01

23

4

LDA

DMP

EB

ITO

F

0.2

0 1 2 3 4 5

−1

01

23

4

QDA

DMP

EB

ITO

F

0.2

0 1 2 3 4 5

−1

01

23

4

KNN = 15

DMP

EB

ITO

F

Plot four charts for decision rules on test data set aligned:

par(mfrow=c(2,2))


col=ifelse(g>=0.2, pal[1],pal[2]),

cex = 0.25, main = 'LGT',

xlim = xlim,

ylim = ylim)




z <- outer(x, y,

function(x,y)predict(lgt.fit,

23


EBITOF = y),

type = 'response'))



col=ifelse(g.lda$posterior[,2]>=0.2,

pal[1],pal[2]),

main = 'LDA',

cex = 0.25,

xlim = xlim,

ylim = ylim)



pal[2], pal[1]),


cex = 1.5)

z <- outer(x, y,

function(x,y)predict(lda.fit,




plot(grid,

col = ifelse(g.qda$posterior[,2]>=0.2, pal[1], pal[2]),

pch= '.', cex = 0.25,

main = 'QDA',

xlim = xlim,

ylim = ylim)


data = test,



z <- outer(x, y,

function(x,y)predict(qda.fit,



contour(x, y, z,

add = TRUE,

level = 0.20, lwd = 1)

plot(grid,

col = ifelse(g.knn.probs.kk>=0.2, pal[1], pal[2]),

cex = 0.25, pch = ".", main = 'KNN = 15',

xlim = xlim,

ylim = ylim)


24

data = test,




test = grid,

cl = train.Default,

k = kk,

prob = TRUE)



1 - z.knn.probs,

z.knn.probs)


length(x),

length(y))


levels = 0.20,

add = TRUE)

25

0 1 2 3 4 5

−1

01

23

4

LGT

DMP

EB

ITO

F

0.2

0 1 2 3 4 5

−1

01

23

4

LDA

DMP

EB

ITO

F

0.2

0 1 2 3 4 5

−1

01

23

4

QDA

DMP

EB

ITO

F

0.2

0 1 2 3 4 5

−1

01

23

4

KNN = 15

DMP

EB

ITO

F

Plot naıve predictors based on financial ratios (EBITOF and DMP) on traindata sample:

par(mfrow=c(1,2))

plot(EBITOF ~ DMP, data = train,


pch = ifelse(Default == 'No', 1, 3), cex = 1.5,

xlim = xlim,

ylim = ylim)


col = ifelse(grid$EBITOF <= 1.2, pal[1], pal[2] ))

abline(h = 1.2, lty = 2, lwd = 1)




xlim = xlim,

ylim = ylim)


26

col = ifelse(grid$DMP >= 2.0, pal[1], pal[2] ))

abline(v = 2.0, lty = 2, lwd = 1)

0 1 2 3 4 5

−1

01

23

4

DMP

EB

ITO

F

0 1 2 3 4 5

−1

01

23

4DMP

EB

ITO

F

par(mfrow=c(1,1))

Plot naıve predictors based on financial ratios (EBITOF and DMP):

par(mfrow=c(1,2))




xlim = xlim,

ylim = ylim)


col = ifelse(grid$EBITOF <= 1.2, pal[1], pal[2] ))

abline(h = 1.2, lty = 2, lwd = 1)




xlim = xlim,

ylim = ylim)


col = ifelse(grid$DMP >= 2.0, pal[1], pal[2] ))

abline(v = 2.0, lty = 2, lwd = 1)

27

0 1 2 3 4 5

−1

01

23

4

DMP

EB

ITO

F

0 1 2 3 4 5

−1

01

23

4

DMP

EB

ITO

Fpar(mfrow=c(1,1))

Use naıve predictors compounding them in a logical AND decision rule (EBIT≤ 1.2 AND DMP ≥ 2.0):




xlim = xlim,

ylim = ylim, main = 'Train data set')


col = ifelse((grid$EBITOF <= 1.2) &

(grid$DMP >= 2.0),

pal[1], pal[2] ))

abline(h = 1.2, lty = 2, lwd = 2)

abline(v = 2.0, lty = 2, lwd = 2)




xlim = xlim,

ylim = ylim, main = 'Test data set')


col = ifelse((grid$EBITOF <= 1.2) &

(grid$DMP >= 2.0),

pal[1], pal[2] ))

abline(h = 1.2, lty = 2, lwd = 2)

abline(v = 2.0, lty = 2, lwd = 2)

28

0 1 2 3 4 5

−1

01

23

4

Train data set

DMP

EB

ITO

F

0 1 2 3 4 5

−1

01

23

4

Test data set

DMP

EB

ITO

Fedmp.pred <- rep('No', s.s/2)

edmp.pred[(test$EBITOF <= 1.2) & (test$DMP >= 2.0)] <- 'Si'

edmp.tab <- table(edmp.pred,

test$Default,



addmargins(edmp.tab)

## Class. effettiva


## No 77 4 81

## Si 1 18 19

## Sum 78 22 100

edmp.er <- mean(edmp.pred != test$Default)

edmp.er

## [1] 0.05

Prepare data frame for ROC curves:

require(pROC)

res <- data.frame(Default = test$Default,

LGT = lgt.probs,

LDA = lda.probs$posterior[,2],

QDA = qda.probs$posterior[,2],

KNN = knn.probs,

EBITOF = test$EBITOF,

DMP = test$DMP)

head(res)

## Default LGT LDA QDA KNN EBITOF

## 198 No 0.019904605 0.012715513 0.009984894 0.0000000 2.05021166

## 978 Si 0.997576429 0.999267410 0.999048875 0.8000000 0.02209348

## 740 No 0.006655441 0.003060701 0.002081429 0.0000000 3.31898715

## 974 Si 0.562134689 0.619997321 0.585406157 0.4000000 1.17433535

29

## 14 No 0.766882711 0.818991764 0.813543896 0.4666667 1.56117928

## 258 No 0.002403967 0.001199350 0.001596928 0.0000000 1.63575939

## DMP

## 198 1.9473854

## 978 3.9704303

## 740 2.3772296

## 974 2.7757477

## 14 3.3324449

## 258 0.9771178

pal1 <- brewer.pal(7, "Dark2")

par(mfrow = c(1,2))

plot.roc(Default ~ LGT, data = res,

main = "Curve ROC machine learning su test",

print.auc = T, percent = T,

print.auc.x = 30,

print.auc.y = 70,

col=pal1[1],

grid = TRUE)

plot.roc(Default ~ LDA, data = res, add = TRUE,


print.auc.x = 30,

print.auc.y = 60,

col=pal1[2])

plot.roc(Default ~ QDA, data = res, add = TRUE,

thresholds="best",

print.thres = "best",


print.auc.x = 30,

print.auc.y = 50,

col=pal1[3])

plot.roc(Default ~ KNN, data = res, add = TRUE,

thresholds="best",



print.auc.x = 30,

print.auc.y = 40,

col=pal1[4])

legend("bottomright", legend=c("LGT", "LDA", "QDA", "KNN"),

col=c(pal1[1], pal1[2], pal1[3], pal1[4]),

lwd=2)

# secondo plot classificatori naive

plot.roc(Default ~ EBITOF, data = res,

thresholds="best",

30

main = "Curve ROC class. naive su test",



print.auc.x = 30,

print.auc.y = 70,

col=pal1[5],

grid = TRUE)

plot.roc(Default ~ DMP, data = res, add = TRUE,

thresholds="best",



print.auc.x = 30,

print.auc.y = 60,

col=pal1[6])

legend("bottomright", legend=c("EBITOF", "DMP"),

col=c(pal1[5], pal1[6]),

lwd=2)

Curve ROC machine learning su test

Specificity (%)

Sen

sitiv

ity (

%)

020

4060

8010

0

100 80 60 40 20 0

AUC: 96.8%

AUC: 97.0%

0.2 (91.0%, 95.5%)

AUC: 97.0%

0.2 (94.9%, 90.9%)

AUC: 94.4%

LGTLDAQDAKNN

Curve ROC class. naive su test

Specificity (%)

Sen

sitiv

ity (

%)

020

4060

8010

0

100 80 60 40 20 0

1.2 (91.0%, 81.8%)

AUC: 91.7%

2.0 (82.1%, 100.0%)

AUC: 93.1%

EBITOFDMP

par(mfrow = c(1,1))

2 Cross validation

n.iter <- 100

lgt.er <- rep(0,n.iter)

lda.er <- rep(0,n.iter)

qda.er <- rep(0,n.iter)

knn.er <- rep(0,n.iter)

31

for (i in 1:n.iter){set.seed(i)


train <- df.sm[z,]

test <- df.sm[-z,]

# regression logistica


data = train,

family = 'binomial' )

lgt.probs <- predict(lgt.fit, newdata = test,

type = 'response')

lgt.pred <- rep("No", s.s/2)


lgt.er[i] <- mean(lgt.pred != test$Default);

# LDA



newdata = test, type = 'response')



lda.er[i] <- mean(lda.pred != test$Default)

# QDA



newdata = test, type = "response")


qda.pred[qda.probs$posterior[,2]>=0.2] <- 'Si'

qda.er[i] <- mean(qda.pred != test$Default)

# KNN




kk = 15


test = test.X,

cl = train.Default,




1 - knn.pred.prob,

knn.pred.prob)

knn.pred.cl <- rep('No', s.s/2)

knn.pred.cl[knn.probs >= 0.2] <- 'Si'

knn.er[i] <- mean(knn.pred.cl != test$Default)

}

32

df.res <- data.frame(LGT = lgt.er,

LDA = lda.er,

QDA = qda.er,

KNN = knn.er)

head(df.res)

## LGT LDA QDA KNN

## 1 0.07 0.07 0.08 0.06

## 2 0.07 0.07 0.07 0.08

## 3 0.09 0.08 0.11 0.09

## 4 0.04 0.05 0.05 0.04

## 5 0.06 0.09 0.07 0.08

## 6 0.07 0.07 0.06 0.06

require(tidyr)

df.res.n <- gather(df.res, "Model", "Error", 1:4)

head(df.res.n)

## Model Error

## 1 LGT 0.07

## 2 LGT 0.07

## 3 LGT 0.09

## 4 LGT 0.04

## 5 LGT 0.06

## 6 LGT 0.07


pal2 <- brewer.pal(5, 'Dark2')

boxplot(data = df.res.n,

Error ~ Model,

col = pal2,

main = "Test error rate su validation set (100 iteraz.)")

33

LGT LDA QDA KNN

0.02

0.06

0.10

0.14

Test error rate su validation set (100 iteraz.)

3 Intuitive predictors

Again, considering Kahneman’s quoted paragraph at the beginning of this essay,I find approriate to quote now the sensible contribution of Prof. Tagliavini inBiffis et al. (2014, page 144)

There are solutions that are elegant and precise and solutions thatare rough and approximate: not necessarily the former are betterthan the latter.

Intuitive predictors needed by loan managers to diagnose

An intuitive predictor like:

DMP − EBITOF ≥ C (1)

In altre parole quando la differenza tra il livello di indebitamento (DMP) e ilmargine sugli oneri finanziari (EBITOF) e sale al di sopra di un certo livello (inFigura sotto −1 dal momento che in ordinata abbiamo EBITOF) allora entriamoin un’area di rischio eccessivo.

Bisogna usare il data frame res because train e test sono stati ri-seedatidurante la fase di validation set.

plot(EBITOF ~ DMP, data = res,



34

main = 'Previsore intuitivo (DMP - EBITOF)',

cex = 1.5,

xlim = xlim,

ylim = ylim)

int <- outer(x, y, function(x,y)y - x)

contour(x, y, int, add = TRUE, level = c(-2.0, -1.0, 0.0),

lwd = 2, lty = 2,

col = pal[4])

plot(EBITOF ~ DMP, data = res,



main = 'LDA e previsore intuitivo (DMP - EBITOF)',

cex = 1.5,

xlim = xlim,

ylim = ylim)

z <- outer(x, y, function(x,y)predict(lda.fit,





contour(x, y, int, add = TRUE, level = c(-1.0),

lwd = 2, lty = 2,

col = pal[4])

0 1 2 3 4 5

−1

01

23

4

Previsore intuitivo (DMP − EBITOF)

DMP

EB

ITO

F

−2

−1

0

0 1 2 3 4 5

−1

01

23

4

LDA e previsore intuitivo (DMP − EBITOF)

DMP

EB

ITO

F

0.2

−1

Thus we have:DMP − EBITOF ≥ 1 (2)

or equivalently on the x-y chart where EBITOF is on the y axis:

EBITOF ≤ DMP − 1 (3)

The classifier in (3) gives the following outcome:

35

int <- res$EBITOF - res$DMP

int.pred <- rep('No', s.s/2)

int.pred[int <= -1.0] <- 'Si'

int.tab <- table(int.pred, res$Default,

dnn = c('Class. prevista', 'Class. effettiva'))

addmargins(int.tab)

## Class. effettiva


## No 73 2 75

## Si 5 20 25

## Sum 78 22 100

# error rate

int.er <- mean(int.pred != res$Default)

names(int.er) <- 'Error rate'

# sensitivity

int.sen <- int.tab[2,2]/(int.tab[1,2]+int.tab[2,2])

names(int.sen) <- 'Sensitivity'

# specificity

int.sp <- int.tab[1,1]/(int.tab[1,1]+int.tab[2,1])

names(int.sp) <- 'Specificity'

int.er; int.sen; int.sp

## Error rate

## 0.07

## Sensitivity

## 0.9090909

## Specificity

## 0.9358974

res$INTUIT <- res$DMP - res$EBITOF

plot.roc(Default ~ LDA, data = res,

grid = TRUE,

main = "Curve ROC class. LDA e intuitivo",


print.auc.x = 30,

print.auc.y = 70,

col=pal2[2])

plot.roc(Default ~ INTUIT, data = res, add = TRUE,

thresholds=c(1.0),

print.thres = c(1.0),


print.auc.x = 30,

print.auc.y = 60,

col=pal2[5])

legend("bottomright", legend=c("LDA", "DMP - EBITOF"),

col=c(pal2[2], pal2[5]),

lwd=2)

36

Curve ROC class. LDA e intuitivo

Specificity (%)

Sen

sitiv

ity (

%)

020

4060

8010

0

100 80 60 40 20 0

AUC: 97.0%

1.0 (93.6%, 90.9%)

AUC: 97.4%

LDADMP − EBITOF

Let’s test the intuitive predictor in a validation set:

n.iter <- 100

lgt.er <- rep(0,n.iter)

lda.er <- rep(0,n.iter)

qda.er <- rep(0,n.iter)

knn.er <- rep(0,n.iter)

int.er <- rep(0,n.iter)

eBit.er <- rep(0,n.iter)

dMp.er <- rep(0,n.iter)

for (i in 1:n.iter){set.seed(i)


train <- df.sm[z,]

test <- df.sm[-z,]

# regression logistica


data = train,

family = 'binomial' )

lgt.probs <- predict(lgt.fit,

newdata = test,

type = 'response')

lgt.pred <- rep("No", s.s/2)


lgt.er[i] <- mean(lgt.pred != test$Default);

# LDA


37


newdata = test,

type = 'response')



lda.er[i] <- mean(lda.pred != test$Default)

# QDA



newdata = test,

type = "response")


qda.pred[qda.probs$posterior[,2]>=0.2] <- 'Si'

qda.er[i] <- mean(qda.pred != test$Default)

# KNN




kk = 15


test = test.X,

cl = train.Default,




1 - knn.pred.prob,

knn.pred.prob)

knn.pred.cl <- rep('No', s.s/2)

knn.pred.cl[knn.probs >= 0.2] <- 'Si'

knn.er[i] <- mean(knn.pred.cl != test$Default)

#INT

int <- test$EBITOF - test$DMP

int.pred <- rep('No', s.s/2)

int.pred[int <= -1.0] <- 'Si'

int.er[i] <- mean(int.pred != test$Default)

#EBITOF

eBit <- test$EBITOF

eBit.pred <- rep('No', s.s/2)

eBit.pred[eBit <= 1.2] <- 'Si'

eBit.er[i] <- mean(eBit.pred != test$Default)

#DMP

dMp <- test$DMP

dMp.pred <- rep('No', s.s/2)

38

dMp.pred[dMp >= 2.0] <- 'Si'

dMp.er[i] <- mean(dMp.pred != test$Default)

}

Plot CV test error rates of predictors:

df.res <- data.frame(LGT = lgt.er,

LDA = lda.er,

QDA = qda.er,

KNN = knn.er,

INT = int.er,

EOF = eBit.er,

DMP = dMp.er)

head(df.res)

## LGT LDA QDA KNN INT EOF DMP

## 1 0.07 0.07 0.08 0.06 0.07 0.11 0.16

## 2 0.07 0.07 0.07 0.08 0.06 0.17 0.25

## 3 0.09 0.08 0.11 0.09 0.08 0.16 0.24

## 4 0.04 0.05 0.05 0.04 0.05 0.16 0.19

## 5 0.06 0.09 0.07 0.08 0.04 0.12 0.14

## 6 0.07 0.07 0.06 0.06 0.04 0.10 0.20

require(tidyr)

df.res.n <- gather(df.res, "Model", "Error", 1:7)

head(df.res.n)

## Model Error

## 1 LGT 0.07

## 2 LGT 0.07

## 3 LGT 0.09

## 4 LGT 0.04

## 5 LGT 0.06

## 6 LGT 0.07

# compute index of ordered 'cost factor' and reassign

#oind <- order(as.numeric(by(DF£cost, DF£type, median)))

oind <- order(as.numeric(by(df.res.n$Error,

df.res.n$Model,

median)))

# DF£type <- ordered(DF£type, levels=levels(DF£type)[oind])

#

df.res.n$Model <- ordered(df.res.n$Model,

levels = levels(df.res.n$Model)[oind])

# boxplot(cost ~ type, data=DF)


pal2 <- brewer.pal(7, 'Dark2')

boxplot(data = df.res.n,

Error ~ Model,

col = pal2,

39

main = "Test error rate CV (valid. set 100 iteraz.)")

INT LGT LDA QDA KNN EOF DMP

0.05

0.10

0.15

0.20

0.25

Test error rate CV (valid. set 100 iteraz.)

Plot intuitive predictor on train data sample:

plot(EBITOF ~ DMP, data = trainn,



main = '(DMP - EBITOF = 1) | train sample',

cex = 1.5,

xlim = xlim,

ylim = ylim)


col=ifelse(grid$DMP - grid$EBITOF >= 1, pal[1],pal[2]))


contour(x, y, int, add = TRUE, level = c(-1.0), lwd = 1.5, lty = 2)

40

0 1 2 3 4 5

−1

01

23

4

(DMP − EBITOF = 1) | train sample

DMP

EB

ITO

F

−1

Plot intuitive predictor on test data sample:

plot(EBITOF ~ DMP, data = testt,



main = '(DMP - EBITOF = 1) | test sample',

cex = 1.5,

xlim = xlim,

ylim = ylim)


col=ifelse(grid$DMP - grid$EBITOF >= 1, pal[1],pal[2]))


contour(x, y, int, add = TRUE, level = c(-1.0), lwd = 1.5, lty = 2)

41

0 1 2 3 4 5

−1

01

23

4

(DMP − EBITOF = 1) | test sample

DMP

EB

ITO

F

−1

4 Rating class clustering

Save firm’s id number, now in row names, in a separate column: it’ll be neededlater as row names will be dropped by applying a function.

df.sm$Firm <- row.names(df.sm)

df.sm.dat <- df.sm[ , 2:3]

# set seed and sample a smaller set too

set.seed(123)

df.sm1 <- sample_n(df.sm, 20,

replace = FALSE)

df.sm1.dat <- df.sm1[, 2:3]

#first clustering

hc.comp <- hclust(dist(df.sm.dat),

method = 'complete')

cl <- cutree(hc.comp, k = 4)

df.sm$Cluster <- as.factor(cl)

#second plot on sample

hc.sm <- hclust(dist(df.sm1.dat),

method = 'complete')

42

cl.sm <- cutree(hc.sm, k = 4)

df.sm1$Cluster <- as.factor(cl.sm)

# contingeny table di numero di default/non default per cluster

tab.rat <- table(df.sm$Default, df.sm$Cluster)

# numero di default per cluster

tab.df.rat <- table(df.sm$Cluster); tab.df.rat

##

## 1 2 3 4

## 98 23 68 11

tab.df.fr <- prop.table(table(df.sm$Default, df.sm$Cluster),

margin = 2)

tab.df.fr[2,]

## 1 2 3 4

## 0.08163265 0.86956522 0.00000000 0.63636364

# contingeny table di numero di default/non default per cluster

tab.rat1 <- table(df.sm1$Default, df.sm1$Cluster)

# numero di default per cluster

tab.df1.rat <- table(df.sm1$Cluster); tab.df1.rat

##

## 1 2 3 4

## 4 10 3 3

tab.df1.fr <- prop.table(table(df.sm1$Default, df.sm1$Cluster),

margin = 2)

tab.df1.fr[2,]

## 1 2 3 4

## 0.25 0.10 0.00 1.00

Compute probabilities of default on mid and small samples and write table todrive for handout for students.

require(dplyr)

df.sm <- df.sm %>%

group_by(Cluster) %>%

mutate(PD = prop.table(table(Default))[2]

)

head(df.sm$PD)

## [1] 0.08163265 0.86956522 0.08163265 0.08163265 0.00000000 0.86956522

df.sm <- df.sm[order(df.sm$PD), ]

df.sm$PD <- as.factor(round(df.sm$PD, digits = 2))

df.sm1 <- df.sm1 %>%

group_by(Cluster) %>%

mutate(PD = prop.table(table(Default))[2]

43

)

head(df.sm1$PD)

## [1] 0.25 0.10 0.00 0.10 0.10 0.10

df.sm1 <- df.sm1[order(df.sm1$PD), ]

df.sm1$PD <- as.factor(round(df.sm1$PD, digits = 2))

write.table(df.sm1, file = "train_rat_20.csv",

sep = ";",

dec = ",")

Chart rating classes on scatterplot:

xx <- range(df.sm$DMP); yy <- range(df.sm$EBITOF)

pl.rat <- ggplot(df.sm, aes(DMP, EBITOF,

color = PD,

shape = Default)) +


scale_color_manual(values = c(pal[2],

pal[3],

pal[4],

pal[1]

)) +

scale_shape_manual(values = c(1,3)) +

xlim(xx) + ylim(yy)

pl.rat +

geom_abline(intercept = -1, slope = 1, linetype = 'dashed')

pl.rat1 <- ggplot(df.sm1, aes(DMP, EBITOF,

color = PD,

shape = Default,

label=Firm)) +


scale_color_manual(values = c(pal[2],

pal[3],

pal[4],

pal[1]

)) +

scale_shape_manual(values = c(1,3)) +

xlim(xx) + ylim(yy) + geom_text(hjust=1.25, vjust=1.25, size = I(3))

pl.rat1 +

geom_abline(intercept = -1, slope = 1, linetype = 'dashed')

44

−1

0

1

2

3

4

0 1 2 3 4 5DMP

EB

ITO

F

PD

0

0.08

0.64

0.87

Default

No

Si

714274

730

130765

350

258

973

190

244

637

151

656

85

966

257

14

886

997951

−1

0

1

2

3

4

0 1 2 3 4 5DMP

EB

ITO

F

PD

a

a

a

a

0

0.1

0.25

1

Default

No

Si

Plot plain dendogram of hierarchical clustering with no aesthetical attributes, atlast.

plot(hc.comp, cex = 0.2,

xlab = "",

sub = "",

main = "")

rect.hclust(hc.comp, k = 4,

border = pal[1])

# second plot

plot(hc.sm, cex = 1.0,

xlab = '',

sub = '',

main = ''

)

rect.hclust(hc.sm, k = 4, border = pal[1])

822

856

880

888

878

860

245

876

866

826

703

905

891

974

510

978

852

958

997

951

939

968

962

261

924

366

966

202 14 94

383

085

198

591

374

077

052

979

6 155

344

196

216

610 78

049

9 666

509

217

257

288

762

175

618

98 545 57 460 70

117 468

527

274

501 48

971

4 590

458

151

795

138

764 76 234

137

416

579

708 40

161

965

799

614

225 79

327

120

952

859

722

0 8464 58

464

945

732

756

012

424

9 730

333

37 324

677

201

404

130

276

513

930

647

612

2 788

478

386

493

551

812

471

72 638

258 95 244

800

774

8145

918

461

248

751

291 51

830

366

797

538

569 72

019

341

263

7 9469

468

283

409

142

844

886 99

287

778

436

1 3885 52

253

951

594

267

028

925

065

173

570

756

341

184

275

618

038

090

776

949

659

9 121

311

198 74 362

113

706

300

36 161

973

608

350

42 656 14

119

046 9

114

676

235

206

01

23

45

Hei

ght

730

714

274

244

151

258

637

130

765

350

973

190

656

886

997

951 96

6 14 85 257

01

23

4

Hei

ght

The use of the excellent package dendextend by Galili (2015) is now required toadjust color, nodes and branches of rating class clusters on dendrogram.

require(dendextend)

require(colorspace)

45

pl.rat1

hc <- as.dendrogram(hc.sm) %>%

rotate(c(4,1,2,3) )

hc %>% set("labels_cex", 1.0) %>%

set("labels_col", value = c(pal[1],

pal[4],

pal[3],

pal[2]),

k = 4) %>%

set("branches_k_color", value = c(pal[1],

pal[4],

pal[3],

pal[2]),

k = 4) %>%

plot(main = "")

hc %>% rect.dendrogram(k = 4,

border = 8,

lty = 5,

lwd = 1.0)

# abline(h = 2.25, lty = 2, lwd = 2.0)

714274

730

130765

350

258

973

190

244

637

151

656

85

966

257

14

886

997951

−1

0

1

2

3

4

0 1 2 3 4 5DMP

EB

ITO

F

PD

a

a

a

a

0

0.1

0.25

1

Default

No

Si

01

23

4

886

997

951

966 14 85 257

130

765

350

973

190

656

258

637

151

244

730

714

274

hc %>% set("labels_cex", 1.0) %>%


pal[4],

pal[3],

pal[2]),

k = 4) %>%


pal[4],

pal[3],

pal[2]),

k = 4) %>%

plot(main = "")

46

hc %>% rect.dendrogram(k = 4,

border = 8,

lty = 5,

lwd = 1.0)

hc.all <- as.dendrogram(hc.comp)

hc.all %>%

set("labels_cex", 0.1) %>%


pal[4],

pal[2],

pal[3]),

k = 4) %>%


pal[4],

pal[2],

pal[3]),

k = 4) %>%

plot(main = "")

hc.all %>% rect.dendrogram(k = 4,

border = 8,

lty = 5,

lwd = 1.0)

01

23

4

886

997

951

966 14 85 257

130

765

350

973

190

656

258

637

151

244

730

714

274

01

23

45

822

856

880

888

878

860

245

876

866

826

703

905

891

974

510

978

852

958

997

951

939

968

962

261

924

366

966

202 14 943

830

851

985

913

740

770

529

796

155

344

196

216

610

780

499

666

509

217

257

288

762

175

618 98 545 57 460

701 17 468

527

274

501

489

714

590

458

151

795

138

764 76 234

137

416

579

708

401

619

657 99 614

225

793

271

209

528

597

220 84 64 584

649

457

327

560

124

249

730

333 37 324

677

201

404

130 2

765

139

306

476

122

788

478

386

493

551

812

471 72 638

258 95 244

800

774 81 459

184

612

487

512 91 518

303 66 797

538

569

720

193

412

637 94 694

682 83 409

142

844

886

992

877

784

361 38 85 522

539

515

942

670

289

250

651

735

707

563

411

842

756

180

380

907

769

496

599

121

311

198 74 362

113

706

300 36 161

973

608

350 42 656

141

190 46 9

114

676

235

206

47

References

Biffis, Paolo, a cura di, (2014), con scritti di M. S. Avi, G. Tagliavini e F. Zen,Analisi del Merito di Credito, EIF.e-Book.

Galili, Tal (2015), dendextend: an R package for visualizing, adjusting, andcomparing trees of hierarchical clustering, The Journal of Bioinformatics.

Kahneman, Daniel (2011), Thinking, Fast and Slow, Penguin.

48

statistical, naïve and intuitive predictors in credit risk ... is required and the scatter plot...

Documents