slides arbres-ubuntu

11pt

11pt

Note Exemple Exemple

11pt

Preuve

Arthur CHARPENTIER - Arbres de Classification

Arbres de Classification

Arthur Charpentier

[email protected]

http ://freakonometrics.hypotheses.org/

Avril 2014

Séminaires en méthodes d’analyses quantitatives et qualitatives

1


La problématique

Victimes d’infarctus du myocardelors de leur admission aux urgences. On observe ; :

◦ fréquence cardiaque (FRCAR),◦ index cardiaque(INCAR)◦ index systolique (INSYS)◦ pression diastolique (PRDIA)◦ pression arterielle pulmonaire (PAPUL)◦ pression venticulaire (PVENT)◦ resistance pulmonaire (REPUL)◦ décès - ou pas - de la personne

Source : Saporta, G. (2006).

2


Plan de la présentation• Introduction◦ La problématique de la classification◦ Retour sur la régression logistique• Analyse d’une classification◦ Erreurs, faux positifs, faux négatifs◦ Courbe ROC et autres courbes• Les arbres de classification◦ Critère de discrimination, Gini et entropie◦ Méthode CART◦ Robustification par bootstrap et forêts aléatoires

3


La classification : modélisation d’une variable 0/1

[0][0]

●●●

DE

CE

SS

UR

VIE

60 70 80 90 100 110 120

● ●●● ●● ●●● ● ●●● ●● ●●● ●●● ● ●● ●● ● ●●

●● ●●● ●●●●● ●● ●● ●● ● ●● ●●● ●●● ●● ●● ●● ● ●●● ● ● ●● ●● ●

FRCAR

●

DE

CE

SS

UR

VIE

1.0 1.5 2.0 2.5 3.0

●● ●● ●●● ●●● ●●●● ● ●● ●●●●● ● ●●● ●● ●

● ● ●● ● ● ●●● ● ●●●● ●● ●●● ●● ●●● ●●● ●● ● ●●● ●● ●● ● ●●●●

INCAR

●

● ●

DE

CE

SS

UR

VIE

10 20 30 40 50

●● ●● ● ●● ●●● ●● ●● ●●● ●● ●●● ● ●●● ●● ●

● ●●● ●● ●● ● ●●●●● ●● ●●●● ● ●●● ●● ● ●● ● ●●● ● ● ●●● ●● ●●

INSYS

● ●

DE

CE

SS

UR

VIE

10 15 20 25 30 35 40 45

●●●● ●● ●●●● ● ●● ●● ●● ● ●●● ● ●● ● ● ●●●

●● ●●● ● ●●● ● ●●●● ●●● ●●● ●●●● ● ●●● ● ●● ●●●● ● ● ●● ●●●

PAPUL

DE

CE

SS

UR

VIE

5 10 15 20

●● ●●● ●● ●●●● ●●●● ● ●● ●●● ● ●● ●●●● ●

●●● ●●● ● ●● ●●●● ●●●● ● ●●● ●● ● ●● ●● ●●●● ●● ● ●●●● ● ●●

PVENT

DE

CE

SS

UR

VIE

500 1000 1500 2000 2500 3000

● ●● ●●● ●● ●●● ●● ●● ●●● ●●● ● ●● ● ●●●●

●● ● ●● ●●●●●● ● ●●● ●● ● ●● ●● ●●● ●●● ●●● ● ●●● ● ● ●● ●●●

REPUL

4


La classification : modélisation d’une variable 0/1

[0][0]

●●●

DE

CE

SS

UR

VIE

60 70 80 90 100 110 120

FRCAR

●

DE

CE

SS

UR

VIE

1.0 1.5 2.0 2.5 3.0

INCAR

●

● ●

DE

CE

SS

UR

VIE

10 20 30 40 50

INSYS

● ●

DE

CE

SS

UR

VIE

10 15 20 25 30 35 40 45

PAPUL

DE

CE

SS

UR

VIE

5 10 15 20

PVENT

DE

CE

SS

UR

VIE

500 1000 1500 2000 2500 3000

REPUL

5


Régression linéaire, simple

E(Y |X = x) = β0 + β1x[0][0]

●●●

DE

CE

SS

UR

VIE

60 70 80 90 100 110 120

● ●●● ●● ●●● ● ●●● ●● ●●● ●●● ● ●● ●● ● ●●

●● ●●● ●●●●● ●● ●● ●● ● ●● ●●● ●●● ●● ●● ●● ● ●●● ● ● ●● ●● ●

FRCAR

●

DE

CE

SS

UR

VIE

1.0 1.5 2.0 2.5 3.0

●● ●● ●●● ●●● ●●●● ● ●● ●●●●● ● ●●● ●● ●

● ● ●● ● ● ●●● ● ●●●● ●● ●●● ●● ●●● ●●● ●● ● ●●● ●● ●● ● ●●●●

INCAR

●

● ●

DE

CE

SS

UR

VIE

10 20 30 40 50

●● ●● ● ●● ●●● ●● ●● ●●● ●● ●●● ● ●●● ●● ●

● ●●● ●● ●● ● ●●●●● ●● ●●●● ● ●●● ●● ● ●● ● ●●● ● ● ●●● ●● ●●

INSYS

● ●

DE

CE

SS

UR

VIE

10 15 20 25 30 35 40 45

●●●● ●● ●●●● ● ●● ●● ●● ● ●●● ● ●● ● ● ●●●

●● ●●● ● ●●● ● ●●●● ●●● ●●● ●●●● ● ●●● ● ●● ●●●● ● ● ●● ●●●

PAPUL

DE

CE

SS

UR

VIE

5 10 15 20

●● ●●● ●● ●●●● ●●●● ● ●● ●●● ● ●● ●●●● ●

●●● ●●● ● ●● ●●●● ●●●● ● ●●● ●● ● ●● ●● ●●●● ●● ● ●●●● ● ●●

PVENT

DE

CE

SS

UR

VIE

500 1000 1500 2000 2500 3000

● ●● ●●● ●● ●●● ●● ●● ●●● ●●● ● ●● ● ●●●●

●● ● ●● ●●●●●● ● ●●● ●● ● ●● ●● ●●● ●●● ●●● ● ●●● ● ● ●● ●●●

REPUL

6


Régression logistique, simple

E(Y |X = x) = P(Y = 1|X = x) = exp[β0 + β1x]1 + exp[β0 + β1x]

●●●

DE

CE

SS

UR

VIE

60 70 80 90 100 110 120

● ●●● ●● ●●● ● ●●● ●● ●●● ●●● ● ●● ●● ● ●●

●● ●●● ●●●●● ●● ●● ●● ● ●● ●●● ●●● ●● ●● ●● ● ●●● ● ● ●● ●● ●

FRCAR

●

DE

CE

SS

UR

VIE

1.0 1.5 2.0 2.5 3.0

●● ●● ●●● ●●● ●●●● ● ●● ●●●●● ● ●●● ●● ●

● ● ●● ● ● ●●● ● ●●●● ●● ●●● ●● ●●● ●●● ●● ● ●●● ●● ●● ● ●●●●

INCAR

●

● ●

DE

CE

SS

UR

VIE

10 20 30 40 50

●● ●● ● ●● ●●● ●● ●● ●●● ●● ●●● ● ●●● ●● ●

● ●●● ●● ●● ● ●●●●● ●● ●●●● ● ●●● ●● ● ●● ● ●●● ● ● ●●● ●● ●●

INSYS

● ●

DE

CE

SS

UR

VIE

10 15 20 25 30 35 40 45

●●●● ●● ●●●● ● ●● ●● ●● ● ●●● ● ●● ● ● ●●●

●● ●●● ● ●●● ● ●●●● ●●● ●●● ●●●● ● ●●● ● ●● ●●●● ● ● ●● ●●●

PAPUL

DE

CE

SS

UR

VIE

5 10 15 20

●● ●●● ●● ●●●● ●●●● ● ●● ●●● ● ●● ●●●● ●

●●● ●●● ● ●● ●●●● ●●●● ● ●●● ●● ● ●● ●● ●●●● ●● ● ●●●● ● ●●

PVENT

DE

CE

SS

UR

VIE

500 1000 1500 2000 2500 3000

● ●● ●●● ●● ●●● ●● ●● ●●● ●●● ● ●● ● ●●●●

●● ● ●● ●●●●●● ● ●●● ●● ● ●● ●● ●●● ●●● ●●● ● ●●● ● ● ●● ●●●

REPUL

7


Régression logistique, multiple

E(Y |X = x) = P(Y = 1|X = x) = exp[β0 + β1x1 + · · ·+ βkxk]1 + exp[β0 + β1x1 + · · ·+ βkxk]

500 1000 1500 2000 2500 3000

510

1520

resistance pulmonaire (REPUL)

pres

sion

ven

ticul

aire

(P

VE

NT

)

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

● ●●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

8



E(Y |X = x) = P(Y = 1|X = x) = exp[β0 + β1x1 + · · ·+ βkxk]1 + exp[β0 + β1x1 + · · ·+ βkxk]

500 1000 1500 2000 2500 3000

510

1520


pres

sion

ven

ticul

aire

(P

VE

NT

)

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

● ●●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●


5001000

1500

2000

2500

3000

pres

sion

ven

ticul

aire

(PVE

NT)

5

10

15

20

probabilité (de survie)

0.0

0.2

0.4

0.6

0.8

1.0

9



E(Y |X = x) = P(Y = 1|X = x) = exp[s(x1, · · · , xk)]1 + exp[s(x1, · · · , xk)]

500 1000 1500 2000 2500 3000

510

1520


pres

sion

ven

ticul

aire

(P

VE

NT

)

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

● ●●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●


5001000

1500

2000

2500

3000

pres

sion

ven

ticul

aire

(PVE

NT)

5

10

15

20

probabilité (de survie)

0.0

0.2

0.4

0.6

0.8

1.0

10


Comment juger la qualité de notre modèle ?découpage avec un seuil à 50% :◦ si P(Y = 1|x) < 50%, diagnostic de décès◦ si P(Y = 1|x) > 50%, diagnostic de survie

Yi = 0 Yi = 1

Yi = 0 24 5 29Yi = 1 4 38 42

28 43 71500 1000 1500 2000 2500 3000

510

1520


pres

sion

ven

ticul

aire

(P

VE

NT

)

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

● ●●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

11



Yi = 0 Yi = 1

Yi = 0 26 3 29Yi = 1 12 30 42

38 33 71500 1000 1500 2000 2500 3000

510

1520


pres

sion

ven

ticul

aire

(P

VE

NT

)

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

● ●●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

12



Yi = 0 Yi = 1

Yi = 0 18 11 29Yi = 1 2 40 42

20 51 71500 1000 1500 2000 2500 3000

510

1520


pres

sion

ven

ticul

aire

(P

VE

NT

)

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

● ●●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

13


Comment juger la qualité de notre modèle ?

impact du seuil, deux cas extrêmes :

Yi = 0 Yi = 1

Yi = 0 0 29 29Yi = 1 0 42 42

0 71 71

Yi = 0 Yi = 1

Yi = 0 29 0 29Yi = 1 42 0 42

71 0 71

0 20 40 60 80 100

05

1015

2025

30

Faux

Nég

atif

(FN

)

0 20 40 60 80 100

010

2030

40

Seuil (%)

Faux

Pos

itif (

TP

)

14



Yi = 0 Yi = 1

Yi = 0 TN FN specificity, TNRnegative true negative false negative true negative rate

Yi = 1 FP TP sensitivity, TPRpositive false positive true positive true positive rate

precision, PPVpositive predictive value

TPR = TPTP + FN et FPR = FP

FP + TNavec

TPR(s) = P(Y (s) = 1|Y = 1) et FPR(s) = P(Y (s) = 1|Y = 0)

15



True positive rateTPR(s) = P(Y (s) = 1|Y = 1)

=nY=1,Y=1

nY=1False positive rateFPR(s) = P(Y (s) = 1|Y = 0)

=nY=1,Y=1

nY=1

→ courbe sensibilité/spécificité,appelée aussi courbe ROC(Receiver Operating Characteristic){(FPR(s), TPR(s)), s ∈ (0, 1)}

16



La courbe ROC est{(FPR(s), TPR(s)), s ∈ (0, 1)}

→ une courbe par modèle.

17



Specificity (%)

Sen

sitiv

ity (

%)

020

4060

8010

0

100 80 60 40 20 0 0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

False Positive Rate

True

Pos

itive

Rat

e

●

●●

●

●

●

●

●

0.10.20.3

0.40.5

0.7

0.8

0.9

18


Construction Hiérarchique d’Arbres de ClassificationClassification = regrouper les individus en un nombre limitéde classesCes classes sont construites au fur et à mesure→ regrouper des des individus similaires, séparer des indi-vidus ayant des caractéristiques proches.Histoire : date des années 1960’s,regain d’intérêt suit à la publication de Breiman et al. (1984).outils devenu populaire en apprentissage automatique (ma-chine learning)

19


Construction Hiérarchique d’Arbres de Classificationclassification descendante :on sélectionne par les variables explicatives la plus liéeà la variable à expliquer Y ,→ donne une première division de léchantillonon réitère dans chaque classe→ chaque classe doit être la plus homogène possible,en Y .Différence par rapport à la régression logistiqueutilisation séquentielle des variables explicativesprésentation des sorties sous forme d’arbre de décisioni.e. une séquence de noeuds

20


Construction Hiérachique d’Arbres de Classification

L’algorithme est construit ainsi◦ un critère pour choisir la ‘meilleure’ divisiondes branches◦ une règle permettant de décider si un noeudest terminal, et devient une feuille◦ un méthode d’attribution d’une valeur danschaque feuille.

REPULp < 0.001

1

≤ 1093 > 1093

PVENTp = 0.179

2

≤ 11.5 > 11.5

Node 3 (n = 31)

SU

RV

IED

EC

ES

0

0.2

0.4

0.6

0.8

1Node 4 (n = 8)

SU

RV

IED

EC

ES

0

0.2

0.4

0.6

0.8

1

REPULp = 0.341

5

≤ 1583 > 1583

Node 6 (n = 16)

SU

RV

IED

EC

ES

0

0.2

0.4

0.6

0.8

1Node 7 (n = 16)

SU

RV

IED

EC

ES

0

0.2

0.4

0.6

0.8

1

21


Subdivisionner l’espace : une variable explicativeY ∈ {0, 1} et X ∈ R : on découpe suivant un seuil s, X = A si X ≤ s

X = B si X > s

X = A X = B

X ≤ s X > s

Y = 0 nA,0 nB,0 n·,0

Y = 1 nA,1 nB,1 n·,1

nA,· nB,· n

Gini gini(Y |X) = −∑

x∈{A,B}

nx,·n

∑y∈{0,1}

nx,ynx,·

(1− nx,y

nx,·

)

22


Subdivisionner l’espace : une variable explicativeY ∈ {0, 1} et X ∈ R : on découpe suivant un seuil s, X = A si X ≤ s

X = B si X > s

X = A X = B

X ≤ s X > s

Y = 0 nA,0 nB,0 n·,0

Y = 1 nA,1 nB,1 n·,1

nA,· nB,· n

Entropie entropie(Y |X) = −∑

x∈{A,B}

nx,·n

∑y∈{0,1}

nx,ynx,·

log(nx,ynx,·

)

23


Subdivisionner l’espace : une variable explicative

Découpage et indice de Gini

−∑

x∈{A,B}

nx,·n

∑y∈{0,1}

nx,ynx,·

(1− nx,y

nx,·

)

24


Subdivisionner l’espace : une variable explicativeOn fixe s, on cherche un second découpage,

A = (−∞, s2] B = (s2, s] C = (s,∞)

X = A X = B X = C

X ≤ s2 X ∈ (s2, s] X > s

Y = 0 nA,0 nB,0 nC,0 n·,0

Y = 1 nA,1 nB,1 nC,1 n·,1

nA,· nB,· nC,· n


−∑

x∈{A,B,C}

nx,·n

∑y∈{0,1}

nx,ynx,·

(1− nx,y

nx,·

)

●

● ●

DE

CE

SS

UR

VIE

10 20 30 40 50

INSYS

●● ● ● ●● ●● ●● ●● ●●● ● ●●● ●●● ●● ●● ● ●

●● ● ● ●● ●●● ●●● ●● ● ●●● ● ●● ● ● ● ●●●● ●● ● ●●● ●●● ● ● ●● ●

10 20 30 40 50

−0.

200

−0.

190

−0.

180

−0.

170

Indi

ce G

ini

25


Subdivisionner l’espace : une variable explicativeOn fixe s, on cherche un second découpage,

A = (−∞, s] B = (s, s2] C = (s2,∞)

X = A X = B X = C

X ≤ s X ∈ (s, s2] X > s2

Y = 0 nA,0 nB,0 nC,0 n·,0

Y = 1 nA,1 nB,1 nC,1 n·,1

nA,· nB,· nC,· n


−∑

x∈{A,B,C}

nx,·n

∑y∈{0,1}

nx,ynx,·

(1− nx,y

nx,·

)

●

● ●

DE

CE

SS

UR

VIE

10 20 30 40 50

INSYS

●●● ●●● ● ●● ● ●●● ●● ●● ●●●●● ●●

● ● ● ●● ● ● ●● ●● ●● ● ● ● ●●●●● ●● ●●●● ● ●●● ●● ● ●● ● ●●● ● ●● ●● ●●

10 20 30 40 50

−0.

200

−0.

190

−0.

180

−0.

170

Indi

ce G

ini

26


Découpage séquentiel, aspects computationnelsÀ chaque étape, on fixe un noeud, et on découpe une des classes en deux

→ structure d’arbre

Avantage numérique important : si s peut prendre n valeurs,

{s1, s2, · · · , sn} e.g.{

1n+ 1 ,

2n+ 1 , · · · ,

n

n+ 1

}il existe n!/(n− k)! = n(n− 1)(n− 2) · · · (n− k + 1) partitions en k classes.

Avec la méthode possible, seuls n+ (n− 1) + · · ·+ (n− k) partitions sontenvisagées,

Example n = 100 et k = 5 : 9, 034, 502, 400 de partitions possibles vs. 400 arbres.

27


Élagage de l’arbreÉtape 1 Construction de l’arbre par un processus récursif de divisions binaires

Étape 2 Élagage de l’arbre (pruning), en supprimant les branches trop vides, oupeu représentatives → besoin d’un critère d’élagage (gain en entropie)

|MYOCARDE[, nom] < 18.85

MYOCARDE[, nom] < 21.55

MYOCARDE[, nom] < 19.75 MYOCARDE[, nom] < 28.25


DECES

SURVIE DECES SURVIE

SURVIE SURVIE

|MYOCARDE[, nom] < 18.85


MYOCARDE[, nom] < 19.75 MYOCARDE[, nom] < 28.25MYOCARDE[, nom] < 31.6

DECES

SURVIE DECES SURVIESURVIE SURVIE

4050

6070

8090

28


Cas de plusieurs variables quantitativesY ∈ {0, 1} et X1, X2 ∈ R : on découpe X1 suivant un seuil s, X = A si X1 ≤ s

X = B si X1 > s

X = A X = B

X1 ≤ s X1 > s

Y = 0 nA,0 nB,0 n·,0

Y = 1 nA,1 nB,1 n·,1

nA,· nB,· n

−∑

x∈{A,B}

nx,·n

∑y∈{0,1}

nx,ynx,·

(1− nx,y

nx,·

)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

● ●●●

●

●

●

●

●● ●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

500 1000 1500 2000 2500 3000

510

1520


pres

sion

ven

ticul

aire

(P

VE

NT

)

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

● ●●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●●

●

500 1000 1500 2000 2500 3000

−0.

45−

0.35

−0.

25

Indi

ce G

ini

29


Cas de plusieurs variables quantitativesY ∈ {0, 1} et X1, X2 ∈ R : on découpe X2 suivant un seuil s, X = A si X2 ≤ s

X = B si X2 > s

X = A X = B

X2 ≤ s X2 > s

Y = 0 nA,0 nB,0 n·,0

Y = 1 nA,1 nB,1 n·,1

nA,· nB,· n

−∑

x∈{A,B}

nx,·n

∑y∈{0,1}

nx,ynx,·

(1− nx,y

nx,·

)

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

● ●●●

●

●

●

●

●● ●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

500 1000 1500 2000 2500 3000

510

1520


pres

sion

ven

ticul

aire

(P

VE

NT

)

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

● ●●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●●

●●

●

5 10 15 20

−0.

45−

0.35

−0.

25

Indi

ce G

ini

30


Cas de variables qualitativesX prend des valeurs {a, b, c, d}.

Au lieu de faire un arbre par découpages suc-cessifs, on peut faire un arbre par regroupementsuccessifs.

{a, b, c, d} {(a, b), c, d} {(a, d), b, c} {(a, d), b, c}{(b, c), a, d} {(b, d), a, c} {(c, d), a, b}

{(b, c, a), d}{(b, c, d), a}{(b, c), (a, d)} A B C D

020

4060

8010

0

31





{(b, c, a), d}{(b, c, d), a}{(b, c), (a, d)}

32





{(b, c, a), d}{(b, c, d), a}{(b, c), (a, d)}

Xp < 0.001

1

A {B, C, D}

Node 2 (n = 100)

●●●●●●●

0

0.2

0.4

0.6

0.8

1

Xp < 0.001

3

C {B, D}

Node 4 (n = 100)

●●●●●●●●●●●●●●0

0.2

0.4

0.6

0.8

1

Xp = 0.066

5

B D

Node 6 (n = 100)

0

0.2

0.4

0.6

0.8

1

Node 7 (n = 100)

0

0.2

0.4

0.6

0.8

1

33


Méthode CART et extensionsOn se contente de faire des coupes suivant X1 ou X2.

●

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

34


Méthode CART et extensionsmais ne marche pas bien en présence de non linéarités, ou de rotations

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

35


Méthode CART et extensionson peut aussi tenter des arbres sur X1 +X2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

36


Robustesse des arbresLes arbres sont intéressants, mais peu robustes,

cf Classification and regression trees, bagging, and boostinghttp ://mason.gmu.edu/ csutton/vt6.pdf

idée possible : bootstrapper (rééchantiloner) et agréger ensuite les prédicitons.

37


Rééchantillonage et régression

Dans un modèle linéaire,E.g. modélation du poids d’une personne (Y )

en fonction de sa taille (X)modèle linéaire GaussienY |X = x ∼ N (β0 + β1x, σ

2)

E(Y |X = x) = β0 + β1x = Y (x)

Y ∈[Y (x)± u1−α/2︸︷︷︸

1.96

· σ]

38


Rééchantillonage et régression

Échantillon (X1, Y1), · · · , (Xn, Yn)On va échantillonner, i.e. tirer n observa-tions avec remiseestimer un modèle sur cet échantillongarder en mémoire la prévisionet répéter cette étape de rééchantillonnage

39


Rééchantillonage et arbre de classification

On va échantillonner, i.e. tirer n observa-tions avec remiseconstruire et arbre sur cet échantillonon répétant, on va générer une forêt, ran-dom forrest

40


Rééchantillonage et arbre de classification

On va échantillonner, i.e. tirer n observa-tions avec remiseconstruire et arbre sur cet échantillonon répétant, on va générer une forêt, ran-dom forrest

500 1000 1500 2000 2500 30005

1015

20

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

● ●●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

41


RéférencesTrevor Hastie, Robert Tibshirani & Jerome Friedman (2013). Elements ofStatistical Learning : Data Mining, Inference, and Prediction. Springer Verlaghttp ://statweb.stanford.edu/˜ tibs/ElemStatLearn/printings/ESLII_print10.pdf

Leo Breiman, Jerome Friedman, Charles J. Stone & R.A. Olshen (1984).Classification and Regression Trees. CRC.

Kevin P. Murphy (2012). Machine Learning : A Probabilistic Perspective. MITPress.

42

slides arbres-ubuntu

Documents