new random forest · 2020. 7. 26. · trees vs. random forest + trees yield insight into decision...
TRANSCRIPT
![Page 1: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have](https://reader036.vdocument.in/reader036/viewer/2022062508/604361088148552e4e3e445f/html5/thumbnails/1.jpg)
Random Forest
Applied Multivariate Statistics – Spring 2013
TexPoint fonts used in EMF.
Read the TexPoint manual before you delete this box.: AAAAAAAA
![Page 2: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have](https://reader036.vdocument.in/reader036/viewer/2022062508/604361088148552e4e3e445f/html5/thumbnails/2.jpg)
Overview
Intuition of Random Forest
The Random Forest Algorithm
De-correlation gives better accuracy
Out-of-bag error (OOB-error)
Variable importance
1
Diseased
Diseased
Healthy
Healthy
Diseased
![Page 3: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have](https://reader036.vdocument.in/reader036/viewer/2022062508/604361088148552e4e3e445f/html5/thumbnails/3.jpg)
Intuition of Random Forest
2
young old
short tall
healthy diseased
young old
diseased
female male
healthy healthy
working retired
healthy
short tall
healthy diseased
New sample:
old, retired, male, short
Tree predictions:
diseased, healthy, diseased
Majority rule:
diseased
healthy
healthy
diseased healthy
Tree 1
Tree 3
Tree 2
![Page 4: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have](https://reader036.vdocument.in/reader036/viewer/2022062508/604361088148552e4e3e445f/html5/thumbnails/4.jpg)
The Random Forest Algorithm
3
![Page 5: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have](https://reader036.vdocument.in/reader036/viewer/2022062508/604361088148552e4e3e445f/html5/thumbnails/5.jpg)
Differences to standard tree
Train each tree on bootstrap resample of data (Bootstrap resample of data set with N samples:
Make new data set by drawing with replacement N samples; i.e., some samples will
probably occur multiple times in new data set)
For each split, consider only m randomly selected variables
Don’t prune
Fit B trees in such a way and use average or majority
voting to aggregate results
4
![Page 6: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have](https://reader036.vdocument.in/reader036/viewer/2022062508/604361088148552e4e3e445f/html5/thumbnails/6.jpg)
Why Random Forest works 1/2
Mean Squared Error = Variance + Bias2
If trees are sufficiently deep, they have very small bias
How could we improve the variance over that of a single
tree?
5
![Page 7: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have](https://reader036.vdocument.in/reader036/viewer/2022062508/604361088148552e4e3e445f/html5/thumbnails/7.jpg)
Why Random Forest works 2/2
6
i=j
Decreases, if number of trees B
increases (irrespective of 𝜌)
Decreaes, if
𝜌 decreases, i.e., if
m decreases
De-correlation gives
better accuracy
x
![Page 8: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have](https://reader036.vdocument.in/reader036/viewer/2022062508/604361088148552e4e3e445f/html5/thumbnails/8.jpg)
Estimating generalization error:
Out-of bag (OOB) error
Similar to leave-one-out cross-validation, but almost
without any additional computational burden
7
young old
short tall
healthy diseased
diseased healthy
Resampled Data:
old, tall – healthy
old, tall – healthy
old, short – diseased
old, short – diseased
young, tall – healthy
young, tall – healthy
young, short - healthy
Out of bag samples:
young, short – diseased
young, tall– healthy
old, short – diseased
Out of bag (OOB) error rate:
1/3 = 0.33
Data:
old, tall – healthy
old, short – diseased
young, tall – healthy
young, short – healthy
young, short – diseased
young, tall – healthy
old, short– diseased
![Page 9: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have](https://reader036.vdocument.in/reader036/viewer/2022062508/604361088148552e4e3e445f/html5/thumbnails/9.jpg)
Variable Importance for variable i
using Permutations
8
Data
…
Resampled
Dataset 1 OOB
Data 1
Resampled
Dataset m OOB
Data m
Tree 1 Tree m
OOB error e1 OOB error em
Permute values of
variable i in OOB
data set
OOB error p1 OOB error pm
d = 1m
Pm
i=1 di
d1 = p1–e1 dm =pm-em
s2d =1
m¡1Pm
i=1(di ¡ d)2vi =
dsd
![Page 10: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have](https://reader036.vdocument.in/reader036/viewer/2022062508/604361088148552e4e3e445f/html5/thumbnails/10.jpg)
Trees vs. Random Forest
+ Trees yield insight into
decision rules
+ Rather fast
+ Easy to tune
parameters
- Prediction of trees tend
to have a high variance
9
+ RF has smaller prediction variance and therefore usually a better general performance
+ Easy to tune parameters
- Rather slow
- “Black Box”: Rather hard to get insights into decision rules
![Page 11: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have](https://reader036.vdocument.in/reader036/viewer/2022062508/604361088148552e4e3e445f/html5/thumbnails/11.jpg)
Comparing runtime
(just for illustration)
10
RF
Tree
• Up to “thousands” of variables
• Problematic if there are categorical predictors with many levels (max: 32 levels)
RF: First predictor cut into 15 levels
![Page 12: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have](https://reader036.vdocument.in/reader036/viewer/2022062508/604361088148552e4e3e445f/html5/thumbnails/12.jpg)
+ Very fast
+ Discriminants for visualizing group separation
+ Can read off decision rule
- Can model only linear class boundaries
- Mediocre performance
- No variable selection
- Only on categorical response
- Needs CV for estimating prediction error
RF vs. LDA
+ Can model nonlinear
class boundaries
+ OOB error “for free” (no
CV needed)
+ Works on continuous and
categorical responses
(regression / classification)
+ Gives variable
importance
+ Very good performance
- “Black box”
- Slow
11
x x
x x x x
x x
x x
x
x
x
x x
x x
x
x
x x
x x x
![Page 13: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have](https://reader036.vdocument.in/reader036/viewer/2022062508/604361088148552e4e3e445f/html5/thumbnails/13.jpg)
Concepts to know
Idea of Random Forest and how it reduces the prediction
variance of trees
OOB error
Variable Importance based on Permutation
12
![Page 14: New Random Forest · 2020. 7. 26. · Trees vs. Random Forest + Trees yield insight into decision rules + Rather fast + Easy to tune parameters - Prediction of trees tend to have](https://reader036.vdocument.in/reader036/viewer/2022062508/604361088148552e4e3e445f/html5/thumbnails/14.jpg)
R functions to know
Function “randomForest” and “varImpPlot” from package
“randomForest”
13