![Page 1: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/1.jpg)
Random Forests for Regression and Classification
Adele Cutler Utah State University
Ovronnaz, Switzerland September 15-17, 2010 1
![Page 2: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/2.jpg)
Leo Breiman, 1928 - 2005
Ovronnaz, Switzerland
1954: PhD Berkeley (mathematics)
1960 -1967: UCLA (mathematics)
1969 -1982: Consultant
1982 - 1993 Berkeley (statistics)
1984 “Classification & Regression Trees” (with Friedman, Olshen, Stone)
1996 “Bagging”
2001 “Random Forests”
September 15-17, 2010 2
![Page 3: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/3.jpg)
Random Forests for Regression and Classification
Ovronnaz, Switzerland September 15-17, 2010 3
![Page 4: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/4.jpg)
Outline
• Background. • Trees. • Bagging predictors. • Random Forests algorithm. • Variable importance. • Proximity measures. • Visualization. • Partial plots and interpretation of effects.
Ovronnaz, Switzerland September 15-17, 2010 4
![Page 5: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/5.jpg)
What is Regression?
Given data on predictor variables (inputs, X) and a continuous response variable (output, Y) build a model for: – Predicting the value of the response from the
predictors. – Understanding the relationship between the
predictors and the response. e.g. predict a person’s systolic blood pressure based on
their age, height, weight, etc.
Ovronnaz, Switzerland September 15-17, 2010 5
![Page 6: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/6.jpg)
Regression Examples
• Y: income X: age, education, sex, occupation, … • Y: crop yield X: rainfall, temperature, humidity, … • Y: test scores X: teaching method, age, sex, ability, … • Y: selling price of homes X: size, age, location, quality, … Ovronnaz, Switzerland September 15-17, 2010 6
![Page 7: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/7.jpg)
Regression Background
• Linear regression • Multiple linear regression • Nonlinear regression (parametric) • Nonparametric regression (smoothing)
– Kernel smoothing – B-splines – Smoothing splines – Wavelets
Ovronnaz, Switzerland September 15-17, 2010 7
![Page 8: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/8.jpg)
Regression Picture
Ovronnaz, Switzerland September 15-17, 2010 8
![Page 9: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/9.jpg)
Regression Picture
Ovronnaz, Switzerland September 15-17, 2010 9
![Page 10: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/10.jpg)
Regression Picture
Ovronnaz, Switzerland September 15-17, 2010 10
![Page 11: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/11.jpg)
Regression Picture
Ovronnaz, Switzerland September 15-17, 2010 11
![Page 12: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/12.jpg)
What is Classification?
Given data on predictor variables (inputs, X) and a categorical response variable (output, Y) build a model for: – Predicting the value of the response from the
predictors. – Understanding the relationship between the
predictors and the response. e.g. predict a person’s 5-year-survival (yes/no) based
on their age, height, weight, etc.
Ovronnaz, Switzerland September 15-17, 2010 12
![Page 13: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/13.jpg)
Classification Examples
• Y: presence/absence of disease X: diagnostic measurements • Y: land cover (grass, trees, water, roads…) X: satellite image data (frequency bands) • Y: loan defaults (yes/no) X: credit score, own or rent, age, marital status, … • Y: dementia status X: scores on a battery of psychological tests
Ovronnaz, Switzerland September 15-17, 2010 13
![Page 14: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/14.jpg)
Classification Background
• Linear discriminant analysis (1930’s) • Logistic regression (1944) • Nearest neighbors classifiers (1951)
Ovronnaz, Switzerland September 15-17, 2010 14
![Page 15: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/15.jpg)
Classification Picture
Ovronnaz, Switzerland September 15-17, 2010 15
![Page 16: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/16.jpg)
Classification Picture
Ovronnaz, Switzerland September 15-17, 2010 16
![Page 17: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/17.jpg)
Classification Picture
Ovronnaz, Switzerland September 15-17, 2010 17
![Page 18: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/18.jpg)
Classification Picture
Ovronnaz, Switzerland September 15-17, 2010 18
![Page 19: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/19.jpg)
Classification Picture
Ovronnaz, Switzerland September 15-17, 2010 19
![Page 20: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/20.jpg)
Classification Picture
Ovronnaz, Switzerland September 15-17, 2010 20
![Page 21: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/21.jpg)
Classification Picture
Ovronnaz, Switzerland September 15-17, 2010 21
![Page 22: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/22.jpg)
Classification Picture
Ovronnaz, Switzerland September 15-17, 2010 22
![Page 23: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/23.jpg)
Classification Picture
Ovronnaz, Switzerland September 15-17, 2010 23
![Page 24: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/24.jpg)
Regression and Classification
Given data D = { (xi,yi), i=1,…,n}
where xi =(xi1,…,xip), build a model f-hat so that Y-hat = f-hat (X) for random variables X = (X1,…,Xp) and Y. Then f-hat will be used for:
– Predicting the value of the response from the predictors: y0-hat = f-hat(x0) where x0 = (xo1,…,xop).
– Understanding the relationship between the predictors and the response.
Ovronnaz, Switzerland September 15-17, 2010 24
![Page 25: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/25.jpg)
Assumptions
Ovronnaz, Switzerland
• Independent observations – Not autocorrelated over time or space – Not usually from a designed experiment – Not matched case-control
• Goal is prediction and (sometimes) understanding – Which predictors are useful? How? Where? – Is there “interesting” structure?
September 15-17, 2010 25
![Page 26: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/26.jpg)
Predictive Accuracy
Ovronnaz, Switzerland
• Regression – Expected mean squared error
• Classification – Expected (classwise) error rate
September 15-17, 2010 26
![Page 27: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/27.jpg)
Estimates of Predictive Accuracy
Ovronnaz, Switzerland
• Resubstitution – Use the accuracy on the training set as an
estimate of generalization error. • AIC etc
– Use assumptions about model. • Crossvalidation
– Randomly select a training set, use the rest as the test set.
– 10-fold crossvalidation.
September 15-17, 2010 27
![Page 28: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/28.jpg)
10-Fold Crossvalidation
Ovronnaz, Switzerland
Divide the data at random into 10 pieces, D1,…,D10. • Fit the predictor to D2,…,D10; predict D1. • Fit the predictor to D1,D3,…,D10; predict D2. • Fit the predictor to D1,D2,D4,…,D10; predict D3. • … • Fit the predictor to D1,D2,…,D9; predict D10. Compute the estimate using the assembled
predictions and their observed values.
September 15-17, 2010 28
![Page 29: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/29.jpg)
Estimates of Predictive Accuracy
Ovronnaz, Switzerland
Typically, resubstitution estimates are optimistic compared to crossvalidation estimates.
Crossvalidation estimates tend to be pessimistic
because they are based on smaller samples. Random Forests has its own way of estimating
predictive accuracy (“out-of-bag” estimates).
September 15-17, 2010 29
![Page 30: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/30.jpg)
Case Study: Cavity Nesting birds in the Uintah Mountains, Utah
• Red-naped sapsucker (Sphyrapicus nuchalis) (n = 42 nest sites)
• Mountain chickadee • (Parus gambeli) (n = 42 nest sites)
• Northern flicker (Colaptes auratus) (n = 23 nest sites) • n = 106 non-nest sites
![Page 31: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/31.jpg)
• Response variable is the presence (coded 1) or absence
(coded 0) of a nest.
• Predictor variables (measured on 0.04 ha plots around the sites) are: – Numbers of trees in various size classes from less than
1 inch in diameter at breast height to greater than 15 inches in diameter.
– Number of snags and number of downed snags. – Percent shrub cover. – Number of conifers. – Stand Type, coded as 0 for pure aspen and 1 for mixed
aspen and conifer.
Case Study: Cavity Nesting birds in the Uintah Mountains, Utah
![Page 32: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/32.jpg)
Assessing Accuracy in Classification
Actual Class
Predicted Class
Total Absence Presence
0 1 Absence, 0 a b a+b
Presence, 1 c d c+d Total a+c b+d n
![Page 33: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/33.jpg)
Assessing Accuracy in Classification
Error rate = ( c + b ) / n
Actual Class
Predicted Class
Total Absence Presence
0 1 Absence, 0 a b a+b
Presence, 1 c d c+d Total a+c b+d n
![Page 34: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/34.jpg)
Resubstitution Accuracy (fully grown tree)
Error rate = ( 0 + 1 )/213 = (approx) 0.005 or 0.5%
Actual Class
Predicted Class
Total Absence Presence
0 1 Absence, 0 105 1 106
Presence, 1 0 107 107 Total 105 108 213
![Page 35: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/35.jpg)
Crossvalidation Accuracy (fully grown tree)
Actual Class
Predicted Class
Total Absence Presence
0 1 Absence, 0 83 23 106
Presence, 1 22 85 107 Total 105 108 213
Error rate = ( 22 + 23 )/213 = (approx) .21 or 21%
![Page 36: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/36.jpg)
Outline
• Background. • Trees. • Bagging predictors. • Random Forests algorithm. • Variable importance. • Proximity measures. • Visualization. • Partial plots and interpretation of effects.
Ovronnaz, Switzerland September 15-17, 2010 36
![Page 37: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/37.jpg)
Classification and Regression Trees Pioneers: • Morgan and Sonquist (1963). • Breiman, Friedman, Olshen, Stone (1984). CART • Quinlan (1993). C4.5
Ovronnaz, Switzerland September 15-17, 2010 37
![Page 38: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/38.jpg)
Classification and Regression Trees
• Grow a binary tree. • At each node, “split” the data into two “daughter” nodes. • Splits are chosen using a splitting criterion. • Bottom nodes are “terminal” nodes. • For regression the predicted value at a node is the average
response variable for all observations in the node. • For classification the predicted class is the most common class
in the node (majority vote). • For classification trees, can also get estimated probability of
membership in each of the classes
Ovronnaz, Switzerland September 15-17, 2010 38
![Page 39: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/39.jpg)
A Classification Tree
Ovronnaz, Switzerland
Predict hepatitis (0=absent, 1=present) using protein and alkaline phosphate. “Yes” goes left.
|protein< 45.43
protein>=26
alkphos< 171
protein< 38.59alkphos< 129.40
19/0 04/0
11/2
11/4
10/3
17/114
September 15-17, 2010 39
![Page 40: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/40.jpg)
Splitting criteria
• Regression: residual sum of squares RSS = ∑left (yi – yL*)2 + ∑right (yi – yR*)2
where yL* = mean y-value for left node yR* = mean y-value for right node
• Classification: Gini criterion Gini = NL ∑k=1,…,K pkL (1- pkL) + NR ∑k=1,…,K pkR (1- pkR)
where pkL = proportion of class k in left node pkR = proportion of class k in right node
Ovronnaz, Switzerland September 15-17, 2010 40
![Page 41: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/41.jpg)
Choosing the best horizontal split
Ovronnaz, Switzerland
Best horizontal split is at 3.67 with RSS = 68.09.
September 15-17, 2010 41
![Page 42: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/42.jpg)
Choosing the best vertical split
Ovronnaz, Switzerland
Best vertical split is at 1.05 with RSS = 61.76.
September 15-17, 2010 42
![Page 43: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/43.jpg)
Regression tree (prostate cancer)
Ovronnaz, Switzerland September 15-17, 2010 43
![Page 44: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/44.jpg)
Choosing the best split in the left node
Ovronnaz, Switzerland
Best horizontal split is at 3.66 with RSS = 16.11.
September 15-17, 2010 44
![Page 45: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/45.jpg)
Choosing the best split in the left node
Ovronnaz, Switzerland
Best vertical split is at -.48 with RSS = 13.61.
September 15-17, 2010 45
![Page 46: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/46.jpg)
Regression tree (prostate cancer)
Ovronnaz, Switzerland September 15-17, 2010 46
![Page 47: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/47.jpg)
Choosing the best split in the right node
Ovronnaz, Switzerland
Best horizontal split is at 3.07 with RSS = 27.15.
September 15-17, 2010 47
![Page 48: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/48.jpg)
Choosing the best split in the right node
Ovronnaz, Switzerland
Best vertical split is at 2.79 with RSS = 25.11.
September 15-17, 2010 48
![Page 49: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/49.jpg)
Regression tree (prostate cancer)
Ovronnaz, Switzerland September 15-17, 2010 49
![Page 50: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/50.jpg)
Choosing the best split in the third node
Ovronnaz, Switzerland
Best horizontal split is at 3.07 with RSS = 14.42, but this is too close to the edge. Use 3.46 with RSS = 16.14.
September 15-17, 2010 50
![Page 51: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/51.jpg)
Choosing the best split in the third node
Ovronnaz, Switzerland
Best vertical split is at 2.46 with RSS = 18.97.
September 15-17, 2010 51
![Page 52: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/52.jpg)
Regression tree (prostate cancer)
Ovronnaz, Switzerland September 15-17, 2010 52
![Page 53: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/53.jpg)
Regression tree (prostate cancer)
Ovronnaz, Switzerland September 15-17, 2010 53
![Page 54: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/54.jpg)
Regression tree (prostate cancer)
Ovronnaz, Switzerland
lcavollweight
lpsa
September 15-17, 2010 54
![Page 55: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/55.jpg)
Classification tree (hepatitis)
Ovronnaz, Switzerland
|protein< 45.43
protein>=26
alkphos< 171
protein< 38.59alkphos< 129.40
19/0 04/0
11/2
11/4
10/3
17/11
September 15-17, 2010 55
![Page 56: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/56.jpg)
Classification tree (hepatitis)
Ovronnaz, Switzerland
|protein< 45.43
protein>=26
alkphos< 171
protein< 38.59alkphos< 129.40
19/0 04/0
11/2
11/4
10/3
17/11
September 15-17, 2010 56
![Page 57: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/57.jpg)
Classification tree (hepatitis)
Ovronnaz, Switzerland
|protein< 45.43
protein>=26
alkphos< 171
protein< 38.59alkphos< 129.40
19/0 04/0
11/2
11/4
10/3
17/11
September 15-17, 2010 57
![Page 58: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/58.jpg)
Classification tree (hepatitis)
Ovronnaz, Switzerland
|protein< 45.43
protein>=26
alkphos< 171
protein< 38.59alkphos< 129.40
19/0 04/0
11/2
11/4
10/3
17/11
September 15-17, 2010 58
![Page 59: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/59.jpg)
Classification tree (hepatitis)
Ovronnaz, Switzerland
|protein< 45.43
protein>=26
alkphos< 171
protein< 38.59alkphos< 129.40
19/0 04/0
11/2
11/4
10/3
17/11
September 15-17, 2010 59
![Page 60: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/60.jpg)
Pruning
• If the tree is too big, the lower “branches” are modeling noise in the data (“overfitting”).
• The usual paradigm is to grow the trees large and “prune” back unnecessary splits.
• Methods for pruning trees have been developed. Most use some form of crossvalidation. Tuning may be necessary.
Ovronnaz, Switzerland September 15-17, 2010 60
![Page 61: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/61.jpg)
Case Study: Cavity Nesting birds in the Uintah Mountains, Utah
Choose cp = .035
![Page 62: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/62.jpg)
Crossvalidation Accuracy (cp = .035)
Actual Class
Predicted Class
Total Absence Presence
0 1 Absence, 0 85 21 106
Presence, 1 19 88 107 Total 104 109 213
Error rate = ( 19 + 21 )/213 = (approx) .19 or 19%
![Page 63: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/63.jpg)
Classification and Regression Trees Advantages
• Applicable to both regression and classification problems.
• Handle categorical predictors naturally.
• Computationally simple and quick to fit, even for large problems.
• No formal distributional assumptions (non-parametric).
• Can handle highly non-linear interactions and classification boundaries.
• Automatic variable selection.
• Handle missing values through surrogate variables.
• Very easy to interpret if the tree is small.
Ovronnaz, Switzerland September 15-17, 2010 63
![Page 64: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/64.jpg)
Classification and Regression Trees
Advantages (ctnd)
• The picture of the tree can give valuable insights into which variables are important and where.
• The terminal nodes suggest a natural clustering of data into homogeneous groups.
Ovronnaz, Switzerland
|protein< 45.43
fatigue< 1.5
alkphos< 171
age>=28.5
albumin< 2.75
varices< 1.firm>=1.5
024/0
10/2
11/4
10/3
02/0
02/1
10/4
13/109
September 15-17, 2010 64
![Page 65: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/65.jpg)
Classification and Regression Trees
Disadvantages
• Accuracy - current methods, such as support vector machines and ensemble classifiers often have 30% lower error rates than CART.
• Instability – if we change the data a little, the tree picture can change a lot. So the interpretation is not as straightforward as it appears.
Today, we can do better! Random Forests
Ovronnaz, Switzerland September 15-17, 2010 65
![Page 66: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/66.jpg)
Outline
• Background. • Trees. • Bagging predictors. • Random Forests algorithm. • Variable importance. • Proximity measures. • Visualization. • Partial plots and interpretation of effects.
Ovronnaz, Switzerland September 15-17, 2010 66
![Page 67: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/67.jpg)
Data and Underlying Function
Ovronnaz, Switzerland
-3 -2 -1 0 1 2 3
-1.0
-0.5
0.0
0.5
1.0
September 15-17, 2010 67
![Page 68: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/68.jpg)
Single Regression Tree
Ovronnaz, Switzerland
-3 -2 -1 0 1 2 3
-1.0
-0.5
0.0
0.5
1.0
September 15-17, 2010 68
![Page 69: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/69.jpg)
10 Regression Trees
Ovronnaz, Switzerland
-3 -2 -1 0 1 2 3
-1.0
-0.5
0.0
0.5
1.0
September 15-17, 2010 69
![Page 70: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/70.jpg)
Average of 100 Regression Trees
Ovronnaz, Switzerland
-3 -2 -1 0 1 2 3
-1.0
-0.5
0.0
0.5
1.0
September 15-17, 2010 70
![Page 71: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/71.jpg)
Hard problem for a single tree:
Ovronnaz, Switzerland
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0
1
1
0
1
1
0
0
1
0
0
1
1
0
0
1
1
1
00
1
0
1
1
0
1
0
1
0
0 0
1
1
1
1
1
0
0
1
1
0
1
0
01
1
1
0
1
0
0
1
0
0
0
1
0
1
110
00
1
1
0
1
1
0
10
0
1
1
0
0
1
0
0
1
0
1
1
0
1
0
0
0
0
1
0
0
0
1
0
0
0
1
1
0
0
0
0
1
11
1
0
00
0
0
11
0 0
1
0
1
0
1
1
1
0
0
1
0
0
1
0
1
1
1
10
1
1
0
1
1
1
00
1
1
0
11
11
00
0
0
0
1 1
0
0
0
0
0
0
11 1
0
1
0
1
0
0
1
0
0
01
0
111
1
10
1
0
0
10
0
1
1
0
1
1
0
1
0
0
1
0
1
1
1
0
11
0
1
0
0
1
11
0
1
1
0
0
0
1
0
1
0
1
0
1
0
0
1
1
1
1
00
0
0
1
0
0
11
0
0
11
0
0
1
1
1
1
1
0 1
01
0
1
1
0
1
0
11
0
0
1
0
1
0
0
1
1
0
1
1
1
1
0
1
0
0
0
1
1
1
1
1
0
1
0
0
11
0
1
1
1
0
1
1
1
1
0
0
0
1
11
0
0
1
1
1
0
0
1
1
0 00
1
0
0
1
0
00
1
1
1
0
10
00
1
1
0
0
0
10
0
1 1
1
1
0
0
0
1
1
1
00
1
1
0
0
1
1
0 0
1
0
0
0
1
0
0
1
1
00 0
1
1 1
1
00
1
0
0
00
1
0
1
1
0
0
0
0
0
100
0
0
1
00
0
1
1
0
0 1
000
0
10
1
1
1
0
1
1
1
0
0
1
10
11
1
0
1
0
1
1
1
1
10
0
0
1
0
1
0
1
1
10
0
0
1
1
1
0
1
1
1
0
0
0
1
0
11
0
1
0
0
0
00
1
0
10
1
0
1 1
0
1
1
0
1
0
1
1
0
0
0
1
0
0
0
0
0
1
September 15-17, 2010 71
![Page 72: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/72.jpg)
Single tree:
Ovronnaz, Switzerland
0.2 0.4 0.6 0.8 1.0
0.2
0.4
0.6
0.8
1.0
September 15-17, 2010 72
![Page 73: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/73.jpg)
25 Averaged Trees:
Ovronnaz, Switzerland
0.2 0.4 0.6 0.8 1.0
0.2
0.4
0.6
0.8
1.0
September 15-17, 2010 73
![Page 74: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/74.jpg)
25 Voted Trees:
Ovronnaz, Switzerland
0.2 0.4 0.6 0.8 1.0
0.2
0.4
0.6
0.8
1.0
September 15-17, 2010 74
![Page 75: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/75.jpg)
Bagging (Bootstrap Aggregating)
Breiman, “Bagging Predictors”, Machine Learning, 1996.
Fit classification or regression models to bootstrap samples from the data and combine by voting (classification) or averaging (regression).
Bootstrap sample ➾ f1(x)
Bootstrap sample ➾ f2(x) Bootstrap sample ➾ f3(x) Combine f1(x),…, fM(x) ➾ f(x) … Bootstrap sample ➾ fM(x) fi(x)’s are “base learners”
Ovronnaz, Switzerland
MODEL AVERAGING
September 15-17, 2010 75
![Page 76: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/76.jpg)
Bagging (Bootstrap Aggregating)
• A bootstrap sample is chosen at random with replacement from the data. Some observations end up in the bootstrap sample more than once, while others are not included (“out of bag”).
• Bagging reduces the variance of the base learner but has limited effect on the bias.
• It’s most effective if we use strong base learners that have very little bias but high variance (unstable). E.g. trees.
• Both bagging and boosting are examples of “ensemble learners” that were popular in machine learning in the ‘90s.
Ovronnaz, Switzerland September 15-17, 2010 76
![Page 77: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/77.jpg)
Bagging CART Dataset # cases # vars # classes CART Bagged
CART Decrease
%
Waveform 300 21 3 29.1 19.3 34
Heart 1395 16 2 4.9 2.8 43
Breast Cancer 699 9 2 5.9 3.7 37
Ionosphere 351 34 2 11.2 7.9 29
Diabetes 768 8 2 25.3 23.9 6
Glass 214 9 6 30.4 23.6 22
Soybean 683 35 19 8.6 6.8 21
Ovronnaz, Switzerland September 15-17, 2010 77
Leo Breiman (1996) “Bagging Predictors”, Machine Learning, 24, 123-140.
![Page 78: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/78.jpg)
Outline
• Background. • Trees. • Bagging predictors. • Random Forests algorithm. • Variable importance. • Proximity measures. • Visualization. • Partial plots and interpretation of effects.
Ovronnaz, Switzerland September 15-17, 2010 78
![Page 79: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/79.jpg)
Random Forests Dataset # cases # vars # classes CART Bagged
CART Random
Forests
Waveform 300 21 3 29.1 19.3 17.2
Breast Cancer 699 9 2 5.9 3.7 2.9
Ionosphere 351 34 2 11.2 7.9 7.1
Diabetes 768 8 2 25.3 23.9 24.2
Glass 214 9 6 30.4 23.6 20.6
Ovronnaz, Switzerland September 15-17, 2010 79
Leo Breiman (2001) “Random Forests”, Machine Learning, 45, 5-32.
![Page 80: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/80.jpg)
Random Forests Grow a forest of many trees. (R default is 500)
Grow each tree on an independent bootstrap sample* from the training data.
At each node: 1. Select m variables at random out of all M possible
variables (independently for each node). 2. Find the best split on the selected m variables.
Grow the trees to maximum depth (classification).
Vote/average the trees to get predictions for new data. *Sample N cases at random with replacement.
Ovronnaz, Switzerland September 15-17, 2010 80
![Page 81: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/81.jpg)
Random Forests Inherit many of the advantages of CART:
• Applicable to both regression and classification problems. Yes.
• Handle categorical predictors naturally. Yes.
• Computationally simple and quick to fit, even for large problems. Yes.
• No formal distributional assumptions (non-parametric). Yes.
• Can handle highly non-linear interactions and classification boundaries. Yes.
• Automatic variable selection. Yes. But need variable importance too.
• Handles missing values through surrogate variables. Using proximities.
• Very easy to interpret if the tree is small. NO!
Ovronnaz, Switzerland September 15-17, 2010 81
![Page 82: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/82.jpg)
Random Forests
But do not inherit:
• The picture of the tree can give valuable insights into which variables are important and where.
• The terminal nodes suggest a natural clustering of data into homogeneous groups.
Ovronnaz, Switzerland
|protein< 50.5
albumin< 3.8
alkphos< 171
fatigue< 1.5
bilirubin< 0.65
alkphos< 71.5
025/0
10/2
10/5
10/8
03/1
10/9
10/102
|protein< 46.5
albumin< 3.9
alkphos< 191
bilirubin< 0.65
alkphos< 71.5 varices< 1.5
firm>=1.5
021/1
10/2
10/7
02/0
11/11
02/0
10/6
10/102
|protein< 45.43
prog>=1.5
fatigue< 1.5 sgot>=123.8
025/0
10/2
02/0
10/11
11/114
|protein< 45.43
bilirubin>=1.8
alkphos< 149
albumin< 3.9
albumin< 2.75
varices< 1.5
bilirubin>=1.8021/0
09/0
10/4
10/7
03/0
04/0
10/7
12/98
|protein< 45
alkphos< 171
fatigue< 1.5
bilirubin>=3.65
bilirubin< 0.5
sgot< 29protein< 66.9
age< 50
021/1
10/2
10/8
03/1
02/0
05/0
10/2
10/20
11/89
|protein< 45.43
sgot>=62
prog>=1.5
bilirubin>=3.65
019/1
04/1
10/7
03/0
14/116
NO!
NO!
September 15-17, 2010 82
![Page 83: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/83.jpg)
Random Forests
Improve on CART with respect to:
• Accuracy – Random Forests is competitive with the best known machine learning methods (but note the “no free lunch” theorem).
• Instability – if we change the data a little, the individual trees
may change but the forest is relatively stable because it is a combination of many trees.
Ovronnaz, Switzerland September 15-17, 2010 83
![Page 84: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/84.jpg)
Two Natural Questions
1. Why bootstrap? (Why subsample?) Bootstrapping → out-of-bag data →
– Estimated error rate and confusion matrix – Variable importance
2. Why trees? Trees → proximities →
– Missing value fill-in – Outlier detection – Illuminating pictures of the data (clusters, structure,
outliers)
Ovronnaz, Switzerland September 15-17, 2010 84
![Page 85: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/85.jpg)
The RF Predictor • A case in the training data is not in the bootstrap sample for about one third of the trees (we say the case is “out of bag” or “oob”). Vote (or average) the predictions of these trees to give the RF predictor. • The oob error rate is the error rate of the RF predictor.
• The oob confusion matrix is obtained from the RF predictor.
• For new cases, vote (or average) all the trees to get the RF predictor.
Ovronnaz, Switzerland September 15-17, 2010 85
![Page 86: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/86.jpg)
The RF Predictor
For example, suppose we fit 1000 trees, and a case is out-of-bag in 339 of them, of which: 283 say “class 1” 56 say “class 2”
The RF predictor for this case is class 1. The “oob” error gives an estimate of test set error (generalization error) as trees are added to the ensemble.
Ovronnaz, Switzerland September 15-17, 2010 86
![Page 87: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/87.jpg)
RFs do not overfit as we fit more trees
Ovronnaz, Switzerland
Oob test
September 15-17, 2010 87
![Page 88: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/88.jpg)
RF handles thousands of predictors Ramón Díaz-Uriarte, Sara Alvarez de Andrés Bioinformatics Unit, Spanish National Cancer Center March, 2005 http://ligarto.org/rdiaz
Compared • SVM, linear kernel • KNN/crossvalidation (Dudoit et al. JASA 2002) • DLDA • Shrunken Centroids (Tibshirani et al. PNAS 2002) • Random forests
“Given its performance, random forest and variable selection using random forest should probably become part of the standard tool-box of methods for the analysis of microarray data.”
Ovronnaz, Switzerland September 15-17, 2010 88
![Page 89: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/89.jpg)
Microarray Datasets Data P N # Classes Leukemia 3051 38 2 Breast 2 4869 78 2 Breast 3 4869 96 3 NCI60 5244 61 8 Adenocar 9868 76 2 Brain 5597 42 5 Colon 2000 62 2 Lymphoma 4026 62 3 Prostate 6033 102 2 Srbct 2308 63 4
Ovronnaz, Switzerland September 15-17, 2010 89
![Page 90: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/90.jpg)
Microarray Error Rates
Data SVM KNN DLDA SC RF rank Leukemia .014 .029 .020 .025 .051 5 Breast 2 .325 .337 .331 .324 .342 5 Breast 3 .380 .449 .370 .396 .351 1 NCI60 .256 .317 .286 .256 .252 1 Adenocar .203 .174 .194 .177 .125 1 Brain .138 .174 .183 .163 .154 2 Colon .147 .152 .137 .123 .127 2 Lymphoma .010 .008 .021 .028 .009 2 Prostate .064 .100 .149 .088 .077 2 Srbct .017 .023 .011 .012 .021 4 Mean .155 .176 .170 .159 .151
Ovronnaz, Switzerland September 15-17, 2010 90
![Page 91: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/91.jpg)
RF handles thousands of predictors
• Add noise to some standard datasets and see how well Random Forests: – predicts – detects the important variables
Ovronnaz, Switzerland September 15-17, 2010 91
![Page 92: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/92.jpg)
No noise added 10 noise variables 100 noise variables
Dataset Error rate Error rate Ratio Error rate Ratio
breast 3.1 2.9 0.93 2.8 0.91 diabetes 23.5 23.8 1.01 25.8 1.10 ecoli 11.8 13.5 1.14 21.2 1.80 german 23.5 25.3 1.07 28.8 1.22 glass 20.4 25.9 1.27 37.0 1.81 image 1.9 2.1 1.14 4.1 2.22 iono 6.6 6.5 0.99 7.1 1.07 liver 25.7 31.0 1.21 40.8 1.59 sonar 15.2 17.1 1.12 21.3 1.40 soy 5.3 5.5 1.06 7.0 1.33 vehicle 25.5 25.0 0.98 28.7 1.12 votes 4.1 4.6 1.12 5.4 1.33 vowel 2.6 4.2 1.59 17.9 6.77
RF error rates (%)
Ovronnaz, Switzerland September 15-17, 2010 92
![Page 93: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/93.jpg)
Error rates (%) Number of noise variables
Dataset No noise added 10 100 1,000 10,000
breast 3.1 2.9 2.8 3.6 8.9 glass 20.4 25.9 37.0 51.4 61.7 votes 4.1 4.6 5.4 7.8 17.7
RF error rates
Ovronnaz, Switzerland September 15-17, 2010 93
![Page 94: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/94.jpg)
Outline
• Background. • Trees. • Bagging predictors. • Random Forests algorithm. • Variable importance. • Proximity measures. • Visualization. • Partial plots and interpretation of effects.
Ovronnaz, Switzerland September 15-17, 2010 94
![Page 95: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/95.jpg)
Variable Importance RF computes two measures of variable importance, one based on a rough-and-ready measure (Gini for classification) and the other based on permutations. To understand how permutation importance is computed, need to understand local variable importance. But first…
Ovronnaz, Switzerland September 15-17, 2010 95
![Page 96: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/96.jpg)
10 noise variables 100 noise variables
Dataset m Number in top m Percent Number in
top m Percent
breast 9 9.0 100.0 9.0 100.0 diabetes 8 7.6 95.0 7.3 91.2 ecoli 7 6.0 85.7 6.0 85.7 german 24 20.0 83.3 10.1 42.1 glass 9 8.7 96.7 8.1 90.0 image 19 18.0 94.7 18.0 94.7 ionosphere 34 33.0 97.1 33.0 97.1 liver 6 5.6 93.3 3.1 51.7 sonar 60 57.5 95.8 48.0 80.0 soy 35 35.0 100.0 35.0 100.0 vehicle 18 18.0 100.0 18.0 100.0 votes 16 14.3 89.4 13.7 85.6 vowel 10 10.0 100.0 10.0 100.0
RF variable importance
Ovronnaz, Switzerland September 15-17, 2010 96
![Page 97: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/97.jpg)
RF error rates
Number in top m Number of noise variables
Dataset m 10 100 1,000 10,000 breast 9 9.0 9.0 9 9 glass 9 8.7 8.1 7 6 votes 16 14.3 13.7 13 13
Ovronnaz, Switzerland September 15-17, 2010 97
![Page 98: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/98.jpg)
Local Variable Importance We usually think about variable importance as an overall measure. In part, this is probably because we fit models with global structure (linear regression, logistic regression). In CART, variable importance is local.
Ovronnaz, Switzerland September 15-17, 2010 98
![Page 99: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/99.jpg)
Local Variable Importance
Different variables are important in different regions of the data.
If protein is high, we don’t care about alkaline phosphate. Similarly if protein is low. But for intermediate values of protein, alkaline phosphate is important.
September 15-17, 2010 Ovronnaz, Switzerland 99
|protein< 45.43
protein>=26
alkphos< 171
protein< 38.59alkphos< 129.40
19/0 04/0
11/2
11/4
10/3
17/11
![Page 100: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/100.jpg)
Local Variable Importance For each tree, look at the out-of-bag data: • randomly permute the values of variable j • pass these perturbed data down the tree, save the
classes.
For case i and variable j find error rate with error rate with variable j permuted no permutation where the error rates are taken over all trees for which case i is out-of-bag.
Ovronnaz, Switzerland
─
September 15-17, 2010 100
![Page 101: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/101.jpg)
Local importance for a class 2 case
September 15-17, 2010 Ovronnaz, Switzerland 101
TREE
No permutation
Permute variable 1
…
Permute variable m
1 2 2 … 1 3 2 2 … 2 4 1 1 … 1 9 2 2 … 1 … … … … … 992 2 2 … 2
% Error 10% 11% … 35%
![Page 102: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/102.jpg)
Outline
• Background. • Trees. • Bagging predictors. • Random Forests algorithm. • Variable importance. • Proximity measures. • Visualization. • Partial plots and interpretation of effects.
Ovronnaz, Switzerland September 15-17, 2010 102
![Page 103: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/103.jpg)
Proximities
Proximity of two cases is the proportion of the time that they end up in the same node.
The proximities don’t just measure similarity of the variables - they also take into account the importance of the variables.
Two cases that have quite different predictor variables might have large proximity if they differ only on variables that are not important.
Two cases that have quite similar values of the predictor variables might have small proximity if they differ on inputs that are important.
Ovronnaz, Switzerland September 15-17, 2010 103
![Page 104: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/104.jpg)
Visualizing using Proximities
To “look” at the data we use classical multidimensional scaling (MDS) to get a picture in 2-D or 3-D: MDS Proximities Scaling Variables Might see clusters, outliers, unusual structure. Can also use nonmetric MDS.
Ovronnaz, Switzerland September 15-17, 2010 104
![Page 105: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/105.jpg)
Visualizing using Proximities
• at-a-glance information about which classes are overlapping, which classes differ
• find clusters within classes • find easy/hard/unusual cases With a good tool we can also • identify characteristics of unusual points • see which variables are locally important • see how clusters or unusual points differ
Ovronnaz, Switzerland September 15-17, 2010 105
![Page 106: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/106.jpg)
Visualizing using Proximities
Synthetic data, 600 cases 2 meaningful variables 48 “noise” variables 3 classes
Ovronnaz, Switzerland September 15-17, 2010 106
![Page 107: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/107.jpg)
The Problem with Proximities
Proximities based on all the data overfit! e.g. two cases from different classes must have proximity zero if
trees are grown deep.
Ovronnaz, Switzerland
-3 -1 1 2 3
-3-1
12
3
Data
X 1
X 2
-0.1 0.1 0.2 0.3
-0.1
0.1
0.2
0.3
MDS
dim 1
dim
2
September 15-17, 2010 107
![Page 108: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/108.jpg)
Proximity-weighted Nearest Neighbors
RF is like a nearest-neighbor classifier: • Use the proximities as weights for nearest-neighbors. • Classify the training data. • Compute the error rate.
Want the error rate to be close to the RF oob error rate.
BAD NEWS! If we compute proximities from trees in which both cases are OOB, we don’t get good accuracy when we use the proximities for prediction!
Ovronnaz, Switzerland September 15-17, 2010 108
![Page 109: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/109.jpg)
Proximity-weighted Nearest Neighbors
Ovronnaz, Switzerland
Dataset RF OOB
breast 2.6 2.9
diabetes 24.2 23.7
ecoli 11.6 12.5
german 23.6 24.1
glass 20.6 23.8
image 1.9 2.1
iono 6.8 6.8
liver 26.4 26.7
sonar 13.9 21.6
soy 5.1 5.4
vehicle 24.8 27.4
votes 3.9 3.7
vowel 2.6 4.5
September 15-17, 2010 109
![Page 110: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/110.jpg)
Proximity-weighted Nearest Neighbors
Ovronnaz, Switzerland
Dataset RF OOB Waveform 15.5 16.1 Twonorm 3.7 4.6 Threenorm 14.5 15.7 Ringnorm 5.6 5.9
September 15-17, 2010 110
![Page 111: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/111.jpg)
New Proximity Method Start with P = I, the identity matrix.
For each observation i:
For each tree in which case i is oob: – Pass case i down the tree and note which terminal node it falls
into.
– Increase the proximity between observation i and the k in-bag cases that are in the same terminal node, by the amount 1/k.
Can show that except for ties, this gives the same error rate as RF, when used as a proximity-weighted nn classifier.
September 15-17, 2010 Ovronnaz, Switzerland 111
![Page 112: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/112.jpg)
New Method
Ovronnaz, Switzerland
Dataset RF OOB New
breast 2.6 2.9 2.6
diabetes 24.2 23.7 24.4
ecoli 11.6 12.5 11.9
german 23.6 24.1 23.4
glass 20.6 23.8 20.6
image 1.9 2.1 1.9
iono 6.8 6.8 6.8
liver 26.4 26.7 26.4
sonar 13.9 21.6 13.9
soy 5.1 5.4 5.3
vehicle 24.8 27.4 24.8
votes 3.9 3.7 3.7
vowel 2.6 4.5 2.6
September 15-17, 2010 112
![Page 113: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/113.jpg)
New Method
Ovronnaz, Switzerland
Dataset RF OOB New Waveform 15.5 16.1 15.5 Twonorm 3.7 4.6 3.7 Threenorm 14.5 15.7 14.5 Ringnorm 5.6 5.9 5.6
September 15-17, 2010 113
![Page 114: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/114.jpg)
But…
The new “proximity” matrix is not symmetric! → Methods for doing multidimensional scaling
on asymmetric matrices.
September 15-17, 2010 Ovronnaz, Switzerland 114
![Page 115: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/115.jpg)
Other Uses for Random Forests
• Missing data imputation.
• Feature selection (before using a method that cannot handle high dimensionality).
• Unsupervised learning (cluster analysis).
• Survival analysis without making the proportional hazards assumption.
Ovronnaz, Switzerland September 15-17, 2010 115
![Page 116: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/116.jpg)
Missing Data Imputation Fast way: replace missing values for a given variable using
the median of the non-missing values (or the most frequent, if categorical)
Better way (using proximities): 1. Start with the fast way. 2. Get proximities. 3. Replace missing values in case i by a weighted average
of non-missing values, with weights proportional to the proximity between case i and the cases with the non-missing values.
Repeat steps 2 and 3 a few times (5 or 6). September 15-17, 2010 Ovronnaz, Switzerland 116
![Page 117: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/117.jpg)
Feature Selection
• Ramón Díaz-Uriarte: varSelRF R package.
• In the NIPS competition 2003, several of the
top entries used RF for feature selection.
Ovronnaz, Switzerland September 15-17, 2010 117
![Page 118: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/118.jpg)
Unsupervised Learning Global histone modification patterns predict risk of prostate
cancer recurrence David B. Seligson, Steve Horvath, Tao Shi, Hong Yu, Sheila Tze,
Michael Grunstein and Siavash K. Kurdistan (all at UCLA). Used RF clustering of 183 tissue microarrays to find two
disease subgroups with distinct risks of tumor recurrence.
http://www.nature.com/nature/journal/v435/n7046/full/natu
re03672.html
Ovronnaz, Switzerland September 15-17, 2010 118
![Page 119: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/119.jpg)
Survival Analysis
• Hemant Ishwaran and Udaya B. Kogalur: randomSurvivalForest R package.
Ovronnaz, Switzerland September 15-17, 2010 119
![Page 120: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/120.jpg)
Outline
• Background. • Trees. • Bagging predictors. • Random Forests algorithm. • Variable importance. • Proximity measures. • Visualization.
Ovronnaz, Switzerland September 15-17, 2010 120
![Page 121: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/121.jpg)
Case Study: Cavity Nesting birds in the Uintah Mountains, Utah
• Red-naped sapsucker (Sphyrapicus nuchalis) (n = 42 nest sites)
• Mountain chickadee (Parus gambeli) (n = 42 nest sites)
• Northern flicker (Colaptes auratus) (n = 23 nest sites) • n = 106 non-nest sites
![Page 122: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/122.jpg)
• Response variable is the presence (coded 1) or absence
(coded 0) of a nest.
• Predictor variables (measured on 0.04 ha plots around the sites) are: – Numbers of trees in various size classes from less than
1 inch in diameter at breast height to greater than 15 inches in diameter.
– Number of snags and number of downed snags. – Percent shrub cover. – Number of conifers. – Stand Type, coded as 0 for pure aspen and 1 for mixed
aspen and conifer.
Case Study: Cavity Nesting birds in the Uintah Mountains, Utah
![Page 123: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/123.jpg)
Autism
Data courtesy of J.D.Odell and R. Torres, USU 154 subjects (308 chromosomes) 7 variables, all categorical (up to 30 categories) 2 classes:
– Normal, blue (69 subjects) – Autistic, red (85 subjects)
Ovronnaz, Switzerland September 15-17, 2010 123
![Page 124: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/124.jpg)
Brain Cancer Microarrays
Pomeroy et al. Nature, 2002. Dettling and Bühlmann, Genome Biology, 2002. 42 cases, 5,597 genes, 5 tumor types: • 10 medulloblastomas BLUE • 10 malignant gliomas PALE BLUE • 10 atypical teratoid/rhabdoid tumors (AT/RTs)
GREEN • 4 human cerebella ORANGE • 8 PNETs RED
Ovronnaz, Switzerland September 15-17, 2010 124
![Page 125: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/125.jpg)
Dementia
Data courtesy of J.T. Tschanz, USU 516 subjects 28 variables 2 classes:
– no cognitive impairment, blue (372 people) – Alzheimer’s, red (144 people)
Ovronnaz, Switzerland September 15-17, 2010 125
![Page 126: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/126.jpg)
Metabolomics
(Lou Gehrig’s disease) data courtesy of Metabolon (Chris Beecham) 63 subjects 317 variables 3 classes:
– blue (22 subjects) ALS (no meds) – green (9 subjects) ALS (on meds) – red (32 subjects) healthy
Ovronnaz, Switzerland September 15-17, 2010 126
![Page 127: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/127.jpg)
Random Forests Software
• Free, open-source code (FORTRAN, java) www.math.usu.edu/~adele/forests • Commercial version (academic discounts) www.salford-systems.com • R interface, independent development (Andy
Liaw and Matthew Wiener)
September 15-17, 2010 Ovronnaz, Switzerland 127
![Page 128: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/128.jpg)
Java Software
Raft uses VisAD www.ssec.wisc.edu/~billh/visad.html and ImageJ http://rsb.info.nih.gov/ij/ These are both open-source projects with great mailing lists and helpful developers.
September 15-17, 2010 Ovronnaz, Switzerland 128
![Page 129: Random Forests for Classification and Regression - · PDF fileRandom Forests for Regression and ... – Expected mean squared error • Classification ... – Randomly select a training](https://reader030.vdocument.in/reader030/viewer/2022021504/5a930bee7f8b9a8b5d8c1eb2/html5/thumbnails/129.jpg)
References • Leo Breiman, Jerome Friedman, Richard Olshen, Charles Stone (1984)
“Classification and Regression Trees” (Wadsworth). • Leo Breiman (1996) “Bagging Predictors” Machine Learning, 24, 123-140. • Leo Breiman (2001) “Random Forests” Machine Learning, 45, 5-32. • Trevor Hastie, Rob Tibshirani, Jerome Friedman (2009) “Statistical Learning”
(Springer).
Ovronnaz, Switzerland September 15-17, 2010 129