tree depth in a forest - nusways to grow the forest. all of the well-known meth-ods grow the forest...
Post on 11-Oct-2020
1 Views
Preview:
TRANSCRIPT
Tree Depth in a Forest
NUS / IMS Workshop on Classification and Regression Trees
Mark SegalCenter for Bioinformatics & Molecular Biostatistics
Division of BioinformaticsDepartment of Epidemiology and Biostatistics
UCSF
• Breiman, Friedman, Olshen, Stone (1984)
• Popularized tree-structured techniques
• Primary distinction with earlier approaches?
• Means for determining tree size
• Grow large / maximal initial tree
• capture all potential action
• Cost-complexity pruning
• Cross-validation based selection
• Size determination critical consideration
• Why??
CART
38 2. Overview of Supervised Learning
High Bias
Low Variance
Low Bias
High Variance
Pre
dic
tion
Err
or
Model Complexity
Training Sample
Test Sample
Low High
FIGURE 2.11. Test and training error as a function of model complexity.
be close to f(x0). As k grows, the neighbors are further away, and thenanything can happen.
The variance term is simply the variance of an average here, and de-creases as the inverse of k. So as k varies, there is a bias–variance tradeo!.
More generally, as the model complexity of our procedure is increased, thevariance tends to increase and the squared bias tends to decrease. The op-posite behavior occurs as the model complexity is decreased. For k-nearestneighbors, the model complexity is controlled by k.
Typically we would like to choose our model complexity to trade biaso! with variance in such a way as to minimize the test error. An obviousestimate of test error is the training error 1
N
!i(yi ! yi)2. Unfortunately
training error is not a good estimate of test error, as it does not properlyaccount for model complexity.
Figure 2.11 shows the typical behavior of the test and training error, asmodel complexity is varied. The training error tends to decrease wheneverwe increase the model complexity, that is, whenever we fit the data harder.However with too much fitting, the model adapts itself too closely to thetraining data, and will not generalize well (i.e., have large test error). Inthat case the predictions f(x0) will have large variance, as reflected in thelast term of expression (2.46). In contrast, if the model is not complexenough, it will underfit and may have large bias, again resulting in poorgeneralization. In Chapter 7 we discuss methods for estimating the testerror of a prediction method, and hence estimating the optimal amount ofmodel complexity for a given prediction method and training set.
Predictive Performance
Predictive Performance
CART Monograph
• CART lived happily ever after
• widespread uptake in diverse fields
• many methodological refinements
• this workshop (thanks Wei-Yin!)
• But, what about predictive performance??
• Better the model fits, the more sound the inference
• Conventional models and CART tend to fit very poorly
• Fit measured by prediction error (PE)
• Substantial gains in PE can be achieved by using ensembles of (weak) predictors
• in particular, individual trees
Breiman Mantra
• Breiman (2001a,b)
• Have become a forefront prediction technique
• Notable gains in prediction performance over individual trees
• PE variance reduced by averaging over the randomness-injected ensemble
• Individual trees grown to large / maximal depth
• Major departure from CART paradigm
• Seemingly, averaging over the ensemble more than compensates for increased individual tree variability
Random Forests
A RF is a collection of tree predictors
h(x;✓t), t = 1, . . . , T ; ✓t iid random vectors
For regression, the forest prediction is the
unweighted average over the collection:
¯h(x)
As t!1 the Law of Large Numbers ensures
EX,Y (Y � ¯h(X))
2 ! EX,Y (Y � E✓h(X;✓))
2
⌘ PE⇤f the forest prediction error
Convergence implies forests don’t overfit
• Growing trees to maximal depth minimizes bias
• But potentially incurs prediction variance cost
• Averaging over ensemble putatively handles this
• But how was it established that such averaging (more than) compensates for increased individual tree variability??
• Hard to address theoretically (will try later)
• Breiman (2001a,b) addressed empirically using
• UCI Irvine machine learning benchmark datasets
• Includes classification and regression problems
• Simulated and (predominantly) real data
• Exported to R mlbench library
STATISTICAL MODELING: THE TWO CULTURES 207
Table 1Data set descriptions
Training TestData set Sample size Sample size Variables Classes
Cancer 699 — 9 2Ionosphere 351 — 34 2Diabetes 768 — 8 2Glass 214 — 9 6Soybean 683 — 35 19
Letters 15,000 5000 16 26Satellite 4,435 2000 36 6Shuttle 43,500 14,500 9 7DNA 2,000 1,186 60 3Digit 7,291 2,007 256 10
that in many states, the trials were anything butspeedy. It funded a study of the causes of the delay.I visited many states and decided to do the anal-ysis in Colorado, which had an excellent computer-ized court data system. A wealth of information wasextracted and processed.
The dependent variable for each criminal casewas the time from arraignment to the time of sen-tencing. All of the other information in the trial his-tory were the predictor variables. A large decisiontree was grown, and I showed it on an overhead andexplained it to the assembled Colorado judges. Oneof the splits was on District N which had a largerdelay time than the other districts. I refrained fromcommenting on this. But as I walked out I heard onejudge say to another, “I knew those guys in DistrictN were dragging their feet.”
While trees rate an A+ on interpretability, theyare good, but not great, predictors. Give them, say,a B on prediction.
9.1 Growing Forests for Prediction
Instead of a single tree predictor, grow a forest oftrees on the same data—say 50 or 100. If we areclassifying, put the new x down each tree in the for-est and get a vote for the predicted class. Let the for-est prediction be the class that gets the most votes.There has been a lot of work in the last five years onways to grow the forest. All of the well-known meth-ods grow the forest by perturbing the training set,growing a tree on the perturbed training set, per-turbing the training set again, growing another tree,etc. Some familiar methods are bagging (Breiman,1996b), boosting (Freund and Schapire, 1996), arc-ing (Breiman, 1998), and additive logistic regression(Friedman, Hastie and Tibshirani, 1998).
My preferred method to date is random forests. Inthis approach successive decision trees are grown byintroducing a random element into their construc-tion. For example, suppose there are 20 predictor
variables. At each node choose several of the 20 atrandom to use to split the node. Or use a randomcombination of a random selection of a few vari-ables. This idea appears in Ho (1998), in Amit andGeman (1997) and is developed in Breiman (1999).
9.2 Forests Compared to Trees
We compare the performance of single trees(CART) to random forests on a number of smalland large data sets, mostly from the UCI repository(ftp.ics.uci.edu/pub/MachineLearningDatabases). Asummary of the data sets is given in Table 1.
Table 2 compares the test set error of a single treeto that of the forest. For the five smaller data setsabove the line, the test set error was estimated byleaving out a random 10% of the data, then run-ning CART and the forest on the other 90%. Theleft-out 10% was run down the tree and the forestand the error on this 10% computed for both. Thiswas repeated 100 times and the errors averaged.The larger data sets below the line came with aseparate test set. People who have been in the clas-sification field for a while find these increases inaccuracy startling. Some errors are halved. Othersare reduced by one-third. In regression, where the
Table 2Test set misclassification error (%)
Data set Forest Single tree
Breast cancer 2.9 5.9Ionosphere 5.5 11.2Diabetes 24.2 25.3Glass 22.0 30.4Soybean 5.7 8.6
Letters 3.4 12.4Satellite 8.6 14.8Shuttle !103 7.0 62.0DNA 3.9 6.2Digit 6.2 17.1
Breiman (2001a,b)
Some classification results from UCI Irvine machine learning benchmark datasets:
• Many further comparisons using the UCI Irvine / mlbench repository datasets:
• several modeling / prediction frameworks:
• CART, ANNs, LDA, QDA, kNNs...
• regression and classification problems
• Conclusion: “Random Forests are A+ predictors’’
• Discussion (Efron): Lots of knobs (tuning parameters)
• Rejoinder (Breiman): Essentially only one (mtry)
• Random Forests have lived happily ever after
• But, lets take a closer look at the UCI Irvine / mlbench repository datasets
0 10 20 30 40 50 60
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Augmented Friedman #1
0 50 100 150
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Boston Housing
0 10 20 30 40 50
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Servo
0 10 20 30 40 50 60 70
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Friedman #1
0 10 20 30 40 50 60 70
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Friedman #2
0 20 40 60
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Friedman #3
Figure 2: UCI Repository: Regression tree prediction error profiles. Note that the upper left plot corre-sponds to modification of a synthetic repository dataset in order to achieve a non-monotone error profile;see Section 4.
0 5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
rBreast Cancer
0 10 20 30 40
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of SplitsC
ross−v
alid
ated
Erro
r
Bupa Liver
0 20 40 60 80
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Diabetes
0 5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Glass
0 5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Image
0 5 10 15 200.
00.
20.
40.
60.
81.
01.
2Number of Splits
Cro
ss−v
alid
ated
Erro
r
Ionosphere
Figure 3: UCI Repository: Classification tree prediction error profiles, I.
5
0 200 400 600 800 1000 1200
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Letter Recognition
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Promoters
0 5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Ringnorm
0 50 100 150 200 250 300
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Satellite
0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Sonar
0 10 20 30 40 500.
00.
20.
40.
60.
81.
01.
2Number of Splits
Cro
ss−v
alid
ated
Erro
r
Soybean
Figure 4: UCI Repository: Classification tree prediction error profiles, II.
0 10 20 30 40
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Threenorm
0 5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Twonorm
0 20 40 60 80
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Vehicle
0 1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
House Votes 84
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Vowel
0.0 0.5 1.0 1.5 2.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Waveform
Figure 5: UCI Repository: Classification tree prediction error profiles, III.
Of course the error profiles depend on the class of model being fitted. While it is appropriate to utilize tree-structured models in dissecting the random forest mechanism, it is also purposeful to assess whether thedatasets can’t be overfit under other model classes. To that end we investigate error profiles correspondingto least angle regression (LARS). LARS represents a recently devised (Efron et al., 2004) technique that
6
0 200 400 600 800 1000 1200
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Letter Recognition
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Promoters
0 5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Ringnorm
0 50 100 150 200 250 300
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Satellite
0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Sonar
0 10 20 30 40 50
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Soybean
Figure 4: UCI Repository: Classification tree prediction error profiles, II.
0 10 20 30 40
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Threenorm
0 5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Twonorm
0 20 40 60 80
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Vehicle
0 1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
House Votes 84
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Vowel
0.0 0.5 1.0 1.5 2.00.
00.
20.
40.
60.
81.
01.
2Number of Splits
Cro
ss−v
alid
ated
Erro
r
Waveform
Figure 5: UCI Repository: Classification tree prediction error profiles, III.
Of course the error profiles depend on the class of model being fitted. While it is appropriate to utilize tree-structured models in dissecting the random forest mechanism, it is also purposeful to assess whether thedatasets can’t be overfit under other model classes. To that end we investigate error profiles correspondingto least angle regression (LARS). LARS represents a recently devised (Efron et al., 2004) technique that
6
0 10 20 30 40 50 60
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Augmented Friedman #1
0 50 100 150
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Boston Housing
0 10 20 30 40 50
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Servo
0 10 20 30 40 50 60 70
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Friedman #1
0 10 20 30 40 50 60 70
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
rFriedman #2
0 20 40 60
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Friedman #3
Figure 2: UCI Repository: Regression tree prediction error profiles. Note that the upper left plot corre-sponds to modification of a synthetic repository dataset in order to achieve a non-monotone error profile;see Section 4.
0 5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Breast Cancer
0 10 20 30 40
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Bupa Liver
0 20 40 60 80
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Diabetes
0 5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Glass
0 5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Image
0 5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ss−v
alid
ated
Erro
r
Ionosphere
Figure 3: UCI Repository: Classification tree prediction error profiles, I.
5
• Almost all UCI Irvine machine learning benchmark datasets exhibit this behaviour:
• they are hard to overfit {not just with trees}
• This will make the Random Forest strategy of growing trees to maximal depth look good
• “Benchmarks” are not representative of what is at least thought to be prototypic
• Will next showcase such an example
• Then offer some theory and characterizations
0 1000 2000 3000 4000 5000
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ssïv
alid
ated
Erro
r
Basal Splicing Signals
• Pre-messenger RNA splicing - responsible for precise removal of introns - is an essential step in expression of most genes
• Exons defined by short, degenerate splice site sequences at intron/exon boundaries: 5’ splice site (5’ss, donor); 3’ss, acceptor
• Each ss has a consensus sequence motif: essential nucleotides plus base usage preferences in flanking positions
• Despite requirement for accurate splicing, human ss only moderately conserved
• Implies an abundance of decoy ss
• Further, strong and complex dependencies between ss nucleotides exist
• Improved understanding of basal ss is important for exon recognition and, ultimately, disease impact of splicing defects
• Approach as a classification problem -- real vs decoy ss -- using large database
• Objective: predict 3’ splice site sequences
• Large n, small p datasets:
• training 8465 real; 180957 decoy
• test 4233 real; 90494 decoy ATTCTTACAAGTCCAATAAGGTT real GAATCGCTTGAACCTGGGAGGTG real CTGAAATGTCTCATCTGCAGTAC decoy ATTTTATTTTTAAATTGCAGGTA decoy
• each (non-degenerate, aligned) position constitutes an unordered covariate (p = 21)
• data generation: Yeo and Burge (2003).
0 1000 2000 3000 4000 5000
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Number of Splits
Cro
ssïv
alid
ated
Erro
r3’ss: CV error for a single tree
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate (1 − Specificity)
True
Pos
itive
Rat
e (S
ensi
tivity
)
Random Forest ROC Curves: Test 3'ss Data
Split ControlNode Size Control
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate (1 − Specificity)
True
Pos
itive
Rat
e (S
ensi
tivity
)
Random ForestsSupport Vector MachinesMaximum Entropy Models
ROC Curves: Test 3’ss Data
{Aside: comparisons}
• Individual tree size determined by inter-related tuning parameters that govern (terminal) node size, number of splits, depth, split improvement
• A priori regulation via node size specifications problematic in large n situations
• Guidelines, rules-of-thumb as function of n are lacking (cf defaults for m)
• Leekasso
Tree Depth in a Forest
• Lin and Jeong (2006, JASA)
• Develop construct of k-PNNs
• Establish connections between Random Forests and k-PNNs where k is terminal node size
• k = 1 for trees grown to maximal depth
• Enables analysis of role of tree depth
Potential Nearest Neighbours
Under simplifying assumptions Lin and Jeon
show that a lower bound on the rate of
convergence of RF MSE is k�1(log n)
�(p�1).
Much inferior to standard rate n�2d/(2d+p)
(where d is degree of target smoothness)
attained by many nonparametric methods.
To achieve competitiveness terminal node
size k should increase with sample size n.
Intuitively: largest trees use 1-PNNs at x0
#1-PNNs ⇠ Op[(log n)
p�1] which is too small.
Lin and Jeon: “growing large trees (k small)
does not always give the best performance”
But, asymptotics require n� p and even
when seemingly applicable may not pertain.
Consider p = 10, d = 2, n = 100000. Then
(log n)
p�1/(p� 1)! = 9793� 27 = n2d/(2d+p)
Even more so the case for larger p, smaller n.
So, for high dimensional problems growing
largest individual trees is often desirable.
• UCI / mlbench data repositories are inadequate as representative testbeds
• k-PNNs provide a theoretic framework for (crudely) evaluating tree depth considerations
• In large sample settings (Big Data) growing the individual tree components of a Random Forest ensemble to maximal depth can be undesirable
• Approaches to developing guidelines, defaults, parameterizations, tuning strategies to address tree depth are yet to be developed
Conclusions / Future Work
• Eugene Yeo
• Leo Breiman
Acknowledgements
top related