bmc bioinformatics biomed central · biomed central page 1 of 21 (page number not for citation...

21
BioMed Central Page 1 of 21 (page number not for citation purposes) BMC Bioinformatics Open Access Methodology article Bias in random forest variable importance measures: Illustrations, sources and a solution Carolin Strobl* 1 , Anne-Laure Boulesteix 2 , Achim Zeileis 3 and Torsten Hothorn 4 Address: 1 Institut für Statistik, Ludwig-Maximilians-Universität München, Ludwigstr. 33, 80539 München, Germany, 2 Institut für medizinische Statistik und Epidemiologie, Technische Universität München, Ismaningerstr. 22, 81675 München, Germany, 3 Department für Statistik und Mathematik, Wirtschaftsuniversität Wien, Augasse 2-6, 1090 Wien, Austria and 4 Institut für Medizininformatik, Biometrie und Epidemiologie, Friedrich-Alexander-Universtität Erlangen-Nürnberg, Waldstr. 6, D-91054 Erlangen, Germany Email: Carolin Strobl* - [email protected]; Anne-Laure Boulesteix - [email protected]; Achim Zeileis - [email protected]; Torsten Hothorn - [email protected] * Corresponding author Abstract Background: Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. Results: Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. Conclusion: We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research. Published: 25 January 2007 BMC Bioinformatics 2007, 8:25 doi:10.1186/1471-2105-8-25 Received: 18 September 2006 Accepted: 25 January 2007 This article is available from: http://www.biomedcentral.com/1471-2105/8/25 © 2007 Strobl et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Upload: others

Post on 24-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BMC Bioinformatics BioMed Central · BioMed Central Page 1 of 21 (page number not for citation purposes) BMC Bioinformatics Methodology article Open Access Bias in random forest variable

BioMed CentralBMC Bioinformatics

ss

Open AcceMethodology articleBias in random forest variable importance measures: Illustrations, sources and a solutionCarolin Strobl*1, Anne-Laure Boulesteix2, Achim Zeileis3 and Torsten Hothorn4

Address: 1Institut für Statistik, Ludwig-Maximilians-Universität München, Ludwigstr. 33, 80539 München, Germany, 2Institut für medizinische Statistik und Epidemiologie, Technische Universität München, Ismaningerstr. 22, 81675 München, Germany, 3Department für Statistik und Mathematik, Wirtschaftsuniversität Wien, Augasse 2-6, 1090 Wien, Austria and 4Institut für Medizininformatik, Biometrie und Epidemiologie, Friedrich-Alexander-Universtität Erlangen-Nürnberg, Waldstr. 6, D-91054 Erlangen, Germany

Email: Carolin Strobl* - [email protected]; Anne-Laure Boulesteix - [email protected]; Achim Zeileis - [email protected]; Torsten Hothorn - [email protected]

* Corresponding author

AbstractBackground: Variable importance measures for random forests have been receiving increasedattention as a means of variable selection in many classification tasks in bioinformatics and relatedscientific fields, for instance to select a subset of genetic markers relevant for the prediction of acertain disease. We show that random forest variable importance measures are a sensible meansfor variable selection in many applications, but are not reliable in situations where potentialpredictor variables vary in their scale of measurement or their number of categories. This isparticularly important in genomics and computational biology, where predictors often includevariables of different types, for example when predictors include both sequence data andcontinuous variables such as folding energy, or when amino acid sequence data show differentnumbers of categories.

Results: Simulation studies are presented illustrating that, when random forest variableimportance measures are used with data of varying types, the results are misleading becausesuboptimal predictor variables may be artificially preferred in variable selection. The twomechanisms underlying this deficiency are biased variable selection in the individual classificationtrees used to build the random forest on one hand, and effects induced by bootstrap sampling withreplacement on the other hand.

Conclusion: We propose to employ an alternative implementation of random forests, thatprovides unbiased variable selection in the individual classification trees. When this method isapplied using subsampling without replacement, the resulting variable importance measures can beused reliably for variable selection even in situations where the potential predictor variables varyin their scale of measurement or their number of categories. The usage of both random forestalgorithms and their variable importance measures in the R system for statistical computing isillustrated and documented thoroughly in an application re-analyzing data from a study on RNAediting. Therefore the suggested method can be applied straightforwardly by scientists inbioinformatics research.

Published: 25 January 2007

BMC Bioinformatics 2007, 8:25 doi:10.1186/1471-2105-8-25

Received: 18 September 2006Accepted: 25 January 2007

This article is available from: http://www.biomedcentral.com/1471-2105/8/25

© 2007 Strobl et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 1 of 21(page number not for citation purposes)

Page 2: BMC Bioinformatics BioMed Central · BioMed Central Page 1 of 21 (page number not for citation purposes) BMC Bioinformatics Methodology article Open Access Bias in random forest variable

BMC Bioinformatics 2007, 8:25 http://www.biomedcentral.com/1471-2105/8/25

BackgroundIn bioinformatics and related scientific fields, such as sta-tistical genomics and genetic epidemiology, an importanttask is the prediction of a categorical response variable(such as the disease status of a patient or the properties ofa molecule) based on a large number of predictors. Theaim of this research is on one hand to predict the value ofthe response variable from the values of the predictors, i.e.to create a diagnostic tool, and on the other hand to relia-bly identify relevant predictors from a large set of candi-date variables. From a statistical point of view, one of thechallenges in identifying these relevant predictor variablesis the so-called "small n large p" problem: Usual data setsin genomics often contain hundreds or thousands ofgenes or markers that serve as predictor variables X1,..., Xp,but only for a comparatively small number n of subjectsor tissue types.

Traditional statistical models used in clinical case controlstudies for predicting the disease status from selected pre-dictor variables, such as logistic regression, are not suita-ble for "small n large p" problems [1,2]. A moreappropriate approach from machine learning, that hasbeen proposed recently for prediction and variable selec-tion in various fields related to bioinformatics and com-putational biology, is the nonlinear and nonparametricrandom forest method [3]. It also provides variableimportance measures for variable selection purposes.

Random forests have been successfully applied to variousproblems in, e.g., genetic epidemiology and microbiologyin general within the last five years. Within a very shortperiod of time, random forests have become a major dataanalysis tool, that performs well in comparison withmany standard methods [2,4]. What has greatly contrib-uted to the popularity of random forests is the fact thatthey can be applied to a wide range of prediction prob-lems, even if they are nonlinear and involve complexhigh-order interaction effects, and that random forestsproduce variable importance measures for each predictorvariable.

Applications of random forests in bioinformatics includelarge-scale association studies for complex genetic dis-eases, as e.g. Lunetta et al. [5] and Bureau et al. [1], whodetect SNP-SNP interactions in the case-control context bymeans of computing a random forest variable importancemeasure for each polymorphism. A comparison of theperformance of random forests and other classificationmethods for the analysis of gene expression data is pre-sented by Díaz-Uriate and Alvarez de Andrés [4], who pro-pose a new gene selection method based on randomforests for sample classification with microarray data. Werefer to [6-8] for other applications of the random forestmethodology to microarray data.

Prediction of phenotypes based on amino acid or DNAsequence is another important area of application of ran-dom forests, since possibly involving many interactions.For example, Segal et al. [9] use random forests to predictthe replication capacity of viruses, such as HIV-1, based onamino acid sequence from reverse transcriptase and pro-tease. Cummings and Segal [10] link the rifampin resist-ance in Mycobacterium tuberculosis to a few amino acidpositions in rpoB, whereas Cummings and Myers [11]predict C-to-U edited sites in plant mitochondrial RNAbased on sequence regions flanking edited sites and a fewother (continuous) parameters.

The random forest approach was shown to outperform sixother methods in the prediction of protein interactionsbased on various biological features such as gene expres-sion, gene ontology (GO) features and sequence data[12]. Other applications of random forests can be foundin fields as different as quantitative structure-activity rela-tionship (QSAR) modeling [13,14], nuclear magnetic res-onance spectroscopy [15], landscape epidemiology [16]and medicine in general [17].

The scope of this paper is to show that the variable impor-tance measures of Breiman's original random forestmethod [3], based on CART classification trees [18], are asensible means for variable selection in many of theseapplications, but are not reliable in situations wherepotential predictor variables vary in their scale of meas-urement or their number of categories, as, e.g., when bothgenetic and environmental variables, individually and ininteractions, are considered as potential predictors, or pre-dictor variables of the same type vary in the number of cat-egories present in a certain sample, as is often the case ingenomics, bioinformatics and related disciplines.

Simulation studies are presented illustrating that variableselection with the variable importance measure of theoriginal random forest method bears the risk that subop-timal predictor variables are artificially preferred in suchscenarios.

In an extra section, further details and explanations of thestatistical sources underlying the deficiency of the variableimportance measures of the original random forestmethod, namely biased variable selection in the individ-ual classification trees used to build the random forest andeffects induced by bootstrap sampling with replacement,are given.

We propose to employ an alternative random forestmethod, the variable importance measure of which can beemployed to reliably select relevant predictor variables inany data set. The performance of this method is comparedto that of the original random forest method in simula-

Page 2 of 21(page number not for citation purposes)

Page 3: BMC Bioinformatics BioMed Central · BioMed Central Page 1 of 21 (page number not for citation purposes) BMC Bioinformatics Methodology article Open Access Bias in random forest variable

BMC Bioinformatics 2007, 8:25 http://www.biomedcentral.com/1471-2105/8/25

tion studies, and is illustrated by an application to the pre-diction of C-to-U edited sites in plant mitochondrial RNA,re-analyzing the data of [11] that were previously ana-lyzed with the original random forest method.

MethodsHere we focus on the use of random forests for classifica-tion tasks, rather than regression tasks, for instance forpredicting the disease status from a set of selected geneticand environmental risk factors, or for predicting whethera site of interest is edited by means of neighboring sitesand other predictor variables as in our application exam-ple.

Random forests are an ensemble method that combinesseveral individual classification trees in the following way:From the original sample several bootstrap samples aredrawn, and an unpruned classification tree is fit to eachbootstrap sample. The variable selection for each split inthe classification tree is conducted only from a small ran-dom subset of predictor variables, so that the "small nlarge p" problem is avoided. From the complete forest thestatus of the response variable is predicted as an average ormajority vote of the predictions of all trees.

Random forests can highly increase the prediction accu-racy as compared to individual classification trees,because the ensemble adjusts for the instability of theindividual trees induced by small changes in the learningsample, that impairs the prediction accuracy in test sam-ples. However, the interpretability of a random forest isnot as straightforward as that of an individual classifica-tion tree, where the influence of a predictor variabledirectly corresponds to its position in the tree. Thus, alter-native measures for variable importance are required forthe interpretation of random forests.

Random forest variable importance measuresA naive variable importance measure to use in tree-basedensemble methods is to merely count the number of timeseach variable is selected by all individual trees in theensemble.

More elaborate variable importance measures incorporatea (weighted) mean of the individual trees' improvementin the splitting criterion produced by each variable [19].An example for such a measure in classification is the"Gini importance" available in random forest implemen-tations. The "Gini importance" describes the improve-ment in the "Gini gain" splitting criterion.

The most advanced variable importance measure availa-ble in random forests is the "permutation accuracy impor-tance" measure. Its rationale is the following: Byrandomly permuting the predictor variable Xj, its original

association with the response Y is broken. When the per-muted variable Xj, together with the remaining unper-muted predictor variables, is used to predict the response,the prediction accuracy (i.e. the number of observationsclassified correctly) decreases substantially, if the originalvariable Xj was associated with the response. Thus, a rea-sonable measure for variable importance is the differencein prediction accuracy before and after permuting Xj.

For variable selection purposes the advantage of the ran-dom forest permutation accuracy importance measure ascompared to univariate screening methods is that it coversthe impact of each predictor variable individually as wellas in multivariate interactions with other predictor varia-bles. For example, Lunetta et al. [5] find that genetic mark-ers relevant in interactions with other markers orenvironmental variables can be detected more efficientlyby means of random forests than by means of univariatescreening methods like Fisher's exact test.

The Gini importance and the permutation accuracyimportance measures are employed as variable selectioncriteria in many recent studies in various disciplinesrelated to bioinformatics, as outlined in the backgroundsection. Therefore we want to investigate their reliabilityas variable importance measures in different scenarios.

In the simulation studies presented in the next section, wecompare the behavior of all three random forest variableimportance measures, namely the number of times eachvariable is selected by all individual trees in the ensemble(termed "selection frequency" in the following), the "Giniimportance" and the permutation accuracy importancemeasure (termed "permutation importance" in the fol-lowing).

Simulation studiesThe reference implementation of Breiman's original ran-dom forest method [3] is available in the R system for sta-tistical computing [20] via the randomForest add-onpackage by Liaw and Wiener [21,22]. The behavior of theselection frequency, the Gini importance and the permu-tation importance of the randomForest function isexplored in a simulation design where potential predictorvariables vary in their scale of measurement and numberof categories.

As an alternative, we propose to use the new random for-est function cforest available in the R add-on packageparty [23] in such scenarios. In contrast to randomForest,the cforest function creates random forests not from CARTclassification trees based on the Gini split criterion [18],that are known to prefer variables with, e.g., more catego-ries in variable selection [18,24-28], but from unbiasedclassification trees based on a conditional inference

Page 3 of 21(page number not for citation purposes)

Page 4: BMC Bioinformatics BioMed Central · BioMed Central Page 1 of 21 (page number not for citation purposes) BMC Bioinformatics Methodology article Open Access Bias in random forest variable

BMC Bioinformatics 2007, 8:25 http://www.biomedcentral.com/1471-2105/8/25

framework [29]. The problem of biased variable selectionin classification trees is covered more thoroughly in a sep-arate section below.

Since the cforest function does not employ the Gini crite-rion, we investigate the behavior of the Gini importancefor the randomForest function only. The selection fre-quency and the permutation importance is studied forboth functions randomForest and cforest in two ways:Either the individual trees are built on bootstrap samplesof the original sample size n drawn with replacement, assuggested in [3], or on subsamples drawn without replace-ment.

For sampling without replacement the subsample sizehere is set to 0.632 times the original sample size n,because in bootstrap sampling with replacement about63.2% of the data end up in the bootstrap sample. Otherfractions for the subsample size are possible, for instance0.5 as suggested by Friedman and Hall [30]. Subsamplingas an alternative to bootstrap sampling in aggregating,e.g., individual classification trees is investigated furtherby Bühlmann and Yu [31], who also coin the term "sub-agging" as an abbreviation for "subsample aggregating" asopposed to "bagging" for "bootstrap aggregating". Politis,Romano and Wolf [32] show that, for statistical inferencein general, subsampling works under weaker assumptions

than bootstrap sampling and even in situations whenbootstrap sampling fails.

The simulation design used throughout this paper repre-sents a scenario where a binary response variable Y is sup-posed to be predicted from a set of potential predictorvariables that vary in their scale of measurement andnumber of categories. The first predictor variable X1 is con-tinuous, while the other predictor variables X2,..., X5 arecategorical (on a nominal scale of measurement) withtheir number of categories between two and up to twenty.The simulation designs of both studies are summarized inTables 1 and 2. The sample size for all simulation studieswas set to n = 120.

In the first simulation study, the so-called null case, noneof the predictor variables is informative for the response,i.e. all predictor variables and the response are sampledindependently. In this situation a sensible variable impor-tance measure should not prefer any one predictor varia-ble over any other.

In the second simulation study, the so-called power case,the predictor variable X2 is informative for the response,i.e. the distribution of the response depends on the valueof this predictor variable. The degree of dependencebetween the informative predictor variable X2 and theresponse Y is regulated by the relevance parameter of theconditional distribution of Y given X2 (cf. Table 2). Wewill later display results for different values of the relevanceparameter indicating different degrees of dependencebetween X2 and Y. In the power case, a sensible variableimportance measure should be able to distinguish theinformative predictor variable from its uninformativecompetitors, and even more so with increasing degree ofdependence.

Results and discussionOur simulation studies show that for the randomForestfunction all three variable importance measures are unre-liable, and the Gini importance is most strongly biased.For the cforest function reliable results can be achievedboth with the selection frequency and the permutationimportance if the function is used together with subsam-

Table 1: Simulation design for the simulation studies – predictor variables

Predictor variables

X1 ~N(0, 1)X2 ~M(2)X3 ~M(4)X4 ~M(10)X5 ~M(20)

The predictor variables are sampled independently from the following distributions. N(0, 1) stands for the standard normal distribution, M(k) stands for the multinomial distribution with values in {0,..., k - 1} and equal probabilities (discrete uniform distribution on {0,..., k - 1}), B(p) stands for the binomial distribution (Bernoulli distribution) with probability p, thus M(2) equals B(0.5).

Table 2: Simulation design for the simulation studies – response variable

Response variable

null case Y ~B(0.5)power case Y|X2 = 1 ~B(0.5 - relevance)

Y|X2 = 2 ~B(0.5 + relevance)

The response variable is sampled from binomial (Bernoulli) distributions. The degree of dependence between the response Y and X2 is regulated by the probability p of the binomial distribution B(p) of Y conditional on X2, with the relevance parameter taking values in {0.05, 0.1, 0.15, 0.2} to model different degrees of dependence.

Page 4 of 21(page number not for citation purposes)

Page 5: BMC Bioinformatics BioMed Central · BioMed Central Page 1 of 21 (page number not for citation purposes) BMC Bioinformatics Methodology article Open Access Bias in random forest variable

BMC Bioinformatics 2007, 8:25 http://www.biomedcentral.com/1471-2105/8/25

pling without replacement. Otherwise the measures arebiased as well.

Results of the null case simulation studyIn the null case, when all predictor variables are equallyuninformative, the selection frequencies as well as theGini importance and the permutation importance of allpredictor variables are supposed to be equal. However, aspresented in Figure 1, the mean selection frequencies(over 1000 simulation runs) of the predictor variables dif-fer substantially when the randomForest function (cf. toprow in Figure 1) or the cforest function with bootstrapsampling (cf. bottom row, left plot in Figure 1) are used.Variables with more categories are obviously preferred.Only when the cforest function is used together with sub-sampling without replacement (cf. bottom row, right plotin Figure 1) are the variable selection frequencies for theuninformative predictor variables equally low as desired.

It is obvious that variable importance cannot be repre-sented reliably by the selection frequencies, that can beconsidered as very basic variable importance measures, ifthe potential predictor variables vary in their scale ofmeasurement or number of categories when the random-Forest function or the cforest function with bootstrapsampling is used.

The mean Gini importance (over 1000 simulation runs),that is displayed in Figure 2, is biased even stronger. Likethe selection frequencies for the randomForest function(cf. top row in Figure 1) the Gini importance shows astrong preference for variables with many categories andthe continuous variable, the statistical sources of whichare explained in the section on variable selection bias inclassification trees below. We conclude that the Giniimportance cannot be used to reliably measure variableimportance in this situation either.

We now consider the more advanced permutation impor-tance measure. We find that here an effect of the scale ofmeasurement or number of categories of the potentialpredictor variables is less obvious but still severely affectsthe reliability and interpretability of the variable impor-tance measure.

Figure 3 shows boxplots of the distributions (over 1000simulation runs) of the permutation importance meas-ures of both functions for the null case. The plots in thetop row again display the distribution when the random-Forest function is used, the bottom row when the cforestfunction is used. The left column of plots displays the dis-tributions when bootstrap sampling is conducted withreplacement, while the right column displays the distribu-tions when subsampling is conducted without replace-ment.

Figure 4 shows boxplots of the distributions of the scaledversion of the permutation importance measures of bothfunctions, incorporating the standard deviation of themeasures.

The scaled variable importance is the default output of therandomForest function. However, it has been noted, e.g.,by Díaz-Uriate and Alvarez de Andrés [4] in their supple-mentary material, that the scaled variable importance ofthe randomForest function depends on the number oftrees grown in the random forest. (In the cforest function,this is not the case.) Therefore we suggest not to interpretthe magnitude of the scaled variable importance of therandomForest function.

The plots show that for the randomForest function (cf. toprow in Figures 3 and 4) and, less pronounced, for the cfor-est function with bootstrap sampling (cf. bottom row, leftplot in Figures 3 and 4), the deviation of the permutationimportance measure over the simulation runs is highestfor the variable X5 with the highest number of categories,and decreases for the variables with less categories and thecontinuous variable. This effect is weakened but not sub-stantially altered by scaling the measure (cf. Figure 3 vs.Figure 4).

As opposed to the obvious effect in the selection frequen-cies and the Gini importance, there is no effect in themean values of the distributions of the permutationimportance measures, which are in mean close to zero asexpected for uninformative variables. However, the nota-ble differences in the variance of the distributions for pre-dictor variables with different scale of measurement ornumber of categories seriously affect the expressiveness ofthe variable importance measure.

In a single trial this effect may lead to a severe over- orunderestimation of the variable importance of variablesthat have more categories as an artefact of the method,even though they are no more or less informative than theother variables.

Only when the cforest function is used together with sub-sampling without replacement (cf. bottom row, right plotin Figures 3 and 4) does the deviation of the permutationimportance measure over the simulation runs not increasesubstantially with the number of categories or scale ofmeasurement of the predictor variables.

Thus, only the variable importance measure available incforest, and only when used together with sampling with-out replacement, reliably reflects the true importance ofpotential predictor variables in a scenario where thepotential predictor variables vary in their scale of meas-urement or number of categories.

Page 5 of 21(page number not for citation purposes)

Page 6: BMC Bioinformatics BioMed Central · BioMed Central Page 1 of 21 (page number not for citation purposes) BMC Bioinformatics Methodology article Open Access Bias in random forest variable

BMC Bioinformatics 2007, 8:25 http://www.biomedcentral.com/1471-2105/8/25

Results of the power case simulation studyIn the power case, where only the predictor variable X2 isinformative, a sensible variable importance measureshould be able to distinguish the informative predictorvariable.

The following figures display the results of the power casewith the highest value 0.2 of the relevance parameter, indi-cating a high degree of dependence between X2 and theresponse. In this setting, each of the variable importancemeasures should clearly prefer X2, while the respective val-

Results of the null case study – variable selection frequencyFigure 1Results of the null case study – variable selection frequency. Mean variable selection frequencies for the null case, where none of the predictor variables is informative. The plots in the top row display the frequencies when the randomForest function is used, the bottom row when the cforest function is used. The left column corresponds to bootstrap sampling with replacement, the right column to subsampling without replacement.

X1 X2 X3 X4 X5

randomForest, replace=TRUE

sele

ctio

n fr

eque

ncy

050

100

150

200

250

X1 X2 X3 X4 X5

randomForest, replace=FALSE

sele

ctio

n fr

eque

ncy

050

100

150

200

250

X1 X2 X3 X4 X5

cforest, replace=TRUE

sele

ctio

n fr

eque

ncy

050

100

150

200

250

X1 X2 X3 X4 X5

cforest, replace=FALSE

sele

ctio

n fr

eque

ncy

050

100

150

200

250

Page 6 of 21(page number not for citation purposes)

Page 7: BMC Bioinformatics BioMed Central · BioMed Central Page 1 of 21 (page number not for citation purposes) BMC Bioinformatics Methodology article Open Access Bias in random forest variable

BMC Bioinformatics 2007, 8:25 http://www.biomedcentral.com/1471-2105/8/25

ues for the remaining predictor variables should beequally low.

Figure 5 shows that the mean selection frequencies (againover 1000 simulation runs) of the predictor variablesagain differ substantially when the randomForest func-tion (cf. top row in Figure 5) is used, and the relevant pre-dictor variable X2 cannot be identified. With the cforestfunction with bootstrap sampling (cf. bottom row, leftplot in Figure 5) there is still bias obvious in the selectionfrequencies of the categorical predictor variables withmany categories. Only when the cforest function is usedtogether with subsampling without replacement (cf. bot-tom row, right plot in Figure 5), are the variable selectionfrequencies for the uninformative predictor variablesequally low as desired, and the value for the relevant pre-dictor variable X2 sticks out.

The mean Gini importance, that is displayed in Figure 6,again shows a strong bias towards variables with manycategories and the continuous variable. It completely failsto identify the relevant predictor variable, with the meanvalue for the relevant variable X2 only slightly higher thanin the null case.

Figures 7 and 8 show boxplots of the distributions of theunscaled and scaled permutation importance measures of

both functions. Again for the randomForest function (cf.top row in Figures 7 and 8) and, less pronounced, for thecforest function with bootstrap sampling (cf. bottom row,left plot in Figures 7 and 8), the deviation of the permuta-tion importance measure over the simulation runs is high-est for the variable X5 with the highest number ofcategories, and decreases for the variables with less catego-ries and the continuous variable. This effect is weakenedbut not substantially altered by scaling the measure (cf.Figure 7 vs. Figure 8).

As expected the mean value of the permutation impor-tance measure for the informative predictor variable X2 ishigher than for the uninformative variables. However, thedeviation of the variable importance measure for theuninformative variables with many categories X4 and X5 isso high that in a single trial these uninformative variablesmay outperform the informative variable as an artefact ofthe method. Thus, only the variable importance measurecomputed with the cforest function, and only when usedtogether with sampling without replacement, is able toreliably detect the informative variable out of a set ofuninformative competitors, even if the degree of depend-ence between X2 and the response is high. The rate atwhich the informative predictor variable is correctly iden-tified (by producing the highest value of the permutationimportance measure) increases with the degree of depend-

Results of the null case study – Gini importanceFigure 2Results of the null case study – Gini importance. Mean Gini importance for the null case, where none of the predictor variables is informative. The left plot corresponds to bootstrap sampling with replacement, the right plot to subsampling with-out replacement.

X1 X2 X3 X4 X5

randomForest, replace=TRUE

Gin

i im

port

ance

05

1015

2025

X1 X2 X3 X4 X5

randomForest, replace=FALSE

Gin

i im

port

ance

05

1015

2025

Page 7 of 21(page number not for citation purposes)

Page 8: BMC Bioinformatics BioMed Central · BioMed Central Page 1 of 21 (page number not for citation purposes) BMC Bioinformatics Methodology article Open Access Bias in random forest variable

BMC Bioinformatics 2007, 8:25 http://www.biomedcentral.com/1471-2105/8/25

ence between X2 and the response. In Table 3 the rates ofcorrect identifications (over 1000 simulation runs) forfour different degrees of dependence between X2 and theresponse are summarized for the randomForest and cfor-est function with different options.

For all degrees of dependence between X2 and theresponse Y the cforest function detects the informativevariable more reliably than the randomForest function,and the cforest function used with subsampling withoutreplacement outperforms the cforest function with boot-

Results of the null case study – unscaled permutation importanceFigure 3Results of the null case study – unscaled permutation importance. Distributions of the unscaled permutation impor-tance measures for the null case, where none of the predictor variables is informative. The plots in the top row display the dis-tributions when the randomForest function is used, the bottom row when the cforest function is used. The left column corresponds to bootstrap sampling with replacement, the right column to subsampling without replacement.

●●

●●

●●●

●●●●

●●●●

●●●●●●●●

●●●●●

●●●●●●●

●●●●●● ●●

●●

●●●●●●●●

●●

●●●●●●●●●

●●

●●●●●

●●●●●

X1 X2 X3 X4 X5

−0.

050.

000.

05

randomForest, replace=TRUE

unscaled

perm

utat

ion

impo

rtan

ce

●●

●●

●●

●●●●●●●

●●●●●

●●●

●●

●●●●●●●●●●●●

●●●●●

●●●●

●●

●●●

●●

●●●●

●●●

●●●●

X1 X2 X3 X4 X5

−0.

050.

000.

05

randomForest, replace=FALSE

unscaled

perm

utat

ion

impo

rtan

ce

●●●●

●●

●●

●●●

●●●●

●●●●●●●

●●●

●●●●●●●●

●●●

●●

●●

●●

●●●●●●●

●●

●●●●●●●●

●●

●●

●●●●●

●●

●●

●●

●●●●

●●●●●●

●●

●●●●

●●●

●●

●●●

●●●

●●●●

●●●●

●●●

●●●●

●●●●●

●●

●●

●●

●●●●●

●●●

●●

●●●●●●

●●

●●●●●

●●●●●●●

●●

●●●●

●●●

●●●

●●

●●●● ●●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●●●●●

●●●●

●●

●●●

●●●●●

X1 X2 X3 X4 X5

−0.

050.

000.

05

cforest, replace=TRUE

unscaled

perm

utat

ion

impo

rtan

ce

●●●

●●

●●●

●●●●

●●●●●●●●

●●●●●●●

●●●

●●

●●●

●●●●●

●●

●●

●●

●● ●

●●●

●●

●●●●●

●●●●●

●●

●●

●●●●●

●●

●●

●●

●●●●

●●

●●●●●

●●●●

●●

●●

●●●

●●●●●

●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●

●●●●●●●

●●

●●

●●

●●●●●●

●●●

●●●

●●

●●●●●●

●●●

●●●●

●●●

●●

●●●

●●

●●

●●●●●●●●●

●●●●●●

●●

●●●●●●

●●

●●

●●●●●

●●●●●

●●●●●●●

●●●

●●●

●●

●●●

●●●

X1 X2 X3 X4 X5

−0.

050.

000.

05

cforest, replace=FALSE

unscaled

perm

utat

ion

impo

rtan

ce

Page 8 of 21(page number not for citation purposes)

Page 9: BMC Bioinformatics BioMed Central · BioMed Central Page 1 of 21 (page number not for citation purposes) BMC Bioinformatics Methodology article Open Access Bias in random forest variable

BMC Bioinformatics 2007, 8:25 http://www.biomedcentral.com/1471-2105/8/25

strap sampling with replacement. For the randomForestfunction scaling the permutation importance measure canslightly increase the rates of correct identificationsbecause, as shown in Figures 4 and 8, scaling weakens the

differences in variance of the permutation importancemeasure for variables of different scale of measurementand number of categories. For the cforest function, that isnot affected by the scale of measurement and number of

Results of the null case study – scaled permutation importanceFigure 4Results of the null case study – scaled permutation importance. Distributions of the scaled permutation importance measures for the null case, where none of the predictor variables is informative. The plots in the top row display the distribu-tions when the randomForest function is used, the bottom row when the cforest function is used. The left column corre-sponds to bootstrap sampling with replacement, the right column to subsampling without replacement.

●●●●

●●●

●●●●

●●

●●●●●●● ●

●●●●●

●●●●●●

●●

X1 X2 X3 X4 X5

−2

−1

01

2

randomForest, replace=TRUE

scaled

perm

utat

ion

impo

rtan

ce

●●●

●●●●●●

●●

●●

●●

●●● ●

●●

●●●● ●

●●

●●●●

●●

X1 X2 X3 X4 X5

−2

−1

01

2

randomForest, replace=FALSE

scaled

perm

utat

ion

impo

rtan

ce

●●●●

●●

●●●●●●●

●●

●●

●●

●●●

●●●●●●●

●●●

●●●●●●

●●●

●●●●●●●

●●●●●●

●●

●●●

●●

●●●

●●

●● ●

●●

●●●

●●

●●●

●●

●●●●●

X1 X2 X3 X4 X5

−0.

50.

00.

5

cforest, replace=TRUE

scaled

perm

utat

ion

impo

rtan

ce

●●

●●●●●●●

●●●●

●●●●●

●●

●●●●●

●●

●●

●●●●

●●

●●●●●●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●●●

●●●

●●●●●●●●●

●●●●

●●

X1 X2 X3 X4 X5

−0.

50.

00.

5

cforest, replace=FALSE

scaled

perm

utat

ion

impo

rtan

ce

Page 9 of 21(page number not for citation purposes)

Page 10: BMC Bioinformatics BioMed Central · BioMed Central Page 1 of 21 (page number not for citation purposes) BMC Bioinformatics Methodology article Open Access Bias in random forest variable

BMC Bioinformatics 2007, 8:25 http://www.biomedcentral.com/1471-2105/8/25

categories of the predictor variables, both the unscaledand the scaled permutation importance perform equallywell.

So far we have seen that for the assessment of variableimportance and variable selection purposes it is impor-tant to use a reliable method, that is not affected by other

Results of the power case study – variable selection frequencyFigure 5Results of the power case study – variable selection frequency. Mean variable selection frequencies for the power case, where only the second predictor variable is informative. The plots in the top row display the frequencies when the ran-domForest function is used, the bottom row when the cforest function is used. The left column corresponds to bootstrap sampling with replacement, the right column to subsampling without replacement.

X1 X2 X3 X4 X5

randomForest, replace=TRUE

sele

ctio

n fr

eque

ncy

050

100

150

200

250

X1 X2 X3 X4 X5

randomForest, replace=FALSE

sele

ctio

n fr

eque

ncy

050

100

150

200

250

X1 X2 X3 X4 X5

cforest, replace=TRUE

sele

ctio

n fr

eque

ncy

050

100

150

200

250

X1 X2 X3 X4 X5

cforest, replace=FALSE

sele

ctio

n fr

eque

ncy

050

100

150

200

250

Page 10 of 21(page number not for citation purposes)

Page 11: BMC Bioinformatics BioMed Central · BioMed Central Page 1 of 21 (page number not for citation purposes) BMC Bioinformatics Methodology article Open Access Bias in random forest variable

BMC Bioinformatics 2007, 8:25 http://www.biomedcentral.com/1471-2105/8/25

characteristics of the predictor variables. Statistical expla-nations of our findings are given in a later section.

In addition to its superiority in the assessment of variableimportance the cforest method, especially when usedtogether with subsampling without replacement, can alsobe superior to the randomForest method with respect toclassification accuracy in situations like that of the powercase simulation study, where uninformative predictor var-iables with many categories "fool" the randomForestfunction.

Due to its artificial preference for uninformative predictorvariables with many categories the randomForest functioncan produce a higher mean misclassification rate than thecforest function. The mean misclassification rates (againover 1000 simulation runs) for the randomForest andcforest function, again for four different degrees ofdependence and used with sampling with and withoutreplacement, are displayed in Table 4.

Each method was applied to the same simulated test set ineach simulation run. The test sets were generated from thesame data generating process as the learning sets. We findthat for all degrees of dependence between X2 and theresponse Y the cforest function, especially with sampling

without replacement, outperforms the other methods. Asimilar result is obtained in the application to C-to-U con-version data presented in the next section.

The differences in classification accuracy are moderate inthe latter case, however one could think of more extremesituations that would produce even greater differences.This shows that the same mechanisms underlying the var-iable importance bias can also affect the classificationaccuracy, e.g. when suboptimal predictor variables, thatdo not add to the classification accuracy, are artificiallypreferred in variable selection merely because they havemore categories.

Application to C-to-U conversion dataRNA editing is the process whereby RNA is modified fromthe sequence of the corresponding DNA template [11].For instance, cytidine-to-uridine conversion (abbreviatedC-to-U conversion) is common in plant mitochondria.The mechanisms of this conversion remain largelyunknown, although the role of neighboring nucleotides isemphasized. Cummings and Myers [11] suggest to useinformation from sequence regions flanking the sites ofinterest to predict editing in Arabidopsis thaliana, Brassicanapus and Oryza sativa based on random forests. The Ara-bidopsis thaliana data of [11] can be loaded from the jour-

Results of the power case study – Gini importanceFigure 6Results of the power case study – Gini importance. Mean Gini importance for the power case, where only the second predictor variable is informative. The left plot corresponds to bootstrap sampling with replacement, the right plot to subsam-pling without replacement.

X1 X2 X3 X4 X5

randomForest, replace=TRUE

Gin

i im

port

ance

05

1015

2025

X1 X2 X3 X4 X5

randomForest, replace=FALSE

Gin

i im

port

ance

05

1015

2025

Page 11 of 21(page number not for citation purposes)

Page 12: BMC Bioinformatics BioMed Central · BioMed Central Page 1 of 21 (page number not for citation purposes) BMC Bioinformatics Methodology article Open Access Bias in random forest variable

BMC Bioinformatics 2007, 8:25 http://www.biomedcentral.com/1471-2105/8/25

nal's homepage. For each of the 876 observations, thedata set gives

• the response at the site of interest (binary: edited/notedited) and as potential predictor variables

• the 40 nucleotides at positions -20 to 20, relative to theedited site (4 categories),

• the codon position (4 categories),

Results of the power case study – unscaled permutation importanceFigure 7Results of the power case study – unscaled permutation importance. Distributions of the unscaled permutation importance measures for the power case, where only the second predictor variable is informative. The plots in the top row display the distributions when the randomForest function is used, the bottom row when the cforest function is used. The left column corresponds to bootstrap sampling with replacement, the right column to subsampling without replacement.

●●●●●●●●●●●●

●●●●●●

●●

●●

●●●●●●

●●●

●●●●

●●

●●

●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●

●●●

●●●●●●

●●●●●●●●

X1 X2 X3 X4 X5

−0.

15−

0.05

0.05

0.15

randomForest, replace=TRUE

unscaled

perm

utat

ion

impo

rtan

ce

●●●●●●●●●●●

●●

●●●●●●●●●●●

●●

●●●●

●●

●●●●●●●

●●

●●

●●●●●●●●

●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●

●●●●●

●●●●

X1 X2 X3 X4 X5

−0.

15−

0.05

0.05

0.15

randomForest, replace=FALSE

unscaled

perm

utat

ion

impo

rtan

ce

●●●●●●●●●●●

●●

●●●●●●

●●

●●●●●●●●●●●●●●

●●

●●●●●

●●●

●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●

●●

●●●

●●●● ●

●●●

●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●

● ●●●●●●●●●●●●●●

●●●

●●●●●●●●●

●●●●

●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

X1 X2 X3 X4 X5

−0.

15−

0.05

0.05

0.15

cforest, replace=TRUE

unscaled

perm

utat

ion

impo

rtan

ce

●●●●●●●●

●●●●●

●●●●●●●●●

●●

●●●

●●●●

●●●●●

●●●●●●●

●●●●●●●

●●●

●●●●● ●

●●

●●

●●●

●●

●●

●●

●●●●●●●●

●●●●●●●●●●●

●●●●

●●●●●

●●●●●●

●●●

●●

●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●

●●

●●●●●●●●●●

●●●●

●●

●●●●●●●

●●●

●●●●●●●●●●●●●●

●●● ●

●●

●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

X1 X2 X3 X4 X5

−0.

15−

0.05

0.05

0.15

cforest, replace=FALSE

unscaled

perm

utat

ion

impo

rtan

ce

Page 12 of 21(page number not for citation purposes)

Page 13: BMC Bioinformatics BioMed Central · BioMed Central Page 1 of 21 (page number not for citation purposes) BMC Bioinformatics Methodology article Open Access Bias in random forest variable

BMC Bioinformatics 2007, 8:25 http://www.biomedcentral.com/1471-2105/8/25

• the estimated folding energy (continuous) and

• the difference in estimated folding energy between pre-edited and edited sequences (continuous).

We first derive the permutation importance measure foreach of the 43 potential predictor variables with eachmethod. As can be seen from the barplot in Figure 9, the(scaled) variable importance measures largely reflect the

Results of the power case study – scaled permutation importanceFigure 8Results of the power case study – scaled permutation importance. Distributions of the scaled permutation impor-tance measures for the power case, where only the second predictor variable is informative. The plots in the top row display the distributions when the randomForest function is used, the bottom row when the cforest function is used. The left column corresponds to bootstrap sampling with replacement, the right column to subsampling without replacement.

●●●●●●

●●

●●●

●●

●●●

●●●●●●

●●

●●●●

●●●

X1 X2 X3 X4 X5

−3

−2

−1

01

23

randomForest, replace=TRUE

scaled

perm

utat

ion

impo

rtan

ce

●●●●●●●

●●●

●●

●●●

●●●●●●●●

●●●●

●●

●●

●●

●●●

●●●●●●

● ●●

X1 X2 X3 X4 X5

−3

−2

−1

01

23

randomForest, replace=FALSE

scaled

perm

utat

ion

impo

rtan

ce

●●

●●

●●

●●●●●

●●

●●●

●●●

●●

●●●

●●

●●●●

●●●●

●●

●●

●●●●●●●●

●●●●●●

●●

●●

●●●●●

●●

●●●

●●●

●●●●●●●

●●

●●

●●

●●●

●●

●●

X1 X2 X3 X4 X5

−1.

0−

0.5

0.0

0.5

1.0

cforest, replace=TRUE

scaled

perm

utat

ion

impo

rtan

ce

●●●

●●

●●●●●

●●●●●●●●●

●●

●●

●●●●●●●

●●●

●●

●●●

●●

●●●●

●●●●

●●●

● ●●●●●●●

●●

●●●●●

●●●●●●●

X1 X2 X3 X4 X5

−1.

0−

0.5

0.0

0.5

1.0

cforest, replace=FALSE

scaled

perm

utat

ion

impo

rtan

ce

Page 13 of 21(page number not for citation purposes)

Page 14: BMC Bioinformatics BioMed Central · BioMed Central Page 1 of 21 (page number not for citation purposes) BMC Bioinformatics Methodology article Open Access Bias in random forest variable

BMC Bioinformatics 2007, 8:25 http://www.biomedcentral.com/1471-2105/8/25

results of [11] based on the Gini importance measure, butdiffer slightly for the randomForest and cforest functionand the different resampling schemes. In particular, thevariable importance measure of the randomForest func-tion seems to produce more "noise" than that of the cfor-est function: the contrast of amplitudes between irrelevantand relevant predictors is more pronounced when thecforest function is used.

Note, however, that the the permutation importance val-ues for one predictor variable can vary between two com-putations, because each computation is based on adifferent random permutation of the variable. Therefore,before interpreting random forest permutation impor-tance values, the analysis should be repeated (with severaldifferent random seeds) to test the stability of the results.

Similarly to the simulation study, we also compared theprediction accuracy of the four approaches for this dataset. To do so, we split the original data set into learningand test sets with size ratio 2:1 in a standard split-samplevalidation scheme. A random forest is grown based on thelearning set and subsequently used to predict the observa-tions in the test set. This procedure is repeated 100 times,and the mean misclassification rates over the 100 runs arereported in Table 5. Again we find a slight superiority ofthe cforest function, especially when sampling is con-ducted without replacement. (Differences to the accuracyvalues reported in [11] are most likely due to their use ofa different validation scheme, that is not reported in detailin [11].)

All function calls and all important options of the ran-domForest and cforest functions used in the simulationstudies and the application to C-to-U conversion data aredocumented in the supplement [see Additional file 1].

Sources of variable importance biasThe main difference between the randomForest function,based on CART trees [18], and cforest function, based onconditional inference trees [29], is that in randomForestthe variable selection in the individual CART trees isbiased, so that e.g. variables with more categories are pre-ferred. This is illustrated in the next section on variableselection bias in individual classification trees.

However, even if the individual trees select variables in anunbiased way as in the cforest function, we find that thevariable importance measures, as well as the selection fre-quencies of the variables, are affected by the bootstrapsampling with replacement. This is explained in the sec-tion on effects induced by bootstrapping.

Variable selection bias in the individual classification trees of a random forestLet us again consider the null case simulation studydesign, where none of the variables is informative, andthus should be selected with equally low probabilities ina classification tree.

In traditional classification tree algorithms, like CART, foreach variable a split criterion like the "Gini index" is com-puted for all possible cutpoints within the range of thatvariable. The variable selected for the next split is the onethat produced the highest criterion value overall, i.e. in itsbest cutpoint.

Obviously variables with more potential cutpoints aremore likely to produce a good criterion value by chance,as in a multiple testing situation. Therefore, if we comparethe highest criterion value of a variable with two catego-ries, say, that provides only one cutpoint from which thecriterion was computed, with a variable with four catego-ries, that provides seven cutpoints from which the best cri-

Table 3: Rates of correct identifications of the informative variable for the power case

Degree of dependenceMethod Replacement 0.05 0.1 0.15 0.2

Scaled randomForest true 0.234 0.497 0.770 0.956false 0.237 0.489 0.760 0.949

cforest true 0.338 0.672 0.923 0.991false 0.365 0.728 0.943 0.994

Unscaled randomForest true 0.194 0.413 0.701 0.928false 0.186 0.400 0.710 0.919

cforest true 0.324 0.648 0.910 0.989false 0.370 0.729 0.943 0.994

Rates of correct identifications of the informative variable with the scaled and unscaled permutation importance of the randomForest method, applied with sampling with and without replacement, as compared to those of the cforest method, applied with sampling with and without replacement, as a function of the degree of dependence (indicated by the relevance parameter, cf. Table 2) between the informative variable X2 and

the response. (Standard errors of the rates of correct identifications r over 1000 iterations can easily be computet by se = .)r r⋅ −( )/1 1000

Page 14 of 21(page number not for citation purposes)

Page 15: BMC Bioinformatics BioMed Central · BioMed Central Page 1 of 21 (page number not for citation purposes) BMC Bioinformatics Methodology article Open Access Bias in random forest variable

BMC Bioinformatics 2007, 8:25 http://www.biomedcentral.com/1471-2105/8/25

terion value is used, the latter is often preferred. Becausethe number of cutpoints grows exponentially with the

number of categories of unordered categorical predictorswe find a preference for variables with more categories in

Results for the C-to-U conversion data – scaled permutation importanceFigure 9Results for the C-to-U conversion data – scaled permutation importance. Scaled variable importance measures for the C-to-U conversion data. The plots in the top row display the measures when the randomForest function is used, the bot-tom row when the cforest function is used. The left column corresponds to bootstrap sampling with replacement, the right column to subsampling without replacement. In each plot the positions -20 through 20 indicate the nucleotides flanking the site of interest, and the last three bars on the right refer to the codon position (cp), the estimated folding energy (fe) and the differ-ence in estimated folding energy (dfe).

randomForest, replace=TRUE

scaled

perm

utat

ion

impo

rtan

ce

−0.

50.

00.

51.

01.

5

cp fe dfe

−20 −10 0 10 20

randomForest, replace=FALSE

scaled

perm

utat

ion

impo

rtan

ce

−0.

50.

00.

51.

01.

5

cp fe dfe

−20 −10 0 10 20

cforest, replace=TRUE

scaled

perm

utat

ion

impo

rtan

ce

−0.

50.

00.

51.

01.

52.

0

cp fe dfe

−20 −10 0 10 20

cforest, replace=FALSE

scaled

perm

utat

ion

impo

rtan

ce

−0.

50.

00.

51.

01.

52.

0

cp fe dfe

−20 −10 0 10 20

Page 15 of 21(page number not for citation purposes)

Page 16: BMC Bioinformatics BioMed Central · BioMed Central Page 1 of 21 (page number not for citation purposes) BMC Bioinformatics Methodology article Open Access Bias in random forest variable

BMC Bioinformatics 2007, 8:25 http://www.biomedcentral.com/1471-2105/8/25

CART-like classification trees. For further reading on vari-able selection bias in classification trees see, e.g., the cor-responding sections in [24,25,28,29,33-35].

Since the Gini importance measure in randomForest isdirectly derived from the Gini index split criterion used inthe underlying individual classification trees, it carries for-ward the same bias, as was shown in Figures 2 and 6.

Conditional inference trees [29], that are used to constructthe classification trees in cforest, are unbiased in variableselection. Here, the variable selection is conducted byminimizing the p value of a conditional inference inde-pendence test, comparable e.g. to the χ2 test, that incorpo-rates the number of categories of each variable in thedegrees of freedom.

The mean selection frequencies (again over 1000 simula-tion runs) of the five predictor variables of the null casesimulation study design for both CART classification trees(as implemented in the rpart function [36]) and condi-tional inference trees (function ctree) are displayed in Fig-ure 10. We find that the variable selection with the rpartfunction is highly biased, while for the ctree function it isunbiased.

The variable selection bias that occurs in every individualtree in the randomForest function also has a direct effecton the variable importance measures of this function. Pre-dictor variables with more categories are artificially pre-ferred in variable selection in each splitting decision.Thus, they are selected in more individual classification

trees and tend to be situated closer to the root node ineach tree.

The variable selection bias affects the variable importancemeasures in two respects. Firstly, the variable selection fre-quencies over all trees are directly affected by the variableselection bias in each individual tree. Secondly, the effecton the permutation importance is less obvious but just assevere.

When permuting the variables to compute their permuta-tion importance measure, the variables that appear inmore trees and are situated closer to the root node canaffect the prediction accuracy of a larger set of observa-tions, while variables that appear in fewer trees and are sit-uated closer to the bottom nodes affect only small subsetsof observations. Thus, the range of possible changes inprediction accuracy in the random forest, i.e. the devia-tion of the variable importance measure, is higher for var-iables that are preferred by the individual trees due tovariable selection bias.

We found in Figures 1 through 9, that the effects inducedby the differences in scale of measurement of the predictorvariables were more pronounced for the randomForestfunction, where variable selection in the individual treesis biased, than for the cforest function, where the individ-ual trees are unbiased. However, we also found that whenthe cforest function is used with bootstrap sampling, thevariable selection frequencies of the categorical predictorsstill depend on their number of categories (cf., e.g., bot-tom row, left plot in Figure 1), and also the deviation of

Table 4: Mean misclassification rates for the power case

Degree of dependenceMethod Replacement 0.05 0.1 0.15 0.2

randomForest true 0.4945 (0.0014) 0.4819 (0.0015) 0.4510 (0.0016) 0.4028 (0.0017)false 0.4942 (0.0014) 0.4814 (0.0015) 0.4496 (0.0016) 0.4026 (0.0017)

cforest true 0.4910 (0.0014) 0.4660 (0.0016) 0.4169 (0.0019) 0.3491 (0.0019)false 0.4879 (0.0014) 0.4581 (0.0017) 0.4022 (0.0019) 0.3384 (0.0019)

Mean misclassification rates of the randomForest method, applied with sampling with and without replacement, as compared to those of the cforest method, applied with sampling with and without replacement, as a function of the degree of dependence (indicated by the relevance parameter, cf. Table 2) between the informative variable X2 and the response. (Standard errors of the mean misclassification rates are given in parentheses.)

Table 5: Mean misclassification rates for application to C-to-U conversion data

Method Replacement

randomForest true 0.2896 (0.0022)false 0.2879 (0.0026)

cforest true 0.2807 (0.0024)false 0.2788 (0.0025)

Mean misclassification rates of the randomForest method applied with sampling with and without replacement as compared to those of the cforest method applied with sampling with and without replacement. (Standard errors of the mean misclassification rates are given in parentheses.)

Page 16 of 21(page number not for citation purposes)

Page 17: BMC Bioinformatics BioMed Central · BioMed Central Page 1 of 21 (page number not for citation purposes) BMC Bioinformatics Methodology article Open Access Bias in random forest variable

BMC Bioinformatics 2007, 8:25 http://www.biomedcentral.com/1471-2105/8/25

the permutation importance measure is still affected bythe number of categories (cf., e.g., bottom row, left plot inFigures 3 and 4).

Thus, there must be another source of bias, besides thevariable selection bias in the individual trees, that affectsthe selection frequencies and the deviation of the permu-tation importance measure.

We show in the next section that this additional effect isdue to bootstrap sampling with replacement, that is tradi-tionally employed in random forests.

Effects induced by bootstrappingFrom the comparison of left and right columns (repre-senting sampling with and without replacement) in Fig-ures 1 and 5 we learned that the variable selectionfrequencies in random forest functions are affected by theresampling scheme.

We found that, even when the cforest function based onunbiased classification trees is used, variables with morecategories are preferred when bootstrap sampling is con-ducted with replacement, while no bias occurs when sub-sampling is conducted without replacement, as displayedin the bottom right plot in Figures 1 and 5. Thus, the boot-strap sampling induces an effect that is more pronouncedfor predictor variables with more categories.

For a better understanding of the underlying mechanismlet us consider only the categorical predictor variables X2through X5 with different numbers of categories from thenull case simulation study design.

Rather than trying to explain the effect of bootstrap sam-pling in the complex framework of random forests, we usea much simpler independence test for the explanation.

We consider the p values of χ2 tests (computed from 1000simulated data sets). In each simulation run, a χ2 test iscomputed for each predictor variable and the binaryresponse Y. Remember that the variables in the null caseare not informative, i.e. the response is independent of allvariables.

For independent variables the distribution of the p valuesof the χ2 test is supposed to form a uniform distribution.

The left plot in Figure 11 displays the distribution of the pvalues of χ2 tests from each predictor variable and theresponse Y as boxplots. We find that the boxplots rangefrom 0 to 1 with median 0.5 as expected, because the pvalues of the χ2 test form a uniform distribution whencomputed before bootstrapping. However, if in each sim-ulation run we draw a bootstrap sample from the originalsample and then again compute the p values based on thebootstrap sample, we find that the distribution of the p

Variable selection bias in individual treesFigure 10Variable selection bias in individual trees. Relative selection frequencies for the rpart (left) and the ctree (right) classifica-tion tree methods. All variables are uninformative as in the null case simulation study.

X1 X2 X3 X4 X5

rpart

rela

tive

varia

ble

sele

ctio

n fr

eque

ncy

0.0

0.2

0.4

0.6

0.8

1.0

X1 X2 X3 X4 X5

ctree

rela

tive

varia

ble

sele

ctio

n fr

eque

ncy

0.0

0.2

0.4

0.6

0.8

1.0

Page 17 of 21(page number not for citation purposes)

Page 18: BMC Bioinformatics BioMed Central · BioMed Central Page 1 of 21 (page number not for citation purposes) BMC Bioinformatics Methodology article Open Access Bias in random forest variable

BMC Bioinformatics 2007, 8:25 http://www.biomedcentral.com/1471-2105/8/25

values is shifted towards zero as displayed in the right plotin Figure 11.

Obviously, the bootstrap sampling artificially induces anassociation between the variables. This effect is alwayspresent when statistical inference, such as an associationtest, is carried out on bootstrap samples: Bickel and Ren[37] point out that bootstrap hypothesis testing failswhenever the distribution of any statistic in the bootstrapsample, rather than the distribution of the statistic underthe null hypothesis, is used for statistical inference. Wefound that this issue directly affects variable selection inrandom forests, because the deviation from the nullhypothesis is more pronounced for variables that havemore categories. The reason for the shift in the distribu-tion of the p values displayed in Figure 11 is that eachoriginal sample, even if sampled from theoretically inde-pendent distributions, may show some minor variationsfrom the null hypothesis of independence. These minorvariations are aggravated by bootstrap sampling withreplacement, because the cell counts in the contingencytable are affected by observations that are either notincluded or are doubled or tripled in the bootstrap sam-ple, and therefore the bootsrap sample deviates notablyfrom the null hypothesis – even if the original sample wasgenerated under the null hypothesis.

This effect is more pronounced for variables with morecategories, because in larger tables (such as the 4 × 2 tablefrom the cross-classification of X3 and the binary responseY), the absolute cell counts are smaller than in smallertables (such as the 2 × 2 table from the cross-classificationof X2 and the binary response Y). With respect to thesmaller absolute cell counts, excluding or duplicating anobservation produces more severe variations from thenull hypothesis.

This effect is not eliminated if the sample size is increased,because in bootstrap sampling the size n of the originalsample and the bootstrap sample size n increase simulta-neously. However, if subsamples are drawn withoutreplacement the effect disappears.

The apparent association that is induced by bootstrapsampling, and that is more pronounced for predictor var-iables with many categories, affects both variable impor-tance measures: The selection frequency is again directlyaffected, and the permutation importance is affectedbecause variables with many categories are selected moreoften and gain positions closer to the root node in theindividual trees. Together with the mechanisms describedin the previous section, this explains our findings.

Effects induced by bootstrappingFigure 11Effects induced by bootstrapping. Distribution of the p values of χ2 tests of each categorical variable X2,..., X5 and the binary response for the null case simulation study, where none of the predictor variables is informative. The left plots corre-spond to the distribution of the p values computed from the original sample before bootstrapping. The right plots correspond to the distribution of the p values computed for each variable from the bootstrap sample drawn with replacement.

X2 X3 X4 X5

0.0

0.2

0.4

0.6

0.8

1.0

before bootstrapping

p va

lue

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●●●

●●●●

●●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●●

X2 X3 X4 X5

0.0

0.2

0.4

0.6

0.8

1.0

after bootstrapping

p va

lue

Page 18 of 21(page number not for citation purposes)

Page 19: BMC Bioinformatics BioMed Central · BioMed Central Page 1 of 21 (page number not for citation purposes) BMC Bioinformatics Methodology article Open Access Bias in random forest variable

BMC Bioinformatics 2007, 8:25 http://www.biomedcentral.com/1471-2105/8/25

From our simulation results we can see, however, that theeffect of bootstrap sampling is mostly superposed by themuch stronger effect of variable selection bias when com-paring the conditions of sampling with and withoutreplacement for the randomForest function only (cf. Fig-ures 1 through 9, top row). Only when variable selectionbias is removed by the cforest function the differencesbetween the conditions of sampling with and withoutreplacement are obvious (cf. Figures 1 through 9, bottomrow). We therefore conclude that in order to be able toreliably interpret the variable importance measures of arandom forest, the forest must be built from unbiasedclassification trees, and sampling must be conductedwithout replacement.

ConclusionRandom forests are a powerful statistical tool, that hasfound many applicants in various scientific areas. It hasbeen applied to such a wide variety of problems as large-scale association studies for complex genetic diseases, theprediction of phenotypes based on amino acid or DNAsequences, QSAR modeling and clinical medicine, toname just a few.

Features that have added to the popularity of random for-ests especially in bioinformatics and related fields, whereidentifying a subset of relevant predictor variables fromvery large sets of candidates is the major challenge,include its ability to deal with critical "small n large p"data sets and the variable importance measures it providesfor variable selection purposes.

However, when a method is used for variable selection,rather than prediction only, it is particularly importantthat the value and interpretation of the variable impor-tance measure actually depict the importance of the varia-ble, and are not affected by any other characteristics.

We found that for the original random forest method thevariable importance measures are affected by the numberof categories and scale of measurement of the predictorvariables, which are no direct indicators of the true impor-tance of the variable.

As long as, e.g., only continuous predictor variables, as inmost gene expression studies, or only variables with thesame number of categories are considered in the sample,variable selection with random forest variable importancemeasures is not affected by our findings. However, instudies where continuous variables, such as the foldingenergy, are used in combination with categorical informa-tion from the neighboring nucleotides, or when categori-cal predictors, as in amino acid sequence data, vary intheir number of categories present in the sample variable

selection with random forest variable importance meas-ures is unreliable and may even be misleading.

Especially informations on clinical and environmentalvariables are often gathered by means of questionnaires,where the number of categories can vary between ques-tions. The number of categories is typically determined bymany different factors, but is not necessarily an indicatorof variable importance. Similarly, the number of differentcategories of a predictor actually available in a certainsample is not an indicator of its relevance for predictingthe response. Hence, the number of categories of a varia-ble should not influence its estimated importance – oth-erwise the results of a study could easily be distorted whenan irrelevant variable with many categories is included inthe study design.

We showed that, due to variable selection bias in the indi-vidual classification trees and effects induced by bootstrapsampling, the variable importance measures of the ran-domForest function are not reliable in many scenarios rel-evant in applied research.

As an alternative random forest method we propose to usethe cforest function, that provides unbiased variable selec-tion in the individual classification trees. When thismethod is applied with subsampling without replacementthe resulting variable importance measure can be usedreliably for variable selection even in situations where thepotential predictor variables vary in their scale of meas-urement or their number of categories.

With respect to computation time the cforest function ismore expensive than the randomForest function, becausein order to be unbiased split decisions and stopping relyon time-consuming conditional inference. To give animpression, the computation times of the application toC-to-U conversion data, with 876 observations and 44predictor variables, as stated in the supplementary file forthe cforest function used with bootstrap sampling withreplacement are in the range of 8.38 sec., while subsam-pling without replacement is computationally less expen-sive and in the range of 4.82.

Since we saw that only subsampling without replacementguarantees reliable variable selection and produces unbi-ased variable importance measures, the faster versionwithout replacement should be preferred anyway. Thecomputation time for the randomForest function is in therange of 0.24 sec. with and 0.18 sec. without replacement.However, we saw that the randomForest function shouldnot be used when the potential predictor variables vary intheir scale of measurement or their number of categories.The aim of this paper was to explore the limits of theempirical measures of variable importance provided for

Page 19 of 21(page number not for citation purposes)

Page 20: BMC Bioinformatics BioMed Central · BioMed Central Page 1 of 21 (page number not for citation purposes) BMC Bioinformatics Methodology article Open Access Bias in random forest variable

BMC Bioinformatics 2007, 8:25 http://www.biomedcentral.com/1471-2105/8/25

random forests, to understand the underlying mecha-nisms and to use that understanding to guarantee unbi-ased and reliable variable selection in random forests.

In a more theoretical work van der Laan [38] gives a fun-damental definition of variable importance, as well as astatistical inference framework for estimating and testingvariable importance. Inspired by this approach, futureresearch on variable importance measures for variableselection with random forests aims at providing furthermeans of statistical inference, that can be used to guidethe decision on which and how many predictor variablesto select in a certain problem.

Authors' contributionsCS first observed the variable selection bias in random for-ests, set up and performed the simulation experiments,studied the sources of the selection bias and drafted themanuscript. ALB, AZ and TH contributed to the design ofthe simulation experiments, to theoretical investigationsof the problem, and to the manuscript. TH implementedthe cforest, ctree and varimp functions. ALB analyzed theC-to-U conversion data. All authors read and approvedthe final manuscript.

Additional material

AcknowledgementsCS was supported by the German Research Foundation (DFG), collabora-tive research center 386 "Statistical Analysis of Discrete Structures". TH received financial support from DFG grant HO 3242/1–3. The authors would like to thank Thomas Augustin, Friedrich Leisch and Gerhard Tutz for fruitful discussions and for supporting our interest in this field of research, and Peter Bühlmann, an anonymous referee and a semi-anony-mous referee for their helpful suggestions.

References1. Bureau A, Dupuis J, Falls K, Lunetta KL, Hayward B, Keith TP,

Eerdewegh PV: Identifying SNPs Predictive of PhenotypeUsing Random Forests. Genetic Epidemiology 2005, 28:171-182.

2. Heidema AG, Boer JMA, Nagelkerke N, Mariman ECM, van der A DL,Feskens EJM: The Challenge for Genetic Epidemiologists: Howto Analyze Large Numbers of SNPs in Relation to ComplexDiseases. BMC Genetics 2006, 7:23.

3. Breiman L: Random Forests. Machine Learning 2001, 45:5-32.4. Díaz-Uriarte R, Alvarez de Andrés S: Gene Selection and Classi-

fication of Microarray Data Using Random Forest. BMC Bioin-formatics 2006, 7:3.

5. Lunetta KL, Hayward LB, Segal J, Eerdewegh PV: Screening Large-Scale Association Study Data: Exploiting Interactions UsingRandom Forests. BMC Genetics 2004, 5:32.

6. Gunther EC, Stone DJ, Gerwien RW, Bento P, Heyes MP: Predic-tion of Clinical Drug Efficacy by Classification of Drug-induced Genomic Expression Profiles in vitro. Proceedings of theNational Academy of Sciences 2003, 100:9608-9613.

7. Huang X, Pan W, Grindle S, Han X, Chen Y, Park SJ, Miller LW, HallJ: A Comparative Study of Discriminating Human Heart Fail-ure Etiology Using Gene Expression Profiles. BMC Bioinformat-ics 2005, 6:205.

8. Shih Y: Tumor Classification by Tissue Microarray Profiling:Random Forest Clustering Applied to Renal Cell Carcinoma.Modern Pathology 2005, 18:547-557.

9. Segal MR, Barbour JD, Grant RM: Relating HIV-1 Sequence Var-iation to Replication Capacity via Trees and Forests. StatisticalApplications in Genetics and Molecular Biology 2004, 3:2.

10. Cummings MP, Segal MR: Few Amino Acid Positions in rpoB areAssociated with Most of the Rifampin Resistance in Myco-bacterium Tuberculosis. BMC Bioinformatics 2004, 5:137.

11. Cummings MP, Myers DS: Simple Statistical Models Predict C-to-U Edited Sites in Plant Mitochondrial RNA. BMC Bioinfor-matics 2004, 5:132.

12. Qi Y, Bar-Joseph Z, Klein-Seetharaman J: Evaluation of DifferentBiological Data and Computational Classification Methodsfor Use in Protein Interaction Prediction. Proteins 2006,63:490-500.

13. Guha R, Jurs PC: Development of Linear, Ensemble, and Non-linear Models for the Prediction and Interpretation of theBiological Activity of a Set of PDGFR Inhibitors. Journal ofChemical Information and Computer Sciences 2003, 44:2179-2189.

14. Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP:Random Forest: A Classification and Regression Tool forCompound Classification and QSAR Modeling. Journal ofChemical Information and Computer Sciences 2003, 43:1947-1958.

15. Arun K, Langmead CJ: Structure Based Chemical Shift Predic-tion Using Random Forests Non-linear Regression. Proceed-ings of the Fourth Asia-Pacific Bioinformatics Conference, Taipei, Taiwan2006:317-326.

16. Furlanello C, Neteler M, Merler S, Menegon S, Fontanari S, Donini D,Rizzoli A, Chemini C: GIS and the Random Forest Predictor:Integration in R for Tick-Borne Disease Risk Assessment.Proceedings of the 3rd International Workshop on Distributed StatisticalComputing, Vienna, Austria 2003 [http://www.ci.tuwien.ac.at/Conferences/DSC-2003/Proceedings/].

17. Ward MM, Pajevic S, Dreyfuss J, Malley JD: Short-Term Predictionof Mortality in Patients with Systemic Lupus Erythematosus:Classification of Outcomes Using Random Forests. Arthritisand Rheumatism 2006, 55:74-80.

18. Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification andRegression Trees New York: Chapman and Hall; 1984.

19. Friedman J: Greedy Function Approximation: A GradientBoosting Machine. The Annals of Statistics 2001, 29:1189-1232.

20. R Development Core Team: R: A Language and Environment for Statis-tical Computing 2006 [http://www.R-project.org/]. R Foundation forStatistical Computing, Vienna, Austria

21. Breiman L, Cutler A, Liaw A, Wiener M: Breiman and Cutler's RandomForests for Classification and Regression 2006 [http://CRAN.R-project.org/]. [R package version 4.5–16].

22. Liaw A, Wiener M: Classification and Regression by random-Forest. R News 2002, 2:18-22 [http://CRAN.R-project.org/doc/Rnews/].

23. Hothorn T, Hornik K, Zeileis A: party: A Laboratory for RecursivePart(y)itioning 2006 [http://CRAN.R-project.org/]. [R package version0.9-0].

24. Kononenko I: On Biases in Estimating Multi-ValuedAttributes. Proceedings of the Fourteenth International Joint Conferenceon Artificial Intelligence, Montréal, Canada 1995:1034-1040.

25. Kim H, Loh W: Classification Trees with Unbiased MultiwaySplits. Journal of the American Statistical Association 2001, 96:589-604.

26. Boulesteix AL: Maximally Selected Chi-square Statistics forOrdinal Variables. Biometrical Journal 2006, 48:451-462.

27. Boulesteix AL: Maximally Selected Chi-square Statistics andBinary Splits of Nominal Variables. Biometrical Journal 2006,48:838-848.

Additional File 1R source code. The exemplary R source code includes all function calls and comments on all important options of the randomForest and cforest func-tions, that were used in the simulation studies and the application to C-to-U conversion data. Please install the latest versions of the packages ran-domForest and party before use.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-8-25-S1.R]

Page 20 of 21(page number not for citation purposes)

Page 21: BMC Bioinformatics BioMed Central · BioMed Central Page 1 of 21 (page number not for citation purposes) BMC Bioinformatics Methodology article Open Access Bias in random forest variable

BMC Bioinformatics 2007, 8:25 http://www.biomedcentral.com/1471-2105/8/25

Publish with BioMed Central and every scientist can read your work free of charge

"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours — you keep the copyright

Submit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral

28. Strobl C, Boulesteix AL, Augustin T: Unbiased Split Selection forClassification Trees Based on the Gini Index. ComputationalStatistics & Data Analysis 2006 [http://dx.doi.org/10.1016/j.csda.2006.12.030].

29. Hothorn T, Hornik K, Zeileis A: Unbiased Recursive Partition-ing: A Conditional Inference Framework. Journal of Computa-tional and Graphical Statistics 2006, 15:651-674.

30. Friedman J, Hall P: On Bagging and Nonlinear Estimation. pre-print 1999 [http://www-stat.stanford.edu/~jhf/].

31. Bühlmann P, Yu B: Analyzing Bagging. The Annals of Statistics 2002,30:927-961.

32. Politis DN, Romano JP, Wolf M: Subsampling New York: Springer;1999.

33. Dobra A, Gehrke J: Bias Correction in Classification Tree Con-struction. Proceedings of the Seventeenth International Conference onMachine Learning, Williams College, Williamstown, MA, USA 2001:90-97.

34. Strobl C: Statistical Sources of Variable Selection Bias in Clas-sification Tree Algorithms Based on the Gini Index. DiscussionPaper 420, SFB "Statistical Analysis of Discrete Structures", Munich, Ger-many 2005 [http://www.stat.uni-muenchen.de/sfb386/papers/dsp/paper420.ps].

35. Strobl C: Variable Selection in Classification Trees Based onImprecise Probabilities. Proceedings of the Fourth International Sym-posium on Imprecise Probabilities and Their Applications, Carnegy MellonUniversity, Pittsburgh, PA, USA 2005:340-348.

36. Therneau TM, Atkinson B, Ripley BD: rpart: Recursive Partitioning 2006[http://CRAN.R-project.org/]. [R package version 3.1–30].

37. Bickel PJ, Ren JJ: The Bootstrap in Hypothesis Testing. State ofthe Art in Probability and Statistics, Festschrift for Willem R. van Zwet, IMSLecture Notes Monograph Series, Beachwood, OH, USA 2001, 36:91-112.

38. van der Laan M: Statistical Inference for Variable Importance.International Journal of Biostatistics 2006, 2:1008-1008.

Page 21 of 21(page number not for citation purposes)