dudoit_multiple hypothesis testing in microarray experiments

8/11/2019 DUDOIT_Multiple Hypothesis Testing in Microarray Experiments

1/33

Statistical Science2003, Vol. 18, No. 1, 71103 Institute of Mathematical Statistics, 2003

Multiple Hypothesis Testing inMicroarray ExperimentsSandrine Dudoit, Juliet Popper Shaffer and Jennifer C. Boldrick

Abstract. DNA microarrays are part of a new and promising class of biotechnologies that allow the monitoring of expression levels in cells forthousands of genes simultaneously. An important and common question inDNA microarray experiments is the identication of differentially expressedgenes, that is, genes whose expression levels are associated with a responseor covariate of interest. The biological question of differential expression canbe restated as a problem in multiple hypothesis testing: the simultaneous testfor each gene of the null hypothesisof no association between the expressionlevels and the responses or covariates. As a typical microarray experimentmeasures expression levels for thousands of genessimultaneously, large mul-tiplicity problems are generated. This article discusses different approachesto multiple hypothesis testing in the context of DNA microarray experimentsand compares the procedures on microarray and simulated data sets.Key words and phrases: Multiple hypothesis testing, adjusted p -value,family-wise Type I error rate, false discovery rate, permutation, DNAmicroarray.

1. INTRODUCTION

The burgeoning eld of genomics has revived in-terest in multiple testing procedures by raising newmethodological and computational challenges. For ex-ample, DNA microarray experiments generate largemultiplicity problems in which thousands of hypothe-ses are tested simultaneously. DNA microarrays arehigh-throughput biological assays that can measureDNA or RNA abundance in cells for thousands of genes simultaneously. Microarrays are being appliedincreasingly in biological and medical research to ad-

Sandrine Dudoit is Assistant Professor, Division of

Biostatistics, School of Public Health, University of California, Berkeley, California 94720-7360 (e-mail:[email protected]). Juliet Popper Shaffer isSenior Lecturer Emerita, Department of Statistics,University of California, Berkeley, California 94720-3860 (e-mail: [email protected]). Jennifer C. Boldrick is a graduate student, Departmentof Microbi-ology and Immunology, Stanford University, Stanford,California 94305-5124 (e-mail: [email protected]).

dress a wide range of problems, such as the classcation of tumors or the study of host genomic responses to bacterial infections (Alizadeh et al., 2000Alon et al., 1999; Boldrick et al., 2002; Golub et al1999; Perou et al., 1999; Pollack et al., 1999; Roset al., 2000). An important and common question iDNA microarray experiments is the identication odifferentially expressed genes, that is, genes whose epression levels are associated with a response or covariate of interest. The covariates could be either poltomous (e.g., treatment/control status, cell type, drutype) or continuous (e.g., dose of a drug, time), anthe responses could be, for example, censored survivtimes or other clinical outcomes. The biological que

tion of differential expressioncan be restated as a prolem in multiple hypothesis testing: the simultaneoutest for each gene of the null hypothesis of no assocation between the expression levels and the responseor covariates. As a typical microarray experiment mesures expression levels for thousands of genes simultneously, large multiplicity problems are generated. Iany testing situation, two types of errors can be committed: a false positive,or Type I error, is committed bdeclaring that a gene is differentially expressed when

71


2/33

72 S. DUDOIT, J. P. SHAFFER AND J. C. BOLDRICK

is not, and a false negative, or Type II error, is commit-ted when the test fails to identify a truly differentiallyexpressed gene. When many hypotheses are tested andeach test has a specied Type I error probability, thechance of committing some Type I errors increases,often sharply, with the number of hypotheses. In par-ticular, a p -value of 0.01 for one gene among a listof several thousands no longer corresponds to a sig-nicant nding, as it is very likely that such a smallp -value will occur by chance under the null hypoth-esis when considering a large enough set of genes.Special problems that arise from the multiplicity as-pect include dening an appropriate Type I error rateand devising powerful multiple testing procedures thatcontrol this error rate and account for the joint distribu-tion of the test statistics. A number of recent articleshave addressed the question of multiple testing in DNA

microarray experiments. However, the proposed solu-tions have not always been cast in the standard statisti-cal framework (Dudoit et al., 2002; Efron et al., 2000;Golub et al., 1999; Kerr, Martin and Churchill, 2000;Manduchi et al., 2000; Pollard and van der Laan, 2003;Tusher, Tibshirani andChu, 2001; Westfall, ZaykinandYoung, 2001).

The present article discusses different approachesto multiple hypothesis testing in the context of DNAmicroarray experiments and compares the procedureson microarray and simulated data sets. Section 2reviews basic notions and procedures for multipletesting, and discusses the recent proposals of Golubet al. (1999) and Tusher, Tibshirani and Chu (2001)within this framework. The microarray data sets andsimulation models which are used to evaluate thedifferent multiple testing procedures are described inSection 3, and the results of the comparison study arepresented in Section 4. Finally, Section 5summarizesour ndings and outlines open questions. Although thefocus is on the identication of differentially expressedgenes in DNA microarray experiments, some of themethods described in this article are applicable to anylarge-scale multiple testing problem.

2. METHODS

2.1 Multiple Testing in DNA MicroarrayExperiments

Consider a DNA microarray experiment which pro-duces expression data on m genes (i.e., variables orfeatures) for n mRNA samples (i.e., observations), andfurther suppose that a response or covariate of inter-

est is recorded for each sample. Such data may arisfor example, from a study of gene expression in tmor biopsy specimens from leukemia patients (Goluet al., 1999): in this case, the response is the tumor tyand the goal is to identify genes that are differentialexpressed in the different types of tumors. The dafor sample i consist of a response or covariate y i anda gene expression prole xi = (x 1i , . . . , x mi ) , wherex j i denotes the expression measure of gene j in sam-ple i , i =1, . . . , n , j =1, . . . , m . The expression levelsx j i might be either absolute [e.g., Affymetrix oligoncleotide chips discussed in Lipshutz et al. (1999)] relative with respect to the expression levels of a suably dened common reference sample [e.g., Staford two-color spotted cDNA microarrays discussed Brown and Botstein (1999)]. Note that the expressiomeasures xj i are in general highly processed data. Th

raw data in a microarray experiment consist of imales, and important preprocessing steps include imaganalysis of these scanned images and normalizatio(Yang et al., 2001, 2002). The gene expression data aconventionally stored in an m n matrix X = (x j i ) ,with rows corresponding to genes and columns corrsponding to individual mRNA samples. In a typical eperiment, the total number n of samples is anywherebetween around 10 and a few hundreds, and the number m of genes is several thousands. The gene expresion measures, x, are generally continuous variableswhile the responses or covariates,y , can be either poly-

tomous or continuous, and possibly censored, as dscribed above.The pairs{(xi , y i )}i=1,...,n , formed by the expressionproles xi and responses or covariates yi , are viewed

as a random sample from a population of interest. Tpopulation and sampling mechanism will depend othe particular application (e.g., designed factorial eperiment in yeast, retrospective study of human tumgene expression). Let X j and Y denote, respectively,the random variables that correspond to the expresion measure for gene j , j = 1, . . . , m , and the re-sponse or covariate. The goal is to use the sample da{(xi , y i )}i=1,...,n to make inference about the popula-tion of interest, specically, test hypotheses concering the joint distribution of the expression measurX =(X 1, . . . , X m ) and response or covariate Y .The biological question of differential expressiocan be restated as a problem in multiple hypothestesting: the simultaneous test for each gene j of thenull hypothesis Hj of no association between the ex-pression measure X j and the response or covariate Y .(In some cases, more specic null hypotheses may


3/33

MULTIPLE HYPOTHESIS TESTING 73

of interest, for example, the null hypothesis of equalmean expression levels in two populations of cellsas opposed to identical distributions.) A standard ap-proach to the multiple testing problem consists of twoaspects:

(1) computing a test statistic T j for each gene j , and(2) applying a multiple testing procedure to determine

which hypotheses to reject while controlling asuitably dened Type I error rate (Dudoit et al.,2002; Efron et al., 2000; Golub et al., 1999;Kerr, Martin and Churchill, 2000; Manduchi et al.,2000; Pollard and van der Laan, 2003; Tusher,Tibshirani and Chu, 2001; Westfall, Zaykin andYoung, 2001).

The univariate problem 1 has been studied exten-sively in the statistical literature (Lehmann, 1986). Ingeneral, the appropriate test statistic will depend on theexperimental design and the type of response or co-variate. For example, for binary covariates, one mightconsider a t -statistic or a MannWhitney statistic; forpolytomous responses, one might use an F -statistic;and for survival data one might rely on the score sta-tistic for the Cox proportional hazards model. We willnot discuss the choice of statistic any further here, ex-cept to say that for each gene j , the null hypothesisHj is tested based on a statistic T j which is a functionof X j and Y . The lower case t j denotes a realization of

the random variableT j . To simplify matters, andunlessspecied otherwise, we further assume that the null Hj is rejected for large values of |T j | (two-sided hypothe-ses). Question 2 is the subject of the presentarticle. Al-though multiple testing is by no means a new subject inthe statistical literature, DNA microarray experimentspresent a new and challenging area of application formultiple testing procedures because of the sheer num-ber of tests. In the remainder of this section, we reviewbasic notions and approaches to multiple testing anddiscuss recent proposals for dealing with the multiplic-ity problem in microarray experiments.

2.2 Type I and Type II Error Rates

Set-up. Consider the problem of testing simultane-ously m null hypotheses Hj , j =1, . . . , m , and denoteby R the number of rejected hypotheses. In the fre-quentist setting, the situation can be summarized byTable 1 (Benjamini and Hochberg, 1995). The specicm hypothesesare assumed to be known in advance, thenumbers m0 and m1 = m m0 of true and false null

TABLE 1

Number NumberNumber of not rejected rejected

True null hypotheses U V m0Non-true null hypotheses T S m1

m R R mhypotheses are unknown parameters, R is an observ-able random variable and S , T , U and V are unob-servable random variables. In the microarray contexthere is a null hypothesis Hj for each gene j and rejection of Hj corresponds to declaring that gene j isdifferentially expressed. Ideally, one would like to minimize the number V of false positives, or Type I errors,and the number T of false negatives, or Type II errors.A standard approach in the univariate setting is to prspecify an acceptable level for the Type I error rateand seek tests which minimize the Type II error ratthat is, maximize power , within the class of tests withType I error rate at most .

Type I error rates. When testing a single hypothesis,H1, say, the probability of a Type I error, that isof rejecting the null hypothesis when it is true, iusually controlled at some designated level . Thiscan be achieved by choosing a critical value c suchthat Pr(|T 1| c | H1) and rejecting H1 when|T 1| c . A variety of generalizations to the multipletesting situation are possible: the Type I error ratedescribed next are the most standard (Shaffer, 1995). The per-comparison error rate (PCER) is denedas the expected value of the number of Type

errors divided by the number of hypotheses, that iPCER=E(V)/m .

The per-family error rate (PFER) is dened as theexpected number of Type I errors, that is, PFER=E(V) . The family-wise error rate (FWER) is dened asthe probability of at least one Type I error, that is

FWER=Pr(V 1) .

The false discovery rate (FDR) of Benjamini andHochberg (1995) is the expected proportion oType I errors among the rejected hypotheses, thais, FDR=E(Q) , where, by denition, Q =V /R if R > 0 and 0 if R =0.A multiple testing procedure is said to control

particular Type I error rate at level , if this error rateis less than or equal to when the given procedureis applied to produce a list of R rejected hypotheses.For instance, the FWER is controlled at level by


4/33


a particular multiple testing procedure if FWER (similarly, for the other denitions of Type I errorrates).

Strong versus weak control. It is important to notethat the error rates above are dened under the true

and typically unknown data generating distribution forX = (X 1, . . . , X m ) and Y . In particular, they dependuponwhich specic subset0 {1, . . . , m }of null hy-potheses is true for this distribution. That is, the family-wise error rate is FWER = Pr(V 1 | j 0 Hj ) =Pr(Reject at least one Hj , j 0 | j 0 Hj ) , where

j 0

Hj refers to the subset of true null hypothesesfor the data generating joint distribution. A fundamen-tal, yet often ignored distinction, is that between strongand weak control of a Type I error rate (Westfall andYoung, 1993, page 10). Strong control refers to con-trol of the Type I error rate under any combinationof true and false null hypotheses, i.e., for any subset

0 {1, . . . , m } of true null hypotheses. In contrast,weak control refers to control of the Type I error rateonly when all the null hypotheses are true, i.e., under anull distribution satisfying the complete null hypothe-sis HC0 = mj =1 Hj with m0 =m . In general, the com-plete null hypothesis HC0 is not realistic and weak con-trol is unsatisfactory. In reality, some null hypotheses

0 may be true and others false, but the subset 0 isunknown. Strong control ensures that the Type I errorrate is controlled under the true and unknown data gen-erating distribution. In the microarray setting, whereit is very unlikely that no genes are differentially ex-pressed, it seems particularly important to have strongcontrol of the Type I error rate. Note that the concept of strong and weak control applies to each of the Type Ierror rates dened above, PCER, PFER, FWER andFDR. The reader is referred to Pollard and van derLaan (2003) for a discussion of multivariate null dis-tributions and proposals for specifying such joint dis-tributions based on projections of the data generatingdistribution or of the joint distribution of the test sta-tistics on submodels satisfying the null hypotheses. Inthe remainder of this article, unless specied other-wise, probabilities and expectations are computed forthe combination of true and false null hypotheses cor-responding to the true data generating distribution, thatis, under the composite null hypothesisj

0 Hj cor-

responding to the data generating distribution, where0{1, . . . , m } is of size m0.Power. Within the class of multiple testing proce-

dures that control a given Type I error rate at an ac-ceptable level , one seeks procedures that maximize

power , that is,minimize a suitably denedTypeII errorate. As with Type I error rates, the concept of powcan be generalized in various ways when moving frosingle to multiple hypothesis testing. Three commodenitions of power are (1) the probability of rejectiat least one false null hypothesis, Pr(S

1)

=Pr(T

m1 1) , (2) the average probability of rejecting thfalse null hypotheses,E(S)/m 1, oraveragepower ,and(3) the probability of rejecting all false null hypothses, Pr(S = m1) = Pr(T = 0) (Shaffer, 1995). Whenthe family of tests consists of pairwise mean compaisons, these quantities havebeen called any-pair poweper-pair power and all-pairs power (Ramsey, 1978In a spirit analogous to the FDR, one could also dne power as E(S/R | R > 0) Pr(R > 0) = Pr(R >0) FDR; when m = m1, this is the any-pair powerPr(S 1) . One should note again that probabilities depend upon which particular subset 0 {1, . . . , m } of null hypotheses is true.

Comparison of Type I error rates. In general, for agiven multiple testing procedure, PCER FWER PFER. Thus, for a xed criterion for controlling theType I error rates, the order reverses for the numbof rejections R : procedures that control the PFER aregenerally more conservative, that is, lead to fewerrejections, than those that control either the FWER the PCER, and procedures that control the FWER amore conservative than those that control the PCETo illustrate the properties of the different Type I errrates, suppose eachhypothesis Hj is tested individuallyat level j and the decision to reject or not rejecthis hypothesis is based solely on that test. Undethe complete null hypothesis, the PCER is simply taverage of the j and the PFER is the sum of the j .In contrast, the FWER is a function not of the test siz j alone, but also involves the joint distribution of thetest statistics T j :

PCER=1 + + m

m max( 1, . . . , m )

FWER

PFER

=1

+ + m .

The FDR also depends on the joint distribution othe test statistics and, for a xed procedure, FDR FWER, with FDR = FWER under the complete null(Benjamini and Hochberg, 1995). The classical approach to multiple testing calls for strong control of tFWER (cf. Bonferroni procedure). The recent proposof Benjamini and Hochberg (1995) controls the FWEin the weak sense (since FDR=FWER under the com-plete null) and can be less conservative than FWE


5/33


otherwise. Procedures that control the PCER are gen-erally less conservative than those that control eitherthe FDR or FWER, but tend to ignore the multiplicityproblem altogether. The following simple example de-scribes the behavior of the various Type I error rates asthe total number of hypothesesm and the proportion of true hypotheses m0/m vary.

A simple example. Consider Gaussian random m-vectors, with mean =( 1, . . . , m ) and identity co-variance matrix I m . Suppose we wish to test simul-taneously the m null hypotheses Hj : j = 0 againstthe two-sided alternatives Hj : j =0. Given a randomsample of n m-vectors from this distribution, a sim-ple multiple testing procedure would be to reject Hj if | X j | z/ 2/ n , where X j is the average of thej th coordinate for the n m-vectors, z/ 2 is such that

(z / 2)

= 1

/ 2 and (

) is the standard normal

cumulative distribution function. Let R j = I ( | X j | z/ 2/ n) , where I ( ) is the indicator function, whichequals 1 if the condition in parentheses is true and 0otherwise. Assume without loss of generality that them0 true null hypotheses are H1, . . . , Hm0, that is, 0 ={1, . . . , m 0}. Then V =

m0j =1 R j and R =

mj =1 R j .Analytical formulae for the Type I error rates can easily

be derived as PFER=m0j =1 j , PCER=

m0j =1 j /m ,FWER=1

m0j =1(1 j ) and

FDR

=

1

r1=0. . .

1

rm=0

m0j =1 r j mj =1 r j

m

j =1

r j j (1

j )

1r j

with the FDR convention that 0/ 0 = 0 and j =E(R j ) = Pr(R j = 1) = 1 (z / 2 j n) +( z/ 2 j n) denoting the chance of rejectinghypothesis Hj . In our simple example, j = for j =1, . . . , m 0 and if we further assume that j = d / nfor j =m0 +1, . . . , m , then the expressions for the er-ror rates simplify to PFER= m0 , PCER =m0/m ,FWER=1(1) m0 andFDR=

m1

s=0

m0

v=1

vv

+s

m0v

v (1) m0v

m1s

s (1) m1s ,where = 1 (z / 2 d ) + ( z/ 2 d ) . Notethat unlike the PCER, PFER or FWER, the FDRdepends on the distribution of the test statistics underthe alternative hypotheses Hj , for j =m0 +1, . . . , m ,through the random variable S (here, the FDR isa function of , the rejection probability under the

alternative hypotheses). In general, the FDR is thumore difcult to work with than the other threerror rates discussed so far. Figure 1 displays plotof the FWER, PCER and FDR versus the numbeof hypotheses m, for different proportions m0/m =1, 0.9, 0.5, 0.1 of true null hypothesesand for

=0.05

and d =1. In general, the FWER and PFER increassharply with the number of hypotheses m, while thePCER remains constant (the PFER is not shown in thgure because it is on a different scale, that is, it not restricted to belong to the interval [0, 1]). Underthe complete null (m = m0), the FDR is equal to theFWER and both increase sharply with m . However, asthe proportion of true nullhypothesesm0/m decreases,the FDR remains relatively stable as a function of mand approaches the PCER. We plotted the error ratefor values of m between 1 and 100 only to provide

more detail in regions where there are sharp changein these error rates. For larger ms, in the thousandsas in DNA microarray experiments, the error ratetend to reach a plateau. Figure 2 displays plots othe FWER, PCER and FDR versus individual tessize for different proportions m0/m of true nullhypotheses and for m = 100 and d = 1. The FWERis generally much larger than the PCER, the largedifference being under the complete null (m = m0).As the proportion of true null hypotheses decreasethe FDR again becomes closer to the PCER. The errrates display similar behavior for larger values of th

number of hypotheses m, with a sharper increase of the FWER as increases.2.3 p -values

Unadjusted p -values. Consider rst a single hy-pothesis H1, say, and a family of tests of H1 with level- nested rejection regions S such that (1) Pr(T 1 S | H1) = for all [0, 1] which are achievableunder the distribution of T 1 and (2) S = S for all and for which these regions are denedin (1). Rather than simply reporting rejection or nonrejection of the hypothesis H1, a p -value connected

with the test can be dened as p 1 = inf { : t 1 S }(adapted from Lehmann, 1986, page 170, to includdiscrete test statistics). The p -value can be thoughtof as the level of the test at which the hypothesis H1 would just be rejected, given t 1. The smaller thep -value p 1, the stronger the evidence against the nulhypothesis H1. Rejecting H1 when p 1 providescontrol of the Type I error rate at level . In our con-text, the p -value can be restated as the probability oobserving a test statistic as extreme or more extreme


6/33


FIG. 1. Type I error rates, simple example. Plot of Type I error rates versus number of hypotheses m for different proportions of truenull hypotheses, m0/m =1, 0.9, 0.5, 0.1. The model and multiple testing procedures are described in Section 2.2. The individual test sizeis =0.05 and the parameter d was set to 1. The nonsmooth behavior for small m is due to the fact that it is not always possible to havexactly 90, 50 , or 10% of true null hypotheses and rounding to the nearest integer is necessary. FWER: red curve; FDR: blue curve; PCER:green curve.

the direction of rejection as the observed one, that is,p 1 =Pr(|T 1| |t 1| |H1) . Extending the concept of p -value to the multiple testing situation leads to the veryuseful denition of adjusted p -value.

Adjusted p -values. Let t j and p j =Pr(|T j | |t j | |Hj ) denote, respectively, the test statistic and unadjusted or raw p -value for hypothesis Hj (gene j ),j = 1, . . . , m . Just as in the single hypothesis case,a multiple testing procedure may be dened in terms of

critical values for the test statistics or p -values of indi-vidual hypotheses: for example, reject Hj if |t j | cj or if p j j , where the critical values c j and j arechosen to control a given Type I error rate (FWERPCER, PFER or FDR) at a prespecied level . Al-ternatively, the multiple testing procedure may be dscribed in terms of adjusted p -values. Given any testprocedure, the adjusted p -value that corresponds to thetest of a single hypothesis Hj can be dened as thenominal level of the entire test procedure at which Hj


7/33


FIG. 2. Type I error rates, simple example. Plot of Type I error rates versus individual test size , for different proportions of true nullhypotheses, m0/m =1, 0.9, 0.5, 0.1. The model and multiple testing procedures are described in Section 2.2. The number of hypotheses ism =100 and the parameter d was set to 1. FWER: red curve; FDR: blue curve; PCER: green curve.

would just be rejected, given the values of all test sta-tistics involved (Hommel and Bernhard, 1999; Shaffer,1995; Westfall and Young, 1993; Wright, 1992; Yeku-tieli and Benjamini, 1999). If interest is in controllingthe FWER, the adjusted p -value for hypothesis Hj ,given a specied multiple testing procedure, is p j =inf { [0, 1]:Hj is rejected at nominal FWER =},where the nominal FWER is the -level at which thespecied procedure is performed. The correspondingrandom variables for unadjusted and adjusted p -valuesare denoted by P j and P j , respectively. Hypothesis Hj is then rejected, that is, gene j is declared differen-tially expressed at nominal FWER if p j . Note

that for many procedures, such as the Bonferroni procedure described in Section 2.4.1, the nominal level isusually larger than the actual level, thus resulting ina conservative test. Adjusted p -values for procedurescontrolling other types of error rates are dened similarly, that is, for FDR controlling procedures, p j =inf { [0, 1]:Hj is rejected at nominal FDR = }(Yekutieli and Benjamini, 1999). As in the singlhypothesis case, an advantage of reporting adjustep -values, as opposed to only rejection or not of the hpotheses, is that the level of the test does not need be determined in advance. Some multiple testing prcedures are also most conveniently described in term


8/33


of their adjusted p -values, and these can in turn be es-timated by resampling methods (Westfall and Young,1993).

Stepwise procedures. One usually distinguishesamong three types of multiple testing procedures:

single-step, step-down and step-up procedures. Insingle-step procedures, equivalent multiplicity adjust-ments are performed for all hypotheses, regardlessof the ordering of the test statistics or unadjustedp -values; that is, each hypothesis is evaluated using acritical value that is independent of the results of testsof other hypotheses. Improvement in power, while pre-serving Type I error rate control, may be achieved bystepwise procedures, in which rejection of a particularhypothesis is based not only on the total number of hy-potheses, but also on the outcome of the tests of otherhypotheses. In step-down procedures, the hypotheses

that correspond to the most signicant test statistics(i.e., smallest unadjusted p -values or largest absolutetest statistics) are considered successively, with fur-ther tests dependent on the outcomes of earlier ones.As soon as one fails to reject a null hypothesis, nofurther hypotheses are rejected. In contrast, for step-up procedures, the hypotheses that correspond to theleast signicant test statistics are considered succes-sively, again with further tests dependent on the out-comes of earlier ones. As soon as one hypothesis is rejected, all remaining hypotheses are rejected. The nextsection discusses single-step and stepwise procedures

for control of the FWER.2.4 Control of the Family-wise Error Rate

2.4.1 Single-step procedures. For strong control of the FWER at level , the Bonferroni procedure, per-haps the best known in multiple testing, rejects any hy-pothesis Hj with unadjusted p -value less than or equalto /m . The corresponding single-step Bonferroni adjusted p -values are thus given by

p j =min(mp j , 1).(1)Control of the FWER in the strong sense follows from

Booles inequality. Assume without loss of generalitythat the true null hypotheses are Hj , for j =1, . . . , m 0.ThenFWER=Pr(V 1)

=Prm0

j =1{ P j }

m0

j =1Pr( P j )

m0

j =1Pr P j

m

m0m

,

where the last inequality follows from Pr(P j x |H j ) x , for any x [0, 1].Closely related to the Bonferroni procedure is thidk procedure. It is exact for protecting the FWEunder the complete null, when the unadjusted p -valuesare independently distributed as U

[0, 1

]. The single-

step idk adjusted p -values are given by

p j =1(1p j )m .(2)However, in many situations, the test statistics, anhence the unadjusted p -values, are dependent. Thisis the case in DNA microarray experiments, whegroups of genes tend to have highly correlated expression measures due, for example, to co-regulatioWestfall and Young (1993) proposed adjustedp -valuesfor less conservative multiple testing procedures whitake into account the dependence structure among testatistics. The single-step minP adjusted p -values aredened by

p j =Pr min1lmP l p j HC0 ,(3)

where HC0 denotes the complete null hypothesis andP l denotes the random variable for the unadjustep -value of the lth hypothesis. Alternatively, one mayconsider procedures based on the single-step maxT adjusted p -values, which are dened in terms of thetest statistics T j themselves:

p j =Pr max1lm |T l | |t j | HC

0 .(4)The following points should be noted regarding th

four procedures introduced above.1. If the unadjusted p -values (P 1, . . . , P m ) are inde-

pendent and P j has a U [0, 1]distribution under Hj ,the minP adjusted p -values are the same as theidk adjusted p -values.

2. The idk procedure does not guarantee controf the FWER for arbitrary distributions of thtest statistics. However, it controls the FWER fotest statistics that satisfy an inequality known aidks inequality: Pr(|T 1| c1, . . . , |T m | cm ) m

j =1 Pr(|T j | cj ) . This inequality, also known asthe positive orthant dependence property, was ini-tially derived by Dunn (1958) for (T 1, . . . , T m ) thathave a multivariate normal distribution with meazero and certain types of covariance matrices. id(1967) extended the result to arbitrary covariancmatrices and Jogdeo (1977) showed that the inequality holds for a larger class of distributions, including some multivariate t - and F -distributions.


9/33


When the idk inequality holds, the minP adjusted p -values are less than or equal to the idkadjusted p -values.

3. Computing the quantities in (3) using the upperbound provided by Booles inequality yields theBonferroni p -values. In other words, proceduresbased on the minP adjusted p -values are lessconservative than the Bonferroni or idk (underthe idk inequality) procedures. In the case of independent test statistics, the idk and minP adjustments are equivalent as discussed in item 1,above.

4. Procedures based on the maxT and minP adjustedp -values provideweak control of the FWER.Strongcontrol of the FWER holds under the assumptionof subset pivotality (Westfall and Young, 1993,page 42). The distribution of unadjusted p -values(P

1, . . . , P

m) is said to have the subset pivotality

property, if the joint distribution of the random vec-tor {P j :j 0} is identical for distributions satis-fying the composite null hypothesesj

0 Hj and

HC0 = mj =1 Hj , for all subsets 0 of {1, . . . , m }.Here, composite hypotheses of the form j 0 Hj refer to the joint distribution of test statistics T j or p -values P j for testing hypotheses Hj , j 0.Without subset pivotality, multiplicity adjustment ismore complex, as one would need to consider thedistribution of the test statistics under partial nullhypotheses j

0 Hj , rather than the complete null

hypothesis HC0 . In the microarray context consid-ered in this article, each null hypothesis refers to asingle gene j and each test statistic T j is a functionof the response/covariate Y and expression mea-sure X j only. The composite hypothesis j 0 Hj refers to the joint distribution of variables Y and{X j : j 0} and species that the random subvec-tor of expression measures {X j : j 0} is inde-pendent of the response/covariate Y , i.e., that the joint distribution of {X j : j 0} is identical for alllevels of Y . The pivotality property holds given theassumption that test statistics for genes in the nullsubset 0 have the same joint distribution regard-less of the truth or falsity of the hypotheses in thecomplement of 0. For a discussion of subset piv-otality and examples of testing problems in whichthe condition holds and does not hold, see Westfalland Young (1993).

5. The maxT adjusted p -values are easier to computethan the minP p -values and are equal to the minP p -values when the test statistics T j are identicallydistributed. However, the two procedures generally

produce different adjusted p -values, and consider-ations of balance, power and computational feasbility should dictate the choice between the two approaches. In the case of non-identically distributetest statistics T j (e.g., t -statistics with different de-

grees of freedom), not all tests contribute equallto the maxT adjusted p -values and this can lead tounbalanced adjustments (Beran, 1988; Westfall anYoung,1993,page 50). When adjustedp -values areestimated by permutation (Section 2.6) and a largnumber of hypotheses are tested, procedures baseon the minP p -values tend to be more sensitiveto the number of permutations and more conservative than those based on the maxT p -values. Also,the minP procedure requires more computationsthan the maxT procedure, because the unadjustedp -values must be estimated before considering th

distribution of their successive minima (Ge, Dudoand Speed, 2003).

2.4.2 Step-down procedures. While single-steppro-cedures are simple to implement, they tend to be conservative for control of the FWER. Improvement ipower, while preserving strong control of the FWERmay be achieved by step-down procedures. Beloware the step-down analogs, in terms of their adjustep -values, of the four procedures described in thprevious section. Let p r1 p r2 p rm denotethe observed ordered unadjusted p -values and letHr1, Hr2, . . . , Hrm denote the corresponding null hy-potheses. For strong control of the FWER at leve , the Holm (1979) procedure proceeds as followDene j = min{j :p r j > /(m j +1)} and rejecthypotheses Hr j , for j = 1, . . . , j 1. If no such j exists, reject all hypotheses. The step-down Holm adjusted p -values are thus given by

p r j = maxk=1,...,j min (m k +1)p rk , 1 .(5)

Holms procedure is less conservative than the standard Bonferroni procedure which would multiply th

unadjusted p -values by m at each step. Note that tak-ing successive maxima of the quantities min((m k + 1)p rk , 1) enforces monotonicity of the adjustedp -values. That is, p r1 p r2 p rm , and one canreject a particular hypothesis only if all hypothesewith smaller unadjustedp -valueswere rejected before-hand. Similarly, the step-down idk adjusted p -valuesare dened as

p r j = maxk=1,...,j 1(1p rk ) (mk+

1) .(6)


10/33


The Westfall and Young (1993) step-down minP adjusted p -values are dened by

p r j = maxk=1,...,j Pr min

l{rk ,...,r m}

P l p rk HC0(7)and thestep-downmaxT adjusted p -valuesaredenedby

p r j = maxk=1,...,j Pr max

l{rk ,...,r m}|

T l | |t rk | HC0 ,(8)where |t r1| |t r2| |t rm | denote the observed ordered test statistics. Note that applying Booles in-equality to the quantities in (7) yields Holms p -values.A procedure based on the step-down minP adjustedp -values is thus less conservative than Holms proce-dure. For a proof of the strong control of the FWER forthe maxT and minP procedures the reader is referred

to Westfall and Young (1993, Section 2.8). Step-downprocedures such as the Holm procedure may be furtherimproved by taking into account logically related hy-potheses as described in Shaffer (1986).

2.4.3 Step-up procedures. In contrast to step-downprocedures, step-up procedures begin with the leastsignicant p -value, p rm , and are usually based on thefollowing probability result of Simes (1986). Underthe complete null hypothesis HC0 and for independenttest statistics, the ordered unadjusted p -values P (1) P (2) P (m) satisfy

Pr P (j ) > j m , j =1, . . . , m HC0 1with equality in the continuous case. This inequalityis known as the Simes inequality. In important casesof dependent test statistics, Simes showed that theprobability was larger than 1 ; however, this doesnot hold generally for all joint distributions.

Hochberg (1988) used the Simes inequality to derivethe following FWER controlling procedure. For con-trol of the FWER at level , let j = max{j :p r j /(m j +1)} and reject hypotheses Hr j , for j =1, . . . , j

. If no such j

exists, reject no hypothesis.Thestep-up Hochberg adjusted p -valuesare thus givenby

p r j = mink=j,...,mmin (m k +1)p rk , 1 .(9)

The Hochberg (1988) procedure can be viewed asthe step-up analog of Holms step-down procedure,since the ordered unadjusted p -values are comparedto the same critical values in both procedures, namely,/(m j + 1) . Related procedures include those of

Hommel (1988) and Rom (1990). Step-up proceduroften have been found to be more powerful than thestep-down counterparts; however, it is important tkeep in mind that all procedures based on the Siminequality rely on the assumption that the result provunder independence yields a conservative procedufor dependent tests. More research is needed to dtermine circumstances in which such methods are aplicable and, in particular, whether they are applicble for the types of correlation structures encounterin DNA microarray experiments. Troendle (1996) prposed a permutation-based step-up multiple testinprocedure which takes into account the dependenstructure among the test statistics and is related to tWestfall and Young (1993) step-down maxT proce-dure.2.5 Control of the False Discovery Rate

A different approach to multiple testing was proposed in 1995 by Benjamini and Hochberg. These athors argued that, in many situations, control of thFWER can lead to unduly conservative procedures aone may be prepared to tolerate some Type I errorprovided their number is small in comparison to thnumber of rejected hypotheses. These considerationled to a less conservative approach which calls focontrolling the expected proportion of Type I erroamong the rejected hypothesesthe false discoveryrate, FDR. Specically, the FDR is dened as FDR=E(Q) , where Q =V /R if R > 0 and 0 if R =0, thatis, FDR=E(V/R |R > 0) Pr(R > 0) . Under the com-plete null, given the denition of 0/ 0=0 when R =0,the FDR is equal to the FWER; procedures that contrthe FDR thus also control the FWER in the weak sensNote that earlier references to the FDR can be foundSeeger (1968) and Soric (1989).

Benjamini and Hochberg (1995) derived the folowing step-up procedure for (strong) control of thFDR for independent test statistics. Let p r1 p r2 p rm denote the observed ordered unadjustep -values. For control of the FDR at level denej =max{j :p r j (j/m) }and reject hypotheses Hr j for j = 1, . . . , j . If no such j exists, reject no hy-pothesis. Corresponding adjusted p -values are

p r j = mink=j,...,mmin m

kp rk , 1 .(10)

Benjamini and Yekutieli (2001) proved that this procdure controls the FDR under certain dependence strutures (for example, positive regression dependenceThey also proposed a simple conservative modicati


11/33


of the procedure which controls the false discovery ratefor arbitrary dependence structures. Adjusted p -valuesfor the modied step-up procedure are

p r j = mink=

j,...,mmin

m mj =1 1/j k

p rk , 1 .(11)

The above two step-up procedures differ only in theirpenalty for multiplicity, that is, in the multiplier appliedto the unadjustedp -values. For the standard Benjaminiand Hochberg (1995) procedure, the penalty is m/k[Equation (10)], while for the conservative Benjaminiand Yekutieli (2001) procedure it is (m mj =1 1/j)/k[Equation (11)]. For a large number m of hypothe-ses, the penalties differ by a factor of about logm .Note that the Benjamini and Hochberg procedure canbe conservative even in the independence case, as itwas shown that for this step-up procedure E(Q)

(m 0/m) . Until recently, most FDR controllingprocedures were either designed for independent teststatistics or did not make use of the dependencestructure among the test statistics. In the spirit of the Westfall and Young (1993) resampling proceduresfor FWER control, Yekutieli and Benjamini (1999)proposed new FDR controlling procedures that useresampling-basedadjustedp -values to incorporate cer-tain types of dependence structures among the test sta-tistics (the proceduresassume, amongotherthings, thatthe unadjustedp -valuesfor the true null hypothesesareindependent of the p -values for the false null hypothe-ses). Other recent work on FDR controlling procedurescan be found in Genovese and Wasserman (2001),Storey (2002), and Storey and Tibshirani (2001).

In the microarray setting, where thousands of testsare performed simultaneously and a fairly large num-ber of genes areexpected to be differentially expressed,FDR controlling procedures present a promising al-ternative to FWER approaches. In this context, onemay be willing to bear a few false positives as longas their number is small in comparison to the numberof rejected hypotheses. The problematic denition of

0/ 0=0 is also not as important in this case.2.6 Resampling

In many situations, the joint (and even marginal) dis-tribution of the test statistics is unknown. Resamplingmethods (e.g., bootstrap, permutation) can be usedto estimate unadjusted and adjusted p -values whileavoiding parametric assumptions about the joint dis-tribution of the test statistics. Here, we consider nullhypotheses Hj of no association between variable X j

and a response or covariate Y , j =1, . . . , m . In the mi-croarray setting and for this type of null hypothesithe joint distribution of the test statistics (T 1, . . . , T m )under the complete null hypothesis can be estimateby permuting the columns of the gene expression da

matrix X (see Box 1). Permuting entire columns othis matrix creates a situation in which the responsor covariate Y is independent of the gene expressionmeasures, while attempting to preserve the correlatiostructure and distributional characteristics of the genexpression measures. Depending on the sample sizn , it may be infeasible to consider all possible permutations; for large n, a random subset of B permu-tations (including the observed) may be consideredThe manner in which the responses/covariates are pemuted should reect the experimental design. For example, for a two-factor design, one should permute thlevels of the factor of interest within the levels of thother factor [see Section 9.3 in Scheff (1959) and Setion 3.1.2 in the present article].

Note that while permutation is appropriate for thtypes of null hypotheses considered in this article, pemutation procedures are not advisable for certain othtypes of hypotheses. For instance, consider the simple case of a binary variable Y {1, 2} and supposethat the null hypothesis Hj is that the conditional dis-tributions of X j given Y = 1 and of X j given Y = 2have equal means, but possibly different varianceA permutation null distribution enforces equal distri-butions in the two groups, which is clearly stronge

Box 1. Permutation algorithm for unadjustedp -values.

For the bth permutation, b =1, . . . , B :1. Permute the n columns of the data matrix X .2. Compute test statistics t 1,b , . . . , t m,b for each hy-

pothesis (i.e., gene).The permutation distribution of the test statistic T j for hypothesis Hj , j = 1, . . . , m , is given by theempirical distribution of t j, 1, . . . , t j,B . For two-sidedalternative hypotheses, the permutation p -value forhypothesis Hj is

pj =Bb=1 I ( |t j,b | |t j |)

B,

where I ( ) is the indicator function, which equals 1 ifthe condition in parentheses is true and 0 otherwise.


12/33


than simply equal means. As a result, a null hypoth-esis Hj may be rejected for reasons other than a dif-ference in means (e.g., difference in a nuisance para-meter). Bootstrap resampling is more appropriate forthis type of hypotheses, as it preserves the covariancestructure present in the original data. The reader is re-ferred to Pollard and van der Laan (2003) for a dis-cussion of resampling-based methods in multiple test-ing.

Permutation adjusted p -values for the Bonferroni,idk, Holm and Hochberg procedures can be obtainedby replacingp j bypj (see Box 1) in Equations(1), (2),(5), (6) and (9). The permutation unadjusted p -valuescan also be used for the FDR controlling proceduresdescribed in Section 2.5. For the step-down maxT adjusted p -values (see Box 2) of Westfall and Young(1993), the complete null distribution of successivemaxima max

l{r j ,...,r m}|

T l |

of the test statistics needsto be estimated. (The single-step case is simpler andis omitted here; in that case, one needs only thedistribution of the maximum max1lm |T l |.)The reader is referred to Ge, Dudoit and Speed(2003) for a fast permutation algorithm for estimatingminP adjusted p -values.

Box 2. Permutation algorithm for step-downmax T adjusted p -values based on Algorithms 2.8and 4.1 in Westfall and Young (1993).

For the bth permutation, b =1, . . . , B :1. Permute the n columns of the data matrix X .2. Compute test statistics t 1,b , . . . , t m,b for each hy-

pothesis (i.e., gene).3. Next, compute successive maxima of the test

statistics

u m,b =|t rm ,b |,u j,b =max(u j +1,b , |t r j ,b |) for j =m 1, . . . , 1,where r j are such that |t r1| |t r2| |t rm | forthe original data.

The permutation adjusted p -values are

pr j =Bb=1 I (u j,b |t r j |)

B,

with the monotonicity constraints enforced by setting

pr1 pr1, pr j max( pr j , pr j 1)for j =2, . . . ,m .

2.7 Recent Proposals for DNA MicroarrayExperiments

Efron et al. (2000), Golub et al. (1999), and TusheTibshirani and Chu (2001) have recently proposed rsampling algorithms for multiple testing in DNA m

croarray experiments. However, these oft-cited procdures were not presented within the standard statisticframework for multiple testing. In particular, the Typerror rates considered were rather loosely dened, thmaking it difcult to assess the properties of the mutiple testing procedures. These recent proposals are rviewed next, within the framework introduced in Setions 2.2 and 2.3.

2.7.1 Neighborhood analysis of Golub et al. Golubet al. (1999) were interested in identifying genethat are differentially expressed in patients with twtypes of leukemias: acute lymphoblastic leukem(ALL, class 1) and acute myeloid leukemia (AMclass 2). (The study is described in greater detain Section 3.1.3.) In their so-called neighborhood analysis, the authors computed a test statistic t j foreach gene [P(g,c) in their notation],

t j = x1j x2j s1j +s2j

,

where xkj and skj denote, respectively, the average andstandard deviation of the expression measures of genj in the class k =1, 2 samples. The Golub et al. statis-tic is based on an ad hoc denition of correlation anresembles a t -statistic with an unusual standard errocalculation (note 16 in Golub et al., 1999). It is not piotal, even in the Gaussian case or asymptotically, ana standard two-sample t -statistic should be preferred.(Note that this denition of pivotality is different frosubset pivotality in Section 2.4: here, the statistic said to be pivotal if its null distribution does not depenon parameters of the distribution which generated thdata.) Statistics such as the Golub et al. statistic abovhave beenused in meta-analysis to measureeffect siz(National Reading Panel, 1999).

Golub et al. used the term neighborhood to refer tosets of genes with test statistics T j greater in absolutevalue than a given critical value c > 0, that is, sets of rejected hypotheses{j :T j c} or{j :T j c} [thesesets are denoted by N 1(c,r) and N 2(c,r) , respectively,in note 16 of Golub et al., 1999]. The ALL/AMlabels were permuted B = 400 times to estimate thecomplete null distribution of the numbers R(c) =V (c) = mj =1 I (T j c) of false positives for differentcritical values c (similarly for the other tail, withT j c). Figure2 in Golub et al. (1999)containsplot


13/33


of the observed R(c) =r(c) and permutation quantilesof R(c) against critical values c for one-sided tests.[We are aware that our notation can lead to confusionwhen compared with that of Golub et al. We chose tofollow the notation of Sections 2.2 and 2.3 to allow

easycomparisonwithother multiple testingproceduresdescribed in the present article. For the Golub et al.method note that we use T j to denote P(g,c) , c todenote r and r(c) to denote a realization of R(c) , thatis, |N 1(c,r) | + |N 2(c,r) |.] A critical value c is thenchosen so that the chance of exceeding the observedr(c) under the complete null is equal to a prespeciedlevel , that is, G(c) =Pr(R(c) r(c) |HC0 ) = .Golub et al. provided no further guidelines for se-lecting the critical value c or discussion of the Type Ierror control of their procedure. Like some PFER,PCER or FWER controlling procedures, the neighbor-hood analysis considers the complete null distributionof the number of Type I errors V(c) = R(c) . How-ever, instead of controlling E(V (c)) , E(V (c))/m orPr(V (c) 1) , it seeks to control a different quantity,G(c) =Pr(R(c) r(c) | HC0 ). G(c) can be thought of as a p -value under HC0 for the number of rejected hy-potheses R(c) and is thus a random variable. Dudoit,Shaffer and Boldrick (2002) show that conditional onthe observed ordered absolute test statistics, |t |(1) |t |(m) , the function G(c) is left-continuous withdiscontinuities at |t |(j ) , j = 1, . . . , m . Although G(c)is decreasing in c within intervals (|t |(j +1) , |t |(j ) ], itis not, in general, decreasing overall and there may beseveral values of c with G(c) = . Hence, one mustdecide on an appropriate procedure for selecting thecritical value c. Two natural choices are given by thestep-down and step-up procedures described in Dudoit,Shaffer and Boldrick (2002). It turns out that neitherversion provides strong control of any Type I errorrate. The step-down version does, however, control theFWER weakly. Finally, note that the number of permu-tations B =400 used in Golub et al. (1999) is probablynot large enough for reporting 99th quantiles in Fig-ure 2. A better plot for Figure 2 of Golub et al. might beof the error rate G(c) = Pr(R(c) r(c) | HC0 ) versusthe critical valuesc , because this does not require a pre-specied level . A more detailed discussion of the sta-tistical properties of neighborhood analysis and relatedgures can be found in Dudoit, Shaffer and Boldrick(2002).

2.7.2 Signicance Analysisof Microarrays. We con-sider the signicance analysis of microarrays (SAM)

multiple testing procedure described in Tusher, Tibshrani and Chu (2001) and Chu et al. (2000). An ealier version of SAM (Efron et al., 2000) is discussein detail in the technical report by Dudoit, Shaffer anBoldrick (2002). Note that the SAM articles also ad-dress the question of choosing appropriate test statitics for different types of responses and covariateHere, we focus only on the proposed methods for deaing with the multiple testing problem and assume tha suitable test statistic is computed for each gene.

SAM procedure from Tusher, Tibshirani and Chu(2001).1. Compute a test statistict j for eachgenej anddene

order statistics t (j ) such that t (1) t (2) t (m) . [The notation for the ordered test statisticis different here than in Tusher, Tibshirani anChu (2001) to be consistent with previous notatiowhereby we set t (1) t (2) t (m) and p (1) p (2) p (m) .]2. Perform B permutations of the responses/covariatesy1, . . . , y n . For each permutation b compute the teststatistics t j,b and the corresponding order statisticst (1),b t (2),b t (m),b . Note thatt (j),b maycor-respond to a different gene than the observed t (j ) .

3. From the B permutations, estimate the expectedvalue (under the complete null) of the order statistics by t (j ) =(1/B) b t (j),b .4. Form a quantilequantile (QQ) plot (so-calleSAM plot) of the observed t (j ) versus the ex-pected t (j ) .5. For a xed threshold , let j 0 =max{j : t (j ) 0},j 1 =max{j j 0 : t (j ) t (j ) }andj 2 =min{j >j 0 : t (j ) t (j ) }. [This is our interpretation of the description in the SAM manual (Chu et al2000): For a xed threshold , starting at the ori-gin, and moving up to the right nd the rst i =i1such that d (i) d (i) . That is, we take the ori-gin to be given by the indexj 0.] All genes withj j 1 are called signicant positive and all genes withj j 2 are calledsignicantnegative. Dene the up-per cut point, cutup( )

=min

{t (j ) : j

j 1

} =t (j 1) ,

and the lower cut point, cutlow( ) =max{t (j ) : j j 2} =t (j 2) . If no such j 1 (j 2) exists, set cutup( ) = (cutlow( ) =).6. For a given threshold , the expected number of false positives, PFER, is estimated by computinfor each of the B permutations the number of geneswith t j,b above cutup( ) or below cutlow( ) , andaveraging this number over permutations.

See Note Added in Proof on p. 101.


14/33


7. A threshold is chosen to control the expectednumber of false positives, PFER, under the com-plete null, at an acceptable nominal level.

Thus, the SAM procedure in Tusher, Tibshirani andChu (2001) usesordered test statistics from the original

data only for the purpose of obtaining global cutoffsfor the test statistics. In the permutation, the cutoffsare kept xed and the PFER is estimated by countingthe number of genes with test statistics above/belowthese global cutoffs. Note that the cutoffs are actuallyrandomvariables, because they dependon theobservedtest statistics.

The reader is referred to Dudoit, Shaffer and Bold-rick (2002) for a more detailed discussion and com-parison of the statistical properties of the SAM pro-cedures in Efron et al. (2000) and Tusher, Tibshiraniand Chu (2001), including a derivation of the corre-sponding adjusted p -values and an extension whichaccounts for differences in variances among the or-der statistics. There, it is shown that both SAM pro-cedures aim to control the PFER (or PCER), but theEfron et al. (2000) procedure controls this error rateonly in the weak sense. The only difference betweenthe Tusher, Tibshirani and Chu (2001) version of SAMand standard procedures which reject the null Hj for

|t j | c is in the use of asymmetric critical values cho-sen from the quantilequantile plot (a discussion of asymmetric critical values is found in Braver, 1975).Otherwise, SAM does not provide any new denitionof Type I error rate nor any new procedure for control-ling this error rate. In summary, the SAM procedurein Efron et al. (2000) amounts to rejecting H(j ) when-ever|t (j ) t (j ) | , where is chosen to control thePFER weakly at a given level. By contrast, the SAMprocedure in Tusher, Tibshirani and Chu (2001) rejectsHj whenever t j cutup( ) or t j cutlow( ) , wherecutlow( ) and cutup( ) are chosen from the permuta-tion quantilequantile plot and such that the PFER iscontrolled strongly at a given level.

SAM control of the FDR. The term false discov-

ery rate is misleading, as the denition in SAM isdifferent than the standard denition of Benjaminiand Hochberg (1995): the SAM FDR is estimatingE(V | HC0 )/R and not E(V/R) as in Benjamini andHochberg. Furthermore, the FDR in SAM can begreater than 1 (cf. Table 3 in Chu et al., 2000, page 16).The issue of strong versus weak control is only men-tioned briey in Tusher, Tibshirani and Chu (2001)andthe authors claim that SAM provides a reasonably ac-curate estimate for the true FDR.

2.8 Reporting the Results of Multiple TestingProcedures

We have described a number of multiple testing prcedures for controlling different Type I error rates, icluding the FWER and the FDR. Table 2 summarizthese methods in terms of their main properties: denition of the Type I error rate, type of control of therror rate (strong versus weak), stepwise nature of tprocedure, distributional assumptions.

For each procedure, adjusted p -values were derivedas convenientand exible summaries of the strength the evidence against each null hypothesis. The following types of plots of adjusted p -values are particularlyuseful in summarizing the results of different multiptesting procedures applied to a large number of geneThe plots allow biologists to examine variousfalse poitive rates (FWER, FDR, or PCER) associatedwith dferent gene lists. They do not require researchers preselect a particular denition of Type I error rate -level, but rather provide them with tools to decidon an appropriate combination of number of genes atolerable false positive rate for a particular experimeand available resources.

Plot of ordered adjusted p -values ( p (j ) versus j ).For a given number of genes r , say, this representa-tion provides the nominal Type I error rate for a giveprocedure when the r genes with the smallest adjustedp -values for that procedure are declared to be differetially expressed [see panels (a) and (b) in Figures 3Therefore, rather than choosing a specic type of eror control and -level, biologists might rst select anumberr of genes which they feel comfortable following up. The nominal false positive rates (or adjustedp -values, p (r) ) corresponding to this number under various types of error control and procedures can thebe read from the plot. For instance, for r =10 genes,the nominal FWER from Holms step-down procedumight be 0.1 and the nominal FDR from the Be jamini and Hochberg (1995) step-up procedure migbe 0.07.

Plot of number of genes declared to be differentiallyexpressed versus nominal Type I error rate (r ver-sus ). This type of plot is the transpose of the previous plot and can be used as follows. For a given nominal level , nd the number r of genes that would bedeclared to be differentially expressed under one prcedure, and read the level required to achieve that number under other methods. Alternatively, nd the number of genes that would be identied using a procedu


15/33

MULTIPLE HYPOTHESIS TESTING 85TABLE 2

Properties of multiple testing procedures

Type I Strong or Stepwise DependenceProcedure error rate weak control structure structure

Bonferroni FWER Strong Single General/ignoreidk FWER Strong Single Positive orthant dependenceminP FWER Strong Single Subset pivotalitymaxT FWER Strong Single Subset pivotalityHolm (1979) FWER Strong Down General/ignoreStep-down idk FWER Strong Down Positive orthant dependenceStep-down minP FWER Strong Down Subset pivotalityStep-down maxT FWER Strong Down Subset pivotalityHochberg (1988) FWER Strong Up Some dependence (Simes)Troendle (1996) FWER Strong Up Some dependenceBenjamini and Hochberg (1995) FDR Strong Up Positive regression dependenceBenjamini and Yekutieli (2001) FDR Strong Up General/ignoreYekutieli and Benjamini (1999) FDR Strong Up Some dependenceUnadjusted p -values PCER Strong Single General/ignoreSAM, Tusher, Tibshirani and Chu (2001) PFER (PCER) Strong Single General/hybridSAM, Efron et al. (2000) PFER (PCER) Weak Single GeneralGolub et al. (1999), step-down Pr(R r |HC0 ) (FWER) Weak Down GeneralGolub et al. (1999), step-up Pr(R r |HC0 ) Weak Up GeneralNotes . By General/ignore, we mean that a procedure controls the claimed Type I error rate for general dependence structures, not explicitly take into account the joint distribution of the test statistics. For the Tusher, Tibshirani and Chu (2001) SAM versionGeneral/hybrid refers to the fact that only the marginal distribution of the test statistics is considered when computing the PFERstatistics are considered jointly only to determine the cutoffs cutup( ) and cutlow( ) from the quantilequantile plot.

controlling the FWER at a xed nominal level , andthen identify how many others would be identied us-ing procedures controlling the FDR and PCER at that

level.The multiple testing procedures considered in this

article can be divided into the following two broadcategories: those for which adjusted p -values aremonotone in the test statistics, t j , and those for whichadjusted p -values are monotone in the unadjustedp -values, p j . In general, the ordering of genes basedon test statistics t j will differ from that based on un-adjusted p -values p j , as the test statistics of differ-ent genes may have different distributions. Within eachof these two classes, procedures will, however, pro-

duce the same orderings of genes, whether they aredesigned to control the FWER, FDR or PCER. Theywill differ only in the cutoffs for signicance. Thatis, for a given nominal level , an FWER control-ling procedure such as Bonferronis might identifyonly the rst 20 genes with the smallest unadjustedp -values, while an FDR controlling procedure such asBenjamini and Hochbergs (1995) might retain an ad-ditional 15 genes with the next 15 smallest unadjustedp -values.

3. DATA

3.1 Microarray Data

3.1.1 Apolipoprotein AI experiment of Callow et alThe apolipoprotein AI (apo AI) experiment was caried out as part of a study of lipid metabolism anartherosclerosis susceptibility in mice (Callow et al2000). Apolipoprotein AI is a gene known to play pivotal role in high density lipoprotein (HDL) metabolism and mice with the apo AI gene knocked ouhave very low HDL cholesterol levels. The goal of thexperiment was to identify genes with altered expresion in the livers of apo AI knock out mice compareto inbred control mice. The treatment group consisteof eight inbred C57Bl/6 mice with the apo AI gen

knocked out and the control group consisted of eighcontrol C57Bl/6 mice. For each of these 16 mice, targcDNA was obtained from mRNA by reverse transcrition and labeled using a red-uorescent dye, Cy5. Threference sample used in all hybridizations was prepared by pooling cDNA from the eight control micand was labeled with a green-uorescent dye, Cy3Target cDNA was hybridized to microarrays containing 6,356 cDNA probes, including 257 related to lipmetabolism. Each of the 16 hybridizations produced


16/33


( a )

( b )

( c )

( d )

F I G . 3 .

A p o A I e x p e r i m e n t . ( a ) a n d ( b ) P l o t o f s o r t e d p e r m u t a t i o n a d j u s t e d p - v a l u e s , p ( j

) v e r s u s

j . P

a n e l ( b ) i s a n e n l a r g e m e n t o f p a n e l ( a ) f o r t h e 5 0 g e n e s w i t h

t h e l a r g e s t a b s o l u t e

t - s t a t i s t i c s | t

j | . A d j u s t e d p - v a l u e s f o r p r o c e d u r e s c o n t r o l l i n g t h e F W E R , F D

R a n d P C E R a r e p l o t t e d u s i n g s o l i d

, d a s h e d a n d d o t t e d l i n e s , r

e s p e c t i v e l y . ( c ) P l o t o f a d j u s t e d p - v a l u e s ,

l o g 1 0

p j v e r s u s t - s t a t i s t i c s t j . ( d ) Q u a n t i l e q u a n t i l e p l o t o f t - s t a t i s t i c s ; t h e d o t t e d l i n e i s t h e i d e n t i t y l i n e a n d t h e d a s h e d l i n e p a s s e s t h r o u g h t h e r s t a n d t h i r d q u a r t i l e s . A

d j u s t e d

p - v a l u e s w e r e e s t i m a t e d b a s e d o n a l l B

p e r m =

1 6 8

= 1 2

, 8 7 0 p e r m u t a t i o n s o f t h e t r e a t m e n t / c o n t r o l l a b e l s , e x c e p t f o r t h e S A M p r o c e d u r e f o r w h i c h

B s a m =

1 0 0 0 r a n d o m p e r m u t a t i o n s

w e r e u s e d . N

o t e t h a t t h e r e s u l t s f o r t h e B o n f e r r o n i , H

o l m , a

n d H o c h b e r g p r o c e d u r e s a r e v i r t u a l l y i d e n t i c a l ; s

i m i l a r l y f o r t h e u n a d j u s t e d p - v a l u e

( r a w p ) a n d S A M T u s h e r p r o c e d u r e s .


17/33


( a )

( b )

( c )

( d )

F I G .

4 . B a c t e r i a e x p e r i m e n t . ( a ) a n d ( b ) P l o t o f s o r t e d p e r m u t a t i o n a d j u s t e d p - v a l u e s , p ( j

) v e r s u s

j . P

a n e l ( b ) i s a n e n l a r g e m e n t o f p a n e l ( a ) f o r t h e 1 0 0 g e n e s w i t h t h e l a r g e s t

a b s o l u t e t - s t a t i s t i c s | t

j | . A d j u s t e d p - v a l u e s f o r p r o c e d u r e s c o n t r o l l i n g t h e F W E R , F D R a n d P C E R a r e p l o t t e d u s i n g s o l i d


e s p e c t i v e l y

. ( c ) P l o t o f a d j u s t e d

p - v a l u e s ,

l o g 1 0

p j v e r s u s t - s t a t i s t i c s t j .

( d ) Q u a n t i l e q u a n t i l e p l o t o f t - s t a t i s t i c s ; t h e d o t t e d l i n e i s t h e i d e n t i t y l i n e a n d t h e d a s h e d l i n e p a s s e s t h r o u g h t h e r s t a n d t h i r d q u a r t i l e s .

A d j u s t e d p - v a l u e s w e r e e s t i m a t e d b a s e d o n a l l B

p e r m =

2 2 2 p e r m u t a t i o n s o f t h e G r a m - p o s i t i v e / G r a m - n e g a t i v e l a b e l s w i t h i n t h e 2 2 d o s e t i m e b l o c k s

, e x c e p t f o r t h e S A M p r o c e d u r e

f o r w h i c h

B s a m =

1 0 0 0 r a n d o m p e r m u t a t i o n s w e r e u s e d . N


o l m , a

n d H o c h b e r g p r o c e d u r e s a r e v i r t u a l l y i d e n t i c a l , s

i m i l a r l y f o r t h e u n a d j u s t e d

p - v a l u e



18/33


( a )

( b )

( c )

( d )

F I G . 5 .

L e u k e m i a s t u d y . ( a ) a n d ( b ) P l o t o f s o r t e d p e r m u t a t i o n a d j u s t e d p - v a l u e s ,

p ( j

) v e r s u s

j . P

a n e l ( b ) i s a n e n l a r g e m e n t o f p a n e l ( a ) f o r t h e 1 0 0 g e n e s w i t h

t h e l a r g e s t a b s o l u t e

t - s t a t i s t i c s | t

j | . A d j u s t e d p - v a l u e s f o r p r o c e d u r e s c o n t r o l l i n g t h e F W E R

, F D

R a n d P C E R a r e p l o t t e d u s i n g s o l i d


e s p e c t i v e l y . ( c ) P l o t o f a d j u s t e d p - v a l u e s ,

l o g 1 0

p j v e r s u s t - s t a t i s t i c s t j . ( d ) Q u a n t i l e q u a n t i l e p l o t o f t - s t a t i s t i c s ; t h e d o t t e d l i n e i s t h e i d e n t i t y l i n e a n d t h e d a s h e d l i n e p a s s e s t h r o u g h t h e r s t a n d t h i r d q u a r t i l e s . A

d j u s t e d

p - v a l u e s w e r e e s t i m a t e d b a s e d o n

B p e r m =

5 0 0 , 0 0 0 r a n d o m p e r m u t a t i o n s o f t h e A L L / A M L l a b e l s

, e x c e p t f o r t h e S A M p r o c e d u r e f o r w h i c h

B s a m =

1 0 0 0 r a n d o m p e r m u t a t i o n s w e r e

u s e d . N


o l m a n d H o c h b e r g p r o c e d u r e s a r e v i r t u a l l y i d e n t i c a l , s

i m i l a r l y f o r t h e u n a d j u s t e d p - v a l u e



19/33


pair of 16-bit images which were processed using thesoftware package Spot (Buckley, 2000). The resultinguorescence intensities were normalized as describedin Dudoit et al. (2002). Let xj i denote the base 2 loga-rithm of the Cy5/Cy3 uorescence intensity ratio forprobe j in array i, i

= 1, . . . , 16, j

= 1, . . . , 6,356.

Then xj i represents the expression response of genej in either a control (i =1, . . . , 8) or a treatment (i =9, . . . , 16) mouse.

Differentially expressed genes were identied bycomputing two-sample Welch t -statistics for eachgene j ,

t j = x2j x1j s21j /n 1 +s22j /n 2 ,

where x1j and x2j denote the average expressionmeasure of gene j in the n1 =

8 control and n2

= 8

treatment hybridizations, respectively. Similarly, s21j and s22j denote the variance of gene j s expressionmeasure in the control and treatment hybridizations,respectively. Large absolutet -statistics suggest that thecorresponding genes have different expression levelsin the control and treatment groups. To assess thestatistical signicance of the results, we considered themultiple testing procedures of Section 2 and estimatedunadjusted and adjusted p -values based on all possible168 = 12,870 permutations of the treatment/controllabels.3.1.2 Bacteriaexperiment of Boldricket al. Boldricket al. (2002) performed an in vitro study of the gene ex-

pression response of human peripheral blood mononu-clear cells (PBMCs) to treatment with pathogenicbacterial components. The ability of an organism tocombat microbial infection is crucial to survival. Hu-mans, along with other higher organisms, possess twoparallel, yet interacting, systems of defense against mi-crobial invasion, referred to as the innate and adaptiveimmune systems. It has recently been discovered thatcells of the innate immune system possess receptorswhich enable them to differentiate between the cellularcomponents of different pathogens, including Gram-positive and Gram-negative bacteria, which differ intheir cell wall structure, fungi, and viruses. Boldricket al. sought to determine if, given the presence of specic receptors, cells of the innate immune systemwould have differing genomic responses to diversepathogenic components. One important question thatwas addressed involved the response of these innateimmune cells (PBMCs) to different doses of bacter-ial components. Although one can experimentally treat

cells with the same amount of two types of bacteribecause the bacteria may differ in size or compositioone cannot be sure that this nominally equivalent treament is truly equivalent, in the sense that the strengthof the stimulation is equivalent. To make a statemen

that the response of PBMCs to a certain bacterium truly specic to that bacterium, one must therefore peform a doseresponse analysis to ensure that one is nsimply sampling from two different points on the samdoseresponse curve.

Boldrick et al. performed a set of experiments (doseresponse data set) that monitored the effect of threfactors on the expression response of PBMCs: bactria type, dose of the bacterial component and time ater treatment. Two types of bacteria were consideredthe Gram-negative B. pertussis and the Gram-positiveS. aureus. Four doses of the pathogenic component

were administered based on a standard dose: 1X , 10X ,100X , 1000X , where X represents the number of bac-terial particles per human cell (X =0.002 for the Grampositive and X = 0.004 for the Gram negative). Thegene expression response was measured at ve timpoints after treatment: 0.5, 2, 4, 6 and 12 hours (exttime points at 1 and 24 hours were recorded for dos100X ). A total of 44 hybridizations (245 plus 1and 24 hour measurements for dose 100X ) were per-formed using the Lymphochip, a specialized microaray comprising 18,432 elements enriched in genes thare preferentially expressed in immune cells or whicare of known immunologic importance. In each hybridization, uorescent cDNA targets were preparefrom PBMC mRNA (red-uorescent dye, Cy5) andreference sample derived from a pool of mRNA frosix immune cell lines (green-uorescent dye, Cy3The microarray scanned images were analyzed usinthe GenePix package and the resulting intensities wepreprocessed as described in Boldrick et al. (2002). Foeach microarray i , i =1, . . . , 44, the base 2 logarithmof the Cy5/Cy3 uorescence intensity ratio for genj represents the expression response xj i of that gene

in PBMCs infected by the Gram-positive or Gramnegative bacteria for one of the 22 dose time com-binations (4 doses 5 time points plus 2 extra timepoints for dose 100X ). The analysis below is basedon a subset of 2,562 genes that were well measured both the doseresponse and the diversity data sets (seBoldrick et al., 2002, for details on the preselection the genes).

One of the goals of this experiment was to identify genes that have a different expression response


20/33


treatment by Gram-positive and Gram-negative bacter-ial components. As there are clearly dose and time ef-fects on the expression response, the null hypothesisof no bacteria effect was tested for each gene basedon a paired t -statistic. For any given gene, let d h de-note the difference in the expression response to in-fection by the Gram-negative and Gram-positive bac-teria for the h th dose time block, h =1, . . . , 22. Thepaired t -statistic is dened as t = d / s 2d /n d , where d is the average of the nd =22 differences d h and s 2d isthe variance of these 22 differences. To assess the sta-tistical signicance of the results, we considered themultiple testing procedures of Section 2 and estimatedunadjusted and adjusted p -values based on all possible222 = 4,194,304 permutations of the expression pro-les within the 22 dose time blocks (i.e., all permu-tations of the Gram-positive and Gram-negative labelswithin dose time blocks).

3.1.3 Leukemia study of Golub et al. Correct diag-nosis of neoplasia is necessary for proper treatment.The traditional means of identication and classica-tion of malignancies has been based upon histologyand immunohistologic staining of pathologic speci-mens. However, it is apparent, based upon the variabil-ity of response to treatment and length of survival aftertherapy, that there is variability within the current sys-tem of classication. Genomic technologies may pro-vide the means by which neoplasms can be more accu-

ratelycharacterizedandclassied, thus leading to moreeffective diagnosis and treatment.As a demonstration of such capabilities, Golub

et al. (1999) studied two hematologic malignancies:acute lymphoblastic leukemia (ALL, class 1) andacute myeloid leukemia (AML, class 2), which arereadily classiable by traditional pathologic meth-ods. They sought to show that these two malig-nant entities could be identied and distinguishedbased on microarray gene expression measures alone.Therefore, one of the goals of the statistical analy-sis was to identify genes that differed most signif-

icantly between the two diseases. Gene expressionlevels were measured using Affymetrix high-densityoligonucleotide chips containing p = 6,817 humangenes. The learning set comprises n = 38 samples,27 ALL cases and 11 AML cases (data availableat www.genome.wi.mit.edu/MPR). Following Golubet al. (Pablo Tamayo, personal communication), threepreprocessing steps were applied to the normalizedmatrix of intensity values available on the website:(1) thresholdingoor of 100 and ceiling of 16,000;

(2) lteringexclusion of geneswith max/ min5 or(maxmin) 500, where max and min refer, respectively, to the maximum and minimum intensities foa particular gene across mRNA samples; (3) base 1logarithmic transformation. Box plots of the expressiomeasures for each of the 38 samples revealed the neto standardize the expression measures within arraybefore combining data across samples. The data wethen summarized by a 3,051 38 matrix X = (x j i ) ,where xj i denotes the expression measure for gene j in mRNA sample i .

Differentially expressed genes in ALL and AMpatients were identied by computing two-sampWelch t -statistics for each gene j as in Section 3.1.1.To assess the statistical signicance of the results, wconsidered the multiple testing procedures of Sectionand estimated unadjusted and adjusted p -values based

on 500,000 random permutations of the ALL/AMlabels.3.2 Simulated Data

While it is informative to examine the behavior odifferent multiple testing procedures on the microaray data sets described above, the genes cannot be uambiguously classied as differentially expressed not. As a result, there is no gold standard for assesing Type I and Type II errors. Simulation studies athus needed to evaluate the Type I error rate and powproperties of each of the procedures. Articial gene e

pression proles x and binary responses y were gener-ated as in Box 3 for m =500 genes.The 17 multiple testing procedures described iTable 3 were applied to each of the simulated dasets. Unadjusted p -values for each of the genes werecomputed in two ways: by permutation of the n =n1 + n2 responses and from the t -distribution withn1 + n2 2 degrees of freedom. Table 4 lists thedifferent parameters used in the simulation.

4. RESULTS

4.1 Microarray Data

The multiple testing procedures of Section 2 werapplied to the three microarray data sets described Section 3.1, using permutation to estimate unadjustand adjusted p -values. For a given procedure, geneswith permutation adjusted p -values pj were de-clared to be differentially expressed at nominal level for the Type I error rate controlled by the procedure uder consideration. For each data set, ordered adjustp -values were plotted for each procedure in panels (


21/33


Box 3. Type I error rate and power calculationsfor simulated data.

1. For the ith response group, i = 1, 2, generate niindependent m-vectors, or articial gene expres-sion proles, x from the Gaussian distributionwith mean i and covariance matrix . The m0genes for which 1 = 2 are not differentiallyexpressed and correspondto the true null hypothe-ses. Model parameters used in the simulation arelisted in Table 4.

2. For each of the m genes, compute a two-samplet -statistic (with equal variances in the two re-sponse groups) comparing the gene expressionmeasures in the two response groups. Apply themultiple testing procedures of Section 2 to deter-mine which genes are differentially expressed forprespecied Type I error rates . A summary of the multiple testing procedures applied in the sim-ulation study is given in Table 3.

3. For each procedure, record the number Rb of genes declared to be differentially expressed, thenumbers V b and T b of Type I and II errors,respectively, and the false discovery rate Q b ,where Q b = V b /R b if Rb > 0 and Q b = 0 if R b =0.

Repeat Steps 13 B times and estimate the Type I er-ror rates and average power for eachof the proceduresas follows:

PCER=b V b /m

B,

FWER=b I (V b 1)

B,

FDR=b Q bB

,

Average power=1b T b /(m m0)

B.

and (b) of Figures 35 (see Section 2.8 for guidelineson interpreting the gures). Different line types areused for different Type I error rate denitions: solid,dashed, and dotted lines are used for FWER, FDR andPCER controlling procedures, respectively. Panels (c)of Figures 35 display plots of adjusted p -values (on alog scale) versus t -statistics. Finally, panels (d) of Fig-ures 35 display permutation quantilequantile plotsof the t -statistics. Results for the Golub et al. (1999)

neighborhood analysis were not plotted in these gures, because it led to rejection of virtually all hypothses for two-sided alternatives (i.e., for tests based oabsolute t -statistics). As expected, for a given nomina -level, the number of genes declared to be differentially expressed was the greatest for procedures controlling the PCER and the smallest for procedures thcontrol the FWER. Indeed, adjusted

dudoit_multiple hypothesis testing in microarray experiments

Documents