the power of the benjamini-hochberg proceduremathematical institute master thesis statistical...

Mathematical Institute

Master Thesis

Statistical Science for the Life and Behavioural Sciences

The Power of the Benjamini-Hochberg Procedure

Author:Wouter van Loon

First Supervisor:Prof. dr. J.J. Goeman

Leiden University Medical Center

Second Supervisor:Dr. M. van Iterson

Leiden University Medical Center

Third Supervisor:Dr. M. Fiocco

Leiden Mathematical Institute

May 2017

Abstract

Background: The Benjamini-Hochberg (BH) procedure is a popular method

for controlling the False Discovery Rate (FDR) in multiple testing experiments.

Currently available software for sample size calculation in the FDR context is based

on asymptotic behavior of the BH procedure, assuming independent test statistics

and possibly a common effect size. In this study, we investigate how this asymptotic

behavior, in terms of the proportion of rejected hypotheses and average power,

relates to performance of the BH procedure when these assumptions are not met.

Furthermore, we decompose the asymptotic expression for the average power, and

propose a number of alternative choices for its components, including the possiblity

of controlling the False Discovery Proportion (FDP) exceedance probability rather

than the FDR.

Results: We performed a number of simulation experiments to assess the effects

of the number of tested hypotheses, sample size, basic dependence, and presence of

an effect size distribution, on the performance of the BH procedure. Our results show

that testing fewer hypotheses is associated with an increase in the variance of the

power distribution. A conservative power estimator based on the law of the iterated

logarithm can ensure a high probability of the power exceeding the desired level,

and automatically scales with the number of tested hypotheses. We find this bound

remains effective even under low to moderate equicorrelation (ρ ≤ 0.5), or if only

the true null hypotheses are correlated. In the presence of an effect size distribution,

sample sizes calculated assuming a common effect size are under-estimated for higher

power levels, and over-estimated for lower power levels.

Conclusions: Sample size calculations based on the asymptotic average power

are quite robust to violations of the assumptions of infinite and independent tests.

Related, but more conservative estimators, allow a researcher to ensure a high prob-

ability of exceeding a desired power level and/or make confidence statements about

the FDP in the rejection set. Basing sample size calculations on the assumption

of a common effect size should be carefully considered, however, since depending

on the shape of the effect size distribution and desired power level, the resulting

sample size estimates may not lead to adequate power. All estimators described in

this paper have been incorporated into a Shiny application, which is freely available

at: https://wsvanloon.shinyapps.io/bhpower.

2

Contents

1 Introduction 5

1.1 Multiple Testing and the False Discovery Rate . . . . . . . . . . . . . . . 5

1.2 The Benjamini-Hochberg Procedure . . . . . . . . . . . . . . . . . . . . . 6

1.3 Asymptotic Behavior of the BH Procedure: The Criticality Phenomenon . 8

1.4 Power in the FDR Context . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.5 Confidence Bounds for False Discovery Proportions . . . . . . . . . . . . . 15

1.6 Aims of the Current Study . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Methods 19

2.1 Computation of α∗, u∗ and p∗ . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Different Approaches to Average Power . . . . . . . . . . . . . . . . . . . 22

2.2.1 Average Power of the BH Procedure . . . . . . . . . . . . . . . . . 22

2.2.2 FDP Exceedance Control . . . . . . . . . . . . . . . . . . . . . . . 23

2.2.3 The Proportion of Rejected Hypotheses . . . . . . . . . . . . . . . 24

2.2.4 The Proportions of True and False Null Hypotheses . . . . . . . . 25

2.2.5 Power Estimators for Sample Size Calculation . . . . . . . . . . . . 26

2.3 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.1 Simulation Study 1: Sample Size and Number of Tests . . . . . . . 28

2.3.2 Simulation Study 2: Dependence . . . . . . . . . . . . . . . . . . . 29

2.3.3 Simulation Study 3: Simulations Based on Real Data . . . . . . . 30

2.3.4 Simulation Study 4: Comparison of Sample Size Estimates . . . . 31

3 Results 33

3.1 Comparison of Power Estimators . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Sample Size and Number of Tests . . . . . . . . . . . . . . . . . . . . . . . 35

3.3 Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.4 Simulations Based on Real Data . . . . . . . . . . . . . . . . . . . . . . . 45

3.5 Comparison of Sample Size Estimates . . . . . . . . . . . . . . . . . . . . 47

4 Discussion 51

4.1 Performance of the BH Procedure in a Non-Asymptotic Setting . . . . . . 51

4.2 Performance of the BH Procedure Under Basic Forms of Dependence . . . 52

4.3 Performance of the BH Procedure With an Effect Size Distribution . . . . 54

4.4 Performance of Power Estimators . . . . . . . . . . . . . . . . . . . . . . . 55

4.5 Sample Size Calculation in Practice . . . . . . . . . . . . . . . . . . . . . . 56

3

4.6 Limitations and Future Research . . . . . . . . . . . . . . . . . . . . . . . 59

4.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

References 61

A Additional Tables 65

B Selected R Code 67

B.1 Functions for Sample Size and Power Calculations . . . . . . . . . . . . . 67

B.2 Code for Simulation Study 2: Equicorrelation . . . . . . . . . . . . . . . . 77

4

1 Introduction

1.1 Multiple Testing and the False Discovery Rate

Multiple testing refers to a situation where multiple hypotheses are tested in the context

of a single experiment. A classical hypothesis test is performed by calculating a test-

statistic and comparing it to the appropriate distribution to obtain a p-value. This

p-value is subsequently compared to some predefined significance level α, typically 0.05,

and if p ≤ α, the null hypothesis is rejected. For a single hypothesis test, this ensures

that the probability of rejecting the null hypothesis given that it is true is at most α.

However, a typical experiment leads to more than a single hypothesis test. In fact,

studies in such fields as genomics or neuroimaging may lead to thousands of hypothesis

tests. If we were to test m null hypotheses at significance level α, then we would expect

to find αm significant tests even if all the null hypotheses are true. So if we perform a

thousand hypothesis tests with α = 0.05, we would expect to reject fifty null hypotheses

even when none of the null hypothesis are false. Clearly, there is a need to correct for

multiple testing.

Perhaps the best-known method to correct for multiple testing is the Bonferroni

correction (Dunn, 1961), which constitutes testing each individual hypothesis at a signif-

icance level of α/m. Applying the Bonferroni correction controls the Family-Wise Error

Rate (FWER), that is, the probability of at least one incorrect rejection. However, it

should be noted that as m gets very large, the test-wise significance level α/m gets very

small, leading to a loss in power. Although more powerful methods for controlling the

FWER exist, such as those by Holm (1979), Hommel (1986), and Hochberg (1988), all

FWER controlling procedures suffer from the same drawback, namely that they do not

scale well with m. In fact, it has been shown for these methods (e.g. Meijer, Krebs,

Solari, & Goeman, 2016) that as the number of tested hypotheses m goes to infinity, the

proportion of rejected hypotheses approaches zero. If we intend to test many hypotheses,

this is clearly an undesirable property.

Instead of looking solely at the behavior of FWER controlling procedures, we can

also reconsider if it is really necessary to control the probability of at least one incorrect

rejection. If we were to perform a study where the main goal is to identify a set of

candidate genes to be used later in a validation experiment, we may decide that including

a few false positives in this set of candidate genes is not much of a problem, as long as

the number of false positives is not too large compared to the number of true positives.

This justification, in combination with the behavior of FWER controlling methods for

large m, may lead us to control a different error rate.

5

One such error rate is the False Discovery Rate (FDR) introduced by Benjamini

and Hochberg (1995), which is defined as the expected proportion of falsely rejected

hypotheses among the set of rejected hypotheses if there is at least one rejection, and

zero otherwise. Unlike FWER controlling procedures, FDR controlling procedures often

scale well with m, although this does require some conditions to hold which we will later

discuss.

1.2 The Benjamini-Hochberg Procedure

In their 1995 paper, Benjamini and Hochberg introduced not only the false discovery rate,

but also a procedure to control it, now commonly known as the “Benjamini-Hochberg

procedure” or “BH procedure”. The BH procedure is the best-known method for FDR

control and is, for example, included in the R function p.adjust under both the names

‘BH’ and ‘fdr’. To apply the BH procedure, we first test each of the m hypotheses

under consideration by calculating a test-statistic, and comparing it to the appropriate

distribution to obtain a p-value. Let p1 ≤ p2 ≤ ... ≤ pm be the ordered p-values, and

Hi the null hypothesis corresponding to pi. Then, to obtain an FDR control level α, we

reject all Hi for i = 1, 2, ..., k, with

k = max{i : pi ≤i

mα}, (1)

and reject no hypotheses if this maximum does not exist. This procedure controls the

FDR at α for any configuration of false null hypotheses, assuming independent test

statistics (Benjamini & Hochberg, 1995). Actually, the BH procedure controls the FDR

at a level lower than α, namely at m0m α, with m0 the number of true null hypotheses.

As m0 is typically unknown, the BH procedure provides a conservative approach that is

valid for any m0. However, if m0 is known, or more realistically, if some estimate of m0

is available, this can be incorporated in the BH procedure to make it less conservative,

as in e.g. Benjamini and Hochberg (2000). Benjamini and Yekutieli (2001) showed

that the BH procedure also controls the FDR for certain types of positive dependency

among the statistics. Since then, further theoretical work and simulation studies have

shown the BH procedure is quite robust, and remains valid for a wide variety of common

dependency structures (Goeman & Solari, 2014).

Recall that the FDR is defined as the expected proportion of falsely rejected hy-

potheses among the set of rejected hypotheses if there is at least one rejection, and zero

otherwise. There exists a related quantity, the positive FDR (pFDR), which is defined

as the expected proportion of falsely rejected hypotheses among the set of rejected hy-

6

potheses given that there is at least one rejection (Storey, 2003). It should be noted

that while the BH-procedure guarantees that the FDR is controlled at α, the same is

not necessarily true for the pFDR. However, if the proportion of false null hypotheses

is positive, then for very large m the probability of no rejections is effectively zero, in

which case the pFDR equals the FDR, and so both are controlled at level α.

To gain more insight into the workings of the BH procedure, consider how p-values

are obtained. We typically assume p-values are sampled from some mixture model, with

a fixed proportion of false nulls among all the nulls, which we will denote π as in Chi

(2007). So a randomly sampled p-value belongs to a false null hypothesis with probability

π, and to a true null hypothesis with probability 1− π. The distribution of the p-values

that belong to a true null hypothesis is uniform on u ∈ [0, 1], such that for any p-value

pj , P(pj ≤ u|Hj = true) = u. The distribution of the p-values that belong to a false

null hypothesis is non-uniform, such that P(pj ≤ u|Hj = false) = G(u). The common

distribution function of the p-values is then:

F (u) = (1− π)u+ πG(u). (2)

The BH procedure approximates the inverse of this distribution function, F−1(u), by

the sequence of ordered p-values. This empirical inverse distribution function, F−1m (u), is

then compared to a rejection line with intercept equal to zero, and slope α. All p-values

up to and including the highest p-value under this line are then rejected. Note that

this implies the BH procedure is a so-called step-up procedure, that is, if the empirical

inverse distribution function crosses the rejection line more than once, all p-values before

the last intersection are rejected, as can be observed in Figure 1.

7

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

i/m

pi

p-valuesrejected p-valuesrejection line αi/m

Figure 1: The Benjamini-Hochberg procedure applied to a set of m = 100 ordered p-values, using an FDR control level of α = 0.5. Note that some of the rejected p-valuesare actually above the rejection line: this is a typical feature of step-up procedures,where all p-values before the last intersection are rejected.

1.3 Asymptotic Behavior of the BH Procedure: The Criticality Phe-

nomenon

We stated previously that FDR controlling procedures scale well with m. This includes

the BH procedure: Genovese and Wasserman (2002) showed that under some conditions,

the proportion of rejected hypotheses converges in probability to some positive value,

rather than to zero like FWER controlling procedures. Chi (2007) characterized this

convergence in an even stronger sense, by providing asymptotic bounds on the deviation

of the proportion of rejected hypotheses from this value. In this section, we will briefly

describe the conditions under which this convergence occurs.

8

The conditions for convergence of the proportion of rejected hypothesis to some

positive value were perhaps most elegantly explained in Chi (2007), where they are

characterized in terms of a criticality phenomenon. We will denote by Rm the number

of hypotheses rejected by the BH-procedure, so that Rmm is the proportion of rejected

hypotheses. As in Chi (2007), we assume independent p-values with a common distribu-

tion function as defined in (2). The criticality phenomenon is defined as follows: there

can be some critical value α∗ > 0 for which the asymptotic behavior of the BH proce-

dure is different when α < α∗ compared to when α > α∗ (Chi, 2007). In particular, if

α > α∗, then as m → ∞, Rmm converges to some positive value, but if α < α∗,

Rmm is

asymptotically zero. The critical value α∗ is a property of the distribution function F

of the p-values. In particular, if F is strictly concave (Chi, 2007),

α∗ =1

F ′(0). (3)

Thus, α∗ represents the slope of F−1 at zero. To explain why this is the critical value,

we consider first the case where F ′(0) =∞, and note that by the law of large numbers

as m → ∞, the empirical distribution function Fm → F in probability. If F ′(0) = ∞,

then α∗ = 0. If the slope of F−1 at zero is zero, then surely for any rejection line with

slope α > 0, there will be some positive proportion of F−1 that is below this rejection

line. So as m→∞, Rmm will converge to a positive proportion for all α > 0. Therefore,

the criticality phenomenon does not occur.

Next, consider the case where F ′(0) < ∞. Now α∗ > 0, and so it is possible to

choose an α such that 0 < α < α∗. In this case, the slope of F−1 at zero is greater

than the slope of the rejection line at zero, which implies the entirety of F−1 is above

the rejection line. In other words, the proportion of F−1 that lies below the rejection

line is zero, and thus Rmm → 0 as m → ∞. But if we choose an α such that α > α∗,

some positive proportion of F−1 will be below the rejection line again, and so Rmm will

converge to this positive proportion. A p-value distribution for which α > α∗ has been

referred to as being Simes detectable at α (Meijer et al., 2016). The borderline case

α = α∗ is complex and has little practical relevance, so we will not consider it in this

study.

9

0.00

0.10

0.20

F (u)

u

F−1(u)rejection line (slope = α)tangent at zero (slope = α∗)

0.0 0.1 0.2 0.3 0.4 0.5

0.00

0.10

0.20

F (u)

u

u∗

p∗

F (u)

u

Figure 2: Top: Application of the BH procedure with α < α∗. It can be observed thatthe whole of F−1(u) is above the rejection line, and therefore u∗ = p∗ = 0. Bottom:Application of the BH procedure with α > α∗. The values u∗ and p∗ are, respectively,the coordinates u and F (u) at the intersection of F−1(u) and the rejection line.

As in Chi (2007), we will denote the proportion to which Rmm converges by p∗, which

serves as the limit (over m) of the proportion of rejected p-values. Then u∗ = F−1(p∗)

is the limit of the largest rejected p-value. We can calculate u∗ by (Chi, 2007)

u∗ = max{u ∈ [0, 1] :u

α≤ F (u)}, (4)

and obtain p∗ by

p∗ = F (u∗) =u∗α. (5)

As α∗ is a function of F , its value depends on the distributions of the chosen test-

statistics. If the distribution of the test-statistics under a null hypothesis is standard

10

normal, and the distribution under an alternative hypothesis is normal with σ > 1, or

σ = 1 and µ > 0, then the criticality phenomenon does not occur (Chi, 2007). If the test-

statistics follow t- or F -distributions the situation is more complex. In fact, assuming

π > 0, the criticality phenomenon always occurs when using t-statistics, implying that

for a given set of distributions, the FDR control level always needs to be chosen such

that α > α∗ if we want Rmm to be asymptotically non-zero. However, α is typically

fixed, whereas the t-distributions depend on both a sample size n and some effect size

θ. We can thus reverse the problem and state that, for a given α and θ, there is a

minimum sample size required to obtain α∗ < α. Whether or not α∗ is smaller than α

says something about the capacity of the BH procedure to asymptotically find signal in

the data, and can therefore be interpreted as a measure of power. In the next section, we

will discuss the meaning of power in a broader sense for methods controlling the FDR,

and discuss sample size calculation.

1.4 Power in the FDR Context

Sample size calculations are an important preliminary step in almost any study. Such

calculations are typically based on some measure of power, that is, some measure that

quantifies the ability of the chosen statistical methods to detect the effect of interest for

a given sample size. The aim is then to arrive at a sample size high enough to obtain a

desired power level, but not so high as to be unnecessarily expensive. In the context of

traditional hypothesis testing, the concept of power is well-defined: it is the probability

of rejecting a null hypothesis, given that it is false. Typically some desired power level is

formulated, e.g. 0.80 or 0.90, and this value, together with the desired significance level

and some effect size measure, is entered into a formula, and the result is some minimum

value of n. In the context of multiple testing, the situation is generally not so simple.

The classical definition of power is something that applies to a single hypothesis.

In the context of multiple testing, the individual power level for a single hypothesis is

sometimes referred to as the per-pair power (Horn & Dunnett, 2004; Keselman, Cribbie,

& Holland, 2002). It is, however, much more convenient to specify the desired power of

a multiple testing experiment in terms of a single quantity. The concept of power can

be extended to the entire set of hypotheses in multiple ways. The most commonly used

extension is the concept of average power, that is, the expected proportion of rejected

false null hypothesis among the set of false null hypotheses (e.g. Benjamini & Hochberg,

1995; Ferreira & Zwinderman, 2006; Jouve, Maucort-Boulch, Ducoroy, & Roy, 2009),

which has also been referred to as the True Positive Rate (TPR) (Kvam, Liu, & Si,

11

2012). The average power is the most natural extension to multiple testing, since if each

hypothesis has the same individual power level, the average power equals the per-pair

power. Although the average power is the most commonly used measure, some research

has focused on other approaches to power. For example, Lee and Whitmore (2002) used

what they termed the family power, which is defined as the probability of rejecting all of

the false null hypotheses. This measure of power has also been studied under the names

all-pairs power (Horn & Dunnett, 2004; Keselman et al., 2002) and collective power

(Jouve et al., 2009). Horn and Dunnett (2004) considered the probability of rejecting

at least one false null hypothesis, which they termed the any-pair power, and which has

also been studied in Yang, Cui, Chazaro, Cupples, and Demissie (2005), and in Jouve et

al. (2009) under the name relaxed power. This measure of power can be extended to the

probability of rejecting at least a certain number or proportion of false null hypotheses,

which has been termed the overall power (Shao & Tseng, 2007; Wang & Chen, 2004).

A quantity related to power is the False Nondiscovery Rate (FNR), which is defined

as the expected proportion of incorrect nonrejections among the nonrejections (Genovese

& Wasserman, 2002). Sarkar (2004) defined the power of a multiple testing procedure as

1− (FDR + FNR). This is a measure of power in a broader sense, as it strikes a balance

between the proportion of correctly accepted null hypotheses and the proportion of

falsely rejected null hypotheses.

Which of these power measures is preferable will probably depend on the situation

at hand. Controlling the family power level is typically unfeasible since the probability of

rejecting all false nulls quickly approaches zero as the number of false nulls increases (Lee

& Whitmore, 2002). The average power is the most obvious and common measure to

control, and is the most natural extension of the concept of power to the multiple testing

setting. However, we may decide that we do not necessarily require an average power

of at least, say 0.80, but that we are already satisfied if we expect to find something,

i.e. expect at least one true discovery. In this case the relaxed power may be a suitable

measure to use. Alternatively, we could decide to base sample size calculations on

the criticality phenomenon, and calculate the minimum n for which α∗ < α, since this

ensures that the size of the rejection set does not vanish as m→∞. Obviously, this leads

to much lower sample size estimates than those obtained from controlling the average

or family power level.

In order to base sample size calculations on a measure of power, we need to be

able to calculate this power before the experiment takes place. If one does not wish

to resort to simulations, some expression for the power measure of choice needs to be

available. Such expressions are typically based on asymptotic properties, that is, based

12

on how the method behaves for an infinite number of hypotheses. Both Jung (2005)

and Liu and Hwang (2007) developed methods for sample size calculation in the context

of FDR control, based on such expressions for the average power. These methods are

closely related, and in identical settings they produce identical results (Liu & Hwang,

2007). These methods can be used for Z-, t- and F - tests, and the results of Liu and

Hwang (2007) have been incorporated into the R package ssize.fdr (Orr & Liu, 2015).

Users of this package can select the type of test that is going to be performed, whether

tests are one- or two-sided, and can specify either a fixed effect size, or an effect size

distribution. The specification of an effect size distribution is typically only used when

pilot data is available from which such a distribution can be estimated. Estimating effect

size distributions is not a trivial affair; see, for example, Van Iterson, Van de Wiel, Boer,

and De Menezes (2013), who provide a method for estimating effect size densities from

pilot data that is applicable in a wide variety of settings. In the current study, we will

assume a fixed effect size.

It should be noted that ssize.fdr is not the only R package available for sample

size calculations in the FDR context. Other packages include FDRsampsize (Pounds,

2016) and ssizeRNA (Bi & Liu, 2017). Both of these packages are also based on average

power.

In addition to being based on asymptotic properties, power calculations typically

make further assumptions. For example, we will give an expression for the average

power of the BH procedure in section 2.2 which additionally assumes independent test

statistics. In a real-life setting, however, test statistics are typically not independent,

and the number of tested hypotheses is certainly not infinite. From this discrepancy, a

question arises: How accurate are power calculations based on such formulas compared

to the average power we observe when their assumptions are not met? Furthermore, if

these calculations are not as accurate as we would hope, can we offer a more conservative

power estimate?

Simulation studies concerning the power of FDR controlling procedures under dif-

ferent conditions have been performed before, but their aim is typically to show the

difference in power between FWER and FDR controlling procedures (e.g. Benjamini &

Hochberg, 1995; Keselman et al., 2002), or between different FDR controlling procedures

(e.g. Benjamini & Hochberg, 2000; Sarkar, 2004; Storey, 2002). Other studies have fo-

cused on comparing the performance of multiple testing methods when dealing with very

specific types of data (e.g. Groppe, Urbach, & Kutas, 2011; Jouve et al., 2009; Kvam et

al., 2012; Yang et al., 2005). In a study by Neuvial (2010), power calculations based on

the asymptotic results of Chi (2007) are compared to a situation where the number of

13

hypotheses is finite. Although they investigate the average power of the BH procedure

for different values of π and α, they only consider a single value for the number of hy-

potheses (m = 1000), and do not take into account dependency or different values of n.

They observe that if α > α∗, the power for m = 1000 is very similar to the asymptotic

power, whereas if α < α∗, the power for m = 1000 is slightly higher than the asymptotic

power. They conclude that the effect of the criticality phenomenon is less dichotomous

in real data analysis than is suggested by theory (Neuvial, 2010). They also investigate

the effect of sample size in a real data set, but consider only the proportion of rejected

hypotheses and not the power, since the data is unedited and the identity of the true

and false nulls is unknown (Neuvial, 2010).

The performance of the BH procedure under dependency has also been a topic of

previous research, although primarily to show that the BH procedure manages to control

the FDR under various common forms of dependency (e.g. Benjamini & Yekutieli, 2001;

Kim & van de Wiel, 2008). Kim and van de Wiel (2008) additionally investigated the

FNR and concluded that the BH procedure can attain a low FNR level even under

dependence. However, they utilized constrained random correlation matrices in which

only the number of correlated variables and the variance of the pairwise correlations

were controllable parameters (Kim & van de Wiel, 2008), which makes it difficult to

draw conclusions about the effect of correlation magnitude. Hemmelmann, Horn, Susse,

Vollandt, and Weiss (2005) investigate the power of the BH procedure when the features

are equicorrelated. For a fixed m (40) and n (8) they compare the average power

under two levels for the pair-wise correlations (0.2 and 0.8) and conclude that a higher

correlation leads to a reduction in power (Hemmelmann et al., 2005). However, the

observed differences in power between pair-wise correlations of 0.2 and 0.8 appear very

small. They additionally observe that larger pair-wise correlations lead to an increase in

the variance of the observed False Discovery Proportion (FDP), although they conclude

that if the BH procedure is applied with α = 0.05, it is uncommon for the observed FDP

to exceed 0.1 (Hemmelmann et al., 2005). Jung (2005) evaluate their method for sample

size calculation, which assumes independent test statistics, in a simulation experiment

with a compound symmetry correlation structure. They conclude that their approach

works well under weak dependency, but when the features are strongly correlated (ρ =

0.6), the interquartile range (IQR) of the observed proportion of rejected hypotheses

almost doubles, although the median observed proportion remains close to the predicted

proportion as under independence (Jung, 2005). Shao and Tseng (2007) observe a similar

increase in the variance of the average power with a 2-block correlation structure for the

false nulls, and assuming independence or some low constant correlation (ρ = 0.1) for the

14

true nulls. They suggest a method for dependence adjustment of sample size calculations,

but this method requires the correlation structure in the data to be known or estimated

(Shao & Tseng, 2007).

From these previous studies, it appears that the asymptotic proportion of rejected

hypotheses calculated under the assumption of independent test statistics remains a de-

cent estimator for the median observed proportion of rejected hypotheses, even when

there are strong correlations in the data. The same applies to the average power. How-

ever, an increase in correlation magnitude does lead to an increase in variance for the

distributions of these quantities. Although large increases have been observed for high

correlations like ρ = 0.6 (Jung, 2005) and ρ = 0.8 (Shao & Tseng, 2007), such situations

are not necessarily realistic. In fact, a simulation study based on real data by Shao and

Tseng (2007) showed far less dramatic differences. An increase in the variance of the

average power, especially if this increase is small, does not necessarily mean that sample

size calculations based on an independence assumption cannot be used.

In the current study, we will further investigate the effects of the number of tested

hypotheses, sample size, and dependency, on the proportion of rejected hypotheses and

average power of the BH procedure. We will compare the observed values of these quan-

tities with the values predicted using the results of Chi (2007). We will investigate the

effects of various magnitudes of correlation between the features, and aim to provide

more insight regarding the level of dependency that is acceptable when power calcu-

lations are based on an independence assumption. We assume an equal effect size for

all hypotheses, but we will also evaluate how power calculations based on this assump-

tion perform when the effect sizes are not equal. Furthermore, we will propose several

conservative estimators for the power of the BH procedure.

1.5 Confidence Bounds for False Discovery Proportions

The FDR is defined as the expected proportion of falsely rejected hypotheses among the

set of rejected hypotheses. As such, FDR control means controlling an expectation, and

the realized False Discovery Proportion (FDP) may be higher or lower depending on the

data at hand. Various methods have been proposed for controlling the probability of

the FDP exceeding a pre-specified level, which has been referred to as FDP exceedance

control (Genovese & Wasserman, 2006). Such methods are generally either based on

permutations (e.g. Korn, Troendle, McShane, & Simon, 2004), or modeling of the FDP

distribution (e.g. Oura, Matsui, & Kawakami, 2009), but many have been found to be

too computationally expensive or too conservative for practical use (Hemmelmann et al.,

15

2005; Shang, Zhou, Liu, & Shao, 2012). The use of exceedance control instead of FDR

control is analogous to using a confidence interval instead of a point estimator (Genovese

& Wasserman, 2006), and can thus also be formulated in terms of a confidence interval

for the FDP as in Shang, Liu, and Shao (2012). Goeman and Solari (2011) showed how

to calculate confidence bounds on the number of false discoveries simultaneously over

all possible rejection sets, a finding that can also be formulated in terms of confidence

bounds on the FDP (Meijer et al., 2016). This method actually allows researchers to

pick the rejection set after having seen the data and guarantees valid confidence bounds

can be formulated for any such set, but it can also be used for more classical FDP

control (Meijer et al., 2016). One could, for example, choose the largest rejection set so

that the upper 95% confidence bound is smaller than some maximum acceptable FDP.

Alternatively, one could use the 50% confidence bound to control the median FDP.

Of primary interest to the current study is the relation that exists between the FDP

confidence bounds from Meijer et al. (2016) and the BH procedure. We will denote by

qγ(S) the 1− γ upper confidence bound for the FDP in rejection set S, and by SBH(α)

the rejection set of the BH procedure at FDR control level α. Meijer et al. (2016) showed

that for all 0 < α ≤ 1:

qγ(SBH(αγ)) < α. (6)

This means that if we want the FDP of a BH rejection set to be smaller than α with

probability at least 1 − γ, we can guarantee this by simply applying the BH procedure

with FDR control level αγ. This is an attractive method for FDP exceedance control

with the BH procedure, since it is very easy to apply.

In addition to the result in (6), Meijer et al. (2016) showed that if α∗ < αγ, there is

a set Sm for which |Sm|m converges to a positive proportion, such that qγ(Sm) converges

in probability to some α′ < α. Furthermore, they show that α′ ≤ hα, with

h = min0≤u<F (γ)

uγ − γF−1(u)− γ

. (7)

This implies that, for the BH procedure at FDR control level αγ with α∗ < αγ, as

m→∞, the upper confidence bound on the FDP converges in probability to a quantity

that is not just smaller than α, but at most hα, with h < 1 since α∗ < αγ ≤ γ.

So what is the interpretation of h? For a sequence of ordered p-values, a confidence

bound for the number of true null hypotheses can be obtained by (Meijer et al., 2016):

h(γ) = max{i ∈ {0, . . . ,m} : ipm−i+j > jγ, for j = 1, . . . , i}. (8)

16

This is a confidence bound in the sense that P(# true nulls ≤ h(γ)) ≥ 1 − γ (Meijer

et al., 2016). As m → ∞, the quantity hm converges in probability to h (Meijer et al.,

2016). This means that h can be interpreted as an asymptotic confidence bound on the

proportion of true nulls.

Although h no longer depends on the number of tests, it still depends on sample

size. For low n, h is close to 1, whereas for high n, h approaches 1 − π. An intuitive

explanation for this behavior is that if n is low, then even if the number of tests is infinite,

it will still be hard to deduce the proportion of true nulls from the p-value distribution.

We cannot be sure of the exact proportion of true null hypotheses, but can only say

that it is likely to fall in a certain range. As n increases, the contrast between the

true and false nulls becomes increasingly more pronounced, so that this range becomes

increasingly narrow and the 1− γ confidence bound, h, approaches the true proportion

of true nulls 1− π.

The quantity h thus plays an important role in the FDP confidence bound. In a

broader sense, using h at an appropriate confidence level can be considered a conservative

alternative to using a point estimate for the proportion of true null hypotheses. As

mentioned in section 1.2, such estimates can be incorporated into the BH procedure.

It can be anticipated that using more conservative methods, i.e. using FDP exeedance

control instead of FDR control or h instead of a point estimate for the proportion

of true nulls, leads to a reduction in predicted average power. We argue that such

lower predictions can serve as conservative power estimates, which can in turn be used

in sample size calculation. In section 2.2, we will attempt to formulate conservative

estimators for the average power of the BH procedure by combining these results of

Meijer et al. (2016) with those of Chi (2007).

1.6 Aims of the Current Study

In the current study, we will further investigate how the asymptotic behavior of the BH

procedure as described by Chi (2007) relates to observed behavior in a setting where

the underlying assumptions are not met. Such a setting involves a finite number of

hypotheses and may have correlated features. We will asses the effect of sample size,

the number of tested hypotheses, and basic forms of dependence, on the behavior of the

BH procedure using simulations. We will assess this behavior primarily in terms of the

proportion of rejected hypotheses and the average power. We will restrict ourselves to

two-sample t-tests and focus on the situation where such tests are performed two-sided.

We will describe how violated assumptions affect the proportion of rejected hypothe-

17

ses and average power, and aim to provide some general guidelines regarding the use

of sample size calculation methods based on these quantities. All simulations will be

performed in R (R Development Core Team, 2008).

We also aim to formulate several conservative power estimators based on the results

of both (Chi, 2007) and (Meijer et al., 2016). These estimators are all related to the

asymptotic average power of the BH procedure, as we will discuss in section 2.2. In

addition to more conservative power estimators, we also suggest a more liberal approach

to power by allowing researchers to calculate the minimum sample size required so that

α∗ < α. Obviously, calculations based on the criticality phenomenon lead to much lower

sample size estimates than those based on average power.

Providing methods for sample size calculation is of little use if they are unavailable

in practice. We will therefore write R functions for sample size calculation based on the

various methods discussed in the current paper, and aim to present these in the form of

an easy-to-use Shiny (Chang, Cheng, Allaire, Xie, & McPherson, 2017) application.

18

2 Methods

2.1 Computation of α∗, u∗ and p∗

In this section we will discuss the computation of the critical value and associated quan-

tities introduced in section 1.3. As can be observed in (3), computing the critical value

α∗ requires computing the derivative of F at zero. We defined F in (2) using the func-

tion G, which we will now discuss in more detail. As in Chi (2007), we will assume that

test statistics corresponding to a true null are randomly sampled from some distribution

with distribution function Ψ0 and density ψ0. Test statistics corresponding to a false

null then have distribution function Ψ1 and density ψ1. The joint distribution function

of the test statistics is then:

Ψ = (1− π)Ψ0 + πΨ1. (9)

From inverse probability sampling we know that if some test statistic X has distribution

function Ψ, then for U ∼ Uniform(0, 1), the random variable Ψ−1(U) has the same

distribution as X. The right tail p-value, 1−Ψ0(X), then has the same distribution as

1−Ψ0(Ψ−1(U)), namely (Chi, 2007):

F (u) = 1−Ψ(Ψ−10 (1− u)). (10)

Since Ψ is a mixture of two distributions, we can split F into two parts, each multiplied

by their corresponding weight. The p-values of the test statistics corresponding to a

true null hypothesis then have distribution function 1 − Ψ0(Ψ−10 (1 − u)) = u, and the

p-values of the test statistics corresponding to a false null have distribution function

1−Ψ1(Ψ−10 (1− u)) = G(u). The joint distribution function of the p-values can then be

written as (Chi, 2007):

F (u) = (1− π)u+ π[1−Ψ1(Ψ−10 (1− u))]. (11)

The first derivative of this function is (Chi, 2007):

F ′(u) = 1− π + πψ1(x)

ψ0(x), (12)

with x = Ψ−10 (1− u). To evaluate the derivative of F at zero we then need to calculate

(Chi, 2007):

F ′(0) = 1− π + π limx→∞

ψ1(x)

ψ0(x). (13)

19

The densities ψ0 and ψ1 obviously depend on the chosen test statistic. We will first

consider one-sample t-tests, where each feature Yj is considered a sample from a normal

distribution with mean µj and standard deviation σj , and where each null hypothesis

H0j : µj = 0 is tested against the one-sided alternative HA

j : µj > 0. We will assume

µj = 0 in the case of a true null hypothesis, and µj = c > 0 in the case of a false null

hypothesis, and assume all σj = σ, i.e. we assume equal (standardized) effect sizes for all

features. Then for the true null hypotheses, the sample t-statistic follows a t-distribution

with n−1 degrees of freedom. For the false null hypotheses, the sample t-statistic follows

a noncentral t-distribution with n − 1 degrees of freedom and noncentrality parameter

δ. This noncentrality parameter can be factorized into δ =√nθ, with θ = c

σ the

(standardized) effect size (Chi, 2007), which implies sample size and effect size can be

considered as two variables which independently affect the noncentrality parameter. The

densities ψ0 and ψ1 are then, respectively, tn−1 and tn−1,δ. The limit in (13) can then

be computed by (Chi, 2007):

limx→∞

tn−1,δ(x)

tn−1(x)= e

−δ22

∞∑k=0

Γ(n+k

2

) (√

2δ)k

k!

Γ(n2

) , (14)

where Γ denotes the Gamma function. This limit does not diverge whenever δ > 0,

implying the criticality phenomenon always occurs with one-sample t-tests (Chi, 2007).

Furthermore, Chi (2007) showed that the distribution function of the p-values is strictly

concave, which means α∗ can be calculated as in (3).

Now, we will consider two-sample t-tests. For each feature Yj , we now observe

two samples: Y1(j) of size n1, and Y2(j) of size n2, with n = n1 + n2 the total sample

size. We will assume Y1(j) ∼ N(µ1(j), σj) and Y2(j) ∼ N(µ2(j), σj). For the true null

hypotheses, the two-sample t-statistic then follows a t-distribution with n − 2 degrees

of freedom. Assuming equal effect sizes, the two-sample t-statistic follows a noncentral

t-distribution with n − 2 degrees of freedom and noncentrality parameter δ =√nθ for

the false null hypotheses. It should be noted that the interpretation of the effect size is

different for two-sample t-tests, compared to one-sample t-tests. In particular, the effect

size for the two-sample t-tests is θ =√p1p2

µ1(j)−µ2(j)σj

, with p1 and p2 the proportion of

samples in group 1 and 2 compared to the total sample size (Van Iterson et al., 2013).

For (one-sided) two-sample t-tests, the limit in (13) can be computed by:

limx→∞

tn−2,δ(x)

tn−2(x)= e

−δ22

∞∑k=0

Γ(n+k−1

2

) (√

2δ)k

k!

Γ(n−1

2

) . (15)

20

It should be noted that the previous specifications are in terms of right-sided p-values.

The calculations are equivalent for left-sided p-values if we assume θ to be an absolute

measure of effect size. If, for some θ, the right-sided p-values 1−Ψ0(X) have distribution

function F , then for −θ, the left-sided p-values Ψ0(X) have the same distribution.

In practice, two-sample t-tests are typically performed two-sided. For two-sided

tests, the distribution of the test statistics corresponding to the false null hypotheses

is assumed to be a mixture of two noncentral t-distributions, one with noncentrality

parameter δ =√nθ, and the other with δ = −

√nθ. Then the two sided p-values,

2(1 − Ψ0(|X|)), do not have the same distribution as those obtained from one-sided

tests. To obtain an expression for the distribution of the two-sided p-values, we can

use the more general F -test. It is well-known that if an F -test is used to compare two

means, this is equivalent to performing a two-sided two-sample t-test. In particular,

the F -statistic is the square of the t-statistic. We will assume that for the true null

hypotheses, the sample F -statistic follows an F -distribution with 1 and n − 2 degrees

of freedom. For the false null hypotheses, the sample F -statistic follows a noncentral

F -distribution with 1 and n − 2 degrees of freedom, and noncentrality parameter δ2.

Then the one-sided p-values of the F -test have the same distribution as the two-sided

p-values of the two-sample t-test. For two-sided two-sample t-tests, the limit in (13) can

then be computed by (Chi, 2007):

limx→∞

f1,n−2,δ2(x)

f1,n−2(x)= e

−δ22 B

(1

2,n− 2

2

) ∞∑k=0

(δ2

2

)kk!B

(12 + k, n−2

2

) , (16)

where B denotes the Beta function. The limits in (14), (15) and (16) can be easily

approximated by repeatedly summing the terms of the series until some convergence or

divergence criterion is met. We utilized a convergence tolerance of 10−10. For increased

numerical precision, we calculated the natural logarithm of each term of the series before

exponentiating and adding to the total. For computation of u∗ we used the approxima-

tion algorithm described by Chi (2007), namely to set u1 = 1 and iterate ui+1 = αF (ui)

until convergence. Then p∗ = u∗α . To evaluate F−1, as in computation of h, we utilized

a general root-finding algorithm (Brent, 1973) with a tolerance of approximately 10−4.

As α∗, u∗ and p∗ can be quickly computed, it is feasible and practical to perform sample

size calculations by simply starting at some minimal value of n, and increasing n until

F has the desired properties.

21

2.2 Different Approaches to Average Power

2.2.1 Average Power of the BH Procedure

In this section, we will further explore the average power of the BH procedure, give an

expression for the asymptotic average power of the BH procedure based on the results

of Chi (2007), and discuss the relationship with an existing R package for sample size

calculation in the FDR context. We will then suggest different power estimators of the

same form, and discuss similarities and differences with the expression for the asymptotic

average power.

We defined the concept of average power in section 1.4 as the expected proportion

of rejected false null hypothesis among the set of false null hypotheses. We can express

this as:

E

(Rm × TDP

m1

), (17)

with Rm the number of rejected hypotheses, TDP the true discovery proportion, and

m1 the number of false null hypotheses. We can equivalently specify this in terms of

the proportion of rejected hypotheses Rmm and the proportion of false null hypotheses

m1m . We know from the results of Chi (2007) that as m→∞, the proportion of rejected

hypotheses converges to p∗. We also know that the BH procedure controls the FDR at a

level (1− π)α, with π the proportion of false null hypotheses. We can therefore express

the asymptotic average power of the BH procedure as:

p∗(α, F )(1− (1− π)α)

π, (18)

where p∗(α, F ) denotes the value of p∗ corresponding to the BH procedure applied to

p-value distribution F , with FDR control level α. This is an asymptotic result in the

sense that as m → ∞, the average power converges in probability to the quantity in

(18), as was shown previously by Ferreira and Zwinderman (2006). It is this asymptotic

behavior on which sample size calculations are typically based, that is, one calculates the

minimum n for which the quantity in (18) is larger than the desired power level. For ex-

ample, although the ssize.fdr package uses a different method to arrive at their power

specification, and the quantity in (18) is not expressed directly anywhere in the related

documentation (Liu & Hwang, 2007), their method for power calculation is essentially

identical, although there are some differences in interpretation of the parameters α and

θ. In particular, the ssize.oneSamp and ssize.twoSamp functions of ssize.fdr take

as their FDR control parameter the quantity (1− π)α rather than α, and the effect size

22

that is specified for ssize.twoSamp is actually the quantity 2θ rather than θ. If one

takes these differences in interpretation into account, the same estimates for the power

and sample size will be obtained.

In a real setting, where the number of hypotheses is finite and features are possibly

correlated, the asymptotic average power can be seen as an approximation or estimator

of the actual power. This estimator can be further decomposed into an estimator of the

proportion of rejected hypotheses, an estimator of the TDP, and an estimator of the

proportion of false null hypotheses. We propose a general formula for a power estimator

of this form, namely:p(αγ, F )(1− π0α)

π. (19)

Here α is again the FDR control parameter, γ the exeedance control parameter, and π

an estimate of the proportion of false null hypotheses. If we do not wish to control the

FDP exceedance probability, we set γ = 1. The function p is an estimator for the size

of the rejection set relative to the number of hypothesis. In (18), p = p∗. An estimator

of the TDP is given by (1− π0α), with π0 an estimator for the proportion of true nulls

in the data. In the remainder of this section, we will propose a number of choices for

these estimators, describe different combinations, and discuss possible advantages and

drawbacks. Note that by decomposing the estimator for the average power in this way,

we are essentially assuming the proportion of rejected hypotheses and the TDP are

independent.

We previously mentioned an alternative power specification where one could choose

to calculate the minimum n such that α∗ < α. This method can actually be considered

as a special case of controlling the asymptotic average power in (18). After all, if α∗ < α,

p∗ > 0, but if α∗ > α, p∗ = 0. Since we always assume 0 < α < 1 and 0 < π < 1,

calculating the minimum n such that α∗ < α is the same as calculating the minimum n

such that the asymptotic average power is non-zero.

2.2.2 FDP Exceedance Control

As discussed in section 1.5, we know from Meijer et al. (2016) that if we want the

probability that the FDP exceeds α to be at most γ, we can assure this by applying

the BH procedure at FDR control level αγ. In this case, the asymptotic average power

is simply given by (18), with α replaced with αγ. However, this may not be the most

desirable measure of power to use in this case, since it does not take into account the

fact that α is now a confidence bound.

From section 1.5 we know that, asymptotically, the 1− γ upper confidence bound

23

on the FDP in the BH rejection set at FDR control level αγ, i.e. qγ(SBH(αγ)), is at most

hα. Equivalently, we can formulate this as a 1−γ lower confidence bound, dγ(SBH(αγ)),

on the TDP. We obtain:

dγ(SBH(αγ)) = 1− qγ(SBH(αγ)) ≥ 1− hα. (20)

We can incorporate 1−hα as a conservative estimator for the TDP in our power estimator

(as opposed to 1− (1− π)αγ). We can arrive at this by writing (20) as a bound for the

average power itself. We obtain:

Rm(αγ)m dγ(SBH(αγ))

m1/m≥

Rm(αγ)m (1− hα)

m1/m. (21)

This implies that, in order to ensure that the average power is greater than some desired

level with probability at least 1 − γ, we could choose an n such that the right hand

side of (21) is greater than this desired level. This is not trivial, however, as both the

quantities m1m and Rm(αγ)

m are unknown beforehand. For large m, we expect replacing

these quantities with π and p∗(αγ) will work well as a conservative power estimator.

However, since there can still be variation around these quantities, this will no longer

be a true confidence bound for the power.

2.2.3 The Proportion of Rejected Hypotheses

So far, we have assumed an infinite number of hypotheses. For example, we estimate the

proportion of rejected hypotheses by p∗, the quantity to which it converges as m→∞.

There are two main problems with this approach when the number of hypotheses is

finite, which is always the case in practice. The first is that E(Rmm ) does not necessarily

equal p∗. The second is that there may be considerable variation around E(Rmm ), and

the smaller the number of hypotheses tested, the higher we expect this variation to be.

The first problem will be further studied using simulations. Here, we will propose a

conservative estimator that takes the second problem into account.

Chi (2007) characterized the convergence of Rmm to p∗ using the law of the iterated

logarithm (LIL), showing that:

lim supm

± Rm −mp∗√m log logm

=

√2p∗(1− p∗)

1− αF ′(u∗), a.s., (22)

24

assuming α∗ < α < 1 and 1− αF ′(u∗) > 0. We define:

pb =

√2p∗(1− p∗)m log logm

m(1− αF ′(u∗)). (23)

Then pl = p∗ − pb is a lower bound on the proportion of rejected hypotheses. Note that

this is still an asymptotic result: we know that from some m onward, P(Rmm > pl) = 1.

For finite m, this probability may not be 1, but we do expect it to be high. Some

preliminary simulations (results not shown) indicated that for m between 10 and 1000,

P(Rmm > pl) was typically in the 0.97 to 1 range, even when m1m is allowed to vary around

π. This implies that by using pl in sample size calculations, a researcher can be sure

that, from some m onward, the proportion of rejected hypotheses is higher than the

desired level, whereas for a smaller number of hypotheses, he can still be very confident

that this will be the case. A nice property of this method is that as m grows, pl will

grow closer and closer to p∗, so that for an infinite number of tests, using pl essentially

coincides with the “traditional” use of p∗.

A possible drawback is that the LIL bounds are very wide for low m. Although

this means that the probability of Rmm > pl is high even for low m, it also means that for

some m (e.g. m = 10), the lower bound pl will be so low that is unusable in practice. In

a high-dimensional setting, however, one typically considers a large number of features.

If one applies this method with, for example, a thousand features, we do not expect this

to be a problem.

2.2.4 The Proportions of True and False Null Hypotheses

If we want to calculate a power estimate as in (19), we require estimates for the propor-

tion of true and false null hypotheses. The true population values of these parameters

are of course complementary: π0 = 1 − π. In order to calculate an estimate of the

average power using this specification, one always needs to provide an estimate of π,

which we will denote π. The estimator for the TDP then requires an estimate of π0,

the most obvious choice for which is π0 = 1 − π, as in (18). This does, however, put

a lot of confidence into π. This estimate may be derived from a small pilot sample, or

perhaps made up by a researcher based on previous experience or beliefs. If one wants

to incorporate some uncertainty about π in the power calculations, one could opt to

use a more conservative estimator of π0 in the estimator for the TDP. For example, by

setting π0 = h. Note that h still depends on π, but for low n, h > 1− π, so that h is a

more conservative estimator in the sense that it assumes the actual proportion of signal

25

in the data may be lower than believed. For low n, this is more similar to specifying the

general shape of the p-value distribution, rather than the exact proportion of noise. One

could also choose any other estimator of π0. The most conservative option is to simply

set π0 = 1.

We do not treat the estimates of π and π0 in (19) as complementary per se. That

is, if we use π0 = h, we do not set π = 1 − h. This would not really make sense, since

h is a function of π. Doing so would actually make the resulting power estimate much

less conservative since it assumes there is less signal in the data to find, such that far

less hypotheses need to be rejected to discover a large proportion of it, and could lead

to power estimates greater than one. Furthermore, setting π0 = 1, would then lead to

π = 0, in which case the power cannot be calculated at all. In other words, we use a

conservative estimate of π0 only to obtain a more conservative estimate of the TDP.

There can be differences in interpretation of π. A researcher may give an estimate

of the population proportion of false nulls π, or an estimate of the proportion of false

nulls in the data m1m . In the latter case, it can be assumed that there is no variation in

m1m , and one could argue that π0 = 1− π is then always the most sensible choice.

It should be noted that the choice of π0 is generally not nearly as impactful as

the choice of α, γ or p. This is because, in a practical high-dimensional setting, the

proportion of signal in the data is typically assumed to be low. Therefore, the proportion

of noise in the data is close to one, and so including any other estimate of π0 can at

most make the power estimate a little less conservative.

2.2.5 Power Estimators for Sample Size Calculation

We have proposed different choices for γ, p and π0. All possible combinations of these

choices can be observed in Table 1.

Table 1: Different estimators of power which arise from the various combinations of γ,p and π0.

γ = 1 γ < 1

p = p∗ p = pl p = p∗ p = pl

π0 = (1− π) p∗(α)(1−(1−π)α)π

pl(α)(1−(1−π)α)π

p∗(αγ)(1−(1−π)α)π

pl(αγ)(1−(1−π)α)π

π0 = h p∗(α)(1−hα)π

pl(α)(1−hα)π

p∗(αγ)(1−hα)π

pl(αγ)(1−hα)π

π0 = 1 p∗(α)(1−α)π

pl(α)(1−α)π

p∗(αγ)(1−α)π

pl(αγ)(1−α)π

26

As can be observed in Table 1, we now have twelve different power estimators, all

of the same form as the asymptotic average power (top left cell in Table 1). It is clear

that this table could be extended even further: For example, it does not include the

possibility to use FDP exceedance control with the normal average power, and it could

probably be extended with any other possible estimators of Rmm and TDP. In this study,

we will only consider these twelve estimators. Which of these estimators is preferable

depends on the situation at hand. The estimator to use automatically follows from the

choices for its three different components.

The first choice to be made is which estimator to use for the proportion of true

null hypotheses. Some estimate of the amount of signal in the data is always required,

since this quantity is used as the denominator of each estimator. Such an estimate could

be derived from data, but it is also possible this is simply the belief of the researcher.

This estimate, π, can additionally be incorporated into the estimator for the TDP to

make the procedure less conservative, and obtain smaller sample size estimates. This

does, however, put a lot of confidence in π. If there are doubts surrounding the accuracy

of π, some conservativeness can be incorporated by not including an estimate of the

proportion of true null hypotheses, i.e. by setting π0 = 1. However, if the proportion

of signal in the data is large, this may be considered too conservative. Setting π0 = h

provides a middle-ground, since this quantity is close to 1 for small n, but close to 1− πfor large n. Additionally, h may be the most natural estimator to use when using FDP

exceedance control, due to its interpretation as an asymptotic confidence bound. When

not using FDP exceedance control, a separate confidence level for h needs to be specified.

The second choice to be made is whether to use an expectation for the TDP es-

timator (γ = 1), or a lower confidence bound (γ < 1). The first guarantees that,

asymptotically, the expected proportion of true discoveries among the discoveries is at

least 1−α. The second guarantees that, asymptotically, the proportion of true discover-

ies among the discoveries is at least 1−α with probability at least 1−γ. This essentially

comes down to the strength of the statements one wishes to make about the rejection

set. The second is a stronger statement, but requires a larger sample size for a fixed α.

The third choice to be made is whether to use p∗ or pl as an estimator of Rmm . Here,

using pl provides a kind of finite-sample correction for the increased variance of Rmm that

comes with a smaller number of hypotheses. Use of p∗ assumes an infinite number of

hypotheses. Use of pl guarantees that Rmm > pl from some m onward, and results in a

high probability of Rmm > pl for lower m.

The different components of these power estimators allow researchers to incorporate

different sources of conservativeness. We do expect, however, that incorporating several

27

sources of conservativeness simultaneously may have too much of an impact on the

resulting sample size. For example, although using both pl and γ = 0.1 guarantees

that, asymptotically, the power is at least the desired level with probability at least

1 − γ, we suspect that the resulting sample sizes may be too large for practical use.

In section 3.1, we will visually compare the different estimators of power from Table 1.

A subset of these estimators and their resulting sample size recommendations will be

further compared using simulations, as described in section 2.3.4.

2.3 Simulation Studies

2.3.1 Simulation Study 1: Sample Size and Number of Tests

As stated in section 1.6, we want to investigate how the asymptotic behavior of the

BH procedure relates to its performance in a more realistic setting. Asymptotic results

describe the behavior of the BH procedure when the number of tests is infinite. In

reality, however, the number of tests performed is always finite. The goal of this first

simulation experiment is to investigate how different values of m affect the behavior of

the BH procedure. Additionally, we will investigate if this behavior differs for different

levels of n, π and α.

We simulate independent normally distributed features and perform two-sided two-

sample t-tests. We assume a common fixed effect size of θ = 0.436. This value of θ is

based on a simulation study performed by Chi (2007), where the false nulls are assumed

to come from a t-distribution with 20 degrees of freedom, and noncentrality parameter

δ = 2. This corresponds to an effect size of roughly 0.436. We investigate the behavior

of the BH procedure for the following levels of m: 10, 50, 100, 1000, and 10000. We

consider 20 different sample sizes n between 4 and 100, the chosen levels being closer

together for lower n, and more spaced out for higher n. In particular, we consider

sample sizes between 4 and 20 in increments of 2, between 20 and 40 in increments of

4, and between 40 and 100 in increments of 10. For the proportion of false nulls π, we

consider the levels 0.1, 0.5 and 0.9. Note that π in this context refers to the population

proportion of false nulls as in (2). The realized proportion of false nulls may vary from

data set to data set. For the FDR control level α, we again consider these same levels

0.1, 0.5 and 0.9. For each combination of the experimental factors, we perform a fixed

number of simulations. In principle, the number of simulations is set to 1000, however,

for m = 10000, we set the number of simulations to 100 to allow the experiment to be

completed within a reasonable time.

In terms of outcome measures, we are primarily interested in the quantity E(Rmm ),

28

the expected proportion of rejected hypothesis. For each combination of the experimental

factors we can calculate a simulation estimate of this quantity, which we will denote p,

and which can be compared to p∗. We know that Rmm → p∗ as m → ∞, but we do not

know E(Rmm ) for finite m. Secondary outcome measures include observed average power,

observed FDR, and observed pFDR. Note that for low π and low m, it is possible for a

simulated data set to not include any false nulls, in which case the power is undefined.

We will therefore look at the expected proportion of discovered false nulls, given that

there was at least one false null to discover, which we will denote the positive power.

2.3.2 Simulation Study 2: Dependence

In addition to an infinite number of tests, the quantities α∗, p∗ and u∗ assume indepen-

dent test-statistics (Chi, 2007). Although the BH procedure controls the FDR under

various common forms of positive dependence (Goeman & Solari, 2014), it does not

necessarily reject the same number of hypotheses. The goal of this second simulation

experiment is to investigate how different dependency structures influence the behavior

of the BH procedure, and how this behavior compares to the asymptotic results of Chi

(2007). We will consider roughly two different scenarios. In the first scenario the features

are equicorrelated, that is, given the experimental factor, the population correlation ρ

between all features is the same. In the second scenario we specify three different cor-

relations: ρ0 is the correlation between all true nulls, ρ1 is the correlation between all

false nulls, and ρ01 is the correlation between each true and false null. In both scenarios

we will consider only positive correlations. For the dependency structures used in this

simulation study, the FDR is assumed to be controlled (Goeman & Solari, 2014).

For the first scenario we will consider all possible values of ρ from 0 to 0.9, in

increments of 0.1. We perform two-sided two-sample t-tests and apply the BH procedure

with α = 0.1. We set π = 0.1, θ = 0.436 and m = 1000. We consider the same values

of n as in the first simulation experiment, and for each combination of the experimental

factors we perform 1000 simulations. Correlated features are simulated using a simple

linear regression model. For every simulation i ∈ (1, 1000), each feature is generated by:

Yi,j =√ρzi + εi,j , (24)

with zi a vector of length n drawn from a standard normal distribution and εi,j ∼N(0,

√1− ρ2). Then, the features are split into two groups, each of size n/2, and for

each false null, the mean of the second group is shifted by the quantity θσ√p1p2

or − θσ√p1p2

,

each with probability 0.5. We record the following outcome measures: observed E(Rmm ),

29

observed FDR, observed pFDR and observed positive power.

For the second scenario we will consider the following correlation structures: Only

the true nulls are correlated (ρ0 = 0.9, ρ1 = 0, ρ01 = 0), only the false nulls are correlated

(ρ0 = 0, ρ1 = 0.9, ρ01 = 0), or both the true and false nulls are independently correlated

(ρ0 = 0.9, ρ1 = 0.9, ρ01 = 0). For reference, we will additionally consider independence

(ρ0 = 0, ρ1 = 0, ρ01 = 0) and equicorrelation (ρ0 = 0.9, ρ1 = 0.9, ρ01 = 0.9). For every

simulation i ∈ (1, 1000), the features are generated by:

Yi = ZiCi, (25)

with Zi an n by m matrix of standard normally-distributed values, and Ci the Cholesky

decomposition of the correlation matrix. Again, the features are split into two groups,

each of size n/2, and for each false null, the mean of the second group is shifted. We

again record observed E(Rmm ), observed FDR, observed pFDR and observed positive

power.

2.3.3 Simulation Study 3: Simulations Based on Real Data

In this simulation experiment, we investigate how the BH procedure behaves in a setting

derived from real data. We base our setting on the Geuvadis RNA sequencing data set

of 465 human lymphoblastoid cell line samples (Lappalainen et al., 2013). The data

consists of 633 runs, which we will treat as independent observations. We obtained the

data through the recount package (Collado-Torres et al., 2016), and transformed the

data to normalized (Robinson & Oshlack, 2010) log counts per million using the edgeR

package (Robinson, McCarthy, & Smyth, 2010). The full data contains 58037 features,

but no experimental factor.

We first create an experimental factor with two levels from the data by using

two genes, Xist (X-inactive specific transcript) and RPS4Y1 (40S ribosomal protein

S4, Y isoform 1 ), to determine gender. After creating the factor, we remove the Xist

and RPS4Y1 features from the data, as well as any features with less than 633 unique

observations. The resulting data set now consists of 327 male and 336 female observations

on 17966 features.

In order to measure quantities like observed power and FDR, we need to know

which features are the true nulls, and which are the false nulls. However, in a real

data set such as the one we are using here, the identity of the true and false nulls is

unknown. To enable us to distinguish between true and false nulls we will generate a

new experimental factor and add corresponding effects to the data. We perform two-

30

sided two-sample t-tests on all 17966 features, and select the 10% lowest p-values. The

corresponding features have a mean standardized absolute effect size θ of 0.158, with a

standard deviation of 0.134. These features form the basis for the set of false nulls.

We sample as follows: First, we take a random sample of n rows and m columns

from the complete data set. We independently generate a two-class factor, with n2

observations of each class. Since the experimental factor is generated independently,

any observed differences between the two groups are caused solely by random sampling.

Next, we draw a random effect size θ from a normal distribution with mean µθ and

standard deviation σθ. For each feature that belongs to the set of false nulls, the mean

of the second group is shifted by the quantity θσ√p1p2

or − θσ√p1p2

, each with probability 0.5.

This sampling method keeps the correlation structure within each group intact, but the

identity of the true and false nulls is now known. In addition to the correlation structure,

the samples inherit other qualities characteristic of real data from the original source.

For example, there is some unaccounted-for dependence between the observations, and

the features are not necessarily truly normally distributed.

Where we previously used a fixed effect size, we now generate effect sizes from a

distribution. We will assess the performance of the BH procedure with α = 0.1 under

three different distributions, all with mean µθ = 0.158. We consider the following values

for the standard deviation: σθ = 0.134, σθ = 0.067, and σθ = 0 (i.e. a fixed effect size).

We consider a grid of values for n between 4 and 600. For each value of n, we draw 100

samples of size m = 1000, and record observed E(Rmm ), observed FDR, observed pFDR

and observed positive power.

2.3.4 Simulation Study 4: Comparison of Sample Size Estimates

In this simulation experiment, we will investigate how some of the power estimators

described in section 2.2 perform in terms of sample size calculation. We first specify

some desired power level ω. Next, we calculate the minimum n such that the value of

the estimator at hand is greater than, or equal to, this desired level. We then perform

a number of simulations for this value of n and compare the observed power with our

desired level.

We will not consider all twelve estimators shown in Table 1. Instead, we select four

of them for use in this experiment. The first estimator is the asymptotic expression for

the average power: p∗(α)(1−(1−π)α)π . The second estimator uses the lower LIL bound pl

instead of p∗:pl(α)(1−(1−π)α)

π . The third estimator uses a confidence bound for the TDP

based on FDP exceedance control: p∗(αγ)(1−hα)π . The fourth estimator combines FDP

31

exceedance control with the LIL bound: pl(αγ)(1−hα)π . In addition, we will evaluate a

fifth sample size, namely the minimum n such that α∗ < α.

As in simulation study 1, we simulate independent normally distributed features

and perform two-sided two-sample t-tests. For each combination of n and the other

experimental factors, we generate 100 samples of size m = 5000. Each feature has

probability π = 0.1 of corresponding to a false null hypotheses. Effect sizes are sampled

from an normal distribution with mean µθ = 0.436 or µθ = 0.158, and standard deviation

σθ = 0 or σθ = 0.134. We consider three possible levels of ω: 0.8, 0.5 and 0.1. We

set α = 0.1 and γ = 0.1. We assume π = π and θ = µθ. We record the following

outcome measures: observed E(Rm×TDP

m1

), observed P

(Rm×TDP

m1> 0)

, and observed

P(Rm×TDP

m1≥ ω

).

32

3 Results

3.1 Comparison of Power Estimators

In section 2.2, we described several power estimators based on the asymptotic expression

for the average power. Figure 3 shows the resulting power estimates for different values

of n, m and π.

0.0

0.2

0.4

0.6

0.8

1.0

π = 0.1,m = 5000

n.seq

π = 0.1,m = 100000

n.seq

0 50 100 150 200

0.0

0.2

0.4

0.6

0.8

1.0

π = 0.5,m = 5000

n.seq

0 50 100 150 200

π = 0.5,m = 100000

n.seq

Color:γ = 1, p = p∗γ = 1, p = plγ = .1, p = p∗γ = .1, p = pl

Line Type:π0 = (1− π)π0 = hπ0 = 1

n

Pow

er

Figure 3: Power estimates for the BH procedure with FDR control parameter α = 0.1.Hue indicates whether γ = 1 or γ = 0.1. Tone indicates whether p = p∗ or p = pl. Linetype indicates the value of π0. The confidence level used for π0 = h is 90%.

Each of the power estimators is a combination of its three components. The largest

contrast seen in Figure 3 is between the situation where no FDP exceedance control

is applied (green lines), and the situation where the FDP exceedance probability is

controlled at 10% (red lines). Desiring P(FDP ≥ 0.1) ≤ 0.1 is a stronger form of false

discovery control than desiring FDR ≤ 0.1, and thus requires a larger sample size.

When m = 5000, there is clear difference between use of p∗ (light tone) and pl (dark

tone). Since pl is a lower bound, the resulting power estimates are more conservative.

33

For m = 100000, the difference is far less noticeable. This is due to the fact that, as

m→∞, pl approaches p∗.

When π = 0.5, the difference between the estimators of π0 can be observed. The

largest difference is between the use of some estimate of π0 (h or 1 − π), and simply

setting π0 = 1. When π = 0.1, the observed difference is smaller. This is due to the fact

that, when the proportion of signal in the data is small, the proportion of noise is close

to one, so there is less benefit to incorporating an estimate of π0.

34

3.2 Sample Size and Number of Tests

The goal of this simulation experiment was to investigate the performance of the BH

procedure for finite m. The primary results in terms of E(Rmm ) can be observed in Figure

4.

0.0

0.2

0.4

0.6

0.8

1.0

π = 0.1 , α = 0.1

n.seq

resu

lts.

p[[

l]][

[k]]

[[1]

][[”

pst

ar”]]

++++++++++++++ + + + + + +

++++++++++++++ + + + + + +

++++++++++++++ + + + + + +

++++++++++++++ + + + + + +

++++++++++++++ + + + + + +

+++++

m = 10m = 50m = 100m = 1000m = 10000p∗

π = 0.1 , α = 0.5

n.seq

resu

lts.

p[[

l]][

[k]]

[[1]]

[[”p

star

”]]

++++++++++++++ + + + + + +

++++++++

++++++ + + + + + +

++++++++

++++++ + + + + + +

+++++++++

+++++ + + + + + +

+++++++++

+++++ + + + + + +

π = 0.1 , α = 0.9

n.seq

resu

lts.

p[[

l]][

[k]]

[[1]]

[[”p

star

”]]

++++++++++++++ + + + + + +

++++++

++++++++ + + + + + +

+++++

+++++++++ + + + + + +

++++++

++++++++ + + + + + +

+

+

++++

++++++++ + + + + + +

0.0

0.2

0.4

0.6

0.8

1.0

π = 0.5 , α = 0.1

n.seq

resu

lts.

p[[

l]][

[k]]

[[1]]

[[”p

star

”]]

+++++

++++

+++

+++ + + + + +

++++++

+++

++

+++

++ + + + +

+++++++

+++

++

++

+ + + + + +

+++++++++

++

++

++

+ + + + +

+++++++++

++

++

++

+ + + + +

π = 0.5 , α = 0.5

n.seq

resu

lts.

p[[

l]][

[k]]

[[1]

][[”

pst

ar”]]

+++++++

++++

+++ + + + + + +

+

+

++++++

++++++ + + + + + +

++

++++++

++++++ + + + + + +

++

+

++++++

+++++ + + + + + +

++

+

+++++

++++++ + + + + + +

π = 0.5 , α = 0.9

n.seq

resu

lts.

p[[

l]][

[k]]

[[1]

][[”

pst

ar”]

]

++++++

++++++++ + + + + + +

++++++

++++++++ + + + + + +

+

+++++

++++++++ + + + + + +

+

++++

+++++++++ + + + + + +

+

++++

+++++++++ + + + + + +

20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

π = 0.9 , α = 0.1

n.seq

resu

lts.

p[[

l]][

[k]]

[[1]]

[[”p

star

”]]

+++++++++

+

++

++

++ + + + +

+++++++++

+

++

++

++ + + + +

+++++++++

+

++

++

++ + + + +

+++++++++

+

++

++

++ + + + +

+++++++++

+

++

++

++ + + + +

20 40 60 80 100

π = 0.9 , α = 0.5

n.seq

resu

lts.

p[[

l]][

[k]]

[[1]]

[[”p

star”

]]

+

+

+++++

+++++++ + + + + + +

+

+

+

++++++

+++++ + + + + + +

+

+

+

++++++

+++++ + + + + + +

+

+

+

+++++

++++++ + + + + + +

+

+

+

++++++

+++++ + + + + + +

20 40 60 80 100

π = 0.9 , α = 0.9

n.seq

resu

lts.

p[[

l]][

[k]]

[[1]

][[”

pst

ar”]

]

++++++

++++++++ + + + + + +

++++

++++++++++ + + + + + +

+++

+++++++++++ + + + + + +

++++

++++++++++ + + + + + +

++++

++++++++++ + + + + + +

n

p

Figure 4: Primary results of the first simulation experiment. Each panel shows, for acombination of π and α, the expected proportions of rejected hypotheses. For each panel,the black line indicates p∗: the proportion of hypotheses that we would asymptoticallyreject for the given sample size. The colored crosses indicate, for each m, the quantityp: the observed proportion of rejected hypotheses, averaged across the (1000 or 100)simulations.

Figure 4 clearly shows the convergence of p to p∗ as m increases, since we observe

35

that for higher m, the observed p’s are closer to the line representing p∗. The rate of

convergence appears to vary for the different levels of π, α, and n. For example, when

π = 0.5 and α = 0.1, we see that for lower n there is a clear difference in the expected

proportion of rejected hypotheses for the different levels of m, but for higher n this

difference nearly vanishes. Conversely, when π = 0.1 and α = 0.9, the differences in p

remain large even for high n.

In can be observed in Figure 4 that convergence of p to p∗ appears to always happen

from above, that is, when m is smaller, we reject on average more hypotheses than we

would asymptotically reject. This is especially noticeable for the combinations of high

α and low π, but is also visible to a lesser extent in the other panels for lower n. For a

combination of high m and high n, p approximates p∗. Another interesting observation

is that for some conditions, particularly when π = 0.1 and α = 0.9, as n→∞, p appears

to converge to something other than p∗ for finite m. Although Figure 4 only shows values

of n up to 100, the observed differences appear to remain constant even for extremely

large sample sizes (e.g. n = 106).

It should be noted that FDR control levels of 0.5 or 0.9 are never used in practice.

Therefore, the left column of Figure 4 is the most interesting from a practical perspective.

In particular, the situation with π = 0.1 and α = 0.1 (top left) is the most realistic of

the experimental conditions. We will therefore examine this condition in more detail.

Figure 5 shows the primary and secondary outcome measures for the condition

with π = 0.1 and α = 0.1. It can again be observed that as m increases, p approaches

p∗ from above. The observed FDR (i.e. the observed proportion of false discoveries

among the discoveries, averaged over the simulations) fluctuates closely around (1−π)α.

There is less variability in the FDR estimates for large n as compared to small n. For a

combination of large n and large m, the variability of the FDR estimates around (1−π)α

is very small. Although the BH procedure controls the FDR regardless of sample size or

the number of tests, this is not the case for the pFDR, since for small m and/or n, the

pFDR does not equal the FDR. It can be observed in the bottom-right panel of Figure

5 that for large n, the BH procedure controls the pFDR at level (1− π)α for all values

of m except m = 10. If m is large, a smaller sample size is required to obtain pFDR

control as compared to when m is small.

It can be observed in the bottom-left panel of Figure 5 that for lowerm, the observed

average positive power is higher than the asymptotic power when n is small, but lower

than the asymptotic power when n is large. This is especially noticeable when m = 10,

and to a lesser extent for the larger values of m. When m = 1000 or m = 10000 the

observed average positive power is very close to the asymptotic power.

36

0.00

0.05

0.10

0.15

0.2

0 Proportion Rejected

n.seq

p

+++++

++++++

+++

++ +

+ ++

+++++++++

++++

++

++ + + +

+++++++++

+++

++

++

+ + + +

++++++++++

++

++

++

++ + +

++++++++++

++

++

++

+ + + +

+++++

m = 10m = 50m = 100m = 1000m = 10000p∗

0.06

0.08

0.10

0.1

2 FDR

n.seq

Ob

serv

edF

DR

+

++

+

+

+

+

+

+

+

+

++

+

++

+ ++

+

+

+

+

+

+

++

+

+

+

+++

+ + ++

+

++

+

+

+

++++++

+++++

++

++

++

+

+

+

+

+

+

+

+

+

+

++

++ + +

+ + + ++

+

+

++

+

+

+

+

++++

+ + + + + + +

(1− π)αα

20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

Positive Power

n.seq

Ave

rage

Posi

tive

Pow

er

+++++

++++

+++

++

++ +

+ + +

+++++++

+++

++

++

+

++

++ +

++++++++

++

++

++

+

+

++

+ +

++++++++++

++

++

+

+

++

+ +

++++++++++

++

++

+

+

++

+ +p∗(1−(1−π)α)π

20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

pFDR

n.seq

Ob

serv

edpF

DR

+

++

++

+++++

++++ + + + + + +

+

+

+

++

+++++

++++ + + + + + +

+

+

+

+++

+++

+++++ + + + + + +

++

++

++

++++++++ + + + + + +

+

+

++

++++++++++ + + + + + +

(1− π)αα

n

Figure 5: Clockwise, from the top left: Observed average proportion of rejected hypothe-ses, observed FDR, observed pFDR, and observed average positive power when π = 0.1and α = 0.1. Note that the y-axis does not have the same scale in each panel.

As described in section 2.3.2, π denotes the population fraction of false nulls, that

is, the probability of a hypothesis to be a false null. The realized fraction of false

nulls will vary from sample to sample. Instead of fixing the population fraction, we

could decide to fix the sample fraction m1m , so that each sample has exactly 10% signal.

Although this does not really correspond with the mixture model in (2), it is interesting

to know what the effect is of fixing the sample fraction rather than the population

fraction, since we may want researchers to make statements about properties of their

data when conducting sample size calculations. Figure 6 shows the results under the

same conditions as in Figure 5, but with the sample fraction of false nulls fixed at 0.1.

37

0.00

0.05

0.10

0.15

0.2

0 Proportion Rejected

n.seq

p

+++++++

++++

+++

+ + ++ + +

+++++++++

++

+++

++

+ + + +

+++++++++

+++

++

++

+ + + +

++++++++++

++

++

++

+ + + +

+++++++++++

++

+

++

+ + + +

+++++

m = 10m = 50m = 100m = 1000m = 10000p∗

0.06

0.08

0.10

0.1

2 FDR

n.seq

Ob

serv

edF

DR

+

++

+

+

+

+

+

+

+++

+

++

+

++

++

+

+

+

+

+

+

+

+

+

+

++++

+ ++

+

+

+

+

+

+

+

++

++++++

+

++ +

+ ++

+

+

+

+

+

+

+

++

+

+++

++ + + + + + +

+

+

+

+

+

+

+

+

++++

++ + + + + + +

(1− π)αα

20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

Positive Power

n.seq

Ave

rage

Posi

tive

Pow

er

+++++

++++

++

++

+

++

++ + +

+++++++

+++

++

++

+

++

++ +

+++++++++

++

++

+

+

+

++ + +

++++++++++

++

++

+

+

++

+ +

++++++++++

++

+

+

+

+

++

+ +p∗(1−(1−π)α)π

20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

pFDR

n.seq

Ob

serv

edpF

DR

+

++

++

+++++++

++ + + + + + +

+

+

++++++++

++++ + + + + + +

+

+

+

++

++++

+++++ + + + + + +

+

+

+

+

++++++++++ + + + + + +

++

++

+++

+++++++ + + + + + +

(1− π)αα

n

Figure 6: Clockwise, from the top left: Observed average proportion of rejected hypothe-ses, observed FDR, observed pFDR, and observed average positive power when m1

m = 0.1and α = 0.1. Note that the y-axis does not have the same scale in each panel.

In terms of p and observed FDR the results are comparable. However, we see that

for a fixed sample fraction, the pFDR converges to (1 − π)α quicker than if we specify

the population proportion. The largest difference occurs in the observed power. If we

specify π, then for small m, the observed power is higher than the asymptotic power if

n is low, but lower than the asymptotic power if n is high. If we specify m1m itself, then

for small m, the power is always higher than the asymptotic power.

In both cases, the average proportion of rejected hypothesis is higher than p∗ for

small m, however, a smaller m is associated with an increase in the variance of Rmm .

The top row of Figure 7 shows an example of this, with α = 0.1 and n = 50. For low

m, the distributions of Rmm are wide and skew, whereas for high m, they are narrow

and approach symmetry. This implies that although the expected proportion of rejected

hypotheses is larger for low m, there is also greater chance of rejecting a proportion of

hypotheses much smaller than p∗. For the parameters used in Figure 7, p∗ = 0.066,

38

which means we asymptotically reject 6.6% of all hypotheses. If π = 0.1, then for the

different levels of m, the observed probabilities of rejecting less than 5% of all hypotheses

(about 0.75p∗) are 0.461, 0.373, 0.263, 0.062 and 0 respectively. It can be observed in

the figure that fixing the sample fraction of false nulls decreases the variance of the

proportion of rejected hypotheses. If m1m = 0.1, then for the different levels of m, the

observed probabilities of rejecting less than 5% of all hypotheses are 0.319, 0.253, 0.161,

0.018 and 0 respectively.

Figure 7 also shows the quantity pl which, as suggested in section 2.2, can be used

as a conservative estimate of the proportion of rejected hypotheses. If π = 0.1, then

for the different levels of m, the observed probabilities of rejecting more than pl of the

hypotheses are 1, 1, 0.986, 0.98 and 0.97 respectively. For a fixed m1m , these probabilities

are, respectively, 1, 1, 0.996, 0.998 and 1. It should be noted that for m = 10 and

m = 50, pl is negative, meaning it cannot be used in practice. For m = 100 it is positive

but very small (0.004). However, as m grows, pl approaches p∗ so for moderate to high

values of m we may base sample size calculations on pl. The bottom row of 7 shows the

distributions of the positive power. Fixing the sample fraction of false nulls decreases

the variance in the observed power. It can be observed that the probability of the power

being at least pl(1−(1−π)α)π is high in both settings. If π = 0.1, then for the different levels

of m, the observed probabilities of the power being at least this high are 1, 1, 0.985,

0.996 and 1 respectively. For a fixed m1m , these probabilities are, respectively, 1, 1, 0.996,

0.997 and 1.

39

0.0

0.2

0.4

0.6

π = 0.1

Ob

serv

edP

rop

orti

on

Rej

ecte

d p∗pl

0.0

0.2

0.4

0.6

m0/m = 0.1

p∗pl

Ob

serv

edP

osit

ive

Pow

er

p∗(1−(1−π)α)π

pl(1−(1−π)α)π

0.0

0.5

1.0

10 50 100 1000 10000

p∗(1−(1−π)α)π

pl(1−(1−π)α)π

0.0

0.5

1.0

10 50 100 1000 10000

m

Figure 7: Top: Violin plots of the distributions of Rm/m for each m, with α = 0.1 andn = 50. Bottom: Violin plots of the distributions of the positive power for each m, withα = 0.1 and n = 50. In the left column, the population proportion of false nulls is setto 0.1. In the right column, the sample fraction of false nulls is fixed at 0.1. Each violinconsists of a box plot with a (vertical) kernel density on each side. Note that the kerneldensities are scaled to have equal maximum width. The quantity pl is equal to p∗ minusthe LIL bound given in (23). The plot was created using the vioplot package (Adler,2005).

40

3.3 Dependence

The goal of this simulation experiment was to investigate the performance of the BH pro-

cedure for when the features are correlated. The results for the equicorrelated scenario

can be observed in Figure 8.0.

000.

100.

200.

30

Proportion Rejected

n.seq

p

+++++++++++++++

+ + + + +

+++++++++++++++

+ + + + +

++++++++++++++

++ + + + +

++++++++++++

+++

+ + + + +

++++++++++++++

++ + + + +

++++++++++++++

++ + + + +

+++++++++

++++++

+ ++ + +

++++++++++++++

++

+ + + +

+++++++++

+++++

++

+ + + +

+++++

++++++++

++

+ + + ++

+++++

+++++

p∗ρ = 0ρ = 0.1ρ = 0.2ρ = 0.3ρ = 0.4

ρ = 0.5ρ = 0.6ρ = 0.7ρ = 0.8ρ = 0.9

0.02

0.06

0.10

0.14

FDR

n.seq

Ob

serv

edF

DR

+

+++++

+++

+++++ + + + + + +++

+++++

++

+++

++ + + + + + +

++

++++

+++

+++

++ + + + + ++

++

+

+++++

++++

++ + + + + + ++

++

+

++++

++++++ + + + + + +

++

+++++++

+++

++ ++ + +

+

+++

++++

++++

++++ + + +

+ + +

+

++

++++++++

+++ + + ++

++

+

++++++++

+

+

+++ + +

+ ++ +

+

++++

+

+

+++++++ +

+

+ + ++

(1− π)αα

20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

Positive Power

n.seq

Ave

rage

Pos

itiv

eP

ower

++++++++++

++

++

+

++

++ +

+++++++++

++

+

++

+

+

++

+ +

+++++++++

++

++

+

+

++

++ +

+++++++++

++

+

++

+

+

++

+ +

+++++++++

++

++

+

+

+

++

+ +

+++++++++

++

+

++

+

+

++

+ +

+++++++++

++

++

+

+

++

++ +

+++++++++

++

++

+

+

+

++

+ +

+++++++++

++

++

+

+

+

++

+ +

+++++++++

++

++

+

+

++

++ +p∗(1−(1−π)α)

π

20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

pFDR

n.seq

Ob

serv

edpF

DR

+

+

++

+++

+++++++ + + + + + +

+

+

+

+

+++++

+++++ + + + + + +

+

+

+

+++++++++++ + + + + + +

++

+

+

++++++++++ + + + + + +

+

+

++

++

++++++++ + + + + + +

+

+

+

++++++

+++++ + + + + + +

+

+

+

++

+++++++++ + + + + + +

+

+

+

+++++++++++ + + + + + +

++

+

++

+++

++++++ + + + + + +

+++

+

+

+

+++

+++++ + + + + + +

(1− π)αα

n

Figure 8: Clockwise, from the top left: Observed average proportion of rejected hypothe-ses, observed FDR, observed pFDR, and observed average positive power for variousvalues of ρ, with π = 0.1, α = 0.1, θ = 0.436, and m = 1000. Note that the y-axis doesnot have the same scale in each panel.

As can be observed in Figure 8, a higher correlation between the features is asso-

ciated with, on average, a larger number of rejected hypothesis. A higher correlation

is associated with a lower observed FDR. When the variables are highly correlated, the

pFDR converges to a value below (1 − π)α, but it does converge slower than when the

variables are uncorrelated, that is, a higher sample size is required for the pFDR to

drop below α. In terms of positive power we see that, for low n, a high correlation is

associated with higher power, whereas for higher n, a high correlation is associated with

41

lower power. For very high n, the differences are very small. For lower n, the observed

differences in power are larger, but even when the features are very highly correlated

(ρ = 0.9), the observed decrease in power is small compared to the increase in corre-

lation. For example, if n = 36, then under independence we observe an average power

of 0.328, whereas if the features are equicorrelated with ρ = 0.9, we observe an average

power of 0.275. For higher n, these differences are even smaller.

Figure 9 shows the results of the scenario where there is a different correlation

between the true nulls (ρ0), between the false nulls (ρ1), and between each true and false

null (ρ01).

0.00

0.10

0.20

0.30

Proportion Rejected

n.seq

p

+++++++++++++++

+ + + + +

++++++++++

+++++

++ + + +

++++++++++++++

++ + + + +

+++

+++

++++++++

+ ++

+ + +

+++++++++

+++++

++ +

+ + +

+++++

p∗independentρ0 = 0.9ρ1 = 0.9ρ0 = 0.9, ρ1 = 0.9equicorrelated

0.02

0.06

0.10

0.14

FDR

n.seq

Ob

serv

edF

DR

+

+

+

++++++

+++++ + + + + + +

+

+++++

++++

++++ +

+ ++ + +

++++

+++

++++

+

++ + + + + + +

+

+++

++++

++++++ +

++

++

+

+

++++++++++

++

+ ++

++ +

+

(1− π)αα

20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

Positive Power

n.seq

Ave

rage

Pos

itiv

eP

ower

++++++++++

++

++

+

++

++ +

+++++++++

++

++

+

+

+

++

+ +

++++++++

++

++

++

+

++

++ +

++++++++

++

++

++

+

+

++ + +

+++++++++

++

++

+

+

+

++

++p∗(1−(1−π)α)

π

20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

pFDR

n.seq

Ob

serv

edpF

DR

+

+

+

+

+++

+++++++ + + + + + +

+

+

+

+++++++++++ + + + + + +

+

++

++++

++

++

+++ + + + + + +

+

++

+

++++++++++ + + + + + +

++

+

+++

++

+++

+++ + + + + + +

(1− π)αα

n

Figure 9: Clockwise, from the top left: Observed average proportion of rejected hy-potheses, observed FDR, observed pFDR, and observed average positive power for thedifferent combinations of ρ0, ρ1, and ρ01. Note that if the legend states only ρ0 = 0.9,this implies the other correlation parameters are zero. Under independence all correla-tion parameters are zero, and under equicorrelation they are all 0.9. The values used forthe other parameters are: π = 0.1, α = 0.1, θ = 0.436, and m = 1000. Note that they-axis does not have the same scale in each panel.

42

In terms of the proportion of rejected hypotheses there is a clear contrast between

the three situations where the true nulls are correlated (ρ0 = 0.9), and the two situations

where they are not. If only the false nulls are correlated, the average proportion of

rejected hypothesis is somewhat larger than under independence for low n, and slightly

smaller for high n, but is in general fairly close to p∗. The three other correlation

structures all lead to a proportion of rejected hypotheses that is higher than p∗ for all

n. For the FDR we again see this same contrast: If only the false nulls are correlated

the observed FDR is similar to independence, but when the true nulls are correlated,

the observed FDR is much lower. If the true nulls are highly correlated, either many of

the true nulls are rejected, or none at all (results not shown). However, the observed

probability of rejecting many true nulls is far lower than the probability of rejecting

none of them. For example, with ρ0 = 0.9, ρ1 = 0 and ρ01 = 0, we observe a 92.3%

probability of rejecting none of the true nulls.

In the bottom row of Figure 9, we again see a contrast between the situation where

only the true nulls are correlated (ρ0 = 0.9, ρ1 = 0, and ρ01 = 0), and the situation

where only the false nulls are correlated (ρ0 = 0, ρ1 = 0.9, and ρ01 = 0). If only

the true nulls are correlated, the pFDR converges to some value lower than (1 − π)α,

similar to the equicorrelation situation. However, if only the true nulls are correlated,

this convergence occurs much quicker; quicker even than under independence, that is, a

lower n is required to obtain pFDR control compared to when all nulls are independent.

This is in contrast to the situation where only the false nulls are correlated. In this case,

the pFDR converges to (1 − π)α, but it requires the highest n out of all conditions to

obtain pFDR control.

In terms of power we see that, when only the true nulls are correlated, the average

power is very close to the asymptotic power, but when only the false nulls are correlated,

the behavior of the BH method in terms of power is more similar to equicorrelation than

independence. This is the opposite of what we see in terms of the proportion of rejected

hypotheses, where if the true nulls are correlated behavior is similar to independence,

but if the false nulls are correlated, behavior is similar to equicorrelation. The observed

differences between the situation where both the false and true nulls are independently

correlated (ρ0 = 0.9, ρ1 = 0.9, and ρ01 = 0), and the situation with equicorrelation

appear small in terms of all outcome measures.

The previously described contrasts between the situations where only the true or

false nulls are correlated also appear in the distributions of the proportion of rejected

hypotheses and power, as can be observed in Figure 10. The figure additionally shows

that under equicorrelation, an increase in ρ leads to an increase in the variance of the

43

power distribution. The interquartile ranges for the values of ρ from 0 to 0.9 are,

respectively, 0.093, 0.096, 0.096, 0.118, 0.129, 0.142, 0.171, 0.210, 0.220, and 0.246.

The observed probabilities of the power being at least as high as the power estimator

incorporating pl for the different values of ρ are, respectively, 0.994, 0.994, 0.998, 0.983,

0.973, 0.954, 0.937, 0.906, 0.877, and 0.869. So for ρ ≤ 0.5 we observe that the power

is higher than this bound more than 95% of the time. When only the true nulls are

correlated with ρ = 0.9, this observed probability is 0.992, whereas if only the falls nulls

are correlated, it is 0.894. The corresponding interquartile ranges are, respectively, 0.093

and 0.252.

0.0

0.2

0.4

0.6

0.8

1.0

Equicorrelation

Ob

serv

edP

rop

orti

on

Rej

ecte

d p∗pl

0.0

0.2

0.4

0.6

0.8

1.0

Different Correlations

p∗pl

Ob

serv

edP

osit

ive

Pow

er

p∗(1−(1−π)α)π

pl(1−(1−π)α)π

0.0

0.5

1.0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

p∗(1−(1−π)α)π

pl(1−(1−π)α)π

0.0

0.5

1.0

indep. ρ0 = 0.9 ρ1 = 0.9 both equic.

correlation structure

Figure 10: Top: Violin plots of the distributions of Rm/m for each correlation structure,with α = 0.1, π = 0.1, m = 1000, and n = 50. Bottom: Violin plots of the distributionsof the positive power. In the left column, all features share the same correlation. Inthe right column, the correlations between the true nulls (ρ0), between the false nulls(ρ1), and between each true and false null, are specified separately. Note that the kerneldensities are scaled to have equal maximum width. The quantity pl is equal to p∗ minusthe LIL bound given in (23). The plot was created using the vioplot package (Adler,2005).

44

3.4 Simulations Based on Real Data

The goal of this simulation experiment was to investigate the performance of the BH

procedure in a setting derived from real data. The results of this experiment can be

observed in Figure 11.

++++++++++++++++++++

+

+

+

+

+

0.00

0.04

0.08

Proportion Rejected

n.seq

p

+++++++++++++++++++

+

+

+

++

+

++++++++++++++++

++++

+

++

++

+++

p∗(µθ)σθ = 0σθ = .067σθ = .134

+

+

+

+

+

+

+

+

++

+

+

++

++

++

+

+

+

+

+ ++

0.04

0.08

0.12

FDR

n.seq

Ob

serv

edF

DR

++

+

+

+

+

++

+

+

+

++

+

+++

+

++

+ +

+ ++

++

+

+

+

+

++

+

+

+

+

+

+

+++

+

++

++

+ ++

(1− π)αα

++++++++++++++++++++

+

+

+

+

+

0 100 300 500

0.0

0.2

0.4

0.6

0.8

1.0

Positive Power

n.seq

Ave

rage

Pos

itiv

eP

ower

++++++++++++++++++++

+

+

++

+

++++++++++++++++++++

+

++

+ +

p∗(µθ)(1−(1−π)α)π

++

++++++

+

+

+

+

+

+++

++++

+ + + + +

0 100 300 500

0.0

0.2

0.4

0.6

0.8

1.0

pFDR

n.seq

Ob

serv

edpF

DR

+

+

+++++

++

+++

+

+++++++ + + + + +

+

+

++++

+

++

+

+

++++++

+++ + + + + +

(1− π)αα

n

Figure 11: Clockwise, from the top left: Observed average proportion of rejected hy-potheses, observed FDR, observed pFDR, and observed average positive power, withα = 0.1, π = 0.1, and m = 1000. In the left-hand panels, the dashed lines indi-cate p∗ calculated using the 45th and 55th percentiles of the effect size distribution withσθ = 0.134, rather than the mean. The dotted lines indicate p∗ calculated using π = 0.08and π = 0.12. Note that the y-axis does not have the same scale in each panel.

Figure 11 shows that, for a fixed effect size, the proportion of rejected hypotheses

and observed power are very close to the expected values. When the effect size is

not fixed, it appears both the proportion of rejected hypotheses and power are higher

than expected for low n, but lower than expected for high n. This effect appears more

pronounced as the variance of the effect size distribution increases. It appears that

underestimating the proportion of signal or mean effect size in the data (shown in the

45

left panels of Figure 11 as the lower dotted and dashed lines respectively) can guard

against the possibility of lower power to some extent, although this is limited by the

shape of the power curve. That is, under an effect size distribution, the power as a

function of n has a different shape than the power curve based on the assumption of

a fixed effect size. Underestimating π or θ can provide some conservativeness, but the

resulting theoretical power curve does not accurately reflect the shape of the true power

function.

Figure 12 shows the distributions of the observed proportion of rejected hypotheses

and power for a fixed value of n. A higher value of σθ is associated with a slightly smaller

variance of both the proportion of rejected hypotheses and power distributions, but with

more extreme outliers in the proportion of rejected hypotheses.

0.00

0.05

0.10

0.1

5O

bse

rved

Pro

por

tion

Rej

ecte

d

p∗(µθ)pl(µθ)

0.0

0.4

0.8

σθ = 0 σθ = .067 σθ = .134

Obse

rved

Pos

itiv

eP

ower p∗(µθ)(1−(1−π)α)

πpl(µθ)(1−(1−π)α)

π

Figure 12: Top: Violin plots of the distributions of Rm/m for each effect size distributionwith n = 300, α = 0.1, π = 01, and m = 1000. Bottom: Violin plots of the distributionsof the positive power. Note that the kernel densities are scaled to have equal maximumwidth. The quantity pl is equal to p∗ minus the LIL bound given in (23). The plot wascreated using the vioplot package (Adler, 2005).

46

3.5 Comparison of Sample Size Estimates

The goal of this simulation experiment was to compare several of the power estimators

from section 2.2 in the context of sample size calculation. Given estimates for π, π0, and

θ, we calculated the minimum n such that the selected power estimate was greater than

the desired level ω. We then performed simulations for this value of n, and recorded

observed average power, the probability that the power is greater than zero, and the

probability that the power is equal to or greater than ω. Results for ω = 0.8 and a fixed

effect size can be observed in Table 2.

Table 2: Sample size estimates and observed power for a number of estimators, usingω = 0.8 and σθ = 0 (i.e. a fixed effect size). Note that for sample size calculationsbased on the criticality phenomenon (α∗ < α), ω is not a determinant of the samplesize. The standard deviation of the effect size distribution Observed average power,probability of at least one correct rejection, and probability that the power is greaterthan or exceeds the desired level are based on 100 simulations per combination of theexperimental factors.

θ method n E(Rm×TDP

m1

)P(Rm×TDP

m1> 0)

P(Rm×TDP

m1≥ 0.8

)0.436 α∗ < α 16 0.0047 0.60 0

p∗(α)(1−(1−π)α)π 68 0.81 1 0.72

pl(α)(1−(1−π)α)π 82 0.90 1 1

p∗(αγ)(1−hα)π 112 0.98 1 1

pl(αγ)(1−hα)π 142 0.99 1 1

0.158 α∗ < α 36 0.0004 0.15 0

p∗(α)(1−(1−π)α)π 484 0.80 1 0.60

pl(α)(1−(1−π)α)π 592 0.90 1 1

p∗(αγ)(1−hα)π 806 0.97 1 1

pl(αγ)(1−hα)π 1032 0.99 1 1

As can be observed in the table, sample sizes calculated using the asymptotic

expression for the average power do indeed lead to an observed average power at least

as high as the desired level (0.8). Using any of the more conservative estimators leads

to an observed P(Rm×TDP

m1≥ 0.8

)= 1. Sample size calculations based on the criticality

47

phenomenon indeed lead to a positive average power, but this power can be very small

(e.g. 0.0004) and the probability that the power is greater than zero need not be large

(e.g. 0.15). Table 3 shows the results of using the same sample size estimates, but now

the effect sizes follow a distribution with σθ = 0.134.

Table 3: Sample size estimates and observed power for a number of estimators, usingω = 0.8 and σθ = 0.134. Note that for sample size calculations based on the criticalityphenomenon (α∗ < α), ω is not a determinant of the sample size. The standard deviationof the effect size distribution Observed average power, probability of at least one correctrejection, and probability that the power is greater than or exceeds the desired level arebased on 100 simulations per combination of the experimental factors.


m1

)P(Rm×TDP

m1> 0)

P(Rm×TDP

m1≥ 0.8

)0.436 α∗ < α 16 0.0165 0.88 0

p∗(α)(1−(1−π)α)π 68 0.72 1 0

pl(α)(1−(1−π)α)π 82 0.79 1 0.30

p∗(αγ)(1−hα)π 112 0.87 1 1

pl(αγ)(1−hα)π 142 0.92 1 1

0.158 α∗ < α 36 0.0082 0.83 0

p∗(α)(1−(1−π)α)π 484 0.62 1 0

pl(α)(1−(1−π)α)π 592 0.66 1 0

p∗(αγ)(1−hα)π 806 0.71 1 0

pl(αγ)(1−hα)π 1032 0.75 1 0

For sample sizes based on the criticality phenomenon we observe an increase in

average power, as well as an increase in the probability of at least one correct rejection.

For all other sample sizes, performance is worse compared to the situation with a fixed

effect size. When the variance of the effect size distribution is not so large compared to

its mean (θ = 0.436), sample size calculations based on the asymptotic average power or

the LIL bound lead to an observed average power which is lower than the desired 0.8.

Sample size calculations that incorporate FDP exceedance control, however, still lead to

an observed P(Rm×TDP

m1≥ 0.8

)= 1. When the variance of the effect size distribution is

large compared to its mean (θ = 0.158), this is no longer the case and we observe that

48

even if FDP exceedance control is combined with the LIL bound, the observed average

power is only 0.75, and the observed P(Rm×TDP

m1≥ 0.8

)= 0.

For a fixed effect size, if we set the desired power level ω = 0.5, we observe little

difference in terms of P(Rm×TDP

m1> 0)

and P(Rm×TDP

m1≥ ω

)compared to when ω = 0.8

(see Appendix A). For a variable effect size, however, differences between ω = 0.8 and

ω = 0.5 can be very large, as can be observed in Table 4.



m1

)P(Rm×TDP

m1> 0)

P(Rm×TDP

m1≥ 0.5

)0.436 α∗ < α 16 0.0165 0.88 0

p∗(α)(1−(1−π)α)π 46 0.52 1 0.75

pl(α)(1−(1−π)α)π 50 0.57 1 0.99

p∗(αγ)(1−hα)π 74 0.75 1 1

pl(αγ)(1−hα)π 82 0.79 1 1

0.158 α∗ < α 36 0.0082 0.83 0

p∗(α)(1−(1−π)α)π 314 0.52 1 0.72

pl(α)(1−(1−π)α)π 354 0.54 1 0.96

p∗(αγ)(1−hα)π 518 0.63 1 1

pl(αγ)(1−hα)π 572 0.65 1 1

We now observe that E(Rm×TDP

m1

)≥ ω for all sample sizes, except those based on

the criticality phenomenon. Furthermore, when using the LIL bound for the proportion

of rejected hypotheses, the observed P(Rm×TDP

m1≥ 0.5

)is at least 0.96. If we set the

desired power level ω = 0.1, we obtain the results shown in Table 5.

49



m1

)P(Rm×TDP

m1> 0)

P(Rm×TDP

m1≥ 0.1

)0.436 α∗ < α 16 0.0165 0.88 0

p∗(α)(1−(1−π)α)π 26 0.18 1 1

pl(α)(1−(1−π)α)π 30 0.26 1 1

p∗(αγ)(1−hα)π 44 0.49 1 1

pl(αγ)(1−hα)π 50 0.57 1 1

0.158 α∗ < α 36 0.0082 0.83 0

p∗(α)(1−(1−π)α)π 164 0.33 1 1

pl(α)(1−(1−π)α)π 192 0.38 1 1

p∗(αγ)(1−hα)π 288 0.49 1 1

pl(αγ)(1−hα)π 324 0.53 1 1

We again observe that E(Rm×TDP

m1

)≥ ω for all sample sizes, except those based

on the criticality phenomenon. The observed average power is now generally much

larger than the desired level. For example, when θ = 0.158, a sample size based on

the asymptotic average power now leads to an observed average power of 0.33, which is

much larger than the desired 0.1. In fact, we now observe for both effect sizes that the

probability of the power being greater than, or equal to, the desired level, is 1 for all

sample sizes except those based on the criticality phenomenon. Results for ω = 0.1 in

combination with a fixed effect size are included in Appendix A.

50

4 Discussion

4.1 Performance of the BH Procedure in a Non-Asymptotic Setting

In our first simulation experiment, we investigated how a finite number of tests affects

the behavior of the BH procedure. We find that for a finite number of tests, the average

proportion of rejected hypotheses is larger than the asymptotic proportion of rejected

hypotheses p∗. We applied the BH procedure for different combinations of the proportion

of false nulls π and the FDR control parameter α, and observe this behavior in all cases,

although it is much more pronounced when π is low and α is high. If we fix the proportion

of false nulls among the nulls in the data, we observe that the positive power likewise

converges to its asymptotic value from above. Benjamini and Hochberg (1995) already

observed that the power of their procedure decreases as the number of tests increases,

a phenomenon they referred to as “the cost of multiplicity control” (p. 296). We

additionally observe that, for low π and high α, as n → ∞, the proportion of rejected

hypotheses converges to some value larger than p∗. We suspect there is some constant

or ratio due to the discontinuous nature of the empirical distribution functions which

can explain this difference, but deriving an expression for such a quantity was deemed to

be outside the scope of this study, since it has only marginal practical relevance. After

all, FDR control levels of 0.9 or 0.5 are never used in practice. In the most realistic

setting, where π = 0.1 and α = 0.1, the observed differences between different values of

m are much smaller. Furthermore, since the observed proportion of rejected hypotheses

is larger than p∗ for finite m, calculations based on p∗ are at most conservative.

We observe an interaction between the number of tests and sample size: the ob-

served differences between the actual proportion of rejected hypotheses and p∗ are gen-

erally larger for smaller n. This is consistent with the results of Neuvial (2010), who

observed that when α < α∗, i.e. when p∗ = 0, the differences between the observed

(m = 1000) and asymptotic power were larger than when α > α∗, i.e. when p∗ > 0.

We see similar behavior, and additionally observe that this behavior becomes more pro-

nounced for smaller m.

We do note, however, that if the proportion of false null hypotheses in the data is

not fixed, but is instead allowed to vary due to hypotheses being sampled from a mixture

model with population fraction of false nulls π, the behavior of the positive power is not

analogous to the behavior of the proportion of rejected hypotheses. In this setting, the

observed power is higher than the asymptotic power for small n, but lower for large n.

However, with the exception of m = 10, the differences between observed power and

asymptotic power for large n are fairly small.

51

We also observe that, for a fixed n, a lower number of tests leads to larger variance

of the distributions of the proportion of rejected hypotheses and positive power. If the

sample proportion of false nulls is fixed this variance is reduced, but in both settings

the realized power can vary wildly for m ≤ 100. We observe that pl is a reasonably

effective lower bound on the proportion of rejected hypotheses: For a fixed value of n,

the observed probability of rejecting at least a proportion pl of the hypotheses was 0.97

or higher. However, since the value of pl is very low for low m, we only expect this bound

to be useful for high m, e.g. m ≥ 1000. Fortunately, such a number of hypotheses is

typical for many multiple testing experiments.

4.2 Performance of the BH Procedure Under Basic Forms of Depen-

dence

In our second simulation experiment, we investigated how certain basic forms of depen-

dence affect the behavior of the BH procedure. We first considered an equicorrelated

setting, with a fixed population correlation coefficient between all the features. We find

that, for population correlations between 0 and 0.9, the FDR remains controlled, which

is in agreement with existing research regarding the performance of the BH procedure

under dependency (see e.g. Goeman & Solari, 2014). We find that an increase in correla-

tion between the features is associated with a decrease in observed FDR, but an increase

in the level of n required to obtain pFDR control. We observe that an increase in corre-

lation is associated with a larger proportion of rejected hypotheses. In terms of positive

power, we find that for low n, power is increased with higher correlation, but for high

n, power is reduced with higher correlation. Observed differences between correlated

features and independent features are small, however, even for very high correlations.

This is consistent with the results of Jung (2005), who observed the asymptotic power

remains a good estimator for the median power under dependence, and with those of

Hemmelmann et al. (2005), who observed only a small reduction in power when the

pair-wise correlations were increased from 0.2 to 0.8.

We do observe an increase in the variance of both the proportion of rejected hy-

potheses and the positive power as correlation increases. This is consistent with the

results of Jung (2005), Hemmelmann et al. (2005), and Shao and Tseng (2007), who

all observed a similar effect. In particular, Jung (2005) observed that for ρ = 0.6, the

interquartile range of the observed proportion of rejected hypotheses almost doubled

compared to independence. We observe the same effect for the interquartile range of the

power: For a fixed n, we observe an IQR of 0.09 under independence, and an IQR of 0.17

52

when ρ = 0.6. Even though the IQR is almost doubled, we observe that the same is not

true for the entire range of the distribution, so that 94% of the time, the observed power

was still higher than the lower LIL bound. Since we consider all possible values of ρ from

0 to 0.9 in increments of 0.1, we can assess the effect of correlation magnitude. For a

fixed n, we find that when the features are weakly correlated (ρ ≤ 0.2), the distribution

of the power is very similar to the independence setting, with the IQR only increased

from 0.093 to 0.096. From ρ ≥ 0.2 onward, outliers start to appear in the distribution of

the proportion of rejected hypotheses, but it is not until ρ > 0.5 that we see more than

5% of the distribution lies below the LIL bound.

We also investigated the effect of correlating the true and false nulls separately.

We again find that under all experimental conditions, the FDR remains controlled. We

observe that if only the true nulls are correlated, a lower n is required to obtain pFDR

control compared to both equicorrelation and independence, whereas if only the false

nulls are correlated, pFDR control requires a higher n than both equicorrelation and

independence. In terms of power, we observe that average power is very close to the

asymptotic power when only the true nulls are correlated, but follows a similar pattern to

equicorrelation when only the false nulls are correlated. The largest difference between

correlating only the true or only the false nulls can be observed in the variance of

the power distribution. Although we observe more outliers in the distribution of the

proportion of rejected hypotheses when only the true nulls are correlated, the power

distribution is quite similar to independence, with the same interquartile range, and the

lower LIL bound performs well. When only the false nulls are correlated, however, the

observed power varies wildly between 0 and 1 in a manner similar to the equicorrelation

scenario. Although the correlation magnitudes used for this experiment are extreme (0

or 0.9), the results do clearly illustrate that correlations between the false nulls can have

a large effect on the power of the BH procedure compared to correlations between the

true nulls. Correlations between the true nulls do affect the FDP, however, since when

the true nulls are strongly correlated, this implies the associated p-values are either all

high, or all low, and so we expect to either reject many of them, or none at all. We

observed a probability of rejecting no true nulls of 0.923, leading to an observed FDR

much lower than the desired level. Thus, if only the true nulls are correlated, there is

a reduction in FDR as in equicorrelation, but without the increase in variance of the

power distribution. There is, however, still a probability of obtaining an FDP far greater

than the desired FDR level.

53

4.3 Performance of the BH Procedure With an Effect Size Distribution

In our third simulation experiment, we investigated how the BH procedure performs in

simulations based on real data. We added signal to the data, either with a fixed effect

size, or with effect sizes drawn from a normal distribution. For a fixed effect size, we

observe that the average power is very close to the asymptotic power. For a fixed sample

size, we observe that 99% of the time, the power was higher than the lower LIL bound.

This combination of results indicates that the effect of the dependency in the data on

the power of the BH procedure is very limited.

When effect sizes are drawn from a normal distribution, however, the observed

power is only close to the asymptotic power when the power is near 0 or near 0.5. For

an asymptotic power between 0 and 0.5, i.e. for low n, the observed power is greater

than predicted, whereas for an asymptotic power between 0.5 and 1, i.e. for high n, the

observed power is lower than predicted. This effect becomes more pronounced as the

variance of the effect size distribution increases compared to its mean.

We believe this effect is caused by the shape of the power curve as a function of

θ. For low n, an increase in effect size of, for example, 0.5σθ, leads to an increase in

power. The amount by which the power increases is larger than the amount by which

it decreases if the effect size is decreased by 0.5σθ. For high n, the amount by which

the power increases is smaller than the amount by which its decreases if the effect size

is decreased by 0.5σθ. This effect is illustrated in Figure 13.

54

0.0

0.4

0.8

n = 200

theta.seq

pow

.seq

+

0.05 0.10 0.15 0.20 0.25 0.30

0.0

0.4

0.8

n = 400

theta.seq

pow

.seq

+

θ

Pow

er

Figure 13: Top: Power as a function of θ for n = 200. Bottom: Power as a function of θfor n = 400. In both panels, the blue cross indicates the power for θ = 0.158. In the toppanel, the increase in power associated with an increase of θ by 0.067, is larger than thedecrease in power associated with a decrease of θ by the same amount. In the bottompanel, the opposite is true.

4.4 Performance of Power Estimators

In our fourth simulation experiment, we compared several power estimators in their

ability to provide adequate sample size estimates. If the false nulls of interest have a

fixed effect size, we observe that sample size estimates based on the asymptotic average

power indeed lead to an average power at least as high as the desired level. For m = 5000,

we observe that incorporating the lower LIL bound in the power estimator leads to a

98% or higher probability of the power being at least as high as the desired level. When

incorporating FDP exceedance control at the γ = 0.1 level, or both FDP exceedance

control and the lower LIL bound, the observed probability of the power being at least

the desired level is 1. Sample size calculations based on the criticality phenomenon lead

to a positive average power, but the probability of making no discoveries can be large.

55

If the false nulls of interest have a variable effect size that follows a normal distri-

bution, we observe the same behavior as described in section 4.3: if the desired power

level is low we observe a higher power than desired, whereas if the power level is high,

we observe a lower power than desired. For ω ≤ 0.5, any of the tested power estimators

can be safely used, although they can be too conservative. For ω = 0.8, however, even

applying both FDP tail control and the lower LIL bound, that is, using a sample size

more than twice as high as the one suggested by using the asymptotic average power,

cannot ensure that the power reaches the desired level.

4.5 Sample Size Calculation in Practice

Existing sample size calculation methods for the BH procedure as implemented in R

assume infinite and independent tests. In our simulation experiments we observe that,

under independent tests and a fixed effect size, the asymptotic average power is an ac-

curate estimator for the observed positive power for a number of hypotheses typical for

high-dimensional experiments. In fact, it remains a decent estimator for all but the

lowest number of tests we considered (m = 10), assuming the proportion of signal in the

data and the FDR control parameter are not very high, which is generally the case in

practice. We additionally observe that if the proportion of signal in the data is fixed, the

asymptotic average power is a conservative power estimator. These observations indicate

that if independent tests and a fixed effect size are reasonable assumptions, the asymp-

totic average power can be safely used for sample size calculation in high-dimensional

experiments. However, the fewer hypotheses are tested, the larger the variance of the

power distribution becomes. One can correct for this by incorporating the lower LIL

bound in the power estimator. This ensures a high probability of the power of the

experiment being at least as high as desired.

In general, independent tests will not be a reasonable assumption. In our simulation

experiments we observe that even if the features are very highly correlated (ρ = 0.9), the

observed average power is close to the asymptotic average power. However, the variance

of the power distribution is very large: it is not unlikely a single experiment will lead to

no rejections at all, even if the expected power is large. If features are expected to be

highly correlated, basing sample size calculations on the asymptotic average power, or

any of the related estimators described in this paper, may not be desirable. In this case,

a researcher may consider incorporating an adjustment for dependence, as suggested by

Shao and Tseng (2007).

We observe, however, that if the data are weakly or moderately correlated (ρ ≤ 0.5),

56

the range of the power distribution is only mildly affected, and the lower LIL bound re-

mains effective. Additionally, only correlations between the true features have a negative

effect on the average power of the BH procedure. Even if strong correlations between the

true nulls exist, sample size calculations can be safely based on the asymptotic average

power, since these correlations will not have a negative effect on the power of the BH

procedure. They can, however, lead to a realized FDP far higher than the desired FDR

control level.

If the effect sizes of interest cannot be assumed fixed, one should be careful when

basing sample size calculations on the assumption of a fixed effect size. In our simulation

experiments, we observe that if a high power level is desired, sample sizes calculated

based on the asymptotic average power can be insufficient to ensure the experiment

has the desired power. This behavior is more severe if the variance of the effect size

distribution is high compared to its mean. Using a more conservative power estimator,

or a more conservative estimate of π or θ, will lead to more conservative estimates of n,

but this will only help up to a point. In section 3.5, we observe a situation were even

using a very conservative power estimator, which leads to an estimate of n twice as large

as the one obtained using the asymptotic average power, does not ensure the average

power reaches the desired level. In fact, in this scenario the desired power is reached for

none of the 100 simulated data sets. If the effects of interest are suspected to have a

large variance compared to their mean, a researcher should consider estimating the effect

size distribution beforehand. If a lower power level is desired, for example if one wishes

to discover at least 50% of all features with high probability, sample size calculations

remain fairly accurate, whereas for even lower levels (e.g. 10%) they are conservative.

It should be noted that the average power calculated in these experiments consid-

ers all false nulls, but a researcher need not be interested in all of them. He or she

could be interested only in falls nulls with an effect size “in the range of θ”, and may

consider false nulls with small effect sizes to be irrelevant. In this case, even though

the distribution of all effect sizes potentially has high variance, the same need not be

true for the distribution of the effect sizes of interest. In our simulations we observed

that if the effect sizes of interest have a mean of 0.436 with a standard deviation of

0.134 (interquartile range: 0.35 to 0.53), applying FDP exceedance control at the 90%

level still ensured the desired power level of 0.80 was reached or exceeded 100% of the

time. Applying the lower LIL bound lead to an average power of 0.79, only slightly

below the desired level. If the variance of the effect size distribution of interest is low

enough, sample size calculations based on the assumption of a fixed effect size may still

perform well, although using the more conservative power estimators suggested in this

57

paper may be preferable to using the traditional expression for the asymptotic average

power. If a researcher is interested in effect sizes of “at least θ”, this implies the effect

size distribution of interest is skewed. In this case, a higher variance of this distribution

leads to more conservative power estimates. Nevertheless, an overly conservative sample

size estimate can also be undesirable, as this will lead to unnecessary costs. This may

justify the costs associated with performing a pilot experiment and estimating the effect

size distribution.

In this study we have discussed twelve different power estimators based on the

asymptotic expression for the average power. If the effects of interest are assumed to

follow an effect size distribution with high variance compared to its mean, one should

be careful in applying any of these estimators since they are all based on an assumption

of equal effect sizes. In the event of equal effect sizes, however, even if the number

of tests if finite and the features are weakly or moderately correlated, the asymptotic

expression for the average power is an accurate estimator. To ensure a high probability

that the power will be at least as high as the desired level, one can incorporate the

lower LIL bound into the power estimator. With 5000 tests, applying FDP exceedance

control at the 90% level likewise ensures a high probability that the power will be at least

as high as the desired level, and additionally allows the researcher to make confidence

statements about the FDP and TDP of the rejection set. However, the required sample

size is higher than when using the lower LIL bound. Using both FDP exceedance control

and the lower LIL bound simultaneously appears to be overly conservative. Sample size

calculations based on the criticality phenomenon ensure a positive average power, even

for an infinite number of tests, but the probability of no discoveries can be high even for

5000 tests. These sample sizes therefore act as a sort of ”absolute minimum” to ensure

that the average power will always remain non-zero.

We have additionally discussed several possible estimators for the proportion of

true null hypotheses. If, in the opinion of the researcher, an adequate estimate of π is

available, setting π0 = 1−π is the most natural choice. Alternatively, one can incorporate

some uncertainty about the proportion of true nulls by using π0 = 1. The quantity h

provides a middle ground, and is the most natural choice when applying FDP exceedance

control, due to its interpretation as a confidence bound. If the proportion of signal in

the data is small, however, the choice of π0 is far less influential than the choice of the

other components of the power estimator.

58

4.6 Limitations and Future Research

We studied the effect of sample size, number of tests, and basic dependence, on the

power of the BH procedure using simulated data. Due to practical constraints, it is not

possible to perform simulations for every combination of every level of the experimental

factors. If an experimental factor was fixed in simulation experiment, we did so at a

value we believe is reflective the kind of settings in which the BH procedure is most

commonly applied, e.g. with at least a thousand hypothesis tests, and with small α and

π. Regardless, our simulated data may not be reflective of real data, so we additionally

performed simulations based on a real data set. However, even this experiment contains

some artificial elements, since the identity of the true and false nulls is required to be

known in order to calculate quantities like the FDP and average power. Nevertheless,

we find that the effect of the correlation structure present in the real data set is much

smaller in magnitude than the effects we have observed in some of our highly correlated

artificial data, indicating effects of correlation in a practical setting may be milder than

those seen in our simulations. It should be noted, however, that due to removal of

features from the data, the correlation structure may not be completely representative

of the correlation structure in the complete data set.

Although our simulations based on real data still contain a sizable artificial compo-

nent, they do show that parameters of the effect size distribution have considerable effect

on the power as a function of sample size. We observe that, if the mean of the effect size

distribution is high enough and/or its variance is low enough, sample size calculations

based on the assumption of a fixed effect size may still perform well, although we con-

sider only two values for the mean, and two values for the variance parameter, and so we

cannot very accurately quantify what “high enough” or “low enough” means in this con-

text. Such accurate quantifications may not be very useful in practice anyway, however,

since if detailed information about the effect size distribution was known beforehand, a

researcher could incorporate this information into the sample size calculations. We do

urge researchers to consider carefully if sample size calculations should be based on the

assumption of a fixed effect size, since depending on the desired power level, this may

lead to under- or overestimation of the required sample size.

In this study, we have proposed several different power estimators for use in sample

size calculation. We have focused on two-sample t-tests, or equivalently, an F -test with

two groups. The methodology described can be readily extended to F -test with more

than two groups. In terms of future research, it would be practical to extend this

methodology to other commonly used tests, such as χ2-tests. Additionally, it would be

59

interesting to further investigate if additional “corrections” for violated assumptions can

be incorporated, such as a correction for dependency as suggested by Shao and Tseng

(2007). Likewise, a next step would be to incorporate the estimation of an effect size

distribution into the methodology described in this study.

4.7 Concluding Remarks

In summary, our experiments suggest power calculations based on the asymptotic aver-

age power are quite robust to violations of assumptions. The asymptotic average power

remains a fairly accurate estimator of the observed average power for all but the lowest

number of hypotheses (m = 10), and even if the variables are very highly correlated.

However, a decrease in the number of hypotheses, or an increase in correlation between

the features, leads to an increase in the variance of the power distribution. Using a

more conservative power estimator, based on the LIL bound or FDP exceedance con-

trol, can provide a high probability of reaching at least the desired power level for a

moderate to large number of hypotheses (e.g. m ≥ 1000). We observed at least a 95%

probability of exceeding the desired power level when incorporating the LIL bound in

the power estimator, even when the features were weakly to moderately equicorrelated

(ρ ≤ 0.5), or when only the false nulls were correlated. Sample size estimation based

on the assumption of a fixed effect size should be performed with care, however, since

depending on the effect size distribution of interest and the desired power level, even

the most conservative power estimators described in this paper cannot ensure adequate

sample sizes.

All power estimators discussed in this study have been incorporated into a Shiny

application. This application allows researchers to perform power and sample size cal-

culations for one- and two-sided two-sample t-tests, using a point-and-click interface.

Additionally, it enables researchers to calculate the quantities α∗, u∗ and p∗, and to

visualize the associated p-value distribution. This application is freely available at:

https://wsvanloon.shinyapps.io/bhpower.

60

References

Adler, D. (2005). vioplot: Violin plot [Computer software manual]. Retrieved from

http://wsopuppenkiste.wiso.uni-goettingen.de/~dadler (R package ver-

sion 0.2)

Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical

and powerful approach to multiple testing. Journal of the Royal Statistical Society:

Series B , 57 (1), 289-300.

Benjamini, Y., & Hochberg, Y. (2000). On the adaptive control of the false discovery

rate in multiple testing with independent statistics. Journal of Educational and

Behavioral Statistics, 25 (1), 60-83.

Benjamini, Y., & Yekutieli, D. (2001). The control of the False Discovery Rate in

multiple testing under dependency. The Annals of Statistics, 29 (4), 1165-1188.

Bi, R., & Liu, P. (2017). ssizeRNA: Sample size calculation for RNA-seq experimental

design [Computer software manual]. Retrieved from https://CRAN.R-project

.org/package=ssizeRNA (R package version 1.2.9)

Brent, R. (1973). Algorithms for minimization without derivatives. Englewood Cliffs,

N.J.: Prentice-Hall.

Chang, W., Cheng, J., Allaire, J., Xie, Y., & McPherson, J. (2017). shiny: Web ap-

plication framework for R [Computer software manual]. Retrieved from https://

CRAN.R-project.org/package=shiny (R package version 1.0.0)

Chi, Z. (2007). On the performance of FDR control: Constraints and a partial solution.

The Annals of Statistics, 35 (4), 1409-1431.

Collado-Torres, L., Nellore, A., Kammers, K., Ellis, S. E., Taub, M. A., Hansen, K. D.,

. . . Leek, J. T. (2016). recount: A large-scale resource of analysis-ready RNA-seq

expression data. bioRxiv . Retrieved from http://biorxiv.org/content/early/

2016/08/08/068478 doi: 10.1101/068478

Dunn, O. (1961). Multiple comparisons among means. Journal of the American Statis-

tical Association, 56 (293), 52-64.

Ferreira, J., & Zwinderman, A. (2006). Approximate power and sample size calculations

with the Benjamini-Hochberg method. The International Journal of Biostatistics,

2 (1), Article 8.

Genovese, C., & Wasserman, L. (2002). Operating characteristics and extensions of the

false discovery rate procedure. Journal of the Royal Statistical Society: Series B ,

64 (3), 499-517.

Genovese, C., & Wasserman, L. (2006). Exceedance control of the false discovery

61

http://wsopuppenkiste.wiso.uni-goettingen.de/~dadler

https://CRAN.R-project.org/package=ssizeRNA

https://CRAN.R-project.org/package=ssizeRNA

https://CRAN.R-project.org/package=shiny

https://CRAN.R-project.org/package=shiny

http://biorxiv.org/content/early/2016/08/08/068478

http://biorxiv.org/content/early/2016/08/08/068478

proportion. Journal of the American Statistical Association, 101 (476), 1408-1417.

Goeman, J., & Solari, A. (2011). Multiple testing for exploratory research. Statistical

Science, 26 (4), 980-987.

Goeman, J., & Solari, A. (2014). Multiple hypothesis testing in genomics. Statistics in

Medicine, 33 (11), 1946-1978.

Groppe, D., Urbach, T., & Kutas, M. (2011). Mass univariate analysis of event-related

brain potentials/fields II: Simulation studies. Psychophysiology , 48 , 1726-1737.

Hemmelmann, C., Horn, M., Susse, T., Vollandt, R., & Weiss, S. (2005). New concepts

of multiple tests and their use for evaluating high-dimensional EEG data. Journal

of Neuroscience Methods, 142 , 209-217.

Hochberg, Y. (1988). A sharper Bonferonni procedure for multiple tests of significance.

Biometrika, 75 (4), 800-802.

Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian

Journal of Statistics, 6 (2), 65-70.

Hommel, G. (1986). Multiple test procedures for arbitrary dependence structures.

Metrika, 33 , 321-336.

Horn, M., & Dunnett, C. (2004). Power and sample size comparisons of stepwise FWE

and FDR controlling test procedures in the normal many-one case. Lecture Notes-

Monograph Series (Recent Developments in Multiple Comparison Procedures), 47 ,

48-64.

Jouve, T., Maucort-Boulch, D., Ducoroy, P., & Roy, P. (2009). Statistical power in

mass-spectrometry proteomic studies.

(submitted manuscript)

Jung, S. (2005). Sample size for FDR-control in microarray data analysis. Bioinformat-

ics, 21 (14), 3097-3104.

Keselman, H., Cribbie, R., & Holland, B. (2002). Controlling the rate of Type I error

over a large set of statistical tests. British Journal of Mathematical and Statistical

Psychology , 55 , 27-39.

Kim, K., & van de Wiel, M. (2008). Effects of dependence in high-dimensional multiple

testing problems. BMC Bioinformatics, 9 (114). doi: 10.1186/1471-2105-9-114

Korn, E., Troendle, J., McShane, L., & Simon, R. (2004). Controlling the number of false

discoveries: Application to high-dimensional genomic data. Journal of Statistical

Planning and Inference, 124 (2), 379-398.

Kvam, V., Liu, P., & Si, Y. (2012). A comparison of statistical methods for detecting

differentially expressed genes from RNA-seq data. American Journal of Botany ,

99 (2), 248-256.

62

Lappalainen, T., Sammeth, M., Friedlnder, M. R., t Hoen, P. A. C., Monlong, J., Rivas,

M. A., . . . Dermitzakis, E. T. (2013). Transcriptome and genome sequencing

uncovers functional variation in humans. Nature, 501 , 506-511.

Lee, M., & Whitmore, G. (2002). Power and sample size for DNA microarray studies.

Statistics in Medicine, 21 , 3543-3570.

Liu, P., & Hwang, J. (2007). Quick calculation for sample size while controlling false

discovery rate with application to microarray analysis. Bioinformatics, 26 (6),

739-746.

Meijer, R., Krebs, T., Solari, A., & Goeman, J. (2016). Simultaneous control of all false

discovery proportions by an extension of Hommel’s method. arXiv:1611.06739v1 .

Neuvial, P. (2010). Intrinsic bounds and false discovery rate control in multiple testing

problems. arXiv:1003.0747v1 .

Orr, M., & Liu, P. (2015). ssize.fdr: Sample size calculations for microarray experiments

[Computer software manual]. Retrieved from https://CRAN.R-project.org/

package=ssize.fdr (R package version 1.2)

Oura, T., Matsui, S., & Kawakami, K. (2009). Sample size calculations for controlling the

distribution of false discovery proportion in microarray experiments. Biostatistics,

10 (4), 694-705.

Pounds, S. (2016). FDRsampsize: Compute sample size that meets requirements for

average power and fdr [Computer software manual]. Retrieved from https://

CRAN.R-project.org/package=FDRsampsize (R package version 1.0)

R Development Core Team. (2008). R: A language and environment for statistical

computing [Computer software manual]. Vienna, Austria. Retrieved from http://

www.R-project.org (ISBN 3-900051-07-0)

Robinson, M., McCarthy, D., & Smyth, G. (2010). edgeR: a Bioconductor package for

differential expression analysis of digital gene expression data. Bioinformatics, 26 ,

139-140.

Robinson, M., & Oshlack, A. (2010). A scaling normalization method for differential

expression analysis of RNA-seq data. Genome Biology , 11 , R25.

Sarkar, S. (2004). FDR-controlling stepwise procedures and their false negative rates.

Journal of Statistical Planning and Inference, 119-137.

Shang, S., Liu, M., & Shao, Y. (2012). A tight prediction interval for false discovery

proportion under dependence. Open Journal of Statistics, 2 , 163-171.

Shang, S., Zhou, Q., Liu, M., & Shao, Y. (2012). Sample size calculation for controlling

false discovery proportion. Journal of Probability and Statistics. doi: 10.1155/

2012/817948

63

https://CRAN.R-project.org/package=ssize.fdr

https://CRAN.R-project.org/package=ssize.fdr

https://CRAN.R-project.org/package=FDRsampsize

https://CRAN.R-project.org/package=FDRsampsize

http://www.R-project.org

http://www.R-project.org

Shao, Y., & Tseng, C. (2007). Sample size calculation with dependence adjustment for

FDR-control in microarray studies. Statistics in Medicine, 26 , 4219-4237.

Storey, J. (2002). A direct approach to false discovery rates. Journal of the Royal

Statistical Society: Series B , 64 (3), 479-498.

Storey, J. (2003). The positive false discovery rate: A Bayesian interpretation and the

q-value. The Annals of Statistics, 31 (6), 2013-2035.

Van Iterson, M., Van de Wiel, M., Boer, J., & De Menezes, R. (2013). General power and

sample size calculations for high-dimensional genomic data. Statistical Applications

in Genetics and Molecular Biology , 12 (4), 449-467.

Wang, S., & Chen, J. (2004). Sample size for identifying differentially expressed genes

in microarray experiments. Journal of Computational Biology , 11 (4), 714-726.

Yang, Q., Cui, J., Chazaro, I., Cupples, L., & Demissie, S. (2005). Power and type I

error rate of false discovery rate approaches in genome-wide association studies.

BMC genetics, 6 (1). doi: 10.1186/1471-2156-6-S1-S134

64

A Additional Tables

Table 6: Sample size estimates and observed power for a number of estimators, usingω = 0.5 and σθ = 0. Note that for sample size calculations based on the criticalityphenomenon (α∗ < α), ω is not a determinant of the sample size. The standard deviationof the effect size distribution Observed average power, probability of at least one correctrejection, and probability that the power is greater than or exceeds the desired level arebased on 100 simulations per combination of the experimental factors.


m1

)P(Rm×TDP

m1> 0)

P(Rm×TDP

m1≥ 0.5

)0.436 α∗ < α 16 0.0047 0.60 0

p∗(α)(1−(1−π)α)π 46 0.53 1 0.83

pl(α)(1−(1−π)α)π 50 0.59 1 1

p∗(αγ)(1−hα)π 74 0.86 1 1

pl(αγ)(1−hα)π 82 0.90 1 1

0.158 α∗ < α 36 0.0004 0.15 0

p∗(α)(1−(1−π)α)π 314 0.50 1 0.56

pl(α)(1−(1−π)α)π 354 0.59 1 1

p∗(αγ)(1−hα)π 518 0.83 1 1

pl(αγ)(1−hα)π 572 0.88 1 1

65

Table 7: Sample size estimates and observed power for a number of estimators, usingω = 0.1 and σθ = 0. Note that for sample size calculations based on the criticalityphenomenon (α∗ < α), ω is not a determinant of the sample size. The standard deviationof the effect size distribution Observed average power, probability of at least one correctrejection, and probability that the power is greater than or exceeds the desired level arebased on 100 simulations per combination of the experimental factors.


m1

)P(Rm×TDP

m1> 0)

P(Rm×TDP

m1≥ 0.1

)0.436 α∗ < α 16 0.0047 0.60 0

p∗(α)(1−(1−π)α)π 26 0.11 1 0.55

pl(α)(1−(1−π)α)π 30 0.19 1 0.98

p∗(αγ)(1−hα)π 44 0.49 1 1

pl(αγ)(1−hα)π 50 0.59 1 1

0.158 α∗ < α 36 0.0004 0.15 0

p∗(α)(1−(1−π)α)π 164 0.10 1 0.49

pl(α)(1−(1−π)α)π 192 0.17 1 0.99

p∗(αγ)(1−hα)π 288 0.44 1 1

pl(αγ)(1−hα)π 324 0.53 1 1

66

B Selected R Code

B.1 Functions for Sample Size and Power Calculations

Here, we provide the R functions used for calculating the quantities described in sec-

tion 2.1, the power estimators described in section 2.2, and functions used for sam-

ple size calculation. These functions form the back-end of the Shiny application at

https://wsvanloon.shinyapps.io/bhpower.

B.1.1 Log-factorial function

l o g f a c t o r i a l <− function ( x ) {i f ( x==0){y <− 0

}else {y <− sum( log ( 1 : x ) )

}return ( y )

}

B.1.2 Function for calculating limits to evaluate F ′(0)

rho . l i m i t <− function (n , c=NULL, s=NULL, t o l=1e−10,

d e l t a=NULL, maxit =10000 , type=c ( ” t1 ” , ” t2 ” , ”F” ) ,

ngroups=2){

i f ( length ( type ) != 1) {stop ( ” I n v a l i d type . Must be one o f : t1 , t2 , F . ” )

}

i f ( ! ( type %in% c ( ” t1 ” , ” t2 ” , ”F” ) ) ) {stop ( ” I n v a l i d type . Must be one o f : t1 , t2 , F . ” )

}

i f ( i s . null ( d e l t a ) ) {d e l t a <− sqrt (n)∗c/s

}

67

s0 <− −1000

s1 <− 0

k <− 0

# One−Sided T−Tests (One− or Two−Sample )

i f ( type == ” t1 ” | type == ” t2 ” ) {

i f ( type == ” t2 ” ) {n <− n − 1

}

while (abs ( s0−s1 ) > t o l & k <= maxit ) {s0 <− s1

s <− exp(lgamma( ( n+k )/2) + k∗log ( sqrt (2 )∗d e l t a ) −l o g f a c t o r i a l ( k ) − lgamma(n/2) )

s1 <− s0 + s

k <− k + 1

i f ( s1 == I n f ) {message ( ”Warning : Limit too la rge , r e tu rn ing I n f . ” )

break

}}i f ( k==maxit ) {message ( ”Warning : Maximum number o f i t e r a t i o n s reached .

S e r i e s did not converge . ” )

}y <− exp(−d e l t a ˆ2/2)∗s1

}

# F−t e s t s ( or Two−Sided , Two−Sample T−Tests )

68

i f ( type == ”F” ) {

df1 <− ngroups − 1

df2 <− n − ngroups

d e l t a <− d e l t a ˆ2

while (abs ( s0−s1 ) > t o l & k <= maxit ) {s0 <− s1

s <− exp( k∗log ( d e l t a ) − k∗log (2 ) − l o g f a c t o r i a l ( k ) −lbeta ( df1/2 + k , df2/2) )

s1 <− s0 + s

k <− k + 1

i f ( s1 == I n f ) {message ( ”Warning : Limit too la rge , r e tu rn ing I n f . ” )

break

}}i f ( k==maxit ) {message ( ”Warning : Maximum number o f i t e r a t i o n s reached .

S e r i e s did not converge . ” )

}y <− exp(−d e l t a /2)∗beta ( df1/2 , df2/2)∗s1

}

return ( y )

}

B.1.3 Function for calculating α∗

alpha . s t a r <− function (n , c=NULL, s=NULL, t o l=1e−10,

d e l t a=NULL, p , type=c ( ” t1 ” , ” t2 ” , ”F” ) , ngroups=2){a <− 1 / (1 − p + p∗rho . l i m i t (n , c=c , s=s , t o l=to l ,

d e l t a=de l ta , type=type , ngroups=ngroups ) )

return ( a )

}

69

B.1.4 Function for evaluating F (u)

cumulat ive . p <− function (u , n , c=NULL, s=NULL, d e l t a=

NULL, p , type=c ( ” t1 ” , ” t2 ” , ”F” ) , ngroups=2){

i f ( length ( type ) != 1) {stop ( ” I n v a l i d type . Must be one o f : t1 , t2 , F . ” )

}

i f ( ! ( type %in% c ( ” t1 ” , ” t2 ” , ”F” ) ) ) {stop ( ” I n v a l i d type . Must be one o f : t1 , t2 , F . ” )

}

i f ( i s . null ( d e l t a ) ) {d e l t a <− sqrt (n)∗c/s

}

# One−Sided T−Tests (One− or Two−Sample )

i f ( type == ” t1 ” | type == ” t2 ” ) {

i f ( type == ” t2 ” ) {n <− n − 1

}

x <− qt (p=(1−u) , df=(n−1) )

Fu <− (1−p)∗u + p∗(1 − pt (q=x , df=(n−1) , ncp=d e l t a ) )

}

i f ( type == ”F” ) {x <− qf (p=(1−u) , df1=(ngroups−1) , df2=(n−ngroups ) )

Fu <− (1−p)∗u + p∗(1 − pf (q=x , df1=(ngroups−1) , df2=(n−ngroups ) , ncp=( d e l t a ˆ2) ) )

}

70

return (Fu)

}

B.1.5 Functions for evaluating F ′(u)

rho <− function (x , n , c , s , type=c ( ” t1 ” , ” t2 ” , ”F” ) ,

ngroups=2){i f ( type == ” t1 ” ) {d f s <− n−1

d e l t a <− sqrt (n)∗c/s

y <− exp(dt (x , dfs , de l ta , log=TRUE) − dt (x , dfs , log=

TRUE) )

}else i f ( type == ” t2 ” ) {d f s <− n−2

d e l t a <− sqrt (n)∗c/s

y <− exp(dt (x , dfs , de l ta , log=TRUE) − dt (x , dfs , log=

TRUE) )

}else i f ( type == ”F” ) {d f s <− c ( ngroups − 1 , n − ngroups )

d e l t a <− ( sqrt (n)∗c/s ) ˆ2

y <− exp( df (x , df1=d f s [ 1 ] , df2=d f s [ 2 ] , ncp=del ta , log=

TRUE) − df (x , df1=d f s [ 1 ] , df2=d f s [ 2 ] , log=TRUE) )

}

return ( y )

}

Fprime <− function (u , n , c , s , p , type=c ( ” t1 ” , ” t2 ” , ”F

” ) , ngroups=2){i f ( type == ” t1 ” ) {x <− qt(1−u , df=n−1)

}i f ( type == ” t2 ” ) {x <− qt(1−u , df=n−2)

}

71

i f ( type == ”F” ) {x <− qf(1−u , df1 = ngroups − 1 , df2 = n − ngroups )

}

y <− 1 − p + p ∗ rho ( x=x , n=n , c=c , s=s , type=type ,

ngroups=ngroups )

return ( y )

}

B.1.6 Function for evaluating F (u)−1

cumulat ive . p . inverse <− function (x , n , c=NULL, s=NULL,

d e l t a=NULL, p , type=c ( ” t1 ” , ” t2 ” , ”F” ) , ngroups=2){inv <− function (u , x , n , c , s , de l ta , p , type , ngroups )

{abs ( cumulat ive . p (u , n , c , s , de l ta , p , type , ngroups ) −

x )

}

optimize ( inv , c ( 0 , 1 ) , x=x , n=n , c=c , s=s , d e l t a=de l ta ,

p=p , type=type , ngroups=ngroups , t o l =0.001)$minimum

}

B.1.7 Function for calculating u∗

u . s t a r <− function ( alpha , p r e c i s i o n=1e−10, n , c=NULL, s

=NULL, d e l t a=NULL, p , type=c ( ” t1 ” , ” t2 ” , ”F” ) ,

ngroups=2){u <− 1

for ( i in 1 :10000) {u [ i +1] <− alpha∗cumulat ive . p (u [ i ] , n=n , c=c , s=s , d e l t a

=de l ta , p=p , type=type , ngroups=ngroups )

i f (abs (u [ i ] − u [ i +1]) < p r e c i s i o n ) {break

}}i f ( length (u) ==10001){

72

message ( ”Warning : maximum number o f i t e r a t i o n s reached .

” )

}ustar <− t a i l (u , 1 )

return ( us ta r )

}

B.1.8 Function for calculating p∗

p . s t a r <− function ( alpha , n , m, theta , pi1 , type=c ( ” t1 ”

, ” t2 ” , ”F” ) , ngroups=2){pstar <− u . s t a r ( alpha=alpha , n=n , c=theta , s =1, p=pi1 ,

type=type , ngroups=ngroups )/alpha

return ( ps ta r )

}

B.1.9 Function for calculating pl

p . l <− function ( alpha , n , m, theta , pi1 , type=c ( ” t1 ” , ”

t2 ” , ”F” ) , ngroups=2){ustar <− u . s t a r ( alpha=alpha , n=n , c=theta , s =1, p=pi1 ,

type=type , ngroups=ngroups )

ps ta r <− ustar/alpha

Fp <− Fprime (u=ustar , n=n , c=theta , s =1, p=pi1 , type=

type , ngroups=ngroups )

pb <− sqrt ( 2 ∗ pstar ∗ (1 − pstar ) ) ∗ sqrt ( m ∗ log (

log (m) ) ) /

( m ∗ (1 − alpha ∗ Fp) )

p l <− pstar − pb

return ( p l )

}

B.1.10 Function for calculating h

h . bar <− function ( alpha , n , theta , pi1 , type=c ( ” t1 ” , ”

t2 ” , ”F” ) , ngroups=2){x . seq <− seq (0 , cumulat ive . p(u=alpha , n=n , c=theta , s

=1, p=pi1 , type=type , ngroups=ngroups ) −0.001 , 0 . 001 )

73

p . inv <− sapply ( x . seq , function ( x ) cumulat ive . p . inverse

(x , n=n , c=theta , s =1, p=pi1 , type=type , ngroups=

ngroups ) )

h . seq <− ( x . seq∗alpha − alpha ) / (p . inv − alpha )

return (min(h . seq ) )

}

B.1.11 Function for calculating power estimates

pow <− function ( alpha , gamma=1, ec=FALSE, n , m, theta ,

pi1 , pi0 , p func , type=c ( ” t1 ” , ” t2 ” , ”F” ) , ngroups

=2, adapt ive ) {

i f ( class ( p i0 )==” func t i on ” ) {pi0 <− pi0 ( alpha=gamma, n=n , theta=theta , p i1=pi1 , type

=type , ngroups=ngroups )

}

i f ( adapt ive ) {alpha <− alpha/pi0

}

i f ( ec ) {p <− p func ( alpha=alpha∗gamma, n=n , m=m, theta=theta ,

p i1=pi1 , type=type , ngroups=ngroups )

}else {p <− p func ( alpha=alpha , n=n , m=m, theta=theta , p i1=pi1

, type=type , ngroups=ngroups )

}

pwr <− p∗(1 − pi0∗alpha )/pi1

return (pwr )

}

74

B.1.12 Function for calculating sample sizes based on the criticality phe-

nomenon

c a l c . c r i t <− function ( alpha , gamma=1, ec=FALSE, maxn ,

theta , pi1 ,

type=c ( ” t1 ” , ” t2 ” , ”F” ) , ngroups =2, adapt ive ) {

i f ( type==” t1 ” ) {f c <− 2

}else {f c <− 1

}

n <− 4/ f c

i f ( adapt ive ) {alpha <− alpha/(1 − pi1 )

}

i f ( ec ) {alpha <− alpha∗gamma

}

a s t a r <− alpha . s t a r (n = n , c = theta , s = 1 , d e l t a =

NULL, p = pi1 , type=type ,

ngroups=ngroups )

nseq <− n

aseq <− a s t a r

while ( a s t a r >= alpha & n < maxn) {n <− n + 2/ f c

a s t a r <− alpha . s t a r (n = n , c = theta , s = 1 , d e l t a =

NULL, p = pi1 , type=type ,

75

ngroups=ngroups )

nseq <− c ( nseq , n)

aseq <− c ( aseq , a s t a r )

}

pstar <− p . s t a r ( alpha = alpha , n = n , m = NULL, theta =

theta , p i1 = pi1 , type = type , ngroups = ngroups )

out <− l i s t (n=n , ps ta r=pstar , nseq=nseq , aseq=aseq )

return ( out )

}

B.1.13 Function for calculating sample sizes based on the average power

c a l c . n <− function ( l v l , alpha , gamma=1, ec=FALSE, maxn ,

m, theta , pi1 , pi0 , p func ,

type=c ( ” t1 ” , ” t2 ” , ”F” ) , ngroups =2, adapt ive ) {i f ( type==” t1 ” ) {f c <− 2

}else {f c <− 1

}

n <− 4/ f c

pwr <− pow( alpha = alpha , gamma = gamma, ec = ec , n = n

, m = m, theta = theta ,

p i1 = pi1 , p i0 = pi0 , p func = p func , type = type ,

ngroups = ngroups , adapt ive = adapt ive )

nseq <− n

pseq <− pwr

while (pwr < l v l & n < maxn) {

76

n <− n + 2/ f c

pwr <− pow( alpha = alpha , gamma = gamma, ec = ec , n = n

, m = m, theta = theta ,

p i1 = pi1 , p i0 = pi0 , p func = p func , type = type ,

ngroups = ngroups , adapt ive = adapt ive )

nseq <− c ( nseq , n)

pseq <− c ( pseq , pwr )

}

out <− l i s t (n=n , power=pwr , nseq=nseq , pseq=pseq )

return ( out )

}

B.2 Code for Simulation Study 2: Equicorrelation

Here, as an example, we provide the code used for assessing the effect of equicorrelation.

The other simulation studies performed are of a similar form.

B.2.1 Function for generating correlated features

e q u i c o r r e l a t e <− function (n , m, r ) {r <− sqrt ( r )

x <− rnorm(n)

x <− matrix ( rep (x ,m) , nrow=n)

e <− matrix (rnorm(n∗m, mean=0, sd=sqrt(1− r ˆ2) ) , n , m)

f e a t u r e s <− r∗x + e

return ( f e a t u r e s )

}

B.2.2 Function for simulating two-sample t-tests

sim . twosamp <− function (n , m, p , theta , equ i co r =0, s i d e

) {# Create c o r r e l a t e d , normal ly d i s t r i b u t e d f e a t u r e s

x <− e q u i c o r r e l a t e (n=n , m=m, r=equ i co r )

77

# S p l i t the data i n t o two p a r t s

x1 <− x [ 1 : f loor (n/2) , ]

x2 <− x [ ( f loor (n/2)+1) : n , ]

n1 <− nrow( x1 )

n2 <− nrow( x2 )

# Take the sample means f o r each group and c a l c u l a t e

the mean d i f f e r e n c e s

xbars1 <− colMeans ( x1 )

xbars2 <− colMeans ( x2 )

x d i f f <− ( xbars1 − xbars2 )

# S h i f t the mean d i f f e r e n c e s f o r the f a l s e n u l l s

mean . s h i f t <− theta∗sqrt (n)∗1∗sqrt (1/n1 + 1/n2 )

i f ( s i d e == ” l e f t ” ) {u <− runif (m)

mean . d i f f <− x d i f f ∗ (u > p) + ( x d i f f − mean . s h i f t )∗ (u <

p)

}else i f ( s i d e == ” r i g h t ” ) {u <− runif (m)

mean . d i f f <− x d i f f ∗ (u > p) + ( x d i f f + mean . s h i f t )∗ (u <

p)

}else i f ( s i d e == ”both” ) {u <− runif (m)

u2 <− runif (m)

mean . d i f f <− x d i f f ∗ (u > p) + ( x d i f f + mean . s h i f t )∗ (u <

p)∗ ( u2 < 0 . 5 ) + ( x d i f f − mean . s h i f t )∗ (u < p)∗ ( u2 >

0 . 5 )

}

# C a l c u l a t e the poo led standard d e v i a t i o n

xs1 <− apply ( x1 , 2 , var )

xs2 <− apply ( x2 , 2 , var )

pooled . sd <− sqrt ( ( ( n1−1)∗xs1 + ( n2−1)∗xs2 )/ (n−2) )

78

# Transform to two−sample t−s t a t i s t i c s

t . s t a t s <− mean . d i f f / ( pooled . sd∗sqrt (1/n1 + 1/n2 ) )

# C a l c u l a t e p−v a l u e s and f e a t u r e i n d i c a t o r s

i f ( s i d e == ” l e f t ” ) {# p−v a l u e s

p . v a l s <− pt ( t . s t a t s , df=n−2)

# True f e a t u r e i n d i c a t o r

f . t rue <− as . numeric (u < p)

# Return a l i s t

out <− l i s t ( pva l s=p . va ls , f t r u e=f . t rue )

}

else i f ( s i d e == ” r i g h t ” ) {# p−v a l u e s

p . v a l s <− 1 − pt ( t . s t a t s , df=n−2)



# Return a l i s t


}

else i f ( s i d e == ”both” ) {# p−v a l u e s

p . v a l s <− 2∗(1 − pt (abs ( t . s t a t s ) , df=n−2) )



# Return a l i s t


79

}

return ( out )

}

B.2.3 Function for power simulations

simpower . cor <− function (n , m, p , theta , alpha , equ i co r

=0, s i d e ) {dat <− sim . twosamp (n=n , m=m, p=p , theta=theta , equ i co r=

equicor , s i d e=s i d e )

p . v a l s <− dat$pva l s

so r t ed . pva l s <− sort (p . v a l s )

f t r u e <− dat$ f t r u e

i n d i c e s <− which( so r t ed . pva l s <= alpha∗ ( 1 :m)/m)

i f ( length ( i n d i c e s ) > 0) {max. index <− max( i n d i c e s )

}else {max. index <− 0

}p . hat <− max. index/m

n . r e j <− max. index

i f (n . r e j == 0) {c . r e j <− 0

c . r e j . prop <− 0

u . hat <− 0

}else {u . hat <− so r t ed . pva l s [max. index ]

c . r e j <− sum( f t r u e [ order (p . v a l s ) ] == 1 & so r t ed . pva l s

<= u . hat )

c . r e j . prop <− c . r e j /sum( f t r u e )

}out <− c (u . hat , p . hat , n . r e j , c . r e j , c . r e j . prop )

return ( out )

}

80

B.2.4 Simulation script

## P r e l i m i n a r i e s :

rm( l i s t=l s ( ) )

options ( warn=1)

set . seed (1402)

## Def in ing Experimental Factors :

theta <− 0 .436

m <− 1000

p <− 0 .1

alpha <− 0 .1

n . seq <− c ( seq (4 , 20 , 2) , seq (24 , 40 , 4) , seq (50 , 100 ,

10) )

cor . seq <− seq (0 , 0 . 9 , 0 . 1 )

n i t e r <− 1000

## I n i t i a l i z i n g Objec t s

r e s u l t s . cor <− l i s t ( )

## S t a r t Timer

t imer0 <− proc . time ( ) [ 3 ]

## Performing Simula t ions

for ( j in 1 : length ( cor . seq ) ) {

phat <− c ( )

pow <− c ( )

ppow <− c ( )

p fdr <− c ( )

f d r <− c ( )

ps ta r <− c ( )

psd <− c ( )

for ( i in 1 : length (n . seq ) ) {# Progress N o t i f i e r

81

cat ( ”Computing rho =” , cor . seq [ j ] , ”and n =” , n . seq [ i ] ,

”\n” )

# Simulate r e s u l t s

raw . r e s u l t s <− r e p l i c a t e ( n i t e r , simpower . cor (n=n . seq [ i

] , m=m, p=p , theta=theta , alpha=alpha , equ i co r=cor .

seq [ j ] , s i d e=”both” ) )

# C a l c u l a t e mean p rop or t i on r e j e c t e d E(Rm/m) and i t s

s tandard d e v i a t i o n

phat [ i ] <− mean( raw . r e s u l t s [ 2 , ] )

psd [ i ] <− sd ( raw . r e s u l t s [ 2 , ] )

# C a l c u l a t e power/ p o s i t i v e power

pow . vec <− raw . r e s u l t s [ 5 , ]

ppow [ i ] <− mean(pow . vec , na .rm=TRUE)

pow . vec [ i s . nan(pow . vec ) ] <− 0

pow [ i ] <− mean(pow . vec )

# C a l c u l a t e f d r / p o s i t i v e f d r

f d r . vec <− 1 − raw . r e s u l t s [ 4 , ] /raw . r e s u l t s [ 3 , ]

p fdr [ i ] <− mean( f d r . vec , na .rm=TRUE)

fd r . vec [ i s . nan( f d r . vec ) ] <− 0

fd r [ i ] <− mean( f d r . vec )

# C a l c u l a t e the asymptot ic pro por t ion r e j e c t e d p∗pstar [ i ] <− u . s t a r ( alpha=alpha , n=n . seq [ i ] , c=theta , s

=1, p=p , type=”F” )/alpha

}

r e s u l t s . cor [ [ j ] ] <− l i s t ( phat=phat , pow=pow , ppow=ppow ,

pfdr=pfdr , f d r=fdr , ps ta r=pstar , psd=psd )

}

## Stop Timer

82

t imer1 <− proc . time ( ) [ 3 ]

t imer <− round( t imer1 − t imer0 )

seconds <− t imer %% 60

minutes <− ( t imer %/% 60) %% 60

hours <− ( t imer %/% 60) %/% 60

cat ( ”The s imu la t i on took ” , hours , ”hour ( s ) , ” , minutes ,

”minute ( s ) , and” , seconds , ” second ( s ) to complete . ” )

83

the power of the benjamini-hochberg proceduremathematical institute master thesis statistical...

Documents