the power of the benjamini-hochberg proceduremathematical institute master thesis statistical...
TRANSCRIPT
Mathematical Institute
Master Thesis
Statistical Science for the Life and Behavioural Sciences
The Power of the Benjamini-Hochberg Procedure
Author:Wouter van Loon
First Supervisor:Prof. dr. J.J. Goeman
Leiden University Medical Center
Second Supervisor:Dr. M. van Iterson
Leiden University Medical Center
Third Supervisor:Dr. M. Fiocco
Leiden Mathematical Institute
May 2017
Abstract
Background: The Benjamini-Hochberg (BH) procedure is a popular method
for controlling the False Discovery Rate (FDR) in multiple testing experiments.
Currently available software for sample size calculation in the FDR context is based
on asymptotic behavior of the BH procedure, assuming independent test statistics
and possibly a common effect size. In this study, we investigate how this asymptotic
behavior, in terms of the proportion of rejected hypotheses and average power,
relates to performance of the BH procedure when these assumptions are not met.
Furthermore, we decompose the asymptotic expression for the average power, and
propose a number of alternative choices for its components, including the possiblity
of controlling the False Discovery Proportion (FDP) exceedance probability rather
than the FDR.
Results: We performed a number of simulation experiments to assess the effects
of the number of tested hypotheses, sample size, basic dependence, and presence of
an effect size distribution, on the performance of the BH procedure. Our results show
that testing fewer hypotheses is associated with an increase in the variance of the
power distribution. A conservative power estimator based on the law of the iterated
logarithm can ensure a high probability of the power exceeding the desired level,
and automatically scales with the number of tested hypotheses. We find this bound
remains effective even under low to moderate equicorrelation (ρ ≤ 0.5), or if only
the true null hypotheses are correlated. In the presence of an effect size distribution,
sample sizes calculated assuming a common effect size are under-estimated for higher
power levels, and over-estimated for lower power levels.
Conclusions: Sample size calculations based on the asymptotic average power
are quite robust to violations of the assumptions of infinite and independent tests.
Related, but more conservative estimators, allow a researcher to ensure a high prob-
ability of exceeding a desired power level and/or make confidence statements about
the FDP in the rejection set. Basing sample size calculations on the assumption
of a common effect size should be carefully considered, however, since depending
on the shape of the effect size distribution and desired power level, the resulting
sample size estimates may not lead to adequate power. All estimators described in
this paper have been incorporated into a Shiny application, which is freely available
at: https://wsvanloon.shinyapps.io/bhpower.
2
Contents
1 Introduction 5
1.1 Multiple Testing and the False Discovery Rate . . . . . . . . . . . . . . . 5
1.2 The Benjamini-Hochberg Procedure . . . . . . . . . . . . . . . . . . . . . 6
1.3 Asymptotic Behavior of the BH Procedure: The Criticality Phenomenon . 8
1.4 Power in the FDR Context . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Confidence Bounds for False Discovery Proportions . . . . . . . . . . . . . 15
1.6 Aims of the Current Study . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2 Methods 19
2.1 Computation of α∗, u∗ and p∗ . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Different Approaches to Average Power . . . . . . . . . . . . . . . . . . . 22
2.2.1 Average Power of the BH Procedure . . . . . . . . . . . . . . . . . 22
2.2.2 FDP Exceedance Control . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.3 The Proportion of Rejected Hypotheses . . . . . . . . . . . . . . . 24
2.2.4 The Proportions of True and False Null Hypotheses . . . . . . . . 25
2.2.5 Power Estimators for Sample Size Calculation . . . . . . . . . . . . 26
2.3 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 Simulation Study 1: Sample Size and Number of Tests . . . . . . . 28
2.3.2 Simulation Study 2: Dependence . . . . . . . . . . . . . . . . . . . 29
2.3.3 Simulation Study 3: Simulations Based on Real Data . . . . . . . 30
2.3.4 Simulation Study 4: Comparison of Sample Size Estimates . . . . 31
3 Results 33
3.1 Comparison of Power Estimators . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Sample Size and Number of Tests . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Simulations Based on Real Data . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Comparison of Sample Size Estimates . . . . . . . . . . . . . . . . . . . . 47
4 Discussion 51
4.1 Performance of the BH Procedure in a Non-Asymptotic Setting . . . . . . 51
4.2 Performance of the BH Procedure Under Basic Forms of Dependence . . . 52
4.3 Performance of the BH Procedure With an Effect Size Distribution . . . . 54
4.4 Performance of Power Estimators . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 Sample Size Calculation in Practice . . . . . . . . . . . . . . . . . . . . . . 56
3
4.6 Limitations and Future Research . . . . . . . . . . . . . . . . . . . . . . . 59
4.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
References 61
A Additional Tables 65
B Selected R Code 67
B.1 Functions for Sample Size and Power Calculations . . . . . . . . . . . . . 67
B.2 Code for Simulation Study 2: Equicorrelation . . . . . . . . . . . . . . . . 77
4
1 Introduction
1.1 Multiple Testing and the False Discovery Rate
Multiple testing refers to a situation where multiple hypotheses are tested in the context
of a single experiment. A classical hypothesis test is performed by calculating a test-
statistic and comparing it to the appropriate distribution to obtain a p-value. This
p-value is subsequently compared to some predefined significance level α, typically 0.05,
and if p ≤ α, the null hypothesis is rejected. For a single hypothesis test, this ensures
that the probability of rejecting the null hypothesis given that it is true is at most α.
However, a typical experiment leads to more than a single hypothesis test. In fact,
studies in such fields as genomics or neuroimaging may lead to thousands of hypothesis
tests. If we were to test m null hypotheses at significance level α, then we would expect
to find αm significant tests even if all the null hypotheses are true. So if we perform a
thousand hypothesis tests with α = 0.05, we would expect to reject fifty null hypotheses
even when none of the null hypothesis are false. Clearly, there is a need to correct for
multiple testing.
Perhaps the best-known method to correct for multiple testing is the Bonferroni
correction (Dunn, 1961), which constitutes testing each individual hypothesis at a signif-
icance level of α/m. Applying the Bonferroni correction controls the Family-Wise Error
Rate (FWER), that is, the probability of at least one incorrect rejection. However, it
should be noted that as m gets very large, the test-wise significance level α/m gets very
small, leading to a loss in power. Although more powerful methods for controlling the
FWER exist, such as those by Holm (1979), Hommel (1986), and Hochberg (1988), all
FWER controlling procedures suffer from the same drawback, namely that they do not
scale well with m. In fact, it has been shown for these methods (e.g. Meijer, Krebs,
Solari, & Goeman, 2016) that as the number of tested hypotheses m goes to infinity, the
proportion of rejected hypotheses approaches zero. If we intend to test many hypotheses,
this is clearly an undesirable property.
Instead of looking solely at the behavior of FWER controlling procedures, we can
also reconsider if it is really necessary to control the probability of at least one incorrect
rejection. If we were to perform a study where the main goal is to identify a set of
candidate genes to be used later in a validation experiment, we may decide that including
a few false positives in this set of candidate genes is not much of a problem, as long as
the number of false positives is not too large compared to the number of true positives.
This justification, in combination with the behavior of FWER controlling methods for
large m, may lead us to control a different error rate.
5
One such error rate is the False Discovery Rate (FDR) introduced by Benjamini
and Hochberg (1995), which is defined as the expected proportion of falsely rejected
hypotheses among the set of rejected hypotheses if there is at least one rejection, and
zero otherwise. Unlike FWER controlling procedures, FDR controlling procedures often
scale well with m, although this does require some conditions to hold which we will later
discuss.
1.2 The Benjamini-Hochberg Procedure
In their 1995 paper, Benjamini and Hochberg introduced not only the false discovery rate,
but also a procedure to control it, now commonly known as the “Benjamini-Hochberg
procedure” or “BH procedure”. The BH procedure is the best-known method for FDR
control and is, for example, included in the R function p.adjust under both the names
‘BH’ and ‘fdr’. To apply the BH procedure, we first test each of the m hypotheses
under consideration by calculating a test-statistic, and comparing it to the appropriate
distribution to obtain a p-value. Let p1 ≤ p2 ≤ ... ≤ pm be the ordered p-values, and
Hi the null hypothesis corresponding to pi. Then, to obtain an FDR control level α, we
reject all Hi for i = 1, 2, ..., k, with
k = max{i : pi ≤i
mα}, (1)
and reject no hypotheses if this maximum does not exist. This procedure controls the
FDR at α for any configuration of false null hypotheses, assuming independent test
statistics (Benjamini & Hochberg, 1995). Actually, the BH procedure controls the FDR
at a level lower than α, namely at m0m α, with m0 the number of true null hypotheses.
As m0 is typically unknown, the BH procedure provides a conservative approach that is
valid for any m0. However, if m0 is known, or more realistically, if some estimate of m0
is available, this can be incorporated in the BH procedure to make it less conservative,
as in e.g. Benjamini and Hochberg (2000). Benjamini and Yekutieli (2001) showed
that the BH procedure also controls the FDR for certain types of positive dependency
among the statistics. Since then, further theoretical work and simulation studies have
shown the BH procedure is quite robust, and remains valid for a wide variety of common
dependency structures (Goeman & Solari, 2014).
Recall that the FDR is defined as the expected proportion of falsely rejected hy-
potheses among the set of rejected hypotheses if there is at least one rejection, and zero
otherwise. There exists a related quantity, the positive FDR (pFDR), which is defined
as the expected proportion of falsely rejected hypotheses among the set of rejected hy-
6
potheses given that there is at least one rejection (Storey, 2003). It should be noted
that while the BH-procedure guarantees that the FDR is controlled at α, the same is
not necessarily true for the pFDR. However, if the proportion of false null hypotheses
is positive, then for very large m the probability of no rejections is effectively zero, in
which case the pFDR equals the FDR, and so both are controlled at level α.
To gain more insight into the workings of the BH procedure, consider how p-values
are obtained. We typically assume p-values are sampled from some mixture model, with
a fixed proportion of false nulls among all the nulls, which we will denote π as in Chi
(2007). So a randomly sampled p-value belongs to a false null hypothesis with probability
π, and to a true null hypothesis with probability 1− π. The distribution of the p-values
that belong to a true null hypothesis is uniform on u ∈ [0, 1], such that for any p-value
pj , P(pj ≤ u|Hj = true) = u. The distribution of the p-values that belong to a false
null hypothesis is non-uniform, such that P(pj ≤ u|Hj = false) = G(u). The common
distribution function of the p-values is then:
F (u) = (1− π)u+ πG(u). (2)
The BH procedure approximates the inverse of this distribution function, F−1(u), by
the sequence of ordered p-values. This empirical inverse distribution function, F−1m (u), is
then compared to a rejection line with intercept equal to zero, and slope α. All p-values
up to and including the highest p-value under this line are then rejected. Note that
this implies the BH procedure is a so-called step-up procedure, that is, if the empirical
inverse distribution function crosses the rejection line more than once, all p-values before
the last intersection are rejected, as can be observed in Figure 1.
7
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
i/m
pi
p-valuesrejected p-valuesrejection line αi/m
Figure 1: The Benjamini-Hochberg procedure applied to a set of m = 100 ordered p-values, using an FDR control level of α = 0.5. Note that some of the rejected p-valuesare actually above the rejection line: this is a typical feature of step-up procedures,where all p-values before the last intersection are rejected.
1.3 Asymptotic Behavior of the BH Procedure: The Criticality Phe-
nomenon
We stated previously that FDR controlling procedures scale well with m. This includes
the BH procedure: Genovese and Wasserman (2002) showed that under some conditions,
the proportion of rejected hypotheses converges in probability to some positive value,
rather than to zero like FWER controlling procedures. Chi (2007) characterized this
convergence in an even stronger sense, by providing asymptotic bounds on the deviation
of the proportion of rejected hypotheses from this value. In this section, we will briefly
describe the conditions under which this convergence occurs.
8
The conditions for convergence of the proportion of rejected hypothesis to some
positive value were perhaps most elegantly explained in Chi (2007), where they are
characterized in terms of a criticality phenomenon. We will denote by Rm the number
of hypotheses rejected by the BH-procedure, so that Rmm is the proportion of rejected
hypotheses. As in Chi (2007), we assume independent p-values with a common distribu-
tion function as defined in (2). The criticality phenomenon is defined as follows: there
can be some critical value α∗ > 0 for which the asymptotic behavior of the BH proce-
dure is different when α < α∗ compared to when α > α∗ (Chi, 2007). In particular, if
α > α∗, then as m → ∞, Rmm converges to some positive value, but if α < α∗,
Rmm is
asymptotically zero. The critical value α∗ is a property of the distribution function F
of the p-values. In particular, if F is strictly concave (Chi, 2007),
α∗ =1
F ′(0). (3)
Thus, α∗ represents the slope of F−1 at zero. To explain why this is the critical value,
we consider first the case where F ′(0) =∞, and note that by the law of large numbers
as m → ∞, the empirical distribution function Fm → F in probability. If F ′(0) = ∞,
then α∗ = 0. If the slope of F−1 at zero is zero, then surely for any rejection line with
slope α > 0, there will be some positive proportion of F−1 that is below this rejection
line. So as m→∞, Rmm will converge to a positive proportion for all α > 0. Therefore,
the criticality phenomenon does not occur.
Next, consider the case where F ′(0) < ∞. Now α∗ > 0, and so it is possible to
choose an α such that 0 < α < α∗. In this case, the slope of F−1 at zero is greater
than the slope of the rejection line at zero, which implies the entirety of F−1 is above
the rejection line. In other words, the proportion of F−1 that lies below the rejection
line is zero, and thus Rmm → 0 as m → ∞. But if we choose an α such that α > α∗,
some positive proportion of F−1 will be below the rejection line again, and so Rmm will
converge to this positive proportion. A p-value distribution for which α > α∗ has been
referred to as being Simes detectable at α (Meijer et al., 2016). The borderline case
α = α∗ is complex and has little practical relevance, so we will not consider it in this
study.
9
0.00
0.10
0.20
F (u)
u
F−1(u)rejection line (slope = α)tangent at zero (slope = α∗)
0.0 0.1 0.2 0.3 0.4 0.5
0.00
0.10
0.20
F (u)
u
u∗
p∗
F (u)
u
Figure 2: Top: Application of the BH procedure with α < α∗. It can be observed thatthe whole of F−1(u) is above the rejection line, and therefore u∗ = p∗ = 0. Bottom:Application of the BH procedure with α > α∗. The values u∗ and p∗ are, respectively,the coordinates u and F (u) at the intersection of F−1(u) and the rejection line.
As in Chi (2007), we will denote the proportion to which Rmm converges by p∗, which
serves as the limit (over m) of the proportion of rejected p-values. Then u∗ = F−1(p∗)
is the limit of the largest rejected p-value. We can calculate u∗ by (Chi, 2007)
u∗ = max{u ∈ [0, 1] :u
α≤ F (u)}, (4)
and obtain p∗ by
p∗ = F (u∗) =u∗α. (5)
As α∗ is a function of F , its value depends on the distributions of the chosen test-
statistics. If the distribution of the test-statistics under a null hypothesis is standard
10
normal, and the distribution under an alternative hypothesis is normal with σ > 1, or
σ = 1 and µ > 0, then the criticality phenomenon does not occur (Chi, 2007). If the test-
statistics follow t- or F -distributions the situation is more complex. In fact, assuming
π > 0, the criticality phenomenon always occurs when using t-statistics, implying that
for a given set of distributions, the FDR control level always needs to be chosen such
that α > α∗ if we want Rmm to be asymptotically non-zero. However, α is typically
fixed, whereas the t-distributions depend on both a sample size n and some effect size
θ. We can thus reverse the problem and state that, for a given α and θ, there is a
minimum sample size required to obtain α∗ < α. Whether or not α∗ is smaller than α
says something about the capacity of the BH procedure to asymptotically find signal in
the data, and can therefore be interpreted as a measure of power. In the next section, we
will discuss the meaning of power in a broader sense for methods controlling the FDR,
and discuss sample size calculation.
1.4 Power in the FDR Context
Sample size calculations are an important preliminary step in almost any study. Such
calculations are typically based on some measure of power, that is, some measure that
quantifies the ability of the chosen statistical methods to detect the effect of interest for
a given sample size. The aim is then to arrive at a sample size high enough to obtain a
desired power level, but not so high as to be unnecessarily expensive. In the context of
traditional hypothesis testing, the concept of power is well-defined: it is the probability
of rejecting a null hypothesis, given that it is false. Typically some desired power level is
formulated, e.g. 0.80 or 0.90, and this value, together with the desired significance level
and some effect size measure, is entered into a formula, and the result is some minimum
value of n. In the context of multiple testing, the situation is generally not so simple.
The classical definition of power is something that applies to a single hypothesis.
In the context of multiple testing, the individual power level for a single hypothesis is
sometimes referred to as the per-pair power (Horn & Dunnett, 2004; Keselman, Cribbie,
& Holland, 2002). It is, however, much more convenient to specify the desired power of
a multiple testing experiment in terms of a single quantity. The concept of power can
be extended to the entire set of hypotheses in multiple ways. The most commonly used
extension is the concept of average power, that is, the expected proportion of rejected
false null hypothesis among the set of false null hypotheses (e.g. Benjamini & Hochberg,
1995; Ferreira & Zwinderman, 2006; Jouve, Maucort-Boulch, Ducoroy, & Roy, 2009),
which has also been referred to as the True Positive Rate (TPR) (Kvam, Liu, & Si,
11
2012). The average power is the most natural extension to multiple testing, since if each
hypothesis has the same individual power level, the average power equals the per-pair
power. Although the average power is the most commonly used measure, some research
has focused on other approaches to power. For example, Lee and Whitmore (2002) used
what they termed the family power, which is defined as the probability of rejecting all of
the false null hypotheses. This measure of power has also been studied under the names
all-pairs power (Horn & Dunnett, 2004; Keselman et al., 2002) and collective power
(Jouve et al., 2009). Horn and Dunnett (2004) considered the probability of rejecting
at least one false null hypothesis, which they termed the any-pair power, and which has
also been studied in Yang, Cui, Chazaro, Cupples, and Demissie (2005), and in Jouve et
al. (2009) under the name relaxed power. This measure of power can be extended to the
probability of rejecting at least a certain number or proportion of false null hypotheses,
which has been termed the overall power (Shao & Tseng, 2007; Wang & Chen, 2004).
A quantity related to power is the False Nondiscovery Rate (FNR), which is defined
as the expected proportion of incorrect nonrejections among the nonrejections (Genovese
& Wasserman, 2002). Sarkar (2004) defined the power of a multiple testing procedure as
1− (FDR + FNR). This is a measure of power in a broader sense, as it strikes a balance
between the proportion of correctly accepted null hypotheses and the proportion of
falsely rejected null hypotheses.
Which of these power measures is preferable will probably depend on the situation
at hand. Controlling the family power level is typically unfeasible since the probability of
rejecting all false nulls quickly approaches zero as the number of false nulls increases (Lee
& Whitmore, 2002). The average power is the most obvious and common measure to
control, and is the most natural extension of the concept of power to the multiple testing
setting. However, we may decide that we do not necessarily require an average power
of at least, say 0.80, but that we are already satisfied if we expect to find something,
i.e. expect at least one true discovery. In this case the relaxed power may be a suitable
measure to use. Alternatively, we could decide to base sample size calculations on
the criticality phenomenon, and calculate the minimum n for which α∗ < α, since this
ensures that the size of the rejection set does not vanish as m→∞. Obviously, this leads
to much lower sample size estimates than those obtained from controlling the average
or family power level.
In order to base sample size calculations on a measure of power, we need to be
able to calculate this power before the experiment takes place. If one does not wish
to resort to simulations, some expression for the power measure of choice needs to be
available. Such expressions are typically based on asymptotic properties, that is, based
12
on how the method behaves for an infinite number of hypotheses. Both Jung (2005)
and Liu and Hwang (2007) developed methods for sample size calculation in the context
of FDR control, based on such expressions for the average power. These methods are
closely related, and in identical settings they produce identical results (Liu & Hwang,
2007). These methods can be used for Z-, t- and F - tests, and the results of Liu and
Hwang (2007) have been incorporated into the R package ssize.fdr (Orr & Liu, 2015).
Users of this package can select the type of test that is going to be performed, whether
tests are one- or two-sided, and can specify either a fixed effect size, or an effect size
distribution. The specification of an effect size distribution is typically only used when
pilot data is available from which such a distribution can be estimated. Estimating effect
size distributions is not a trivial affair; see, for example, Van Iterson, Van de Wiel, Boer,
and De Menezes (2013), who provide a method for estimating effect size densities from
pilot data that is applicable in a wide variety of settings. In the current study, we will
assume a fixed effect size.
It should be noted that ssize.fdr is not the only R package available for sample
size calculations in the FDR context. Other packages include FDRsampsize (Pounds,
2016) and ssizeRNA (Bi & Liu, 2017). Both of these packages are also based on average
power.
In addition to being based on asymptotic properties, power calculations typically
make further assumptions. For example, we will give an expression for the average
power of the BH procedure in section 2.2 which additionally assumes independent test
statistics. In a real-life setting, however, test statistics are typically not independent,
and the number of tested hypotheses is certainly not infinite. From this discrepancy, a
question arises: How accurate are power calculations based on such formulas compared
to the average power we observe when their assumptions are not met? Furthermore, if
these calculations are not as accurate as we would hope, can we offer a more conservative
power estimate?
Simulation studies concerning the power of FDR controlling procedures under dif-
ferent conditions have been performed before, but their aim is typically to show the
difference in power between FWER and FDR controlling procedures (e.g. Benjamini &
Hochberg, 1995; Keselman et al., 2002), or between different FDR controlling procedures
(e.g. Benjamini & Hochberg, 2000; Sarkar, 2004; Storey, 2002). Other studies have fo-
cused on comparing the performance of multiple testing methods when dealing with very
specific types of data (e.g. Groppe, Urbach, & Kutas, 2011; Jouve et al., 2009; Kvam et
al., 2012; Yang et al., 2005). In a study by Neuvial (2010), power calculations based on
the asymptotic results of Chi (2007) are compared to a situation where the number of
13
hypotheses is finite. Although they investigate the average power of the BH procedure
for different values of π and α, they only consider a single value for the number of hy-
potheses (m = 1000), and do not take into account dependency or different values of n.
They observe that if α > α∗, the power for m = 1000 is very similar to the asymptotic
power, whereas if α < α∗, the power for m = 1000 is slightly higher than the asymptotic
power. They conclude that the effect of the criticality phenomenon is less dichotomous
in real data analysis than is suggested by theory (Neuvial, 2010). They also investigate
the effect of sample size in a real data set, but consider only the proportion of rejected
hypotheses and not the power, since the data is unedited and the identity of the true
and false nulls is unknown (Neuvial, 2010).
The performance of the BH procedure under dependency has also been a topic of
previous research, although primarily to show that the BH procedure manages to control
the FDR under various common forms of dependency (e.g. Benjamini & Yekutieli, 2001;
Kim & van de Wiel, 2008). Kim and van de Wiel (2008) additionally investigated the
FNR and concluded that the BH procedure can attain a low FNR level even under
dependence. However, they utilized constrained random correlation matrices in which
only the number of correlated variables and the variance of the pairwise correlations
were controllable parameters (Kim & van de Wiel, 2008), which makes it difficult to
draw conclusions about the effect of correlation magnitude. Hemmelmann, Horn, Susse,
Vollandt, and Weiss (2005) investigate the power of the BH procedure when the features
are equicorrelated. For a fixed m (40) and n (8) they compare the average power
under two levels for the pair-wise correlations (0.2 and 0.8) and conclude that a higher
correlation leads to a reduction in power (Hemmelmann et al., 2005). However, the
observed differences in power between pair-wise correlations of 0.2 and 0.8 appear very
small. They additionally observe that larger pair-wise correlations lead to an increase in
the variance of the observed False Discovery Proportion (FDP), although they conclude
that if the BH procedure is applied with α = 0.05, it is uncommon for the observed FDP
to exceed 0.1 (Hemmelmann et al., 2005). Jung (2005) evaluate their method for sample
size calculation, which assumes independent test statistics, in a simulation experiment
with a compound symmetry correlation structure. They conclude that their approach
works well under weak dependency, but when the features are strongly correlated (ρ =
0.6), the interquartile range (IQR) of the observed proportion of rejected hypotheses
almost doubles, although the median observed proportion remains close to the predicted
proportion as under independence (Jung, 2005). Shao and Tseng (2007) observe a similar
increase in the variance of the average power with a 2-block correlation structure for the
false nulls, and assuming independence or some low constant correlation (ρ = 0.1) for the
14
true nulls. They suggest a method for dependence adjustment of sample size calculations,
but this method requires the correlation structure in the data to be known or estimated
(Shao & Tseng, 2007).
From these previous studies, it appears that the asymptotic proportion of rejected
hypotheses calculated under the assumption of independent test statistics remains a de-
cent estimator for the median observed proportion of rejected hypotheses, even when
there are strong correlations in the data. The same applies to the average power. How-
ever, an increase in correlation magnitude does lead to an increase in variance for the
distributions of these quantities. Although large increases have been observed for high
correlations like ρ = 0.6 (Jung, 2005) and ρ = 0.8 (Shao & Tseng, 2007), such situations
are not necessarily realistic. In fact, a simulation study based on real data by Shao and
Tseng (2007) showed far less dramatic differences. An increase in the variance of the
average power, especially if this increase is small, does not necessarily mean that sample
size calculations based on an independence assumption cannot be used.
In the current study, we will further investigate the effects of the number of tested
hypotheses, sample size, and dependency, on the proportion of rejected hypotheses and
average power of the BH procedure. We will compare the observed values of these quan-
tities with the values predicted using the results of Chi (2007). We will investigate the
effects of various magnitudes of correlation between the features, and aim to provide
more insight regarding the level of dependency that is acceptable when power calcu-
lations are based on an independence assumption. We assume an equal effect size for
all hypotheses, but we will also evaluate how power calculations based on this assump-
tion perform when the effect sizes are not equal. Furthermore, we will propose several
conservative estimators for the power of the BH procedure.
1.5 Confidence Bounds for False Discovery Proportions
The FDR is defined as the expected proportion of falsely rejected hypotheses among the
set of rejected hypotheses. As such, FDR control means controlling an expectation, and
the realized False Discovery Proportion (FDP) may be higher or lower depending on the
data at hand. Various methods have been proposed for controlling the probability of
the FDP exceeding a pre-specified level, which has been referred to as FDP exceedance
control (Genovese & Wasserman, 2006). Such methods are generally either based on
permutations (e.g. Korn, Troendle, McShane, & Simon, 2004), or modeling of the FDP
distribution (e.g. Oura, Matsui, & Kawakami, 2009), but many have been found to be
too computationally expensive or too conservative for practical use (Hemmelmann et al.,
15
2005; Shang, Zhou, Liu, & Shao, 2012). The use of exceedance control instead of FDR
control is analogous to using a confidence interval instead of a point estimator (Genovese
& Wasserman, 2006), and can thus also be formulated in terms of a confidence interval
for the FDP as in Shang, Liu, and Shao (2012). Goeman and Solari (2011) showed how
to calculate confidence bounds on the number of false discoveries simultaneously over
all possible rejection sets, a finding that can also be formulated in terms of confidence
bounds on the FDP (Meijer et al., 2016). This method actually allows researchers to
pick the rejection set after having seen the data and guarantees valid confidence bounds
can be formulated for any such set, but it can also be used for more classical FDP
control (Meijer et al., 2016). One could, for example, choose the largest rejection set so
that the upper 95% confidence bound is smaller than some maximum acceptable FDP.
Alternatively, one could use the 50% confidence bound to control the median FDP.
Of primary interest to the current study is the relation that exists between the FDP
confidence bounds from Meijer et al. (2016) and the BH procedure. We will denote by
qγ(S) the 1− γ upper confidence bound for the FDP in rejection set S, and by SBH(α)
the rejection set of the BH procedure at FDR control level α. Meijer et al. (2016) showed
that for all 0 < α ≤ 1:
qγ(SBH(αγ)) < α. (6)
This means that if we want the FDP of a BH rejection set to be smaller than α with
probability at least 1 − γ, we can guarantee this by simply applying the BH procedure
with FDR control level αγ. This is an attractive method for FDP exceedance control
with the BH procedure, since it is very easy to apply.
In addition to the result in (6), Meijer et al. (2016) showed that if α∗ < αγ, there is
a set Sm for which |Sm|m converges to a positive proportion, such that qγ(Sm) converges
in probability to some α′ < α. Furthermore, they show that α′ ≤ hα, with
h = min0≤u<F (γ)
uγ − γF−1(u)− γ
. (7)
This implies that, for the BH procedure at FDR control level αγ with α∗ < αγ, as
m→∞, the upper confidence bound on the FDP converges in probability to a quantity
that is not just smaller than α, but at most hα, with h < 1 since α∗ < αγ ≤ γ.
So what is the interpretation of h? For a sequence of ordered p-values, a confidence
bound for the number of true null hypotheses can be obtained by (Meijer et al., 2016):
h(γ) = max{i ∈ {0, . . . ,m} : ipm−i+j > jγ, for j = 1, . . . , i}. (8)
16
This is a confidence bound in the sense that P(# true nulls ≤ h(γ)) ≥ 1 − γ (Meijer
et al., 2016). As m → ∞, the quantity hm converges in probability to h (Meijer et al.,
2016). This means that h can be interpreted as an asymptotic confidence bound on the
proportion of true nulls.
Although h no longer depends on the number of tests, it still depends on sample
size. For low n, h is close to 1, whereas for high n, h approaches 1 − π. An intuitive
explanation for this behavior is that if n is low, then even if the number of tests is infinite,
it will still be hard to deduce the proportion of true nulls from the p-value distribution.
We cannot be sure of the exact proportion of true null hypotheses, but can only say
that it is likely to fall in a certain range. As n increases, the contrast between the
true and false nulls becomes increasingly more pronounced, so that this range becomes
increasingly narrow and the 1− γ confidence bound, h, approaches the true proportion
of true nulls 1− π.
The quantity h thus plays an important role in the FDP confidence bound. In a
broader sense, using h at an appropriate confidence level can be considered a conservative
alternative to using a point estimate for the proportion of true null hypotheses. As
mentioned in section 1.2, such estimates can be incorporated into the BH procedure.
It can be anticipated that using more conservative methods, i.e. using FDP exeedance
control instead of FDR control or h instead of a point estimate for the proportion
of true nulls, leads to a reduction in predicted average power. We argue that such
lower predictions can serve as conservative power estimates, which can in turn be used
in sample size calculation. In section 2.2, we will attempt to formulate conservative
estimators for the average power of the BH procedure by combining these results of
Meijer et al. (2016) with those of Chi (2007).
1.6 Aims of the Current Study
In the current study, we will further investigate how the asymptotic behavior of the BH
procedure as described by Chi (2007) relates to observed behavior in a setting where
the underlying assumptions are not met. Such a setting involves a finite number of
hypotheses and may have correlated features. We will asses the effect of sample size,
the number of tested hypotheses, and basic forms of dependence, on the behavior of the
BH procedure using simulations. We will assess this behavior primarily in terms of the
proportion of rejected hypotheses and the average power. We will restrict ourselves to
two-sample t-tests and focus on the situation where such tests are performed two-sided.
We will describe how violated assumptions affect the proportion of rejected hypothe-
17
ses and average power, and aim to provide some general guidelines regarding the use
of sample size calculation methods based on these quantities. All simulations will be
performed in R (R Development Core Team, 2008).
We also aim to formulate several conservative power estimators based on the results
of both (Chi, 2007) and (Meijer et al., 2016). These estimators are all related to the
asymptotic average power of the BH procedure, as we will discuss in section 2.2. In
addition to more conservative power estimators, we also suggest a more liberal approach
to power by allowing researchers to calculate the minimum sample size required so that
α∗ < α. Obviously, calculations based on the criticality phenomenon lead to much lower
sample size estimates than those based on average power.
Providing methods for sample size calculation is of little use if they are unavailable
in practice. We will therefore write R functions for sample size calculation based on the
various methods discussed in the current paper, and aim to present these in the form of
an easy-to-use Shiny (Chang, Cheng, Allaire, Xie, & McPherson, 2017) application.
18
2 Methods
2.1 Computation of α∗, u∗ and p∗
In this section we will discuss the computation of the critical value and associated quan-
tities introduced in section 1.3. As can be observed in (3), computing the critical value
α∗ requires computing the derivative of F at zero. We defined F in (2) using the func-
tion G, which we will now discuss in more detail. As in Chi (2007), we will assume that
test statistics corresponding to a true null are randomly sampled from some distribution
with distribution function Ψ0 and density ψ0. Test statistics corresponding to a false
null then have distribution function Ψ1 and density ψ1. The joint distribution function
of the test statistics is then:
Ψ = (1− π)Ψ0 + πΨ1. (9)
From inverse probability sampling we know that if some test statistic X has distribution
function Ψ, then for U ∼ Uniform(0, 1), the random variable Ψ−1(U) has the same
distribution as X. The right tail p-value, 1−Ψ0(X), then has the same distribution as
1−Ψ0(Ψ−1(U)), namely (Chi, 2007):
F (u) = 1−Ψ(Ψ−10 (1− u)). (10)
Since Ψ is a mixture of two distributions, we can split F into two parts, each multiplied
by their corresponding weight. The p-values of the test statistics corresponding to a
true null hypothesis then have distribution function 1 − Ψ0(Ψ−10 (1 − u)) = u, and the
p-values of the test statistics corresponding to a false null have distribution function
1−Ψ1(Ψ−10 (1− u)) = G(u). The joint distribution function of the p-values can then be
written as (Chi, 2007):
F (u) = (1− π)u+ π[1−Ψ1(Ψ−10 (1− u))]. (11)
The first derivative of this function is (Chi, 2007):
F ′(u) = 1− π + πψ1(x)
ψ0(x), (12)
with x = Ψ−10 (1− u). To evaluate the derivative of F at zero we then need to calculate
(Chi, 2007):
F ′(0) = 1− π + π limx→∞
ψ1(x)
ψ0(x). (13)
19
The densities ψ0 and ψ1 obviously depend on the chosen test statistic. We will first
consider one-sample t-tests, where each feature Yj is considered a sample from a normal
distribution with mean µj and standard deviation σj , and where each null hypothesis
H0j : µj = 0 is tested against the one-sided alternative HA
j : µj > 0. We will assume
µj = 0 in the case of a true null hypothesis, and µj = c > 0 in the case of a false null
hypothesis, and assume all σj = σ, i.e. we assume equal (standardized) effect sizes for all
features. Then for the true null hypotheses, the sample t-statistic follows a t-distribution
with n−1 degrees of freedom. For the false null hypotheses, the sample t-statistic follows
a noncentral t-distribution with n − 1 degrees of freedom and noncentrality parameter
δ. This noncentrality parameter can be factorized into δ =√nθ, with θ = c
σ the
(standardized) effect size (Chi, 2007), which implies sample size and effect size can be
considered as two variables which independently affect the noncentrality parameter. The
densities ψ0 and ψ1 are then, respectively, tn−1 and tn−1,δ. The limit in (13) can then
be computed by (Chi, 2007):
limx→∞
tn−1,δ(x)
tn−1(x)= e
−δ22
∞∑k=0
Γ(n+k
2
) (√
2δ)k
k!
Γ(n2
) , (14)
where Γ denotes the Gamma function. This limit does not diverge whenever δ > 0,
implying the criticality phenomenon always occurs with one-sample t-tests (Chi, 2007).
Furthermore, Chi (2007) showed that the distribution function of the p-values is strictly
concave, which means α∗ can be calculated as in (3).
Now, we will consider two-sample t-tests. For each feature Yj , we now observe
two samples: Y1(j) of size n1, and Y2(j) of size n2, with n = n1 + n2 the total sample
size. We will assume Y1(j) ∼ N(µ1(j), σj) and Y2(j) ∼ N(µ2(j), σj). For the true null
hypotheses, the two-sample t-statistic then follows a t-distribution with n − 2 degrees
of freedom. Assuming equal effect sizes, the two-sample t-statistic follows a noncentral
t-distribution with n − 2 degrees of freedom and noncentrality parameter δ =√nθ for
the false null hypotheses. It should be noted that the interpretation of the effect size is
different for two-sample t-tests, compared to one-sample t-tests. In particular, the effect
size for the two-sample t-tests is θ =√p1p2
µ1(j)−µ2(j)σj
, with p1 and p2 the proportion of
samples in group 1 and 2 compared to the total sample size (Van Iterson et al., 2013).
For (one-sided) two-sample t-tests, the limit in (13) can be computed by:
limx→∞
tn−2,δ(x)
tn−2(x)= e
−δ22
∞∑k=0
Γ(n+k−1
2
) (√
2δ)k
k!
Γ(n−1
2
) . (15)
20
It should be noted that the previous specifications are in terms of right-sided p-values.
The calculations are equivalent for left-sided p-values if we assume θ to be an absolute
measure of effect size. If, for some θ, the right-sided p-values 1−Ψ0(X) have distribution
function F , then for −θ, the left-sided p-values Ψ0(X) have the same distribution.
In practice, two-sample t-tests are typically performed two-sided. For two-sided
tests, the distribution of the test statistics corresponding to the false null hypotheses
is assumed to be a mixture of two noncentral t-distributions, one with noncentrality
parameter δ =√nθ, and the other with δ = −
√nθ. Then the two sided p-values,
2(1 − Ψ0(|X|)), do not have the same distribution as those obtained from one-sided
tests. To obtain an expression for the distribution of the two-sided p-values, we can
use the more general F -test. It is well-known that if an F -test is used to compare two
means, this is equivalent to performing a two-sided two-sample t-test. In particular,
the F -statistic is the square of the t-statistic. We will assume that for the true null
hypotheses, the sample F -statistic follows an F -distribution with 1 and n − 2 degrees
of freedom. For the false null hypotheses, the sample F -statistic follows a noncentral
F -distribution with 1 and n − 2 degrees of freedom, and noncentrality parameter δ2.
Then the one-sided p-values of the F -test have the same distribution as the two-sided
p-values of the two-sample t-test. For two-sided two-sample t-tests, the limit in (13) can
then be computed by (Chi, 2007):
limx→∞
f1,n−2,δ2(x)
f1,n−2(x)= e
−δ22 B
(1
2,n− 2
2
) ∞∑k=0
(δ2
2
)kk!B
(12 + k, n−2
2
) , (16)
where B denotes the Beta function. The limits in (14), (15) and (16) can be easily
approximated by repeatedly summing the terms of the series until some convergence or
divergence criterion is met. We utilized a convergence tolerance of 10−10. For increased
numerical precision, we calculated the natural logarithm of each term of the series before
exponentiating and adding to the total. For computation of u∗ we used the approxima-
tion algorithm described by Chi (2007), namely to set u1 = 1 and iterate ui+1 = αF (ui)
until convergence. Then p∗ = u∗α . To evaluate F−1, as in computation of h, we utilized
a general root-finding algorithm (Brent, 1973) with a tolerance of approximately 10−4.
As α∗, u∗ and p∗ can be quickly computed, it is feasible and practical to perform sample
size calculations by simply starting at some minimal value of n, and increasing n until
F has the desired properties.
21
2.2 Different Approaches to Average Power
2.2.1 Average Power of the BH Procedure
In this section, we will further explore the average power of the BH procedure, give an
expression for the asymptotic average power of the BH procedure based on the results
of Chi (2007), and discuss the relationship with an existing R package for sample size
calculation in the FDR context. We will then suggest different power estimators of the
same form, and discuss similarities and differences with the expression for the asymptotic
average power.
We defined the concept of average power in section 1.4 as the expected proportion
of rejected false null hypothesis among the set of false null hypotheses. We can express
this as:
E
(Rm × TDP
m1
), (17)
with Rm the number of rejected hypotheses, TDP the true discovery proportion, and
m1 the number of false null hypotheses. We can equivalently specify this in terms of
the proportion of rejected hypotheses Rmm and the proportion of false null hypotheses
m1m . We know from the results of Chi (2007) that as m→∞, the proportion of rejected
hypotheses converges to p∗. We also know that the BH procedure controls the FDR at a
level (1− π)α, with π the proportion of false null hypotheses. We can therefore express
the asymptotic average power of the BH procedure as:
p∗(α, F )(1− (1− π)α)
π, (18)
where p∗(α, F ) denotes the value of p∗ corresponding to the BH procedure applied to
p-value distribution F , with FDR control level α. This is an asymptotic result in the
sense that as m → ∞, the average power converges in probability to the quantity in
(18), as was shown previously by Ferreira and Zwinderman (2006). It is this asymptotic
behavior on which sample size calculations are typically based, that is, one calculates the
minimum n for which the quantity in (18) is larger than the desired power level. For ex-
ample, although the ssize.fdr package uses a different method to arrive at their power
specification, and the quantity in (18) is not expressed directly anywhere in the related
documentation (Liu & Hwang, 2007), their method for power calculation is essentially
identical, although there are some differences in interpretation of the parameters α and
θ. In particular, the ssize.oneSamp and ssize.twoSamp functions of ssize.fdr take
as their FDR control parameter the quantity (1− π)α rather than α, and the effect size
22
that is specified for ssize.twoSamp is actually the quantity 2θ rather than θ. If one
takes these differences in interpretation into account, the same estimates for the power
and sample size will be obtained.
In a real setting, where the number of hypotheses is finite and features are possibly
correlated, the asymptotic average power can be seen as an approximation or estimator
of the actual power. This estimator can be further decomposed into an estimator of the
proportion of rejected hypotheses, an estimator of the TDP, and an estimator of the
proportion of false null hypotheses. We propose a general formula for a power estimator
of this form, namely:p(αγ, F )(1− π0α)
π. (19)
Here α is again the FDR control parameter, γ the exeedance control parameter, and π
an estimate of the proportion of false null hypotheses. If we do not wish to control the
FDP exceedance probability, we set γ = 1. The function p is an estimator for the size
of the rejection set relative to the number of hypothesis. In (18), p = p∗. An estimator
of the TDP is given by (1− π0α), with π0 an estimator for the proportion of true nulls
in the data. In the remainder of this section, we will propose a number of choices for
these estimators, describe different combinations, and discuss possible advantages and
drawbacks. Note that by decomposing the estimator for the average power in this way,
we are essentially assuming the proportion of rejected hypotheses and the TDP are
independent.
We previously mentioned an alternative power specification where one could choose
to calculate the minimum n such that α∗ < α. This method can actually be considered
as a special case of controlling the asymptotic average power in (18). After all, if α∗ < α,
p∗ > 0, but if α∗ > α, p∗ = 0. Since we always assume 0 < α < 1 and 0 < π < 1,
calculating the minimum n such that α∗ < α is the same as calculating the minimum n
such that the asymptotic average power is non-zero.
2.2.2 FDP Exceedance Control
As discussed in section 1.5, we know from Meijer et al. (2016) that if we want the
probability that the FDP exceeds α to be at most γ, we can assure this by applying
the BH procedure at FDR control level αγ. In this case, the asymptotic average power
is simply given by (18), with α replaced with αγ. However, this may not be the most
desirable measure of power to use in this case, since it does not take into account the
fact that α is now a confidence bound.
From section 1.5 we know that, asymptotically, the 1− γ upper confidence bound
23
on the FDP in the BH rejection set at FDR control level αγ, i.e. qγ(SBH(αγ)), is at most
hα. Equivalently, we can formulate this as a 1−γ lower confidence bound, dγ(SBH(αγ)),
on the TDP. We obtain:
dγ(SBH(αγ)) = 1− qγ(SBH(αγ)) ≥ 1− hα. (20)
We can incorporate 1−hα as a conservative estimator for the TDP in our power estimator
(as opposed to 1− (1− π)αγ). We can arrive at this by writing (20) as a bound for the
average power itself. We obtain:
Rm(αγ)m dγ(SBH(αγ))
m1/m≥
Rm(αγ)m (1− hα)
m1/m. (21)
This implies that, in order to ensure that the average power is greater than some desired
level with probability at least 1 − γ, we could choose an n such that the right hand
side of (21) is greater than this desired level. This is not trivial, however, as both the
quantities m1m and Rm(αγ)
m are unknown beforehand. For large m, we expect replacing
these quantities with π and p∗(αγ) will work well as a conservative power estimator.
However, since there can still be variation around these quantities, this will no longer
be a true confidence bound for the power.
2.2.3 The Proportion of Rejected Hypotheses
So far, we have assumed an infinite number of hypotheses. For example, we estimate the
proportion of rejected hypotheses by p∗, the quantity to which it converges as m→∞.
There are two main problems with this approach when the number of hypotheses is
finite, which is always the case in practice. The first is that E(Rmm ) does not necessarily
equal p∗. The second is that there may be considerable variation around E(Rmm ), and
the smaller the number of hypotheses tested, the higher we expect this variation to be.
The first problem will be further studied using simulations. Here, we will propose a
conservative estimator that takes the second problem into account.
Chi (2007) characterized the convergence of Rmm to p∗ using the law of the iterated
logarithm (LIL), showing that:
lim supm
± Rm −mp∗√m log logm
=
√2p∗(1− p∗)
1− αF ′(u∗), a.s., (22)
24
assuming α∗ < α < 1 and 1− αF ′(u∗) > 0. We define:
pb =
√2p∗(1− p∗)m log logm
m(1− αF ′(u∗)). (23)
Then pl = p∗ − pb is a lower bound on the proportion of rejected hypotheses. Note that
this is still an asymptotic result: we know that from some m onward, P(Rmm > pl) = 1.
For finite m, this probability may not be 1, but we do expect it to be high. Some
preliminary simulations (results not shown) indicated that for m between 10 and 1000,
P(Rmm > pl) was typically in the 0.97 to 1 range, even when m1m is allowed to vary around
π. This implies that by using pl in sample size calculations, a researcher can be sure
that, from some m onward, the proportion of rejected hypotheses is higher than the
desired level, whereas for a smaller number of hypotheses, he can still be very confident
that this will be the case. A nice property of this method is that as m grows, pl will
grow closer and closer to p∗, so that for an infinite number of tests, using pl essentially
coincides with the “traditional” use of p∗.
A possible drawback is that the LIL bounds are very wide for low m. Although
this means that the probability of Rmm > pl is high even for low m, it also means that for
some m (e.g. m = 10), the lower bound pl will be so low that is unusable in practice. In
a high-dimensional setting, however, one typically considers a large number of features.
If one applies this method with, for example, a thousand features, we do not expect this
to be a problem.
2.2.4 The Proportions of True and False Null Hypotheses
If we want to calculate a power estimate as in (19), we require estimates for the propor-
tion of true and false null hypotheses. The true population values of these parameters
are of course complementary: π0 = 1 − π. In order to calculate an estimate of the
average power using this specification, one always needs to provide an estimate of π,
which we will denote π. The estimator for the TDP then requires an estimate of π0,
the most obvious choice for which is π0 = 1 − π, as in (18). This does, however, put
a lot of confidence into π. This estimate may be derived from a small pilot sample, or
perhaps made up by a researcher based on previous experience or beliefs. If one wants
to incorporate some uncertainty about π in the power calculations, one could opt to
use a more conservative estimator of π0 in the estimator for the TDP. For example, by
setting π0 = h. Note that h still depends on π, but for low n, h > 1− π, so that h is a
more conservative estimator in the sense that it assumes the actual proportion of signal
25
in the data may be lower than believed. For low n, this is more similar to specifying the
general shape of the p-value distribution, rather than the exact proportion of noise. One
could also choose any other estimator of π0. The most conservative option is to simply
set π0 = 1.
We do not treat the estimates of π and π0 in (19) as complementary per se. That
is, if we use π0 = h, we do not set π = 1 − h. This would not really make sense, since
h is a function of π. Doing so would actually make the resulting power estimate much
less conservative since it assumes there is less signal in the data to find, such that far
less hypotheses need to be rejected to discover a large proportion of it, and could lead
to power estimates greater than one. Furthermore, setting π0 = 1, would then lead to
π = 0, in which case the power cannot be calculated at all. In other words, we use a
conservative estimate of π0 only to obtain a more conservative estimate of the TDP.
There can be differences in interpretation of π. A researcher may give an estimate
of the population proportion of false nulls π, or an estimate of the proportion of false
nulls in the data m1m . In the latter case, it can be assumed that there is no variation in
m1m , and one could argue that π0 = 1− π is then always the most sensible choice.
It should be noted that the choice of π0 is generally not nearly as impactful as
the choice of α, γ or p. This is because, in a practical high-dimensional setting, the
proportion of signal in the data is typically assumed to be low. Therefore, the proportion
of noise in the data is close to one, and so including any other estimate of π0 can at
most make the power estimate a little less conservative.
2.2.5 Power Estimators for Sample Size Calculation
We have proposed different choices for γ, p and π0. All possible combinations of these
choices can be observed in Table 1.
Table 1: Different estimators of power which arise from the various combinations of γ,p and π0.
γ = 1 γ < 1
p = p∗ p = pl p = p∗ p = pl
π0 = (1− π) p∗(α)(1−(1−π)α)π
pl(α)(1−(1−π)α)π
p∗(αγ)(1−(1−π)α)π
pl(αγ)(1−(1−π)α)π
π0 = h p∗(α)(1−hα)π
pl(α)(1−hα)π
p∗(αγ)(1−hα)π
pl(αγ)(1−hα)π
π0 = 1 p∗(α)(1−α)π
pl(α)(1−α)π
p∗(αγ)(1−α)π
pl(αγ)(1−α)π
26
As can be observed in Table 1, we now have twelve different power estimators, all
of the same form as the asymptotic average power (top left cell in Table 1). It is clear
that this table could be extended even further: For example, it does not include the
possibility to use FDP exceedance control with the normal average power, and it could
probably be extended with any other possible estimators of Rmm and TDP. In this study,
we will only consider these twelve estimators. Which of these estimators is preferable
depends on the situation at hand. The estimator to use automatically follows from the
choices for its three different components.
The first choice to be made is which estimator to use for the proportion of true
null hypotheses. Some estimate of the amount of signal in the data is always required,
since this quantity is used as the denominator of each estimator. Such an estimate could
be derived from data, but it is also possible this is simply the belief of the researcher.
This estimate, π, can additionally be incorporated into the estimator for the TDP to
make the procedure less conservative, and obtain smaller sample size estimates. This
does, however, put a lot of confidence in π. If there are doubts surrounding the accuracy
of π, some conservativeness can be incorporated by not including an estimate of the
proportion of true null hypotheses, i.e. by setting π0 = 1. However, if the proportion
of signal in the data is large, this may be considered too conservative. Setting π0 = h
provides a middle-ground, since this quantity is close to 1 for small n, but close to 1− πfor large n. Additionally, h may be the most natural estimator to use when using FDP
exceedance control, due to its interpretation as an asymptotic confidence bound. When
not using FDP exceedance control, a separate confidence level for h needs to be specified.
The second choice to be made is whether to use an expectation for the TDP es-
timator (γ = 1), or a lower confidence bound (γ < 1). The first guarantees that,
asymptotically, the expected proportion of true discoveries among the discoveries is at
least 1−α. The second guarantees that, asymptotically, the proportion of true discover-
ies among the discoveries is at least 1−α with probability at least 1−γ. This essentially
comes down to the strength of the statements one wishes to make about the rejection
set. The second is a stronger statement, but requires a larger sample size for a fixed α.
The third choice to be made is whether to use p∗ or pl as an estimator of Rmm . Here,
using pl provides a kind of finite-sample correction for the increased variance of Rmm that
comes with a smaller number of hypotheses. Use of p∗ assumes an infinite number of
hypotheses. Use of pl guarantees that Rmm > pl from some m onward, and results in a
high probability of Rmm > pl for lower m.
The different components of these power estimators allow researchers to incorporate
different sources of conservativeness. We do expect, however, that incorporating several
27
sources of conservativeness simultaneously may have too much of an impact on the
resulting sample size. For example, although using both pl and γ = 0.1 guarantees
that, asymptotically, the power is at least the desired level with probability at least
1 − γ, we suspect that the resulting sample sizes may be too large for practical use.
In section 3.1, we will visually compare the different estimators of power from Table 1.
A subset of these estimators and their resulting sample size recommendations will be
further compared using simulations, as described in section 2.3.4.
2.3 Simulation Studies
2.3.1 Simulation Study 1: Sample Size and Number of Tests
As stated in section 1.6, we want to investigate how the asymptotic behavior of the
BH procedure relates to its performance in a more realistic setting. Asymptotic results
describe the behavior of the BH procedure when the number of tests is infinite. In
reality, however, the number of tests performed is always finite. The goal of this first
simulation experiment is to investigate how different values of m affect the behavior of
the BH procedure. Additionally, we will investigate if this behavior differs for different
levels of n, π and α.
We simulate independent normally distributed features and perform two-sided two-
sample t-tests. We assume a common fixed effect size of θ = 0.436. This value of θ is
based on a simulation study performed by Chi (2007), where the false nulls are assumed
to come from a t-distribution with 20 degrees of freedom, and noncentrality parameter
δ = 2. This corresponds to an effect size of roughly 0.436. We investigate the behavior
of the BH procedure for the following levels of m: 10, 50, 100, 1000, and 10000. We
consider 20 different sample sizes n between 4 and 100, the chosen levels being closer
together for lower n, and more spaced out for higher n. In particular, we consider
sample sizes between 4 and 20 in increments of 2, between 20 and 40 in increments of
4, and between 40 and 100 in increments of 10. For the proportion of false nulls π, we
consider the levels 0.1, 0.5 and 0.9. Note that π in this context refers to the population
proportion of false nulls as in (2). The realized proportion of false nulls may vary from
data set to data set. For the FDR control level α, we again consider these same levels
0.1, 0.5 and 0.9. For each combination of the experimental factors, we perform a fixed
number of simulations. In principle, the number of simulations is set to 1000, however,
for m = 10000, we set the number of simulations to 100 to allow the experiment to be
completed within a reasonable time.
In terms of outcome measures, we are primarily interested in the quantity E(Rmm ),
28
the expected proportion of rejected hypothesis. For each combination of the experimental
factors we can calculate a simulation estimate of this quantity, which we will denote p,
and which can be compared to p∗. We know that Rmm → p∗ as m → ∞, but we do not
know E(Rmm ) for finite m. Secondary outcome measures include observed average power,
observed FDR, and observed pFDR. Note that for low π and low m, it is possible for a
simulated data set to not include any false nulls, in which case the power is undefined.
We will therefore look at the expected proportion of discovered false nulls, given that
there was at least one false null to discover, which we will denote the positive power.
2.3.2 Simulation Study 2: Dependence
In addition to an infinite number of tests, the quantities α∗, p∗ and u∗ assume indepen-
dent test-statistics (Chi, 2007). Although the BH procedure controls the FDR under
various common forms of positive dependence (Goeman & Solari, 2014), it does not
necessarily reject the same number of hypotheses. The goal of this second simulation
experiment is to investigate how different dependency structures influence the behavior
of the BH procedure, and how this behavior compares to the asymptotic results of Chi
(2007). We will consider roughly two different scenarios. In the first scenario the features
are equicorrelated, that is, given the experimental factor, the population correlation ρ
between all features is the same. In the second scenario we specify three different cor-
relations: ρ0 is the correlation between all true nulls, ρ1 is the correlation between all
false nulls, and ρ01 is the correlation between each true and false null. In both scenarios
we will consider only positive correlations. For the dependency structures used in this
simulation study, the FDR is assumed to be controlled (Goeman & Solari, 2014).
For the first scenario we will consider all possible values of ρ from 0 to 0.9, in
increments of 0.1. We perform two-sided two-sample t-tests and apply the BH procedure
with α = 0.1. We set π = 0.1, θ = 0.436 and m = 1000. We consider the same values
of n as in the first simulation experiment, and for each combination of the experimental
factors we perform 1000 simulations. Correlated features are simulated using a simple
linear regression model. For every simulation i ∈ (1, 1000), each feature is generated by:
Yi,j =√ρzi + εi,j , (24)
with zi a vector of length n drawn from a standard normal distribution and εi,j ∼N(0,
√1− ρ2). Then, the features are split into two groups, each of size n/2, and for
each false null, the mean of the second group is shifted by the quantity θσ√p1p2
or − θσ√p1p2
,
each with probability 0.5. We record the following outcome measures: observed E(Rmm ),
29
observed FDR, observed pFDR and observed positive power.
For the second scenario we will consider the following correlation structures: Only
the true nulls are correlated (ρ0 = 0.9, ρ1 = 0, ρ01 = 0), only the false nulls are correlated
(ρ0 = 0, ρ1 = 0.9, ρ01 = 0), or both the true and false nulls are independently correlated
(ρ0 = 0.9, ρ1 = 0.9, ρ01 = 0). For reference, we will additionally consider independence
(ρ0 = 0, ρ1 = 0, ρ01 = 0) and equicorrelation (ρ0 = 0.9, ρ1 = 0.9, ρ01 = 0.9). For every
simulation i ∈ (1, 1000), the features are generated by:
Yi = ZiCi, (25)
with Zi an n by m matrix of standard normally-distributed values, and Ci the Cholesky
decomposition of the correlation matrix. Again, the features are split into two groups,
each of size n/2, and for each false null, the mean of the second group is shifted. We
again record observed E(Rmm ), observed FDR, observed pFDR and observed positive
power.
2.3.3 Simulation Study 3: Simulations Based on Real Data
In this simulation experiment, we investigate how the BH procedure behaves in a setting
derived from real data. We base our setting on the Geuvadis RNA sequencing data set
of 465 human lymphoblastoid cell line samples (Lappalainen et al., 2013). The data
consists of 633 runs, which we will treat as independent observations. We obtained the
data through the recount package (Collado-Torres et al., 2016), and transformed the
data to normalized (Robinson & Oshlack, 2010) log counts per million using the edgeR
package (Robinson, McCarthy, & Smyth, 2010). The full data contains 58037 features,
but no experimental factor.
We first create an experimental factor with two levels from the data by using
two genes, Xist (X-inactive specific transcript) and RPS4Y1 (40S ribosomal protein
S4, Y isoform 1 ), to determine gender. After creating the factor, we remove the Xist
and RPS4Y1 features from the data, as well as any features with less than 633 unique
observations. The resulting data set now consists of 327 male and 336 female observations
on 17966 features.
In order to measure quantities like observed power and FDR, we need to know
which features are the true nulls, and which are the false nulls. However, in a real
data set such as the one we are using here, the identity of the true and false nulls is
unknown. To enable us to distinguish between true and false nulls we will generate a
new experimental factor and add corresponding effects to the data. We perform two-
30
sided two-sample t-tests on all 17966 features, and select the 10% lowest p-values. The
corresponding features have a mean standardized absolute effect size θ of 0.158, with a
standard deviation of 0.134. These features form the basis for the set of false nulls.
We sample as follows: First, we take a random sample of n rows and m columns
from the complete data set. We independently generate a two-class factor, with n2
observations of each class. Since the experimental factor is generated independently,
any observed differences between the two groups are caused solely by random sampling.
Next, we draw a random effect size θ from a normal distribution with mean µθ and
standard deviation σθ. For each feature that belongs to the set of false nulls, the mean
of the second group is shifted by the quantity θσ√p1p2
or − θσ√p1p2
, each with probability 0.5.
This sampling method keeps the correlation structure within each group intact, but the
identity of the true and false nulls is now known. In addition to the correlation structure,
the samples inherit other qualities characteristic of real data from the original source.
For example, there is some unaccounted-for dependence between the observations, and
the features are not necessarily truly normally distributed.
Where we previously used a fixed effect size, we now generate effect sizes from a
distribution. We will assess the performance of the BH procedure with α = 0.1 under
three different distributions, all with mean µθ = 0.158. We consider the following values
for the standard deviation: σθ = 0.134, σθ = 0.067, and σθ = 0 (i.e. a fixed effect size).
We consider a grid of values for n between 4 and 600. For each value of n, we draw 100
samples of size m = 1000, and record observed E(Rmm ), observed FDR, observed pFDR
and observed positive power.
2.3.4 Simulation Study 4: Comparison of Sample Size Estimates
In this simulation experiment, we will investigate how some of the power estimators
described in section 2.2 perform in terms of sample size calculation. We first specify
some desired power level ω. Next, we calculate the minimum n such that the value of
the estimator at hand is greater than, or equal to, this desired level. We then perform
a number of simulations for this value of n and compare the observed power with our
desired level.
We will not consider all twelve estimators shown in Table 1. Instead, we select four
of them for use in this experiment. The first estimator is the asymptotic expression for
the average power: p∗(α)(1−(1−π)α)π . The second estimator uses the lower LIL bound pl
instead of p∗:pl(α)(1−(1−π)α)
π . The third estimator uses a confidence bound for the TDP
based on FDP exceedance control: p∗(αγ)(1−hα)π . The fourth estimator combines FDP
31
exceedance control with the LIL bound: pl(αγ)(1−hα)π . In addition, we will evaluate a
fifth sample size, namely the minimum n such that α∗ < α.
As in simulation study 1, we simulate independent normally distributed features
and perform two-sided two-sample t-tests. For each combination of n and the other
experimental factors, we generate 100 samples of size m = 5000. Each feature has
probability π = 0.1 of corresponding to a false null hypotheses. Effect sizes are sampled
from an normal distribution with mean µθ = 0.436 or µθ = 0.158, and standard deviation
σθ = 0 or σθ = 0.134. We consider three possible levels of ω: 0.8, 0.5 and 0.1. We
set α = 0.1 and γ = 0.1. We assume π = π and θ = µθ. We record the following
outcome measures: observed E(Rm×TDP
m1
), observed P
(Rm×TDP
m1> 0)
, and observed
P(Rm×TDP
m1≥ ω
).
32
3 Results
3.1 Comparison of Power Estimators
In section 2.2, we described several power estimators based on the asymptotic expression
for the average power. Figure 3 shows the resulting power estimates for different values
of n, m and π.
0.0
0.2
0.4
0.6
0.8
1.0
π = 0.1,m = 5000
n.seq
π = 0.1,m = 100000
n.seq
0 50 100 150 200
0.0
0.2
0.4
0.6
0.8
1.0
π = 0.5,m = 5000
n.seq
0 50 100 150 200
π = 0.5,m = 100000
n.seq
Color:γ = 1, p = p∗γ = 1, p = plγ = .1, p = p∗γ = .1, p = pl
Line Type:π0 = (1− π)π0 = hπ0 = 1
n
Pow
er
Figure 3: Power estimates for the BH procedure with FDR control parameter α = 0.1.Hue indicates whether γ = 1 or γ = 0.1. Tone indicates whether p = p∗ or p = pl. Linetype indicates the value of π0. The confidence level used for π0 = h is 90%.
Each of the power estimators is a combination of its three components. The largest
contrast seen in Figure 3 is between the situation where no FDP exceedance control
is applied (green lines), and the situation where the FDP exceedance probability is
controlled at 10% (red lines). Desiring P(FDP ≥ 0.1) ≤ 0.1 is a stronger form of false
discovery control than desiring FDR ≤ 0.1, and thus requires a larger sample size.
When m = 5000, there is clear difference between use of p∗ (light tone) and pl (dark
tone). Since pl is a lower bound, the resulting power estimates are more conservative.
33
For m = 100000, the difference is far less noticeable. This is due to the fact that, as
m→∞, pl approaches p∗.
When π = 0.5, the difference between the estimators of π0 can be observed. The
largest difference is between the use of some estimate of π0 (h or 1 − π), and simply
setting π0 = 1. When π = 0.1, the observed difference is smaller. This is due to the fact
that, when the proportion of signal in the data is small, the proportion of noise is close
to one, so there is less benefit to incorporating an estimate of π0.
34
3.2 Sample Size and Number of Tests
The goal of this simulation experiment was to investigate the performance of the BH
procedure for finite m. The primary results in terms of E(Rmm ) can be observed in Figure
4.
0.0
0.2
0.4
0.6
0.8
1.0
π = 0.1 , α = 0.1
n.seq
resu
lts.
p[[
l]][
[k]]
[[1]
][[”
pst
ar”]]
++++++++++++++ + + + + + +
++++++++++++++ + + + + + +
++++++++++++++ + + + + + +
++++++++++++++ + + + + + +
++++++++++++++ + + + + + +
+++++
m = 10m = 50m = 100m = 1000m = 10000p∗
π = 0.1 , α = 0.5
n.seq
resu
lts.
p[[
l]][
[k]]
[[1]]
[[”p
star
”]]
++++++++++++++ + + + + + +
++++++++
++++++ + + + + + +
++++++++
++++++ + + + + + +
+++++++++
+++++ + + + + + +
+++++++++
+++++ + + + + + +
π = 0.1 , α = 0.9
n.seq
resu
lts.
p[[
l]][
[k]]
[[1]]
[[”p
star
”]]
++++++++++++++ + + + + + +
++++++
++++++++ + + + + + +
+++++
+++++++++ + + + + + +
++++++
++++++++ + + + + + +
+
+
++++
++++++++ + + + + + +
0.0
0.2
0.4
0.6
0.8
1.0
π = 0.5 , α = 0.1
n.seq
resu
lts.
p[[
l]][
[k]]
[[1]]
[[”p
star
”]]
+++++
++++
+++
+++ + + + + +
++++++
+++
++
+++
++ + + + +
+++++++
+++
++
++
+ + + + + +
+++++++++
++
++
++
+ + + + +
+++++++++
++
++
++
+ + + + +
π = 0.5 , α = 0.5
n.seq
resu
lts.
p[[
l]][
[k]]
[[1]
][[”
pst
ar”]]
+++++++
++++
+++ + + + + + +
+
+
++++++
++++++ + + + + + +
++
++++++
++++++ + + + + + +
++
+
++++++
+++++ + + + + + +
++
+
+++++
++++++ + + + + + +
π = 0.5 , α = 0.9
n.seq
resu
lts.
p[[
l]][
[k]]
[[1]
][[”
pst
ar”]
]
++++++
++++++++ + + + + + +
++++++
++++++++ + + + + + +
+
+++++
++++++++ + + + + + +
+
++++
+++++++++ + + + + + +
+
++++
+++++++++ + + + + + +
20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
π = 0.9 , α = 0.1
n.seq
resu
lts.
p[[
l]][
[k]]
[[1]]
[[”p
star
”]]
+++++++++
+
++
++
++ + + + +
+++++++++
+
++
++
++ + + + +
+++++++++
+
++
++
++ + + + +
+++++++++
+
++
++
++ + + + +
+++++++++
+
++
++
++ + + + +
20 40 60 80 100
π = 0.9 , α = 0.5
n.seq
resu
lts.
p[[
l]][
[k]]
[[1]]
[[”p
star”
]]
+
+
+++++
+++++++ + + + + + +
+
+
+
++++++
+++++ + + + + + +
+
+
+
++++++
+++++ + + + + + +
+
+
+
+++++
++++++ + + + + + +
+
+
+
++++++
+++++ + + + + + +
20 40 60 80 100
π = 0.9 , α = 0.9
n.seq
resu
lts.
p[[
l]][
[k]]
[[1]
][[”
pst
ar”]
]
++++++
++++++++ + + + + + +
++++
++++++++++ + + + + + +
+++
+++++++++++ + + + + + +
++++
++++++++++ + + + + + +
++++
++++++++++ + + + + + +
n
p
Figure 4: Primary results of the first simulation experiment. Each panel shows, for acombination of π and α, the expected proportions of rejected hypotheses. For each panel,the black line indicates p∗: the proportion of hypotheses that we would asymptoticallyreject for the given sample size. The colored crosses indicate, for each m, the quantityp: the observed proportion of rejected hypotheses, averaged across the (1000 or 100)simulations.
Figure 4 clearly shows the convergence of p to p∗ as m increases, since we observe
35
that for higher m, the observed p’s are closer to the line representing p∗. The rate of
convergence appears to vary for the different levels of π, α, and n. For example, when
π = 0.5 and α = 0.1, we see that for lower n there is a clear difference in the expected
proportion of rejected hypotheses for the different levels of m, but for higher n this
difference nearly vanishes. Conversely, when π = 0.1 and α = 0.9, the differences in p
remain large even for high n.
In can be observed in Figure 4 that convergence of p to p∗ appears to always happen
from above, that is, when m is smaller, we reject on average more hypotheses than we
would asymptotically reject. This is especially noticeable for the combinations of high
α and low π, but is also visible to a lesser extent in the other panels for lower n. For a
combination of high m and high n, p approximates p∗. Another interesting observation
is that for some conditions, particularly when π = 0.1 and α = 0.9, as n→∞, p appears
to converge to something other than p∗ for finite m. Although Figure 4 only shows values
of n up to 100, the observed differences appear to remain constant even for extremely
large sample sizes (e.g. n = 106).
It should be noted that FDR control levels of 0.5 or 0.9 are never used in practice.
Therefore, the left column of Figure 4 is the most interesting from a practical perspective.
In particular, the situation with π = 0.1 and α = 0.1 (top left) is the most realistic of
the experimental conditions. We will therefore examine this condition in more detail.
Figure 5 shows the primary and secondary outcome measures for the condition
with π = 0.1 and α = 0.1. It can again be observed that as m increases, p approaches
p∗ from above. The observed FDR (i.e. the observed proportion of false discoveries
among the discoveries, averaged over the simulations) fluctuates closely around (1−π)α.
There is less variability in the FDR estimates for large n as compared to small n. For a
combination of large n and large m, the variability of the FDR estimates around (1−π)α
is very small. Although the BH procedure controls the FDR regardless of sample size or
the number of tests, this is not the case for the pFDR, since for small m and/or n, the
pFDR does not equal the FDR. It can be observed in the bottom-right panel of Figure
5 that for large n, the BH procedure controls the pFDR at level (1− π)α for all values
of m except m = 10. If m is large, a smaller sample size is required to obtain pFDR
control as compared to when m is small.
It can be observed in the bottom-left panel of Figure 5 that for lowerm, the observed
average positive power is higher than the asymptotic power when n is small, but lower
than the asymptotic power when n is large. This is especially noticeable when m = 10,
and to a lesser extent for the larger values of m. When m = 1000 or m = 10000 the
observed average positive power is very close to the asymptotic power.
36
0.00
0.05
0.10
0.15
0.2
0 Proportion Rejected
n.seq
p
+++++
++++++
+++
++ +
+ ++
+++++++++
++++
++
++ + + +
+++++++++
+++
++
++
+ + + +
++++++++++
++
++
++
++ + +
++++++++++
++
++
++
+ + + +
+++++
m = 10m = 50m = 100m = 1000m = 10000p∗
0.06
0.08
0.10
0.1
2 FDR
n.seq
Ob
serv
edF
DR
+
++
+
+
+
+
+
+
+
+
++
+
++
+ ++
+
+
+
+
+
+
++
+
+
+
+++
+ + ++
+
++
+
+
+
++++++
+++++
++
++
++
+
+
+
+
+
+
+
+
+
+
++
++ + +
+ + + ++
+
+
++
+
+
+
+
++++
+ + + + + + +
(1− π)αα
20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
Positive Power
n.seq
Ave
rage
Posi
tive
Pow
er
+++++
++++
+++
++
++ +
+ + +
+++++++
+++
++
++
+
++
++ +
++++++++
++
++
++
+
+
++
+ +
++++++++++
++
++
+
+
++
+ +
++++++++++
++
++
+
+
++
+ +p∗(1−(1−π)α)π
20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
pFDR
n.seq
Ob
serv
edpF
DR
+
++
++
+++++
++++ + + + + + +
+
+
+
++
+++++
++++ + + + + + +
+
+
+
+++
+++
+++++ + + + + + +
++
++
++
++++++++ + + + + + +
+
+
++
++++++++++ + + + + + +
(1− π)αα
n
Figure 5: Clockwise, from the top left: Observed average proportion of rejected hypothe-ses, observed FDR, observed pFDR, and observed average positive power when π = 0.1and α = 0.1. Note that the y-axis does not have the same scale in each panel.
As described in section 2.3.2, π denotes the population fraction of false nulls, that
is, the probability of a hypothesis to be a false null. The realized fraction of false
nulls will vary from sample to sample. Instead of fixing the population fraction, we
could decide to fix the sample fraction m1m , so that each sample has exactly 10% signal.
Although this does not really correspond with the mixture model in (2), it is interesting
to know what the effect is of fixing the sample fraction rather than the population
fraction, since we may want researchers to make statements about properties of their
data when conducting sample size calculations. Figure 6 shows the results under the
same conditions as in Figure 5, but with the sample fraction of false nulls fixed at 0.1.
37
0.00
0.05
0.10
0.15
0.2
0 Proportion Rejected
n.seq
p
+++++++
++++
+++
+ + ++ + +
+++++++++
++
+++
++
+ + + +
+++++++++
+++
++
++
+ + + +
++++++++++
++
++
++
+ + + +
+++++++++++
++
+
++
+ + + +
+++++
m = 10m = 50m = 100m = 1000m = 10000p∗
0.06
0.08
0.10
0.1
2 FDR
n.seq
Ob
serv
edF
DR
+
++
+
+
+
+
+
+
+++
+
++
+
++
++
+
+
+
+
+
+
+
+
+
+
++++
+ ++
+
+
+
+
+
+
+
++
++++++
+
++ +
+ ++
+
+
+
+
+
+
+
++
+
+++
++ + + + + + +
+
+
+
+
+
+
+
+
++++
++ + + + + + +
(1− π)αα
20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
Positive Power
n.seq
Ave
rage
Posi
tive
Pow
er
+++++
++++
++
++
+
++
++ + +
+++++++
+++
++
++
+
++
++ +
+++++++++
++
++
+
+
+
++ + +
++++++++++
++
++
+
+
++
+ +
++++++++++
++
+
+
+
+
++
+ +p∗(1−(1−π)α)π
20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
pFDR
n.seq
Ob
serv
edpF
DR
+
++
++
+++++++
++ + + + + + +
+
+
++++++++
++++ + + + + + +
+
+
+
++
++++
+++++ + + + + + +
+
+
+
+
++++++++++ + + + + + +
++
++
+++
+++++++ + + + + + +
(1− π)αα
n
Figure 6: Clockwise, from the top left: Observed average proportion of rejected hypothe-ses, observed FDR, observed pFDR, and observed average positive power when m1
m = 0.1and α = 0.1. Note that the y-axis does not have the same scale in each panel.
In terms of p and observed FDR the results are comparable. However, we see that
for a fixed sample fraction, the pFDR converges to (1 − π)α quicker than if we specify
the population proportion. The largest difference occurs in the observed power. If we
specify π, then for small m, the observed power is higher than the asymptotic power if
n is low, but lower than the asymptotic power if n is high. If we specify m1m itself, then
for small m, the power is always higher than the asymptotic power.
In both cases, the average proportion of rejected hypothesis is higher than p∗ for
small m, however, a smaller m is associated with an increase in the variance of Rmm .
The top row of Figure 7 shows an example of this, with α = 0.1 and n = 50. For low
m, the distributions of Rmm are wide and skew, whereas for high m, they are narrow
and approach symmetry. This implies that although the expected proportion of rejected
hypotheses is larger for low m, there is also greater chance of rejecting a proportion of
hypotheses much smaller than p∗. For the parameters used in Figure 7, p∗ = 0.066,
38
which means we asymptotically reject 6.6% of all hypotheses. If π = 0.1, then for the
different levels of m, the observed probabilities of rejecting less than 5% of all hypotheses
(about 0.75p∗) are 0.461, 0.373, 0.263, 0.062 and 0 respectively. It can be observed in
the figure that fixing the sample fraction of false nulls decreases the variance of the
proportion of rejected hypotheses. If m1m = 0.1, then for the different levels of m, the
observed probabilities of rejecting less than 5% of all hypotheses are 0.319, 0.253, 0.161,
0.018 and 0 respectively.
Figure 7 also shows the quantity pl which, as suggested in section 2.2, can be used
as a conservative estimate of the proportion of rejected hypotheses. If π = 0.1, then
for the different levels of m, the observed probabilities of rejecting more than pl of the
hypotheses are 1, 1, 0.986, 0.98 and 0.97 respectively. For a fixed m1m , these probabilities
are, respectively, 1, 1, 0.996, 0.998 and 1. It should be noted that for m = 10 and
m = 50, pl is negative, meaning it cannot be used in practice. For m = 100 it is positive
but very small (0.004). However, as m grows, pl approaches p∗ so for moderate to high
values of m we may base sample size calculations on pl. The bottom row of 7 shows the
distributions of the positive power. Fixing the sample fraction of false nulls decreases
the variance in the observed power. It can be observed that the probability of the power
being at least pl(1−(1−π)α)π is high in both settings. If π = 0.1, then for the different levels
of m, the observed probabilities of the power being at least this high are 1, 1, 0.985,
0.996 and 1 respectively. For a fixed m1m , these probabilities are, respectively, 1, 1, 0.996,
0.997 and 1.
39
0.0
0.2
0.4
0.6
π = 0.1
Ob
serv
edP
rop
orti
on
Rej
ecte
d p∗pl
0.0
0.2
0.4
0.6
m0/m = 0.1
p∗pl
Ob
serv
edP
osit
ive
Pow
er
p∗(1−(1−π)α)π
pl(1−(1−π)α)π
0.0
0.5
1.0
10 50 100 1000 10000
p∗(1−(1−π)α)π
pl(1−(1−π)α)π
0.0
0.5
1.0
10 50 100 1000 10000
m
Figure 7: Top: Violin plots of the distributions of Rm/m for each m, with α = 0.1 andn = 50. Bottom: Violin plots of the distributions of the positive power for each m, withα = 0.1 and n = 50. In the left column, the population proportion of false nulls is setto 0.1. In the right column, the sample fraction of false nulls is fixed at 0.1. Each violinconsists of a box plot with a (vertical) kernel density on each side. Note that the kerneldensities are scaled to have equal maximum width. The quantity pl is equal to p∗ minusthe LIL bound given in (23). The plot was created using the vioplot package (Adler,2005).
40
3.3 Dependence
The goal of this simulation experiment was to investigate the performance of the BH pro-
cedure for when the features are correlated. The results for the equicorrelated scenario
can be observed in Figure 8.0.
000.
100.
200.
30
Proportion Rejected
n.seq
p
+++++++++++++++
+ + + + +
+++++++++++++++
+ + + + +
++++++++++++++
++ + + + +
++++++++++++
+++
+ + + + +
++++++++++++++
++ + + + +
++++++++++++++
++ + + + +
+++++++++
++++++
+ ++ + +
++++++++++++++
++
+ + + +
+++++++++
+++++
++
+ + + +
+++++
++++++++
++
+ + + ++
+++++
+++++
p∗ρ = 0ρ = 0.1ρ = 0.2ρ = 0.3ρ = 0.4
ρ = 0.5ρ = 0.6ρ = 0.7ρ = 0.8ρ = 0.9
0.02
0.06
0.10
0.14
FDR
n.seq
Ob
serv
edF
DR
+
+++++
+++
+++++ + + + + + +++
+++++
++
+++
++ + + + + + +
++
++++
+++
+++
++ + + + + ++
++
+
+++++
++++
++ + + + + + ++
++
+
++++
++++++ + + + + + +
++
+++++++
+++
++ ++ + +
+
+++
++++
++++
++++ + + +
+ + +
+
++
++++++++
+++ + + ++
++
+
++++++++
+
+
+++ + +
+ ++ +
+
++++
+
+
+++++++ +
+
+ + ++
(1− π)αα
20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
Positive Power
n.seq
Ave
rage
Pos
itiv
eP
ower
++++++++++
++
++
+
++
++ +
+++++++++
++
+
++
+
+
++
+ +
+++++++++
++
++
+
+
++
++ +
+++++++++
++
+
++
+
+
++
+ +
+++++++++
++
++
+
+
+
++
+ +
+++++++++
++
+
++
+
+
++
+ +
+++++++++
++
++
+
+
++
++ +
+++++++++
++
++
+
+
+
++
+ +
+++++++++
++
++
+
+
+
++
+ +
+++++++++
++
++
+
+
++
++ +p∗(1−(1−π)α)
π
20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
pFDR
n.seq
Ob
serv
edpF
DR
+
+
++
+++
+++++++ + + + + + +
+
+
+
+
+++++
+++++ + + + + + +
+
+
+
+++++++++++ + + + + + +
++
+
+
++++++++++ + + + + + +
+
+
++
++
++++++++ + + + + + +
+
+
+
++++++
+++++ + + + + + +
+
+
+
++
+++++++++ + + + + + +
+
+
+
+++++++++++ + + + + + +
++
+
++
+++
++++++ + + + + + +
+++
+
+
+
+++
+++++ + + + + + +
(1− π)αα
n
Figure 8: Clockwise, from the top left: Observed average proportion of rejected hypothe-ses, observed FDR, observed pFDR, and observed average positive power for variousvalues of ρ, with π = 0.1, α = 0.1, θ = 0.436, and m = 1000. Note that the y-axis doesnot have the same scale in each panel.
As can be observed in Figure 8, a higher correlation between the features is asso-
ciated with, on average, a larger number of rejected hypothesis. A higher correlation
is associated with a lower observed FDR. When the variables are highly correlated, the
pFDR converges to a value below (1 − π)α, but it does converge slower than when the
variables are uncorrelated, that is, a higher sample size is required for the pFDR to
drop below α. In terms of positive power we see that, for low n, a high correlation is
associated with higher power, whereas for higher n, a high correlation is associated with
41
lower power. For very high n, the differences are very small. For lower n, the observed
differences in power are larger, but even when the features are very highly correlated
(ρ = 0.9), the observed decrease in power is small compared to the increase in corre-
lation. For example, if n = 36, then under independence we observe an average power
of 0.328, whereas if the features are equicorrelated with ρ = 0.9, we observe an average
power of 0.275. For higher n, these differences are even smaller.
Figure 9 shows the results of the scenario where there is a different correlation
between the true nulls (ρ0), between the false nulls (ρ1), and between each true and false
null (ρ01).
0.00
0.10
0.20
0.30
Proportion Rejected
n.seq
p
+++++++++++++++
+ + + + +
++++++++++
+++++
++ + + +
++++++++++++++
++ + + + +
+++
+++
++++++++
+ ++
+ + +
+++++++++
+++++
++ +
+ + +
+++++
p∗independentρ0 = 0.9ρ1 = 0.9ρ0 = 0.9, ρ1 = 0.9equicorrelated
0.02
0.06
0.10
0.14
FDR
n.seq
Ob
serv
edF
DR
+
+
+
++++++
+++++ + + + + + +
+
+++++
++++
++++ +
+ ++ + +
++++
+++
++++
+
++ + + + + + +
+
+++
++++
++++++ +
++
++
+
+
++++++++++
++
+ ++
++ +
+
(1− π)αα
20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
Positive Power
n.seq
Ave
rage
Pos
itiv
eP
ower
++++++++++
++
++
+
++
++ +
+++++++++
++
++
+
+
+
++
+ +
++++++++
++
++
++
+
++
++ +
++++++++
++
++
++
+
+
++ + +
+++++++++
++
++
+
+
+
++
++p∗(1−(1−π)α)
π
20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
pFDR
n.seq
Ob
serv
edpF
DR
+
+
+
+
+++
+++++++ + + + + + +
+
+
+
+++++++++++ + + + + + +
+
++
++++
++
++
+++ + + + + + +
+
++
+
++++++++++ + + + + + +
++
+
+++
++
+++
+++ + + + + + +
(1− π)αα
n
Figure 9: Clockwise, from the top left: Observed average proportion of rejected hy-potheses, observed FDR, observed pFDR, and observed average positive power for thedifferent combinations of ρ0, ρ1, and ρ01. Note that if the legend states only ρ0 = 0.9,this implies the other correlation parameters are zero. Under independence all correla-tion parameters are zero, and under equicorrelation they are all 0.9. The values used forthe other parameters are: π = 0.1, α = 0.1, θ = 0.436, and m = 1000. Note that they-axis does not have the same scale in each panel.
42
In terms of the proportion of rejected hypotheses there is a clear contrast between
the three situations where the true nulls are correlated (ρ0 = 0.9), and the two situations
where they are not. If only the false nulls are correlated, the average proportion of
rejected hypothesis is somewhat larger than under independence for low n, and slightly
smaller for high n, but is in general fairly close to p∗. The three other correlation
structures all lead to a proportion of rejected hypotheses that is higher than p∗ for all
n. For the FDR we again see this same contrast: If only the false nulls are correlated
the observed FDR is similar to independence, but when the true nulls are correlated,
the observed FDR is much lower. If the true nulls are highly correlated, either many of
the true nulls are rejected, or none at all (results not shown). However, the observed
probability of rejecting many true nulls is far lower than the probability of rejecting
none of them. For example, with ρ0 = 0.9, ρ1 = 0 and ρ01 = 0, we observe a 92.3%
probability of rejecting none of the true nulls.
In the bottom row of Figure 9, we again see a contrast between the situation where
only the true nulls are correlated (ρ0 = 0.9, ρ1 = 0, and ρ01 = 0), and the situation
where only the false nulls are correlated (ρ0 = 0, ρ1 = 0.9, and ρ01 = 0). If only
the true nulls are correlated, the pFDR converges to some value lower than (1 − π)α,
similar to the equicorrelation situation. However, if only the true nulls are correlated,
this convergence occurs much quicker; quicker even than under independence, that is, a
lower n is required to obtain pFDR control compared to when all nulls are independent.
This is in contrast to the situation where only the false nulls are correlated. In this case,
the pFDR converges to (1 − π)α, but it requires the highest n out of all conditions to
obtain pFDR control.
In terms of power we see that, when only the true nulls are correlated, the average
power is very close to the asymptotic power, but when only the false nulls are correlated,
the behavior of the BH method in terms of power is more similar to equicorrelation than
independence. This is the opposite of what we see in terms of the proportion of rejected
hypotheses, where if the true nulls are correlated behavior is similar to independence,
but if the false nulls are correlated, behavior is similar to equicorrelation. The observed
differences between the situation where both the false and true nulls are independently
correlated (ρ0 = 0.9, ρ1 = 0.9, and ρ01 = 0), and the situation with equicorrelation
appear small in terms of all outcome measures.
The previously described contrasts between the situations where only the true or
false nulls are correlated also appear in the distributions of the proportion of rejected
hypotheses and power, as can be observed in Figure 10. The figure additionally shows
that under equicorrelation, an increase in ρ leads to an increase in the variance of the
43
power distribution. The interquartile ranges for the values of ρ from 0 to 0.9 are,
respectively, 0.093, 0.096, 0.096, 0.118, 0.129, 0.142, 0.171, 0.210, 0.220, and 0.246.
The observed probabilities of the power being at least as high as the power estimator
incorporating pl for the different values of ρ are, respectively, 0.994, 0.994, 0.998, 0.983,
0.973, 0.954, 0.937, 0.906, 0.877, and 0.869. So for ρ ≤ 0.5 we observe that the power
is higher than this bound more than 95% of the time. When only the true nulls are
correlated with ρ = 0.9, this observed probability is 0.992, whereas if only the falls nulls
are correlated, it is 0.894. The corresponding interquartile ranges are, respectively, 0.093
and 0.252.
0.0
0.2
0.4
0.6
0.8
1.0
Equicorrelation
Ob
serv
edP
rop
orti
on
Rej
ecte
d p∗pl
0.0
0.2
0.4
0.6
0.8
1.0
Different Correlations
p∗pl
Ob
serv
edP
osit
ive
Pow
er
p∗(1−(1−π)α)π
pl(1−(1−π)α)π
0.0
0.5
1.0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
p∗(1−(1−π)α)π
pl(1−(1−π)α)π
0.0
0.5
1.0
indep. ρ0 = 0.9 ρ1 = 0.9 both equic.
correlation structure
Figure 10: Top: Violin plots of the distributions of Rm/m for each correlation structure,with α = 0.1, π = 0.1, m = 1000, and n = 50. Bottom: Violin plots of the distributionsof the positive power. In the left column, all features share the same correlation. Inthe right column, the correlations between the true nulls (ρ0), between the false nulls(ρ1), and between each true and false null, are specified separately. Note that the kerneldensities are scaled to have equal maximum width. The quantity pl is equal to p∗ minusthe LIL bound given in (23). The plot was created using the vioplot package (Adler,2005).
44
3.4 Simulations Based on Real Data
The goal of this simulation experiment was to investigate the performance of the BH
procedure in a setting derived from real data. The results of this experiment can be
observed in Figure 11.
++++++++++++++++++++
+
+
+
+
+
0.00
0.04
0.08
Proportion Rejected
n.seq
p
+++++++++++++++++++
+
+
+
++
+
++++++++++++++++
++++
+
++
++
+++
p∗(µθ)σθ = 0σθ = .067σθ = .134
+
+
+
+
+
+
+
+
++
+
+
++
++
++
+
+
+
+
+ ++
0.04
0.08
0.12
FDR
n.seq
Ob
serv
edF
DR
++
+
+
+
+
++
+
+
+
++
+
+++
+
++
+ +
+ ++
++
+
+
+
+
++
+
+
+
+
+
+
+++
+
++
++
+ ++
(1− π)αα
++++++++++++++++++++
+
+
+
+
+
0 100 300 500
0.0
0.2
0.4
0.6
0.8
1.0
Positive Power
n.seq
Ave
rage
Pos
itiv
eP
ower
++++++++++++++++++++
+
+
++
+
++++++++++++++++++++
+
++
+ +
p∗(µθ)(1−(1−π)α)π
++
++++++
+
+
+
+
+
+++
++++
+ + + + +
0 100 300 500
0.0
0.2
0.4
0.6
0.8
1.0
pFDR
n.seq
Ob
serv
edpF
DR
+
+
+++++
++
+++
+
+++++++ + + + + +
+
+
++++
+
++
+
+
++++++
+++ + + + + +
(1− π)αα
n
Figure 11: Clockwise, from the top left: Observed average proportion of rejected hy-potheses, observed FDR, observed pFDR, and observed average positive power, withα = 0.1, π = 0.1, and m = 1000. In the left-hand panels, the dashed lines indi-cate p∗ calculated using the 45th and 55th percentiles of the effect size distribution withσθ = 0.134, rather than the mean. The dotted lines indicate p∗ calculated using π = 0.08and π = 0.12. Note that the y-axis does not have the same scale in each panel.
Figure 11 shows that, for a fixed effect size, the proportion of rejected hypotheses
and observed power are very close to the expected values. When the effect size is
not fixed, it appears both the proportion of rejected hypotheses and power are higher
than expected for low n, but lower than expected for high n. This effect appears more
pronounced as the variance of the effect size distribution increases. It appears that
underestimating the proportion of signal or mean effect size in the data (shown in the
45
left panels of Figure 11 as the lower dotted and dashed lines respectively) can guard
against the possibility of lower power to some extent, although this is limited by the
shape of the power curve. That is, under an effect size distribution, the power as a
function of n has a different shape than the power curve based on the assumption of
a fixed effect size. Underestimating π or θ can provide some conservativeness, but the
resulting theoretical power curve does not accurately reflect the shape of the true power
function.
Figure 12 shows the distributions of the observed proportion of rejected hypotheses
and power for a fixed value of n. A higher value of σθ is associated with a slightly smaller
variance of both the proportion of rejected hypotheses and power distributions, but with
more extreme outliers in the proportion of rejected hypotheses.
0.00
0.05
0.10
0.1
5O
bse
rved
Pro
por
tion
Rej
ecte
d
p∗(µθ)pl(µθ)
0.0
0.4
0.8
σθ = 0 σθ = .067 σθ = .134
Obse
rved
Pos
itiv
eP
ower p∗(µθ)(1−(1−π)α)
πpl(µθ)(1−(1−π)α)
π
Figure 12: Top: Violin plots of the distributions of Rm/m for each effect size distributionwith n = 300, α = 0.1, π = 01, and m = 1000. Bottom: Violin plots of the distributionsof the positive power. Note that the kernel densities are scaled to have equal maximumwidth. The quantity pl is equal to p∗ minus the LIL bound given in (23). The plot wascreated using the vioplot package (Adler, 2005).
46
3.5 Comparison of Sample Size Estimates
The goal of this simulation experiment was to compare several of the power estimators
from section 2.2 in the context of sample size calculation. Given estimates for π, π0, and
θ, we calculated the minimum n such that the selected power estimate was greater than
the desired level ω. We then performed simulations for this value of n, and recorded
observed average power, the probability that the power is greater than zero, and the
probability that the power is equal to or greater than ω. Results for ω = 0.8 and a fixed
effect size can be observed in Table 2.
Table 2: Sample size estimates and observed power for a number of estimators, usingω = 0.8 and σθ = 0 (i.e. a fixed effect size). Note that for sample size calculationsbased on the criticality phenomenon (α∗ < α), ω is not a determinant of the samplesize. The standard deviation of the effect size distribution Observed average power,probability of at least one correct rejection, and probability that the power is greaterthan or exceeds the desired level are based on 100 simulations per combination of theexperimental factors.
θ method n E(Rm×TDP
m1
)P(Rm×TDP
m1> 0)
P(Rm×TDP
m1≥ 0.8
)0.436 α∗ < α 16 0.0047 0.60 0
p∗(α)(1−(1−π)α)π 68 0.81 1 0.72
pl(α)(1−(1−π)α)π 82 0.90 1 1
p∗(αγ)(1−hα)π 112 0.98 1 1
pl(αγ)(1−hα)π 142 0.99 1 1
0.158 α∗ < α 36 0.0004 0.15 0
p∗(α)(1−(1−π)α)π 484 0.80 1 0.60
pl(α)(1−(1−π)α)π 592 0.90 1 1
p∗(αγ)(1−hα)π 806 0.97 1 1
pl(αγ)(1−hα)π 1032 0.99 1 1
As can be observed in the table, sample sizes calculated using the asymptotic
expression for the average power do indeed lead to an observed average power at least
as high as the desired level (0.8). Using any of the more conservative estimators leads
to an observed P(Rm×TDP
m1≥ 0.8
)= 1. Sample size calculations based on the criticality
47
phenomenon indeed lead to a positive average power, but this power can be very small
(e.g. 0.0004) and the probability that the power is greater than zero need not be large
(e.g. 0.15). Table 3 shows the results of using the same sample size estimates, but now
the effect sizes follow a distribution with σθ = 0.134.
Table 3: Sample size estimates and observed power for a number of estimators, usingω = 0.8 and σθ = 0.134. Note that for sample size calculations based on the criticalityphenomenon (α∗ < α), ω is not a determinant of the sample size. The standard deviationof the effect size distribution Observed average power, probability of at least one correctrejection, and probability that the power is greater than or exceeds the desired level arebased on 100 simulations per combination of the experimental factors.
θ method n E(Rm×TDP
m1
)P(Rm×TDP
m1> 0)
P(Rm×TDP
m1≥ 0.8
)0.436 α∗ < α 16 0.0165 0.88 0
p∗(α)(1−(1−π)α)π 68 0.72 1 0
pl(α)(1−(1−π)α)π 82 0.79 1 0.30
p∗(αγ)(1−hα)π 112 0.87 1 1
pl(αγ)(1−hα)π 142 0.92 1 1
0.158 α∗ < α 36 0.0082 0.83 0
p∗(α)(1−(1−π)α)π 484 0.62 1 0
pl(α)(1−(1−π)α)π 592 0.66 1 0
p∗(αγ)(1−hα)π 806 0.71 1 0
pl(αγ)(1−hα)π 1032 0.75 1 0
For sample sizes based on the criticality phenomenon we observe an increase in
average power, as well as an increase in the probability of at least one correct rejection.
For all other sample sizes, performance is worse compared to the situation with a fixed
effect size. When the variance of the effect size distribution is not so large compared to
its mean (θ = 0.436), sample size calculations based on the asymptotic average power or
the LIL bound lead to an observed average power which is lower than the desired 0.8.
Sample size calculations that incorporate FDP exceedance control, however, still lead to
an observed P(Rm×TDP
m1≥ 0.8
)= 1. When the variance of the effect size distribution is
large compared to its mean (θ = 0.158), this is no longer the case and we observe that
48
even if FDP exceedance control is combined with the LIL bound, the observed average
power is only 0.75, and the observed P(Rm×TDP
m1≥ 0.8
)= 0.
For a fixed effect size, if we set the desired power level ω = 0.5, we observe little
difference in terms of P(Rm×TDP
m1> 0)
and P(Rm×TDP
m1≥ ω
)compared to when ω = 0.8
(see Appendix A). For a variable effect size, however, differences between ω = 0.8 and
ω = 0.5 can be very large, as can be observed in Table 4.
Table 4: Sample size estimates and observed power for a number of estimators, usingω = 0.5 and σθ = 0.134. Note that for sample size calculations based on the criticalityphenomenon (α∗ < α), ω is not a determinant of the sample size. The standard deviationof the effect size distribution Observed average power, probability of at least one correctrejection, and probability that the power is greater than or exceeds the desired level arebased on 100 simulations per combination of the experimental factors.
θ method n E(Rm×TDP
m1
)P(Rm×TDP
m1> 0)
P(Rm×TDP
m1≥ 0.5
)0.436 α∗ < α 16 0.0165 0.88 0
p∗(α)(1−(1−π)α)π 46 0.52 1 0.75
pl(α)(1−(1−π)α)π 50 0.57 1 0.99
p∗(αγ)(1−hα)π 74 0.75 1 1
pl(αγ)(1−hα)π 82 0.79 1 1
0.158 α∗ < α 36 0.0082 0.83 0
p∗(α)(1−(1−π)α)π 314 0.52 1 0.72
pl(α)(1−(1−π)α)π 354 0.54 1 0.96
p∗(αγ)(1−hα)π 518 0.63 1 1
pl(αγ)(1−hα)π 572 0.65 1 1
We now observe that E(Rm×TDP
m1
)≥ ω for all sample sizes, except those based on
the criticality phenomenon. Furthermore, when using the LIL bound for the proportion
of rejected hypotheses, the observed P(Rm×TDP
m1≥ 0.5
)is at least 0.96. If we set the
desired power level ω = 0.1, we obtain the results shown in Table 5.
49
Table 5: Sample size estimates and observed power for a number of estimators, usingω = 0.1 and σθ = 0.134. Note that for sample size calculations based on the criticalityphenomenon (α∗ < α), ω is not a determinant of the sample size. The standard deviationof the effect size distribution Observed average power, probability of at least one correctrejection, and probability that the power is greater than or exceeds the desired level arebased on 100 simulations per combination of the experimental factors.
θ method n E(Rm×TDP
m1
)P(Rm×TDP
m1> 0)
P(Rm×TDP
m1≥ 0.1
)0.436 α∗ < α 16 0.0165 0.88 0
p∗(α)(1−(1−π)α)π 26 0.18 1 1
pl(α)(1−(1−π)α)π 30 0.26 1 1
p∗(αγ)(1−hα)π 44 0.49 1 1
pl(αγ)(1−hα)π 50 0.57 1 1
0.158 α∗ < α 36 0.0082 0.83 0
p∗(α)(1−(1−π)α)π 164 0.33 1 1
pl(α)(1−(1−π)α)π 192 0.38 1 1
p∗(αγ)(1−hα)π 288 0.49 1 1
pl(αγ)(1−hα)π 324 0.53 1 1
We again observe that E(Rm×TDP
m1
)≥ ω for all sample sizes, except those based
on the criticality phenomenon. The observed average power is now generally much
larger than the desired level. For example, when θ = 0.158, a sample size based on
the asymptotic average power now leads to an observed average power of 0.33, which is
much larger than the desired 0.1. In fact, we now observe for both effect sizes that the
probability of the power being greater than, or equal to, the desired level, is 1 for all
sample sizes except those based on the criticality phenomenon. Results for ω = 0.1 in
combination with a fixed effect size are included in Appendix A.
50
4 Discussion
4.1 Performance of the BH Procedure in a Non-Asymptotic Setting
In our first simulation experiment, we investigated how a finite number of tests affects
the behavior of the BH procedure. We find that for a finite number of tests, the average
proportion of rejected hypotheses is larger than the asymptotic proportion of rejected
hypotheses p∗. We applied the BH procedure for different combinations of the proportion
of false nulls π and the FDR control parameter α, and observe this behavior in all cases,
although it is much more pronounced when π is low and α is high. If we fix the proportion
of false nulls among the nulls in the data, we observe that the positive power likewise
converges to its asymptotic value from above. Benjamini and Hochberg (1995) already
observed that the power of their procedure decreases as the number of tests increases,
a phenomenon they referred to as “the cost of multiplicity control” (p. 296). We
additionally observe that, for low π and high α, as n → ∞, the proportion of rejected
hypotheses converges to some value larger than p∗. We suspect there is some constant
or ratio due to the discontinuous nature of the empirical distribution functions which
can explain this difference, but deriving an expression for such a quantity was deemed to
be outside the scope of this study, since it has only marginal practical relevance. After
all, FDR control levels of 0.9 or 0.5 are never used in practice. In the most realistic
setting, where π = 0.1 and α = 0.1, the observed differences between different values of
m are much smaller. Furthermore, since the observed proportion of rejected hypotheses
is larger than p∗ for finite m, calculations based on p∗ are at most conservative.
We observe an interaction between the number of tests and sample size: the ob-
served differences between the actual proportion of rejected hypotheses and p∗ are gen-
erally larger for smaller n. This is consistent with the results of Neuvial (2010), who
observed that when α < α∗, i.e. when p∗ = 0, the differences between the observed
(m = 1000) and asymptotic power were larger than when α > α∗, i.e. when p∗ > 0.
We see similar behavior, and additionally observe that this behavior becomes more pro-
nounced for smaller m.
We do note, however, that if the proportion of false null hypotheses in the data is
not fixed, but is instead allowed to vary due to hypotheses being sampled from a mixture
model with population fraction of false nulls π, the behavior of the positive power is not
analogous to the behavior of the proportion of rejected hypotheses. In this setting, the
observed power is higher than the asymptotic power for small n, but lower for large n.
However, with the exception of m = 10, the differences between observed power and
asymptotic power for large n are fairly small.
51
We also observe that, for a fixed n, a lower number of tests leads to larger variance
of the distributions of the proportion of rejected hypotheses and positive power. If the
sample proportion of false nulls is fixed this variance is reduced, but in both settings
the realized power can vary wildly for m ≤ 100. We observe that pl is a reasonably
effective lower bound on the proportion of rejected hypotheses: For a fixed value of n,
the observed probability of rejecting at least a proportion pl of the hypotheses was 0.97
or higher. However, since the value of pl is very low for low m, we only expect this bound
to be useful for high m, e.g. m ≥ 1000. Fortunately, such a number of hypotheses is
typical for many multiple testing experiments.
4.2 Performance of the BH Procedure Under Basic Forms of Depen-
dence
In our second simulation experiment, we investigated how certain basic forms of depen-
dence affect the behavior of the BH procedure. We first considered an equicorrelated
setting, with a fixed population correlation coefficient between all the features. We find
that, for population correlations between 0 and 0.9, the FDR remains controlled, which
is in agreement with existing research regarding the performance of the BH procedure
under dependency (see e.g. Goeman & Solari, 2014). We find that an increase in correla-
tion between the features is associated with a decrease in observed FDR, but an increase
in the level of n required to obtain pFDR control. We observe that an increase in corre-
lation is associated with a larger proportion of rejected hypotheses. In terms of positive
power, we find that for low n, power is increased with higher correlation, but for high
n, power is reduced with higher correlation. Observed differences between correlated
features and independent features are small, however, even for very high correlations.
This is consistent with the results of Jung (2005), who observed the asymptotic power
remains a good estimator for the median power under dependence, and with those of
Hemmelmann et al. (2005), who observed only a small reduction in power when the
pair-wise correlations were increased from 0.2 to 0.8.
We do observe an increase in the variance of both the proportion of rejected hy-
potheses and the positive power as correlation increases. This is consistent with the
results of Jung (2005), Hemmelmann et al. (2005), and Shao and Tseng (2007), who
all observed a similar effect. In particular, Jung (2005) observed that for ρ = 0.6, the
interquartile range of the observed proportion of rejected hypotheses almost doubled
compared to independence. We observe the same effect for the interquartile range of the
power: For a fixed n, we observe an IQR of 0.09 under independence, and an IQR of 0.17
52
when ρ = 0.6. Even though the IQR is almost doubled, we observe that the same is not
true for the entire range of the distribution, so that 94% of the time, the observed power
was still higher than the lower LIL bound. Since we consider all possible values of ρ from
0 to 0.9 in increments of 0.1, we can assess the effect of correlation magnitude. For a
fixed n, we find that when the features are weakly correlated (ρ ≤ 0.2), the distribution
of the power is very similar to the independence setting, with the IQR only increased
from 0.093 to 0.096. From ρ ≥ 0.2 onward, outliers start to appear in the distribution of
the proportion of rejected hypotheses, but it is not until ρ > 0.5 that we see more than
5% of the distribution lies below the LIL bound.
We also investigated the effect of correlating the true and false nulls separately.
We again find that under all experimental conditions, the FDR remains controlled. We
observe that if only the true nulls are correlated, a lower n is required to obtain pFDR
control compared to both equicorrelation and independence, whereas if only the false
nulls are correlated, pFDR control requires a higher n than both equicorrelation and
independence. In terms of power, we observe that average power is very close to the
asymptotic power when only the true nulls are correlated, but follows a similar pattern to
equicorrelation when only the false nulls are correlated. The largest difference between
correlating only the true or only the false nulls can be observed in the variance of
the power distribution. Although we observe more outliers in the distribution of the
proportion of rejected hypotheses when only the true nulls are correlated, the power
distribution is quite similar to independence, with the same interquartile range, and the
lower LIL bound performs well. When only the false nulls are correlated, however, the
observed power varies wildly between 0 and 1 in a manner similar to the equicorrelation
scenario. Although the correlation magnitudes used for this experiment are extreme (0
or 0.9), the results do clearly illustrate that correlations between the false nulls can have
a large effect on the power of the BH procedure compared to correlations between the
true nulls. Correlations between the true nulls do affect the FDP, however, since when
the true nulls are strongly correlated, this implies the associated p-values are either all
high, or all low, and so we expect to either reject many of them, or none at all. We
observed a probability of rejecting no true nulls of 0.923, leading to an observed FDR
much lower than the desired level. Thus, if only the true nulls are correlated, there is
a reduction in FDR as in equicorrelation, but without the increase in variance of the
power distribution. There is, however, still a probability of obtaining an FDP far greater
than the desired FDR level.
53
4.3 Performance of the BH Procedure With an Effect Size Distribution
In our third simulation experiment, we investigated how the BH procedure performs in
simulations based on real data. We added signal to the data, either with a fixed effect
size, or with effect sizes drawn from a normal distribution. For a fixed effect size, we
observe that the average power is very close to the asymptotic power. For a fixed sample
size, we observe that 99% of the time, the power was higher than the lower LIL bound.
This combination of results indicates that the effect of the dependency in the data on
the power of the BH procedure is very limited.
When effect sizes are drawn from a normal distribution, however, the observed
power is only close to the asymptotic power when the power is near 0 or near 0.5. For
an asymptotic power between 0 and 0.5, i.e. for low n, the observed power is greater
than predicted, whereas for an asymptotic power between 0.5 and 1, i.e. for high n, the
observed power is lower than predicted. This effect becomes more pronounced as the
variance of the effect size distribution increases compared to its mean.
We believe this effect is caused by the shape of the power curve as a function of
θ. For low n, an increase in effect size of, for example, 0.5σθ, leads to an increase in
power. The amount by which the power increases is larger than the amount by which
it decreases if the effect size is decreased by 0.5σθ. For high n, the amount by which
the power increases is smaller than the amount by which its decreases if the effect size
is decreased by 0.5σθ. This effect is illustrated in Figure 13.
54
0.0
0.4
0.8
n = 200
theta.seq
pow
.seq
+
0.05 0.10 0.15 0.20 0.25 0.30
0.0
0.4
0.8
n = 400
theta.seq
pow
.seq
+
θ
Pow
er
Figure 13: Top: Power as a function of θ for n = 200. Bottom: Power as a function of θfor n = 400. In both panels, the blue cross indicates the power for θ = 0.158. In the toppanel, the increase in power associated with an increase of θ by 0.067, is larger than thedecrease in power associated with a decrease of θ by the same amount. In the bottompanel, the opposite is true.
4.4 Performance of Power Estimators
In our fourth simulation experiment, we compared several power estimators in their
ability to provide adequate sample size estimates. If the false nulls of interest have a
fixed effect size, we observe that sample size estimates based on the asymptotic average
power indeed lead to an average power at least as high as the desired level. For m = 5000,
we observe that incorporating the lower LIL bound in the power estimator leads to a
98% or higher probability of the power being at least as high as the desired level. When
incorporating FDP exceedance control at the γ = 0.1 level, or both FDP exceedance
control and the lower LIL bound, the observed probability of the power being at least
the desired level is 1. Sample size calculations based on the criticality phenomenon lead
to a positive average power, but the probability of making no discoveries can be large.
55
If the false nulls of interest have a variable effect size that follows a normal distri-
bution, we observe the same behavior as described in section 4.3: if the desired power
level is low we observe a higher power than desired, whereas if the power level is high,
we observe a lower power than desired. For ω ≤ 0.5, any of the tested power estimators
can be safely used, although they can be too conservative. For ω = 0.8, however, even
applying both FDP tail control and the lower LIL bound, that is, using a sample size
more than twice as high as the one suggested by using the asymptotic average power,
cannot ensure that the power reaches the desired level.
4.5 Sample Size Calculation in Practice
Existing sample size calculation methods for the BH procedure as implemented in R
assume infinite and independent tests. In our simulation experiments we observe that,
under independent tests and a fixed effect size, the asymptotic average power is an ac-
curate estimator for the observed positive power for a number of hypotheses typical for
high-dimensional experiments. In fact, it remains a decent estimator for all but the
lowest number of tests we considered (m = 10), assuming the proportion of signal in the
data and the FDR control parameter are not very high, which is generally the case in
practice. We additionally observe that if the proportion of signal in the data is fixed, the
asymptotic average power is a conservative power estimator. These observations indicate
that if independent tests and a fixed effect size are reasonable assumptions, the asymp-
totic average power can be safely used for sample size calculation in high-dimensional
experiments. However, the fewer hypotheses are tested, the larger the variance of the
power distribution becomes. One can correct for this by incorporating the lower LIL
bound in the power estimator. This ensures a high probability of the power of the
experiment being at least as high as desired.
In general, independent tests will not be a reasonable assumption. In our simulation
experiments we observe that even if the features are very highly correlated (ρ = 0.9), the
observed average power is close to the asymptotic average power. However, the variance
of the power distribution is very large: it is not unlikely a single experiment will lead to
no rejections at all, even if the expected power is large. If features are expected to be
highly correlated, basing sample size calculations on the asymptotic average power, or
any of the related estimators described in this paper, may not be desirable. In this case,
a researcher may consider incorporating an adjustment for dependence, as suggested by
Shao and Tseng (2007).
We observe, however, that if the data are weakly or moderately correlated (ρ ≤ 0.5),
56
the range of the power distribution is only mildly affected, and the lower LIL bound re-
mains effective. Additionally, only correlations between the true features have a negative
effect on the average power of the BH procedure. Even if strong correlations between the
true nulls exist, sample size calculations can be safely based on the asymptotic average
power, since these correlations will not have a negative effect on the power of the BH
procedure. They can, however, lead to a realized FDP far higher than the desired FDR
control level.
If the effect sizes of interest cannot be assumed fixed, one should be careful when
basing sample size calculations on the assumption of a fixed effect size. In our simulation
experiments, we observe that if a high power level is desired, sample sizes calculated
based on the asymptotic average power can be insufficient to ensure the experiment
has the desired power. This behavior is more severe if the variance of the effect size
distribution is high compared to its mean. Using a more conservative power estimator,
or a more conservative estimate of π or θ, will lead to more conservative estimates of n,
but this will only help up to a point. In section 3.5, we observe a situation were even
using a very conservative power estimator, which leads to an estimate of n twice as large
as the one obtained using the asymptotic average power, does not ensure the average
power reaches the desired level. In fact, in this scenario the desired power is reached for
none of the 100 simulated data sets. If the effects of interest are suspected to have a
large variance compared to their mean, a researcher should consider estimating the effect
size distribution beforehand. If a lower power level is desired, for example if one wishes
to discover at least 50% of all features with high probability, sample size calculations
remain fairly accurate, whereas for even lower levels (e.g. 10%) they are conservative.
It should be noted that the average power calculated in these experiments consid-
ers all false nulls, but a researcher need not be interested in all of them. He or she
could be interested only in falls nulls with an effect size “in the range of θ”, and may
consider false nulls with small effect sizes to be irrelevant. In this case, even though
the distribution of all effect sizes potentially has high variance, the same need not be
true for the distribution of the effect sizes of interest. In our simulations we observed
that if the effect sizes of interest have a mean of 0.436 with a standard deviation of
0.134 (interquartile range: 0.35 to 0.53), applying FDP exceedance control at the 90%
level still ensured the desired power level of 0.80 was reached or exceeded 100% of the
time. Applying the lower LIL bound lead to an average power of 0.79, only slightly
below the desired level. If the variance of the effect size distribution of interest is low
enough, sample size calculations based on the assumption of a fixed effect size may still
perform well, although using the more conservative power estimators suggested in this
57
paper may be preferable to using the traditional expression for the asymptotic average
power. If a researcher is interested in effect sizes of “at least θ”, this implies the effect
size distribution of interest is skewed. In this case, a higher variance of this distribution
leads to more conservative power estimates. Nevertheless, an overly conservative sample
size estimate can also be undesirable, as this will lead to unnecessary costs. This may
justify the costs associated with performing a pilot experiment and estimating the effect
size distribution.
In this study we have discussed twelve different power estimators based on the
asymptotic expression for the average power. If the effects of interest are assumed to
follow an effect size distribution with high variance compared to its mean, one should
be careful in applying any of these estimators since they are all based on an assumption
of equal effect sizes. In the event of equal effect sizes, however, even if the number
of tests if finite and the features are weakly or moderately correlated, the asymptotic
expression for the average power is an accurate estimator. To ensure a high probability
that the power will be at least as high as the desired level, one can incorporate the
lower LIL bound into the power estimator. With 5000 tests, applying FDP exceedance
control at the 90% level likewise ensures a high probability that the power will be at least
as high as the desired level, and additionally allows the researcher to make confidence
statements about the FDP and TDP of the rejection set. However, the required sample
size is higher than when using the lower LIL bound. Using both FDP exceedance control
and the lower LIL bound simultaneously appears to be overly conservative. Sample size
calculations based on the criticality phenomenon ensure a positive average power, even
for an infinite number of tests, but the probability of no discoveries can be high even for
5000 tests. These sample sizes therefore act as a sort of ”absolute minimum” to ensure
that the average power will always remain non-zero.
We have additionally discussed several possible estimators for the proportion of
true null hypotheses. If, in the opinion of the researcher, an adequate estimate of π is
available, setting π0 = 1−π is the most natural choice. Alternatively, one can incorporate
some uncertainty about the proportion of true nulls by using π0 = 1. The quantity h
provides a middle ground, and is the most natural choice when applying FDP exceedance
control, due to its interpretation as a confidence bound. If the proportion of signal in
the data is small, however, the choice of π0 is far less influential than the choice of the
other components of the power estimator.
58
4.6 Limitations and Future Research
We studied the effect of sample size, number of tests, and basic dependence, on the
power of the BH procedure using simulated data. Due to practical constraints, it is not
possible to perform simulations for every combination of every level of the experimental
factors. If an experimental factor was fixed in simulation experiment, we did so at a
value we believe is reflective the kind of settings in which the BH procedure is most
commonly applied, e.g. with at least a thousand hypothesis tests, and with small α and
π. Regardless, our simulated data may not be reflective of real data, so we additionally
performed simulations based on a real data set. However, even this experiment contains
some artificial elements, since the identity of the true and false nulls is required to be
known in order to calculate quantities like the FDP and average power. Nevertheless,
we find that the effect of the correlation structure present in the real data set is much
smaller in magnitude than the effects we have observed in some of our highly correlated
artificial data, indicating effects of correlation in a practical setting may be milder than
those seen in our simulations. It should be noted, however, that due to removal of
features from the data, the correlation structure may not be completely representative
of the correlation structure in the complete data set.
Although our simulations based on real data still contain a sizable artificial compo-
nent, they do show that parameters of the effect size distribution have considerable effect
on the power as a function of sample size. We observe that, if the mean of the effect size
distribution is high enough and/or its variance is low enough, sample size calculations
based on the assumption of a fixed effect size may still perform well, although we con-
sider only two values for the mean, and two values for the variance parameter, and so we
cannot very accurately quantify what “high enough” or “low enough” means in this con-
text. Such accurate quantifications may not be very useful in practice anyway, however,
since if detailed information about the effect size distribution was known beforehand, a
researcher could incorporate this information into the sample size calculations. We do
urge researchers to consider carefully if sample size calculations should be based on the
assumption of a fixed effect size, since depending on the desired power level, this may
lead to under- or overestimation of the required sample size.
In this study, we have proposed several different power estimators for use in sample
size calculation. We have focused on two-sample t-tests, or equivalently, an F -test with
two groups. The methodology described can be readily extended to F -test with more
than two groups. In terms of future research, it would be practical to extend this
methodology to other commonly used tests, such as χ2-tests. Additionally, it would be
59
interesting to further investigate if additional “corrections” for violated assumptions can
be incorporated, such as a correction for dependency as suggested by Shao and Tseng
(2007). Likewise, a next step would be to incorporate the estimation of an effect size
distribution into the methodology described in this study.
4.7 Concluding Remarks
In summary, our experiments suggest power calculations based on the asymptotic aver-
age power are quite robust to violations of assumptions. The asymptotic average power
remains a fairly accurate estimator of the observed average power for all but the lowest
number of hypotheses (m = 10), and even if the variables are very highly correlated.
However, a decrease in the number of hypotheses, or an increase in correlation between
the features, leads to an increase in the variance of the power distribution. Using a
more conservative power estimator, based on the LIL bound or FDP exceedance con-
trol, can provide a high probability of reaching at least the desired power level for a
moderate to large number of hypotheses (e.g. m ≥ 1000). We observed at least a 95%
probability of exceeding the desired power level when incorporating the LIL bound in
the power estimator, even when the features were weakly to moderately equicorrelated
(ρ ≤ 0.5), or when only the false nulls were correlated. Sample size estimation based
on the assumption of a fixed effect size should be performed with care, however, since
depending on the effect size distribution of interest and the desired power level, even
the most conservative power estimators described in this paper cannot ensure adequate
sample sizes.
All power estimators discussed in this study have been incorporated into a Shiny
application. This application allows researchers to perform power and sample size cal-
culations for one- and two-sided two-sample t-tests, using a point-and-click interface.
Additionally, it enables researchers to calculate the quantities α∗, u∗ and p∗, and to
visualize the associated p-value distribution. This application is freely available at:
https://wsvanloon.shinyapps.io/bhpower.
60
References
Adler, D. (2005). vioplot: Violin plot [Computer software manual]. Retrieved from
http://wsopuppenkiste.wiso.uni-goettingen.de/~dadler (R package ver-
sion 0.2)
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical
and powerful approach to multiple testing. Journal of the Royal Statistical Society:
Series B , 57 (1), 289-300.
Benjamini, Y., & Hochberg, Y. (2000). On the adaptive control of the false discovery
rate in multiple testing with independent statistics. Journal of Educational and
Behavioral Statistics, 25 (1), 60-83.
Benjamini, Y., & Yekutieli, D. (2001). The control of the False Discovery Rate in
multiple testing under dependency. The Annals of Statistics, 29 (4), 1165-1188.
Bi, R., & Liu, P. (2017). ssizeRNA: Sample size calculation for RNA-seq experimental
design [Computer software manual]. Retrieved from https://CRAN.R-project
.org/package=ssizeRNA (R package version 1.2.9)
Brent, R. (1973). Algorithms for minimization without derivatives. Englewood Cliffs,
N.J.: Prentice-Hall.
Chang, W., Cheng, J., Allaire, J., Xie, Y., & McPherson, J. (2017). shiny: Web ap-
plication framework for R [Computer software manual]. Retrieved from https://
CRAN.R-project.org/package=shiny (R package version 1.0.0)
Chi, Z. (2007). On the performance of FDR control: Constraints and a partial solution.
The Annals of Statistics, 35 (4), 1409-1431.
Collado-Torres, L., Nellore, A., Kammers, K., Ellis, S. E., Taub, M. A., Hansen, K. D.,
. . . Leek, J. T. (2016). recount: A large-scale resource of analysis-ready RNA-seq
expression data. bioRxiv . Retrieved from http://biorxiv.org/content/early/
2016/08/08/068478 doi: 10.1101/068478
Dunn, O. (1961). Multiple comparisons among means. Journal of the American Statis-
tical Association, 56 (293), 52-64.
Ferreira, J., & Zwinderman, A. (2006). Approximate power and sample size calculations
with the Benjamini-Hochberg method. The International Journal of Biostatistics,
2 (1), Article 8.
Genovese, C., & Wasserman, L. (2002). Operating characteristics and extensions of the
false discovery rate procedure. Journal of the Royal Statistical Society: Series B ,
64 (3), 499-517.
Genovese, C., & Wasserman, L. (2006). Exceedance control of the false discovery
61
proportion. Journal of the American Statistical Association, 101 (476), 1408-1417.
Goeman, J., & Solari, A. (2011). Multiple testing for exploratory research. Statistical
Science, 26 (4), 980-987.
Goeman, J., & Solari, A. (2014). Multiple hypothesis testing in genomics. Statistics in
Medicine, 33 (11), 1946-1978.
Groppe, D., Urbach, T., & Kutas, M. (2011). Mass univariate analysis of event-related
brain potentials/fields II: Simulation studies. Psychophysiology , 48 , 1726-1737.
Hemmelmann, C., Horn, M., Susse, T., Vollandt, R., & Weiss, S. (2005). New concepts
of multiple tests and their use for evaluating high-dimensional EEG data. Journal
of Neuroscience Methods, 142 , 209-217.
Hochberg, Y. (1988). A sharper Bonferonni procedure for multiple tests of significance.
Biometrika, 75 (4), 800-802.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian
Journal of Statistics, 6 (2), 65-70.
Hommel, G. (1986). Multiple test procedures for arbitrary dependence structures.
Metrika, 33 , 321-336.
Horn, M., & Dunnett, C. (2004). Power and sample size comparisons of stepwise FWE
and FDR controlling test procedures in the normal many-one case. Lecture Notes-
Monograph Series (Recent Developments in Multiple Comparison Procedures), 47 ,
48-64.
Jouve, T., Maucort-Boulch, D., Ducoroy, P., & Roy, P. (2009). Statistical power in
mass-spectrometry proteomic studies.
(submitted manuscript)
Jung, S. (2005). Sample size for FDR-control in microarray data analysis. Bioinformat-
ics, 21 (14), 3097-3104.
Keselman, H., Cribbie, R., & Holland, B. (2002). Controlling the rate of Type I error
over a large set of statistical tests. British Journal of Mathematical and Statistical
Psychology , 55 , 27-39.
Kim, K., & van de Wiel, M. (2008). Effects of dependence in high-dimensional multiple
testing problems. BMC Bioinformatics, 9 (114). doi: 10.1186/1471-2105-9-114
Korn, E., Troendle, J., McShane, L., & Simon, R. (2004). Controlling the number of false
discoveries: Application to high-dimensional genomic data. Journal of Statistical
Planning and Inference, 124 (2), 379-398.
Kvam, V., Liu, P., & Si, Y. (2012). A comparison of statistical methods for detecting
differentially expressed genes from RNA-seq data. American Journal of Botany ,
99 (2), 248-256.
62
Lappalainen, T., Sammeth, M., Friedlnder, M. R., t Hoen, P. A. C., Monlong, J., Rivas,
M. A., . . . Dermitzakis, E. T. (2013). Transcriptome and genome sequencing
uncovers functional variation in humans. Nature, 501 , 506-511.
Lee, M., & Whitmore, G. (2002). Power and sample size for DNA microarray studies.
Statistics in Medicine, 21 , 3543-3570.
Liu, P., & Hwang, J. (2007). Quick calculation for sample size while controlling false
discovery rate with application to microarray analysis. Bioinformatics, 26 (6),
739-746.
Meijer, R., Krebs, T., Solari, A., & Goeman, J. (2016). Simultaneous control of all false
discovery proportions by an extension of Hommel’s method. arXiv:1611.06739v1 .
Neuvial, P. (2010). Intrinsic bounds and false discovery rate control in multiple testing
problems. arXiv:1003.0747v1 .
Orr, M., & Liu, P. (2015). ssize.fdr: Sample size calculations for microarray experiments
[Computer software manual]. Retrieved from https://CRAN.R-project.org/
package=ssize.fdr (R package version 1.2)
Oura, T., Matsui, S., & Kawakami, K. (2009). Sample size calculations for controlling the
distribution of false discovery proportion in microarray experiments. Biostatistics,
10 (4), 694-705.
Pounds, S. (2016). FDRsampsize: Compute sample size that meets requirements for
average power and fdr [Computer software manual]. Retrieved from https://
CRAN.R-project.org/package=FDRsampsize (R package version 1.0)
R Development Core Team. (2008). R: A language and environment for statistical
computing [Computer software manual]. Vienna, Austria. Retrieved from http://
www.R-project.org (ISBN 3-900051-07-0)
Robinson, M., McCarthy, D., & Smyth, G. (2010). edgeR: a Bioconductor package for
differential expression analysis of digital gene expression data. Bioinformatics, 26 ,
139-140.
Robinson, M., & Oshlack, A. (2010). A scaling normalization method for differential
expression analysis of RNA-seq data. Genome Biology , 11 , R25.
Sarkar, S. (2004). FDR-controlling stepwise procedures and their false negative rates.
Journal of Statistical Planning and Inference, 119-137.
Shang, S., Liu, M., & Shao, Y. (2012). A tight prediction interval for false discovery
proportion under dependence. Open Journal of Statistics, 2 , 163-171.
Shang, S., Zhou, Q., Liu, M., & Shao, Y. (2012). Sample size calculation for controlling
false discovery proportion. Journal of Probability and Statistics. doi: 10.1155/
2012/817948
63
Shao, Y., & Tseng, C. (2007). Sample size calculation with dependence adjustment for
FDR-control in microarray studies. Statistics in Medicine, 26 , 4219-4237.
Storey, J. (2002). A direct approach to false discovery rates. Journal of the Royal
Statistical Society: Series B , 64 (3), 479-498.
Storey, J. (2003). The positive false discovery rate: A Bayesian interpretation and the
q-value. The Annals of Statistics, 31 (6), 2013-2035.
Van Iterson, M., Van de Wiel, M., Boer, J., & De Menezes, R. (2013). General power and
sample size calculations for high-dimensional genomic data. Statistical Applications
in Genetics and Molecular Biology , 12 (4), 449-467.
Wang, S., & Chen, J. (2004). Sample size for identifying differentially expressed genes
in microarray experiments. Journal of Computational Biology , 11 (4), 714-726.
Yang, Q., Cui, J., Chazaro, I., Cupples, L., & Demissie, S. (2005). Power and type I
error rate of false discovery rate approaches in genome-wide association studies.
BMC genetics, 6 (1). doi: 10.1186/1471-2156-6-S1-S134
64
A Additional Tables
Table 6: Sample size estimates and observed power for a number of estimators, usingω = 0.5 and σθ = 0. Note that for sample size calculations based on the criticalityphenomenon (α∗ < α), ω is not a determinant of the sample size. The standard deviationof the effect size distribution Observed average power, probability of at least one correctrejection, and probability that the power is greater than or exceeds the desired level arebased on 100 simulations per combination of the experimental factors.
θ method n E(Rm×TDP
m1
)P(Rm×TDP
m1> 0)
P(Rm×TDP
m1≥ 0.5
)0.436 α∗ < α 16 0.0047 0.60 0
p∗(α)(1−(1−π)α)π 46 0.53 1 0.83
pl(α)(1−(1−π)α)π 50 0.59 1 1
p∗(αγ)(1−hα)π 74 0.86 1 1
pl(αγ)(1−hα)π 82 0.90 1 1
0.158 α∗ < α 36 0.0004 0.15 0
p∗(α)(1−(1−π)α)π 314 0.50 1 0.56
pl(α)(1−(1−π)α)π 354 0.59 1 1
p∗(αγ)(1−hα)π 518 0.83 1 1
pl(αγ)(1−hα)π 572 0.88 1 1
65
Table 7: Sample size estimates and observed power for a number of estimators, usingω = 0.1 and σθ = 0. Note that for sample size calculations based on the criticalityphenomenon (α∗ < α), ω is not a determinant of the sample size. The standard deviationof the effect size distribution Observed average power, probability of at least one correctrejection, and probability that the power is greater than or exceeds the desired level arebased on 100 simulations per combination of the experimental factors.
θ method n E(Rm×TDP
m1
)P(Rm×TDP
m1> 0)
P(Rm×TDP
m1≥ 0.1
)0.436 α∗ < α 16 0.0047 0.60 0
p∗(α)(1−(1−π)α)π 26 0.11 1 0.55
pl(α)(1−(1−π)α)π 30 0.19 1 0.98
p∗(αγ)(1−hα)π 44 0.49 1 1
pl(αγ)(1−hα)π 50 0.59 1 1
0.158 α∗ < α 36 0.0004 0.15 0
p∗(α)(1−(1−π)α)π 164 0.10 1 0.49
pl(α)(1−(1−π)α)π 192 0.17 1 0.99
p∗(αγ)(1−hα)π 288 0.44 1 1
pl(αγ)(1−hα)π 324 0.53 1 1
66
B Selected R Code
B.1 Functions for Sample Size and Power Calculations
Here, we provide the R functions used for calculating the quantities described in sec-
tion 2.1, the power estimators described in section 2.2, and functions used for sam-
ple size calculation. These functions form the back-end of the Shiny application at
https://wsvanloon.shinyapps.io/bhpower.
B.1.1 Log-factorial function
l o g f a c t o r i a l <− function ( x ) {i f ( x==0){y <− 0
}else {y <− sum( log ( 1 : x ) )
}return ( y )
}
B.1.2 Function for calculating limits to evaluate F ′(0)
rho . l i m i t <− function (n , c=NULL, s=NULL, t o l=1e−10,
d e l t a=NULL, maxit =10000 , type=c ( ” t1 ” , ” t2 ” , ”F” ) ,
ngroups=2){
i f ( length ( type ) != 1) {stop ( ” I n v a l i d type . Must be one o f : t1 , t2 , F . ” )
}
i f ( ! ( type %in% c ( ” t1 ” , ” t2 ” , ”F” ) ) ) {stop ( ” I n v a l i d type . Must be one o f : t1 , t2 , F . ” )
}
i f ( i s . null ( d e l t a ) ) {d e l t a <− sqrt (n)∗c/s
}
67
s0 <− −1000
s1 <− 0
k <− 0
# One−Sided T−Tests (One− or Two−Sample )
i f ( type == ” t1 ” | type == ” t2 ” ) {
i f ( type == ” t2 ” ) {n <− n − 1
}
while (abs ( s0−s1 ) > t o l & k <= maxit ) {s0 <− s1
s <− exp(lgamma( ( n+k )/2) + k∗log ( sqrt (2 )∗d e l t a ) −l o g f a c t o r i a l ( k ) − lgamma(n/2) )
s1 <− s0 + s
k <− k + 1
i f ( s1 == I n f ) {message ( ”Warning : Limit too la rge , r e tu rn ing I n f . ” )
break
}}i f ( k==maxit ) {message ( ”Warning : Maximum number o f i t e r a t i o n s reached .
S e r i e s did not converge . ” )
}y <− exp(−d e l t a ˆ2/2)∗s1
}
# F−t e s t s ( or Two−Sided , Two−Sample T−Tests )
68
i f ( type == ”F” ) {
df1 <− ngroups − 1
df2 <− n − ngroups
d e l t a <− d e l t a ˆ2
while (abs ( s0−s1 ) > t o l & k <= maxit ) {s0 <− s1
s <− exp( k∗log ( d e l t a ) − k∗log (2 ) − l o g f a c t o r i a l ( k ) −lbeta ( df1/2 + k , df2/2) )
s1 <− s0 + s
k <− k + 1
i f ( s1 == I n f ) {message ( ”Warning : Limit too la rge , r e tu rn ing I n f . ” )
break
}}i f ( k==maxit ) {message ( ”Warning : Maximum number o f i t e r a t i o n s reached .
S e r i e s did not converge . ” )
}y <− exp(−d e l t a /2)∗beta ( df1/2 , df2/2)∗s1
}
return ( y )
}
B.1.3 Function for calculating α∗
alpha . s t a r <− function (n , c=NULL, s=NULL, t o l=1e−10,
d e l t a=NULL, p , type=c ( ” t1 ” , ” t2 ” , ”F” ) , ngroups=2){a <− 1 / (1 − p + p∗rho . l i m i t (n , c=c , s=s , t o l=to l ,
d e l t a=de l ta , type=type , ngroups=ngroups ) )
return ( a )
}
69
B.1.4 Function for evaluating F (u)
cumulat ive . p <− function (u , n , c=NULL, s=NULL, d e l t a=
NULL, p , type=c ( ” t1 ” , ” t2 ” , ”F” ) , ngroups=2){
i f ( length ( type ) != 1) {stop ( ” I n v a l i d type . Must be one o f : t1 , t2 , F . ” )
}
i f ( ! ( type %in% c ( ” t1 ” , ” t2 ” , ”F” ) ) ) {stop ( ” I n v a l i d type . Must be one o f : t1 , t2 , F . ” )
}
i f ( i s . null ( d e l t a ) ) {d e l t a <− sqrt (n)∗c/s
}
# One−Sided T−Tests (One− or Two−Sample )
i f ( type == ” t1 ” | type == ” t2 ” ) {
i f ( type == ” t2 ” ) {n <− n − 1
}
x <− qt (p=(1−u) , df=(n−1) )
Fu <− (1−p)∗u + p∗(1 − pt (q=x , df=(n−1) , ncp=d e l t a ) )
}
i f ( type == ”F” ) {x <− qf (p=(1−u) , df1=(ngroups−1) , df2=(n−ngroups ) )
Fu <− (1−p)∗u + p∗(1 − pf (q=x , df1=(ngroups−1) , df2=(n−ngroups ) , ncp=( d e l t a ˆ2) ) )
}
70
return (Fu)
}
B.1.5 Functions for evaluating F ′(u)
rho <− function (x , n , c , s , type=c ( ” t1 ” , ” t2 ” , ”F” ) ,
ngroups=2){i f ( type == ” t1 ” ) {d f s <− n−1
d e l t a <− sqrt (n)∗c/s
y <− exp(dt (x , dfs , de l ta , log=TRUE) − dt (x , dfs , log=
TRUE) )
}else i f ( type == ” t2 ” ) {d f s <− n−2
d e l t a <− sqrt (n)∗c/s
y <− exp(dt (x , dfs , de l ta , log=TRUE) − dt (x , dfs , log=
TRUE) )
}else i f ( type == ”F” ) {d f s <− c ( ngroups − 1 , n − ngroups )
d e l t a <− ( sqrt (n)∗c/s ) ˆ2
y <− exp( df (x , df1=d f s [ 1 ] , df2=d f s [ 2 ] , ncp=del ta , log=
TRUE) − df (x , df1=d f s [ 1 ] , df2=d f s [ 2 ] , log=TRUE) )
}
return ( y )
}
Fprime <− function (u , n , c , s , p , type=c ( ” t1 ” , ” t2 ” , ”F
” ) , ngroups=2){i f ( type == ” t1 ” ) {x <− qt(1−u , df=n−1)
}i f ( type == ” t2 ” ) {x <− qt(1−u , df=n−2)
}
71
i f ( type == ”F” ) {x <− qf(1−u , df1 = ngroups − 1 , df2 = n − ngroups )
}
y <− 1 − p + p ∗ rho ( x=x , n=n , c=c , s=s , type=type ,
ngroups=ngroups )
return ( y )
}
B.1.6 Function for evaluating F (u)−1
cumulat ive . p . inverse <− function (x , n , c=NULL, s=NULL,
d e l t a=NULL, p , type=c ( ” t1 ” , ” t2 ” , ”F” ) , ngroups=2){inv <− function (u , x , n , c , s , de l ta , p , type , ngroups )
{abs ( cumulat ive . p (u , n , c , s , de l ta , p , type , ngroups ) −
x )
}
optimize ( inv , c ( 0 , 1 ) , x=x , n=n , c=c , s=s , d e l t a=de l ta ,
p=p , type=type , ngroups=ngroups , t o l =0.001)$minimum
}
B.1.7 Function for calculating u∗
u . s t a r <− function ( alpha , p r e c i s i o n=1e−10, n , c=NULL, s
=NULL, d e l t a=NULL, p , type=c ( ” t1 ” , ” t2 ” , ”F” ) ,
ngroups=2){u <− 1
for ( i in 1 :10000) {u [ i +1] <− alpha∗cumulat ive . p (u [ i ] , n=n , c=c , s=s , d e l t a
=de l ta , p=p , type=type , ngroups=ngroups )
i f (abs (u [ i ] − u [ i +1]) < p r e c i s i o n ) {break
}}i f ( length (u) ==10001){
72
message ( ”Warning : maximum number o f i t e r a t i o n s reached .
” )
}ustar <− t a i l (u , 1 )
return ( us ta r )
}
B.1.8 Function for calculating p∗
p . s t a r <− function ( alpha , n , m, theta , pi1 , type=c ( ” t1 ”
, ” t2 ” , ”F” ) , ngroups=2){pstar <− u . s t a r ( alpha=alpha , n=n , c=theta , s =1, p=pi1 ,
type=type , ngroups=ngroups )/alpha
return ( ps ta r )
}
B.1.9 Function for calculating pl
p . l <− function ( alpha , n , m, theta , pi1 , type=c ( ” t1 ” , ”
t2 ” , ”F” ) , ngroups=2){ustar <− u . s t a r ( alpha=alpha , n=n , c=theta , s =1, p=pi1 ,
type=type , ngroups=ngroups )
ps ta r <− ustar/alpha
Fp <− Fprime (u=ustar , n=n , c=theta , s =1, p=pi1 , type=
type , ngroups=ngroups )
pb <− sqrt ( 2 ∗ pstar ∗ (1 − pstar ) ) ∗ sqrt ( m ∗ log (
log (m) ) ) /
( m ∗ (1 − alpha ∗ Fp) )
p l <− pstar − pb
return ( p l )
}
B.1.10 Function for calculating h
h . bar <− function ( alpha , n , theta , pi1 , type=c ( ” t1 ” , ”
t2 ” , ”F” ) , ngroups=2){x . seq <− seq (0 , cumulat ive . p(u=alpha , n=n , c=theta , s
=1, p=pi1 , type=type , ngroups=ngroups ) −0.001 , 0 . 001 )
73
p . inv <− sapply ( x . seq , function ( x ) cumulat ive . p . inverse
(x , n=n , c=theta , s =1, p=pi1 , type=type , ngroups=
ngroups ) )
h . seq <− ( x . seq∗alpha − alpha ) / (p . inv − alpha )
return (min(h . seq ) )
}
B.1.11 Function for calculating power estimates
pow <− function ( alpha , gamma=1, ec=FALSE, n , m, theta ,
pi1 , pi0 , p func , type=c ( ” t1 ” , ” t2 ” , ”F” ) , ngroups
=2, adapt ive ) {
i f ( class ( p i0 )==” func t i on ” ) {pi0 <− pi0 ( alpha=gamma, n=n , theta=theta , p i1=pi1 , type
=type , ngroups=ngroups )
}
i f ( adapt ive ) {alpha <− alpha/pi0
}
i f ( ec ) {p <− p func ( alpha=alpha∗gamma, n=n , m=m, theta=theta ,
p i1=pi1 , type=type , ngroups=ngroups )
}else {p <− p func ( alpha=alpha , n=n , m=m, theta=theta , p i1=pi1
, type=type , ngroups=ngroups )
}
pwr <− p∗(1 − pi0∗alpha )/pi1
return (pwr )
}
74
B.1.12 Function for calculating sample sizes based on the criticality phe-
nomenon
c a l c . c r i t <− function ( alpha , gamma=1, ec=FALSE, maxn ,
theta , pi1 ,
type=c ( ” t1 ” , ” t2 ” , ”F” ) , ngroups =2, adapt ive ) {
i f ( type==” t1 ” ) {f c <− 2
}else {f c <− 1
}
n <− 4/ f c
i f ( adapt ive ) {alpha <− alpha/(1 − pi1 )
}
i f ( ec ) {alpha <− alpha∗gamma
}
a s t a r <− alpha . s t a r (n = n , c = theta , s = 1 , d e l t a =
NULL, p = pi1 , type=type ,
ngroups=ngroups )
nseq <− n
aseq <− a s t a r
while ( a s t a r >= alpha & n < maxn) {n <− n + 2/ f c
a s t a r <− alpha . s t a r (n = n , c = theta , s = 1 , d e l t a =
NULL, p = pi1 , type=type ,
75
ngroups=ngroups )
nseq <− c ( nseq , n)
aseq <− c ( aseq , a s t a r )
}
pstar <− p . s t a r ( alpha = alpha , n = n , m = NULL, theta =
theta , p i1 = pi1 , type = type , ngroups = ngroups )
out <− l i s t (n=n , ps ta r=pstar , nseq=nseq , aseq=aseq )
return ( out )
}
B.1.13 Function for calculating sample sizes based on the average power
c a l c . n <− function ( l v l , alpha , gamma=1, ec=FALSE, maxn ,
m, theta , pi1 , pi0 , p func ,
type=c ( ” t1 ” , ” t2 ” , ”F” ) , ngroups =2, adapt ive ) {i f ( type==” t1 ” ) {f c <− 2
}else {f c <− 1
}
n <− 4/ f c
pwr <− pow( alpha = alpha , gamma = gamma, ec = ec , n = n
, m = m, theta = theta ,
p i1 = pi1 , p i0 = pi0 , p func = p func , type = type ,
ngroups = ngroups , adapt ive = adapt ive )
nseq <− n
pseq <− pwr
while (pwr < l v l & n < maxn) {
76
n <− n + 2/ f c
pwr <− pow( alpha = alpha , gamma = gamma, ec = ec , n = n
, m = m, theta = theta ,
p i1 = pi1 , p i0 = pi0 , p func = p func , type = type ,
ngroups = ngroups , adapt ive = adapt ive )
nseq <− c ( nseq , n)
pseq <− c ( pseq , pwr )
}
out <− l i s t (n=n , power=pwr , nseq=nseq , pseq=pseq )
return ( out )
}
B.2 Code for Simulation Study 2: Equicorrelation
Here, as an example, we provide the code used for assessing the effect of equicorrelation.
The other simulation studies performed are of a similar form.
B.2.1 Function for generating correlated features
e q u i c o r r e l a t e <− function (n , m, r ) {r <− sqrt ( r )
x <− rnorm(n)
x <− matrix ( rep (x ,m) , nrow=n)
e <− matrix (rnorm(n∗m, mean=0, sd=sqrt(1− r ˆ2) ) , n , m)
f e a t u r e s <− r∗x + e
return ( f e a t u r e s )
}
B.2.2 Function for simulating two-sample t-tests
sim . twosamp <− function (n , m, p , theta , equ i co r =0, s i d e
) {# Create c o r r e l a t e d , normal ly d i s t r i b u t e d f e a t u r e s
x <− e q u i c o r r e l a t e (n=n , m=m, r=equ i co r )
77
# S p l i t the data i n t o two p a r t s
x1 <− x [ 1 : f loor (n/2) , ]
x2 <− x [ ( f loor (n/2)+1) : n , ]
n1 <− nrow( x1 )
n2 <− nrow( x2 )
# Take the sample means f o r each group and c a l c u l a t e
the mean d i f f e r e n c e s
xbars1 <− colMeans ( x1 )
xbars2 <− colMeans ( x2 )
x d i f f <− ( xbars1 − xbars2 )
# S h i f t the mean d i f f e r e n c e s f o r the f a l s e n u l l s
mean . s h i f t <− theta∗sqrt (n)∗1∗sqrt (1/n1 + 1/n2 )
i f ( s i d e == ” l e f t ” ) {u <− runif (m)
mean . d i f f <− x d i f f ∗ (u > p) + ( x d i f f − mean . s h i f t )∗ (u <
p)
}else i f ( s i d e == ” r i g h t ” ) {u <− runif (m)
mean . d i f f <− x d i f f ∗ (u > p) + ( x d i f f + mean . s h i f t )∗ (u <
p)
}else i f ( s i d e == ”both” ) {u <− runif (m)
u2 <− runif (m)
mean . d i f f <− x d i f f ∗ (u > p) + ( x d i f f + mean . s h i f t )∗ (u <
p)∗ ( u2 < 0 . 5 ) + ( x d i f f − mean . s h i f t )∗ (u < p)∗ ( u2 >
0 . 5 )
}
# C a l c u l a t e the poo led standard d e v i a t i o n
xs1 <− apply ( x1 , 2 , var )
xs2 <− apply ( x2 , 2 , var )
pooled . sd <− sqrt ( ( ( n1−1)∗xs1 + ( n2−1)∗xs2 )/ (n−2) )
78
# Transform to two−sample t−s t a t i s t i c s
t . s t a t s <− mean . d i f f / ( pooled . sd∗sqrt (1/n1 + 1/n2 ) )
# C a l c u l a t e p−v a l u e s and f e a t u r e i n d i c a t o r s
i f ( s i d e == ” l e f t ” ) {# p−v a l u e s
p . v a l s <− pt ( t . s t a t s , df=n−2)
# True f e a t u r e i n d i c a t o r
f . t rue <− as . numeric (u < p)
# Return a l i s t
out <− l i s t ( pva l s=p . va ls , f t r u e=f . t rue )
}
else i f ( s i d e == ” r i g h t ” ) {# p−v a l u e s
p . v a l s <− 1 − pt ( t . s t a t s , df=n−2)
# True f e a t u r e i n d i c a t o r
f . t rue <− as . numeric (u < p)
# Return a l i s t
out <− l i s t ( pva l s=p . va ls , f t r u e=f . t rue )
}
else i f ( s i d e == ”both” ) {# p−v a l u e s
p . v a l s <− 2∗(1 − pt (abs ( t . s t a t s ) , df=n−2) )
# True f e a t u r e i n d i c a t o r
f . t rue <− as . numeric (u < p)
# Return a l i s t
out <− l i s t ( pva l s=p . va ls , f t r u e=f . t rue )
79
}
return ( out )
}
B.2.3 Function for power simulations
simpower . cor <− function (n , m, p , theta , alpha , equ i co r
=0, s i d e ) {dat <− sim . twosamp (n=n , m=m, p=p , theta=theta , equ i co r=
equicor , s i d e=s i d e )
p . v a l s <− dat$pva l s
so r t ed . pva l s <− sort (p . v a l s )
f t r u e <− dat$ f t r u e
i n d i c e s <− which( so r t ed . pva l s <= alpha∗ ( 1 :m)/m)
i f ( length ( i n d i c e s ) > 0) {max. index <− max( i n d i c e s )
}else {max. index <− 0
}p . hat <− max. index/m
n . r e j <− max. index
i f (n . r e j == 0) {c . r e j <− 0
c . r e j . prop <− 0
u . hat <− 0
}else {u . hat <− so r t ed . pva l s [max. index ]
c . r e j <− sum( f t r u e [ order (p . v a l s ) ] == 1 & so r t ed . pva l s
<= u . hat )
c . r e j . prop <− c . r e j /sum( f t r u e )
}out <− c (u . hat , p . hat , n . r e j , c . r e j , c . r e j . prop )
return ( out )
}
80
B.2.4 Simulation script
## P r e l i m i n a r i e s :
rm( l i s t=l s ( ) )
options ( warn=1)
set . seed (1402)
## Def in ing Experimental Factors :
theta <− 0 .436
m <− 1000
p <− 0 .1
alpha <− 0 .1
n . seq <− c ( seq (4 , 20 , 2) , seq (24 , 40 , 4) , seq (50 , 100 ,
10) )
cor . seq <− seq (0 , 0 . 9 , 0 . 1 )
n i t e r <− 1000
## I n i t i a l i z i n g Objec t s
r e s u l t s . cor <− l i s t ( )
## S t a r t Timer
t imer0 <− proc . time ( ) [ 3 ]
## Performing Simula t ions
for ( j in 1 : length ( cor . seq ) ) {
phat <− c ( )
pow <− c ( )
ppow <− c ( )
p fdr <− c ( )
f d r <− c ( )
ps ta r <− c ( )
psd <− c ( )
for ( i in 1 : length (n . seq ) ) {# Progress N o t i f i e r
81
cat ( ”Computing rho =” , cor . seq [ j ] , ”and n =” , n . seq [ i ] ,
”\n” )
# Simulate r e s u l t s
raw . r e s u l t s <− r e p l i c a t e ( n i t e r , simpower . cor (n=n . seq [ i
] , m=m, p=p , theta=theta , alpha=alpha , equ i co r=cor .
seq [ j ] , s i d e=”both” ) )
# C a l c u l a t e mean p rop or t i on r e j e c t e d E(Rm/m) and i t s
s tandard d e v i a t i o n
phat [ i ] <− mean( raw . r e s u l t s [ 2 , ] )
psd [ i ] <− sd ( raw . r e s u l t s [ 2 , ] )
# C a l c u l a t e power/ p o s i t i v e power
pow . vec <− raw . r e s u l t s [ 5 , ]
ppow [ i ] <− mean(pow . vec , na .rm=TRUE)
pow . vec [ i s . nan(pow . vec ) ] <− 0
pow [ i ] <− mean(pow . vec )
# C a l c u l a t e f d r / p o s i t i v e f d r
f d r . vec <− 1 − raw . r e s u l t s [ 4 , ] /raw . r e s u l t s [ 3 , ]
p fdr [ i ] <− mean( f d r . vec , na .rm=TRUE)
fd r . vec [ i s . nan( f d r . vec ) ] <− 0
fd r [ i ] <− mean( f d r . vec )
# C a l c u l a t e the asymptot ic pro por t ion r e j e c t e d p∗pstar [ i ] <− u . s t a r ( alpha=alpha , n=n . seq [ i ] , c=theta , s
=1, p=p , type=”F” )/alpha
}
r e s u l t s . cor [ [ j ] ] <− l i s t ( phat=phat , pow=pow , ppow=ppow ,
pfdr=pfdr , f d r=fdr , ps ta r=pstar , psd=psd )
}
## Stop Timer
82
t imer1 <− proc . time ( ) [ 3 ]
t imer <− round( t imer1 − t imer0 )
seconds <− t imer %% 60
minutes <− ( t imer %/% 60) %% 60
hours <− ( t imer %/% 60) %/% 60
cat ( ”The s imu la t i on took ” , hours , ”hour ( s ) , ” , minutes ,
”minute ( s ) , and” , seconds , ” second ( s ) to complete . ” )
83