beyond the traditional simulation design

49
Beyond the traditional simulation design for evaluating type 1 error: from ‘theoretical’ to ‘empirical’ null by Ting Zhang A thesis submitted in conformity with the requirements for the degree of Master of Science Department of Public Health Sciences University of Toronto c Copyright 2018 by Ting Zhang

Upload: others

Post on 02-Feb-2022

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Beyond the traditional simulation design

Beyond the traditional simulation designfor evaluating type 1 error: from ‘theoretical’ to ‘empirical’ null

by

Ting Zhang

A thesis submitted in conformity with the requirements for the degree of Master of Science

Department of Public Health Sciences University of Toronto

c© Copyright 2018 by Ting Zhang

Page 2: Beyond the traditional simulation design

Abstract

Beyond the traditional simulation design

for evaluating type 1 error: from ‘theoretical’ to ‘empirical’ null

Ting Zhang

Master of Science

Department of Public Health Sciences

University of Toronto

2018

When evaluating a new statistical test, its important to check the type 1 error (T1E) control, often

achieved by the ‘theoretical’ design S0. In whole-genome association analyses, people scan through

large numbers of genes (Gs) for the ones associated with an outcome(Y). Y comes from an unknown

alternative and is not associated with the majority of Gs, This reality can be represented by two

‘empirical designs, where S1.1 simulates Y from G then evaluates its association with independent

Gnew; while S1.2 evaluates the association between permutated Yperm and G. Using scale tests,

location tests with single and multiple Gs as examples, we show that not all designs are equal. For

certain statistics, T1E inflation can only be revealed under 1empirical designs and doesnt diminish

as sample size increases. This important observation calls for new practices for methods evaluation

and interpretation of T1E control.

ii

Page 3: Beyond the traditional simulation design

Acknowledgements

The author have no conflict of interest to declare.

The authors would like to thank Dr. Lei Sun for suggesting the topic of this thesis and her kind

supervision.

The author would like to thank Dr. David Soave and Dr. Jerry Lawless for helpful discussions;

and the PhD candidates in Biostatistics, University of Toronto, Bo Chen, Kuan Liu and Kaviul

Anam for assistance and support in this research.

The author would like to thank my dearest friend Xi Luo and my family. Without their love and

caring, I could never have gone this far.

This research was funded by the Natural Sciences and Engineering Research Council of Canada

(NSERC, 250053-2013), and the Canadian Institutes of Health Research (CIHR, 201309MOP-

310732-G-CEAA-117978) to LS.

iii

Page 4: Beyond the traditional simulation design

Contents

Preliminary

Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

1 Introduction 1

2 Methods and Materials 6

2.1 Three GxE interaction scenarios and corresponding statistical tests . . . . . . . . . . 6

2.1.1 Scenario 1: single G, and E missing . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Scenario 2: single SNP, E is known . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.3 Scenario 3: multiple Gs, and E known . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Three T1E simulation designs: the ‘theoretical’ null S0, and the ‘empirical’ null S1.1

and S1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Data-generating models for the three T1E simulation designs and under the

three interaction scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Simulation Studies 14

3.1 Summary of Simulation models and parameter values . . . . . . . . . . . . . . . . . 14

3.2 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Discussion 24

Bibliography 26

iv

Page 5: Beyond the traditional simulation design

5 Appendix 31

5.1 Conditional and Marginal Distribution of Y under Alternative Model . . . . . . . . . 31

5.2 Asymptotic Distribution of Likelihood Ratio Statistics for Variance under Non-normality 32

5.3 Score Test on GxE Interaction Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.3.1 Dependency of Test Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.3.2 Under Non-Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.4 Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

v

Page 6: Beyond the traditional simulation design

List of Tables

3.1 Summary of the two data-generating models for indirect and direct GxE interaction

testing, and evaluating the ‘theoretical’ null simulation designs S0 vs. the two the

‘empirical’ null simulation designs S1.1 and S1.2, as described in Sections 2.3.1 and

2.3.2. Assume E was available for direct GxE testing, the Aschard et al. (2013) model

coincides with Model I of Rao and Province (2016), except E was Gnon−repeating. T1E

rate is first estimated from nrep.in simulation replicates in an inner loop in which

E is fixed (similar to one whole-genome GxE interaction scan), then averaged over

nrep.out simulation replicates in an outer loop in which E varies. . . . . . . . . . . 15

3.2 Summary of the data-generating models for multivariate GxE interaction testing,

and evaluating the ‘theoretical’ null simulation designs S0 vs. the two the ‘empirical’

null simulation designs S1.1 and S1.2, as described in Section 2.3.3. T1E rate is first

estimated from nrep.in simulation replicates in an inner loop in which E is fixed

(similar to one whole-genome gene-based GxE interaction scan), then averaged over

nrep.out simulation replicates in an outer loop in which E varies. . . . . . . . . . . 16

3.3 Simulation results of interaction scenario 1: single G, and E missing. Empirical T1E

rates of LRTm and Scorem location tests for mean difference in Y across the three

G groups, and of LRTv and Levene scale tests for variance difference in Y , based on

the ‘theoretical’ null design of S0 and the alternative ‘empirical’ null designs of S1.1

and S1.2. The alternative Y1 data were generated using the Aschard’s genetic model

as described in Table 3.1. Empirical T1E rates outside α ± 3√α× (1− α)/nrep.in

are bolded in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

vi

Page 7: Beyond the traditional simulation design

3.4 Simulation results of interaction scenario 1: single G, and E missing; effect of the

nominal α level. Empirical T1E rates of LRTm and Scorem location tests for mean

difference in Y across the three G groups, and of LRTv and Levene scale tests for

variance difference in Y , based on the ‘theoretical’ null design of S0 and the al-

ternative ‘empirical’ null designs of S1.1 and S1.2. The alternative Y1 data were

generated using the Aschard’s genetic model as described in Table 3.1, focusing on

the extreme case of large interaction effect, βGE = 1. Empirical T1E rates outside

α± 3√α× (1− α)/nrep.in are boded in red. . . . . . . . . . . . . . . . . . . . . . 18

3.5 Simulation results of interaction scenario 1: single G, and E missing; effect of sample

size n. Empirical T1E rates of LRTm and Scorem location tests for mean difference

in Y across the three G groups, and of LRTv and Levene scale tests for variance

difference in Y , based on the ‘theoretical’ null design of S0 and the alternative ‘em-

pirical’ null designs of S1.1 and S1.2. The alternative Y1 data were generated using

the Cao’s genetic model as described in Table 3.1, and using two difference sample

sizes of n = 103 and 104. Empirical T1E rates outside α ± 3√α× (1− α)/nrep.in

are bolded in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.6 Simulation results of interaction scenario 2: single G, and E known. Empirical T1E

rates of the LRTGE and ScoreGE tests, based on the ‘theoretical’ null design of S0 and

the alternative ‘empirical’ null designs of S1.1 and S1.2. The alternative Y1 data were

generated using the Aschard’s genetic model as described in Table 3.1 when βGE = 1,

but E was assumed to be known in this case and direction interaction testing was

possible. Empirical T1E rates outside α± 3√α× (1− α)/nrep.in are bolded in red. 22

3.7 Simulation results of interaction scenario 3: multiple Gs, and E known. Empirical

T1E rates of the FGE , SKATGE and BurdenGE tests for jointly testing for multiple

interaction effects, based on the ‘theoretical’ null design of S0 and the alternative

‘empirical’ null designs of S1.1 and S1.2. Models and parameters values are given in

Table 3.2. Empirical T1E rates outside α± 3√α× (1− α)/nrep.in are bolded in red. 23

vii

Page 8: Beyond the traditional simulation design

List of Figures

S1 Comparison of the asymptotic distribution (black solid) and finite-sample distribution

(red dashed) of LRTv under the ‘empirical’ null S1.1 or S1.2, with the asymptotic

distribution (χ22, blue dot-dashed) of LRTv under the ‘theoretical’ null S0. Vertical

lines correspond the 99.9% quantile cutoffs for α = 0.001. . . . . . . . . . . . . . . . 20

S1 Marginal and conditional histogram (Top), and normal Q-Q plot (Bottom) of the

empirical phenotype data Y1 when βGE = 1 based on the Aschard’s model as described

in Table 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

S2 Marginal and conditional histogram (Top), and normal Q-Q plot (Bottom) of the em-

pirical phenotype data Y1 when βGE = 0.2 based on the Aschard’s model as described

in Table 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

S3 Histogram of p-values of the LRTv test under S0 (Left column), S1.1 (Middle col-

umn) and S1.2 (Right column) null simulation designs. Simulation of the empirical

phenotype data Y1 was based on the Aschard’s genetic model as described in Table 1

with βGE = 0 (Top row), βGE = 0.2 (Top row), and βGE = 0.6 (Top row). Red line

represents the Uniform (0, 1) distribution. . . . . . . . . . . . . . . . . . . . . . . . . 37

S4 Histograms with theoretical normal distribution (Top row), and normal QQ-plots

(Bottom row) of phenotype Y1 simulated based on the Aschard’s genetic model as

described in Table 2 with βGE = 1 for before transformation (Left column), after

square root (Middle column), and after log transformations (Right column). . . . . . 38

viii

Page 9: Beyond the traditional simulation design

S5 Histogram of p-values of the LRTv test under the S0 (Left column), S1.1 (Middle

column) and S1.2 (Right column) null simulation designs, when phenotype Y1 was

simulated based on the Aschard’s genetic model as described in Table 1 with βGE = 1

for before transformation (Top row), after square root transformation (Middle row),

and after log transformation (Bottom row). . . . . . . . . . . . . . . . . . . . . . . . 39

S6 Scatter plot of nrep.out = 100 estimated T1E rates for whole-genome interaction

GxE scans, each estimated from nrep.in = 105 simulated replicates (or SNPs) with

a fixed E, under the ‘theoretical’ null design S0 as described in Table 1, for sample

size n = 104 (Left) and 105 (Right). . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

ix

Page 10: Beyond the traditional simulation design

Chapter 1

Introduction

Type 1 error (T1E) control evaluation using simulations is always the first step in understanding

the performance of any newly developed statistical test. A test statistics is considered as robust

when it is applicable to a wide range of probability distributions and maintains a correct type 1

error. To formulate the problem more precisely, let us consider the current large-scale genome-wide

association studies (GWAS) or next-generation sequencing (NGS) studies of complex and heritable

traits. A typical GWAS compares two large groups of individuals, one healthy control group and one

case group affected by a disease. And it then scan through millions or more genetic markers (Gs)

across the genome, one at a time, for the ones associated with the disease trait (Y ), while accounting

for covariates. And for Y -G association tests, type 1 error is made when a test claims statistical

significant association between independent Y -G pair, or more technically, p-value being greater

than a threshold value α and null hypothesis of independence being rejected. And if the proportion

of type 1 errors in a large numbers of tests on simulated Y -G pairs equals to the nominal level α,

we say the test maintain a correct T1E control. Many Y -G association tests have been developed

and they often require the assumption of (approximately) normally distributed errors. However,

independence between Y -G does not restrict the distribution of Y to be normal, and among all these

tests, some are more robust to non-normality than others. For example, Bartlett test for variance

heterogeneity has been shown to have large inflated T1E rates when the error term e follows a t-

or χ2-distribution (Struchalin et al. 2010), and the likelihood ratio test (LRT) is similarly sensitive

(Cao et al. 2014), while Levene’s test appears to be more robust (Soave, Corvol, et al. 2015; Soave

and L. Sun 2017).

1

Page 11: Beyond the traditional simulation design

Chapter 1. Introduction 2

Standard T1E simulation design, denoted as S0, generates phenotype data Y0 = e under the ‘the-

oretical’ null model of no association, then independently generates genotype data G and estimates

the empirical T1E of a certain test statistics for H0 : βG = 0 from a fitted model Y0 = bGG; for

notation simplicity and without loss of generality, intercept and additional covariates Zs are omitted

from the conceptual expression of the regression model. A method is generally considered sound

if T1E is well controlled under the e ∼ N(0, σ2) assumption, and robustness is then evaluated by

assuming other distribution forms for e, typically t or χ2 distribution. Given statistical accuracy of

T1E control, statistical efficiency in terms of power will be studied by generating phenotype under

an alternative, Y1 = G + e and e ∼ N(0, σ2). The sensitivity problems mentioned before can be

revealed by altering error term distribution in T1E evaluation. But it would not match the power

estimation settings, where a normal error is often assumed.

In practice, GWAS and NGS receive one empirical Y that is truly associated with one or more

Gs among all candidates. And the rest, large numbers of Gs are under the null hypothesis of no

association. When examining these independent Y -G pairs, the marginal distribution of Y can be

viewed as identical to some distribution under alternative hypothesis. Now consider two alternative

simulation designs to evaluate T1E control that generate Y similar to the real case. Suppose we are

evaluating a test statistic T for Y -G association, design S1.1 simulates Y1 = G+e from an alternative

hypothesis, then it independently generates Gnew and applies the association test T to fitted model

Y1 = bGGnew. T1E rates is calculated from proportion of significant T (bG) at a nominal level α.

Design S1.2 permutes the simulated Y1 and evaluates T1E rate from fitted model Y perm1 = bGG. The

two designs both achieves the goal that (1) Y is from an ‘empirical’ distribution, identical to some

alternative hypothesis; and (2) a null relationship between Y -G pair is maintained in evaluating

T1E rate. Note that Y1 is in contrast to Y0 in distribution both marginally and conditionally , and

the residual in fitted models may or may be normally distributed.

An important question can then be asked as to whether the ‘empirical’ S1.1 and S1.2 designs

lead to similar T1E conclusion as the S0 design for one specific test statistics. In particular, even if

the e ∼ N(0, σ2) assumption is true and the test appears to be accurate based on the S0 evaluation,

do we expect it to perform well in real data which are better represented by the S1.1 and S1.2

simulation designs? The answers would depend on the type of test statistics used.

Efron (2004) has brought up the discussion of the ‘theoretical’ vs. ‘empirical’ null more than a

decade ago. Focusing on controlling the false discovery rate (FDR), Efron (2004) outlined several

Page 12: Beyond the traditional simulation design

Chapter 1. Introduction 3

possible sources of non-normality including unobserved covariates and hidden correlation, that may

cause the FDR under ‘theoretical’ and ‘empirical’ null to be different. He also proposed an em-

pirical Bayes approach to solve the problem and address the importance of choosing a correct null

hypothesis. In this thesis, we study the practical implications of T1E evaluation based on the the

commonly used ‘theoretical’ null simulation design S0 in the context of whole-genome association

scans. We show that while a method may appear to be accurate under S0 assuming normality, it

can have incorrect T1E rates under the ‘empirical’ null of S1.1 or S1.2, also ‘assuming normality’.

The fundamental cause of the discrepancy is that, in evaluating Y1 = bGGnew (or Y perm1 = bGG),

the marginal distribution of Y1 may not be normal even if it was generated with a normal error

term, and the Y1 condition on the true causal G and other covariates is normal. Here, discrepancy

firstly leads to a conclusion that the test statistic is not robust to non-normality, and thus can have

much more false discoveries than expected in real data application. However, instead of focusing on

test statistics level, we hope to address the problem in standard T1E evaluation design. In the big-

ger picture, that many statistics show discrepancy between ‘theoretical’ and ‘empirical’ null designs

shows the flaws of S0 design since its not reflecting the real distribution of Y in data generating

process.

As a proof-of-principle, we will focus on testing the gene-environment(GxE) and gene-gene (GxG)

effects first. Traditional association tests focus on location parameters bGE in the interaction model

Y = bGG+ bEE + bGEGxE(or bGG for GxG interaction). Such interaction effects are expected for

complex traits and we studied three scenarios for the interaction tests.

The first scenario assumes that data on E are not available in practice. Thus, direct GxE inter-

action analysis is not possible. In that case, because the un-modelled interaction induces variance

heterogeneity in Y when conditional only on G, scale tests such as Levene’s test, originally devel-

oped for model diagnostics, can be used to indirectly test for the interaction effect. In contrast to

traditional location tests, which focus on mean difference between groups, scale tests refer to those

focusing on variance differences. We will investigate the several scale tests recently proposed for this

purpose (Pare et al. 2010; Aschard et al. 2013; Cao et al. 2014; Soave, Corvol, et al. 2015). It is

worth-noting that the causes of variance heterogeneity are multifaceted beyond potential interactions

(X. Sun et al. 2013; Dudbridge and Fletcher 2014; Wood et al. 2014). We show that, depending on

the robustness of a test, T1E conclusion may differ between the “theoretical’ and ‘empirical’ null.

The second scenario assumes E was available for direct modelling of the GxE interaction effect.

Page 13: Beyond the traditional simulation design

Chapter 1. Introduction 4

Previously, Voorman et al. (2011) and Rao and Province (2016) showed that T1E rate of testing GxG

in a whole-genome interaction scan can be sensitive to reasons beyond the traditional causes of mean

or variance model mis-specifications. The authors focused on the SNPnon−repeatingxSNPrepeating

analysis, where SNPrepeating represents a fixed SNP of primary interest and one is interested in

studying its interaction with all other SNPs. Statistically, this is similar to GxE that we will be

studying here, because E does not vary between SNPs in a genome-wide GxE interaction scan.

Focusing on inflated or deflated genomic inflation factor λGC (Devlin and Roeder 1999), Rao and

Province (2016) demonstrated a larger variation in λGC (similar to a larger variation in T1E rates

between different whole-genome association scans), when testing the interaction effect as compared

to the main effect under the ‘theoretical’ null. They attributed this to dependence between the

interaction test statistics, because SNPrepeating (or the fixed E in our setting) is not changing

between tests. They also noted that increasing sample size mitigates the problem. Here, we use this

opportunity to revisit direct GxE interaction testing in a GWAS setting. We show that, while T1E

rates are indeed more variable between simulation replicates (i.e. between genome-wide interaction

scans) under the conventional ‘theoretical’ null due to dependency between the tests as in Rao and

Province (2016), the average T1E rate is correct regardless of the sample size. However, under the

‘empirical’ null, a different picture emerges as in the scale test setting.

The third scenario extends the above to a multivariate setting, where the fixed E interacts with

multiple different Gs as in gene-based interaction studies. We will examine SKAT-type of variance

component test (Wu et al. 2011), together with burden-type of sum test (Madsen and Browning

2009) and the classical F-test, that jointly evaluate multiple interaction effects. Generally speaking,

departure from normality is of a less concern when an outcome Y is influenced by multiple genetic

and environmental factors (Falconer 1960, Mackay 2009). However, we will show that the distinction

between the ‘theoretical’ and ‘empirical’ null remains relevant for multivariate models.

In Chapter 2, we first describe three genetic settings in Section 2.1 for testing interaction effect,

including when environmental factor E is unknown; when E is known and interacts with one SNP;

and when E is know and interacts with multiple SNPs. Under each scenario, we briefly review all

the tests being investigated. And in Section 2.2 we introduced three simulation designs S0, S1.1 and

S1.2e, and how they will be applied to each scenario. Following detailed simulation settings and

parameters in Section 3.1, We then provide numerical results from extensive simulation studies in

Section 3.2. Importantly, we note that even if the departure from normality is generally minor in

Page 14: Beyond the traditional simulation design

Chapter 1. Introduction 5

practice and appears to pass standard diagnostic tests for non-normality, the different null simulation

designs can still noticeably affect conclusion regarding T1E control for some tests. Finally, in

Chapter 4, we remark that future T1E evaluation and interpretation should go beyond the traditional

‘theoretical’ null and adopt the alternative ‘empirical’ null simulation designs. All the figures and

tables are included in Chapter 5 and some analytical explanations of the simulation results are

included in Chapter 6.

Page 15: Beyond the traditional simulation design

Chapter 2

Methods and Materials

2.1 Three GxE interaction scenarios and corresponding sta-

tistical tests

For association study of a complex trait Y using a sample of size n, we first define genotype data Gi

for individual i at each SNP under the study. As in tradition, Gi denotes the number of copies of

the minor allele, coded additively as Gi = 0, 1 and 2. And under Hardy-Weinberg equilibrium, Gi

is assumed to come from a binomial distribution, Gi ∼ Binomial(2, 1 − f), where f is minor allele

frequency(MAF).

Consider the simplest case where only one SNP is causal, the true generating model is:

Y = βGG+ βEE + βGEG× E + e, e ∼ N(0, σ2) (2.1)

where βG is the main effect of a causal SNP, βE is the environmental effect, and βGE is the gene-

environment interaction effect. Note that the error term in the true data generating model is

denoted as e and assumed to be normal. We wish to identify the SNP whose genotype G has

non-zero interaction effect βGE .

6

Page 16: Beyond the traditional simulation design

Chapter 2. Methods and Materials 7

2.1.1 Scenario 1: single G, and E missing

Suppose information regarding E was not collected, then the fitted model in real application can

only account for the main effect of G:

Y = bGG+ ε (2.2)

However, it is straightforward to show that variances of Y stratified by the three genotype groups

of G differ if βGE 6= 0:

V ar(Y |G) = (βE + βGEG)2V ar(E) + σ2 = σ2G = V ar(ε). (2.3)

That is, ε in the fitted model of (2.2) can behave quite differently from e, the error term in the

generating model of (2.1). Thus, when E is missing and direct interaction modelling is not feasible,

scale tests for heteroscedasticity can be utilized to identify G associated with variance of Y (Pare

et al. 2010). A joint location-scale testing framework can provide robustness against either βG = 0 or

βGE = 0, and it can improve power if both main and interaction effects are present (Soave, Corvol,

et al. 2015). Here we focus on studying the more sensitive scale tests, because power of the joint

test depends on the individual components.

Different scale tests have been studied in this context, and chief among them are the Levene’s

test (Levene et al. 1960) considered by Pare et al. (2010) and Soave, Corvol, et al. (2015), and

the likelihood ratio test considered by Cao et al. (2014). Levene’s test for variance heterogeneity

between k groups is an ANOVA of the absolute deviation of each observation yi from its group mean

or median. The resulting test statistic Levene follows a F(k−1, n−k) distribution under normality,

and it is asymptotically χ2k−1/(k − 1) distributed; k = 3 in our case. Using median instead of

mean to measure the spread within each group has been proved to be more robust to non-normality,

particularly for t-distributed or skewed data (Brown and Forsythe 1974; Soave and L. Sun 2017).

And we will be using the median version of Levene in this thesis.

The variance likelihood ratio test (LRTv) considered by Cao et al. (2014) contrasts the null model

of no variance difference with the alternative model,

Y = βGG+ e, e ∼ N(0, σ2) vs. Y = βGG+ e, e ∼ N(0, σ2G), (2.4)

and conduct the corresponding LRT for H0 : σ2G=0 = σ2

G=1 = σ2G=2. The corresponding test statistic

Page 17: Beyond the traditional simulation design

Chapter 2. Methods and Materials 8

LRTv is asymptotically χ2(2) distributed. For comparison purpose, we will also examine the likelihood

ratio test(LRTm) and score test (Scorem) for testing the main effect, H0 : bG = 0.

Cao et al. (2014) has pointed out that LRTv is sensitive to the normality assumption, but under

normality they have demonstrated that LRTv has good T1E control. However, they implicitly

assumed that the test would work well as long as the error term e in the phenotype-generating

model is normally distributed, regardless of the ‘theoretical’ null or ‘empirical’ null. The work here

is to show why LRTv may have T1E issue in practice. Indeed, Soave, Corvol, et al. (2015) applied

LRTv to a GWAS of lung function measures in cystic fibrosis subjects. Despite the fact that the lung

measures were approximately normally distributed and permuted prior to the variance association

analysis, the histogram of GWAS p-values clearly showed an increased T1E rate (Figure S2.G of

Soave, Corvol, et al. 2015); the actual application was a joint LRTm and LRTv test but the T1E

issue was due to the LRTv component.

2.1.2 Scenario 2: single SNP, E is known

Assume that E was known, model (2.1) also allows us to evaluate T1E control for our second study

of directly testing for the interaction effect bGE . Continuing from the location tests from the first

study, likelihood ratio test and the score test were examined again. The likelihood ratio test, denoted

as LRT (bGE) considered testing:

Y = bGG+ bEE + ε, vs. Y = bGG+ bEE + bGE(G× E) + ε, (2.5)

where ε ∼ N(0, σ2). In the rest of this thesis, we will use the full model Y = bG+bE+bGE(G×E)+ε

as ‘fitted model’ to refer to both, for tests for nested models like LRT (bGE) here. The corresponding

statistics LRT (bGE) and the classic score test statistics Score(bGE) are both asymptotically χ2(1)

distributed. That is, under the ‘theoretical’ null that βGE = 0 in the true phenotype-generating

model (2.1), the density function f0(T ) = χ2(1), where the test statistic T is either LRT (bGE) or

Score(bGE). This is an important note for our study here. We will show that if βGE 6= 0 but the

association test is conducted for Gnew under the ‘empirical’ null, continuing using χ2(1) to obtain

p-value can lead to T1E result that is quite different from that obtained under the ‘theoretical’ null.

More test statistics could be examined under scenario 2. We use these two mainly to demonstrate

the T1E control problem in standard design, not to list all sensitive test statistics.

Page 18: Beyond the traditional simulation design

Chapter 2. Methods and Materials 9

2.1.3 Scenario 3: multiple Gs, and E known

A complex trait is influenced by multiple factors. For example, intelligence has been found to be

associated with more than a hundred of SNPs (Hill et al. 2018, Dadaev et al. 2018). Without loss

of generality, the simple phenotype-generating model (2.1) can be extended to include J SNPs,

Y =

J∑j=1

βGjGj + βEE +

J∑j=1

βGEj(Gj × E) + e, e ∼ N(0, σ2), (2.6)

where βGj is the main effect of SNP j, βE is the environmental effect, and βGEj is the interaction

effect of Gj × E.

To detect any of the J interaction effects, the classical F-test, denoted as FGE , can be applied

to the following fitted model,

Y =

J∑j=1

bGjGj + bEE +

J∑j=1

bGEj(Gj × E) + ε, (2.7)

and test H0 : bGE1= . . . = bGEJ

= 0 simultaneously. Under the ‘theoretical’ null that βGEj= 0 ∀j

in the phenotype-generating model (2.6), the FGE test statistic is F(J, n− 2J − 1) distributed and

is known to have good T1E control. However, it is not clear how the test would behave in practice

under the ‘empirical’ null. In that case, a set of Gnewj ’s under the study do not interact with E to

influence Y , but βGEj6= 0 and e ∼ N(0, σ2) in the true phenotype-generating model (2.6).

For rare variants where MAFs are low, multivariate testing is common. In addition, different

joint testing methods have been proposed, such as the burden-type sum test (B. Li and Leal 2008;

Madsen and Browning 2009; Price et al. 2010), and SKAT-type variance component test (Wu et al.

2011); see Derkach, Lawless, L. Sun, et al. 2014 for a review. Although these tests were originally

developed for main effects of rare variants, they can be applied to common variants (Xinyi Lin, Lee,

Christiani, et al. 2013) and used for interaction effects (Xinyi Lin, Lee, Wu, et al. 2016 Section 3).

Briefly, the BurdenGE interaction test first aggregates the allele counts across the J SNPs to

obtain G∗ =∑Jj=1Gj . It then tests H0 : bG∗E = 0 using the fitted model of

Y = bG∗G∗ + bEE + bG∗E(G∗ × E) + ε. (2.8)

However, the BurdenGE test has T1E issue even under the ‘theoretical’ null; this was studied in

Page 19: Beyond the traditional simulation design

Chapter 2. Methods and Materials 10

Section 3 of Xinyi Lin, Lee, Wu, et al. (2016). For completeness of method evaluation, we include

BurdenGE in our study of ‘theoretical’ vs. ‘empirical’ null.

Extending the earlier SKAT work for main effect, Xinyi Lin, Lee, Wu, et al. 2016 then used

it to study interaction effect. Without going to the technical details, the main component of the

SKATGE interaction test is∑Jj=1 wjScore

2(bGEj), where Score(bGEj

) is the score test statistic for

each bGEj in the fitted model (2.7), and wj depends on the MAF of SNP j. We will study the

SKATGE , as well as the BurdenGE and FGE for gene-based interaction studies of both common

and rare variants.

2.2 Three T1E simulation designs: the ‘theoretical’ null S0,

and the ‘empirical’ null S1.1 and S1.2

The three simulation designs for evaluate T1E can be conceptualized as follows.

The ‘theoretical’ null’ S0: simulates Y0 under the null,

independently generates Gnew,

and evaluates T1E from the Y0 −Gnew null relationship.

The ‘empirical’ null’ S1.1: simulates Y1 under an alternative based on G,

independently generates Gnew,

and evaluates T1E from the Y1 −Gnew null relationship.

The ‘empirical’ null’ S1.2: permutes the Y1,

and evaluates T1E from the Y perm1 −G null relationship.

(2.9)

The exact implementation depends on the test to be evaluated. Thus, we describe below, in details,

how data are being generated for the three T1E simulation designs (S0, S1.1, and S1.2), and under

the different interaction testing scenarios (E missing or not, and single or multiple Gs).

Page 20: Beyond the traditional simulation design

Chapter 2. Methods and Materials 11

2.2.1 Data-generating models for the three T1E simulation designs and

under the three interaction scenarios

Scenario 1: single G, and E missing

Consider the true phenotype-generating model (2.1), the ‘theoretical’ null simulation design S0

assumes βGE = 0; for simplicity we also assume βG = 0. Thus, S0 simulates phenotype data using

Y0 = βEE + e, e ∼ N(0, σ2). (2.10)

It then independently simulates genotype data for an non-associated SNP,

Gnew ∼ Binomial(2, 1− f), (2.11)

where f is the MAF. Finally, because E was assumed missing in practice, S0 uses the following fitted

model,

Y0 = bGGnew + ε, ε ∼ N(0, σ2G), (2.12)

to detect the variance heterogeneity present in ε, based on the Levene and LRTv tests as described

in Section 2.1.1. Theoretical design S0 can be formulated in another way where we generate true

phenotype from Y0 = βEE+βGG+e, e ∼ N(0, σ2), with main effect of some genotype G (Model II of

Rao and Province (2016)), and apply variance test using fitted model Y0 = bGGnew+ε, ε ∼ N(0, σ2G).

The two type of ‘theoretical’ null designs are not essentially different due to independently generated

Gnew. Similar conclusion that main effect in data generating model does not affect our study of

interaction effect can be drawn in the other two scenarios for the same reason.

The ‘empirical’ null simulation design S1.1, however, first simulates phenotype data under an

alternative hypothesis. That is,

Y1 = βGG+ βEE + βGE(G× E) + e, e ∼ N(0, σ2). (2.13)

Note that G is a truly associated SNP, and G ∼ Binomial(2, 1 − f ′). S1.1 then independently

simulates genotype data for a non-associated SNP Gnew as in (2.11). Note that the MAF f does

not have to be the same as f ′ for G. Similarly, because E was assumed missing in practice, S1.1

Page 21: Beyond the traditional simulation design

Chapter 2. Methods and Materials 12

uses the fitted model of

Y1 = bGGnew + ε, ε ∼ N(0, σ2G) (2.14)

to obtain the Levene and LRTv variance test statistic.

The ‘empirical’ null simulation design S1.2 first permutes the Y1 generated above. Because Y perm1

is no longer associated with the Y1-generating G, it then uses the fitted model of

Y perm1 = bGG+ ε, ε ∼ N(0, σ2G) (2.15)

to assess T1E control of a test.

In summary, the true phenotype-generating models are Y0 = βEE + e or Y1 = βGG + βEE +

βGEG×E + e, where e ∼ N(0, σ2) in both cases. Gnew is genotype data for a non-associated SNP.

The three T1E simulation designs estimate T1E rate of a heteroscedasticity test for V ar(ε) using,

respectively, the fitted models of

S0: Y0 = bGGnew + ε,

S1.1: Y1 = bGGnew + ε,

S1.2: Y perm1 = bGG+ ε.

(2.16)

It’s apparent that both ‘empirical’ null designs simulated phenotype data Y under some alter-

native hypothesis, which is as expected in real data applications. And the tested Y -G pairs are

independent as the null hypothesis claims so that we can evaluate T1E rate control. The facts

explained why we define S1.1 and S1.2 as ‘empirical’ null designs.

Scenario 2: single G, and E known

In this case, the data generating model is the same as above, namely, Y0 = βEE + e, e ∼ N(0, σ2),

for S0 design, and Y1 = βGG + βEE + βGE(G × E) + e, e ∼ N(0, σ2) for S1.1 and S1.2 designs.

Because E is known in this second interaction scenario, the three T1E simulation designs estimate

the T1E rate by testing H0 : bGE = 0 using, respectively, the fitted models of

S0: Y0 = bGGnew + bEE + bGE(Gnew × E) + ε,

S1.1: Y1 = bGGnew + bEE + bGE(Gnew × E) + ε,

S1.2: Y perm1 = bGG+ bEEperm + bGE(G× Eperm) + ε.

(2.17)

Page 22: Beyond the traditional simulation design

Chapter 2. Methods and Materials 13

Note that Eperm represents the fact that the permutation must be performed jointly for Y1 and

E. This is to maintain the Y1 − E association while breaking the Y1 − G association. We can

easily conclude that E(bGE) = 0 in all three fitted models in (2.17), even the Y -GxE may not be

independent. Details can be found in Appendix.

For the ‘theoretical’ null S0, we assumed Y0 = βEE + e without the main G effect in the true

generating model (similar to Model I of Rao and Province (2016)). Alternatively, we could consider

Y0 = βGG+ βEE + e with the main effect (Model II of Rao and Province (2016)). In that case, the

fitted model would be Y0 = bGG+ bEE + bGE(G× E) + ε, and bGE is expected to be zero.

The work of Rao and Province (2016) studied the effect of dependency in SNPnon−repeatingxSNPrepeating

interaction analysis on T1E control. Similarly, we can assume E is fixed to represent the fact that,

in a real genome-wide GxE interaction scan, the E does not change from SNP to SNP. Thus the

test statistics T (bGE)s are dependent in a scan. This dependency can lead to larger variation in

genomic inflation factors. However, we show that this dependency is not the source of the T1E issue

addressed here in Appendix.

Scenario 3: multiple Gs, and E known

By now, it should be clear how the three T1E simulation designs would be implemented in this

setting. The true phenotype-generating models are Y0 = βEE + e, e ∼ N(0, σ2) for S0, and Y1 =∑Jj=1 βGj

Gj +βEE+∑Jj=1 βGEj

(Gj×E)+e, e ∼ N(0, σ2) for S1.1 and S1.2. For the SKATGE and

FGE tests, the three T1E simulation designs estimate the T1E rate of jointly testing H0 : bGEj=

0, j = 1, . . . J , using, respectively, the fitted models of

S0: Y0 =

∑Jj=1 bGj

Gnewj+ bEE +

∑Jj=1 bGEj

(Gnewj× E) + ε,

S1.1: Y1 =∑Jj=1 bGjGnewj + bEE +

∑Jj=1 bGEj (Gnewj × E) + ε,

S1.2: Y perm1 =∑Jj=1 bGj

Gj + bEEperm +

∑Jj=1 bGEj

(Gj × Eperm) + ε.

(2.18)

For the BurdenGE test based on G∗ =∑Jj=1Gj , the three T1E simulation designs estimate the

T1E rate of testing H0 : bG∗E = 0, using, respectively, the fitted models of

S0: Y0 = bG∗G∗new + bEE + bG∗E(G∗new × E) + ε,

S1.1: Y1 = bG∗G∗new + bEE + bG∗E(G∗new × E) + ε,

S1.2: Y perm1 = bG∗G∗ + bEE

perm + bG∗E(G∗ × Eperm) + ε.

(2.19)

Page 23: Beyond the traditional simulation design

Chapter 3

Simulation Studies

3.1 Summary of Simulation models and parameter values

For indirect or direct GxE interaction study of a single SNP (i.e. interaction scenarios 1 and 2), Table

3.1 provides the details of the data-generating models and parameter values used. To evaluate the

Levene and LRTv tests for variance heterogeneity, besides the data generating model as described in

Section 2.3.1 and as used by Aschard et al. (2013), we also considered the model adopted by Cao et

al. (2014) for a more extensive comparison. Cao et al. (2014) used model (2.4) to directly simulate

variance homogeneity or heterogeneity in Y stratified by G. In contrast, Aschard et al. (2013)

used model (2.1) to indirectly simulate variance heterogeneity that has better genetic epidemiology

interpretation, because the size of βGE corresponds to power of scale tests under alternatives. For

direct testing of the interaction effect, conveniently, the data-generating model of Aschard et al.

(2013) in Table 3.1 is conceptually the same as the simulation model I of Rao and Province (2016),

except SNPrepeating is a fixed E here.

To best mimic real data condition, we also used a ‘double-loop’ simulation design. Consider the

‘theoretical’ null of Y0 = βEE + e for GxE interaction analysis. Within each of nrep.out replicates

(e.g. 100) in an outer simulation loop, we simulate Y0 based on a fixed E, Y0 = βEEfixed + e.

We then simulate nrep.in replicates (e.g. 104) of Gnew, test bGE based on the fitted model of

Y0 = bGGnew + bEEfixed + bGE(Gnew × Efixed) + ε, and estimate the T1E rate using the nrep.in

replicates. This is similar to one single whole-genome GxE interaction scan. Finally, we can average

the T1E rate over nrep.out replicates to account for sampling variation inherent in the simulation of

14

Page 24: Beyond the traditional simulation design

Chapter 3. Simulation Studies 15

a Efixed for one single whole-genome interaction scan. This can be done similarly for the ‘empirical’

null.

Table 3.1: Summary of the two data-generating models for indirect and direct GxE interactiontesting, and evaluating the ‘theoretical’ null simulation designs S0 vs. the two the ‘empirical’ nullsimulation designs S1.1 and S1.2, as described in Sections 2.3.1 and 2.3.2. Assume E was availablefor direct GxE testing, the Aschard et al. (2013) model coincides with Model I of Rao and Province(2016), except E was Gnon−repeating. T1E rate is first estimated from nrep.in simulation replicatesin an inner loop in which E is fixed (similar to one whole-genome GxE interaction scan), thenaveraged over nrep.out simulation replicates in an outer loop in which E varies.

Introduce Introducevariance heterogeneity by σ2

G variance heterogeneity by G× E(Cao et al. 2014) (Aschard et al. 2013)

Or, Directly test βGE(assuming E was available)

Null Model Y0 = e, Y0 = βEE + e,for S0 e ∼ N(0, σ2) e ∼ N(0, σ2)

Alternative Models Y1 = βGG+ e, Y1 = βGG+ βEE + βGEG× E + e,for S1.1 and S1.2 e ∼ N(0, σ2

G) e ∼ N(0, σ2)

ParametersMAF=0.4 for G and Gnew MAF=0.4 for G and Gnew; P(E=1)=0.3

βG=0.3 βG=0.01, βE=0.3, βGE=0.1, 0.2,· · · ,1σ20=0.23, σ2

1=0.25, σ22=0.29 σ2 = 1

Sample size n=103 or 104 n=103 or 104

Nominal T1E α=0.05 α=0.05, 0.01, 0.001, 10−5

Replications nrep.in= 105 nrep.in=105, or 107 for α=10−5

nrep.out= 100 nrep.out= 100

For each combination of parameter values in Table 3.1 that generates Y1 under an alternative,

instead of studying power, we focused on evaluating T1E control, contrasting the proposed ‘empirical’

null with the previously considered ‘theoretical’ null, as described in Section 2.

Table 3.2 provides the parameter values for gene-based interaction analysis (i.e interaction sce-

nario 3). The number of SNPs was chosen to be J = 11 as in Xinyi Lin, Lee, Wu, et al. (2016),

among which only seven interact with E. That is, βGEj6= 0 only for some j = 1, . . . , J as detailed in

Page 25: Beyond the traditional simulation design

Chapter 3. Simulation Studies 16

Table 3.2. Two sets of MAF levels were considered, with one presents gene-based interaction studies

of common variants and the other of rare variants.

Table 3.2: Summary of the data-generating models for multivariate GxE interaction testing, andevaluating the ‘theoretical’ null simulation designs S0 vs. the two the ‘empirical’ null simulationdesigns S1.1 and S1.2, as described in Section 2.3.3. T1E rate is first estimated from nrep.in

simulation replicates in an inner loop in which E is fixed (similar to one whole-genome gene-basedGxE interaction scan), then averaged over nrep.out simulation replicates in an outer loop in whichE varies.

Null ModelY0 =

∑Jj=1 βGj

Gj + βEE + e, e ∼ N(0, σ2)for S0

Alternative ModelsY1 =

∑Jj=1 βGjGj + βEE +

∑Jj=1 βGEj (Gj × E) + e, e ∼ N(0, σ2)

for S1.1 and S1.2

MAF for ~Gj and ~Gnewj

small: (2.15, 2.58, 2.58, 4.16, 2.57, 2.61, 4.95, 2.58, 2.57, 2.58, 3.68)× 10−2

large: (2.15, 2.58, 2.58, 4.16, 2.57, 2.61, 4.95, 2.58, 2.57, 2.58, 3.68)× 10−1

Parameters

J=11 ; P(E=1)=0.3~βG = (−0.218, 0, 0,−0.476, 0, 0,−0.151,−0.845, 0.0945, 0,−0.133)

βE = 0.3~βGE = (0.1,−0.1, 0, 0, 0.1, 0.1, 0,−0.1, 0,−0.1, 0)

σ2=0.27

Sample size n=103

Nominal T1E α=0.05, 0.01, 0.001

Replications nrep.in=105; nrep.out=100

3.2 Simulation results

For each of the three GxE interaction scenarios, for each of the statistical tests under the study,

and for each of the three T1E simulation designs, we recorded the corresponding T1E rate. We bold

in red colour the ones that exceed the α ± 3√α× (1− α)/nrep.in range, where α is the nominal

T1E rate, and rep.in is the number of simulation replicates used to estimate the empirical T1E

rate for each of the nrep.out replicates; each replicate in an outer loop represents a whole-genome

association scan. Thus, α± 3√α× (1− α)/nrep.in is a conservative interval.

Page 26: Beyond the traditional simulation design

Chapter 3. Simulation Studies 17

Scenario 1: single G, and E missing

To understand the table results below, each cell indicates an averaged estimated T1E rate from a

full loop (nrep.out×nrep.in), evaluated at a nominal level α. If nominal level α = 0.05, we write

α = 5× 10−2 and each cell is in the same 10−2 scale and we can directly compare the results with 5.

Table 3.3: Simulation results of interaction scenario 1: single G, and E missing. Empirical T1Erates of LRTm and Scorem location tests for mean difference in Y across the three G groups, andof LRTv and Levene scale tests for variance difference in Y , based on the ‘theoretical’ null designof S0 and the alternative ‘empirical’ null designs of S1.1 and S1.2. The alternative Y1 data weregenerated using the Aschard’s genetic model as described in Table 3.1. Empirical T1E rates outsideα± 3

√α× (1− α)/nrep.in are bolded in red.

α = 5× 10−2, n = 103

βGE 0.0 0.2 0.4 0.6 0.8 1

Location

LRTm

S0 5.029 5.027 5.026 5.027 5.027 5.026S1.1 5.039 5.021 5.021 5.019 5.014 5.010S1.2 4.997 5.023 5.022 5.020 5.017 5.011

Scorem

S0 5.002 4.998 4.997 4.998 4.998 4.997S1.1 5.014 4.992 4.993 4.991 4.985 4.981S1.2 4.974 4.994 4.993 4.990 4.988 4.983

Scale

LRTv

S0 5.035 5.083 5.081 5.081 5.081 5.079S1.1 5.029 5.188 5.262 5.507 5.979 6.757S1.2 5.031 5.198 5.274 5.519 5.994 6.756

LeveneS0 4.956 4.898 4.898 4.898 4.898 4.896

S1.1 4.989 4.901 4.904 4.905 4.909 4.907S1.2 4.906 4.922 4.912 4.911 4.915 4.908

Table 3.3 above shows that location tests for phenotypical mean differences (LRTm and Scorem)

are generally robust to the choice of ‘theoretical’ null S0 vs. ‘empirical’ null S1.1 or S1.2. That is, for

a location test with correct T1E control under S0, it has correct T1E control under ‘empirical’ designs

as well. Similarly, if it has inflated T1E rates under S0, it is supposed to have the same inflation

problem showing under S1.1 and S1.2. On the contrary, LRTv test for variance heterogeneity is not

as ‘robust’. While S0 design concludes that LRTv has correct T1E rate control across the parameter

values considered, S1.1 and S1.2 reveals its problem in T1E control with ‘empirical’ null phenotype

Page 27: Beyond the traditional simulation design

Chapter 3. Simulation Studies 18

data. For example, in some settings, the T1E rate of LRTv is 0.07 for the nominal α = 0.05 level.

The empirical T1E rates of the Levene test were slightly deflated but not significantly.

While the increased T1E rates under the S1.1 and S1.2 ‘empirical’ null designs appear to be mild

and occur in extreme models (i.e. large GxE interaction effect), results in Table 3.4 demonstrate

that the T1E issue of LRTv, revealed under the ‘empirical’ null simulation designs of S1.1 and S1.2,

can be more severe at the tail. For example, for the nominal α = 1× 10−5 level, the empirical T1E

rate of LRTv can be as high as 11.5× 10−5. Because the genome-wide significance level for GWAS

is α = 5 × 10−8 (Dudbridge and Gusnanto 2008), an inflation of false positive findings can be of a

real problem in practice. Further, results in Table 3.5 confirm that increasing sample size n (from

103 to 104) does not mitigate the discrepancy in T1E conclusion drawn from the ‘theoretical’ vs.

‘empirical’ null.

Table 3.4: Simulation results of interaction scenario 1: single G, and E missing; effect of the nominalα level. Empirical T1E rates of LRTm and Scorem location tests for mean difference in Y acrossthe three G groups, and of LRTv and Levene scale tests for variance difference in Y , based on the‘theoretical’ null design of S0 and the alternative ‘empirical’ null designs of S1.1 and S1.2. Thealternative Y1 data were generated using the Aschard’s genetic model as described in Table 3.1,focusing on the extreme case of large interaction effect, βGE = 1. Empirical T1E rates outsideα± 3

√α× (1− α)/nrep.in are boded in red.

n = 103 α 5× 10−2 1× 10−2 1× 10−3 1× 10−5

Location

LRTm

S0 5.016 0.999 0.985 0.988S1.1 5.009 1.007 0.988 0.990S1.2 5.011 1.009 1.033 1.013

Scorem

S0 5.008 0.998 0.989 0.990S1.1 4.982 0.998 0.982 0.991S1.2 4.982 0.999 0.983 0.997

Scale

LRTv

S0 5.009 1.002 1.024 1.033S1.1 6.923 1.636 2.059 11.599S1.2 6.920 1.639 2.042 11.624

LeveneS0 4.955 0.961 0.938 0.971

S1.1 4.964 0.978 0.932 0.952S1.2 4.962 0.970 0.953 0.958

Page 28: Beyond the traditional simulation design

Chapter 3. Simulation Studies 19

Table 3.5: Simulation results of interaction scenario 1: single G, and E missing; effect of sample sizen. Empirical T1E rates of LRTm and Scorem location tests for mean difference in Y across the threeG groups, and of LRTv and Levene scale tests for variance difference in Y , based on the ‘theoretical’null design of S0 and the alternative ‘empirical’ null designs of S1.1 and S1.2. The alternative Y1 datawere generated using the Cao’s genetic model as described in Table 3.1, and using two differencesample sizes of n = 103 and 104. Empirical T1E rates outside α ± 3

√α× (1− α)/nrep.in are

bolded in red.

α = 5× 10−2

n = 103 n = 104

Location

LRTm

S0 5.011 5.012S1.1 4.992 5.003S1.2 4.934 4.989

Scorem

S0 5.011 5.012S1.1 4.993 5.003S1.2 4.934 4.989

Scale

LRTv

S0 5.103 5.165S1.1 7.034 7.125S1.2 7.007 7.020

LeveneS0 4.924 4.945

S1.1 4.905 4.965S1.2 4.874 4.825

The root cause is that, under the S1.1 and S1.2 designs, Y1 marginally is not normally distributed,

even if it was generated (conditional on the true G) using a normally distributed error e term. Using

LRTv as an example for the test statistic T of interest. Cao et al. (2014) has shown that LRTv is

accurate based on f0(T ), which is χ2(2) derived for Y0−Gnew under the ‘theoretical’ null S0 condition

and assuming e ∼ N(0, σ2). However, when LRTv is applied to Y1 − Gnew, even if e ∼ N(0, σ2)

for generating Y1 conditional on the true G, the correct asymptotic distribution of LRTv under

the ‘empirical’ null is a weighted sum of independent χ2(1) (Appendix and Theorem 3.4.1(1) of

Yanagihara, Tonda, and Matsumoto (2005)), we call it asymptotic distribution under ‘empirical’

null. Thus, assessing LRTv using χ2(2) (asymptotic distribution under ‘theoretical’ null) can have

T1E issue if the data were under the ‘empirical’ null S1.1 condition; similarly for S1.2.

Figure S1 compares the asymptotic distribution (black solid curve) with the finite-sample dis-

tribution from simulation results (red dashed curve) of LRTv under the ‘empirical’ null, as well as

Page 29: Beyond the traditional simulation design

Chapter 3. Simulation Studies 20

with the asymptotic distribution (χ2(2), blue dot-dashed curve) of LRTv under the ‘theoretical’ null.

While the the finite-sample distribution approximates well the asymptotic distribution derived un-

der the ‘empirical’ null, it is clear that the distributions of LRTv differ between the ‘empirical’ and

‘theoretical’ null; the difference is more visible on the scale of critical value for statistical significance

(the vertical lines). The true significance threshold for LRTv under the ‘empirical ’ null is further

away to the tail, as compared with the threshold of χ2(2) under the ‘theoretical’ null. Thus, applying

LRTv to empirical GWAS or NGS while using the significance threshold of χ2(2) can lead to inflated

T1E rate.

Figure S1: Comparison of the asymptotic distribution (black solid) and finite-sample distribution(red dashed) of LRTv under the ‘empirical’ null S1.1 or S1.2, with the asymptotic distribution(χ2

2, blue dot-dashed) of LRTv under the ‘theoretical’ null S0. Vertical lines correspond the 99.9%quantile cutoffs for α = 0.001.

0 5 10 15

0.0

0.1

0.2

0.3

0.4

0.5

Density of LRT variance test statistics

N = 1000000 Bandwidth = 0.08798

Density

Asymptotic dist. | `empirical' nullFinite-sample dist. | `empirical' nullAsymptotic dist. | `theoretical' null

In practice, it is routine (and recommended) to display and examine the empirical distribution

of a trait under the study. However, Figure S1 shows that even under the most extreme setting

where βGE = 1, the marginal histogram of Y1 appears to be approximately normal visually, unless a

formal diagnostic test for normality was conducted. The slightly right-skewed empirical distribution

of Y1 is the result of mixing six conditional distributions of Y1, each perfectly normally distributed

conditional on the causal G and E. This is the key difference between the ‘theoretical’ and ‘empirical’

Page 30: Beyond the traditional simulation design

Chapter 3. Simulation Studies 21

null simulation designs, regardless of the sample size. For a less extreme case where βGE = 0.2,

although both the histogram and Q-Q plot (Figure S2) suggest that normal distribution is a good

fit (passing the Shapiro-Wilk normality test), the T1E discrepancy between the ‘theoretical’ and

‘empirical’ null remains, albeit less severe, as shown in Table 3.3 (column βGE = 0.2) and Figure S3

(middle row).

Tables 3.3, 3.4 and 3.5 also provide T1E results for testing phenotypic mean (as opposed to

variance) difference across the genotype groups. Although location testing for main effect is generally

quite robust to the assumption of normality, problem can arise when testing for interaction effects,

beyond any apparent model mis-specifications (Rao and Province 2016).

Scenario 2: single G, and E known

In direct testing for the GxE interaction effect (SNPrepeating×SNPnon−repeating to be more precise),

Rao and Province (2016) used the classical ‘theoretical’ null simulation design S0, with or without

the main G effect. Regardless, Figures 1B-1C of Rao and Province (2016) showed that the variation

in the resulting λGC was substantially bigger when testing bGE than testing bG. Their Figures 1D

and 1E also demonstrated that the variation diminishes as sample size increases. However, we note

that this observation was made before averaging across the 414 simulated interaction scans/data

sets; each scan contained 20,000 SNPs from which a λGC value was estimated.

The results of Rao and Province (2016) are consistent with ours shown in Figure S6. Figure

S6 showed that scan-specific estimated T1E rates are indeed variable due to test dependency, and

become less so as sample size increases; 100 GxE whole-genome interaction scans with 105 SNPs in

each scan and E was fixed within a scan. However, it is important to note that the average T1E rate

across nrep.out simulated scans reflects better the long-run behaviour of a method. Indeed, results

in Table 3.6 show that the T1E rate of testing bGE , estimated from 105×100 (nrep.in× nrep.out)

simulated replicates, is well controlled, under the conventional ‘theoretical’ null simulation design

of S0. However, this is not the case under the ‘empirical’ null simulation designs of S1.1 or S1.2.

Similar to the LRTv scale test for variance heterogeneity, the discrepancy in T1E control of LRTGE

and ScoreGE , between the two types of null simulation designs, becomes more prominent at the tail

and persists as sample increases (Table 3.6).

Page 31: Beyond the traditional simulation design

Chapter 3. Simulation Studies 22

Table 3.6: Simulation results of interaction scenario 2: single G, and E known. Empirical T1Erates of the LRTGE and ScoreGE tests, based on the ‘theoretical’ null design of S0 and the al-ternative ‘empirical’ null designs of S1.1 and S1.2. The alternative Y1 data were generated usingthe Aschard’s genetic model as described in Table 3.1 when βGE = 1, but E was assumed to beknown in this case and direction interaction testing was possible. Empirical T1E rates outsideα± 3

√α× (1− α)/nrep.in are bolded in red.

n = 103 n = 104

α = 5× 10−2 1× 10−2 1× 10−3 5× 10−2 1× 10−2 1× 10−3

LRTGE

S0 5.034 1.017 1.029 5.033 1.007 0.979S1.1 7.091 1.763 2.422 6.886 1.709 2.410S1.2 7.100 1.771 2.389 6.972 1.707 2.339

ScoreGE

S0 4.982 1.003 1.002 5.021 1.004 0.972S1.1 7.028 1.738 2.363 6.874 1.705 2.400S1.2 7.040 1.747 2.339 6.960 1.703 2.326

Scenario 3: multiple Gs, and E known

Table 3.7 contains T1E results for gene-based interaction studies of both common and rare variants.

When the MAFs are between 0.2 and 0.5 (common), we observed that the T1E rates of all tests

considered are significantly different between the ‘theoretical’ null and ‘empirical’ null. This result is

consistent with that of the interaction scenarios 1 and 2 above. Our simulation results also confirmed

that the BurdenGE interaction test has T1E issue even under the ‘theoretical’ null, as previously

noted in Xinyi Lin, Lee, Wu, et al. (2016). When the MAFs are reduced ten-fold to be between 0.02

and 0.05 (rare), although SKATGE appears to be robust, the BurdenGE and FGE tests continue

to display significant differences between the ‘theoretical’ and ‘empirical’ null simulation designs for

evaluating their T1E control.

Page 32: Beyond the traditional simulation design

Chapter 3. Simulation Studies 23

Table 3.7: Simulation results of interaction scenario 3: multiple Gs, and E known. EmpiricalT1E rates of the FGE , SKATGE and BurdenGE tests for jointly testing for multiple interactioneffects, based on the ‘theoretical’ null design of S0 and the alternative ‘empirical’ null designs ofS1.1 and S1.2. Models and parameters values are given in Table 3.2. Empirical T1E rates outsideα± 3

√α× (1− α)/nrep.in are bolded in red.

small MAF (rare) large MAF (common)

α = 5× 10−2 1× 10−2 1× 10−3 5× 10−2 1× 10−2 1× 10−3

FGE

S0 4.991 0.982 0.985 4.983 1.001 0.992S1.1 5.448 1.130 2.363 6.905 1.591 2.043S1.2 5.448 1.130 1.238 6.818 1.596 1.945

SKATGE

S0 4.904 0.949 0.895 4.934 0.976 0.917S1.1 5.038 1.074 1.059 6.460 1.411 1.640S1.2 5.034 1.065 1.110 6.378 1.394 1.651

BurdenGE

S0 6.169 3.751 15.049 13.709 4.280 7.440S1.1 8.432 2.213 3.144 5.666 1.232 1.410S1.2 8.408 2.208 3.148 5.712 1.237 1.420

Page 33: Beyond the traditional simulation design

Chapter 4

Discussion

In this work, we highlight the importance of distinguishing the ‘theoretical’ and ‘empirical’ null

distributions, first noted by Efron (2004), in a different application context. Starting with scale

tests for variance heterogeneity and through simulation studies, we showed that conclusions of type

1 error control of a statistical test could differ depending on the choice of the null simulation designs.

For example, the LRTv variance test appears to be accurate under the ‘theoretical’ null but invalid

under the ‘empirical’ null (Tables 3.3, 3.4 and 3.5, and Figure S3). Such differences indicates firstly

the test statistics we are examining does not have correct T1E rate control under ‘empirical’ null

hypothesis, which is expected from real data sets. And more importantly, such problem cannot be

revealed by traditional ‘theoretical’ null designs.

Cao et al. (2014) has pointed out the sensitivity and limitation of LRTv when the error term e is

not normally distributed. However, they implicitly assumed that the test would work well as long as

e ∼ N(0, σ2), regardless of ‘theoretical’ null or ‘empirical’ null. In our simulation study, although all

the e’s for generating the phenotype data were normal, the increased T1E rates under the ‘empirical’

null are, fundamentally, attributed to the sensitivity of LRTv to departure from normality. This is

because the marginal distribution of the empirical outcome data are not normal (Figures S1 and

S2). Thus, tests shown to be extremely sensitive to the assumption of normality are particularly

vulnerable when applied to real data, which are better represented by the ‘empirical’ null than the

‘theoretical’ null.

Conversely, power calculation for LRTv using the significance threshold of the ‘theoretical’ null

distribution can be too optimistic. Indeed, this study was motivated by the observation made in

24

Page 34: Beyond the traditional simulation design

Chapter 4. Discussion 25

Table S7 of Soave, Corvol, et al. (2015) that, “further analysis using permutation estimation of

p-values showed that power of the LRT under asymptotic analysis was greatly inflated. The work

here provides insights on why this is the case. The asymptotic power was obtained using the χ2(2)

distribution derived under the ‘theoretical’ null while controlling the T1E rate at α, as in Cao et

al. (2014). The permutation-based power was obtained by using the empirical 1− α quantile cutoff

of the LRTv statistic applied to permuted phenotype. This is equivalent to controlling the T1E

rate at α under the ‘empirical’ null of S1.2. Under the ‘empirical’ null, however, we showed that

LRTv follows a different distribution and the corresponding significance quantile cutoff is further

to the tail, as compared with the f0(T ) = χ2(2) distribution under the ‘theoretical’ null (Figure

S1). This leads to (correct) smaller permutation-based power. The (incorrect) higher asymptotic-

based power is a result of increased T1E rate, when the data were generated under the ‘empirical’

null condition but the test statistic was evaluated using χ2(2) derived under the ‘theoretical’ null

condition. Permutation-based S1.2 null design can estimate the true distribution of a test statistic

T when applied to a real data set, and identify the correct significance threshold for T under the

‘empirical’ null. But, permutation must be carried out carefully in practice, for example, in the

presence of sample correlation (Abney 2015).

In practice, investigators often rely on visual inspection of histograms of outcome data as illus-

trated in Figures S1 and S2. We have noted that the departure from normality does not have to

be severe to have an effect on tests such as LRTv, as observed in a LRTv scan of lung function in

cystic fibrosis subjects by Soave, Corvol, et al. (2015). Furthermore, for data appear to deviate from

normal such as those displayed in Figure S1, even if investigators chose to perform some standard

normal transformations, the T1E issue can persist. For example, let us consider the phenotype data

simulated based on Aschard’s genetic model, as described in Table 3.1 where βGE = 1 (Figure S1).

After square-root or log transformations (Goh and Yap 2009), although the empirical marginal dis-

tribution of the phenotype improved as expected (Figure S4), the severity of T1E inflation of LRTv

in fact worsened under the ‘empirical’ S1.1 and S1.2 null (Figure S5).

Voorman et al. (2011) showed that spurious false positives can occur in genome-wide scans for

GxE interactions, particularly in the presence of model mis-specification. Rao and Province (2016)

also presented inflated or deflated genomic inflation factors in a GxG interaction scan when one SNP

is anchored (i.e. SNPrepeatingxSNPnon−repeating), using the conventional ‘theoretical’ null simula-

tion design without any apparent model mis-specifications. Based on our GxE simulation studies

Page 35: Beyond the traditional simulation design

Chapter 4. Discussion 26

where E was fixed within each scan, we note that the large variation in λGC estimate demonstrated

by Rao and Province (2016) corresponds to the sampling variation inherent in estimating T1E rate

from nrep.in replicates (or SNPs) across nrep.out replicates (or scans). This, however, does not

translate to T1E issue based on the classical frequentist interpretation. Results in Table 3.6 show

that, under the ‘theoretical’ null of S0, there is no T1E issue in GxE interaction studies even if E

was fixed within a genome-wide scan. But, similar to scale test of variance, T1E results differ using

the ‘empirical’ S1.1 or S1.2 null simulation designs. Additional theoretical insights are provided in

Appendix Section 3.

Departure from normality is generally weakened under multivariate assumptions. However, the

topic addressed here remain relevant. To demonstrate this, in addition to studying interaction

between E and G of a single SNP, we also examined testing for interaction effects between E

and multiple Gs as in gene-based interaction studies. We reached the same conclusion that T1E

conclusions could differ between the ‘theoretical’ and ‘empirical’ null.

In practice, with real data available, ‘empirical’ null designs can be implemented as well to

evaluate T1E rate control for a test statistics. S1.1 generates new genotype data and fits models

using real phenotype Y and environmental factors E. S1.2 permutes the real Y -E pair jointly and

fit a model with Y perm, G, and Eperm. From the simulation results, S1.1 and S1.2 seem to reach

the same conclusion in all cases. However, with real data, S1.1 can hardly generates Gs while

maintaining the same linkage relationship. Permutation approach S1.2 keeps the genotype data,

but it could not maintain correlation structures in phenotype. The ‘empirical’ null designs S1.1 and

S1.2 are two basic designs to mimic real data and obtain true T1E rate evaluation. However, due

to the difficulties in reproducing real data together with possible relationships, new designs may be

developed to improve the current ones.

To conclude, although we only presented three examples (i.e. scale tests for variance heterogene-

ity, and direct tests for one or multiple interaction effects jointly), the findings here have important

implications for future evaluation of type 1 error control and interpretation. The newer tests be-

ing developed are increasingly complex, and they often go beyond the first moment such as the

scale tests studied here. Conventional simulation design S0 under the ‘theoretical’ null can lead

to misleading conclusion regarding the accuracy of a test, which in turn affects power study. The

alternative simulation designs S1.1 and S1.2 under the ‘empirical’ null, can reveal the true behaviour

of a test when applied to real data.

Page 36: Beyond the traditional simulation design

Bibliography

[1] Mark Abney. “Permutation Testing in the Presence of Polygenic Variation”. In: Genetic Epi-

demiology 39.4 (2015), pp. 249–258.

[2] Hugues Aschard et al. “A nonparametric test to detect quantitative trait loci where the phe-

notypic distribution differs by genotypes”. In: Genetic epidemiology 37.4 (2013), pp. 323–333.

[3] Morton B Brown and Alan B Forsythe. “Robust tests for the equality of variances”. In: Journal

of the American Statistical Association 69.346 (1974), pp. 364–367.

[4] Ying Cao et al. “A versatile omnibus test for detecting mean and variance heterogeneity”. In:

Genetic epidemiology 38.1 (2014), pp. 51–59.

[5] Tokhir Dadaev et al. “Fine-mapping of prostate cancer susceptibility loci in a large meta-

analysis identifies candidate causal variants”. In: Nature communications 9.1 (2018), p. 2256.

[6] Andriy Derkach, Jerry F Lawless, Lei Sun, et al. “Pooled association tests for rare genetic

variants: a review and some new results”. In: Statistical Science 29.2 (2014), pp. 302–321.

[7] Bernie Devlin and Kathryn Roeder. “Genomic control for association studies.” In: Biometrics

55.4 (1999), pp. 997–1004.

[8] Frank Dudbridge and Olivia Fletcher. “Gene-environment dependence creates spurious gene-

environment interaction”. In: The American Journal of Human Genetics 95.3 (2014), pp. 301–

307.

[9] Frank Dudbridge and Arief Gusnanto. “Estimation of significance thresholds for genomewide

association scans”. In: Genetic epidemiology 32.3 (2008), pp. 227–234.

[10] Bradley Efron. “Large-scale simultaneous hypothesis testing: the choice of a null hypothesis”.

In: Journal of the American Statistical Association 99.465 (2004), pp. 96–104.

27

Page 37: Beyond the traditional simulation design

BIBLIOGRAPHY 28

[11] Friedhelm Eicker. “Limit theorems for regressions with unequal and dependent errors”. In:

Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. Vol. 1.

1. 1967, pp. 59–82.

[12] Douglas Scott Falconer. Introduction to quantitative genetics. Oliver and Boyd; Edinburgh;

London, 1960.

[13] Liang Goh and Von Bing Yap. “Effects of normalization on quantitative traits in association

test”. In: BMC bioinformatics 10.1 (2009), p. 415.

[14] E Gomez-Sanchez-Manzano, MA Gomez-Villegas, and JM Marfffdfffdn. “Sequences of elliptical

distributions and mixtures of normal distributions”. In: Journal of multivariate analysis 97.2

(2006), pp. 295–310.

[15] WD Hill et al. “A combined analysis of genetically correlated traits identifies 187 loci and a

role for neurogenesis and myelination in intelligence”. In: Molecular psychiatry (2018), p. 1.

[16] Azmeri Khan and Glen D Rayner. “Robustness to non-normality of common tests for the

many-sample location problem”. In: Journal of Applied Mathematics & Decision Sciences 7.4

(2003), pp. 187–206.

[17] Julia Kozlitina and William R Schucany. “A robust distribution-free test for genetic association

studies of quantitative traits”. In: Statistical applications in genetics and molecular biology 14.5

(2015), pp. 443–464.

[18] Seunggeun Lee, Michael C Wu, and Xihong Lin. “Optimal tests for rare variant effects in

sequencing association studies”. In: Biostatistics 13.4 (2012), pp. 762–775.

[19] Howard Levene et al. “Robust tests for equality of variances”. In: Contributions to probability

and statistics 1 (1960), pp. 278–292.

[20] Bingshan Li and Suzanne M Leal. “Methods for detecting associations with rare variants for

common diseases: application to analysis of sequence data”. In: The American Journal of

Human Genetics 83.3 (2008), pp. 311–321.

[21] Xinyi Lin, Seunggeun Lee, David C Christiani, et al. “Test for interactions between a ge-

netic marker set and environment in generalized linear models”. In: Biostatistics 14.4 (2013),

pp. 667–681.

Page 38: Beyond the traditional simulation design

BIBLIOGRAPHY 29

[22] Xinyi Lin, Seunggeun Lee, Michael C Wu, et al. “Test for rare variants by environment inter-

actions in sequencing association studies”. In: Biometrics 72.1 (2016), pp. 156–164.

[23] Trudy FC Mackay. “Q&A: Genetic analysis of quantitative traits”. In: Journal of biology 8.3

(2009), p. 23.

[24] Bo Eskerod Madsen and Sharon R Browning. “A groupwise association test for rare mutations

using a weighted sum statistic”. In: PLoS genetics 5.2 (2009), e1000384.

[25] Benjamin M Neale et al. “Testing for an unusual distribution of rare variants”. In: PLoS

genetics 7.3 (2011), e1001322.

[26] Guillaume Pare et al. “On the use of variance per genotype as a tool to identify quantitative

trait interaction effects: a report from the Women’s Genome Health Study”. In: PLoS genetics

6.6 (2010), e1000981.

[27] Alkes L Price et al. “Pooled association tests for rare variants in exon-resequencing studies”.

In: The American Journal of Human Genetics 86.6 (2010), pp. 832–838.

[28] Tara J Rao and Michael A Province. “A Framework for Interpreting Type I Error Rates from a

Product-Term Model of Interaction Applied to Quantitative Traits”. In: Genetic epidemiology

40.2 (2016), pp. 144–153.

[29] David Soave, Harriet Corvol, et al. “A joint location-scale test improves power to detect as-

sociated SNPs, gene sets, and pathways”. In: The American Journal of Human Genetics 97.1

(2015), pp. 125–138.

[30] David Soave and Lei Sun. “A generalized Levene’s scale test for variance heterogeneity in the

presence of sample correlation and group uncertainty”. In: Biometrics (2017).

[31] Maksim V Struchalin et al. “Variance heterogeneity analysis for detection of potentially inter-

acting genetic loci: method and its limitations”. In: BMC genetics 11.1 (2010), p. 92.

[32] Xiangqing Sun et al. “What is the significance of difference in phenotypic variability across

SNP genotypes?” In: The American Journal of Human Genetics 93.2 (2013), pp. 390–397.

[33] Tetsuji Tonda, Hirofumi Wakaki, et al. “Asymptotic expansion of the null distribution of the

likelihood ratio statistic for testing the equality of variances in a nonnormal one-way ANOVA

model”. In: Hiroshima Mathematical Journal 33.1 (2003), pp. 113–126.

Page 39: Beyond the traditional simulation design

BIBLIOGRAPHY 30

[34] Arend Voorman et al. “Behavior of QQ-plots and genomic control in studies of gene-environment

interaction”. In: PloS one 6.5 (2011), e19416.

[35] Andrew R Wood et al. “Another explanation for apparent epistasis”. In: Nature 514.7520

(2014), E3.

[36] Michael C Wu et al. “Rare-variant association testing for sequencing data with the sequence

kernel association test”. In: The American Journal of Human Genetics 89.1 (2011), pp. 82–93.

[37] Hirokazu Yanagihara, Tetsuji Tonda, and Chieko Matsumoto. “The effects of nonnormality

on asymptotic distributions of some likelihood ratio criteria for testing covariance structures

under normal assumption”. In: Journal of Multivariate Analysis 96.2 (2005), pp. 237–264.

Page 40: Beyond the traditional simulation design

Chapter 5

Appendix

5.1 Conditional and Marginal Distribution of Y under Alter-

native Model

The conditional and marginal mean/variance of phenotype Y:

µij = E[Y |E = i, G = j] = jβG + iβE + ijβGE

σij = Var[Y |E = i, G = j] = σ2

E[Y ] =

1∑i=0

2∑j=0

ωij · µij

Var[Y ] =

1∑i=0

2∑j=0

ωij · ((µij − E[Y ])2 + σ2)

(5.1)

Where ωij = 21(j=1)f2−j(1−f)jP[E = 1]i(1−P[E = 1])1−i, where f is the minor allele frequency

(MAF) of the SNP. And it can be easily concluded that Y ∼∑1i=0

∑2j=0 ωijN (µij , σ

2).

31

Page 41: Beyond the traditional simulation design

Chapter 5. Appendix 32

5.2 Asymptotic Distribution of Likelihood Ratio Statistics

for Variance under Non-normality

The null hypothesis for testing equal variance in 3 genotype groups are:

H0 : σ20 = σ2

1 = σ22

Likelihood Ratio test statistic equals to:

LRTv = −1

2

2∑j=0

log σ2j − log σ2

It is known that under normality, LRTvL−→ χ2

(2). When Y is not normal but with E[||Y ||4] <∞ and

nj/n = O(1) for each group G = j, denote ρj = nj/n, ρ = (ρ0, ρ1, ρ2)′. And uj = (√

nj

2 ·σ2i−σ

2

σ2 ),

u = (u0, u1, u2)′. The covariance matrix of u can be written as I3⊗Ω for some 3x3 matrix Ω. Then

the characteristic function of LRTv becomes:

C(t) =

2∏j=0

(1− 2itλj)−1 + o(1)

where λj are the eigenvalues of Ω. And the test statistics converges to:

LRTvL−→

2∑j=0

λjχ2(1)

When Y is normally distributed the eigenvalues will reduce to a zero and two 1s and the asymptotic

distribution will reduce to∑2j=1 χ

2(1) = χ2

(2).

5.3 Score Test on GxE Interaction Scan

5.3.1 Dependency of Test Statistics

The working model in the whole-genome interaction scan is:

E[Y 0] = β1G+ β2E + β3G× E

Page 42: Beyond the traditional simulation design

Chapter 5. Appendix 33

Score statistics:

T =S(0)− S(β3)√V ar(S(β3))

=

n∑i=1

1

σ2(GiEi) · (yi − βGGi − βEEi)

/√n

σ2E[G2E2]

Since Y ∼ N (0, 1), σ2 is the variance of normal error term, T becomes:

T =

( n∑i=1

1

σGiEi · εi

)/√n · E[G2E2] ∼ N

(0,

1n

∑ni=1G

2iE

2i

E[G2E2]

)

For a p-SNP scan, the distributions of T1, · · · , Tp have dependent variances. When∑ni=1 1[Gi =

0] (1−maf)2 · n, or∑ni=1 1[Ei = 1] P[E = 1] · n, the genomic inflation factor λ is expected to

be lower than 1 and vice versa.

Small-sized sample can be largely deviated from expectation by chance. As n→∞, 1n

∑ni=1(GiEi)

2 →

E[(GE)2], and thus λ becomes stabled around 1 (Figure S5). Averaged over outer loop for type 1

error control, location tests will not reveal any problem under S0 design (Table 6).

5.3.2 Under Non-Normality

The phenotype data Y 1 under alternative ‘empirical’ null is generated from:

Y 1 = βGG+ βEE + βGEGE + e, e ∼ N (0, 1)

With conditional expectation and variance:

E[Y 1|E] = βGE[G] + (βE + βGEE[G]) · E

Var[Y 1|E] = β2GVar[G] + β2

GEVar[G] · E2

For example, in S1.1, Gnew |= Y,E in the working model:

E[Y 1] = β1Gnew + β2E + β3Gnew × E

E has linear mean effect on Y with increased variance. Thus error term ε has variance depending

on E.

Page 43: Beyond the traditional simulation design

Chapter 5. Appendix 34

Consequently, the asymptotic distribution of score test statistics is expected to have variance larger

than 1. Thus empirical type 1 error is larger than nominal level in general regardless sample size.

Page 44: Beyond the traditional simulation design

Chapter 5. Appendix 35

5.4 Figures

Figure S1: Marginal and conditional histogram (Top), and normal Q-Q plot (Bottom) of the empir-ical phenotype data Y1 when βGE = 1 based on the Aschard’s model as described in Table 1.

Page 45: Beyond the traditional simulation design

Chapter 5. Appendix 36

Figure S2: Marginal and conditional histogram (Top), and normal Q-Q plot (Bottom) of the em-pirical phenotype data Y1 when βGE = 0.2 based on the Aschard’s model as described in Table1.

Page 46: Beyond the traditional simulation design

Chapter 5. Appendix 37

Figure S3: Histogram of p-values of the LRTv test under S0 (Left column), S1.1 (Middle column)and S1.2 (Right column) null simulation designs. Simulation of the empirical phenotype data Y1 wasbased on the Aschard’s genetic model as described in Table 1 with βGE = 0 (Top row), βGE = 0.2(Top row), and βGE = 0.6 (Top row). Red line represents the Uniform (0, 1) distribution.

Page 47: Beyond the traditional simulation design

Chapter 5. Appendix 38

Figure S4: Histograms with theoretical normal distribution (Top row), and normal QQ-plots (Bot-tom row) of phenotype Y1 simulated based on the Aschard’s genetic model as described in Table 2with βGE = 1 for before transformation (Left column), after square root (Middle column), and afterlog transformations (Right column).

Histogram of Y

Y

Den

sity

−2 0 2 4

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Histogram of rtY

rtY

Den

sity

−4 −2 0 2

0.0

0.1

0.2

0.3

0.4

Histogram of logY

logY

Den

sity

−4 −3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

−3 −2 −1 0 1 2 3

−2

−1

01

23

4

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

−3 −2 −1 0 1 2 3

−4

−2

02

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Page 48: Beyond the traditional simulation design

Chapter 5. Appendix 39

Figure S5: Histogram of p-values of the LRTv test under the S0 (Left column), S1.1 (Middle column)and S1.2 (Right column) null simulation designs, when phenotype Y1 was simulated based on theAschard’s genetic model as described in Table 1 with βGE = 1 for before transformation (Top row),after square root transformation (Middle row), and after log transformation (Bottom row).

Page 49: Beyond the traditional simulation design

Chapter 5. Appendix 40

Figure S6: Scatter plot of nrep.out = 100 estimated T1E rates for whole-genome interaction GxEscans, each estimated from nrep.in = 105 simulated replicates (or SNPs) with a fixed E, under the‘theoretical’ null design S0 as described in Table 1, for sample size n = 104 (Left) and 105 (Right).

0 20 40 60 80 100

0.03

0.04

0.05

0.06

0.07

sample size n=1000

Est

imat

ed ty

pe 1

err

or

0 20 40 60 80 100

0.03

0.04

0.05

0.06

0.07

sample size n=10000

Est

imat

ed ty

pe 1

err

or