new comparison of methods for longitudinal analysis of … · 2019. 11. 28. · ii comparison of...
Post on 12-Oct-2020
5 Views
Preview:
TRANSCRIPT
Comparison of methods for longitudinal analysis of quantitative traits in genome-wide association studies
by
Mengdan Xu
A thesis submitted in conformity with the requirements for the degree of Master of Science
Graduate Department of Public Health Sciences University of Toronto
© Copyright by Mengdan Xu 2019
ii
Comparison of methods for longitudinal analysis of quantitative traits in genome-
wide association studies
Mengdan Xu
Master of Science
Graduate Department of Public Health Sciences
University of Toronto
2019
Abstract
Longitudinal genome-wide association studies provide us more information on the relationship
between repeated-measured traits and genetic variants (SNPs) than cross-sectional ones. The
Linear Mixed Model (LMM) has been a popular tool for such analysis but is deficient in
computational speed. Inspired by fast methods proposed by Sikorska et al. which work as an
approximation of LMM, we extended previous simulation scenarios and made comparison on
cross-sectional, longitudinal SNP effects and joint test inference between fast methods and
LMM. We also applied fast methods on real data for each of two renal outcomes. Results showed
that using fast methods, we detected several SNPs with longitudinal effects on the two outcomes.
These fast methods are effective and much faster than LMM, however they differ in
computational efficiency and speed. We also discuss limitations existing in our simulation study
and application to real data, along with discussion of future directions.
iii
Acknowledgments
First, I want to thank my supervisor, Dr. Andrew Paterson, for providing me so many valuable
suggestions . Via this opportunity, I learned about the importance of reading and writing for
doing a complete analysis. I have also learned a lot of new knowledge and techniques on
biostatistics. This was also a challenge to me as it is my first serious thesis, and there is no way I
can complete it without the guidance of Dr. Paterson.
I want to express my gratitude to my advisory committee members, Dr. Wei Xu and Dr. Shelley
Bull, for assessing my thesis during the whole procedure and providing advice from different
perspectives. I also want to thank Dr. Lei Sun for sparing her time and offering help along with
my advisory committee.
Sincere thanks go to many people in Sickkids Genetics and Genome Biology for answering my
questions on the data and troubleshooting problems I encountered in programming.
Lastly, I want to thank my family, especially my mother, for understanding my decision to
accept the challenge and doing something I have never done before. I genuinely thank my
roommates, Han and Yuzhu, my friends, Cassie, Fan, Steven, Vicky, Thaison and many others
for supporting and helping in many ways during the time when I felt unconfident and worried
about the thesis.
iv
Table of Contents
Acknowledgments.......................................................................................................................... iii
Table of Contents ........................................................................................................................... iv List of Tables ................................................................................................................................. vi List of Figures ............................................................................................................................... vii List of Abbreviations ..................................................................................................................... ix
Introduction .................................................................................................................................1
1.1 Genetic Basis of Diabetes ....................................................................................................1 1.2 T1D Complications ..............................................................................................................3 1.3 Motivation ............................................................................................................................4
1.3.1 Genome-wide Association Study .............................................................................4
1.3.2 Repeated Measurements from a Longitudinal Study ...............................................4 1.3.3 GWAS on Longitudinal Study .................................................................................5
1.4 Linear Mixed Effects Models ..............................................................................................6
1.4.1 Introduction to Linear Mixed Effects Models .........................................................6 1.4.2 General Form ...........................................................................................................6
1.4.3 Model Assumptions .................................................................................................7 1.5 Fast Methods for Longitudinal Data ....................................................................................8
1.5.1 Study Background ....................................................................................................9 1.5.2 Slope as Outcome Method (SAO) .........................................................................10 1.5.3 Two-Step Method (TS) ..........................................................................................11
1.5.4 Conditional Two-Step (CTS) Method ...................................................................12 1.5.5 Genome-wide Analysis of Large-scale Longitudinal Outcomes using
Penalization (GALLOP) ........................................................................................15 1.5.6 Methods Performance ...............................................................................................17
1.5.7 Concerns ...................................................................................................................19 1.6 Goals ...................................................................................................................................21
Methods .....................................................................................................................................22 2.1 Data Background: DCCT and EDIC..................................................................................22
2.1.1 Diabetes Control and Complications Trial (DCCT) ..............................................22
2.1.2 Epidemiology of Diabetes Interventions and Complications (EDIC) ......................24 2.1.3 Renal Outcome Measures .........................................................................................25
2.1.4 Importance of DCCT/EDIC Study............................................................................27 2.2 Linear Mixed Effects Model for DCCT/EDIC Renal Outcomes ........................................28
2.2.1 Phenotypic Covariate Selection ................................................................................28
2.2.2 Within-individual Correlation Structure Selection ...................................................29 2.2.3 Genetic Data..............................................................................................................31
2.3 Weighted Slope as Outcome (WSAO) ................................................................................32 2.4 Simulation Study .................................................................................................................33
2.4.1 Experimental Designs ...............................................................................................34 2.4.2 Simulation of SNPs ...................................................................................................36 2.4.3 Simulation of Outcome Trait ....................................................................................36
2.5 Methods for DCCT/EDIC Data Analysis ...........................................................................36 Results .......................................................................................................................................39
3.1 Data Description ................................................................................................................39 3.1.1 Number of visits .....................................................................................................40
3.1.2 Distribution of logAER ..........................................................................................42
v
3.1.3 Distribution of eGFR .............................................................................................46 3.2 Simulation Study Results ...................................................................................................50
3.2.1 Set up .....................................................................................................................50 3.2.2 Type 1 Error ...........................................................................................................51 3.2.3 Power .....................................................................................................................53 3.2.4 Parameter Estimation .............................................................................................56 3.2.5 Speed ......................................................................................................................59
3.3 DCCT/EDIC Data Analysis Results ..................................................................................60 3.3.1 Set up .....................................................................................................................60 3.3.2 GWAS of logAER .................................................................................................61 3.3.3 GWAS of eGFR .....................................................................................................75 3.3.4 Speed Comparison .................................................................................................87
Discussion .................................................................................................................................89 4.1 Simulation Study ................................................................................................................89
4.1.1 Type 1 Error ...........................................................................................................89
4.1.2 Power .....................................................................................................................90
4.1.3 Speed ......................................................................................................................91 4.1.4 Implementation ......................................................................................................92
4.2 DCCT/EDIC Data Analysis ...............................................................................................92
4.2.1 GWAS of logAER .................................................................................................92 4.2.2 GWAS of eGFR .....................................................................................................93
4.3 Limitations and Future Study.............................................................................................93 4.3.1 Simulation Study Settings ......................................................................................93 4.3.2 Real Data Model Specification ..............................................................................95
4.3.3 Efficiency Measures...............................................................................................95
4.3.4 Heteroscedasticity ..................................................................................................96 4.3.5 Weighted Slope as Outcome (WSAO) ..................................................................96 4.3.6 Missing Not at Random .........................................................................................97
4.3.7 Empirical T1E and Theoretical T1E ......................................................................97 4.3.8 Multivariate Model ................................................................................................98
Summary ...................................................................................................................................99 References ....................................................................................................................................100
vi
List of Tables
Table 1-1 Rotterdam study: number of individuals with K non-missing responses. ...................... 9
Table 1-2 Rotterdam study: Type 1 error comparison. ................................................................. 17
Table 1-3 Rotterdam study: Power comparison............................................................................ 18
Table 1-4 Rotterdam study: Speed comparison. ........................................................................... 18
Table 2-1 Spatial correlation structures in nlme. .......................................................................... 30
Table 3-1 Descriptive table for DCCT/EDIC. .............................................................................. 39
Table 3-2 Missing rate for repeated measurements. Missing rate of logAER in EDIC and
DCCT/EDIC is calculated by combining alternate years. ............................................................ 40
Table 3-3 Time comparison in simulation study for 1000 SNPs under null with MAF=0.3,
N=2000, 0 cross-sectional or longitudinal SNP effect and unchanged error structure. ................ 60
Table 3-4 DCCT/EDIC data analysis settings. ............................................................................. 61
Table 3-5 Statistical comparison between fast methods and LMM on random 19570 SNPs....... 65
Table 3-6 Summary information of significant SNPs (P<5× 10 − 8) for outcome logAER. ...... 70
Table 3-7 Parameter estimates (BETA), standard error (SE) and p-values (P) for rs3817222 on
logAER. ........................................................................................................................................ 70
Table 3-8 Parameter estimates (BETA), standard error (SE) and p-values (P) for rs74155187 on
logAER. ........................................................................................................................................ 71
Table 3-9 Measures of efficiency of fast methods on logAER. .................................................... 71
Table 3-10 Statistical comparison between fast methods and LMM on random 19570 SNPs..... 78
Table 3-11 Summary information of significant SNPs (P<5× 10 − 8) for outcome eGFR. ....... 83
Table 3-12 Parameter estimates (BETA), standard error (SE) and p-values (P) for rs12713270 on
chromosome 2 for eGFR. .............................................................................................................. 84
Table 3-13 Parameter estimates (BETA), standard error (SE) and p-values (P) for rs74155187 on
eGFR. ............................................................................................................................................ 84
Table 3-14 Measures of efficiency of fast methods on real eGFR data. ....................................... 85
vii
List of Figures
Figure 1-1 Random intercept or slope model examples. ................................................................ 7
Figure 2-1 Flow chart for Methods section................................................................................... 22
Figure 2-2 Flow chart of simulation study. ................................................................................... 34
Figure 3-1 Number of DCCT participants by randomization year. .............................................. 40
Figure 3-2 Number of visit counts per subject in DCCT/EDIC years. ......................................... 41
Figure 3-3 Expected and actual numbers of subjects in each DCCT/EDIC year including DCCT
baseline and close out visits. ......................................................................................................... 41
Figure 3-4 Proportion of missing subjects in each DCCT/EDIC year including DCCT baseline
and close out visits. ....................................................................................................................... 42
Figure 3-5 Barplot of numbers of logAER measurements per subject in DCCT/EDIC study. .... 42
Figure 3-6 Distribution of logAER in DCCT years. ..................................................................... 44
Figure 3-7 Distribution of logAER in EDIC years. ...................................................................... 45
Figure 3-8 Spaghetti plots of logAER from 20 subjects from DCCT/EDIC in each grid. ........... 46
Figure 3-9 Barplot of numbers of eGFR measurements per subject in DCCT/EDIC study. ....... 47
Figure 3-10 Distribution of eGFR in DCCT years. ...................................................................... 48
Figure 3-11 Distribution of eGFR in EDIC years. ........................................................................ 49
Figure 3-12 Spaghetti plots of eGFR from 20 subjects from DCCT/EDIC in each grid. ............. 50
Figure 3-13 Type 1 error rates calculated in reference scenario, 4 missing scenarios and 2 within-
subject error correlation scenarios. ............................................................................................... 52
Figure 3-14 Example of sample size changing by time in missing scenarios. ............................. 53
Figure 3-15 Power of cross-sectional SNP effect calculated in 7 scenarios. ................................ 54
Figure 3-16 Power calculated in 7 scenarios. ............................................................................... 56
Figure 3-17 Parameter estimates of cross-sectional SNP effect calculated in 7 scenarios. .......... 57
Figure 3-18 Parameter estimates of longitudinal SNP effect by MAF calculated in 7 scenarios. 58
Figure 3-19 Parameter estimates of longitudinal SNP effect by N calculated in 7 scenarios. ..... 59
Figure 3-20 Number of participants in EDIC starting years. ........................................................ 61
Figure 3-21 Slopes for two-stage fast methods on outcome logAER. .......................................... 62
Figure 3-22 Histograms of MAFs of all SNPs (n=8,979,131) and randomly selected SNPs
(n=19,570). .................................................................................................................................... 63
Figure 3-23 P-P plots on outcome logAER on random subset of SNPs. ...................................... 64
Figure 3-24 E-E plots on outcome logAER on random subset of SNPs. ..................................... 65
Figure 3-25 Histograms of p-values (logAER). ............................................................................ 66
Figure 3-26 Q-Q plots of p-values (logAER). .............................................................................. 67
viii
Figure 3-27 Q-Q plots of p-values stratified by MAF (logAER). ................................................ 67
Figure 3-28 Manhattan plots (logAER). ....................................................................................... 68
Figure 3-29 Estimates along with 95% CI for cross-sectional SNP effects on logAER by
combined and separate DCCT/EDIC years. ................................................................................. 73
Figure 3-30 Locus plot on logAER with reference SNP rs3817222. ........................................... 74
Figure 3-31 Locus plot on logAER with reference SNP rs74155187. ......................................... 75
Figure 3-32 Slopes for two-stage fast methods on outcome eGFR. ............................................. 76
Figure 3-33 P-P plots on outcome eGFR on random 19570 SNPs. .............................................. 77
Figure 3-34 E-E plots on outcome eGFR on random 19570 SNPs. ............................................. 78
Figure 3-35 Histograms of p-values (eGFR). ............................................................................... 79
Figure 3-36 Q-Q plots of p-values (eGFR). .................................................................................. 80
Figure 3-37 Q-Q plots of p-values stratified by MAF (eGFR). .................................................... 80
Figure 3-38 Manhattan plots (eGFR). ........................................................................................... 81
Figure 3-39 Estimate along with 95% CI for cross-sectional SNP effects on eGFR by combined
and separate DCCT/EDIC years. .................................................................................................. 85
Figure 3-40 Locus plot with reference SNP rs12713270.............................................................. 86
Figure 3-41 Locus plot with reference SNP rs74155187.............................................................. 87
Figure 3-42 Running time for GWAS. ......................................................................................... 88
ix
List of Abbreviations
(e)GFR (Estimated) Glomerular Filtration Rate
(log)AER (Logarithm-transformed) Albumin Excretion Rate
(RM-)ANOVA (Repeated Measures) Analysis of Variance
2df Two Degree-of-freedom
AIC Akaike Information Criterion
AR Autoregressive
ARMA Autoregressive Moving Average
BLUP Best Linear Unbiased Predictor
BMD Bone Mineral Density
bp Base Pair
CAR Continuous Autoregression
CKD Chronic Kidney Disease
CS Compound Symmetry
CTS Conditional Two Step
DCCT Diabetes Control and Complications Trial
EDIC Epidemiology of Diabetes Interventions and Complications
GALLOP Genome-wide Analysis of Large-scale Longitudinal Outcomes using Penalization
GC Genomic Control
GEE Generalized Estimating Equation
GWAS Genome-Wide Association Study
HbA1c Hemoglobin A1c
HLA Human Leukocyte Antigen
HWE Hardy–Weinberg Equilibrium
ICC Intra-class Correlation Coefficient
IDDM Insulin-Dependent Diabetes Mellitus
IQR Inter-Quartile Range
LMM Linear Mixed Model
M/Mb Million/Million base pair
MAF Minor Allele Frequency
MAR Missing at Random
MCAR Missing Completely at Random
MLE Maximum Likelihood Estimation
MNAR Missing Not at Random
SAO Slope as Outcome
SE Standard Error
SNP Single Nucleotide Polymorphism
T1D/T2D Type 1/2 Diabetes
T1E Type 1 Error
TS Two Step
VC Variance Component
WSAO Weighed Slope as Outcome
1
Introduction
1.1 Genetic Basis of Diabetes
Diabetes is a chronic disease which is caused by the malfunction of insulin secretion or improper
response to insulin. People with diabetes usually have blood and tissue glucose concentrations that
are too high, resulting in acute and long-term complications which have social, emotional and
economic impacts affecting their quality of life. The symptoms and signs of diabetes are of a
variety as a result of hyperglycemia, meanwhile the pathogeneses of diabetes can differ by type of
diabetes. The majority of diabetes can be classified as either type 1 diabetes (T1D) or type 2
diabetes (T2D) with the distinction in autoimmune system (van de Bunt et al., 2014). Type 1
diabetes, previously called insulin-dependent diabetes, is caused by autoimmune destruction of
pancreatic islet insulin-producing β-cells, therefore patients cannot produce sufficient insulin.
Type 2 diabetes, previously named non-insulin dependent diabetes, is due to defective insulin
action in which case although patients do secrete insulin, there is resistance to its actions and the
response to insulin has declined (Boland et al., 2017).
According to the World Health Organization, the number of people with diabetes has increased
threefold from around 100 million in 1980 to 400 million in 2014 with a more rapid rise in
prevalence in middle and low-income countries (World Health Organization, 2018). It is thought
that changes in the environment in aspects of food, exercise, climate, sleep and others contribute
to the increase in the prevalence of T2D or T1D (Greenbaum et al., 2008). Genetic studies have
contributed to understanding the pathogenesis of diabetes in order to better prevent and treat the
disease, meanwhile developing a healthy lifestyle can as well benefit people nowadays to avoid or
reduce exposure to the diabetogenic environmental agents.
The general categories of diabetes by American Diabetes Association in 2019 include other types
of diabetes such as gestational diabetes mellitus which take up a small proportion among the whole
patient population (American Diabetes Association, 2019). For example, ~1% of patients have
monogenic diabetes. People with this type of diabetes might be misdiagnosed as T1D or T2D but
it is caused by single gene defects such as rare coding variations in HNF1A (Hepatic Nuclear
Factor 1 Alpha) or HNF4A (Hepatic Nuclear Factor 4 Alpha) among other genes (Radha and
Mohan, 2017). Classification is important for determining therapy.
2
Compared to other types of diabetes, T1D and T2D are more common while heterogeneous in a
combination of factors from the genetic basis and environmental effects. Genes related to the risk
of T1D have been identified over the past 30 years (Todd et al., 1987). In early to mid-1990s, there
was a trend that optimists believed the pathogenesis along with the corresponding precise
prevention and treatment could be found very soon for T1D. In recent years the complexity of the
genetics of T1D has been realized. The most important genes for T1D are located within the MHC
(Major Histocompatibility Complex) HLA (Human Leukocyte Antigen) class II region on
chromosome 6p21 (previously termed IDDM1), and account for around 45% of genetic
susceptibility for T1D. However their exact function in terms of pathogenesis is still
obscure (Buzzetti et al., 1998). With regard to the environment, no significant evidence was found
that any environmental agents can trigger the onset of T1D in spite of much investigation devoted
into viral infections, early infant diet and toxins (Atkinson and Eisenbarth, 2001; Afonso and
Mallone, 2013).
Numerous novel loci related to T1D have been identified. For instance, in 2009 Barrett et al.
detected over 40 loci affecting the risk of T1D by using case-control data from the Type 1 Diabetes
Genetics Consortium and applying meta-analysis combined with the Wellcome Trust Case Control
Consortium study and the Genetics of Kidneys in Diabetes study. Apart from the long-known HLA
region on chromosome 6p21, the results of locations detected in this study containing susceptibility
loci not only supported previous discoveries of 4 non-HLA loci , INS, CTLA4, PTPN22 and IL2RA,
but also many new candidate genes were discovered including IL10, IL19, CD69 (Barrett et al.,
2009). Afterwards, techniques like targeted resequencing helped to pinpoint potential causal
variants which were initially detected by these studies (Nejentsev et al., 2009).
As to T2D, researchers are also investigating the polygenetic and environmental factors in it and
related traits. Till 2018, more than 100 variants were identified to be associated with T2D (Zheng
et al., 2018). However, the effect sizes are usually so small that the associated variants in total can
only explain ~10% of the heritability of T2D. Besides, related traits like glycemia or obesity were
also studied in populations without T2D. This is to figure out whether loci identified for these
related traits are also associated with T2D risk so as to have a better understanding of underlying
genes and biological mechanisms for T2D (Mohlke and Lindgren, 2014).
3
1.2 T1D Complications
The fundamental cause of most diabetes complications is the increased blood or tissue glucose
level resulting from insufficient secretion of insulin. Keeping a good control of blood glucose level
is the most essential way to prevent or slow the progression of all complications. The most
common measure for glycemic level is HbA1c. It refers to the glycated hemoglobin as a
measurement of average blood glucose concentration over the preceding 2-3 months. HbA1c is
one of the top risk factors for diabetes complications and it is also recommended as a screening
test for T2D (Allan et al., 2013). Many studies were conducted to better learn biological bases of
HbA1c between-person variation as there are sometimes discrepancies between HbA1c and other
glycemia measurements (Paterson et al., 2010).
Most chronic complications of T1D can be primarily classified as eye disease (retinopathy), nerve
disease (neuropathy), renal/kidney disease (nephropathy) and cardiovascular disease
(cardiopathy). Eye, nerve and renal diseases are microvascular complications and cardiovascular
disease is a macrovascular complication. To begin with, diabetic retinopathy is the main cause of
vision loss in people with diabetes. It was estimated among the 246 million people with diabetes
in 2010, one third had signs of retinopathy (Cheung et al., 2010). Diabetic neuropathy is the most
common complication and about half of people with diabetes in 2015 developed some form of
nerve disorders (Razmaria, 2015). The nerve disorders present in the form of numbness or pain in
limbs and can also affect internal organs like the heart. Usually nerves of the feet are the first to
be affected, therefore examinations of foot nerves can detect early signs of neuropathy. Renal
disease is one of the underlying causes of morbidity and mortality of T1D among related
complications. It has been found that the kidney is one of the targeted organs by high level of blood
glucose and moreover, the appearance and progression of kidney diseases are highly related to
other complications like cardiovascular disease (de Boer and DCCT/EDIC Research Group,
2014). It is reported that about 30%-35% patients with T1D and T2D developed renal
complications over their lifetime (Thomas and Karalliedde, 2019). Although the advanced stage
of renal disease or renal failure might occur many years after the onset of diabetes, diabetic renal
disease is the largest cause of end-stage renal disease worldwide (Ghaderian et al., 2015). Diabetic
cardiovascular disease includes coronary heart disease, cerebrovascular disease, and peripheral
artery disease and among these, heart disease is the most common cause of death. The prevalence
of cardiovascular disease is not as high as other chronic complications (Task Force on Diabetes
4
and Cardiovascular Diseases of the European Society of Cardiology and European Association for
the Study of Diabetes, 2007). However, as concluded by a large population-based cohort study on
the causes of mortality in people with T1D , cardiovascular disease becomes the leading cause of
death after about 10 years from onset, eventually accounting for about 40% of deaths after 20 years
of duration (Secrest et al., 2010).
1.3 Motivation
1.3.1 Genome-wide Association Study
Genome-wide association studies (GWAS) assess the association between genetic variations and
traits. With the development of the Human Genome Project and cost-effective genotyping
techniques for DNA, the amount of available information from DNA, usually in form of single
nucleotide polymorphisms (SNPs), is rapidly increasing. It provides more opportunities to find
associations between genetic variations and human traits under the conventional GWAS
significance level threshold, p< 5 × 10−8 . A lot of effort is being devoted to both common
complex diseases and quantitative traits which can be affected by effects of both genes and
environment.
There are hundreds to thousands of genes on each of 23 pairs of chromosomes, containing
information from around 3 billion base pairs. The typical genotyping technique usually provides
about 0.5M to 2.5M SNPs. Currently many tools such as PLINK v1.9 have been developed to
conduct GWAS between genetic markers and cross-sectional outcomes at a notable speed despite
the large number of SNPs (Purcell et al., 2007). The usual statistical methods used by PLINK, for
example, are genetic association test for case/control data (logistic regression) or standard linear
regression of quantitative traits assuming an additive model.
1.3.2 Repeated Measurements from a Longitudinal Study
Repeated measurements from a longitudinal study adds complexity to GWAS. GWAS has been
popular in cross-sectional studies such as case-control studies of cancer. However, clinical trials
and some observational studies on chronic diseases like diabetes are often longitudinal studies.
5
The change or trajectory in measurements reveals more information than one-time measurement
at a single point. To deal with repeated measurements from longitudinal study, the simple ways
are to use one summary statistic such as the mean, median or measurement at a predetermined time
point to represent the dataset as a cross-sectional one. However, lots of information would be
wasted by directly reducing records which contain important repeated measures from participants.
One of the earliest proposed approaches is repeated measures analysis of variance (RM-ANOVA).
It is analysis of variance (ANOVA) which takes multiple correlated responses for each subject.
Being a traditional method for longitudinal study in many fields like anesthesiology and physical
education, RM-ANOVA takes into account correlation between repeated measurements which
differs from a standard ANOVA. However it has undesirable characteristics. First, the outcome
has to be quantitative and covariates can only be discrete. Second, it assumes that the correlation
between any two time points is constant which might not be appropriate for some types of traits
where for example the within subject correlation decreases as time interval increases. Finally, it
can only handle the same number of repeated measurements: subjects with even one missing
response will be excluded (Ma et al., 2012).
1.3.3 GWAS on Longitudinal Study
Conventional GWAS focuses on SNP effect from a cross-sectional aspect, such as how genetic
factors affect people’s susceptibility to a certain disease. With repeatedly measured data, GWAS
can be conducted to investigate longitudinal SNP effects. The SNPs stay the same over an
individual’s lifetime, but the trait distribution and effects of SNPs might change over time. Cross-
sectional and longitudinal SNP effects both provide important information in how genetic variants
and disease traits are associated. From the aspect of respecting the nature of chronic diseases and
making use of available information as much as possible, ways to efficiently conduct GWAS on
longitudinal study for cross-sectional and longitudinal SNP effects are in demand.
6
1.4 Linear Mixed Effects Models
1.4.1 Introduction to Linear Mixed Effects Models
The linear mixed effects model or linear mixed model (LMM), is a popular method to deal with
repeated measurements and outperforms RM-ANOVA in many aspects. It allows different types
of covariates, continuous or categorical. It also allows for correlation within subject to vary by a
specific pattern which produces models that better fit the data. Lastly, subjects with missing
responses and different numbers of visits can be included as long as the time intervals are correctly
specified in the model. Because LMM uses maximum likelihood estimation (MLE), it is robust
against missing at random (MAR) data (Sikorska et al., 2013b). The biggest challenge of fitting
LMMs for GWAS lies in computation.
The other popular method is generalized estimating equations (GEE). It has the same advantages
as LMM in that it allows different correlation structures via working correlation matrix and it can
take different types of covariates. However, the main difference is that GEE only provides
population level estimates without individual level information for random effects. One of the
disadvantages is that it requires complete data or missing completely at random (MCAR) as it is
not likelihood based (Little and Rubin, 1987). This method will not be adopted in this thesis
because it is not an outstandingly fast method and it is less robust than LMM in missing data
scenarios (Sikorska et al., 2013b).
1.4.2 General Form
In general, a linear mixed model has a form as (Verbeke and Molenberghs, 2000):
{
𝑌𝑖 = 𝑋𝑖𝛽 + 𝑍𝑖𝑏𝑖 + 𝜀𝑖
𝑏𝑖~𝑁(0, 𝐷)
𝜀𝑖~𝑁(0, 𝛴𝑖)
𝑏1,⋯ , 𝑏𝑁 , 𝜀1, ⋯ , 𝜀𝑁 independent of each other, 𝑖 = 1,⋯𝑁
(1.1)
Assuming there are 𝑁 individuals, 𝑌𝑖 denotes the vector of responses for individual 𝑖 with 𝑛𝑖
elements, indicating this individual had 𝑛𝑖 measurements for response and each individual does
not have to have same number of measures.
7
For covariates, 𝑋𝑖 is a known 𝑝 × 𝑛𝑖 matrix with 𝑝 columns representing 𝑝 covariates having
fixed effects on the response. The 𝑛𝑖 rows match with number of measurements. A 𝑝-dimensional
vector 𝛽 is a vector of parameters for unknown fixed effects.
Similarly, 𝑍𝑖 is a known 𝑞 × 𝑛𝑖 matrix containing values for 𝑞 random-effect covariates and 𝑏𝑖 is
the unknown random effect part. 𝜀𝑖 is a 𝑛𝑖-dimensional vector explaining the error between our
estimates and the true response values for individual 𝑖.
1.4.3 Model Assumptions
This form (Eq. 1.1) shows the basic assumptions for general use of LMM. At the same time, our
constructed LMMs are also based on all of the following assumptions.
First of all, the vector of response 𝑌𝑖 is linearly related to covariates. Some covariates have fixed
effects which means they have population-average effect and are the same for all individuals. Some
covariates have random effects, also called individual-specific effects. Models with random effects
allow different specific regressions to model the relationships between response and covariates for
different individuals. With random effects on the intercept or slope, it allows different final
conditional models on each subject who has repeated measures over time. The example individual
level models in Figure 1-1 are generated using subset of BodyWeight dataset in nlme
package (Pinheiro et al., 2019). It shows four types of models to fit the relationship of weight
(grams) and time (days) on 5 rats.
Figure 1-1 Random intercept or slope model examples. Same subset of 5 rats with id’s 1-5 selected, each
line representing one subject. Weight is in grams and time is in days.
8
Second, the random-effect parameter vector 𝑏𝑖 contains random effects for 𝑞 covariates, and
usually it is assumed 𝑏𝑖 follows a multivariate normal distribution with a mean of 0 and a common
𝑞 × 𝑞 covariance matrix 𝐷 for every subject. The 𝑑𝑖𝑗(𝑖 ≠ 𝑗) element in this matrix is the
covariance between random effects for 𝑖th covariate and 𝑗th covariate and thus 𝑑𝑖𝑗 = 𝑑𝑗𝑖.
Third, the error vector 𝜀𝑖 follows a multivariate normal distribution with a mean of 0 and
individual-specific 𝑛𝑖 × 𝑛𝑖 covariance matrix Σ𝑖 to explain any deviance between the model and
response. The covariance matrix depends on 𝑖 only through the dimension 𝑛𝑖 and thus elements in
this matrix are set. In other words, if the number of measurements is the same for all subjects, this
covariance matrix can be written as Σ.
In addition, all the random effects between individuals including random effects from covariates
and errors are assumed to be independent.
Under all these assumptions, we would be able to obtain inference for the distribution of response
vector as a multivariate normal distribution:
𝑌𝑖~𝑁(𝜇𝑖 = 𝑋𝑖𝛽, 𝑉𝑖 = 𝑍𝑖𝐷𝑍’𝑖 + 𝛴𝑖) (1.2)
1.5 Fast Methods for Longitudinal Data
This thesis is inspired by Dr. Karolina Sikorska and her colleagues’ several papers since 2012 on
fast methods for analyzing longitudinal data for GWAS in replacement of LMM (Sikorska et al.,
2013b; Sikorska et al., 2013a; Sikorska et al., 2015; Sikorska et al., 2018). LMM has been a
relatively mature way to deal with repeated measurements, thus it is treated as the gold standard
model in this series of papers. Several fast methods were proposed along with implementable codes
provided in appendices. Here I review the fast methods in the order in which they were proposed.
9
1.5.1 Study Background
The papers used data from the Rotterdam study. It is a prospective cohort study initiated in 1990
which focuses on a series of diseases frequent in elderly people. 14,926 participants were aged 45
years or over who lived in the study district of Rotterdam at the end of 2008 and an extension has
started since 2016 to include participants aged 40 years and over (Ikram et al., 2017). The data for
Sikorska et al. (Sikorska et al., 2013b) included 4,987 participants who were designed to have
visits at baseline and after 2, 6 and 12 years to take femoral neck bone mineral density (BMD)
measurements. 4,933 had at least one visit. The number of individuals for each visit decreased and
the missing rate increased over time. The numbers of individuals with 1, 2, 3 or 4 visits were
relatively even, indicating a missing response rate of 34.5% among these 4,933 individuals as
shown in Table 1-1.
Table 1-1 Rotterdam study: number of individuals with K non-missing responses.
K Women Men Combined
4 679 554 1233
3 833 659 1492
2 759 552 1311
1 543 354 897
Note: Adapted from Table II in “Fast linear mixed model computations for genome‐wide
association studies with longitudinal data” by K. Sikorska et al., 2013, Statistics in
Medicine, 32, 165-180, Copyright 2012 John Wiley & Sons, Ltd (Sikorska et al., 2013b).
In the first paper (Sikorska et al., 2013b), simulated data were generated based on the Rotterdam
study for the longitudinal BMD responses along with the real data analyzed by slope as outcome
(SAO), two-step method (TS), conditional two-step method (CTS), standard LMM and two GEEs
with different working correlation matrices. The simulation study had a full factorial design in
aspects of sample size (N=500, 1000, 3000), cross-sectional SNP effect (b=0, 0.005), longitudinal
SNP effect (b=0, 0.008) and missing patterns (complete, missing completely at random and
missing at random). An additive genetic model was assumed for SNPs in the simulation study to
generate dosage from a uniform distribution (MAF=0.5). Without any additional covariates in their
simulation study model, the standard LMM has a form as:
𝑌𝑖𝑗 = 𝛽0 + 𝛽1𝑆𝑖 + 𝛽2𝑡𝑖𝑗 + 𝛽3𝑆𝑖𝑡𝑖𝑗 + 𝑏0𝑖 + 𝑏1𝑖𝑡𝑖𝑗 + 𝜖𝑖𝑗, 𝑗 = 1,⋯ , 𝑛𝑖 , 𝑖 = 1,⋯ ,𝑁 (1.3)
10
Where β1, β3 are cross-sectional effects at baseline (time 0) and longitudinal effects of a SNP, β2
and 𝑏1𝑖 are fixed and random slopes of time. 𝑆𝑖 is SNP coded as 0, 1 or 2 representing the number
of copies of the minor allele, and is incorporated as continuous variable in the model.
With the LMM as standard, several two-stage methods of slope as outcome (SAO), two-step
method (TS) and conditional two-step method (CTS) were presented as fast approximations to find
genetic associations with the evolution of traits over time (Sikorska et al., 2013b). In 2018, another
fast method for LMM called GALLOP (Genome-wide Analysis of Large-scale Longitudinal
Outcomes using Penalization) was proposed and evaluated in a similarly designed simulation study
with application on the Rotterdam data (Sikorska et al., 2018).
1.5.2 Slope as Outcome Method (SAO)
Slope as outcome is one of the simplest ways to deal with longitudinal data. The main idea of two
stage analysis is to use summary statistics to replace the multiple observations for each
individual (Verbeke and Molenberghs, 2000). For example, the mean of the response variable can
be calculated as the outcome, making one linear regression for one SNP possible and efficient.
Here the per individual slope over time is the statistic to be used.
Step 1:
𝑌𝑖𝑗 = 𝛽0𝑖∆ + 𝛽1𝑖
∆time𝑖𝑗 + 𝛽2𝑖∆Cov𝑖𝑗 + ⋯+ 𝑒𝑖𝑗
∆, 𝑖 = 1,2, … ,𝑁, 𝑗 = 1,2, … , 𝑛𝑖 (1.4)
In the first step, we fit a linear regression for each individual with ordinary least squares approach
to find the slope of time 𝛽1𝑖∆. Time-changing covariates can be added in this step to adjust for time
and outcome association. Also, time-static covariates are meaningless in the model as they will
simply be included as part of the intercept, therefore the slope 𝛽1𝑖∆ is unaffected.
Step 2:
𝛽1𝑖∆ = 𝛽0
∆∆ + 𝛽1∆∆𝑠𝑛𝑝𝑖 + 𝑒𝑖
∆∆, 𝑖 = 1,2… ,𝑁 (1.5)
In the second step, SNPs will be tested one at a time and analyzed on its association with the slope.
The idea is that the slope obtained in the first step contains the information on how the outcome
11
changed over time. To be explicit, with one unit forward in time, the outcome will increase 𝛽1𝑖∆
units (Eq. 1.4). And then, it is reasonable to believe that this change over time might be partly
explained by the SNP effect, which is the coefficient 𝛽1∆∆
in the second linear regression model
(Eq. 1.5). Similarly, the regression is fitted using an ordinary least square approach. Now, our null
hypothesis on longitudinal effect of SNPs becomes:
𝐻0: 𝛽1∆∆ = 0.
Through the test on this 𝐻0 with a specified significance level, the p-value indicating significance
of association between the SNP and outcome trait can be easily calculated according to the selected
test statistic. Here the Wald test is the default in nlme package (Pinheiro et al., 2019). With estimate
𝛽1̂∆∆
and standard error SE, a Z test statistic assumed to follow a standard normal distribution can
be calculated as 𝑍 =𝛽1̂
∆∆−0
𝑆�̂�, and a p-value can be obtained.
1.5.3 Two-Step Method (TS)
The TS method works under the same idea as SAO, except for different approaches to slopes of
time.
Step 1:
𝑌𝑖𝑗 = 𝛽0∗ + 𝛽1
∗time𝑖𝑗 + 𝛽2∗Cov𝑖(𝑗) + ⋯+ 𝑏0𝑖
∗ + 𝑏1𝑖∗time𝑖𝑗 + 𝑏2𝑖
∗Cov′𝑖𝑗 + ⋯+ 𝑒𝑖𝑗
∗, 𝑖
= 1,2, … ,𝑁, 𝑗 = 1,2, … , 𝑛𝑖 (1.6)
In the first step, one LMM is fit without any terms of SNPs. It is the same model as (Eq. 1.3). In
TS the model can be adjusted by all the other covariates with fixed or random effects on the
outcome.
Next the subject-specific random slopes of time are used as the separately generated slopes from
SAO; they may contain information of how genetic predictors have changing effects over time. So
the best linear unbiased predictions (BLUP) of random slopes are utilized as responses to fit the
linear regression with SNPs.
12
Step 2:
𝑏1𝑖∗ = 𝛽0
∗∗ + 𝛽1∗∗𝑠𝑛𝑝𝑖 + 𝑒𝑖
∗∗, 𝑖 = 1,2… ,𝑁 (1.7)
Now, the null hypothesis on longitudinal effect of SNPs becomes:
𝐻0: 𝛽1∗∗ = 0.
If there is no SNP effect and no SNP-time interaction effect, the model in first step would be a
well specified model. Additionally, the model in the first step can be applied with the selected
covariance structure for error to improve the model.
1.5.4 Conditional Two-Step (CTS) Method
1.5.4.1 Conditional Linear Mixed Model
In longitudinal studies, researchers are usually interested in whether the progression of a
quantitative trait is caused by some changing factors over time. Apart from the longitudinal effects
from changing factors, the difference between individuals may also lie in some characteristics
which are constant since baseline, which are cross-sectional effects. Although in such studies
longitudinal effects are of more interest, omitting cross-sectional effects would be a serious
misspecification of the model. The estimate and inference on longitudinal effects can be biased
when cross-sectional effects are relatively large and mis-specified in the model (Verbeke and
Molenberghs, 2000).
By applying conditional linear mixed effect model, we can remove all cross-sectional effects from
the model by conditioning on their sufficient statistics. The advantage of this method is that we
can obtain inference on the parameters of interest without loss of information and we do not have
to deal with nuisance parameters. The disadvantage is, however, all information on cross-sectional
effects are lost including subject-specific effects. But if it is justified that longitudinal effects
should be the main focus in longitudinal studies, the benefits can outweigh the costs.
In the context of this thesis, the conditional LMM serves for the purpose of getting random slopes
as the first step in CTS. The random slopes generated from conditional LMM will then be
13
processed the same way as in SAO/TS method. They are used to investigate whether the random
changing effect over time can be explained by genetic information.
1.5.4.2 Data Transformation for Conditional Inference
A background on conditional inference is provided here in order to apply the conditional linear
mixed model. This approach is an alternative to the classical MLE. When a LMM is present as
(Eq. 1.1), usually 𝑏𝑖 are not of primary interest so they will be treated as nuisance. The conditional
approach is to make MLE conditional on sufficient statistics for the nuisance parameters 𝑏𝑖, and
the sufficient statistic is selected as 𝑍𝑖′𝑦𝑖.
Given 𝑏𝑖 = 𝑍𝑖′𝑦𝑖, the conditional density can be used to obtain estimates for relevant parameters
such as 𝛽, 𝜎 by maximizing conditional likelihood ∏ 𝑓𝑖(𝑦𝑖|𝑍𝑖′𝑦𝑖, 𝛽, 𝜎2)𝑁
𝑖=1 . Then, it was found by
finding one of any full rank 𝑛𝑖 × (𝑛𝑖 − 𝑞) matrices 𝐴𝑖 which satisfies 𝐴𝑖′𝑍𝑖 = 0, the conditional
approach is equivalent to transforming outcome vectors 𝑦𝑖 by this matrix 𝐴𝑖. In addition, it would
be more convenient if Ai is selected to satisfy 𝐴𝑖′𝐴𝑖 = 𝐼𝑛𝑖−𝑞 so that the transformed 𝑦𝑖 follows a
normal distribution as 𝐴𝑖′𝑦𝑖 ~ N(𝐴𝑖′𝑋𝑖𝛽, 𝜎2𝐼𝑛𝑖−𝑞).
For us more emphasis is put on time-varying effects among both fixed and random effect variables,
model should be rewritten in this form:
𝑦𝑖 = 𝑋𝑖(1)
𝛽(1) + 𝑋𝑖(2)
𝛽(2) + 𝑍𝑖(1)
𝑏𝑖(1)
+ 𝑍𝑖(2)
𝑏𝑖(2)
+ 𝑒𝑖 (1.8)
Design matrices for both 𝑋𝑖 and 𝑍𝑖 have been split in a cross-sectional part and a time-varying
part. An upper right notation of ∗(1) indicates cross-sectional part and ∗(2) indicates longitudinal
part.
Because 𝑋𝑖(1)
are cross-sectional covariates, it can be expressed as 𝑋𝑖(1)
= 1𝑛𝑖𝑥𝑖′ . Similarly,
𝑍𝑖(1)
= 1𝑛𝑖 because 𝑏i
(1) is random intercept for 𝑖th subject. Lastly, 𝑏i
(2) are random slopes for
longitudinal covariates and 𝑍i(2)
is time here.
14
In this case, the only nuisance parameter is 𝑏i(1)
which is subject-specific intercept because we are
counting on random slopes 𝑏i(2)
to provide longitudinal effects of SNPs. Similarly as 𝑍𝑖′𝑦𝑖 , the
sufficient statistic for 𝑏i(1)
is 𝑍𝑖(1)
𝑦𝑖 = ∑ 𝑦𝑖𝑗𝑗 or 𝑦�̅� = ∑ 𝑦𝑖𝑗𝑗 /𝑛𝑖 . Furthermore, here only 1
parameter for each subject is set as nuisance parameter, so a full rank 𝑛𝑖 × (𝑛𝑖 − 1) matrix 𝐴𝑖
which satisfies 𝐴𝑖′𝑍𝑖(1)
= 𝐴𝑖′1𝑛𝑖= 0 is needed to transform the data to make conditional inference
on other parameters.
By multiplying 𝐴𝑖 on both sides of (Eq. 1.8), we have that
𝐴𝑖′𝑦𝑖 = 𝐴𝑖
′1𝑛𝑖𝑥𝑖
′𝛽(1)
+ 𝐴𝑖′𝑋𝑖
(2)𝛽(2) + 𝐴𝑖
′1𝑛𝑖𝑏𝑖
(1)+ 𝐴𝑖
′𝑍𝑖(2)
𝑏𝑖(2)
+ 𝐴𝑖′𝑒𝑖 (1.9)
which is equivalent to
𝑦i∗ ≡ Ai
′𝑦i = Xi∗𝛽(2) + 𝑍i
∗𝑏i(2)
+ 𝑒i∗ (1.10)
In addition, 𝐴𝑖 will be selected to satisfy 𝐴i′𝐴i = 𝐼𝑛𝑖−1 so that the variance of 𝑒i
∗ is 𝜎2𝐼𝑛𝑖−1.
Now all cross-sectional parts, including the random intercepts for each individual, are removed
from the model. The rest of parameters remaining in the model can then be estimated by fitting a
LMM on the transformed data. The R code for finding the 𝐴𝑖 matrix to transform data is obtained
from Sikorska’s paper (Sikorska et al., 2013b).
1.5.4.3 Steps after Data Transformation
Step 1:
𝑦i∗ = 𝛽(2)𝑡𝑖𝑚𝑒i𝑗
∗ + 𝑏i(2)
𝑡𝑖𝑚𝑒i𝑗∗ + 𝑒i
∗, 𝑖 = 1,2 … ,𝑁 (1.11)
No time-varying covariates were used in our model, therefore the only covariate left in the
transformed data is time. By specifying a model without any fixed or random intercepts, a LMM
is fit as the second step in order to obtain random slopes for each subject. Similar to previous fast
methods, longitudinal information on SNPs is assumed to be contained in estimations for these
random slopes 𝑏𝑖(2)
.
15
Step 2:
𝑏𝑖(2)
= 𝛽0∗∗∗ + 𝛽1
∗∗∗𝑠𝑛𝑝𝑖 + 𝑒𝑖∗∗∗, 𝑖 = 1,2… ,𝑁 (1.12)
Now by fitting a least squares linear regression on the random slopes, the null hypothesis on the
longitudinal effect of SNPs becomes this:
𝐻0: 𝛽1∗∗∗ = 0.
Compared with SAO and TS, the major advantage of CTS is that misspecification of the cross-
sectional part does not have an effect on the estimation of rest of parameters anymore.
Shortly afterwards the authors reported additional work on linear and logistic regression to make
faster access to SNP data and to speed up fitting many regressions (Sikorska et al., 2013a). In
2015, Sikorska et al. combined their semi-parallel fast linear regression with CTS, presenting that
the computation time for GWAS was reduced from several weeks to a few minutes on a desktop
while the accuracy was still under good control (Sikorska et al., 2015).
1.5.5 Genome-wide Analysis of Large-scale Longitudinal Outcomes using
Penalization (GALLOP)
In 2018 Sikorska and her colleagues developed a new algorithm named GALLOP as a fast
replacement for LMM. A common feature of the SAO, TS and CTS methods is that they cannot
provide any inference on cross-sectional SNP effect. It is because all of them reduce the dimension
of data by taking one summary statistic, slope, from each individual using different methods. While
longitudinal SNP effect is the main focus, loss of cross-sectional information is still a defect not
to be neglected, and additional LMMs need to be run if the main effect is still required. With the
implementation of GALLOP both cross-sectional and longitudinal effects for SNP can be obtained
at the same time for a comparison with LMM results. In addition, the speed is as similarly fast as
other methods.
Currently the usual way of getting the MLE of parameters including fixed, random effects and
their variances in LMM is via iteration algorithms like Newton-Raphson. It is claimed with
variances known, both random and fixed effects can be estimated at the same time by solving a
16
penalized least squares problem in the form of system of equations. This system is Henderson’s
system of equations for LMM. To get Henderson’s system of equations, the BLUPs of random and
fixed effects in LMM are obtained by letting the partial derivative of log-likelihood function be 0
with respect to random effects first and fixed effects second. Therefore estimating BLUPs is
equivalent to solving the Henderson’s system of equations (Henderson, 1950):
{X’Σ−1Xβ̂ + X’Σ−1Z�̂� = X’Σ−1𝑦
Z’Σ−1Xβ̂ + (Z’Σ−1Z + D−1)�̂� = Z’Σ−1y (1.13)
The penalized least squares problem in GALLOP is based on this form by setting the error variance
as 𝛴𝑖 = 𝜎2𝐼𝑛𝑖, generating the simplified system of equations (Sikorska et al., 2018):
{X’Xβ̂ + X’Z�̂� = X’𝑦
Z’Xβ̂ + (Z’Z + 𝑃)�̂� = Z’y (1.14)
𝑤ℎ𝑒𝑟𝑒 𝑃 = 𝑑𝑖𝑎𝑔(𝑃𝑖) and 𝑃𝑖 = (𝐷/𝜎2)−1.
Under the assumptions of known variance σ2 and form of error variance, this method starts with a
mis-specified LMM by omitting SNP terms first. This step provides the estimated variances to
calculate an estimated penalized component 𝑃 which is necessary to solve penalized least squares
problem. It was additionally assumed that SNP effects are usually so small that adding them into
model later will not change the variance or P by much.
In the next step, a SNP is added in forms of both cross-sectional and longitudinal effects in the
design matrix 𝑋, adding two more parameters to be solved and two equations to the previous
system of equations. However, parts of the essential large matrix inversion necessary to the
solution of current system can be conducted from the solution of previous system of equations
without SNPs. By calculating these components in advance for repeated use, a lot more
computation time can be saved.
Finally, the outcomes of regression parameter estimation, standard error and Wald test p-value can
all be calculated by matrix operation. The detailed algorithm of GALLOP and implementation in
R code are provided in the supplementary information of the original paper (Sikorska et al., 2018).
17
1.5.6 Methods Performance
Through both simulation and application data analysis it was found that CTS showed the best
performance among SAO, TS and CTS to resemble inference on longitudinal SNP effect generated
by LMM in two dimensions: the highest accuracy in terms of power nearest to LMM under a
controlled T1E (α = 5%) and the shortest processing time (Sikorska et al., 2013b; Sikorska et al.,
2018). In Table 1-2, 1-3, and 1-4, comparisons of T1E, power and time comparison are cited from
the paper (Sikorska et al., 2013b).
Table 1-2 Rotterdam study: Type 1 error comparison.
Note: Adapted from Table IX in “Fast linear mixed model computations for genome-wide association
studies with longitudinal data” by K. Sikorska et al., 2013, Statistics in Medicine, 32, 165-180, Copyright
2012 John Wiley & Sons, Ltd (Sikorska et al., 2013b). β1: cross-sectional SNP effect. β3: longitudinal SNP
effect.
18
Table 1-3 Rotterdam study: Power comparison.
Note: Adapted from Table XI in “Fast linear mixed model computations for genome-wide association
studies with longitudinal data” by K. Sikorska et al., 2013, Statistics in Medicine, 32, 165-180, Copyright
2012 John Wiley & Sons, Ltd (Sikorska et al., 2013b). β1: cross-sectional SNP effect. β3: longitudinal
SNP effect.
Table 1-4 Rotterdam study: Speed comparison.
Note: Adapted from Table XII in “Fast linear mixed model computations for genome-wide association
studies with longitudinal data” by K. Sikorska et al., 2013, Statistics in Medicine, 32, 165-180, Copyright
2012 John Wiley & Sons, Ltd (Sikorska et al., 2013b).
The comparison of GALLOP was conducted with LMM and CTS in a similar simulation study
based on same Rotterdam study data. CTS is 15 times faster than GALLOP to produce p-values
for longitudinal SNP effect, but for CTS most of analysis time was spent on data access (Sikorska
et al., 2018). It was concluded that GALLOP is a very efficient method providing practically exact
results of estimates and p-values as LMM for both cross-sectional and longitudinal SNP effects.
19
1.5.7 Concerns
The Sikorska et al., papers provide great inspiration for the potential useful tools to efficiently
conduct GWAS for longitudinal studies. However, there are some concerns about their simulation
study and motivating Rotterdam study that might prevent the widespread application of their
proposed fast methods. Here we study various aspects of real data types that require additional
evaluation across methods.
First, the SNP data in the Rotterdam simulation study were generated randomly from a uniform
distribution from 0 to 2. This results in an equal probability of SNPs being any value between 0 to
2. The additive mode of SNPs usually has only three values which are 0 (homozygous for the
major allele), 1 (heterozygous) and 2 (homozygous for the minor allele) for genotyped SNPs. For
imputed SNPs we have probabilities for the 3 possible genotypes. SNPs in most simulation studies
follow Hardy-Weinberg Equilibrium (HWE) which states under five conditions (no mutation, no
gene flow, large population size, random mating, and no natural selection), the proportions of 0, 1
and 2 based on Minor Allele Frequency MAF = p (𝑝 ≤ 50%) follow:
𝑃𝑟(𝑆𝑁𝑃 = 0) = (1 − 𝑝)2, 𝑃𝑟(𝑆𝑁𝑃 = 1) = 2𝑝(1 − 𝑝), 𝑃𝑟(𝑆𝑁𝑃 = 2) = 𝑝2
This results in the SNP matrix for subjects usually being sparse with most elements being 0,
especially at a low MAF. However, an imputed SNP can have a “dosage” value between 0 and 2
if it is a weighted average of all three possible genotypes. A SNP generated from a uniform
distribution from 0 to 2 can be seen as an imputed SNP following MAF = 0.5. (There are also
other ways to decide the genotype from imputed SNPs, such as taking the genotype with largest
probability if it is larger than pre-specified threshold and calling a missing genotype otherwise.
Typically, dosage is used.) Given this, it motivates us to determine how SNPs generated from a
range of MAFs have influence on the T1E and power in a simulation study.
Variation in MAF leads to the second concern, T1E. According to research (Ma et al., 2013) on
GWAS testing association between low count variants and case-control outcome using both MAF
and expected minor allele count (E[MAC]), it was found that for low count variants (E[MAC] <
400, MAF < 0.01 for N = 20,000) the Wald test is very conservative at a nominal T1E rate of
α = 5 × 10−8. We therefore also were concerned that inflated T1E might occur for quantitative
20
outcomes when MAF is low. Consequently, an inflated T1E can cause inflation of power of the
test.
The third concern is that the Rotterdam simulation study did not account for any within subject
error correlation as the errors were generated independently for each visit of each subject. This is
partly because of their selection of the lme4 package (Bates et al., 2015). The lme4 package does
not provide an option for error covariance structure, nor does it generate p-values for association
testing. We would like to see how performance changes when a different covariance structure for
error is applied. We selected the nlme package which provides many options (Pinheiro et al.,
2019). In addition, it has been tested that there is no difference between generated default p-values
from nlme package and calculated Wald-test p-values from lme4 package, although lme4 was
estimated to conduct a faster algorithm than nlme (Sikorska et al., 2013b).
The fourth aspect to be further explored is the missing pattern. According to a classification for
missing data originated from Rubin et. al (Little and Rubin, 1987), assume there are (more than
one) X and Y variables for one subject, but only Y is subject to nonresponse/missingness. Then
missing mechanisms are classified by telling whether probability of response in Y (1) depends on
Y and possibly X as well, (2) depends on X but not on Y, or (3) is independent of X and Y. Today,
people usually summarize these three mechanisms as:
1) Missing Not at Random (MNAR): The data are neither missing at random nor observed at
random;
2) Missing at Random (MAR): The observed values of Y are not necessarily a random
subsample of the sampled values, but they are a random sample of the sampled values
within subclasses defined by values of X;
3) Missing Completely at Random (MCAR): The observed values of Y form a random
subsample of the sampled values of Y.
The missing data mechanisms for MAR and MCAR are ignorable for likelihood-based inference,
but the MNAR mechanism is non-ignorable. The missing data scenarios in the simulation study
based on the Rotterdam study included dropout based on MCAR and MAR mechanisms which
controlled the overall missing rate as similar to the real data of ~35%. We would like to simulate
both missing at some time points and dropout (lost to follow-up). For missing at some time points,
subjects will be missing according to the three mechanisms MCAR, MAR and MNAR. We are
21
making such a design because apart from dropout, participants might also come back after missing
a visit.
The last concern for fast methods of SAO, TS and CTS is that they only provide inference on
longitudinal SNP effect and ignore the cross-sectional effect. However, even using LMM and
GALLOP we are limited to testing the effect of SNP on both aspects as separate tests. One
alternative is testing both marginal and interaction effect of a genetic variant and an environment
factor in one test inspired by Kraft et al (Kraft et al., 2007). Driven by the need to test genetic
associations with disease under different environmental exposures, it is claimed a joint test
combining tests on marginal and interaction effect is an optimal approach in most of situations and
is more powerful than single tests when both effects exist. The benefit of applying 2df (two
degrees-of-freedom) test is that it avoids multiple-testing complications and difficulties in
interpreting results. We therefore provide Wald joint test results for GALLOP and LMM.
1.6 Goals
The primary goal is to compare fast methods including SAO, TS, CTS and GALLOP with LMM
in terms of T1E, power, parameter estimate and computational speed under specific settings. T1E
and power are compared in simulation studies to assess accuracy. In order for future extensive
application, it would be meaningful to find out the specific conditions when each fast method
performs best.
A secondary goal is to detect associations in DCCT/EDIC application with the help of fast
methods. These fast methods work as an approximation of LMM and will narrow the pool of
potential SNPs. The subset of SNPs achieving certain threshold in fast methods will then be fit in
LMM to see whether the results are close to the ‘truth’ from testing all SNPs using LMM.
By realizing these two goals, we want to test the hypothesis of this thesis: The performance of the
above fast methods to replace LMM in longitudinal GWAS is affected by MAF, sample size,
within subject error structure and missing data patterns related to real study settings.
22
Methods
Chapter 1 The following procedures of applying methods in the simulation study and real data are
summarized in the flowchart:
Figure 2-1 Flow chart for Methods section.
2.1 Data Background: DCCT and EDIC
Our motivating data come from the DCCT/EDIC study.
2.1.1 Diabetes Control and Complications Trial (DCCT)
The DCCT was an unblinded randomized control trial designed for patients with Insulin-dependent
diabetes mellitus (IDDM), namely type 1 diabetes (T1D). The trial began in 1983 and ended
prematurely in 1993 due to the beneficial effects of intensive therapy. The total number of patients
with T1D recruited was 1441 from 29 centers from 1983 till 1989 and these patients were randomly
assigned into two treatment groups: conventional or intensive therapy (Shamoon et al., 1993). The
mean follow-up time is 6.5 years with a minimum of around 3 years and a maximum of 9 years.
The primary outcome was retinopathy but this trial was also used to test whether other T1D related
complications could be delayed or the rate of progression slowed by treatment intervention.
23
2.1.1.1 DCCT cohorts
The DCCT has two important cohorts. The primary prevention group and secondary intervention
group differ in term of retinopathy at baseline, i.e. patients without and with retinopathy when
recruited into the trial (DCCT Research Group, 1986). Accordingly, the primary cohort had
diabetes duration of 1-5 years and no microalbuminuria (<40mg albumin/24h on a 4-h urine
collection). The secondary cohort had diabetes duration of 1-15 years and possible
microalbuminuria (≤200mg albumin/24h on a 4-h urine collection). The primary outcomes of the
DCCT are different for the two cohorts. For the primary prevention group, the primary outcome
of interest was the initial appearance of retinopathy. For the secondary intervention group, the
principle outcome was the progression or improvement of pre-existing minimal
retinopathy (DCCT Research Group, 1986). The final count of patients in the primary prevention
cohort was 726 and 715 for the secondary intervention cohort.
2.1.1.2 DCCT treatment
The intensive treatment plan was three or more insulin injections or treatment with the help of an
external pump every day, adjusted by results of self-monitoring glucose level four times or more
per day, diet and exercise. The conventional treatment was one or two daily insulin injections along
with self-monitoring urine or blood glucose and education about lifestyle (Paterson et al., 2010).
The goal of the conventional group was to help free patients from hyper/hypoglycemia, ketonuria
and maintain body weight. The goal of intensive group was to maintain the glycemia level to a
normal range (HbA1c<6.05%) with multiple blood glucose measures daily and monthly HbA1c
measures (Shamoon et al., 1993).
The DCCT was terminated after about 10 years from its launch due to the beneficial effects of
intensive therapy and the original conventional group was taught intensive therapy (Nathan, 2014).
The experience obtained from the trial is of great value and many insights were provided by the
DCCT research team. The definitive trial results were published in 1993 by DCCT team (Shamoon
et al., 1993). The important findings include that intensive therapy delays the risk and progression
of retinopathy, reduces the occurrence of microalbuminuria, albuminuria and neurologic
complications but caused a higher risk of hypoglycemia and adverse weight gain. They believed
24
the benefits of intensive treatment outweighed the risks but that intensive therapy should be carried
out with extra caution (Shamoon et al., 1993).
2.1.2 Epidemiology of Diabetes Interventions and Complications (EDIC)
After the termination of DCCT, a follow-up observational study began tracking the subsequent
traits of the same cohort of participants. This longitudinal study is called Epidemiology of Diabetes
Interventions and Complications (EDIC) and provides annual evaluations of different traits of
diabetes for participants at multiple centers from 1994 till now. The purpose of EDIC is to observe
the durability of DCCT treatment effects on diabetes complications in the long term, despite the
fact that many participants have changed their treatment therapy in EDIC period (EDIC Research
Group, 1999).
EDIC has investigated other related chronic diseases. For instance, measurements such as the
effects of HbA1c and diabetes duration on the incidence of cardiovascular disease using EDIC
data (Lachin and DCCT/EDIC Research Group, 2016). As an important classification of
participants in DCCT, the effect of randomized DCCT treatment on diabetic retinopathy and other
ocular diseases has been a focus (DCCT/EDIC Research Group, 2014a). Besides, the development
and progression of neuropathy, and nephropathy contribute to our understanding of T1D (Martin
et al., 2014; de Boer and DCCT/EDIC Research Group, 2014). Apart from physical function,
cognitive function has also been explored with EDIC data and it was found original treatment
group assignment is not associated with any decline in cognitive function (DCCT/EDIC Research
Group, 2007). EDIC provides new insights in the mechanisms of long-term development of these
diseases.
Visits to clinics included a medical history and physical examination for both studies, but the
collection of medical information is slightly different depending on measurements during DCCT
and EDIC. For example, glomerular filtration rate (GFR) was estimated every year in both
DCCT/EDIC, while albumin excretion rate (AER) was measured every year in DCCT and every
other year in EDIC.
25
2.1.3 Renal Outcome Measures
We chose to focus on complications of renal disease in this thesis. An introduction to two renal
complication traits: albuminuria and estimated glomerular filtration rate; including their
definitions and the collection methods are provided, along with some previous findings.
2.1.3.1 Albuminuria
Urinary albumin excretion rate (AER) is an important measure to screen people with diabetes for
nephropathy. AER at DCCT baseline inclusion criteria were ≤ 40 mg/24h for primary prevention
cohort and ≤ 200 mg/24h for secondary intervention cohort, with the definition for
microalbuminuria as AER ≥ 40 mg/24h and for macroalbuminuria as AER ≥ 300 mg/24h in
earlier DCCT and EDIC reports (Younes et al., 2010). However, in order to be consistent with
modern American Diabetes Association guidelines, in EDIC microalbuminuria was defined as
AER ≥ 30 mg/24h (DCCT/EDIC Research Group, 2014b). The measurement of AER required a
4-hour urine collection which was performed every year in DCCT and every other year in EDIC
with the confirmation of high precision for data quality in two studies (DCCT Research Group,
1986; EDIC Research Group, 1999).
The existence and extent of albuminuria reflected by AER can help evaluate the possibility of
kidney disease so as to prevent or postpone irreparable damage. In aspect of genetic association, a
very limited number of loci were discovered and validated for albuminuria in T1D under the
traditional GWAS significance level. Some findings included rs1564939 in GLRA3 in people with
T1D with 24h AER data obtained from a Finnish study and it has been validated by meta-analysis
study (Sandholm et al., 2018). More new albuminuria loci are being discovered in general
population cohorts, predominantly without diabetes (Haas et al., 2018).
At the end of the DCCT period, the findings showed that compared with conventional treatment
group, the intensive treatment reduced the incidence of microalbuminuria by 39% (95% CI 21%-
52%, p≤0.002) and reduced incidence of macroalbuminuria by 54% (19%-74%, p<0.004 ) in the
combined cohort (Shamoon et al., 1993). After the termination of DCCT, the effect of intensive
therapy on albuminuria has been shown to persist for over ten years, despite that many people
26
switched to intensive treatment in later time. During EDIC years 1-18, the risk reduction of
microalbuminuria between the original intensive and conventional groups is 45% (26%-59%,
p<0.0001), and the risk reduction of macroalbuminuria is 61% (41%-74%,
p<0.0001) (DCCT/EDIC Research Group, 2014b). Although the definition of microalbuminuria
changed, the risk reduction over time shows that the original intensive therapy group have lower
risk for albuminuria decades after the end of DCCT.
2.1.3.2 Glomerular Filtration Rate
Glomerular filtration rate (GFR) is another measurement of kidney function and is used to stage
kidney disease. GFR can be measured precisely by use of inulin clearance or other methods.
Measurement as Iothalamate clearance was measured in DCCT, but they were not preferred in
DCCT/EDIC due to cumbersome procedure. Usually this value is estimated GFR (eGFR)
calculated from a function utilizing several measurements from patients. It was found Iothalamate
GFR measurements obtained in subsets of participants showed changes that were in the same
direction as the changes in the estimated GFR, but they were of larger magnitude (DCCT/EDIC
Research Group, 2011). The most commonly used parameters for eGFR include serum creatinine
concentration, age, sex and ethnicity. Some previous versions also included body size (EDIC
Research Group, 1999). Over time the formula has been updated and validated by researchers and
currently the most credible one is the formula from Chronic Kidney Disease Epidemiology
Collaboration creatinine equation (CKD-EPI) due to its improvement in accuracy compared to
other ones (Levey et al., 2009):
eGFR = 141 ∗ min(Scr/κ, 1)α ∗ max(Scr/κ, 1) − 1.209 ∗ 0.993𝐴𝑔𝑒 ∗ 1.018 [if female]
∗ 1.159 [if black].
The equation takes serum creatinine, age, gender and ethnicity as parameters. Serum creatinine
level (Scr) is in mg/dL, age is in years, κ is 0.7 for females and 0.9 for males, α is -0.329 for
females and -0.411 for males, min indicates the minimum of Scr/κ or 1, and max indicates the
maximum of Scr/κ or 1.
eGFR is widely accepted as an overall measurement of kidney function. Clinical practice
guidelines by National Kidney Foundation proposed the classification of chronic kidney disease
27
(CKD) severity as the level of eGFR as there was no standard classification on chronic kidney
disease stages for a long time. Usually patients with eGFR<60mL/min/1.73𝑚2 should be aware
and assessed for renal function impairment and well-being (National Kidney Foundation, 2002).
In DCCT, GFR was directly measured for only a few times from each individual very infrequently.
Serum creatinine was obtained annually in both DCCT and EDIC to calculate eGFR (DCCT
Research Group, 1986). The development of albuminuria is usually ahead of impaired GFR and a
slightly increased excretion of albumin can be a sensitive predictor of decline in GFR (de Boer
and DCCT/EDIC Research Group, 2014). Impairment of GFR was defined in DCCT/EDIC as
eGFR being less than 60 mL/min/1.73𝑚2 level for two consecutive study visits, i.e. CKD stage 3.
In EDIC years, participants are determined to reach the renal stop point by getting an eGFR value
of ≤10mL/min/1.73𝑚2 . After first reaching the renal stop point, the participants will then be
censored for the rest of albuminuria or GFR measurement because of developing end-stage renal
disease.
Impaired eGFR increases the probability of developing end-stage renal disease which might cause
other complications such as cardiovascular disease and the risk of death can be very high for T1D
patients. With around 70 participants having developed impaired eGFR in DCCT/EDIC study till
2011 (EDIC year 18), it has been found that intensive therapy significantly lowered the risk of
impaired eGFR (DCCT/EDIC Research Group, 2011). During EDIC year 1-18, the risk reduction
of impairment of eGFR between the original intensive and conventional groups was 44% (95% CI
12%-64%, p=0.011) (DCCT/EDIC Research Group, 2014b).
2.1.4 Importance of DCCT/EDIC Study
In the (up to) 36 years after launch of DCCT/EDIC 94% of the surviving cohort are still being
followed (Bebu et al., 2019). One of the strengths of the DCCT/EDIC is that the data were
collected from a well-documented cohort which have been followed for over 30 years. This
ongoing cohort study is still very meaningful for all kinds of research so that better treatment
regimens can be adopted to reduce the risk of long-term complications for T1D.
28
2.2 Linear Mixed Effects Model for DCCT/EDIC Renal Outcomes
Based on the motivating dataset, our targeted outcomes are logAER and eGFR which are both
repeatedly measured quantitative traits measuring T1D renal complications. Since different
responses might have different patterns in evolution or in residuals, applying one uniform model
or one common correlation structure would be inappropriate and separate LMMs will be fit for
these two outcomes with the adjustments of potentially related characteristics. In general, to
conduct GWAS with LMMs the first step is to construct a marginal model which can best predict
the population-average outcomes for individuals using the phenotype data only. The second step
is to include SNPs one by one and test for associations between SNPs and traits.
2.2.1 Phenotypic Covariate Selection
By selecting clinically meaningful predictors, the associated demographic characteristics with the
renal traits of logAER and eGFR include sex (female versus male), cohort (primary prevention
versus secondary intervention), randomized treatment (conventional versus intensive therapy) and
the interaction between cohort and treatment. These variables were selected as the fixed effect
covariates adjusting for the LMM.
Time variable in longitudinal data is needed to define the interval between repeated measurements.
The initial time variable used for DCCT/EDIC is the duration time in years since randomization
into DCCT. (In addition, time variable in months is also created as a more precise time variable
for some of the within-subject correlation structures). Time is used as both fixed and random
effects in the model. To allow for individual intercepts and slopes of time for all patients, a random
intercept and slope model was specified.
Showing an example on logAER with all potential adjusting covariates, the final model in this step
containing only phenotypic covariates has the form as:
logAER𝑖 = 𝛽0 + 𝛽3𝑡𝑖𝑚𝑒𝑖𝑗 + 𝛽4𝑐𝑜ℎ𝑜𝑟𝑡𝑖 + 𝛽5𝑔𝑒𝑛𝑑𝑒𝑟𝑖 + 𝛽6𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑖
+ 𝛽7𝑐𝑜ℎ𝑜𝑟𝑡𝑖 × 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑖 + 𝑏0𝑖+ 𝑏1𝑖
𝑡𝑖𝑚𝑒𝑖𝑗 + 𝑒𝑖𝑗,
where (𝑏0𝑖, 𝑏1𝑖
)~𝑁(0, 𝐷) and 𝑒𝑖𝑗~𝑁(0, 𝛴𝑖). (2.1)
29
2.2.2 Within-individual Correlation Structure Selection
Despite the fact that between individual errors are often assumed to be independent, consecutive
measures within a person are typically correlated. Such correlation patterns or structures can be
selected from the options in the nlme package (Pinheiro et al., 2019).
From now onwards the covariance structure Σ𝑖 is referred to and constructed as the replacement of
the correlation structure. Both correlation structure and covariance structure show how
measurements are associated at different time points. With 𝑛𝑖 measurements for individual 𝑖, the
covariance structure for distribution of measurement error has a form as:
Σi =
[ 𝜎1
2 𝜎12 ⋯ 𝜎1𝑛𝑖
𝜎21 𝜎22 𝜎2𝑛𝑖
⋮ ⋱ ⋮𝜎𝑛𝑖1
𝜎𝑛𝑖2⋯ 𝜎𝑛𝑖
2]
The default option in nlme package is the Variance Component (VC) structure which has a
diagonal form and assumes there is no within individual correlation. This is not the ideal structure
for repeated measures, but it is a convenient way to have a sense of the scale of residuals for
individuals and effect of fitting other structures. The simplest form allowing correlation is called
compound symmetry (CS) structure. In this form the covariance between within person errors are
assumed to be the same no matter how far away the two repeated measurements are. There are
only two values in the covariance matrix. The most complex form is a general structure with no
additional structure, also known as unstructured form. Covariance between each pair of time points
is allowed to be different. This structure takes more time to fit as a lot more computation is
required. It makes the final covariance structure very large and complicated for a large number of
repeated measurements. No rules of how repeated measurement errors are correlated can be
summarized depending on the time interval. Therefore this structure is not adopted in this thesis
as it takes much more time than other structures given the number of measurements in
DCCT/EDIC.
Apart from the previous three forms, the rest of structures can be divided into two types: time-
series based correlations and spatial correlations. In nlme package, three time-series based
30
correlations are provided: first order autoregressive (AR(1)), autoregressive moving average
process with arbitrary orders for the autoregressive and moving average components
(ARMA(p, q) ), and continuous first order autoregressive process ( CAR(1) ). The difference
between CAR(1) and the other two lies in that CAR(1) allows continuous time variable and can
deal with the precise time intervals while the other two only take discrete time. If the discrete time
variable is not specified in model, the AR(1) and ARMA(p, q) structures by default identify the
order of repeated measures in one subject as the time points, which generates unreliable results for
unsorted data.
In the case of spatial correlations, there are five options in total: exponential, Gaussian, linear,
Rational quadratic and spherical spatial correlation. More than one spatial covariate can be
specified for all these structures. To make the spatial distance more precise, the spatial covariate
used in this dataset is time in months. With the 𝑑 denoting the whole range, the correlation between
two observations with a distance of 𝑟 (difference in months here) < 𝑑 have different forms
accordingly. Table 2-1 summarizes the correlations.
Table 2-1 Spatial correlation structures in nlme.
Spatial correlation structure Correlation (𝒓 < 𝒅)
Exponential 𝑒𝑥𝑝(−𝑟/𝑑)
Gaussian 𝑒𝑥𝑝(−(𝑟/𝑑)2)
Linear 1 − (𝑟/𝑑)
Rational Quadratic 1/(1 + (𝑟/𝑑)2)
Spherical 1 − 1.5(𝑟/𝑑) + 0.5(𝑟/𝑑)3
d: whole range of longitudinal data in months.
r: time difference between two measurements in months.
In the DCCT/EDIC dataset time in months are used for CAR(1) and spatial correlation error
structures. For the rest of the error structures time in years are applied. The selection of covariance
structures can produce different results for traits with different patterns. All these possible
structures will be compared in (Eq. 2.1) to see which structure returns a best fit model, which is
defined here as the model with smallest Akaike Information Criteria (AIC).
The selection for covariance structure might differ between DCCT and EDIC. However, such a
model would be too complex to let the error generated from different covariance matrices before
and after DCCT close out, therefore all available data will be used to select the error covariance
structure producing best fit model.
31
2.2.3 Genetic Data
Genetic data were obtained from blood DNA sample in DCCT by genotyping SNP array.
Genotyping was performed using Illumina 1M BeadArrays (San Diego, CA, USA) (Roshandel et
al., 2018). Ungenotyped autosomal SNPs were imputed using 1000 Genomes data (phase 3
v5) (1000 Genomes Project Consortium, 2015). Genotype dosage data from Illumina 1M
BeadArrays were used to analyze logAER and eGFR respectively with approximately 8M SNPs
imputed with high imputation quality (INFO>0.8). A total of 8,979,131 SNPs with a MAF>1%
were subsequently analyzed statistically using genotype probabilities with additive coding of
genotype (Paterson et al., 2010).
After the previous LMM without SNP terms is determined, genetic data will be applied into the
model as LMM1. There is only cross-sectional SNP effect in this model, showing the population
average effect without the influence from interaction term. This is the SNP effect that is assumed
to be the same at all time points, in other words, the average effect over the time. The LMM with
only cross-sectional SNP effect has a form as:
logAER𝑖𝑗 = 𝛽0 + 𝛽1𝑠𝑛𝑝𝑖 + 𝛽3𝑡𝑖𝑚𝑒𝑖𝑗 + 𝛽4𝑐𝑜ℎ𝑜𝑟𝑡𝑖 + 𝛽5𝑔𝑒𝑛𝑑𝑒𝑟𝑖 + 𝛽6𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑖
+ 𝛽7𝑐𝑜ℎ𝑜𝑟𝑡𝑖 × 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑖 + 𝑏0𝑖+ 𝑏1𝑖
𝑡𝑖𝑚𝑒𝑖𝑗 + 𝑒𝑖𝑗,
(𝑏0𝑖, 𝑏1𝑖
)~𝑁(0, 𝐷), 𝑒𝑖𝑗~𝑁(0, 𝛴𝑖∗), (2.2)
Then, genetic data will be incorporated into the model in the form of a cross-sectional fixed effect
and a longitudinal fixed effect on the outcome traits as LMM2. Different from LMM1, now the
interpretation of cross-sectional SNP effect is the effect at time 0, instead of an average effect over
the time if longitudinal effect exists. An updated complete LMM with both SNP effects is as
followed:
logAER𝑖𝑗 = 𝛽0 + 𝛽1𝑠𝑛𝑝𝑖 + 𝛽2𝑠𝑛𝑝𝑖 × 𝑡𝑖𝑚𝑒𝑖𝑗 + 𝛽3𝑡𝑖𝑚𝑒𝑖𝑗 + 𝛽4𝑐𝑜ℎ𝑜𝑟𝑡𝑖 + 𝛽5𝑔𝑒𝑛𝑑𝑒𝑟𝑖
+ 𝛽6𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑖 + 𝛽7𝑐𝑜ℎ𝑜𝑟𝑡𝑖 × 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑖 + 𝑏0𝑖+ 𝑏1𝑖
𝑡𝑖𝑚𝑒𝑖𝑗 + 𝑒𝑖𝑗,
32
(𝑏0𝑖, 𝑏1𝑖
)~𝑁(0, 𝐷), 𝑒𝑖𝑗~𝑁(0, 𝛴𝑖∗), (2.3)
Where 𝛴𝑖∗ is the selected variance structure for logAER in DCCT/EDIC data. Similarly, another
𝛴𝑖∗ will be selected independently for outcome eGFR.
To conduct hypothesis testing separately on the two effects of SNPs, the null hypotheses for LMM
from (Eq. 2.2) and (Eq. 2.3) on GWAS are:
𝐻0(1): β1
𝐿𝑀𝑀1 = 0 for 𝐋𝐌𝐌𝟏;
𝐻0(1): β1
𝐿𝑀𝑀2 = 0 for 𝐋𝐌𝐌𝟐;
𝐻0(2): β2 = 0.
For a 2df test on both SNP effects, the null hypothesis is:
𝐻0: β1𝐿𝑀𝑀2 = β2 = 0.
All fast methods can provide results on 𝐻0(2) while only GALLOP can test H0
(1) as LMM does.
As we mentioned, the interpretation of the two tests in 𝐻0(1)
might be different as for LMM1,
cross-sectional effect is average effect over time, while for LMM2 this effect only represents effect
at time 0. GALLOP can also provide inference for the joint test by additional implementation to
the original code.
2.3 Weighted Slope as Outcome (WSAO)
Except for application of all fast methods, one simple modification to the SAO method, WSAO is
proposed in this thesis. The motivation is that some people have less data by study design or
missingness, causing unbalanced data. The first step is the same as SAO which is to extract per
individual slope as the summary statistics, allowing longitudinal covariates adjusting for the slope
in this step. In the second step, weights are assigned to each individual as the number of visits to
33
fit the linear relationship of time slopes and SNPs. This modification is easy to implement but
essentially puts more weight on people with more visits when the data are highly unbalanced.
2.4 Simulation Study
We will compare methods in different scenarios to acquire a better knowledge of their properties
and performance. A factorial designed simulation study was used. We set k=7 equally distributed
time points for each individual assuming that the data are complete. The number of replicates for
each scenario is 5000 to obtain stable statistics.
The simulation study is based on the association between simulated SNPs and pattern of logAER
from DCCT. No covariates are contained in the simulation study models as requirements of
covariates types are not the same for different fast methods and we want to make sure the
difference between methods’ performance only comes from the simulation settings. The main
procedure is to firstly simulate SNPs and response variable logAER according to the assumed true
model as (Eq. 2.3), and then analyze by different methods.
logAER𝑖𝑗 = 𝛽0 + 𝛽1𝑠𝑛𝑝𝑖 + 𝛽2𝑠𝑛𝑝𝑖 × 𝑡𝑖𝑚𝑒𝑖𝑗 + 𝛽3𝑡𝑖𝑚𝑒𝑖𝑗 + 𝑏0𝑖+ 𝑏1𝑖
𝑡𝑖𝑚𝑒𝑖𝑗 + 𝑒𝑖𝑗 ,
(𝑏0𝑖, 𝑏1𝑖
)~𝑁(0, 𝐷), 𝑒𝑖𝑗~𝑁(0, 𝛴𝑖∗), (2.4)
Where 𝛴𝑖∗ is selected structure for logAER in DCCT data.
A flow chart for simulation study is as followed:
34
Figure 2-2 Flow chart of simulation study.
2.4.1 Experimental Designs
There are 6 factors with different levels to be adjusted in this simulation study with 7 visits for
each subject and 5000 replicates for each scenario. Here are the factors:
1) Minor allele frequency (MAF): 0.01, 0.05, 0.1, 0.3, 0.5;
2) Sample size (N): 500, 1000, 1500, 2000, 3000;
3) Cross-sectional SNP effect (β1): 0, 0.08;
4) Longitudinal SNP effect (β2): 0, 0.016;
5) Missing pattern for the response variable: Complete, MCAR, MAR, MNAR, dropout;
6) Within subject error correlation (error variance Σ𝑖).
1) – 4): Minor allele frequency, sample size and SNP effects have different values to be tested.
When sample size is varying, MAF is set as 0.3; when MAF is varying the sample size is set to
2000.
In addition, we call it Reference scenario when no data modifications regarding missingness and
error correlation structure are done for the simulated data. In other words, the reference scenario
has no missing data and is generated under the selected structure 𝛴𝑖∗ (medium/reference error
correlation) for logAER in DCCT data.
5): There are many different ways to simulate these mechanisms of missingness. Here some easy-
to-construct ways are implemented to simulate these three missing mechanisms and demonstrate
35
the effect on different methods. All missing mechanisms return a missing rate of ~40% of the
records in the long format simulated dataset. In addition, the first observation for all subjects is
treated as baseline observation and is complete for everyone.
(1) MCAR: A missing probability of 47% is applied on all outcome values except for first
measurement for all subjects to get a same overall missing rate of 40% as the other two missing
scenarios.
(2) MAR: Missing at random scenario usually requires the covariates which are not included in
this simulation study. Assuming probability of missing outcome depends on the baseline value of
outcome, here is a formula illustrating the missing probability 𝑃𝑖𝑗 for subject 𝑖 and measurement
𝑗:
log (𝑃𝑖𝑗
1 − 𝑃𝑖𝑗) = −3.75 + 1.5 × logAER𝑖,1, 𝑖 = 1,⋯𝑁, 𝑗 ≠ 1 (2.5)
(3) MNAR: Assuming probability of missing outcome depends on the current value of outcome,
here is a formula illustrating the missing probability 𝑃𝑖𝑗 for subject i and measurement j:
log (𝑃𝑖𝑗
1 − 𝑃𝑖𝑗) = −3.75 + 1.45 × logAER𝑖,𝑗, 𝑖 = 1,⋯𝑁, 𝑗 ≠ 1 (2.6)
(4) Dropout: Censored data or dropout data usually occurs in longitudinal studies when participants
missed one visit and all the later ones. The reason for the first missing visit can be the previous
three missing mechanisms, and here we adopt the design when one observation reached a certain
threshold, all later observations are censored. For outcome logAER, this dropout threshold is set
as 70th quantile of logAER values in the simulation dataset. The formula illustrating the missing
probability 𝑃𝑖𝑗 for subject 𝑖 and measurement 𝑗 is:
𝑃𝑖𝑗 = {1, 𝑖𝑓 logAER𝑖,𝑗−1 > logAER𝑄0.7
𝑜𝑟 𝑃𝑖,𝑗−1 = 1
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒(2.7)
36
6): For within-subject error covariance structure, the default structure is the one selected for
logAER in DCCT data. Other than that, error structures with no, medium/reference and strong
correlations will be used to generate the data.
2.4.2 Simulation of SNPs
The simulation of SNPs follows the Hardy-Weinberg Equilibrium principle and probability of
minor allele frequency follows the distribution when MAF = 𝑝:
𝑃𝑟(𝑆𝑁𝑃 = 0) = (1 − 𝑝)2, 𝑃𝑟(𝑆𝑁𝑃 = 1) = 2𝑝(1 − 𝑝), 𝑃𝑟(𝑆𝑁𝑃 = 2) = 𝑝2
2.4.3 Simulation of Outcome Trait
It is assumed that adding SNP terms will not change fixed and random intercept or slope effects
by much, therefore the coefficients 𝛽0 and 𝛽3 are directly taken from results of model without SNP
terms (Eq. 2.1) using observed DCCT data. The covariance matrix for random effects 𝐷 is also
assumed to be the same as estimated from the model without SNP terms.
logAER𝑖𝑗 = 𝛽0 + 𝛽3𝑡𝑖𝑚𝑒𝑖𝑗 + 𝑏0𝑖+ 𝑏1𝑖
𝑡𝑖𝑚𝑒𝑖𝑗 + 𝑒𝑖𝑗,
(𝑏0𝑖, 𝑏1𝑖
)~𝑁(0, 𝐷), 𝑒𝑖𝑗~𝑁(0, 𝛴𝑖∗), (2.8)
After the simulation of all necessary variables and parameters, the outcome values logAER𝑖𝑗 are
simply calculated by applying the assumed true model equation as (Eq. 2.4). This can also be done
by generating logAER𝑖𝑗 following the inferred multivariate normal distribution as in (Eq. 1.2).
2.5 Methods for DCCT/EDIC Data Analysis
Different from fast methods in the simulation study where covariates are ignored, covariates are
needed for full LMMs to adjust for the true relationship between outcomes and SNP effects.
However due to the limitation of fast methods, requirements of covariates to be added into the fast
37
algorithms are not the same. According to that, here we specify the covariates used for LMMs and
fast methods on real data.
In 2.2.1 Phenotypic Covariate Selection the selected covariates for full LMM on logAER and
eGFR are sex, treatment and cohort, with an interaction effect of treatment and cohort. All
covariates are time-static characteristics. The full model with both SNP effects is (Eq. 2.3).
SAO allows only longitudinal covariates to be added in the first step, so no covariates will be
added for the real data analysis. However, considering the unbalanced data, weighted SAO with
weights of number of visits is also applied along with original SAO. The TS method fits an LMM
without SNP effects in the first step, so it can allow all additional covariates to adjust for the
relationship between outcome and time and has the same form as (Eq. 2.1) in the first step. The
CTS method takes only longitudinal data into data transformation in the first step, therefore same
as SAO, no covariates can be added for real data analysis. Finally, the GALLOP method can take
all types of covariates into its design matrix and thus it allows all of the covariates, cross-sectional
or longitudinal, for real data analysis. In addition, TS is the only method which can apply the same
error correlation structure as LMM and all other fast methods assume an independent within-
subject error structure.
By conducting fast methods with real DCCT/EDIC data on outcome logAER and eGFR, we first
set a loose significance level of p< 10−5 . Any SNPs significant at this level with single or joint
effects are extracted and put in LMM for final testing. Unlike in the simulation study in which we
know the exact power and T1E of a method, to get a direct comparison here we can define the
efficiency of fast methods loosely or stringently as:
𝐸𝑙 =#𝑆𝑁𝑃𝑠 𝑤𝑖𝑡ℎ 𝑃∗ < 5 × 10−8 𝑏𝑦 𝐿𝑀𝑀
#𝑆𝑁𝑃𝑠 𝑤𝑖𝑡ℎ 𝑃 < 10−5 𝑏𝑦 𝑓𝑎𝑠𝑡 𝑚𝑒𝑡ℎ𝑜𝑑(2.9)
𝐸𝑠 =#𝑆𝑁𝑃𝑠 𝑤𝑖𝑡ℎ 𝑃∗ < 5 × 10−8 𝑏𝑦 𝐿𝑀𝑀
#𝑆𝑁𝑃𝑠 𝑤𝑖𝑡ℎ 𝑃 < 5 × 10−8 𝑏𝑦 𝑓𝑎𝑠𝑡 𝑚𝑒𝑡ℎ𝑜𝑑(2.10)
𝑃* can be any P value for single SNP effect or joint effect under the significance level detected by
LMM.
38
Besides, conducting the CTS with real DCCT/EDIC data on the eGFR outcome has potential
problems because the number of eGFR values per subject can be over 22 as the measurement was
taken every year in DCCT/EDIC, which makes CTS fail to find a required orthogonal polynomial
to transform the data. Only every alternate year eGFR measures for each subject are taken for the
CTS method.
39
Results
3.1 Data Description
A general description of the data is provided in Table 3-1 which contains summary statistics for
the number of visits, covariates and outcomes in DCCT/EDIC.
Table 3-1 Descriptive table for DCCT/EDIC.
Characteristics DCCT EDIC DCCT/EDIC
Calendar year 1983-1993 1994-2011 1983-2011
Total sample count 1441 1401 1441
Number of visits (range) (1 - 11) (1- 18) (1 - 29)
mean 8.2 16.2 24
median 8 18 25
Time-static % N % N % N
Gender (Female) 47.2% 680 47.8% 669 47.2% 680
Cohort (Primary) 50.4% 726 50.3% 705 50.4% 726
Treatment (Conventional) 50.7% 730 50.2% 703 50.7% 730
mean sd mean sd mean sd
Age at DCCT baseline (years) 26.9 7.1 26.9 7.1 26.9 7.1
Duration of T1D at DCCT baseline
(years) 5.6 4.2 5.6 4.1 5.6 4.2
Time-changing mean sd mean sd mean sd
Mean logAER(mg/24h) 2.5 0.8 2.7 1.2 2.6 0.8
Mean eGFR(mL/min/1.73𝑚2) 120.0 10.7 103.0 13.4 109.0 11.4
%: proportion.
N: count.
sd: standard deviation.
The baseline time-static covariates have no missing data and the missing rates for repeated
measurements are presented in Table 3-2. As mentioned in Section 2.1.3.1 Albuminuria, AER was
measured every other year in EDIC and calculation of missing rate in EDIC has taken this study
design into account. Generally there is a higher missing rate in EDIC than DCCT which could be
caused by a higher dropout rate in the later study. By simply fitting a linear mixed model with
logAER as response, eGFR as independent variable grouped by subjects, the marginal effect of
eGFR on logAER is significant and negative (β=−0.013, P<0.001).
40
Table 3-2 Missing rate for repeated measurements. Missing rate of logAER in EDIC and DCCT/EDIC is
calculated by combining alternate years.
Missing Rate DCCT EDIC DCCT/EDIC
logAER 1.71% 10.32% 7.43%
eGFR 1.60% 10.43% 7.49%
3.1.1 Number of visits
Randomization in DCCT took place from 1983 to 1989, with the different numbers of participants
entering in the clinical trial each year shown in Figure 3-1. The analysis was conducted based on
study time defined by the number of years the participant had been enrolled in DCCT since
baseline, rather than calendar time.
Figure 3-1 Number of DCCT participants by randomization year.
Due to the staggered entry into DCCT, the total number of regular visits in DCCT (Figure 3-2A)
has a peak at 7 visits corresponding to the peak of randomization year 1987 in Figure 3-1. The
number of visits in EDIC (Figure 3-2B) is relatively stable given all continuing participants should
have started in the same year. To show the proportion of missing participants among all study
years, barplots of expected, actual number of participants (Figure 3-3) and barplots of missing rate
Figure 3-4 in every DCCT/EDIC year are generated to provide a visual and detailed description.
The proportion of missing participants generally increased by DCCT year except for year 9 and
showed a missing rate of 1.8% in closeout visit. The missing proportion of participants in EDIC
also followed a rising trend by year from 5.4% to 14.8%.
41
Figure 3-2 Number of visit counts per subject in DCCT/EDIC years.
Figure 3-3 Expected and actual numbers of subjects in each DCCT/EDIC year including DCCT baseline
and close out visits. The counts by solid bars represent actual numbers which are no larger than expected
numbers in the same year. The transparent part of the bar represents the difference between expected and
actual counts in that year.
42
Figure 3-4 Proportion of missing subjects in each DCCT/EDIC year including DCCT baseline and close
out visits. This proportion is calculated as the proportion of transparent bar size out of total bar size in the
same year in Figure 3-3.
3.1.2 Distribution of logAER
First of all, we present a barplot of numbers of logAER measurements per subject in Figure 3-5.
The number of measurements per subject is also used as the weights of WSAO method. The barplot
shows its peak is at around 15 measurements. The mean and the median of the counts are 15.
Figure 3-5 Barplot of numbers of logAER measurements per subject in DCCT/EDIC study.
To present the distribution of logAER in DCCT year, violin plots along with boxplots were
generated in Figure 3-6A. Generally the boxplots show a right skewed distribution after baseline
with a larger gap between the median and third quartile (Q3) than the median and first quartile
(Q1). This is due to outliers mostly showing larger logAER values and longer right tails in the
43
violin plots. The violin plots get flatter gradually by year and indicate more spread of logAER
values in later DCCT years, in which years there were fewer participants seen in Figure 3-6B due
to the staggered entry. There are very few missing in DCCT years for logAER. The two treatment
groups were compared using violin plots and boxplots summarizing their distribution each year in
Figure 3-6C.
Similar plots are generated for EDIC years in Figure 3-7. The study was designed to measure
logAER for every other year, with around half participants (n=693) in odd EDIC years and the
others (n=704) in even years. There is a significant difference by Wilcoxon rank-sum test
(p=0.003) in first EDIC logAER measurement between odd and even EDIC years, with even year
logAER higher than odd year. Among 1397 individuals who are in EDIC and have measurements
for logAER, 14.46% had violated the assignment of odd or even year measurement at least once.
The violin plots and boxplots are thus generated with about half of total sample size for every year.
Missing rate on logAER shown in Figure 3-6B is monotone increasing over time from 5.64% for
year 1, 2 to 15.99% for year 17, 18.
Figure 3-8 shows specific paths of subjects’ logAER values separately in DCCT and EDIC
according to individual variance over time within each time period. Subjects plotted for each grid
are not necessarily the same ones. Subjects with smallest variance in Figure 3-8B and E usually
have shorter duration time in the study, either starting late or did not participate after some time
point. The subjects with largest variance in Figure 3-8C and F showed the same increasing trend
in logAER over study years, with red lines mostly on top of green ones. Most grid plots showed
an equal mixture of subjects from DCCT treatment groups, except for Figure 3-8C in which most
subjects are from conventional group.
44
Figure 3-6 Distribution of logAER in DCCT years. A: Violin plot combined with boxplot on distribution
of logAER in DCCT years. B: Expected and actual numbers of participants in DCCT years. Solid bars
represent actual numbers and transparent part of the bar represents the difference between expected and
actual counts in that year. C: Violin plot combined with boxplot on distribution of logAER by treatment
groups in DCCT years. Outliers are identified by 1.5×IQR rule.
45
Figure 3-7 Distribution of logAER in EDIC years. A: Violin plot combined with boxplot on distribution of
logAER in EDIC years. B: Expected and actual numbers of participants calculated for every two years in
EDIC years due to the study design. Solid bars represent actual numbers and transparent part of the bar
represents the difference between expected and actual counts in that year. C: Violin plot combined with
boxplot on distribution of logAER by original treatment groups in EDIC years. Outliers are identified by
1.5×IQR rule.
46
Figure 3-8 Spaghetti plots of logAER from 20 subjects from DCCT/EDIC in each grid. A and D were
plotted for 20 random subjects selected from DCCT/EDIC. B and E were plotted for 20 subjects with least
variance over DCCT/EDIC. C and F were plotted for 20 subjects with largest variance over DCCT/EDIC.
DCCT treatment groups were denoted by different line colors for these selected subjects.
3.1.3 Distribution of eGFR
Again, we firstly present a barplot of numbers of eGFR measurements per subject in Figure 3-9.
The number of measurements per subject is also used as the weights of WSAO method. The barplot
shows its peak is at around 24 times. The mean of the number is 23 and the median is 24.
47
Figure 3-9 Barplot of numbers of eGFR measurements per subject in DCCT/EDIC study.
Similar descriptive figures are generated for outcome eGFR in Figure 3-10 for DCCT years and
Figure 3-11 for EDIC years. In Figure 3-10A and Figure 3-11A, compared to logAER data, eGFR
values are less skewed for having similar length of whiskers in boxplots. However, in EDIC years
the distribution of eGFR has the opposite feature from logAER that the distribution of eGFR each
year is left skewed and decreasing over time. This might be caused by the contrasts on the trend
or range of logAER and eGFR values. We can see in Figure 3-10B and Figure 3-11B there is very
few missing in DCCT and a monotone increasing missing rate in EDIC. In Figure 3-11B the
missing rate is almost monotone increasing from 5.35% to 16.27%.
Figure 3-12 shows specific paths of subjects’ eGFR values separately in DCCT and EDIC
according to individual variance over time within each time period. Subjects plotted for each grid
are not necessarily the same ones. Same as logAER, subjects with smallest variance usually have
shorter duration time in the study for starting late or not showing up after some time point. The
subjects with largest variance in showed the decreasing trend in eGFR over time. All grid plots
showed an equal mixture of subjects from DCCT treatment groups.
48
Figure 3-10 Distribution of eGFR in DCCT years. A: Violin plot combined with boxplot on distribution of
eGFR in DCCT years. B: Expected and actual numbers of participants in DCCT years. Solid bars represent
actual numbers and transparent part of the bar represents the difference between expected and actual counts
in that year. C: Violin plot combined with boxplot on distribution of eGFR by treatment groups in DCCT
years. Outliers are identified by 1.5×IQR rule.
49
Figure 3-11 Distribution of eGFR in EDIC years. A: Violin plot combined with boxplot on distribution of
eGFR in EDIC years. B: Expected and actual numbers of participants in EDIC years. Solid bars represent
actual numbers and transparent part of the bar represents the difference between expected and actual counts
in that year. C: Violin plot combined with boxplot on distribution of eGFR by original treatment groups in
EDIC years. Outliers are identified by 1.5×IQR rule.
50
Figure 3-12 Spaghetti plots of eGFR from 20 subjects from DCCT/EDIC in each grid. A and D were plotted
for 20 random subjects selected from DCCT/EDIC. B and E were plotted for 20 subjects with least variance
over DCCT/EDIC. C and F were plotted for 20 subjects with largest variance over DCCT/EDIC. DCCT
treatment groups were denoted by different line colors for these selected subjects.
3.2 Simulation Study Results
3.2.1 Set up
As a result of selecting the correlation structure for logAER model with DCCT data, ARMA(1, 1)
is the selected structure 𝛴𝑖∗ which produces model fit with least AIC and for subject 𝑖 with 𝑛𝑖
consecutive visits, the covariance matrix for the error term has a form as:
𝑒𝑖𝑗 ~ 𝛴𝑖∗ = 𝜎2
[
1 𝛾 𝛾𝜌 ⋯ 𝛾𝜌𝑛𝑖
𝛾 1 𝛾 ⋯ 𝛾𝜌𝑛𝑖−1
𝛾𝜌 𝛾 1 ⋯ 𝛾𝜌𝑛𝑖−2
⋮ ⋮ ⋮ ⋱ ⋮𝛾𝜌𝑛𝑖 𝛾𝜌𝑛𝑖−1 𝛾𝜌𝑛𝑖−2 ⋯ 1 ]
, 𝜎 = 0.4220, 𝛾 = 0.2944, 𝜌 = 0.7337.
51
Because of the selection of the error structure, we consequently only adopt time in years as the
time variable and abandon use of time in months.
According to this error covariance structure, aside from the default structure with 𝛾 = 0.2944 we
therefore set two more specific settings for experimental design 6) which produces an independent
error correlation structure: 𝛾 = 0 and a stronger error correlation structure: 𝛾 = 0.87. We refer to
scenarios with these two settings as Independent (correlation) scenario and Strong
(correlation) scenario.
Covariance matrix for bivariate normal distribution of random intercept and slope is:
(𝑏0𝑖, 𝑏1𝑖
) ~ D = [0.1634 0.02170.0217 0.0010
].
Parameter estimates for standard LMM (Eq. 2.3) to conduct simulation study on logAER are:
Intercept = β0 = 2.4335, slope of time = β3 = 0.0122.
3.2.2 Type 1 Error
A well-controlled T1E is the basis for power calculation. T1E is calculated when both cross-
sectional and longitudinal SNP effects are set as 0. The significance level α is selected as 0.05. As
suggested by a previous simulation study on T1E (Zhang and Sun, 2019), we decided to adopt the
wider CI than the usual choice of 95% and the accuracy of T1E is decided by an approximate 99%
CI with 5000 replicates 𝛼 − 3√𝛼(1 − 𝛼)/5000 ≤ T1E ≤ 𝛼 + 3√𝛼(1 − 𝛼)/5000, which is here
(0.0408, 0.0592).
In Figure 3-13 we provide T1E plot for all methods in different scenarios ranging by MAF or
sample size. In results, T1E is basically well controlled under the 99% CI in most of scenarios by
all methods.
The most outstanding violation occurs in dropout scenario for WSAO that it causes a significantly
deflated T1E across MAF or sample size ranges which makes WSAO a very conservative method.
Compared to other missing scenarios, MAR and MNAR also produce a quite low T1E for WSAO
52
for some MAF and sample size, however the deflation is not as severe. The difference between
dropout and other missingness is that dropout produces more unbalanced sample sizes by time. An
example can be seen in Figure 3-14 for the changing the sample size per time unit (7 in total, 1 is
baseline) where one replicate is conducted for MAF=0.3, N=2000 and 0 cross-sectional or
longitudinal SNP effect. It can be seen that as expected the dropout scenario has an increasing
missingness by time.
Other than that, the 2df test by LMM produces inflated T1E in reference and strong error
correlation scenarios, but no specific pattern is yet found for the inflation.
Figure 3-13 Type 1 error rates calculated in reference scenario, 4 missing scenarios and 2 within-subject
error correlation scenarios. T1E is plotted by MAF in upper plot with fixed N=2000; and by N in lower
plot at fixed MAF=0.3. Dashed lines represent criterion of accurate T1E which is (0.0408, 0.0592).
53
Figure 3-14 Example of sample size changing by time in missing scenarios. X-axis is the time point in
simulation study, there are 7 time units in total with assumed same distance, and 1 is baseline when all
individuals have complete data. The setting for this example is: MAF=0.3, N=2000, cross-sectional and
longitudinal SNP effects are 0 and unchanged error structure.
3.2.3 Power
The results of power are presented in a similar grid form as T1E, however barplots are generated
because of overlapping power curves among some methods. Considering that there are two SNP
effects, bars are only plotted for methods which can provide the corresponding power for the SNP
effect. For power of the cross-sectional SNP effect, barplots include power by GALLOP and
LMM. For power of the longitudinal SNP effect, it includes power for SAO, WSAO, TS, CTS,
GALLOP and LMM. In addition, GALLOP and LMM provides power for 2df tests. To be more
rigorous, plots for one SNP effect are stratified by the value of the other SNP effect considering
one might affect the inference of the other.
3.2.3.1 Analysis on Cross-sectional SNP Effect (β1)
For cross-sectional SNP effect Figure 3-15, not much difference exists among different scenarios.
When only cross-sectional effect exists and β2 = 0, the cross-sectional effect is in fact the average
SNP effect assumed to be the same at all time points. Compared to the reference scenario where
data is complete and error correlation is medium/reference, all missing scenarios and strong error
correlation cause a drop in power and independent error structure causes an increase in power.
Among these scenarios, the largest difference between LMM and GALLOP appears in strong error
correlation scenario in that GALLOP cannot adjust for different error structures. The most
54
powerful test in strong error correlation scenario is LMM cross-sectional SNP effect test, and then
LMM joint test.
When β2 = 0.016, the cross-sectional effect is now the SNP effect at baseline. The joint tests by
both LMM and GALLOP can produce very similar power that is much higher than cross-sectional
effect tests. Power by 2df tests can reach 100% very quickly by increasing MAF or sample size so
that there is not much difference in 2df test powers across different simulation scenarios.
Figure 3-15 Power of cross-sectional SNP effect calculated in 7 scenarios. Power is plotted by order of
MAF in upper plot with fixed N=2000; and by order of N in lower plot at fixed MAF=0.3. Each scenario
is stratified by longitudinal SNP effect b2=0 or b2=0.016 (right hand legend).
3.2.3.2 Analysis on Longitudinal SNP Effect (β2)
When the cross-sectional SNP effect β1 = 0, by looking at grid plots in Figure 3-16, the missing
scenarios all produce a drop in power, more severe in MNAR and dropout scenarios. The largest
55
power reached in these two scenarios is at least 12.5% lower than the largest power in other
scenarios.
When cross-sectional SNP effect β1 = 0.08, 2df tests by GALLOP and LMM again gain power
quickly with increasing MAF and sample size which makes them the most powerful tests across
all scenarios. Besides, TS method produces much higher power than the other longitudinal SNP
effect tests when both SNP effects are present in all scenarios. Except for these three tests, other
tests basically show a similar pattern as β1 = 0.
By comparing grid plots vertically, other than the increase in power for 2df tests and TS method,
other methods are not much affected by the value of the cross-sectional SNP effect in most
scenarios except for the MNAR and dropout scenarios. In the MNAR scenario, the longitudinal
SNP effect tests except TS have much lower power when the cross-sectional SNP effect is present,
with a difference of around 25% in highest power. In the dropout scenario, WSAO, SAO and CTS
gain much more power from the existence of a main SNP effect and become more powerful than
GALLOP and LMM for the longitudinal SNP effect tests.
56
Figure 3-16 Power calculated in 7 scenarios. Power is plotted by order of MAF in upper plot with fixed
N=2000; and by order of N in lower plot at fixed MAF=0.3. Each scenario is stratified by cross-sectional
SNP effect b1=0 or b1=0.08 (right hand legend).
3.2.4 Parameter Estimation
Parameter estimates are plotted along with 95% CI to see whether estimates by different methods
can capture the true effect sizes. Across all scenarios, the larger the MAF and sample size, the less
the estimates vary.
For main SNP effect Figure 3-17, GALLOP and LMM perform well in any scenario across the
MAF and sample size ranges yielding a mean estimate for β1 around 0.08 and a 95% CI covering
this value, unaffected by β2.
57
Figure 3-17 Parameter estimates of cross-sectional SNP effect calculated in 7 scenarios. Estimates are
plotted as points with bars as 95% confidence intervals. Ranks are ordered by MAF in upper plot with fixed
N=2000; and by N in lower plot at fixed MAF=0.3. Results for each scenario are stratified by longitudinal
SNP effect b2=0 or b2=0.016 (right hand legend). Solid lines represent the true parameter value for cross-
sectional SNP effect b1=0.08. Dashed lines represent b1=0.
For longitudinal SNP effect Figure 3-18 and Figure 3-19, WSAO, SAO, GALLOP and LMM can
obtain an accurate estimate in most scenarios, with exceptions in MNAR and dropout scenarios.
With a slightly lower estimate by these four in MNAR and a slightly higher estimate with larger
variation by WSAO, SAO in dropout scenario, their 95% CI still cover the true effect size 0.016.
However TS and CTS methods always produce a lower mean estimates with a relatively smaller
SE than other methods. These two methods tend to get substantially lower estimates than the true
effect size when MAF or sample size is large in most scenarios.
As to vertical comparison when cross-sectional effect is 0 or 0.08, similarly to the cross-sectional
SNP effect, LMM, GALLOP and CTS are almost never affected by cross-sectional SNP effect.
58
WSAO and SAO are affected to produce a different mean and CI for estimate by cross-sectional
effect but essentially still cover the true longitudinal SNP effect size. Lastly for TS, in most
scenarios the existence of β1 tends to raise the mean estimate of β2 with unchanged length of CI,
resulting in more possibility of capturing the true effect size.
Figure 3-18 Parameter estimates of longitudinal SNP effect by MAF calculated in 7 scenarios. Estimates
are plotted as points with bars as 95% confidence intervals. Ranks are ordered by MAF with fixed N=2000.
Results for each scenario are stratified by cross-sectional SNP effect b1=0 or b1=0.08 (right hand legend).
Solid lines represent the true parameter value for longitudinal SNP effect b2=0.016. Dashed lines represent
b1=0.
59
Figure 3-19 Parameter estimates of longitudinal SNP effect by N calculated in 7 scenarios. Estimates are
plotted as points with bars as 95% confidence intervals. Ranks are ordered by N at fixed MAF=0.3. Results
for each scenario are stratified by cross-sectional SNP effect b1=0 or b1=0.08 (right hand legend). Solid
lines represent the true parameter value for longitudinal SNP effect b2=0.016. Dashed lines represent b1=0.
3.2.5 Speed
The last aspect to be compared is the computational efficiency of methods. Different from the main
part of simulation study for calculating accuracy where outcomes are re-generated in each replicate
to make it representative for the distribution of different scenarios, comparison of speed has to be
done in the dataset where multiple SNPs are generated altogether with only one set of outcomes
per subject. With the speed from simulation study presented in Table 3-3, it is possible to generally
estimate the needed time for real data analysis. Limited by the speed of LMM, we are conducting
a simulation on 1000 simulated SNPs with a MAF of 0.3 and a sample size of 2000, 0 cross-
sectional or longitudinal SNP effects and medium/reference error structure. In result, it takes about
20 seconds on average for fast methods to run the 1000 SNPs, while LMM needs more than 6
60
hours. Simulations were run on one node at the high-performance computing server which includes
144 nodes consisting of 2x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz with 192Gb RAM for
general use. The total time was added up for each method instead of accounting parallel
computation because different nodes or servers might have different capability for parallel running.
The final obtained running time is compared as followed:
Table 3-3 Time comparison in simulation study for 1000 SNPs under null with MAF=0.3, N=2000, 0 cross-
sectional or longitudinal SNP effect and unchanged error structure.
LMM SAO(WSAO) TS CTS GALLOP
Time 6.23h 18.92s 31.88s 11.67s 17.04s
3.3 DCCT/EDIC Data Analysis Results
3.3.1 Set up
By fitting LMMs without SNPs on DCCT/EDIC data, both logAER and eGFR selected
ARMA(1,1) as the best error structure to the model with covariates gender and treatment × cohort.
The specific error correlations are different for these two outcomes, but we apply the same
structure to these two outcomes.
𝑙𝑜𝑔𝐴𝐸𝑅: 𝛴𝑖∗ = 𝜎2 [
1 𝛾 ⋯ 𝛾𝜌𝑛𝑖
𝛾 1 ⋯ 𝛾𝜌𝑛𝑖−1
⋮ ⋮ ⋱ ⋮𝛾𝜌𝑛𝑖 𝛾𝜌𝑛𝑖−1 ⋯ 1
] , 𝜎 = 0.8576, 𝛾 = 0.5680, 𝜌 = 0.8745.
𝑒𝐺𝐹𝑅: 𝛴𝑖∗ = 𝜎2 [
1 𝛾 ⋯ 𝛾𝜌𝑛𝑖
𝛾 1 ⋯ 𝛾𝜌𝑛𝑖−1
⋮ ⋮ ⋱ ⋮𝛾𝜌𝑛𝑖 𝛾𝜌𝑛𝑖−1 ⋯ 1
] , 𝜎 = 12.45, 𝛾 = 0.6540, 𝜌 = 0.9391.
LMM and fast methods are applied to combined DCCT/EDIC data with ~ 9M SNPs for GWAS.
We again specify the model covariates and error structures for LMM and fast methods in Table
3-4.
61
Table 3-4 DCCT/EDIC data analysis settings.
With covariates No covariates
ARMA(1,1) covariance structure LMM, TS
Independent covariance structure GALLOP SAO, WSAO, CTS
Covariates: gender, treatment × cohort.
The timeline in this dataset is from DCCT baseline till EDIC year 18 (calendar year 2011). To
combine DCCT and EDIC, EDIC year 1 is treated as the follow up year after the last regular visit
in DCCT for those who continued in the EDIC study because the close out visits were relatively
irregular in time. With ~97% DCCT participants continued in EDIC, 95% of them started right
after DCCT close out in EDIC year 1 while others started in later EDIC years as shown in Figure
3-20.
Figure 3-20 Number of participants in EDIC starting years.
3.3.2 GWAS of logAER
3.3.2.1 Slopes for Two-stage Methods
First of all, we want to compare two-stage methods SAO(WSAO), TS and CTS to see whether
there are systematic difference existing in the slopes in the first step. The slopes are generated in
different ways for each individual as described in Methods section. The Figure 3-21 upper plots
are the histograms of slopes. The slopes represent the effect of time on the longitudinal trait
logAER, with the absolute effect size from 0 to 1. All three have their highest frequency at around
0 or a slightly negative slope. The shape of WSAO slopes is more symmetric while TS and CTS
have more right-skewed slopes. We also made Bland-Altman plots (Altman and Bland, 1986;
62
Bland and Altman, 1999) to make pairwise comparison in Figure 3-21 lower plots. In addition,
paired t-tests are conducted to test whether the mean difference is 0 at a significance level of 5%.
Results are that there is significant difference between SAO and TS (p<0.001, t=14.229,
ΔSAO-TS=0.026), SAO and CTS (p<0.001, t=21.413, ΔSAO-CTS=0.026), but the TS and CTS are not
significantly different (p=1, t=3.85e-13, ΔTS-CTS=3.85e-16). Both t-tests and Bland-Altman plots
show that SAO always has a larger positive slope than TS and CTS, meaning a larger positive time
effect on outcome.
Figure 3-21 Slopes for two-stage fast methods on outcome logAER. Upper: Histograms of slopes for two-
stage fast methods (logAER). Lower: Bland-Altman plot comparing slopes between two fast methods. In
Bland-Altman plots, blue(middle) dashed line is mean difference. Green(top) and red(bottom) dashed lines
are upper and lower limits of 95% CI of mean difference.
3.3.2.2 Methods Comparison on Random Selection
Before conducting GWAS, we randomly selected a subset of around 20k SNPs (n=19,570, 0.22%
of total SNPs) from across all autosomes. We ran models with all fast methods and LMM on this
random selection of SNPs to compare the p-values and parameter estimates. This is done because
of the limited computation speed of LMM method. We made the random selection instead of
selecting a specific region to get a reasonable comparison between the fast methods and LMM.
The histograms of MAFs from all 8,979,131 SNPs and the 19,570 selected SNPs are plotted in
63
Figure 3-22. From the histograms, we can see the distribution of MAFs from the random SNPs
resembles the distribution of all SNPs.
Figure 3-22 Histograms of MAFs of all SNPs (n=8,979,131) and randomly selected SNPs (n=19,570).
Figure 3-23 P-P plots, comparing p-values in −𝑙𝑜𝑔10 scale, are generated for visualized
comparison. We can see the x axis is limited to 4 and y axis is limited to 5 at most, showing there
were no significant SNPs in the random selection under conventional GWAS threshold. With the
red line as diagonal line, SAO and TS have greater dispersion than other methods.
Figure 3-24 E-E plots, comparing parameter estimates, is also generated for visualized comparison.
x=0 and y=0 were added as dashed lines in plot to indicate the position where SNP effect is 0 by
LMM and fast methods. The center of cluster in each subplot is at the intersection of x=0 and y=0,
showing most SNPs in the random selection are under the null hypothesis. Similar to P-P plot,
SAO and TS have more variability than other methods and they seem a bit asymmetric around the
diagonal line.
In Table 3-5, we calculated intraclass correlation coefficient (ICC) to see the intra-SNP correlation
between the compared p-values or estimates. For p-values in −𝑙𝑜𝑔10 scale, SAO and TS have
moderate correlation, and the rest fast methods have very high correlation with LMM. For
parameter estimates, again SAO and TS are the only two methods having moderate correlation.
We also conducted paired Wilcoxon rank sum tests on p-values and paired t-tests on estimates to
see the significance of difference at a significance level of 5%. The results show that in the −𝑙𝑜𝑔10
scale, the p-values generated by fast methods and LMM are all significantly different. All fast
methods except WSAO are having smaller p-values than LMM, leaving WSAO as the only
conservative method. As to the parameter estimates, none of the comparisons between fast
methods and LMM were significantly different.
64
Figure 3-23 P-P plots on outcome logAER on random subset of SNPs. X and y axis are in -log10 scale. Red
solid line: diagonal line. b1: cross-sectional SNP effect. b2: longitudinal SNP effect. LMM(1): cross-
sectional SNP effect by LMM1. LMM(2): cross-sectional SNP effect by LMM2.
65
Figure 3-24 E-E plots on outcome logAER on random subset of SNPs. Red solid line: diagonal line. b1:
cross-sectional SNP effect. b2: longitudinal SNP effect. LMM(1): cross-sectional SNP effect by LMM1.
LMM(2): cross-sectional SNP effect by LMM2. Black dashed lines: x=0 or y=0.
Table 3-5 Statistical comparison between fast methods and LMM on random 19570 SNPs.
LMM 𝜷𝟐 𝜷𝟐 𝜷𝟐 𝜷𝟐 𝜷𝟏𝑳𝑴𝑴𝟏
𝜷𝟏𝑳𝑴𝑴𝟐
𝜷𝟐 2df
Fast SAO WSAO TS CTS GA (𝜷𝟏) GA (𝜷𝟏) GA (𝜷𝟐) GA (2df)
P* ICC 0.73 0.91 0.70 0.94 0.86 0.94 0.94 0.95
𝑃𝑤 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001
Δ̅ -0.022 0.032 -0.017 -0.017 -0.031 -0.041 -0.017 -0.034
𝛽 ICC 0.81 0.95 0.77 0.97 0.93 0.97 0.96 --
𝑡𝑝𝑡 -0.18 -0.02 0.89 -0.99 1.32 0.79 -0.39 --
𝑃𝑝𝑡 0.855 0.983 0.371 0.321 0.188 0.429 0.695 --
Δ̅ -5.82e-6 -2.92e-7 2.17e-5 -9.64e-6 2.10e-4 7.98e-5 -4.77e-6 --
GA: GALLOP. P*: −𝑙𝑜𝑔10(𝑃) of SNPs. β: parameter estimate. ICC: Intraclass correlation coefficient
(ICC3 calculated by r package psych using lme4 option (Revelle, 2018)). 𝑃𝑤: p-value of paired Wilcoxon
rank sum test. 𝑡𝑝𝑡 : t statistic of paired t-test. 𝑃𝑝𝑡 : p-value of paired t-test. Δ̅ : Mean difference of
−𝑙𝑜𝑔10(𝑃) or β between LMM and fast method. Bolded are p-values under 5%. --: not applicable.
66
3.3.2.3 Methods Comparison on Detecting Significant SNPs
We conducted GWAS with fast methods on all SNPs (n=8,979,131). For the fast method results
for the logAER outcome , Figure 3-25 provides histograms of p-values, Figure 3-26 provides Q-
Q plots and Figure 3-28 provides Manhattan plots for each method. Apart from WSAO, all other
methods have a uniform distribution of p-values from histograms Figure 3-25. WSAO has larger
density for large p-values and has a deflated genomic control value in Q-Q plot Figure 3-26. In
Manhattan plot Figure 3-28, SAO detected many more SNPs than other fast methods under the
conventional GWAS threshold. In order to prevent the potential false discovery caused by low
MAF, Figure 3-27 Q-Q plots are stratified by 1%<MAF<5% (n=2,238,985) and 5%≤MAF≤50%
(n=6,740,146) to see if the distributions of p-values are different for these two MAF ranges. In
Figure 3-27, among all methods SAO seems to separate the two groups of points the most, with
the largest difference (0.0415) between genomic control values for low MAF and high MAF.
Figure 3-25 Histograms of p-values (logAER).
67
Figure 3-26 Q-Q plots of p-values (logAER). gc: genomic control value.
Figure 3-27 Q-Q plots of p-values stratified by MAF (logAER). Red dots: SNPs with 1%<MAF<5% (gc_L
for genomic control value). Blue dots: 5%≤MAF≤50% (gc_H for genomic control value).
68
Figure 3-28 Manhattan plots (logAER). Red line: 𝑃 = 5 × 10−8. Blue line: 𝑃 = 10−5.
69
As mentioned in Section 2.5 Methods for DCCT/EDIC Data Analysis’, we applied a loose
threshold of 𝑃 < 10−5 to select candidate SNPs from the fast method results. A total of 1089 SNPs
were selected by fast methods. By running full LMM on the candidate SNPs, we found that for
cross-sectional SNP effect, neither GALLOP nor LMM found SNPs associated with the outcome
logAER under 𝑝 = 5 × 10−8 . For the longitudinal SNP effect, LMM detected 4 SNPs on
chromosome 1 which are also found by all fast methods. For the joint test of both SNP effects,
both GALLOP and LMM discovered the same 4 SNPs on chromosome 1 with a new finding on
chromosome 10.
We present summary information on these 5 significant SNPs in Table 3-6. We also compare
parameter estimates, standard error and p-values on these SNPs between fast methods and LMM
in Table 3-7 and Table 3-8. Results for the 4 SNPs on chromosome 1 are the same so we only
display rs3817222 as representative.
70
Table 3-6 Summary information of significant SNPs (P<5× 10−8) for outcome logAER.
Chr SNP BP MAF INFO A1:
A2 𝜷𝟏
𝑳𝑴𝑴𝟏 SE P 𝜷𝟏𝑳𝑴𝑴𝟐 SE P 𝜷𝟐 SE P 𝑷𝟐𝒅𝒇
1 rs3817222 202464760 0.26 1.00 C:T* 0.015 0.403 0.971 -0.056 0.404 0.891 0.026 0.004 5.02e-11 4.30e-11
1 rs12734338 202469723 0.26 1.00 T:C* 0.015 0.403 0.971 -0.056 0.404 0.891 0.026 0.004 5.02e-11 4.30e-11
1 rs12743401 202476648 0.26 1.00 T:C* 0.015 0.403 0.971 -0.056 0.404 0.891 0.026 0.004 5.02e-11 4.30e-11
1 rs3881953 202528021 0.26 1.00 G:A* 0.015 0.403 0.971 -0.056 0.404 0.891 0.026 0.004 5.02e-11 4.30e-11
10 rs74155187 107418625 0.01 0.99 T*:C -3.766 2.532 0.137 -0.469 0.176 0.008 -0.077 0.017 1.07e-05 1.47e-08
The significance is determined by LMM on any kind of effect of the SNP under conventional GWAS threshold. Chr: chromosome. BP: position in
base pairs on the chromosome (1000 genomes phase 3 v5). INFO: quality of imputation. A1 and A2 are alleles 1, 2, with the minor allele indicated
with *. Parameter estimates (β), standard error (SE) and p-values (P) from LMMs are presented. β1LMM1 and 𝛽1
𝐿𝑀𝑀2: estimates of cross-sectional SNP
effects from LMM1 and LMM2. β2: estimate of longitudinal SNP effect. 𝑃2𝑑𝑓: p-value for 2df test. Bolded SNPs are selected for following analysis.
Bolded p-values are under 5× 10−8.
Table 3-7 Parameter estimates (BETA), standard error (SE) and p-values (P) for rs3817222 on logAER.
Method Hypothesis test BETA SE P
LMM 𝛽1𝐿𝑀𝑀1 0.015 0.403 0.971
LMM 𝛽1𝐿𝑀𝑀2 -0.056 0.404 0.891
LMM β2 0.026 0.004 5.02e-12
LMM 2df -- -- 4.30e-11
SAO β2 0.030 0.005 2.09e-08
WSAO β2 0.028 0.005 6.76e-10
TS β2 0.014 0.002 2.07e-08
CTS β2 0.025 0.004 8.83e-12
GALLOP β1 -0.037 0.394 0.924
GALLOP β2 0.028 0.004 2.47e-11
GALLOP 2df -- -- 2.13e-10
𝛽1𝐿𝑀𝑀1 and 𝛽1
𝐿𝑀𝑀2: estimates of cross-sectional SNP effects from LMM1 and LMM2. 𝛽1: estimate of
cross-sectional effect. 𝛽2 : estimate of longitudinal effect. 2df: 2df test. “: same as above. --: not
applicable. Bolded p-values are under 5× 10−8.
71
Table 3-8 Parameter estimates (BETA), standard error (SE) and p-values (P) for rs74155187 on logAER.
Method Hypothesis test BETA SE P
LMM 𝛽1𝐿𝑀𝑀1 -0.689 0.169 4.87e-05
LMM 𝛽1𝐿𝑀𝑀2 -0.469 0.176 0.008
LMM β2 -0.076 0.017 1.07e-05
LMM 2df -- -- 1.47e-08
SAO β2 -0.101 0.023 1.42e-05
WSAO β2 -0.093 0.020 5.65e-06
TS β2 -0.063 0.011 5.13e-09
CTS β2 -0.080 0.016 6.00e-07
GALLOP β1 -0.367 0.170 0.031
GALLOP β2 -0.093 0.019 8.19e-07
GALLOP 2df -- -- 3.12e-08
𝛽1𝐿𝑀𝑀1 and 𝛽1
𝐿𝑀𝑀2: estimates of cross-sectional SNP effects from LMM1 and LMM2. 𝛽1: estimate of
cross-sectional effect. 𝛽2 : estimate of longitudinal effect. 2df: 2df test. “: same as above. --: not
applicable. Bolded p-values are under 5× 10−8.
By using the efficiency measures defined in Section 2.5 Methods for DCCT/EDIC Data Analysis,
we calculated the discovery proportions 𝐸𝑙 and 𝐸𝑠 for the fast methods in Table 3-9. All fast
methods detected at least the 4 SNPs on chromosome 1. For 𝐸𝑙, in the subset of SNPs with 𝑃 <
10−5 by fast methods the discovery proportions are all around 0~2% except for WSAO. WSAO
shows the best ability to narrow down the range of potential associated SNPs. SAO has very low
efficiency because it provides the most candidate SNPs but did not include all 5 SNPs. For 𝐸𝑠, TS
and GALLOP 2df test are able to detect all 5 SNPs under p-value 5 × 10−8.
Table 3-9 Measures of efficiency of fast methods on logAER.
logAER SAO WSAO TS CTS GALLOP
(𝛃𝟐)
GALLOP
(2df)
𝐸𝑙 0.70% (4) 7.94% (5) 1.79% (5) 2.66% (5) 2.24% (5) 1.43% (5)
𝐸𝑠 8.89% (4) 80.00% (4) 38.46% (5) 66.67% (4) 80% (4) 100% (5)
El: % of SNPs detected by LMM under P < 5 × 10−8 out of SNPs by fast methods under P < 10−5. E𝑠:
% of SNPs detected by LMM under P < 5 × 10−8 out of SNPs by fast methods under P < 5 × 10−8 .
Count in brackets is the count of SNPs detected by LMM under P < 5 × 10−8. The maximum count
should be 5.
72
3.3.2.4 Significant SNPs
In order to discover whether the significant SNPs, rs3817222 on chromosome 1 and rs74155187
on chromosome 10, are meaningful in that these SNPs do have longitudinal effects on outcome,
we break down the longitudinal dataset into cross-sectional subsets stratified by study years. For
each year, we then run linear regression on the SNPs and outcome adjusting by the same covariates.
Because of the staggered entry in DCCT, we conduct the cross-sectional analysis in DCCT and
EDIC separately and combined (Figure 3-29). The SNP effect estimates along with 95% CI
constructed by SE are provided in Figure 3-29 to see the longitudinal change. When DCCT and
EDIC are combined, the last 4 years for rs3817222 generate missing estimate for SNP effect
because of the singularity problem. Some covariates such as sex are not linearly independent of
SNP dosage in the last 4 years which is also caused by the much smaller number of subjects
enrolled in later DCCT years. The last year for rs74155187 produces a very large estimate which
is caused by the sparse distribution of SNP in that year. In DCCT years rs3817222 has no estimates
significantly different from 0. In EDIC years, it suggests a difference between measurements in
odd and even years with even years always producing a negative estimate for SNP effect except
for last year. The difference between odd and even years is significant at the beginning and
narrowing down by years. For rs74155187 it shows that DCCT effects are always negative with
effect size of 0~2. In EDIC years, the difference between every other years is not that obvious
especially in early years, however there is a pattern in the later few years.
73
Figure 3-29 Estimates along with 95% CI for cross-sectional SNP effects on logAER by combined and
separate DCCT/EDIC years. Only rs3817222 on chromosome 1 is plotted as rs12734338, rs12743401 and
rs3881953 have the same results. Dashed lines represent 0 which is no SNP effect.
We made Locus Zoom plots (Pruim et al., 2010) on 2df p-values from LMMs in two selected
regions to contain all these 5 SNPs in Figure 3-30 and Figure 3-31. It shows that the 4 SNPs on
chromosome 1 are close to each other and all produce the same information by this model. The
other SNPs around them have large p values indicating little evidence of association with outcome
logAER. For rs74155187, many SNPs have relatively high correlation with the reference SNP
measured by r2 but no other SNPs reaching or close to the conventional GWAS threshold.
74
Figure 3-30 Locus plot on logAER with reference SNP rs3817222. The display range is from 202.06 –
202.86 Mb on chromosome 1 with 2024 SNPs plotted in total. P-values on left y-axis are 2df p-values by
LMM. SNPs with the same p-values are (from left to right) rs12734338, rs12743401 and rs3881953. There
is no usable linkage disequilibrium for these 4 SNPs. Reference panel: human genome 19/1000 Genomes
Nov 2014 European.
75
Figure 3-31 Locus plot on logAER with reference SNP rs74155187. The display range is from 107.02 –
107.82 Mb on chromosome 10 with 2445 SNPs plotted in total. P-values on left y-axis are 2df p-values by
LMM. Correlation between reference SNP and other SNPs are displayed in color shown in legend.
Reference panel: human genome 19/1000 Genomes Nov 2014 European.
3.3.3 GWAS of eGFR
3.3.3.1 Slopes for Two-stage Methods
First of all, we also want to compare two-stage methods SAO(WSAO), TS and CTS to see whether
there are systematic difference existing in the slopes in the first step. The figure is generated in the
same way as Figure 3-21. Now, in Figure 3-32 upper plots the slopes represent the effect of time
on the longitudinal trait eGFR, with the absolute effect size from 0 to 10. All three are having their
highest frequency at around 0 and have left-skewed slopes, which is opposite to logAER. We also
made Bland-Altman plots (Altman and Bland, 1986; Bland and Altman, 1999) to make pairwise
comparison in Figure 3-32 lower plots. In addition, paired t-tests are conducted to test whether the
mean difference is 0 at a significance level of 5%. Results again show that there is significant
76
difference between SAO and TS (p<0.001, t=-55.426, ΔSAO-TS=-1.496), SAO and CTS (p<0.001,
t=-67.310, ΔSAO-CTS=-1.496), but the TS and CTS are not significantly different (p=1, t=4.63e-12,
ΔTS-CTS=5.33e-14). Both t-tests and Bland-Altman plots show that SAO always gets a larger
negative slope than TS and CTS, meaning a larger negative time effect on outcome, which is the
opposite direction of logAER.
Figure 3-32 Slopes for two-stage fast methods on outcome eGFR. Upper: Histograms of slopes for two-
stage fast methods (eGFR). Lower: Bland-Altman plot comparing slopes between two fast methods. In
Bland-Altman plots, blue(middle) dashed line is mean difference. Green(top) and red(bottom) dashed lines
are upper and lower limits of 95% CI of mean difference.
3.3.3.2 Methods Comparison on Random Selection
Before conducting GWAS, we adopted the same random selection of 19,570 SNPs. We ran models
with all fast methods and LMM on this random selection of SNPs to compare the p-values and
parameter estimates.
Figure 3-33 P-P plots, comparing p-values in −𝑙𝑜𝑔10 scale, is generated for eGFR. We can see
from the maximum value of x or y axis that there were no significant SNPs in the same random
selection under conventional GWAS threshold for eGFR. SAO, TS and GALLOP (LMM(1) vs
GALLOP(b1)) have more spread out dots than others. WSAO is not spread out much, but the dots
are not symmetric around the diagonal line.
77
Figure 3-34 E-E plots, comparing parameter estimates, is generated for eGFR as well. The center
of cluster in each subplot is at the intersection of x=0 and y=0, showing most SNPs in the random
selection are under the null hypothesis. Similar to P-P plot, the same three methods have more
spread out clusters than the others. In addition, same as E-E plot for logAER Figure 3-24 SAO and
TS seem a bit asymmetric around the diagonal line.
In Table 3-10, similar to Table 3-5 we calculated ICC first. SAO and TS are moderately correlated
with LMM in both p-values (−𝑙𝑜𝑔10 scale) and parameter estimates. Besides, GALLOP is having
a good but slightly lower correlation with LMM1 in cross-sectional SNP effect, compared to Table
3-5. We also performed paired Wilcoxon rank sum tests on p-values and paired t-tests on parameter
estimates at a significance level of 5%. The results show that in the −𝑙𝑜𝑔10 scale, the p-values
generated by fast methods and LMM are mostly not significantly different except for WSAO.
WSAO again produces larger p-values than LMM as conservative method, same as logAER. As
to the parameter estimates, none of the comparisons between fast methods and LMM were
significantly different.
Figure 3-33 P-P plots on outcome eGFR on random 19570 SNPs. X and y axis are in -log10 scale. Red
solid line: diagonal line. b1: cross-sectional SNP effect. b2: longitudinal SNP effect. LMM(1): cross-
sectional SNP effect by LMM1. LMM(2): cross-sectional SNP effect by LMM2.
78
Figure 3-34 E-E plots on outcome eGFR on random 19570 SNPs. Red solid line: diagonal line. b1: cross-
sectional SNP effect. b2: longitudinal SNP effect. LMM(1): cross-sectional SNP effect by LMM1.
LMM(2): cross-sectional SNP effect by LMM2. Black dashed lines: x=0 or y=0.
Table 3-10 Statistical comparison between fast methods and LMM on random 19570 SNPs.
LMM 𝜷𝟐 𝜷𝟐 𝜷𝟐 𝜷𝟐 𝜷𝟏𝑳𝑴𝑴𝟏
𝜷𝟏𝑳𝑴𝑴𝟐
𝜷𝟐 2df
Fast SAO WSAO TS CTS GA (𝜷𝟏) GA (𝜷𝟏) GA (𝜷𝟐) GA (2df)
P* ICC 0.72 0.91 0.82 0.90 0.74 0.93 0.93 0.94
𝑃𝑤 0.642 <0.001 0.392 0.085 0.759 0.592 0.261 0.602
Δ̅ -0.002 0.085 0.001 -0.002 0.001 8.78e-5 -0.003 -0.001
𝛽 ICC 0.78 0.97 0.80 0.95 0.87 0.97 0.95 --
𝑡𝑝𝑡 -0.18 0.09 -1.00 1.94 -1.30 -1.37 0.45 --
𝑃𝑝𝑡 0.857 0.931 0.316 0.052 0.195 0.171 0.654 --
Δ̅ -8.44e-5 1.13e-5 -2.83e-4 3.31e-4 -0.004 -0.002 8.04e-5 --
GA: GALLOP. P*: −𝑙𝑜𝑔10(𝑃) of SNPs. β: parameter estimate. ICC: Intraclass correlation coefficient
(ICC3 calculated by r package psych using lme4 option (Revelle, 2018)). 𝑃𝑤: p-value of paired Wilcoxon
rank sum test. 𝑡𝑝𝑡: t statistic of paired t-test. 𝑃𝑝𝑡: p-value of paired t-test. Δ̅: Mean difference of −𝑙𝑜𝑔10(𝑃)
or β between LMM and fast method. Bolded are p-values under 5%. --: not applicable.
79
3.3.3.3 Methods Comparison on Detecting Significant SNPs
Same process of GWAS is done on outcome of eGFR, with Figure 3-35 histograms of p-values,
Figure 3-36 Q-Q plots, Figure 3-37 stratified Q-Q plots and Figure 3-38 Manhattan plots provided
for each method. We observed several common points between logAER and eGFR:
a. WSAO has larger density on large p-values in histogram Figure 3-35.
b. WSAO has a deflated genomic control value in Figure 3-36 and this value is the farthest
from 1 among all fast methods.
c. SAO has the largest difference (0.0345) between genomic control values for low MAF and
high MAF.
d. SAO detected much more SNPs than other fast methods under the conventional GWAS
threshold in Figure 3-38.
Figure 3-35 Histograms of p-values (eGFR).
80
Figure 3-36 Q-Q plots of p-values (eGFR). gc: genomic control value.
Figure 3-37 Q-Q plots of p-values stratified by MAF (eGFR). Red dots: SNPs with 1%<MAF<5% (gc_L
for genomic control value). Blue dots: 5%≤MAF≤50% (gc_H for genomic control value).
81
Figure 3-38 Manhattan plots (eGFR). Red line: 𝑃 = 5 × 10−8. Blue line: 𝑃 = 10−5.
82
As a result, a total count of 2347 SNPs are selected across all fast methods to run full LMM with
22 significant findings by LMM. Besides, the same SNP as for logAER (Section 3.3.2.1) on
chromosome 10, rs74155187, is found again by both GALLOP and LMM at the level of 5 × 10−8
with significant longitudinal effect and joint effect.
The summary information of the 22 SNPs are in the Table 3-11. There are 10 SNPs on chromosome
2 which are close to each other presented in BP with similar statistics. Apart from that, all of the
rest of SNPs have MAF under 2%. From the stratified Q-Q plots by fast methods in Figure 3-37,
there is not much change between genomic control values in low or high MAF. However, it can
be observed that when MAF is in range of 1%-5%, the dots in Q-Q plot deviate more from the
diagonal line revealing more potential signals than 5%-50%, especially for longitudinal SNP
effects. Therefore we chose to filter out the SNPs with low MAF and provide inference on
rs12713270 to represent the 10 SNPs on chromosome 2. We also kept rs74155187 as it is the only
common SNP for both logAER and eGFR which has the same positive direction for longitudinal
SNP effect.
We also compare parameter estimates, standard error and p-values on these SNPs between fast
methods and LMM in Table 3-12 and Table 3-13. Results for the 10 SNPs on chromosome 2 are
similar so we only display rs12713270 as representative.
83
Table 3-11 Summary information of significant SNPs (P<5× 10−8) for outcome eGFR.
Chr SNP BP MAF INFO A1:
A2 𝜷𝟏
𝑳𝑴𝑴𝟏 SE P 𝜷𝟏𝑳𝑴𝑴𝟐 SE P 𝜷𝟐 SE P 𝑷𝟐𝒅𝒇
1 rs74409324 34832080 0.01 0.83 A*:G 2.459 2.078 0.237 -2.097 2.229 0.347 1.024 0.183 2.38e-08 8.68e-08
1 rs10538156 75791208 0.01 0.99 A*:AGTT 0.390 10.389 0.970 -32.935 11.333 0.004 7.542 1.023 1.69e-13 1.54e-12
1 rs1707184 75794925 0.01 0.99 A*:C 0.390 10.389 0.970 -32.935 11.333 0.004 7.542 1.023 1.69e-13 1.54e-12
2 rs12713270 54832751 0.20 1.00 A*:C -0.475 0.507 0.349 -1.517 0.542 0.005 0.236 0.043 3.96e-08 1.82e-07
2 rs67244067 54835845 0.20 1.00 A*:G -0.451 0.508 0.375 -1.506 0.543 0.006 0.239 0.043 2.81e-08 1.36e-07
2 rs6724136 54837377 0.22 1.00 G*:A -0.621 0.490 0.205 -1.630 0.543 0.002 0.229 0.041 3.29e-08 1.06e-07
2 rs10183043 54838859 0.22 1.00 G*:T -0.622 0.490 0.205 -1.630 0.543 0.002 0.229 0.041 3.27e-08 1.05e-07
2 rs12713272 54842302 0.20 1.00 T*:C -0.371 0.505 0.463 -1.405 0.539 0.009 0.234 0.043 4.55e-08 2.46e-07
2 rs11902659 54842317 0.20 1.00 G*:A -0.362 0.504 0.473 -1.402 0.539 0.009 0.235 0.043 3.91e-08 2.14e-07
2 rs7569127 54844926 0.20 1.00 G*:T -0.390 0.504 0.438 -1.432 0.538 0.008 0.236 0.043 3.33e-08 1.76e-07
2 rs10176359 54845698 0.20 1.00 A*:G -0.387 0.504 0.443 -1.429 0.538 0.008 0.236 0.043 3.34e-08 1.77e-07
2 rs13408295 54854905 0.20 1.00 T*:C -0.380 0.507 0.454 -1.428 0.542 0.009 0.237 0.043 3.51e-08 1.89e-07
2 rs72806653 54892663 0.21 1.00 A*:C -0.687 0.499 0.169 -1.704 0.533 0.001 0.231 0.042 4.71e-08 1.30e-07
7 rs189015886 105075505 0.01 0.88 A*:G 1.807 1.973 0.360 -2.735 2.116 0.196 1.003 0.169 2.74e-09 1.36e-08
10 rs2339623 52688098 0.01 0.99 G:C* 3.134 2.934 0.286 9.216 3.139 0.003 -1.369 0.250 4.45e-08 1.79e-07
10 rs2339622 52688177 0.01 0.99 T:G* 3.130 2.934 0.286 9.214 3.140 0.003 -1.370 0.250 4.42e-08 1.78e-07
10 rs2339621 52689642 0.01 0.99 T:G* 3.059 2.941 0.298 9.180 3.148 0.004 -1.378 0.251 4.05e-08 1.68e-07
10 rs10821959 52691497 0.01 0.99 C:G* 2.552 2.987 0.393 8.890 3.199 0.006 -1.424 0.255 2.39e-08 1.20e-07
10 rs74155187 107418625 0.01 0.99 T*:C -3.766 2.532 0.137 -9.642 2.711 3.89e-04 1.339 0.220 1.09e-09 2.85e-09
10 rs1674926 127851203 0.01 0.97 A*:T -0.447 7.256 0.951 -18.015 7.768 0.021 3.887 0.612 2.24e-10 1.80e-09
14 rs72712117 99147183 0.01 0.84 C*:A 2.218 1.928 0.250 -1.983 2.067 0.338 0.946 0.168 1.81e-08 6.75e-08
21 rs117099726 32194470 0.01 0.86 A*:G 3.960 2.189 0.071 -0.788 2.350 0.738 1.052 0.191 3.44e-08 4.81e-08
The significance is determined by LMM on any kind of effect of the SNP under conventional GWAS threshold. Chr: chromosome. BP: position in base pairs on
the chromosome (1000 genomes phase 3 v5). INFO: quality of imputation. A1 and A2 are alleles 1, 2, with the minor allele indicated with *. Parameter estimates
(β), standard error (SE) and p-values (P) from LMMs are presented. β1LMM1 and 𝛽1
𝐿𝑀𝑀2: estimates of main effects from LMM1 and LMM2. β2: estimate of SNP-
time interaction effect. 𝑃2𝑑𝑓: p-value for 2df test. Bolded SNPs are selected for following analysis. Bolded p-values are under 5× 10−8.
84
Table 3-12 Parameter estimates (BETA), standard error (SE) and p-values (P) for rs12713270 on
chromosome 2 for eGFR.
Method Hypothesis test BETA SE P
LMM 𝛽1𝐿𝑀𝑀1 -0.475 0.507 0.349
LMM 𝛽1𝐿𝑀𝑀2 -1.517 0.542 0.005
LMM β2 0.236 0.043 3.96e-08
LMM 2df -- -- 1.82e-07
SAO β2 0.225 0.066 7.44e-04
WSAO β2 0.197 0.051 1.13e-04
TS β2 0.121 0.026 4.47e-06
CTS β2 0.198 0.039 4.03e-07
GALLOP β1 -1.652 0.546 2.49e-03
GALLOP β2 0.238 0.051 3.09e-06
GALLOP 2df -- -- 8.99e-06
𝛽1𝐿𝑀𝑀1 and 𝛽1
𝐿𝑀𝑀2: estimates of cross-sectional SNP effects from LMM1 and LMM2. 𝛽1: estimate of
cross-sectional effect. 𝛽2 : estimate of longitudinal effect. 2df: 2df test. “: same as above. --: not
applicable. Bolded p-values are under 5× 10−8.
Table 3-13 Parameter estimates (BETA), standard error (SE) and p-values (P) for rs74155187 on eGFR.
Method Hypothesis test BETA SE P
LMM 𝛽1𝐿𝑀𝑀1 -3.766 2.532 0.137
LMM 𝛽1𝐿𝑀𝑀2 -9.642 2.711 3.89e-04
LMM β2 1.339 0.220 1.09e-09
LMM 2df -- -- 2.85e-09
SAO β2 1.391 0.332 2.94e-05
WSAO β2 1.293 0.261 7.88e-07
TS β2 0.631 0.131 1.67e-06
CTS β2 1.158 0.194 3.11e-09
GALLOP β1 -11.958 2.737 1.25e-05
GALLOP β2 1.431 0.257 2.52e-08
GALLOP 2df -- -- 1.41e-08
𝛽1𝐿𝑀𝑀1 and 𝛽1
𝐿𝑀𝑀2: estimates of cross-sectional SNP effects from LMM1 and LMM2. 𝛽1: estimate of
cross-sectional effect. 𝛽2 : estimate of longitudinal effect. 2df: 2df test. “: same as above. --: not
applicable. Bolded p-values are under 5× 10−8.
By using the efficiency measures defined in Section 2.5 Methods for DCCT/EDIC Data Analysis,
we calculated the discovery proportions 𝐸𝑙 and 𝐸𝑠 for fast methods presented in Table 3-14. With
more findings by LMM, 22 for significant longitudinal effect and 6 out of these for joint effect,
the efficiency of fast methods is better discriminated. WSAO is still having high values for both
El and Es, but the total number of SNPs detected is small. For El, TS becomes more efficient on
85
this outcome with a discovery proportion of 6.42% in its candidate SNPs. CTS and GALLOP have
lower El than TS but included all 22 SNPs in their candidate SNPs.
Table 3-14 Measures of efficiency of fast methods on real eGFR data.
eGFR SAO WSAO TS CTS GALLOP
(𝛃𝟐)
GALLOP
(2df)
𝐸𝑙 0.51% (7) 40.91% (9) 6.42% (21) 4.44% (22) 3.70% (22) 2.33% (16)
𝐸𝑠 1.85% (3) 100% (3) 71.43% (5) 46.15% (6) 24.24% (8) 30.44% (7)
El: % of SNPs detected by LMM under P < 5 × 10−8 out of SNPs by fast methods under P < 10−5. E𝑠:
% of SNPs detected by LMM under P < 5 × 10−8 out of SNPs by fast methods under P < 5 × 10−8 .
Count in brackets is the count of SNPs detected by LMM under P < 5 × 10−8. The maximum count
should be 22.
3.3.3.4 Significant SNPs
Similarly, the cross-sectional analysis stratified by study years are conducted on rs12713270 on
chromosome 2 and rs74155187 on chromosome 10, with results provided in Figure 3-39. These
two SNPs show the same pattern in DCCT/EDIC that the cross-sectional SNP effect is slowly
increasing by study year with changing direction of effect over time. However, it seems
rs74155187 has a larger effect size than rs12713270 overall.
Figure 3-39 Estimate along with 95% CI for cross-sectional SNP effects on eGFR by combined and separate
DCCT/EDIC years. Only rs12713270 on chromosome 2 is plotted as the other findings on chromosome 2
have the similar results. Dashed lines represent 0 which is no SNP effect.
86
We made Locus Zoom plots (Pruim et al., 2010) for 2df p-values from LMMs on two selected
regions to contain all SNPs on chromosome 2 in Figure 3-40 and the SNP on chromosome 10 in
Figure 3-41. For rs12713270, many SNPs on chromosome 2 have very high correlation to this
SNP measured by r2. For rs74155187, the locus plot look similar as Figure 3-31 that many SNPs
have relatively high correlation with the reference SNP but no other SNPs are reaching or close to
the conventional GWAS threshold.
Figure 3-40 Locus plot with reference SNP rs12713270. The display range is from 544.33 – 552.33 Mb on
chromosome 2 with 2846 SNPs plotted in total. P-values on left y-axis are 2df p-values by LMM.
Correlation between reference SNP and other SNPs are displayed in color shown in legend. Reference
panel: human genome 19/1000 Genomes Nov 2014 European.
87
Figure 3-41 Locus plot with reference SNP rs74155187. The display range is from 107.02 – 107.82 Mb on
chromosome 10 with 2445 SNPs plotted in total. P-values on left y-axis are 2df p-values by LMM.
Correlation between reference SNP and other SNPs are displayed in color shown in legend. Reference
panel: human genome 19/1000 Genomes Nov 2014 European.
3.3.4 Speed Comparison
The speed comparison of GWAS is present in Figure 3-42. As before, GWAS was run on one node
at high-performance computing server which includes 144 nodes consisting of 2x Intel(R)
Xeon(R) Gold 6140 CPU @ 2.30GHz with 192Gb RAM. The total time was added up for each
method instead of accounting parallel computation because different nodes or servers might have
different capability for parallel running. As expected, the two-stage methods SAO, WSAO, TS
and CTS are around the same speed in that they only differ in time for first speed of calculating
slopes, which is negligible compared to the GWAS analysis time. GALLOP requires more time
than other fast methods due to different algorithm and is able to provide more inference including
both single and joint SNP effects. Besides, the longer time of GALLOP for eGFR than logAER
88
indicates that the speed of GALLOP is also affected by the number of measurements while the
two-stage methods are not affected because the data is transformed into cross-sectional data. Due
to the expected limit of speed of LMM, the whole genome is not analyzed and only the random
subset of 19,570 SNPs is counted for speed comparison. LMM spends 169.13 hours on these SNPs
for logAER and 191.62 hours on the same set of SNPs for eGFR. With the same number of SNPs,
the speed of LMM is slightly affected by the different number of measurements as it takes 22.49
hours more to run on outcome eGFR than logAER. By estimating time for 8,979,131 SNPs from
time for 19,570 SNPs, it takes about 78k hours for logAER and 88k hours for eGFR to run GWAS
with LMMs without parallel running.
Figure 3-42 Running time for GWAS. For fast methods, the count of SNPs analyzed is 8,979,131. For
LMM*, time is recorded by running a random selection of 19,570 SNPs (~0.2% of SNPs).
89
Discussion
4.1 Simulation Study
Simulation studies help us to have a general idea of how methods perform on longitudinal data
with different features. The limitations of the previous Rotterdam simulation studies (Sikorska et
al., 2013b; Sikorska et al., 2018) include:
a. MAF of SNPs was 0.5 for all scenarios.
b. The study did not account for any within subject error correlation and the errors were
generated independently for each visit of each subject.
c. The missing data scenarios in the simulation study only included dropout based on MCAR
and MAR mechanisms with an overall missing rate of ~35%.
d. All methods only provide inference on either cross-sectional or longitudinal SNP effect,
without 2df test.
Compared to previous studies, we extend simulation aspects in MAF, missingness and within-
subject error structure.
4.1.1 Type 1 Error
Using the approximate 99% CI, we want to make sure that these methods produce controlled T1E
in any scenario so that it is fair to compare their powers in the next step.
As described in Section 3.2.2, Figure 3-13 showed most fast methods are relatively stable at
controlling T1E in current 7 scenarios, except for WSAO being significantly conservative when
data is MAR, MNAR or dropout. Other than that, we also observed inflated T1E for LMM 2df test
when MAF=0.3 and N=500 in strong correlation scenario, but T1E stays in 99% CI for other
combinations of MAF and N. For all the other methods currently well controlled, we mentioned
in Section 1.5.7 Concerns about other papers discussing possibility of inflated T1E for low MAF
or low minor allele count (Ma et al., 2013). In our study we set MAF and N constant (MAF=0.3;
N=2000) when varying one of them across a wide range, so the lowest MAF in our setting is 0.01
and the corresponding sample size is 2000. This type of combination design is convenient to
conduct but limited the findings, for instance, how T1E would be when MAF=0.01 and N=1000.
90
We have no idea about the inflation of T1E in many detailed situations like that, but we can always
adjust the settings to conduct simulation studies based on specific real data.
4.1.2 Power
4.1.2.1 Cross-sectional Effect Test
The power for cross-sectional effects by GALLOP and LMM are usually similar to each other
across all scenarios except for strong correlation scenario (Figure 3-15). Both methods lose little
power in missing scenarios. Both lose power when error correlation is adjusted to strong, but
GALLOP loses more than LMM because it always assumes independent within-subject errors. In
addition, power of cross-sectional effect is not affected by longitudinal SNP effect being 0 or not.
4.1.2.2 Longitudinal Effect Test
4.1.2.2.1 Missing Data Scenarios
We can observe obvious power loss in missing data scenarios for longitudinal SNP effect,
especially MNAR and dropout.
In general, no method is robust against MNAR which is expected. However, it is counterintuitive
to see the power drop for the longitudinal effect in MNAR when both main and interaction effects
exist (Figure 3-16).
Power loss in dropout scenario is not that severe compared to MNAR, maybe because the
probability of dropout is actually based on the previous observation and it can be seen as a type of
MAR. This design is inspired by data of eGFR that once participants reached an eGFR value under
10 ml/min/1.73m2, it is recoded as 10, and set to missing for following visits because of the
development of end-stage renal disease. Data generated from dropout resembles our motivating
data more than other missing scenarios in that the number of measurements decreases by years.
91
The reasons for performance change in these two missing scenarios is yet to be explained, but it
might have something to do with our current simulation settings including the relative effects of
the cross-sectional and longitudinal SNP effects or the approach to simulate missing scenarios.
4.1.2.2.2 Error Correlation Scenarios
We can observe some power loss in strong error correlation scenario in Figure 3-16. The strong
error structure has effects on all fast methods except TS because of their incapability of adjusting
for correlated within-subject error. However, the difference between methods is not obvious in the
reference scenario which allows a medium correlation. When the errors are independently
generated, which is 0 error correlation, the fast methods do not lose power and have similar
performance compared to reference scenario from Figure 3-16.
4.1.2.3 Two-degree-of-freedom Test
In power comparison Figure 3-15 and Figure 3-16, across all scenarios GALLOP may be the best
fast method so far in that the 2df tests have dominant advantage in all scenarios when both SNP
effects exist and are still powerful enough when either cross-sectional or longitudinal SNP effects
exists. This coincides with findings by Kraft et al. in a different study where it is claimed a joint
test combining tests on marginal and interaction effect is an optimal approach in most situations
and is more powerful than either main effect or interaction effect when both effects exist (Kraft et
al., 2007). Their conclusions were based on cross-sectional GWAS with both outcome and
environmental factor simulated as binary variables (Kraft et al., 2007). Also, there are other studies
applying 2df test on quantitative traits showing gain in power to detect new loci compared to using
main effects alone (Young et al., 2018; Persad et al., 2017) . By conducting this simulation study
we see that 2df for SNP×time to be a very powerful and efficient for longitudinal GWAS with
quantitative traits.
4.1.3 Speed
We simulated and analyzed 1000 SNPs for comparison of speed. The comparison between fast
methods might be unsteady and in theory SAO, WSAO, TS and CTS should have very similar
92
speed in large data because they only differ in the first step of slope calculation. The simulation is
limited by the speed of LMM and the results show fast methods runs about 1000 times faster than
LMM (Table 3-3). However, this is just an estimate because the true data may have different
sample size, MAF, missingness and error structure which can all affect the speed of LMM. Also,
file sizes may have speed implications as whether it is one file with all GWAS data, or separate by
chromosome, or chunks. The time can be shortened by parallel processing but we simply added
up the time for each model as results due to the different capability of parallel computing systems.
4.1.4 Implementation
In addition, ease of implementation is important to consider. SAO/WSAO are easiest to conduct
in that they only require linear regression for whole algorithm. Among others, CTS and GALLOP
need additional codes developed from the original papers which could require additional
adjustment of codes for specific data. Besides, CTS cannot transform data with > 22 repeated
measurements because it is infeasible to calculate the orthogonal polynomials needed. This might
require thinning the data from individuals with more than 22 measurements, but this could result
in loss of information.
4.2 DCCT/EDIC Data Analysis
4.2.1 GWAS of logAER
The cross-sectional effects of discovered SNPs for logAER showed an alternate pattern in alternate
years (Figure 3-29). This may be caused by the fact that during EDIC the measurement of logAER
is taken every other year for each subject and the assignment of participants in EDIC years is based
on their entry year in DCCT. Predominant adolescents (age 13-17 years) were recruited at the early
phase of DCCT, and the other subjects recruited were adults aged 18-39 years (DCCT Research
Group, 1986). This might explain the significant difference (p=0.003) described in Section 3.1.2
that the first EDIC logAER measure is higher in even year than odd year, and there can be different
populations between odd and even years in EDIC study. However, it is complex to totally separate
93
participants as odd or even years measured because as mentioned in Section 3.1.2, 14.46% of
participants in EDIC did not strictly adhere to the assignment at least once and change odd or even
year to be measured.
4.2.2 GWAS of eGFR
The eGFR outcome showed a different performance from logAER in that p-values from most fast
methods are not significantly different from LMM, except for WSAO and GALLOP on
longitudinal SNP effect. We found a lot more SNPs detected by LMM for eGFR than logAER
under conventional GWAS threshold. The explanation for the difference between the two
outcomes might be a different error structure (Section 3.3.1). eGFR also has more measurements
than logAER. There might be a different performance for T1E and power for a combination of low
MAF, more visits and such an error structure which is not assessed in our simulation study.
4.3 Limitations and Future Study
4.3.1 Simulation Study Settings
We are aware that our simulation designs do not cover all possible scenarios in real data. The first
thing to remember about this study is that it is only based on outcome logAER in DCCT data,
therefore it cannot be generalized to other traits and other studies. The simulation results are not
definitive and we can always extend the designs to more specific scenarios, but the current results
may be able to provide us some insights or mirror other longitudinal data which is similar to one
of our scenarios.
4.3.1.1 MAF, Sample Size and Effect Size
As we mentioned in 4.1.1, we set MAF to 0.3 or sample size to 2000 when varying the other design
across the range. This helps us know the performance change caused by single factor and is easier
to conduct than full factorial combinations. However without a full factorial design, it is impossible
to learn performance in all combinations, for example a low MAF and a small sample size.
94
The effect size on cross-sectional or longitudinal SNP effect was selected to make reasonable
power plot under reference scenario. The limitation of it is a combination of fixed ratio for cross-
sectional and longitudinal effect. This ratio may differ for different studies or not be constant by
time.
Another option to consider is to set multiple SNPs which have different MAFs (e.g. 0.01 – 0.5) or
effect size coefficients in the simulation model (Eq. 2.4). By setting such a model, the application
of different fast methods on the generated data can reveal their performance on identifiability of
the true SNPs, association test results or effect size estimations all in one model.
4.3.1.2 Missing Pattern
Although we have simulated four missing scenarios, the approach of simulating missingness is
just one of many possible ways and maybe the simplest one. We used a missing rate of 40% to
exaggerate the difference between methods, however our data has an average missing rate of
around 7.5% for these two outcomes in DCCT/EDIC and the missing patterns are different in
separate studies (Table 3-2).
In our observed DCCT/EDIC data for renal measurements, one of the reasons for missingness is
by study design. When a participant develops end-stage renal disease by reaching the eGFR
threshold value, he/she will be censored afterwards for renal measures (DCCT/EDIC Research
Group, 2011). Another possible reason is death and till 2012 around 107 deaths occurred in the
EDIC cohort (Orchard and DCCT/EDIC Research Group, 2015). This indicates the missingness
in our data might be a mixture of many possible mechanisms, therefore the limitation of simulation
study is that we only considered one reason for the missingness.
4.3.1.3 Within-subject Error Correlation
We extended the Rotterdam study by allowing within-subject error correlation in LMM. Among
all kinds of error structures introduced in 2.2.2 Within-individual Correlation Structure Selection,
we used ARMA(1,1) because the model selects this structure as the best structure describing errors
for both outcomes. However, we varied one parameter γ only to manually adjust the strength of
95
correlation between time points as mentioned in Set up. For this structure, the parameter ρ is also
adjustable to change the correlation and extension can be based on different combinations of γ and
ρ. In addition, different error structure types could be extended to generate data.
4.3.2 Real Data Model Specification
In Section 2.5 we selected treatment as one of fixed effect covariates. By doing so we are assuming
effect of treatment is the same at all time points, including baseline. However, the fact is subjects
were not randomized at DCCT baseline, therefore there should be no effect of randomly assigned
treatment. This might influence results due to misspecification of model. One of the potential
solutions is to include an interaction term of treatment and binary indicator for baseline. It will
then provide the estimation of treatment effect at baseline as 0 which is more accurate than current
model.
4.3.3 Efficiency Measures
In Section 2.5 Methods for DCCT/EDIC Data Analysis, we defined two measures 𝐸𝑙 and 𝐸𝑠 for
efficiency or accuracy of fast methods to capture the same SNPs for LMM under conventional
GWAS threshold. The definition of these two measures has limitations especially after we found
in the GWAS of logAER or eGFR that there are clusters of correlated SNPs within a region (SNPs
on chromosome 2 in Table 3-11 for example). A modification could be made to define the
measures in terms of the localization ability to a region instead of requiring the exact same SNPs
as LMM. By spotting SNPs under the pre-specified threshold from the fast methods, we can find
correlated SNPs with high linkage disequilibrium to the targeted SNP. Further examination would
then be conducted on this cluster of SNPs to investigate their cross-sectional or longitudinal effects
on the trait.
96
4.3.4 Heteroscedasticity
In current analysis with a specified error structure, errors are generated independently from the
identical multivariate distribution for every subject at multiple time points. However we did not
investigate the heteroscedasticity in either simulation study or DCCT/EDIC data. Data might be
generated from different error variances for subjects from different subgroups, for example
treatments (Yamaguchi et al., 2019). We can extend our study to simulate heteroscedastic data and
see how performance of methods is affected by this potential issue.
4.3.5 Weighted Slope as Outcome (WSAO)
As a new method proposed by us based on SAO method, WSAO applies weights in the second
step to put more weights on subjects with more observations. The distributions of weights for
logAER and eGFR were presented in results Figure 3-5 and Figure 3-9. Both traits showed a left-
skewed distribution on numbers of measurements, in which most subjects have around 15
observations for logAER and over 20 observations for eGFR.
In simulation study, T1E result Figure 3-13 suggests that WSAO is significantly conservative in
missing data scenario, especially dropout. In DCCT/EDIC data description, the missing rates of
both traits increase by year in EDIC, suggesting a dropout pattern. Therefore, we again observed
the conservative performance of WSAO indicated by its histograms on p-values (Figure 3-25,
Figure 3-35). The power study for WSAO cannot be considered convincing under such a missing
pattern with a questionable T1E, which might also affect the performance of WSAO on
DCCT/EDIC data analysis.
We suspect this abnormal performance of WSAO might be caused by the small sample with small
number of observations, which forms a left-skewed distribution of weights for WSAO. In future
study, we can repeat the analysis but limit the subjects to people with most records to determine
whether WSAO is affected by the evenly-distributed missingness or clustered missingness. A
different weighting strategy could be designed due to the limitation of current WSAO method.
97
4.3.6 Missing Not at Random
For MNAR scenario, the truth is in real data it is very hard to detect this type of missingness as it
is based on unobserved data (Little and Rubin, 1987). Besides, no methods including LMM in our
study are robust against MNAR in theory and the classic ways of dealing with MNAR for
longitudinal data include selection model and pattern mixture model (Enders, 2011; Fiero et al.,
2017). Based on selection and pattern mixture model, some newly proposed methods such as
pattern submodel are computationally efficient tools for handling missing data including
MNAR (Fletcher Mercaldo and Blume, 2018). Although the missing rate in our data is around
15% at most in later years, ignoring MNAR might produce biased results and thus techniques for
MNAR can be taken into consideration to see whether the results from LMM are biased by MNAR
data.
4.3.7 Empirical T1E and Theoretical T1E
The simulation study for T1E is limited to calculation of theoretical T1E. The null hypothesis for
GWAS is that all SNPs have no association with interested traits and the multiple testing problem
requires correction on T1E rate. Theoretical T1E is obtained by setting one single SNP
unassociated with outcome and analyze effect of this SNP only. However, the empirical T1E might
be different for the same method. One example of simulation study for empirical T1E is done by
Zhang et al (Zhang and Sun, 2019). The outcome is generated from one alternative that it is
associated with one SNP, but then a large number of null SNPs are regenerated or the outcome is
permutated to analyze and calculate T1E. In this case, they also calculate T1E rate with much more
stringent significance levels with large number of replicates which resembles a true GWAS. This
is one of our limitation that we did not address stringent threshold for T1E control in our simulation
study. Compared to empirical T1E, theoretical T1E is less realistic and might produce misleading
conclusions on T1E rate control, which affects the interpretation of power comparisons.
98
4.3.8 Multivariate Model
The current analysis is based on one phenotype and one SNP at a time. There are efforts devoted
in multivariate models combining multiple phenotype traits to increase genome-wide discovery.
Conducting GWAS repeatedly for several traits may be less efficient when the number of available
variants or sample size is large. Other benefits of multivariate LMMs are their ability of
considering sample relatedness and population stratification and the increased power compared to
standard univariate analysis (Zhou and Stephens, 2014). Joint outcome analysis can be useful when
the interest is in longitudinal trends between outcomes or evolution of association between
outcomes, but currently the limitation of multivariate LMMs exists in the computational
complexity as usual LMM tools like nlme and lme4 cannot conduct it (Verbeke et al., 2014). Many
different ways emerged to extend LMM to carry out GWAS of correlated traits efficiently with
tools like GMA (Ning et al., 2019), joineRML package in R (Hickey et al., 2018) or MTG2 (Lee
and van der Werf, 2016). Recently such tools have been proposed to conduct multivariate LMMs.
As described in Section 3.1, the renal outcomes AER and eGFR have a weak negative correlation
between each other. A joint analysis of multivariate LMMs with these outcomes for GWAS may
be another future direction to identify novel genetic variants associated with renal traits in this
cohort.
99
Summary
In summary, we based our study on diabetic renal complication because it is one of the four major
complications and is the leading cause of end-stage renal failure. The measurement for renal traits
might be interfered by environment and the trajectory in repeated measurements can reveal more
information on the status of a patient with diabetes than one-time measurement. It provided us the
motivation to find a fast and effective way to conduct GWAS with longitudinal data.
We extended the previous Rotterdam simulation study by Sikorska et al. with our motivating data
from the DCCT/EDIC study. The results show that GALLOP stands out as the most accurate
method because 1) 2df test by GALLOP always has very high power (Figure 3-15, 3-16Figure
3-16); 2) parameter estimation by GALLOP can always capture the true effect sizes for cross-
sectional and longitudinal SNP effects (Figure 3-17, 3-18, 3-19). It was found 2-stage methods
SAO, WSAO, TS and CTS require similar computation time and are faster than GALLOP when
the number of SNPs is high, but GALLOP is still much faster than LMM (Figure 3-42).
In the end, we realized there are limitations in our current simulation study designs and
DCCT/EDIC real data model. We want to extend our simulation study to assess these methods in
more realistic scenarios, and correct the specification of real data model to find the true genetic
associations in our data.
100
References
1000 Genomes Project Consortium (2015) 'A global reference for human genetic variation',
Nature, 526, pp. 68-74.
Afonso, G. and Mallone, R. (2013) 'Infectious triggers in type 1 diabetes: is there a case for
epitope mimicry?', Diabetes, Obesity & Metabolism, 15(Suppl 3), pp. 82-88.
Allan, G.M., Mannarino, M. and Tonelli, M. (2013) 'Tools for Practice Screening and diagnosis
of type 2 diabetes with HbA1c', Canadian Family Physician, 59(1), pp. 42.
Altman, D.G. and Bland, J.M. (1986) 'Comparison of methods of measuring blood pressure', J
Epidemiol Community Health, 40(3), pp. 274-7.
American Diabetes Association (2019) 'Classification and Diagnosis of Diabetes: Standards of
Medical Care in Diabetes—2019', Diabetes Care, 42(Supplement 1), pp. S13-S28.
Atkinson, M.A. and Eisenbarth, G.S. (2001) 'Type 1 diabetes: new perspectives on disease
pathogenesis and treatment', The Lancet, 358(9277), pp. 221-229.
Barrett, J.C., Clayton, D.G., Concannon, P., Akolkar, B., Cooper, J.D., Erlich, H.A., Julier, C.,
Morahan, G., Nerup, J., Nierras, C., Plagnol, V., Pociot, F., Schuilenburg, H., Smyth, D.J.,
Stevens, H., Todd, J.A., Walker, N.M. and Rich, S.S. (2009) 'Genome-wide association study
and meta-analysis find that over 40 loci affect risk of type 1 diabetes', Nature genetics, 41(6), pp.
703-707.
Bates, D., Machler, M., Bolker, B. and Walker, S. (2015) 'Fitting Linear Mixed-Effects Models
Using lme4', Journal of Statistical Software, 67(1), pp. 1-48.
Bebu, I., Braffett, B.H., Orchard, T.J., Lorenzi, G.M. and Lachin, J.M. (2019) 'Mediation of the
Effect of Glycemia on the Risk of CVD Outcomes in Type 1 Diabetes: The DCCT/EDIC Study',
Diabetes Care, pp. 10.2337/dc18-1613.
Bland, J.M. and Altman, D.G. (1999) 'Measuring agreement in method comparison studies', Stat
Methods Med Res, 8(2), pp. 135-60.
Boland, B.B., Rhodes, C.J. and Grimsby, J.S. (2017) 'The dynamic plasticity of insulin
production in β-cells', Molecular Metabolism, 6(9), pp. 958-973.
Buzzetti, R., Quattrocchi, C.C. and Nisticò, L. (1998) 'Dissecting the genetics of Type 1
diabetes: relevance for familial clustering and differences in incidence', Diabetes/Metabolism
Reviews, 14(2), pp. 111-128.
Cheung, N., Mitchell, P. and Wong, T.Y. (2010) 'Diabetic retinopathy', The Lancet, 376(9735),
pp. 124-136.
DCCT Research Group (1986) 'The Diabetes Control and Complications Trial (DCCT). Design
and Methodologic Considerations for the Feasibility Phase.', Diabetes, 35(5), pp. 530-545.
101
DCCT/EDIC Research Group (2007) 'Long-term effect of diabetes and its treatment on cognitive
function', New England Journal of Medicine, 356(18), pp. 1842-1852.
DCCT/EDIC Research Group (2011) 'Intensive Diabetes Therapy and Glomerular Filtration Rate
in Type 1 Diabetes', New England Journal of Medicine, 365, pp. 2366-2376.
DCCT/EDIC Research Group (2014a) 'Diabetic retinopathy and other ocular findings in the
diabetes control and complications trial/epidemiology of diabetes interventions and
complications study.', Diabetes Care, 37(1), pp. 17-23.
DCCT/EDIC Research Group (2014b) 'Effect of intensive diabetes treatment on albuminuria in
type 1 diabetes: long-term follow-up of the Diabetes Control and Complications Trial and
Epidemiology of Diabetes Interventions and Complications study', Lancet Diabetes &
Endocrinology, 2(10), pp. 793-800.
de Boer, I.H. and DCCT/EDIC Research Group (2014) 'Kidney disease and related findings in
the diabetes control and complications trial/epidemiology of diabetes interventions and
complications study.', Diabetes Care, 37(1), pp. 24-30.
EDIC Research Group (1999) 'Design, implementation, and preliminary results of a long-term
follow-up of the Diabetes Control and Complications Trial cohort.', Diabetes Care, 22(1), pp.
99-111.
Enders, C.K. (2011) 'Missing Not at Random Models for Latent Growth Curve Analyses',
American Psychological Association, 16(1), pp. 1-16.
Fiero, M.H., Hsu, C.H. and Bell, M.L. (2017) 'A pattern-mixture model approach for handling
missing continuous outcome data in longitudinal cluster randomized trials', Stat Med, 36(26), pp.
4094-4105.
Fletcher Mercaldo, S. and Blume, J.D. (2018) 'Missing data and prediction: the pattern
submodel', Biostatistics.
Ghaderian, S.B., Hayati, F., Shayanpour, S. and Beladi Mousavi, S.S. (2015) 'Diabetes and end-
stage renal disease; a review article on new concepts', Journal of renal injury prevention, 4(2),
pp. 28-33.
Greenbaum, C.J., Harrison, L.C., Wentworth, J.M., Elkassaby, S. and Fourlanos, S. (2008)
'Reappraising the Stereotypes of Diabetes', Diabetes: Translating Research into Practice. New
York: Informa Healthcare USA.
Haas, M.E., Aragam, K.G., Emdin, C.A., Bick, A.G., Hemani, G., Davey Smith, G. and
Kathiresan, S. (2018) 'Genetic Association of Albuminuria with Cardiometabolic Disease and
Blood Pressure', American Journal of Human Genetics, 103(4), pp. 461-473.
Henderson, C.R. (1950) 'Estimation of Genetic Parameters', Annals of Mathematical Statistics,
21, pp. 309-310.
102
Hickey, G.L., Philipson, P., Jorgensen, A. and Kolamunnage-Dona, R. (2018) 'joineRML: a joint
model and software package for time-to-event and multivariate longitudinal outcomes', BMC
Med Res Methodol, 18(1), pp. 50.
Ikram, M.A., Brusselle, G., Murad, S.D., van Duijn, C.M., Franco, O.H., Goedegebure, A.,
Klaver, C.C.W., Nijsten, T., Peeters, R.P., Stricker, B.H., Tiemeier, H., Uitterlinden, A.G.,
Vernooij, M.W., Hofman, A. and Hofman, A. (2017) 'The Rotterdam Study: 2018 update on
objectives, design and main results', European Journal of Epidemiology, 32(9), pp. 807-850.
Kraft, P., Yen, Y.C., Stram, O.D., Morrison, J. and Gauderman, W.J. (2007) 'Exploiting Gene-
Environment Interaction to Detect Genetic Associations', Human Heredity, 63, pp. 111-119.
Lachin, J.M. and DCCT/EDIC Research Group (2016) 'Risk Factors for Cardiovascular Disease
in Type 1 Diabetes', Diabetes, 65(5), pp. 1370-1379.
Lee, S.H. and van der Werf, J.H.J. (2016) 'MTG2: An efficient algorithm for multivariate linear
mixed model analysis based on genomic information', Bioinformatics, 32(9), pp. 1420-1422.
Levey, A.S., Stevens, L.A., Schmid, C.H., Zhang, Y.L., Castro, A.F., Feldman, H.I., Kusek,
J.W., Eggers, P., Lente, F.V., Greene, T. and Coresh, J. (2009) 'A new equation to estimate
glomerular filtration rate', Annals of Internal Medicine, 150(9), pp. 604-612.
Little, R.J.A. and Rubin, D.B. (1987) Statistical analysis with missing data. New York: John
Wiley & Sons.
Ma, C., Blackwell, T., Boehnke, M., Scott, L.J. and Go, T.D.i. (2013) 'Recommended joint and
meta-analysis strategies for case-control association testing of single low-count variants', Genet
Epidemiol, 37(6), pp. 539-50.
Ma, Y., Mazumdar, M. and Memtsoudis, S.G. (2012) 'Beyond Repeated-Measures Analysis of
Variance: Advanced Statistical Methods for the Analysis of Longitudinal Data in Anesthesia
Research', Regional Anesthesia and Pain Medicine, 37(1), pp. 99-105.
Martin, C.L., Albers, J.W., Pop-Busui, R. and DCCT/EDIC Research Group (2014) 'Neuropathy
and related findings in the diabetes control and complications trial/epidemiology of diabetes
interventions and complications study.', Diabetes Care, 37(1), pp. 31-38.
Mohlke, K.L. and Lindgren, C.M. (2014) 'Genome-Wide Association Studies of Obesity and
Related Traits', Type 2 Diabetes and Related Traits. Oxford: Karger, pp. 58-70.
Nathan, D.M. (2014) 'The Diabetes Control and Complications Trial/Epidemiology of Diabetes
Interventions and Complications Study at 30 Years: Overview', Diabetes Care, 37(1), pp. 9-16.
National Kidney Foundation (2002) 'K/DOQI clinical practice guidelines for chronic kidney
disease: Evaluation, classification, and stratification', American Journal of Kidney Diseases,
39(suppl 1), pp. S1-S266.
103
Nejentsev, S., Walker, N., Riches, D., Egholm, M. and Todd, J.A. (2009) 'Rare variants of
IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes', Science,
324(5925), pp. 387–389.
Ning, C., Wang, D., Zhou, L., Wei, J., Liu, Y., Kang, H., Zhang, S., Zhou, X., Xu, S. and Liu,
J.F. (2019) 'Efficient Multivariate Analysis Algorithms for Longitudinal Genome-wide
Association Studies', Bioinformatics.
Orchard, T.J. and DCCT/EDIC Research Group (2015) 'Asociation between seven years of
intensive treatment of type 1 diabetes and long time mortality', Journal of the American Medical
Association, 313(1), pp. 45-53.
Paterson, A.D., Waggott, D., Boright, A.P., Hosseini, S.M., Shen, E., Sylvestre, M.P., Wong, I.,
Bharaj, B., Cleary, P.A., Lachin, J.M., Below, J.E., Nicolae, D., Cox, N.J., Canty, A.J., Sun, L.
and Bull, S.B. (2010) 'A Genome-Wide Association Study Identifies a Novel Major Locus for
Glycemic Control in Type 1 Diabetes, as Measured by Both A1C and Glucose', Diabetes, 59(2),
pp. 539-549.
Persad, P.J., Heid, I.M., Weeks, D.E., Baird, P.N., de Jong, E.K., Haines, J.L., Pericak-Vance,
M.A., Scott, W.K. and International Age-Related Macular Degeneration Genomics, C. (2017)
'Joint Analysis of Nuclear and Mitochondrial Variants in Age-Related Macular Degeneration
Identifies Novel Loci TRPM1 and ABHD2/RLBP1', Invest Ophthalmol Vis Sci, 58(10), pp.
4027-4038.
Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D. and R Core Team (2019) 'nlme: Linear and
Nonlinear Mixed Effects Models'.
Pruim, R.J., Welch, R.P., Sanna, S., Teslovich, T.M., Chines, P.S., Gliedt, T.P., Boehnke, M.,
Abecasis, G.R. and Willer, C.J. (2010) 'LocusZoom: Regional visualization of genome-wide
association scan results', Bioinformatics, 26(18), pp. 2336-2337.
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A.R., Bender, D., Maller, J.,
Sklar, P., de Bakker, P.I.W., Daly, M.J. and Sham, P.C. (2007) 'PLINK: a toolset for whole-
genome association and population-based linkage analysis.', American Journal of Human
Genetics, 81(3), pp. 559-575.
Radha, V. and Mohan, V. (2017) 'Genetic Basis of Monogenic Diabetes', Current Science,
113(07), pp. 1277-1286.
Razmaria, A.A. (2015) 'Diabetes Neuropathy', Journal of the American Medical Association,
314(20), pp. 2202.
Revelle, W. (2018) 'psych: Procedures for Psychological, Psychometric, and Personality
Research'.
Roshandel, D., Gubitosi-Klug, R., Bull, S.B., Canty, A.J., Pezzolesi, M.G., King, G.L., Keenan,
H.A., Snell-Bergeon, J.K., Maahs, D.M., Klein, R., Klein, B.E.K., Orchard, T.J., Costacou, T.,
Weedon, M.N., Oram, R.A. and Paterson, A.D. (2018) 'Meta-genome-wide association studies
104
identify a locus on chromosome 1 and multiple variants in the MHC region for serum C-peptide
in type 1 diabetes', Diabetologia, 61(5), pp. 1098-1111.
Sandholm, N., Haukka, J.K., Toppila, I., Valo, E., Harjutsalo, V., Forsblom, C. and Groop, P.H.
(2018) 'Confirmation of GLRA3 as a susceptibility locus for albuminuria in Finnish patients with
type 1 diabetes', Scientific Reports, 8(1), pp. doi: 10.1038/s41598-018-29211-1.
Secrest, A.M., Becker, D.J., Kelsey, S.F., LaPorte, R.E. and Orchard, T.J. (2010) 'Cause-Specific
Mortality Trends in a Large Population-Based Cohort With Long-Standing Childhood-Onset
Type 1 Diabetes', Diabetes, 59(12), pp. 3216-3222.
Shamoon, H., Duffy, H., Fleischer, N., Engel, S., Saenger, P., Strelzyn, M., Litwak, M., Wylie-
Rosett, J., Farkash, A., Geiger, D., Engel, H., Fleischman, J., Pompi, D., Ginsberg, N., Glover,
M., Brisman, M., Walker, E., Thomashunis, A. and Gonzalez, J. (1993) 'The effect of intensive
treatment of diabetes on the development and progression of long-term complications in insulin-
dependent diabetes mellitus.', The New England Journal of Medicine, 329(14), pp. 977-986.
Sikorska, K., Lesaffre, E., Groenen, P.F.J. and Eilers, P.H.C. (2013a) 'GWAS on your notebook:
fast semi-parallel linear and logistic regression for genome-wide association studies', BMC
Bioinformatics, 14(1), pp. 166-166.
Sikorska, K., Lesaffre, E., Groenen, P.J.F., Rivadeneira, F. and Eilers, P.H.C. (2018) 'Genome-
wide Analysis of Large-scale Longitudinal Outcomes using Penalization —GALLOP algorithm',
Scientific Reports, 8, pp. 6518.
Sikorska, K., Montazeri, N.M., Uitterlinden, A.G., Rivadeneira, F., Eilers, P.H.C. and Lesaffre,
E. (2015) 'GWAS with longitudinal phenotypes: performance of approximate procedures',
European Journal of Human Genetics, 23(10), pp. 1384-1391.
Sikorska, K., Rivadeneira, F., Groenen, P.J.F., Hofman, A., Uitterlinden, A.G., Eilers, P.H.C.
and Lesaffre, E. (2013b) 'Fast linear mixed model computations for genome-wide association
studies with longitudinal data', Statistics in Medicine, 32(1), pp. 165-180.
Task Force on Diabetes and Cardiovascular Diseases of the European Society of Cardiology and
European Association for the Study of Diabetes (2007) 'Guidelines on diabetes, pre-diabetes and
cardiovascular diseases', European Heart Journal, 60(4), pp. 88-136.
Thomas, S. and Karalliedde, J. (2019) 'Diabetic nephropathy', Medicine, 47(2), pp. 86-91.
Todd, J.A., Bell, J.I. and McDevitt, H.O. (1987) 'HLA-DQ beta gene contributes to susceptibility
and resistance to insulin-dependent diabetes mellitus', Nature, 329(6140), pp. 599-604.
van de Bunt, M., Moran, I., Ferrer, J. and McCarthy, M. (2014) 'Insights into β-cell biology and
type 2 diabetes pathogenesis from studies of the islet transcriptome', Genetics in Diabetes, pp.
111-121.
Verbeke, G., Fieuws, S., Molenberghs, G. and Davidian, M. (2014) 'The analysis of multivariate
longitudinal data: A review', Statistical Methods in Medical Research, 23(1), pp. 42-59.
105
Verbeke, G. and Molenberghs, G. (2000) Linear Mixed Models for Longitudinal Data. New
York: Springer.
World Health Organization (2018) Diabetes.
Yamaguchi, Y., Ueno, M., Maruo, K. and Gosho, M. (2019) 'Multiple imputation for
longitudinal data in the presence of heteroscedasticity between treatment groups', J Biopharm
Stat, pp. 1-19.
Younes, N., Cleary, P.A., Steffes, M.W., der Boer, I.H., Molitch, M.E., Rutledge, B.N., Lachin,
J.M. and Dahms, W. (2010) 'Comparison of urinary albumin-creatinine ratio and albumin
excretion rate in the diabetes control and complications trial/epidemiology of diabetes
interventions and complications study', Clinical Journal of The American Society of Nephrology,
5(7), pp. 1235-1242.
Young, A.I., Wauthier, F.L. and Donnelly, P. (2018) 'Identifying loci affecting trait variability
and detecting interactions in genome-wide association studies', Nat Genet, 50(11), pp. 1608-
1614.
Zhang, T. and Sun, L. (2019) 'Beyond the traditional simulation design for evaluating type 1
error control: From the "theoretical" null to "empirical" null', Genetic Epidemiology, 43, pp. 166-
179.
Zheng, Z.L., Zhu, Z.H., Wu, Y.D., Kemper, K.E., Lloyd-Jones, L.R., McRae, A.F., Xue, A.,
Sidorenko, J., Visscher, P.M., Zhang, F.T. and Zeng, J. (2018) 'Genome-wide association
analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes',
Nature Communications, 9, pp. 2941.
Zhou, X. and Stephens, M. (2014) 'Efficient multivariate linear mixed model algorithms for
genome-wide association studies', Nature Methods, 11(4), pp. 407-409.
top related