new comparison of methods for longitudinal analysis of … · 2019. 11. 28. · ii comparison of...

Comparison of methods for longitudinal analysis of quantitative traits in genome-wide association studies

Mengdan Xu

A thesis submitted in conformity with the requirements for the degree of Master of Science

Graduate Department of Public Health Sciences University of Toronto

Comparison of methods for longitudinal analysis of quantitative traits in genome-

wide association studies

Mengdan Xu

Master of Science

Graduate Department of Public Health Sciences

University of Toronto

Abstract

Longitudinal genome-wide association studies provide us more information on the relationship

between repeated-measured traits and genetic variants (SNPs) than cross-sectional ones. The

Linear Mixed Model (LMM) has been a popular tool for such analysis but is deficient in

computational speed. Inspired by fast methods proposed by Sikorska et al. which work as an

approximation of LMM, we extended previous simulation scenarios and made comparison on

cross-sectional, longitudinal SNP effects and joint test inference between fast methods and

LMM. We also applied fast methods on real data for each of two renal outcomes. Results showed

that using fast methods, we detected several SNPs with longitudinal effects on the two outcomes.

These fast methods are effective and much faster than LMM, however they differ in

computational efficiency and speed. We also discuss limitations existing in our simulation study

and application to real data, along with discussion of future directions.

Acknowledgments

First, I want to thank my supervisor, Dr. Andrew Paterson, for providing me so many valuable

suggestions . Via this opportunity, I learned about the importance of reading and writing for

doing a complete analysis. I have also learned a lot of new knowledge and techniques on

biostatistics. This was also a challenge to me as it is my first serious thesis, and there is no way I

can complete it without the guidance of Dr. Paterson.

I want to express my gratitude to my advisory committee members, Dr. Wei Xu and Dr. Shelley

Bull, for assessing my thesis during the whole procedure and providing advice from different

perspectives. I also want to thank Dr. Lei Sun for sparing her time and offering help along with

my advisory committee.

Sincere thanks go to many people in Sickkids Genetics and Genome Biology for answering my

questions on the data and troubleshooting problems I encountered in programming.

Lastly, I want to thank my family, especially my mother, for understanding my decision to

accept the challenge and doing something I have never done before. I genuinely thank my

roommates, Han and Yuzhu, my friends, Cassie, Fan, Steven, Vicky, Thaison and many others

for supporting and helping in many ways during the time when I felt unconfident and worried

about the thesis.

Table of Contents

Acknowledgments.......................................................................................................................... iii

Table of Contents ........................................................................................................................... iv List of Tables ................................................................................................................................. vi List of Figures ............................................................................................................................... vii List of Abbreviations ..................................................................................................................... ix

Introduction .................................................................................................................................1

1.1 Genetic Basis of Diabetes ....................................................................................................1 1.2 T1D Complications ..............................................................................................................3 1.3 Motivation ............................................................................................................................4

1.3.1 Genome-wide Association Study .............................................................................4

1.3.2 Repeated Measurements from a Longitudinal Study ...............................................4 1.3.3 GWAS on Longitudinal Study .................................................................................5

1.4 Linear Mixed Effects Models ..............................................................................................6

1.4.1 Introduction to Linear Mixed Effects Models .........................................................6 1.4.2 General Form ...........................................................................................................6

1.4.3 Model Assumptions .................................................................................................7 1.5 Fast Methods for Longitudinal Data ....................................................................................8

1.5.1 Study Background ....................................................................................................9 1.5.2 Slope as Outcome Method (SAO) .........................................................................10 1.5.3 Two-Step Method (TS) ..........................................................................................11

1.5.4 Conditional Two-Step (CTS) Method ...................................................................12 1.5.5 Genome-wide Analysis of Large-scale Longitudinal Outcomes using

Penalization (GALLOP) ........................................................................................15 1.5.6 Methods Performance ...............................................................................................17

1.5.7 Concerns ...................................................................................................................19 1.6 Goals ...................................................................................................................................21

Methods .....................................................................................................................................22 2.1 Data Background: DCCT and EDIC..................................................................................22

2.1.1 Diabetes Control and Complications Trial (DCCT) ..............................................22

2.1.2 Epidemiology of Diabetes Interventions and Complications (EDIC) ......................24 2.1.3 Renal Outcome Measures .........................................................................................25

2.1.4 Importance of DCCT/EDIC Study............................................................................27 2.2 Linear Mixed Effects Model for DCCT/EDIC Renal Outcomes ........................................28

2.2.1 Phenotypic Covariate Selection ................................................................................28

2.2.2 Within-individual Correlation Structure Selection ...................................................29 2.2.3 Genetic Data..............................................................................................................31

2.3 Weighted Slope as Outcome (WSAO) ................................................................................32 2.4 Simulation Study .................................................................................................................33

2.4.1 Experimental Designs ...............................................................................................34 2.4.2 Simulation of SNPs ...................................................................................................36 2.4.3 Simulation of Outcome Trait ....................................................................................36

2.5 Methods for DCCT/EDIC Data Analysis ...........................................................................36 Results .......................................................................................................................................39

3.1 Data Description ................................................................................................................39 3.1.1 Number of visits .....................................................................................................40

3.1.2 Distribution of logAER ..........................................................................................42

3.1.3 Distribution of eGFR .............................................................................................46 3.2 Simulation Study Results ...................................................................................................50

3.2.1 Set up .....................................................................................................................50 3.2.2 Type 1 Error ...........................................................................................................51 3.2.3 Power .....................................................................................................................53 3.2.4 Parameter Estimation .............................................................................................56 3.2.5 Speed ......................................................................................................................59

3.3 DCCT/EDIC Data Analysis Results ..................................................................................60 3.3.1 Set up .....................................................................................................................60 3.3.2 GWAS of logAER .................................................................................................61 3.3.3 GWAS of eGFR .....................................................................................................75 3.3.4 Speed Comparison .................................................................................................87

Discussion .................................................................................................................................89 4.1 Simulation Study ................................................................................................................89

4.1.1 Type 1 Error ...........................................................................................................89

4.1.2 Power .....................................................................................................................90

4.1.3 Speed ......................................................................................................................91 4.1.4 Implementation ......................................................................................................92

4.2 DCCT/EDIC Data Analysis ...............................................................................................92

4.2.1 GWAS of logAER .................................................................................................92 4.2.2 GWAS of eGFR .....................................................................................................93

4.3 Limitations and Future Study.............................................................................................93 4.3.1 Simulation Study Settings ......................................................................................93 4.3.2 Real Data Model Specification ..............................................................................95

4.3.3 Efficiency Measures...............................................................................................95

4.3.4 Heteroscedasticity ..................................................................................................96 4.3.5 Weighted Slope as Outcome (WSAO) ..................................................................96 4.3.6 Missing Not at Random .........................................................................................97

4.3.7 Empirical T1E and Theoretical T1E ......................................................................97 4.3.8 Multivariate Model ................................................................................................98

Summary ...................................................................................................................................99 References ....................................................................................................................................100

List of Tables

Table 1-1 Rotterdam study: number of individuals with K non-missing responses. ...................... 9

Table 1-2 Rotterdam study: Type 1 error comparison. ................................................................. 17

Table 1-3 Rotterdam study: Power comparison............................................................................ 18

Table 1-4 Rotterdam study: Speed comparison. ........................................................................... 18

Table 2-1 Spatial correlation structures in nlme. .......................................................................... 30

Table 3-1 Descriptive table for DCCT/EDIC. .............................................................................. 39

Table 3-2 Missing rate for repeated measurements. Missing rate of logAER in EDIC and

DCCT/EDIC is calculated by combining alternate years. ............................................................ 40

Table 3-3 Time comparison in simulation study for 1000 SNPs under null with MAF=0.3,

N=2000, 0 cross-sectional or longitudinal SNP effect and unchanged error structure. ................ 60

Table 3-4 DCCT/EDIC data analysis settings. ............................................................................. 61

Table 3-5 Statistical comparison between fast methods and LMM on random 19570 SNPs....... 65

Table 3-6 Summary information of significant SNPs (P<5× 10 − 8) for outcome logAER. ...... 70

Table 3-7 Parameter estimates (BETA), standard error (SE) and p-values (P) for rs3817222 on

logAER. ........................................................................................................................................ 70

logAER. ........................................................................................................................................ 71

Table 3-9 Measures of efficiency of fast methods on logAER. .................................................... 71

Table 3-10 Statistical comparison between fast methods and LMM on random 19570 SNPs..... 78

Table 3-11 Summary information of significant SNPs (P<5× 10 − 8) for outcome eGFR. ....... 83

chromosome 2 for eGFR. .............................................................................................................. 84

eGFR. ............................................................................................................................................ 84

Table 3-14 Measures of efficiency of fast methods on real eGFR data. ....................................... 85

List of Figures

Figure 1-1 Random intercept or slope model examples. ................................................................ 7

Figure 2-1 Flow chart for Methods section................................................................................... 22

Figure 2-2 Flow chart of simulation study. ................................................................................... 34

Figure 3-1 Number of DCCT participants by randomization year. .............................................. 40

Figure 3-2 Number of visit counts per subject in DCCT/EDIC years. ......................................... 41

Figure 3-3 Expected and actual numbers of subjects in each DCCT/EDIC year including DCCT

baseline and close out visits. ......................................................................................................... 41

Figure 3-4 Proportion of missing subjects in each DCCT/EDIC year including DCCT baseline

and close out visits. ....................................................................................................................... 42

Figure 3-5 Barplot of numbers of logAER measurements per subject in DCCT/EDIC study. .... 42

Figure 3-6 Distribution of logAER in DCCT years. ..................................................................... 44

Figure 3-7 Distribution of logAER in EDIC years. ...................................................................... 45

Figure 3-8 Spaghetti plots of logAER from 20 subjects from DCCT/EDIC in each grid. ........... 46

Figure 3-9 Barplot of numbers of eGFR measurements per subject in DCCT/EDIC study. ....... 47

Figure 3-10 Distribution of eGFR in DCCT years. ...................................................................... 48

Figure 3-11 Distribution of eGFR in EDIC years. ........................................................................ 49

Figure 3-12 Spaghetti plots of eGFR from 20 subjects from DCCT/EDIC in each grid. ............. 50

Figure 3-13 Type 1 error rates calculated in reference scenario, 4 missing scenarios and 2 within-

subject error correlation scenarios. ............................................................................................... 52

Figure 3-14 Example of sample size changing by time in missing scenarios. ............................. 53

Figure 3-15 Power of cross-sectional SNP effect calculated in 7 scenarios. ................................ 54

Figure 3-16 Power calculated in 7 scenarios. ............................................................................... 56

Figure 3-17 Parameter estimates of cross-sectional SNP effect calculated in 7 scenarios. .......... 57

Figure 3-18 Parameter estimates of longitudinal SNP effect by MAF calculated in 7 scenarios. 58

Figure 3-19 Parameter estimates of longitudinal SNP effect by N calculated in 7 scenarios. ..... 59

Figure 3-20 Number of participants in EDIC starting years. ........................................................ 61

Figure 3-21 Slopes for two-stage fast methods on outcome logAER. .......................................... 62

Figure 3-22 Histograms of MAFs of all SNPs (n=8,979,131) and randomly selected SNPs

(n=19,570). .................................................................................................................................... 63

Figure 3-23 P-P plots on outcome logAER on random subset of SNPs. ...................................... 64

Figure 3-24 E-E plots on outcome logAER on random subset of SNPs. ..................................... 65

Figure 3-25 Histograms of p-values (logAER). ............................................................................ 66

Figure 3-26 Q-Q plots of p-values (logAER). .............................................................................. 67

Figure 3-27 Q-Q plots of p-values stratified by MAF (logAER). ................................................ 67

Figure 3-28 Manhattan plots (logAER). ....................................................................................... 68

Figure 3-29 Estimates along with 95% CI for cross-sectional SNP effects on logAER by

combined and separate DCCT/EDIC years. ................................................................................. 73

Figure 3-30 Locus plot on logAER with reference SNP rs3817222. ........................................... 74

Figure 3-31 Locus plot on logAER with reference SNP rs74155187. ......................................... 75

Figure 3-32 Slopes for two-stage fast methods on outcome eGFR. ............................................. 76

Figure 3-33 P-P plots on outcome eGFR on random 19570 SNPs. .............................................. 77

Figure 3-34 E-E plots on outcome eGFR on random 19570 SNPs. ............................................. 78

Figure 3-35 Histograms of p-values (eGFR). ............................................................................... 79

Figure 3-36 Q-Q plots of p-values (eGFR). .................................................................................. 80

Figure 3-37 Q-Q plots of p-values stratified by MAF (eGFR). .................................................... 80

Figure 3-38 Manhattan plots (eGFR). ........................................................................................... 81

Figure 3-39 Estimate along with 95% CI for cross-sectional SNP effects on eGFR by combined

and separate DCCT/EDIC years. .................................................................................................. 85

Figure 3-40 Locus plot with reference SNP rs12713270.............................................................. 86

Figure 3-41 Locus plot with reference SNP rs74155187.............................................................. 87

Figure 3-42 Running time for GWAS. ......................................................................................... 88

List of Abbreviations

(e)GFR (Estimated) Glomerular Filtration Rate

(log)AER (Logarithm-transformed) Albumin Excretion Rate

(RM-)ANOVA (Repeated Measures) Analysis of Variance

2df Two Degree-of-freedom

AIC Akaike Information Criterion

AR Autoregressive

ARMA Autoregressive Moving Average

BLUP Best Linear Unbiased Predictor

BMD Bone Mineral Density

bp Base Pair

CAR Continuous Autoregression

CKD Chronic Kidney Disease

CS Compound Symmetry

CTS Conditional Two Step

DCCT Diabetes Control and Complications Trial

EDIC Epidemiology of Diabetes Interventions and Complications

GALLOP Genome-wide Analysis of Large-scale Longitudinal Outcomes using Penalization

GC Genomic Control

GEE Generalized Estimating Equation

GWAS Genome-Wide Association Study

HbA1c Hemoglobin A1c

HLA Human Leukocyte Antigen

HWE Hardy–Weinberg Equilibrium

ICC Intra-class Correlation Coefficient

IDDM Insulin-Dependent Diabetes Mellitus

IQR Inter-Quartile Range

LMM Linear Mixed Model

M/Mb Million/Million base pair

MAF Minor Allele Frequency

MAR Missing at Random

MCAR Missing Completely at Random

MLE Maximum Likelihood Estimation

MNAR Missing Not at Random

SAO Slope as Outcome

SE Standard Error

SNP Single Nucleotide Polymorphism

T1D/T2D Type 1/2 Diabetes

T1E Type 1 Error

TS Two Step

VC Variance Component

WSAO Weighed Slope as Outcome

Introduction

1.1 Genetic Basis of Diabetes

Diabetes is a chronic disease which is caused by the malfunction of insulin secretion or improper

response to insulin. People with diabetes usually have blood and tissue glucose concentrations that

are too high, resulting in acute and long-term complications which have social, emotional and

economic impacts affecting their quality of life. The symptoms and signs of diabetes are of a

variety as a result of hyperglycemia, meanwhile the pathogeneses of diabetes can differ by type of

diabetes. The majority of diabetes can be classified as either type 1 diabetes (T1D) or type 2

diabetes (T2D) with the distinction in autoimmune system (van de Bunt et al., 2014). Type 1

diabetes, previously called insulin-dependent diabetes, is caused by autoimmune destruction of

pancreatic islet insulin-producing β-cells, therefore patients cannot produce sufficient insulin.

Type 2 diabetes, previously named non-insulin dependent diabetes, is due to defective insulin

action in which case although patients do secrete insulin, there is resistance to its actions and the

response to insulin has declined (Boland et al., 2017).

According to the World Health Organization, the number of people with diabetes has increased

threefold from around 100 million in 1980 to 400 million in 2014 with a more rapid rise in

prevalence in middle and low-income countries (World Health Organization, 2018). It is thought

that changes in the environment in aspects of food, exercise, climate, sleep and others contribute

to the increase in the prevalence of T2D or T1D (Greenbaum et al., 2008). Genetic studies have

contributed to understanding the pathogenesis of diabetes in order to better prevent and treat the

disease, meanwhile developing a healthy lifestyle can as well benefit people nowadays to avoid or

reduce exposure to the diabetogenic environmental agents.

The general categories of diabetes by American Diabetes Association in 2019 include other types

of diabetes such as gestational diabetes mellitus which take up a small proportion among the whole

patient population (American Diabetes Association, 2019). For example, ~1% of patients have

monogenic diabetes. People with this type of diabetes might be misdiagnosed as T1D or T2D but

it is caused by single gene defects such as rare coding variations in HNF1A (Hepatic Nuclear

Factor 1 Alpha) or HNF4A (Hepatic Nuclear Factor 4 Alpha) among other genes (Radha and

Mohan, 2017). Classification is important for determining therapy.

Compared to other types of diabetes, T1D and T2D are more common while heterogeneous in a

combination of factors from the genetic basis and environmental effects. Genes related to the risk

of T1D have been identified over the past 30 years (Todd et al., 1987). In early to mid-1990s, there

was a trend that optimists believed the pathogenesis along with the corresponding precise

prevention and treatment could be found very soon for T1D. In recent years the complexity of the

genetics of T1D has been realized. The most important genes for T1D are located within the MHC

(Major Histocompatibility Complex) HLA (Human Leukocyte Antigen) class II region on

chromosome 6p21 (previously termed IDDM1), and account for around 45% of genetic

susceptibility for T1D. However their exact function in terms of pathogenesis is still

obscure (Buzzetti et al., 1998). With regard to the environment, no significant evidence was found

that any environmental agents can trigger the onset of T1D in spite of much investigation devoted

into viral infections, early infant diet and toxins (Atkinson and Eisenbarth, 2001; Afonso and

Mallone, 2013).

Numerous novel loci related to T1D have been identified. For instance, in 2009 Barrett et al.

detected over 40 loci affecting the risk of T1D by using case-control data from the Type 1 Diabetes

Genetics Consortium and applying meta-analysis combined with the Wellcome Trust Case Control

Consortium study and the Genetics of Kidneys in Diabetes study. Apart from the long-known HLA

region on chromosome 6p21, the results of locations detected in this study containing susceptibility

loci not only supported previous discoveries of 4 non-HLA loci , INS, CTLA4, PTPN22 and IL2RA,

but also many new candidate genes were discovered including IL10, IL19, CD69 (Barrett et al.,

2009). Afterwards, techniques like targeted resequencing helped to pinpoint potential causal

variants which were initially detected by these studies (Nejentsev et al., 2009).

As to T2D, researchers are also investigating the polygenetic and environmental factors in it and

related traits. Till 2018, more than 100 variants were identified to be associated with T2D (Zheng

et al., 2018). However, the effect sizes are usually so small that the associated variants in total can

only explain ~10% of the heritability of T2D. Besides, related traits like glycemia or obesity were

also studied in populations without T2D. This is to figure out whether loci identified for these

related traits are also associated with T2D risk so as to have a better understanding of underlying

genes and biological mechanisms for T2D (Mohlke and Lindgren, 2014).

1.2 T1D Complications

The fundamental cause of most diabetes complications is the increased blood or tissue glucose

level resulting from insufficient secretion of insulin. Keeping a good control of blood glucose level

is the most essential way to prevent or slow the progression of all complications. The most

common measure for glycemic level is HbA1c. It refers to the glycated hemoglobin as a

measurement of average blood glucose concentration over the preceding 2-3 months. HbA1c is

one of the top risk factors for diabetes complications and it is also recommended as a screening

test for T2D (Allan et al., 2013). Many studies were conducted to better learn biological bases of

HbA1c between-person variation as there are sometimes discrepancies between HbA1c and other

glycemia measurements (Paterson et al., 2010).

Most chronic complications of T1D can be primarily classified as eye disease (retinopathy), nerve

disease (neuropathy), renal/kidney disease (nephropathy) and cardiovascular disease

(cardiopathy). Eye, nerve and renal diseases are microvascular complications and cardiovascular

disease is a macrovascular complication. To begin with, diabetic retinopathy is the main cause of

vision loss in people with diabetes. It was estimated among the 246 million people with diabetes

in 2010, one third had signs of retinopathy (Cheung et al., 2010). Diabetic neuropathy is the most

common complication and about half of people with diabetes in 2015 developed some form of

nerve disorders (Razmaria, 2015). The nerve disorders present in the form of numbness or pain in

limbs and can also affect internal organs like the heart. Usually nerves of the feet are the first to

be affected, therefore examinations of foot nerves can detect early signs of neuropathy. Renal

disease is one of the underlying causes of morbidity and mortality of T1D among related

complications. It has been found that the kidney is one of the targeted organs by high level of blood

glucose and moreover, the appearance and progression of kidney diseases are highly related to

other complications like cardiovascular disease (de Boer and DCCT/EDIC Research Group,

2014). It is reported that about 30%-35% patients with T1D and T2D developed renal

complications over their lifetime (Thomas and Karalliedde, 2019). Although the advanced stage

of renal disease or renal failure might occur many years after the onset of diabetes, diabetic renal

disease is the largest cause of end-stage renal disease worldwide (Ghaderian et al., 2015). Diabetic

cardiovascular disease includes coronary heart disease, cerebrovascular disease, and peripheral

artery disease and among these, heart disease is the most common cause of death. The prevalence

of cardiovascular disease is not as high as other chronic complications (Task Force on Diabetes

and Cardiovascular Diseases of the European Society of Cardiology and European Association for

the Study of Diabetes, 2007). However, as concluded by a large population-based cohort study on

the causes of mortality in people with T1D , cardiovascular disease becomes the leading cause of

death after about 10 years from onset, eventually accounting for about 40% of deaths after 20 years

of duration (Secrest et al., 2010).

1.3 Motivation

1.3.1 Genome-wide Association Study

Genome-wide association studies (GWAS) assess the association between genetic variations and

traits. With the development of the Human Genome Project and cost-effective genotyping

techniques for DNA, the amount of available information from DNA, usually in form of single

nucleotide polymorphisms (SNPs), is rapidly increasing. It provides more opportunities to find

associations between genetic variations and human traits under the conventional GWAS

significance level threshold, p< 5 × 10−8 . A lot of effort is being devoted to both common

complex diseases and quantitative traits which can be affected by effects of both genes and

environment.

There are hundreds to thousands of genes on each of 23 pairs of chromosomes, containing

information from around 3 billion base pairs. The typical genotyping technique usually provides

about 0.5M to 2.5M SNPs. Currently many tools such as PLINK v1.9 have been developed to

conduct GWAS between genetic markers and cross-sectional outcomes at a notable speed despite

the large number of SNPs (Purcell et al., 2007). The usual statistical methods used by PLINK, for

example, are genetic association test for case/control data (logistic regression) or standard linear

regression of quantitative traits assuming an additive model.

1.3.2 Repeated Measurements from a Longitudinal Study

Repeated measurements from a longitudinal study adds complexity to GWAS. GWAS has been

popular in cross-sectional studies such as case-control studies of cancer. However, clinical trials

and some observational studies on chronic diseases like diabetes are often longitudinal studies.

The change or trajectory in measurements reveals more information than one-time measurement

at a single point. To deal with repeated measurements from longitudinal study, the simple ways

are to use one summary statistic such as the mean, median or measurement at a predetermined time

point to represent the dataset as a cross-sectional one. However, lots of information would be

wasted by directly reducing records which contain important repeated measures from participants.

One of the earliest proposed approaches is repeated measures analysis of variance (RM-ANOVA).

It is analysis of variance (ANOVA) which takes multiple correlated responses for each subject.

Being a traditional method for longitudinal study in many fields like anesthesiology and physical

education, RM-ANOVA takes into account correlation between repeated measurements which

differs from a standard ANOVA. However it has undesirable characteristics. First, the outcome

has to be quantitative and covariates can only be discrete. Second, it assumes that the correlation

between any two time points is constant which might not be appropriate for some types of traits

where for example the within subject correlation decreases as time interval increases. Finally, it

can only handle the same number of repeated measurements: subjects with even one missing

response will be excluded (Ma et al., 2012).

1.3.3 GWAS on Longitudinal Study

Conventional GWAS focuses on SNP effect from a cross-sectional aspect, such as how genetic

factors affect people’s susceptibility to a certain disease. With repeatedly measured data, GWAS

can be conducted to investigate longitudinal SNP effects. The SNPs stay the same over an

individual’s lifetime, but the trait distribution and effects of SNPs might change over time. Cross-

sectional and longitudinal SNP effects both provide important information in how genetic variants

and disease traits are associated. From the aspect of respecting the nature of chronic diseases and

making use of available information as much as possible, ways to efficiently conduct GWAS on

longitudinal study for cross-sectional and longitudinal SNP effects are in demand.

1.4 Linear Mixed Effects Models

1.4.1 Introduction to Linear Mixed Effects Models

The linear mixed effects model or linear mixed model (LMM), is a popular method to deal with

repeated measurements and outperforms RM-ANOVA in many aspects. It allows different types

of covariates, continuous or categorical. It also allows for correlation within subject to vary by a

specific pattern which produces models that better fit the data. Lastly, subjects with missing

responses and different numbers of visits can be included as long as the time intervals are correctly

specified in the model. Because LMM uses maximum likelihood estimation (MLE), it is robust

against missing at random (MAR) data (Sikorska et al., 2013b). The biggest challenge of fitting

LMMs for GWAS lies in computation.

The other popular method is generalized estimating equations (GEE). It has the same advantages

as LMM in that it allows different correlation structures via working correlation matrix and it can

take different types of covariates. However, the main difference is that GEE only provides

population level estimates without individual level information for random effects. One of the

disadvantages is that it requires complete data or missing completely at random (MCAR) as it is

not likelihood based (Little and Rubin, 1987). This method will not be adopted in this thesis

because it is not an outstandingly fast method and it is less robust than LMM in missing data

scenarios (Sikorska et al., 2013b).

1.4.2 General Form

In general, a linear mixed model has a form as (Verbeke and Molenberghs, 2000):

𝑌𝑖 = 𝑋𝑖𝛽 + 𝑍𝑖𝑏𝑖 + 𝜀𝑖

𝑏𝑖~𝑁(0, 𝐷)

𝜀𝑖~𝑁(0, 𝛴𝑖)

𝑏1,⋯ , 𝑏𝑁 , 𝜀1, ⋯ , 𝜀𝑁 independent of each other, 𝑖 = 1,⋯𝑁

Assuming there are 𝑁 individuals, 𝑌𝑖 denotes the vector of responses for individual 𝑖 with 𝑛𝑖

elements, indicating this individual had 𝑛𝑖 measurements for response and each individual does

not have to have same number of measures.

For covariates, 𝑋𝑖 is a known 𝑝 × 𝑛𝑖 matrix with 𝑝 columns representing 𝑝 covariates having

fixed effects on the response. The 𝑛𝑖 rows match with number of measurements. A 𝑝-dimensional

vector 𝛽 is a vector of parameters for unknown fixed effects.

Similarly, 𝑍𝑖 is a known 𝑞 × 𝑛𝑖 matrix containing values for 𝑞 random-effect covariates and 𝑏𝑖 is

the unknown random effect part. 𝜀𝑖 is a 𝑛𝑖-dimensional vector explaining the error between our

estimates and the true response values for individual 𝑖.

1.4.3 Model Assumptions

This form (Eq. 1.1) shows the basic assumptions for general use of LMM. At the same time, our

constructed LMMs are also based on all of the following assumptions.

First of all, the vector of response 𝑌𝑖 is linearly related to covariates. Some covariates have fixed

effects which means they have population-average effect and are the same for all individuals. Some

covariates have random effects, also called individual-specific effects. Models with random effects

allow different specific regressions to model the relationships between response and covariates for

different individuals. With random effects on the intercept or slope, it allows different final

conditional models on each subject who has repeated measures over time. The example individual

level models in Figure 1-1 are generated using subset of BodyWeight dataset in nlme

package (Pinheiro et al., 2019). It shows four types of models to fit the relationship of weight

(grams) and time (days) on 5 rats.

Figure 1-1 Random intercept or slope model examples. Same subset of 5 rats with id’s 1-5 selected, each

line representing one subject. Weight is in grams and time is in days.

Second, the random-effect parameter vector 𝑏𝑖 contains random effects for 𝑞 covariates, and

usually it is assumed 𝑏𝑖 follows a multivariate normal distribution with a mean of 0 and a common

𝑞 × 𝑞 covariance matrix 𝐷 for every subject. The 𝑑𝑖𝑗(𝑖 ≠ 𝑗) element in this matrix is the

covariance between random effects for 𝑖th covariate and 𝑗th covariate and thus 𝑑𝑖𝑗 = 𝑑𝑗𝑖.

Third, the error vector 𝜀𝑖 follows a multivariate normal distribution with a mean of 0 and

individual-specific 𝑛𝑖 × 𝑛𝑖 covariance matrix Σ𝑖 to explain any deviance between the model and

response. The covariance matrix depends on 𝑖 only through the dimension 𝑛𝑖 and thus elements in

this matrix are set. In other words, if the number of measurements is the same for all subjects, this

covariance matrix can be written as Σ.

In addition, all the random effects between individuals including random effects from covariates

and errors are assumed to be independent.

Under all these assumptions, we would be able to obtain inference for the distribution of response

vector as a multivariate normal distribution:

𝑌𝑖~𝑁(𝜇𝑖 = 𝑋𝑖𝛽, 𝑉𝑖 = 𝑍𝑖𝐷𝑍’𝑖 + 𝛴𝑖) (1.2)

1.5 Fast Methods for Longitudinal Data

This thesis is inspired by Dr. Karolina Sikorska and her colleagues’ several papers since 2012 on

fast methods for analyzing longitudinal data for GWAS in replacement of LMM (Sikorska et al.,

2013b; Sikorska et al., 2013a; Sikorska et al., 2015; Sikorska et al., 2018). LMM has been a

relatively mature way to deal with repeated measurements, thus it is treated as the gold standard

model in this series of papers. Several fast methods were proposed along with implementable codes

provided in appendices. Here I review the fast methods in the order in which they were proposed.

1.5.1 Study Background

The papers used data from the Rotterdam study. It is a prospective cohort study initiated in 1990

which focuses on a series of diseases frequent in elderly people. 14,926 participants were aged 45

years or over who lived in the study district of Rotterdam at the end of 2008 and an extension has

started since 2016 to include participants aged 40 years and over (Ikram et al., 2017). The data for

Sikorska et al. (Sikorska et al., 2013b) included 4,987 participants who were designed to have

visits at baseline and after 2, 6 and 12 years to take femoral neck bone mineral density (BMD)

measurements. 4,933 had at least one visit. The number of individuals for each visit decreased and

the missing rate increased over time. The numbers of individuals with 1, 2, 3 or 4 visits were

relatively even, indicating a missing response rate of 34.5% among these 4,933 individuals as

shown in Table 1-1.

Table 1-1 Rotterdam study: number of individuals with K non-missing responses.

K Women Men Combined

4 679 554 1233

3 833 659 1492

2 759 552 1311

1 543 354 897

Note: Adapted from Table II in “Fast linear mixed model computations for genome‐wide

association studies with longitudinal data” by K. Sikorska et al., 2013, Statistics in

In the first paper (Sikorska et al., 2013b), simulated data were generated based on the Rotterdam

study for the longitudinal BMD responses along with the real data analyzed by slope as outcome

(SAO), two-step method (TS), conditional two-step method (CTS), standard LMM and two GEEs

with different working correlation matrices. The simulation study had a full factorial design in

aspects of sample size (N=500, 1000, 3000), cross-sectional SNP effect (b=0, 0.005), longitudinal

SNP effect (b=0, 0.008) and missing patterns (complete, missing completely at random and

missing at random). An additive genetic model was assumed for SNPs in the simulation study to

generate dosage from a uniform distribution (MAF=0.5). Without any additional covariates in their

simulation study model, the standard LMM has a form as:

𝑌𝑖𝑗 = 𝛽0 + 𝛽1𝑆𝑖 + 𝛽2𝑡𝑖𝑗 + 𝛽3𝑆𝑖𝑡𝑖𝑗 + 𝑏0𝑖 + 𝑏1𝑖𝑡𝑖𝑗 + 𝜖𝑖𝑗, 𝑗 = 1,⋯ , 𝑛𝑖 , 𝑖 = 1,⋯ ,𝑁 (1.3)

Where β1, β3 are cross-sectional effects at baseline (time 0) and longitudinal effects of a SNP, β2

and 𝑏1𝑖 are fixed and random slopes of time. 𝑆𝑖 is SNP coded as 0, 1 or 2 representing the number

of copies of the minor allele, and is incorporated as continuous variable in the model.

With the LMM as standard, several two-stage methods of slope as outcome (SAO), two-step

method (TS) and conditional two-step method (CTS) were presented as fast approximations to find

genetic associations with the evolution of traits over time (Sikorska et al., 2013b). In 2018, another

fast method for LMM called GALLOP (Genome-wide Analysis of Large-scale Longitudinal

Outcomes using Penalization) was proposed and evaluated in a similarly designed simulation study

with application on the Rotterdam data (Sikorska et al., 2018).

1.5.2 Slope as Outcome Method (SAO)

Slope as outcome is one of the simplest ways to deal with longitudinal data. The main idea of two

stage analysis is to use summary statistics to replace the multiple observations for each

individual (Verbeke and Molenberghs, 2000). For example, the mean of the response variable can

be calculated as the outcome, making one linear regression for one SNP possible and efficient.

Here the per individual slope over time is the statistic to be used.

Step 1:

𝑌𝑖𝑗 = 𝛽0𝑖∆ + 𝛽1𝑖

∆time𝑖𝑗 + 𝛽2𝑖∆Cov𝑖𝑗 + ⋯+ 𝑒𝑖𝑗

∆, 𝑖 = 1,2, … ,𝑁, 𝑗 = 1,2, … , 𝑛𝑖 (1.4)

In the first step, we fit a linear regression for each individual with ordinary least squares approach

to find the slope of time 𝛽1𝑖∆. Time-changing covariates can be added in this step to adjust for time

and outcome association. Also, time-static covariates are meaningless in the model as they will

simply be included as part of the intercept, therefore the slope 𝛽1𝑖∆ is unaffected.

Step 2:

𝛽1𝑖∆ = 𝛽0

∆∆ + 𝛽1∆∆𝑠𝑛𝑝𝑖 + 𝑒𝑖

∆∆, 𝑖 = 1,2… ,𝑁 (1.5)

In the second step, SNPs will be tested one at a time and analyzed on its association with the slope.

The idea is that the slope obtained in the first step contains the information on how the outcome

changed over time. To be explicit, with one unit forward in time, the outcome will increase 𝛽1𝑖∆

units (Eq. 1.4). And then, it is reasonable to believe that this change over time might be partly

explained by the SNP effect, which is the coefficient 𝛽1∆∆

in the second linear regression model

(Eq. 1.5). Similarly, the regression is fitted using an ordinary least square approach. Now, our null

hypothesis on longitudinal effect of SNPs becomes:

𝐻0: 𝛽1∆∆ = 0.

Through the test on this 𝐻0 with a specified significance level, the p-value indicating significance

of association between the SNP and outcome trait can be easily calculated according to the selected

test statistic. Here the Wald test is the default in nlme package (Pinheiro et al., 2019). With estimate

𝛽1̂∆∆

and standard error SE, a Z test statistic assumed to follow a standard normal distribution can

be calculated as 𝑍 =𝛽1̂

∆∆−0

𝑆�̂�, and a p-value can be obtained.

1.5.3 Two-Step Method (TS)

The TS method works under the same idea as SAO, except for different approaches to slopes of

Step 1:

𝑌𝑖𝑗 = 𝛽0∗ + 𝛽1

∗time𝑖𝑗 + 𝛽2∗Cov𝑖(𝑗) + ⋯+ 𝑏0𝑖

∗ + 𝑏1𝑖∗time𝑖𝑗 + 𝑏2𝑖

∗Cov′𝑖𝑗 + ⋯+ 𝑒𝑖𝑗

∗, 𝑖

= 1,2, … ,𝑁, 𝑗 = 1,2, … , 𝑛𝑖 (1.6)

In the first step, one LMM is fit without any terms of SNPs. It is the same model as (Eq. 1.3). In

TS the model can be adjusted by all the other covariates with fixed or random effects on the

outcome.

Next the subject-specific random slopes of time are used as the separately generated slopes from

SAO; they may contain information of how genetic predictors have changing effects over time. So

the best linear unbiased predictions (BLUP) of random slopes are utilized as responses to fit the

linear regression with SNPs.

Step 2:

𝑏1𝑖∗ = 𝛽0

∗∗ + 𝛽1∗∗𝑠𝑛𝑝𝑖 + 𝑒𝑖

∗∗, 𝑖 = 1,2… ,𝑁 (1.7)

Now, the null hypothesis on longitudinal effect of SNPs becomes:

𝐻0: 𝛽1∗∗ = 0.

If there is no SNP effect and no SNP-time interaction effect, the model in first step would be a

well specified model. Additionally, the model in the first step can be applied with the selected

covariance structure for error to improve the model.

1.5.4 Conditional Two-Step (CTS) Method

1.5.4.1 Conditional Linear Mixed Model

In longitudinal studies, researchers are usually interested in whether the progression of a

quantitative trait is caused by some changing factors over time. Apart from the longitudinal effects

from changing factors, the difference between individuals may also lie in some characteristics

which are constant since baseline, which are cross-sectional effects. Although in such studies

longitudinal effects are of more interest, omitting cross-sectional effects would be a serious

misspecification of the model. The estimate and inference on longitudinal effects can be biased

when cross-sectional effects are relatively large and mis-specified in the model (Verbeke and

Molenberghs, 2000).

By applying conditional linear mixed effect model, we can remove all cross-sectional effects from

the model by conditioning on their sufficient statistics. The advantage of this method is that we

can obtain inference on the parameters of interest without loss of information and we do not have

to deal with nuisance parameters. The disadvantage is, however, all information on cross-sectional

effects are lost including subject-specific effects. But if it is justified that longitudinal effects

should be the main focus in longitudinal studies, the benefits can outweigh the costs.

In the context of this thesis, the conditional LMM serves for the purpose of getting random slopes

as the first step in CTS. The random slopes generated from conditional LMM will then be

processed the same way as in SAO/TS method. They are used to investigate whether the random

changing effect over time can be explained by genetic information.

1.5.4.2 Data Transformation for Conditional Inference

A background on conditional inference is provided here in order to apply the conditional linear

mixed model. This approach is an alternative to the classical MLE. When a LMM is present as

(Eq. 1.1), usually 𝑏𝑖 are not of primary interest so they will be treated as nuisance. The conditional

approach is to make MLE conditional on sufficient statistics for the nuisance parameters 𝑏𝑖, and

the sufficient statistic is selected as 𝑍𝑖′𝑦𝑖.

Given 𝑏𝑖 = 𝑍𝑖′𝑦𝑖, the conditional density can be used to obtain estimates for relevant parameters

such as 𝛽, 𝜎 by maximizing conditional likelihood ∏ 𝑓𝑖(𝑦𝑖|𝑍𝑖′𝑦𝑖, 𝛽, 𝜎2)𝑁

𝑖=1 . Then, it was found by

finding one of any full rank 𝑛𝑖 × (𝑛𝑖 − 𝑞) matrices 𝐴𝑖 which satisfies 𝐴𝑖′𝑍𝑖 = 0, the conditional

approach is equivalent to transforming outcome vectors 𝑦𝑖 by this matrix 𝐴𝑖. In addition, it would

be more convenient if Ai is selected to satisfy 𝐴𝑖′𝐴𝑖 = 𝐼𝑛𝑖−𝑞 so that the transformed 𝑦𝑖 follows a

normal distribution as 𝐴𝑖′𝑦𝑖 ~ N(𝐴𝑖′𝑋𝑖𝛽, 𝜎2𝐼𝑛𝑖−𝑞).

For us more emphasis is put on time-varying effects among both fixed and random effect variables,

model should be rewritten in this form:

𝑦𝑖 = 𝑋𝑖(1)

𝛽(1) + 𝑋𝑖(2)

𝛽(2) + 𝑍𝑖(1)

𝑏𝑖(1)

+ 𝑍𝑖(2)

𝑏𝑖(2)

+ 𝑒𝑖 (1.8)

Design matrices for both 𝑋𝑖 and 𝑍𝑖 have been split in a cross-sectional part and a time-varying

part. An upper right notation of ∗(1) indicates cross-sectional part and ∗(2) indicates longitudinal

Because 𝑋𝑖(1)

are cross-sectional covariates, it can be expressed as 𝑋𝑖(1)

= 1𝑛𝑖𝑥𝑖′ . Similarly,

𝑍𝑖(1)

= 1𝑛𝑖 because 𝑏i

(1) is random intercept for 𝑖th subject. Lastly, 𝑏i

(2) are random slopes for

longitudinal covariates and 𝑍i(2)

is time here.

In this case, the only nuisance parameter is 𝑏i(1)

which is subject-specific intercept because we are

counting on random slopes 𝑏i(2)

to provide longitudinal effects of SNPs. Similarly as 𝑍𝑖′𝑦𝑖 , the

sufficient statistic for 𝑏i(1)

is 𝑍𝑖(1)

𝑦𝑖 = ∑ 𝑦𝑖𝑗𝑗 or 𝑦�̅� = ∑ 𝑦𝑖𝑗𝑗 /𝑛𝑖 . Furthermore, here only 1

parameter for each subject is set as nuisance parameter, so a full rank 𝑛𝑖 × (𝑛𝑖 − 1) matrix 𝐴𝑖

which satisfies 𝐴𝑖′𝑍𝑖(1)

= 𝐴𝑖′1𝑛𝑖= 0 is needed to transform the data to make conditional inference

on other parameters.

By multiplying 𝐴𝑖 on both sides of (Eq. 1.8), we have that

𝐴𝑖′𝑦𝑖 = 𝐴𝑖

′1𝑛𝑖𝑥𝑖

′𝛽(1)

+ 𝐴𝑖′𝑋𝑖

(2)𝛽(2) + 𝐴𝑖

′1𝑛𝑖𝑏𝑖

(1)+ 𝐴𝑖

′𝑍𝑖(2)

𝑏𝑖(2)

+ 𝐴𝑖′𝑒𝑖 (1.9)

which is equivalent to

𝑦i∗ ≡ Ai

′𝑦i = Xi∗𝛽(2) + 𝑍i

∗𝑏i(2)

+ 𝑒i∗ (1.10)

In addition, 𝐴𝑖 will be selected to satisfy 𝐴i′𝐴i = 𝐼𝑛𝑖−1 so that the variance of 𝑒i

∗ is 𝜎2𝐼𝑛𝑖−1.

Now all cross-sectional parts, including the random intercepts for each individual, are removed

from the model. The rest of parameters remaining in the model can then be estimated by fitting a

LMM on the transformed data. The R code for finding the 𝐴𝑖 matrix to transform data is obtained

from Sikorska’s paper (Sikorska et al., 2013b).

1.5.4.3 Steps after Data Transformation

Step 1:

𝑦i∗ = 𝛽(2)𝑡𝑖𝑚𝑒i𝑗

∗ + 𝑏i(2)

𝑡𝑖𝑚𝑒i𝑗∗ + 𝑒i

∗, 𝑖 = 1,2 … ,𝑁 (1.11)

No time-varying covariates were used in our model, therefore the only covariate left in the

transformed data is time. By specifying a model without any fixed or random intercepts, a LMM

is fit as the second step in order to obtain random slopes for each subject. Similar to previous fast

methods, longitudinal information on SNPs is assumed to be contained in estimations for these

random slopes 𝑏𝑖(2)

Step 2:

𝑏𝑖(2)

= 𝛽0∗∗∗ + 𝛽1

∗∗∗𝑠𝑛𝑝𝑖 + 𝑒𝑖∗∗∗, 𝑖 = 1,2… ,𝑁 (1.12)

Now by fitting a least squares linear regression on the random slopes, the null hypothesis on the

longitudinal effect of SNPs becomes this:

𝐻0: 𝛽1∗∗∗ = 0.

Compared with SAO and TS, the major advantage of CTS is that misspecification of the cross-

sectional part does not have an effect on the estimation of rest of parameters anymore.

Shortly afterwards the authors reported additional work on linear and logistic regression to make

faster access to SNP data and to speed up fitting many regressions (Sikorska et al., 2013a). In

2015, Sikorska et al. combined their semi-parallel fast linear regression with CTS, presenting that

the computation time for GWAS was reduced from several weeks to a few minutes on a desktop

while the accuracy was still under good control (Sikorska et al., 2015).

1.5.5 Genome-wide Analysis of Large-scale Longitudinal Outcomes using

Penalization (GALLOP)

In 2018 Sikorska and her colleagues developed a new algorithm named GALLOP as a fast

replacement for LMM. A common feature of the SAO, TS and CTS methods is that they cannot

provide any inference on cross-sectional SNP effect. It is because all of them reduce the dimension

of data by taking one summary statistic, slope, from each individual using different methods. While

longitudinal SNP effect is the main focus, loss of cross-sectional information is still a defect not

to be neglected, and additional LMMs need to be run if the main effect is still required. With the

implementation of GALLOP both cross-sectional and longitudinal effects for SNP can be obtained

at the same time for a comparison with LMM results. In addition, the speed is as similarly fast as

other methods.

Currently the usual way of getting the MLE of parameters including fixed, random effects and

their variances in LMM is via iteration algorithms like Newton-Raphson. It is claimed with

variances known, both random and fixed effects can be estimated at the same time by solving a

penalized least squares problem in the form of system of equations. This system is Henderson’s

system of equations for LMM. To get Henderson’s system of equations, the BLUPs of random and

fixed effects in LMM are obtained by letting the partial derivative of log-likelihood function be 0

with respect to random effects first and fixed effects second. Therefore estimating BLUPs is

equivalent to solving the Henderson’s system of equations (Henderson, 1950):

{X’Σ−1Xβ̂ + X’Σ−1Z�̂� = X’Σ−1𝑦

Z’Σ−1Xβ̂ + (Z’Σ−1Z + D−1)�̂� = Z’Σ−1y (1.13)

The penalized least squares problem in GALLOP is based on this form by setting the error variance

as 𝛴𝑖 = 𝜎2𝐼𝑛𝑖, generating the simplified system of equations (Sikorska et al., 2018):

{X’Xβ̂ + X’Z�̂� = X’𝑦

Z’Xβ̂ + (Z’Z + 𝑃)�̂� = Z’y (1.14)

𝑤ℎ𝑒𝑟𝑒 𝑃 = 𝑑𝑖𝑎𝑔(𝑃𝑖) and 𝑃𝑖 = (𝐷/𝜎2)−1.

Under the assumptions of known variance σ2 and form of error variance, this method starts with a

mis-specified LMM by omitting SNP terms first. This step provides the estimated variances to

calculate an estimated penalized component 𝑃 which is necessary to solve penalized least squares

problem. It was additionally assumed that SNP effects are usually so small that adding them into

model later will not change the variance or P by much.

In the next step, a SNP is added in forms of both cross-sectional and longitudinal effects in the

design matrix 𝑋, adding two more parameters to be solved and two equations to the previous

system of equations. However, parts of the essential large matrix inversion necessary to the

solution of current system can be conducted from the solution of previous system of equations

without SNPs. By calculating these components in advance for repeated use, a lot more

computation time can be saved.

Finally, the outcomes of regression parameter estimation, standard error and Wald test p-value can

all be calculated by matrix operation. The detailed algorithm of GALLOP and implementation in

R code are provided in the supplementary information of the original paper (Sikorska et al., 2018).

1.5.6 Methods Performance

Through both simulation and application data analysis it was found that CTS showed the best

performance among SAO, TS and CTS to resemble inference on longitudinal SNP effect generated

by LMM in two dimensions: the highest accuracy in terms of power nearest to LMM under a

controlled T1E (α = 5%) and the shortest processing time (Sikorska et al., 2013b; Sikorska et al.,

2018). In Table 1-2, 1-3, and 1-4, comparisons of T1E, power and time comparison are cited from

the paper (Sikorska et al., 2013b).

Table 1-2 Rotterdam study: Type 1 error comparison.

Note: Adapted from Table IX in “Fast linear mixed model computations for genome-wide association

studies with longitudinal data” by K. Sikorska et al., 2013, Statistics in Medicine, 32, 165-180, Copyright

2012 John Wiley & Sons, Ltd (Sikorska et al., 2013b). β1: cross-sectional SNP effect. β3: longitudinal SNP

effect.

Table 1-3 Rotterdam study: Power comparison.

Note: Adapted from Table XI in “Fast linear mixed model computations for genome-wide association

2012 John Wiley & Sons, Ltd (Sikorska et al., 2013b). β1: cross-sectional SNP effect. β3: longitudinal

SNP effect.

Table 1-4 Rotterdam study: Speed comparison.

Note: Adapted from Table XII in “Fast linear mixed model computations for genome-wide association

2012 John Wiley & Sons, Ltd (Sikorska et al., 2013b).

The comparison of GALLOP was conducted with LMM and CTS in a similar simulation study

based on same Rotterdam study data. CTS is 15 times faster than GALLOP to produce p-values

for longitudinal SNP effect, but for CTS most of analysis time was spent on data access (Sikorska

et al., 2018). It was concluded that GALLOP is a very efficient method providing practically exact

results of estimates and p-values as LMM for both cross-sectional and longitudinal SNP effects.

1.5.7 Concerns

The Sikorska et al., papers provide great inspiration for the potential useful tools to efficiently

conduct GWAS for longitudinal studies. However, there are some concerns about their simulation

study and motivating Rotterdam study that might prevent the widespread application of their

proposed fast methods. Here we study various aspects of real data types that require additional

evaluation across methods.

First, the SNP data in the Rotterdam simulation study were generated randomly from a uniform

distribution from 0 to 2. This results in an equal probability of SNPs being any value between 0 to

2. The additive mode of SNPs usually has only three values which are 0 (homozygous for the

major allele), 1 (heterozygous) and 2 (homozygous for the minor allele) for genotyped SNPs. For

imputed SNPs we have probabilities for the 3 possible genotypes. SNPs in most simulation studies

follow Hardy-Weinberg Equilibrium (HWE) which states under five conditions (no mutation, no

gene flow, large population size, random mating, and no natural selection), the proportions of 0, 1

and 2 based on Minor Allele Frequency MAF = p (𝑝 ≤ 50%) follow:

𝑃𝑟(𝑆𝑁𝑃 = 0) = (1 − 𝑝)2, 𝑃𝑟(𝑆𝑁𝑃 = 1) = 2𝑝(1 − 𝑝), 𝑃𝑟(𝑆𝑁𝑃 = 2) = 𝑝2

This results in the SNP matrix for subjects usually being sparse with most elements being 0,

especially at a low MAF. However, an imputed SNP can have a “dosage” value between 0 and 2

if it is a weighted average of all three possible genotypes. A SNP generated from a uniform

distribution from 0 to 2 can be seen as an imputed SNP following MAF = 0.5. (There are also

other ways to decide the genotype from imputed SNPs, such as taking the genotype with largest

probability if it is larger than pre-specified threshold and calling a missing genotype otherwise.

Typically, dosage is used.) Given this, it motivates us to determine how SNPs generated from a

range of MAFs have influence on the T1E and power in a simulation study.

Variation in MAF leads to the second concern, T1E. According to research (Ma et al., 2013) on

GWAS testing association between low count variants and case-control outcome using both MAF

and expected minor allele count (E[MAC]), it was found that for low count variants (E[MAC] <

400, MAF < 0.01 for N = 20,000) the Wald test is very conservative at a nominal T1E rate of

α = 5 × 10−8. We therefore also were concerned that inflated T1E might occur for quantitative

outcomes when MAF is low. Consequently, an inflated T1E can cause inflation of power of the

The third concern is that the Rotterdam simulation study did not account for any within subject

error correlation as the errors were generated independently for each visit of each subject. This is

partly because of their selection of the lme4 package (Bates et al., 2015). The lme4 package does

not provide an option for error covariance structure, nor does it generate p-values for association

testing. We would like to see how performance changes when a different covariance structure for

error is applied. We selected the nlme package which provides many options (Pinheiro et al.,

2019). In addition, it has been tested that there is no difference between generated default p-values

from nlme package and calculated Wald-test p-values from lme4 package, although lme4 was

estimated to conduct a faster algorithm than nlme (Sikorska et al., 2013b).

The fourth aspect to be further explored is the missing pattern. According to a classification for

missing data originated from Rubin et. al (Little and Rubin, 1987), assume there are (more than

one) X and Y variables for one subject, but only Y is subject to nonresponse/missingness. Then

missing mechanisms are classified by telling whether probability of response in Y (1) depends on

Y and possibly X as well, (2) depends on X but not on Y, or (3) is independent of X and Y. Today,

people usually summarize these three mechanisms as:

1) Missing Not at Random (MNAR): The data are neither missing at random nor observed at

random;

2) Missing at Random (MAR): The observed values of Y are not necessarily a random

subsample of the sampled values, but they are a random sample of the sampled values

within subclasses defined by values of X;

3) Missing Completely at Random (MCAR): The observed values of Y form a random

subsample of the sampled values of Y.

The missing data mechanisms for MAR and MCAR are ignorable for likelihood-based inference,

but the MNAR mechanism is non-ignorable. The missing data scenarios in the simulation study

based on the Rotterdam study included dropout based on MCAR and MAR mechanisms which

controlled the overall missing rate as similar to the real data of ~35%. We would like to simulate

both missing at some time points and dropout (lost to follow-up). For missing at some time points,

subjects will be missing according to the three mechanisms MCAR, MAR and MNAR. We are

making such a design because apart from dropout, participants might also come back after missing

a visit.

The last concern for fast methods of SAO, TS and CTS is that they only provide inference on

longitudinal SNP effect and ignore the cross-sectional effect. However, even using LMM and

GALLOP we are limited to testing the effect of SNP on both aspects as separate tests. One

alternative is testing both marginal and interaction effect of a genetic variant and an environment

factor in one test inspired by Kraft et al (Kraft et al., 2007). Driven by the need to test genetic

associations with disease under different environmental exposures, it is claimed a joint test

combining tests on marginal and interaction effect is an optimal approach in most of situations and

is more powerful than single tests when both effects exist. The benefit of applying 2df (two

degrees-of-freedom) test is that it avoids multiple-testing complications and difficulties in

interpreting results. We therefore provide Wald joint test results for GALLOP and LMM.

1.6 Goals

The primary goal is to compare fast methods including SAO, TS, CTS and GALLOP with LMM

in terms of T1E, power, parameter estimate and computational speed under specific settings. T1E

and power are compared in simulation studies to assess accuracy. In order for future extensive

application, it would be meaningful to find out the specific conditions when each fast method

performs best.

A secondary goal is to detect associations in DCCT/EDIC application with the help of fast

methods. These fast methods work as an approximation of LMM and will narrow the pool of

potential SNPs. The subset of SNPs achieving certain threshold in fast methods will then be fit in

LMM to see whether the results are close to the ‘truth’ from testing all SNPs using LMM.

By realizing these two goals, we want to test the hypothesis of this thesis: The performance of the

above fast methods to replace LMM in longitudinal GWAS is affected by MAF, sample size,

within subject error structure and missing data patterns related to real study settings.

Methods

Chapter 1 The following procedures of applying methods in the simulation study and real data are

summarized in the flowchart:

Figure 2-1 Flow chart for Methods section.

2.1 Data Background: DCCT and EDIC

Our motivating data come from the DCCT/EDIC study.

2.1.1 Diabetes Control and Complications Trial (DCCT)

The DCCT was an unblinded randomized control trial designed for patients with Insulin-dependent

diabetes mellitus (IDDM), namely type 1 diabetes (T1D). The trial began in 1983 and ended

prematurely in 1993 due to the beneficial effects of intensive therapy. The total number of patients

with T1D recruited was 1441 from 29 centers from 1983 till 1989 and these patients were randomly

assigned into two treatment groups: conventional or intensive therapy (Shamoon et al., 1993). The

mean follow-up time is 6.5 years with a minimum of around 3 years and a maximum of 9 years.

The primary outcome was retinopathy but this trial was also used to test whether other T1D related

complications could be delayed or the rate of progression slowed by treatment intervention.

2.1.1.1 DCCT cohorts

The DCCT has two important cohorts. The primary prevention group and secondary intervention

group differ in term of retinopathy at baseline, i.e. patients without and with retinopathy when

recruited into the trial (DCCT Research Group, 1986). Accordingly, the primary cohort had

diabetes duration of 1-5 years and no microalbuminuria (<40mg albumin/24h on a 4-h urine

collection). The secondary cohort had diabetes duration of 1-15 years and possible

microalbuminuria (≤200mg albumin/24h on a 4-h urine collection). The primary outcomes of the

DCCT are different for the two cohorts. For the primary prevention group, the primary outcome

of interest was the initial appearance of retinopathy. For the secondary intervention group, the

principle outcome was the progression or improvement of pre-existing minimal

retinopathy (DCCT Research Group, 1986). The final count of patients in the primary prevention

cohort was 726 and 715 for the secondary intervention cohort.

2.1.1.2 DCCT treatment

The intensive treatment plan was three or more insulin injections or treatment with the help of an

external pump every day, adjusted by results of self-monitoring glucose level four times or more

per day, diet and exercise. The conventional treatment was one or two daily insulin injections along

with self-monitoring urine or blood glucose and education about lifestyle (Paterson et al., 2010).

The goal of the conventional group was to help free patients from hyper/hypoglycemia, ketonuria

and maintain body weight. The goal of intensive group was to maintain the glycemia level to a

normal range (HbA1c<6.05%) with multiple blood glucose measures daily and monthly HbA1c

measures (Shamoon et al., 1993).

The DCCT was terminated after about 10 years from its launch due to the beneficial effects of

intensive therapy and the original conventional group was taught intensive therapy (Nathan, 2014).

The experience obtained from the trial is of great value and many insights were provided by the

DCCT research team. The definitive trial results were published in 1993 by DCCT team (Shamoon

et al., 1993). The important findings include that intensive therapy delays the risk and progression

of retinopathy, reduces the occurrence of microalbuminuria, albuminuria and neurologic

complications but caused a higher risk of hypoglycemia and adverse weight gain. They believed

the benefits of intensive treatment outweighed the risks but that intensive therapy should be carried

out with extra caution (Shamoon et al., 1993).

2.1.2 Epidemiology of Diabetes Interventions and Complications (EDIC)

After the termination of DCCT, a follow-up observational study began tracking the subsequent

traits of the same cohort of participants. This longitudinal study is called Epidemiology of Diabetes

Interventions and Complications (EDIC) and provides annual evaluations of different traits of

diabetes for participants at multiple centers from 1994 till now. The purpose of EDIC is to observe

the durability of DCCT treatment effects on diabetes complications in the long term, despite the

fact that many participants have changed their treatment therapy in EDIC period (EDIC Research

Group, 1999).

EDIC has investigated other related chronic diseases. For instance, measurements such as the

effects of HbA1c and diabetes duration on the incidence of cardiovascular disease using EDIC

data (Lachin and DCCT/EDIC Research Group, 2016). As an important classification of

participants in DCCT, the effect of randomized DCCT treatment on diabetic retinopathy and other

ocular diseases has been a focus (DCCT/EDIC Research Group, 2014a). Besides, the development

and progression of neuropathy, and nephropathy contribute to our understanding of T1D (Martin

et al., 2014; de Boer and DCCT/EDIC Research Group, 2014). Apart from physical function,

cognitive function has also been explored with EDIC data and it was found original treatment

group assignment is not associated with any decline in cognitive function (DCCT/EDIC Research

Group, 2007). EDIC provides new insights in the mechanisms of long-term development of these

diseases.

Visits to clinics included a medical history and physical examination for both studies, but the

collection of medical information is slightly different depending on measurements during DCCT

and EDIC. For example, glomerular filtration rate (GFR) was estimated every year in both

DCCT/EDIC, while albumin excretion rate (AER) was measured every year in DCCT and every

other year in EDIC.

2.1.3 Renal Outcome Measures

We chose to focus on complications of renal disease in this thesis. An introduction to two renal

complication traits: albuminuria and estimated glomerular filtration rate; including their

definitions and the collection methods are provided, along with some previous findings.

2.1.3.1 Albuminuria

Urinary albumin excretion rate (AER) is an important measure to screen people with diabetes for

nephropathy. AER at DCCT baseline inclusion criteria were ≤ 40 mg/24h for primary prevention

cohort and ≤ 200 mg/24h for secondary intervention cohort, with the definition for

microalbuminuria as AER ≥ 40 mg/24h and for macroalbuminuria as AER ≥ 300 mg/24h in

earlier DCCT and EDIC reports (Younes et al., 2010). However, in order to be consistent with

modern American Diabetes Association guidelines, in EDIC microalbuminuria was defined as

AER ≥ 30 mg/24h (DCCT/EDIC Research Group, 2014b). The measurement of AER required a

4-hour urine collection which was performed every year in DCCT and every other year in EDIC

with the confirmation of high precision for data quality in two studies (DCCT Research Group,

1986; EDIC Research Group, 1999).

The existence and extent of albuminuria reflected by AER can help evaluate the possibility of

kidney disease so as to prevent or postpone irreparable damage. In aspect of genetic association, a

very limited number of loci were discovered and validated for albuminuria in T1D under the

traditional GWAS significance level. Some findings included rs1564939 in GLRA3 in people with

T1D with 24h AER data obtained from a Finnish study and it has been validated by meta-analysis

study (Sandholm et al., 2018). More new albuminuria loci are being discovered in general

population cohorts, predominantly without diabetes (Haas et al., 2018).

At the end of the DCCT period, the findings showed that compared with conventional treatment

group, the intensive treatment reduced the incidence of microalbuminuria by 39% (95% CI 21%-

52%, p≤0.002) and reduced incidence of macroalbuminuria by 54% (19%-74%, p<0.004 ) in the

combined cohort (Shamoon et al., 1993). After the termination of DCCT, the effect of intensive

therapy on albuminuria has been shown to persist for over ten years, despite that many people

switched to intensive treatment in later time. During EDIC years 1-18, the risk reduction of

microalbuminuria between the original intensive and conventional groups is 45% (26%-59%,

p<0.0001), and the risk reduction of macroalbuminuria is 61% (41%-74%,

p<0.0001) (DCCT/EDIC Research Group, 2014b). Although the definition of microalbuminuria

changed, the risk reduction over time shows that the original intensive therapy group have lower

risk for albuminuria decades after the end of DCCT.

2.1.3.2 Glomerular Filtration Rate

Glomerular filtration rate (GFR) is another measurement of kidney function and is used to stage

kidney disease. GFR can be measured precisely by use of inulin clearance or other methods.

Measurement as Iothalamate clearance was measured in DCCT, but they were not preferred in

DCCT/EDIC due to cumbersome procedure. Usually this value is estimated GFR (eGFR)

calculated from a function utilizing several measurements from patients. It was found Iothalamate

GFR measurements obtained in subsets of participants showed changes that were in the same

direction as the changes in the estimated GFR, but they were of larger magnitude (DCCT/EDIC

Research Group, 2011). The most commonly used parameters for eGFR include serum creatinine

concentration, age, sex and ethnicity. Some previous versions also included body size (EDIC

Research Group, 1999). Over time the formula has been updated and validated by researchers and

currently the most credible one is the formula from Chronic Kidney Disease Epidemiology

Collaboration creatinine equation (CKD-EPI) due to its improvement in accuracy compared to

other ones (Levey et al., 2009):

eGFR = 141 ∗ min(Scr/κ, 1)α ∗ max(Scr/κ, 1) − 1.209 ∗ 0.993𝐴𝑔𝑒 ∗ 1.018 [if female]

∗ 1.159 [if black].

The equation takes serum creatinine, age, gender and ethnicity as parameters. Serum creatinine

level (Scr) is in mg/dL, age is in years, κ is 0.7 for females and 0.9 for males, α is -0.329 for

females and -0.411 for males, min indicates the minimum of Scr/κ or 1, and max indicates the

maximum of Scr/κ or 1.

eGFR is widely accepted as an overall measurement of kidney function. Clinical practice

guidelines by National Kidney Foundation proposed the classification of chronic kidney disease

(CKD) severity as the level of eGFR as there was no standard classification on chronic kidney

disease stages for a long time. Usually patients with eGFR<60mL/min/1.73𝑚2 should be aware

and assessed for renal function impairment and well-being (National Kidney Foundation, 2002).

In DCCT, GFR was directly measured for only a few times from each individual very infrequently.

Serum creatinine was obtained annually in both DCCT and EDIC to calculate eGFR (DCCT

Research Group, 1986). The development of albuminuria is usually ahead of impaired GFR and a

slightly increased excretion of albumin can be a sensitive predictor of decline in GFR (de Boer

and DCCT/EDIC Research Group, 2014). Impairment of GFR was defined in DCCT/EDIC as

eGFR being less than 60 mL/min/1.73𝑚2 level for two consecutive study visits, i.e. CKD stage 3.

In EDIC years, participants are determined to reach the renal stop point by getting an eGFR value

of ≤10mL/min/1.73𝑚2 . After first reaching the renal stop point, the participants will then be

censored for the rest of albuminuria or GFR measurement because of developing end-stage renal

disease.

Impaired eGFR increases the probability of developing end-stage renal disease which might cause

other complications such as cardiovascular disease and the risk of death can be very high for T1D

patients. With around 70 participants having developed impaired eGFR in DCCT/EDIC study till

2011 (EDIC year 18), it has been found that intensive therapy significantly lowered the risk of

impaired eGFR (DCCT/EDIC Research Group, 2011). During EDIC year 1-18, the risk reduction

of impairment of eGFR between the original intensive and conventional groups was 44% (95% CI

12%-64%, p=0.011) (DCCT/EDIC Research Group, 2014b).

2.1.4 Importance of DCCT/EDIC Study

In the (up to) 36 years after launch of DCCT/EDIC 94% of the surviving cohort are still being

followed (Bebu et al., 2019). One of the strengths of the DCCT/EDIC is that the data were

collected from a well-documented cohort which have been followed for over 30 years. This

ongoing cohort study is still very meaningful for all kinds of research so that better treatment

regimens can be adopted to reduce the risk of long-term complications for T1D.

2.2 Linear Mixed Effects Model for DCCT/EDIC Renal Outcomes

Based on the motivating dataset, our targeted outcomes are logAER and eGFR which are both

repeatedly measured quantitative traits measuring T1D renal complications. Since different

responses might have different patterns in evolution or in residuals, applying one uniform model

or one common correlation structure would be inappropriate and separate LMMs will be fit for

these two outcomes with the adjustments of potentially related characteristics. In general, to

conduct GWAS with LMMs the first step is to construct a marginal model which can best predict

the population-average outcomes for individuals using the phenotype data only. The second step

is to include SNPs one by one and test for associations between SNPs and traits.

2.2.1 Phenotypic Covariate Selection

By selecting clinically meaningful predictors, the associated demographic characteristics with the

renal traits of logAER and eGFR include sex (female versus male), cohort (primary prevention

versus secondary intervention), randomized treatment (conventional versus intensive therapy) and

the interaction between cohort and treatment. These variables were selected as the fixed effect

covariates adjusting for the LMM.

Time variable in longitudinal data is needed to define the interval between repeated measurements.

The initial time variable used for DCCT/EDIC is the duration time in years since randomization

into DCCT. (In addition, time variable in months is also created as a more precise time variable

for some of the within-subject correlation structures). Time is used as both fixed and random

effects in the model. To allow for individual intercepts and slopes of time for all patients, a random

intercept and slope model was specified.

Showing an example on logAER with all potential adjusting covariates, the final model in this step

containing only phenotypic covariates has the form as:

logAER𝑖 = 𝛽0 + 𝛽3𝑡𝑖𝑚𝑒𝑖𝑗 + 𝛽4𝑐𝑜ℎ𝑜𝑟𝑡𝑖 + 𝛽5𝑔𝑒𝑛𝑑𝑒𝑟𝑖 + 𝛽6𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑖

+ 𝛽7𝑐𝑜ℎ𝑜𝑟𝑡𝑖 × 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑖 + 𝑏0𝑖+ 𝑏1𝑖

𝑡𝑖𝑚𝑒𝑖𝑗 + 𝑒𝑖𝑗,

where (𝑏0𝑖, 𝑏1𝑖

)~𝑁(0, 𝐷) and 𝑒𝑖𝑗~𝑁(0, 𝛴𝑖). (2.1)

2.2.2 Within-individual Correlation Structure Selection

Despite the fact that between individual errors are often assumed to be independent, consecutive

measures within a person are typically correlated. Such correlation patterns or structures can be

selected from the options in the nlme package (Pinheiro et al., 2019).

From now onwards the covariance structure Σ𝑖 is referred to and constructed as the replacement of

the correlation structure. Both correlation structure and covariance structure show how

measurements are associated at different time points. With 𝑛𝑖 measurements for individual 𝑖, the

covariance structure for distribution of measurement error has a form as:

[ 𝜎1

2 𝜎12 ⋯ 𝜎1𝑛𝑖

𝜎21 𝜎22 𝜎2𝑛𝑖

⋮ ⋱ ⋮𝜎𝑛𝑖1

𝜎𝑛𝑖2⋯ 𝜎𝑛𝑖

The default option in nlme package is the Variance Component (VC) structure which has a

diagonal form and assumes there is no within individual correlation. This is not the ideal structure

for repeated measures, but it is a convenient way to have a sense of the scale of residuals for

individuals and effect of fitting other structures. The simplest form allowing correlation is called

compound symmetry (CS) structure. In this form the covariance between within person errors are

assumed to be the same no matter how far away the two repeated measurements are. There are

only two values in the covariance matrix. The most complex form is a general structure with no

additional structure, also known as unstructured form. Covariance between each pair of time points

is allowed to be different. This structure takes more time to fit as a lot more computation is

required. It makes the final covariance structure very large and complicated for a large number of

repeated measurements. No rules of how repeated measurement errors are correlated can be

summarized depending on the time interval. Therefore this structure is not adopted in this thesis

as it takes much more time than other structures given the number of measurements in

DCCT/EDIC.

Apart from the previous three forms, the rest of structures can be divided into two types: time-

series based correlations and spatial correlations. In nlme package, three time-series based

correlations are provided: first order autoregressive (AR(1)), autoregressive moving average

process with arbitrary orders for the autoregressive and moving average components

(ARMA(p, q) ), and continuous first order autoregressive process ( CAR(1) ). The difference

between CAR(1) and the other two lies in that CAR(1) allows continuous time variable and can

deal with the precise time intervals while the other two only take discrete time. If the discrete time

variable is not specified in model, the AR(1) and ARMA(p, q) structures by default identify the

order of repeated measures in one subject as the time points, which generates unreliable results for

unsorted data.

In the case of spatial correlations, there are five options in total: exponential, Gaussian, linear,

Rational quadratic and spherical spatial correlation. More than one spatial covariate can be

specified for all these structures. To make the spatial distance more precise, the spatial covariate

used in this dataset is time in months. With the 𝑑 denoting the whole range, the correlation between

two observations with a distance of 𝑟 (difference in months here) < 𝑑 have different forms

accordingly. Table 2-1 summarizes the correlations.

Table 2-1 Spatial correlation structures in nlme.

Spatial correlation structure Correlation (𝒓 < 𝒅)

Exponential 𝑒𝑥𝑝(−𝑟/𝑑)

Gaussian 𝑒𝑥𝑝(−(𝑟/𝑑)2)

Linear 1 − (𝑟/𝑑)

Rational Quadratic 1/(1 + (𝑟/𝑑)2)

Spherical 1 − 1.5(𝑟/𝑑) + 0.5(𝑟/𝑑)3

d: whole range of longitudinal data in months.

r: time difference between two measurements in months.

In the DCCT/EDIC dataset time in months are used for CAR(1) and spatial correlation error

structures. For the rest of the error structures time in years are applied. The selection of covariance

structures can produce different results for traits with different patterns. All these possible

structures will be compared in (Eq. 2.1) to see which structure returns a best fit model, which is

defined here as the model with smallest Akaike Information Criteria (AIC).

The selection for covariance structure might differ between DCCT and EDIC. However, such a

model would be too complex to let the error generated from different covariance matrices before

and after DCCT close out, therefore all available data will be used to select the error covariance

structure producing best fit model.

2.2.3 Genetic Data

Genetic data were obtained from blood DNA sample in DCCT by genotyping SNP array.

Genotyping was performed using Illumina 1M BeadArrays (San Diego, CA, USA) (Roshandel et

al., 2018). Ungenotyped autosomal SNPs were imputed using 1000 Genomes data (phase 3

v5) (1000 Genomes Project Consortium, 2015). Genotype dosage data from Illumina 1M

BeadArrays were used to analyze logAER and eGFR respectively with approximately 8M SNPs

imputed with high imputation quality (INFO>0.8). A total of 8,979,131 SNPs with a MAF>1%

were subsequently analyzed statistically using genotype probabilities with additive coding of

genotype (Paterson et al., 2010).

After the previous LMM without SNP terms is determined, genetic data will be applied into the

model as LMM1. There is only cross-sectional SNP effect in this model, showing the population

average effect without the influence from interaction term. This is the SNP effect that is assumed

to be the same at all time points, in other words, the average effect over the time. The LMM with

only cross-sectional SNP effect has a form as:

logAER𝑖𝑗 = 𝛽0 + 𝛽1𝑠𝑛𝑝𝑖 + 𝛽3𝑡𝑖𝑚𝑒𝑖𝑗 + 𝛽4𝑐𝑜ℎ𝑜𝑟𝑡𝑖 + 𝛽5𝑔𝑒𝑛𝑑𝑒𝑟𝑖 + 𝛽6𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑖

+ 𝛽7𝑐𝑜ℎ𝑜𝑟𝑡𝑖 × 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑖 + 𝑏0𝑖+ 𝑏1𝑖

(𝑏0𝑖, 𝑏1𝑖

)~𝑁(0, 𝐷), 𝑒𝑖𝑗~𝑁(0, 𝛴𝑖∗), (2.2)

Then, genetic data will be incorporated into the model in the form of a cross-sectional fixed effect

and a longitudinal fixed effect on the outcome traits as LMM2. Different from LMM1, now the

interpretation of cross-sectional SNP effect is the effect at time 0, instead of an average effect over

the time if longitudinal effect exists. An updated complete LMM with both SNP effects is as

followed:

logAER𝑖𝑗 = 𝛽0 + 𝛽1𝑠𝑛𝑝𝑖 + 𝛽2𝑠𝑛𝑝𝑖 × 𝑡𝑖𝑚𝑒𝑖𝑗 + 𝛽3𝑡𝑖𝑚𝑒𝑖𝑗 + 𝛽4𝑐𝑜ℎ𝑜𝑟𝑡𝑖 + 𝛽5𝑔𝑒𝑛𝑑𝑒𝑟𝑖

+ 𝛽6𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑖 + 𝛽7𝑐𝑜ℎ𝑜𝑟𝑡𝑖 × 𝑡𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑖 + 𝑏0𝑖+ 𝑏1𝑖

)~𝑁(0, 𝐷), 𝑒𝑖𝑗~𝑁(0, 𝛴𝑖∗), (2.3)

Where 𝛴𝑖∗ is the selected variance structure for logAER in DCCT/EDIC data. Similarly, another

𝛴𝑖∗ will be selected independently for outcome eGFR.

To conduct hypothesis testing separately on the two effects of SNPs, the null hypotheses for LMM

from (Eq. 2.2) and (Eq. 2.3) on GWAS are:

𝐻0(1): β1

𝐿𝑀𝑀1 = 0 for 𝐋𝐌𝐌𝟏;

𝐻0(1): β1

𝐿𝑀𝑀2 = 0 for 𝐋𝐌𝐌𝟐;

𝐻0(2): β2 = 0.

For a 2df test on both SNP effects, the null hypothesis is:

𝐻0: β1𝐿𝑀𝑀2 = β2 = 0.

All fast methods can provide results on 𝐻0(2) while only GALLOP can test H0

(1) as LMM does.

As we mentioned, the interpretation of the two tests in 𝐻0(1)

might be different as for LMM1,

cross-sectional effect is average effect over time, while for LMM2 this effect only represents effect

at time 0. GALLOP can also provide inference for the joint test by additional implementation to

the original code.

2.3 Weighted Slope as Outcome (WSAO)

Except for application of all fast methods, one simple modification to the SAO method, WSAO is

proposed in this thesis. The motivation is that some people have less data by study design or

missingness, causing unbalanced data. The first step is the same as SAO which is to extract per

individual slope as the summary statistics, allowing longitudinal covariates adjusting for the slope

in this step. In the second step, weights are assigned to each individual as the number of visits to

fit the linear relationship of time slopes and SNPs. This modification is easy to implement but

essentially puts more weight on people with more visits when the data are highly unbalanced.

2.4 Simulation Study

We will compare methods in different scenarios to acquire a better knowledge of their properties

and performance. A factorial designed simulation study was used. We set k=7 equally distributed

time points for each individual assuming that the data are complete. The number of replicates for

each scenario is 5000 to obtain stable statistics.

The simulation study is based on the association between simulated SNPs and pattern of logAER

from DCCT. No covariates are contained in the simulation study models as requirements of

covariates types are not the same for different fast methods and we want to make sure the

difference between methods’ performance only comes from the simulation settings. The main

procedure is to firstly simulate SNPs and response variable logAER according to the assumed true

model as (Eq. 2.3), and then analyze by different methods.

logAER𝑖𝑗 = 𝛽0 + 𝛽1𝑠𝑛𝑝𝑖 + 𝛽2𝑠𝑛𝑝𝑖 × 𝑡𝑖𝑚𝑒𝑖𝑗 + 𝛽3𝑡𝑖𝑚𝑒𝑖𝑗 + 𝑏0𝑖+ 𝑏1𝑖

𝑡𝑖𝑚𝑒𝑖𝑗 + 𝑒𝑖𝑗 ,

)~𝑁(0, 𝐷), 𝑒𝑖𝑗~𝑁(0, 𝛴𝑖∗), (2.4)

Where 𝛴𝑖∗ is selected structure for logAER in DCCT data.

A flow chart for simulation study is as followed:

Figure 2-2 Flow chart of simulation study.

2.4.1 Experimental Designs

There are 6 factors with different levels to be adjusted in this simulation study with 7 visits for

each subject and 5000 replicates for each scenario. Here are the factors:

1) Minor allele frequency (MAF): 0.01, 0.05, 0.1, 0.3, 0.5;

2) Sample size (N): 500, 1000, 1500, 2000, 3000;

3) Cross-sectional SNP effect (β1): 0, 0.08;

4) Longitudinal SNP effect (β2): 0, 0.016;

5) Missing pattern for the response variable: Complete, MCAR, MAR, MNAR, dropout;

6) Within subject error correlation (error variance Σ𝑖).

1) – 4): Minor allele frequency, sample size and SNP effects have different values to be tested.

When sample size is varying, MAF is set as 0.3; when MAF is varying the sample size is set to

In addition, we call it Reference scenario when no data modifications regarding missingness and

error correlation structure are done for the simulated data. In other words, the reference scenario

has no missing data and is generated under the selected structure 𝛴𝑖∗ (medium/reference error

correlation) for logAER in DCCT data.

5): There are many different ways to simulate these mechanisms of missingness. Here some easy-

to-construct ways are implemented to simulate these three missing mechanisms and demonstrate

the effect on different methods. All missing mechanisms return a missing rate of ~40% of the

records in the long format simulated dataset. In addition, the first observation for all subjects is

treated as baseline observation and is complete for everyone.

(1) MCAR: A missing probability of 47% is applied on all outcome values except for first

measurement for all subjects to get a same overall missing rate of 40% as the other two missing

scenarios.

(2) MAR: Missing at random scenario usually requires the covariates which are not included in

this simulation study. Assuming probability of missing outcome depends on the baseline value of

outcome, here is a formula illustrating the missing probability 𝑃𝑖𝑗 for subject 𝑖 and measurement

log (𝑃𝑖𝑗

1 − 𝑃𝑖𝑗) = −3.75 + 1.5 × logAER𝑖,1, 𝑖 = 1,⋯𝑁, 𝑗 ≠ 1 (2.5)

(3) MNAR: Assuming probability of missing outcome depends on the current value of outcome,

here is a formula illustrating the missing probability 𝑃𝑖𝑗 for subject i and measurement j:

log (𝑃𝑖𝑗

1 − 𝑃𝑖𝑗) = −3.75 + 1.45 × logAER𝑖,𝑗, 𝑖 = 1,⋯𝑁, 𝑗 ≠ 1 (2.6)

(4) Dropout: Censored data or dropout data usually occurs in longitudinal studies when participants

missed one visit and all the later ones. The reason for the first missing visit can be the previous

three missing mechanisms, and here we adopt the design when one observation reached a certain

threshold, all later observations are censored. For outcome logAER, this dropout threshold is set

as 70th quantile of logAER values in the simulation dataset. The formula illustrating the missing

probability 𝑃𝑖𝑗 for subject 𝑖 and measurement 𝑗 is:

𝑃𝑖𝑗 = {1, 𝑖𝑓 logAER𝑖,𝑗−1 > logAER𝑄0.7

𝑜𝑟 𝑃𝑖,𝑗−1 = 1

0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒(2.7)

6): For within-subject error covariance structure, the default structure is the one selected for

logAER in DCCT data. Other than that, error structures with no, medium/reference and strong

correlations will be used to generate the data.

2.4.2 Simulation of SNPs

The simulation of SNPs follows the Hardy-Weinberg Equilibrium principle and probability of

minor allele frequency follows the distribution when MAF = 𝑝:

𝑃𝑟(𝑆𝑁𝑃 = 0) = (1 − 𝑝)2, 𝑃𝑟(𝑆𝑁𝑃 = 1) = 2𝑝(1 − 𝑝), 𝑃𝑟(𝑆𝑁𝑃 = 2) = 𝑝2

2.4.3 Simulation of Outcome Trait

It is assumed that adding SNP terms will not change fixed and random intercept or slope effects

by much, therefore the coefficients 𝛽0 and 𝛽3 are directly taken from results of model without SNP

terms (Eq. 2.1) using observed DCCT data. The covariance matrix for random effects 𝐷 is also

assumed to be the same as estimated from the model without SNP terms.

logAER𝑖𝑗 = 𝛽0 + 𝛽3𝑡𝑖𝑚𝑒𝑖𝑗 + 𝑏0𝑖+ 𝑏1𝑖

)~𝑁(0, 𝐷), 𝑒𝑖𝑗~𝑁(0, 𝛴𝑖∗), (2.8)

After the simulation of all necessary variables and parameters, the outcome values logAER𝑖𝑗 are

simply calculated by applying the assumed true model equation as (Eq. 2.4). This can also be done

by generating logAER𝑖𝑗 following the inferred multivariate normal distribution as in (Eq. 1.2).

2.5 Methods for DCCT/EDIC Data Analysis

Different from fast methods in the simulation study where covariates are ignored, covariates are

needed for full LMMs to adjust for the true relationship between outcomes and SNP effects.

However due to the limitation of fast methods, requirements of covariates to be added into the fast

algorithms are not the same. According to that, here we specify the covariates used for LMMs and

fast methods on real data.

In 2.2.1 Phenotypic Covariate Selection the selected covariates for full LMM on logAER and

eGFR are sex, treatment and cohort, with an interaction effect of treatment and cohort. All

covariates are time-static characteristics. The full model with both SNP effects is (Eq. 2.3).

SAO allows only longitudinal covariates to be added in the first step, so no covariates will be

added for the real data analysis. However, considering the unbalanced data, weighted SAO with

weights of number of visits is also applied along with original SAO. The TS method fits an LMM

without SNP effects in the first step, so it can allow all additional covariates to adjust for the

relationship between outcome and time and has the same form as (Eq. 2.1) in the first step. The

CTS method takes only longitudinal data into data transformation in the first step, therefore same

as SAO, no covariates can be added for real data analysis. Finally, the GALLOP method can take

all types of covariates into its design matrix and thus it allows all of the covariates, cross-sectional

or longitudinal, for real data analysis. In addition, TS is the only method which can apply the same

error correlation structure as LMM and all other fast methods assume an independent within-

subject error structure.

By conducting fast methods with real DCCT/EDIC data on outcome logAER and eGFR, we first

set a loose significance level of p< 10−5 . Any SNPs significant at this level with single or joint

effects are extracted and put in LMM for final testing. Unlike in the simulation study in which we

know the exact power and T1E of a method, to get a direct comparison here we can define the

efficiency of fast methods loosely or stringently as:

𝐸𝑙 =#𝑆𝑁𝑃𝑠 𝑤𝑖𝑡ℎ 𝑃∗ < 5 × 10−8 𝑏𝑦 𝐿𝑀𝑀

#𝑆𝑁𝑃𝑠 𝑤𝑖𝑡ℎ 𝑃 < 10−5 𝑏𝑦 𝑓𝑎𝑠𝑡 𝑚𝑒𝑡ℎ𝑜𝑑(2.9)

𝐸𝑠 =#𝑆𝑁𝑃𝑠 𝑤𝑖𝑡ℎ 𝑃∗ < 5 × 10−8 𝑏𝑦 𝐿𝑀𝑀

#𝑆𝑁𝑃𝑠 𝑤𝑖𝑡ℎ 𝑃 < 5 × 10−8 𝑏𝑦 𝑓𝑎𝑠𝑡 𝑚𝑒𝑡ℎ𝑜𝑑(2.10)

𝑃* can be any P value for single SNP effect or joint effect under the significance level detected by

Besides, conducting the CTS with real DCCT/EDIC data on the eGFR outcome has potential

problems because the number of eGFR values per subject can be over 22 as the measurement was

taken every year in DCCT/EDIC, which makes CTS fail to find a required orthogonal polynomial

to transform the data. Only every alternate year eGFR measures for each subject are taken for the

CTS method.

Results

3.1 Data Description

A general description of the data is provided in Table 3-1 which contains summary statistics for

the number of visits, covariates and outcomes in DCCT/EDIC.

Table 3-1 Descriptive table for DCCT/EDIC.

Characteristics DCCT EDIC DCCT/EDIC

Calendar year 1983-1993 1994-2011 1983-2011

Total sample count 1441 1401 1441

Number of visits (range) (1 - 11) (1- 18) (1 - 29)

mean 8.2 16.2 24

median 8 18 25

Time-static % N % N % N

Gender (Female) 47.2% 680 47.8% 669 47.2% 680

Cohort (Primary) 50.4% 726 50.3% 705 50.4% 726

Treatment (Conventional) 50.7% 730 50.2% 703 50.7% 730

mean sd mean sd mean sd

Age at DCCT baseline (years) 26.9 7.1 26.9 7.1 26.9 7.1

Duration of T1D at DCCT baseline

(years) 5.6 4.2 5.6 4.1 5.6 4.2

Time-changing mean sd mean sd mean sd

Mean logAER(mg/24h) 2.5 0.8 2.7 1.2 2.6 0.8

Mean eGFR(mL/min/1.73𝑚2) 120.0 10.7 103.0 13.4 109.0 11.4

%: proportion.

N: count.

sd: standard deviation.

The baseline time-static covariates have no missing data and the missing rates for repeated

measurements are presented in Table 3-2. As mentioned in Section 2.1.3.1 Albuminuria, AER was

measured every other year in EDIC and calculation of missing rate in EDIC has taken this study

design into account. Generally there is a higher missing rate in EDIC than DCCT which could be

caused by a higher dropout rate in the later study. By simply fitting a linear mixed model with

logAER as response, eGFR as independent variable grouped by subjects, the marginal effect of

eGFR on logAER is significant and negative (β=−0.013, P<0.001).

Table 3-2 Missing rate for repeated measurements. Missing rate of logAER in EDIC and DCCT/EDIC is

calculated by combining alternate years.

Missing Rate DCCT EDIC DCCT/EDIC

logAER 1.71% 10.32% 7.43%

eGFR 1.60% 10.43% 7.49%

3.1.1 Number of visits

Randomization in DCCT took place from 1983 to 1989, with the different numbers of participants

entering in the clinical trial each year shown in Figure 3-1. The analysis was conducted based on

study time defined by the number of years the participant had been enrolled in DCCT since

baseline, rather than calendar time.

Figure 3-1 Number of DCCT participants by randomization year.

Due to the staggered entry into DCCT, the total number of regular visits in DCCT (Figure 3-2A)

has a peak at 7 visits corresponding to the peak of randomization year 1987 in Figure 3-1. The

number of visits in EDIC (Figure 3-2B) is relatively stable given all continuing participants should

have started in the same year. To show the proportion of missing participants among all study

years, barplots of expected, actual number of participants (Figure 3-3) and barplots of missing rate

Figure 3-4 in every DCCT/EDIC year are generated to provide a visual and detailed description.

The proportion of missing participants generally increased by DCCT year except for year 9 and

showed a missing rate of 1.8% in closeout visit. The missing proportion of participants in EDIC

also followed a rising trend by year from 5.4% to 14.8%.

Figure 3-2 Number of visit counts per subject in DCCT/EDIC years.

Figure 3-3 Expected and actual numbers of subjects in each DCCT/EDIC year including DCCT baseline

and close out visits. The counts by solid bars represent actual numbers which are no larger than expected

numbers in the same year. The transparent part of the bar represents the difference between expected and

actual counts in that year.

Figure 3-4 Proportion of missing subjects in each DCCT/EDIC year including DCCT baseline and close

out visits. This proportion is calculated as the proportion of transparent bar size out of total bar size in the

same year in Figure 3-3.

3.1.2 Distribution of logAER

First of all, we present a barplot of numbers of logAER measurements per subject in Figure 3-5.

The number of measurements per subject is also used as the weights of WSAO method. The barplot

shows its peak is at around 15 measurements. The mean and the median of the counts are 15.

Figure 3-5 Barplot of numbers of logAER measurements per subject in DCCT/EDIC study.

To present the distribution of logAER in DCCT year, violin plots along with boxplots were

generated in Figure 3-6A. Generally the boxplots show a right skewed distribution after baseline

with a larger gap between the median and third quartile (Q3) than the median and first quartile

(Q1). This is due to outliers mostly showing larger logAER values and longer right tails in the

violin plots. The violin plots get flatter gradually by year and indicate more spread of logAER

values in later DCCT years, in which years there were fewer participants seen in Figure 3-6B due

to the staggered entry. There are very few missing in DCCT years for logAER. The two treatment

groups were compared using violin plots and boxplots summarizing their distribution each year in

Figure 3-6C.

Similar plots are generated for EDIC years in Figure 3-7. The study was designed to measure

logAER for every other year, with around half participants (n=693) in odd EDIC years and the

others (n=704) in even years. There is a significant difference by Wilcoxon rank-sum test

(p=0.003) in first EDIC logAER measurement between odd and even EDIC years, with even year

logAER higher than odd year. Among 1397 individuals who are in EDIC and have measurements

for logAER, 14.46% had violated the assignment of odd or even year measurement at least once.

The violin plots and boxplots are thus generated with about half of total sample size for every year.

Missing rate on logAER shown in Figure 3-6B is monotone increasing over time from 5.64% for

year 1, 2 to 15.99% for year 17, 18.

Figure 3-8 shows specific paths of subjects’ logAER values separately in DCCT and EDIC

according to individual variance over time within each time period. Subjects plotted for each grid

are not necessarily the same ones. Subjects with smallest variance in Figure 3-8B and E usually

have shorter duration time in the study, either starting late or did not participate after some time

point. The subjects with largest variance in Figure 3-8C and F showed the same increasing trend

in logAER over study years, with red lines mostly on top of green ones. Most grid plots showed

an equal mixture of subjects from DCCT treatment groups, except for Figure 3-8C in which most

subjects are from conventional group.

Figure 3-6 Distribution of logAER in DCCT years. A: Violin plot combined with boxplot on distribution

of logAER in DCCT years. B: Expected and actual numbers of participants in DCCT years. Solid bars

represent actual numbers and transparent part of the bar represents the difference between expected and

actual counts in that year. C: Violin plot combined with boxplot on distribution of logAER by treatment

groups in DCCT years. Outliers are identified by 1.5×IQR rule.

Figure 3-7 Distribution of logAER in EDIC years. A: Violin plot combined with boxplot on distribution of

logAER in EDIC years. B: Expected and actual numbers of participants calculated for every two years in

EDIC years due to the study design. Solid bars represent actual numbers and transparent part of the bar

represents the difference between expected and actual counts in that year. C: Violin plot combined with

boxplot on distribution of logAER by original treatment groups in EDIC years. Outliers are identified by

1.5×IQR rule.

Figure 3-8 Spaghetti plots of logAER from 20 subjects from DCCT/EDIC in each grid. A and D were

plotted for 20 random subjects selected from DCCT/EDIC. B and E were plotted for 20 subjects with least

variance over DCCT/EDIC. C and F were plotted for 20 subjects with largest variance over DCCT/EDIC.

DCCT treatment groups were denoted by different line colors for these selected subjects.

3.1.3 Distribution of eGFR

Again, we firstly present a barplot of numbers of eGFR measurements per subject in Figure 3-9.

The number of measurements per subject is also used as the weights of WSAO method. The barplot

shows its peak is at around 24 times. The mean of the number is 23 and the median is 24.

Figure 3-9 Barplot of numbers of eGFR measurements per subject in DCCT/EDIC study.

Similar descriptive figures are generated for outcome eGFR in Figure 3-10 for DCCT years and

Figure 3-11 for EDIC years. In Figure 3-10A and Figure 3-11A, compared to logAER data, eGFR

values are less skewed for having similar length of whiskers in boxplots. However, in EDIC years

the distribution of eGFR has the opposite feature from logAER that the distribution of eGFR each

year is left skewed and decreasing over time. This might be caused by the contrasts on the trend

or range of logAER and eGFR values. We can see in Figure 3-10B and Figure 3-11B there is very

few missing in DCCT and a monotone increasing missing rate in EDIC. In Figure 3-11B the

missing rate is almost monotone increasing from 5.35% to 16.27%.

Figure 3-12 shows specific paths of subjects’ eGFR values separately in DCCT and EDIC

according to individual variance over time within each time period. Subjects plotted for each grid

are not necessarily the same ones. Same as logAER, subjects with smallest variance usually have

shorter duration time in the study for starting late or not showing up after some time point. The

subjects with largest variance in showed the decreasing trend in eGFR over time. All grid plots

showed an equal mixture of subjects from DCCT treatment groups.

Figure 3-10 Distribution of eGFR in DCCT years. A: Violin plot combined with boxplot on distribution of

eGFR in DCCT years. B: Expected and actual numbers of participants in DCCT years. Solid bars represent

actual numbers and transparent part of the bar represents the difference between expected and actual counts

in that year. C: Violin plot combined with boxplot on distribution of eGFR by treatment groups in DCCT

years. Outliers are identified by 1.5×IQR rule.

Figure 3-11 Distribution of eGFR in EDIC years. A: Violin plot combined with boxplot on distribution of

eGFR in EDIC years. B: Expected and actual numbers of participants in EDIC years. Solid bars represent

actual numbers and transparent part of the bar represents the difference between expected and actual counts

in that year. C: Violin plot combined with boxplot on distribution of eGFR by original treatment groups in

EDIC years. Outliers are identified by 1.5×IQR rule.

Figure 3-12 Spaghetti plots of eGFR from 20 subjects from DCCT/EDIC in each grid. A and D were plotted

for 20 random subjects selected from DCCT/EDIC. B and E were plotted for 20 subjects with least variance

over DCCT/EDIC. C and F were plotted for 20 subjects with largest variance over DCCT/EDIC. DCCT

treatment groups were denoted by different line colors for these selected subjects.

3.2 Simulation Study Results

3.2.1 Set up

As a result of selecting the correlation structure for logAER model with DCCT data, ARMA(1, 1)

is the selected structure 𝛴𝑖∗ which produces model fit with least AIC and for subject 𝑖 with 𝑛𝑖

consecutive visits, the covariance matrix for the error term has a form as:

𝑒𝑖𝑗 ~ 𝛴𝑖∗ = 𝜎2

1 𝛾 𝛾𝜌 ⋯ 𝛾𝜌𝑛𝑖

𝛾 1 𝛾 ⋯ 𝛾𝜌𝑛𝑖−1

𝛾𝜌 𝛾 1 ⋯ 𝛾𝜌𝑛𝑖−2

⋮ ⋮ ⋮ ⋱ ⋮𝛾𝜌𝑛𝑖 𝛾𝜌𝑛𝑖−1 𝛾𝜌𝑛𝑖−2 ⋯ 1 ]

, 𝜎 = 0.4220, 𝛾 = 0.2944, 𝜌 = 0.7337.

Because of the selection of the error structure, we consequently only adopt time in years as the

time variable and abandon use of time in months.

According to this error covariance structure, aside from the default structure with 𝛾 = 0.2944 we

therefore set two more specific settings for experimental design 6) which produces an independent

error correlation structure: 𝛾 = 0 and a stronger error correlation structure: 𝛾 = 0.87. We refer to

scenarios with these two settings as Independent (correlation) scenario and Strong

(correlation) scenario.

Covariance matrix for bivariate normal distribution of random intercept and slope is:

) ~ D = [0.1634 0.02170.0217 0.0010

Parameter estimates for standard LMM (Eq. 2.3) to conduct simulation study on logAER are:

Intercept = β0 = 2.4335, slope of time = β3 = 0.0122.

3.2.2 Type 1 Error

A well-controlled T1E is the basis for power calculation. T1E is calculated when both cross-

sectional and longitudinal SNP effects are set as 0. The significance level α is selected as 0.05. As

suggested by a previous simulation study on T1E (Zhang and Sun, 2019), we decided to adopt the

wider CI than the usual choice of 95% and the accuracy of T1E is decided by an approximate 99%

CI with 5000 replicates 𝛼 − 3√𝛼(1 − 𝛼)/5000 ≤ T1E ≤ 𝛼 + 3√𝛼(1 − 𝛼)/5000, which is here

(0.0408, 0.0592).

In Figure 3-13 we provide T1E plot for all methods in different scenarios ranging by MAF or

sample size. In results, T1E is basically well controlled under the 99% CI in most of scenarios by

all methods.

The most outstanding violation occurs in dropout scenario for WSAO that it causes a significantly

deflated T1E across MAF or sample size ranges which makes WSAO a very conservative method.

Compared to other missing scenarios, MAR and MNAR also produce a quite low T1E for WSAO

for some MAF and sample size, however the deflation is not as severe. The difference between

dropout and other missingness is that dropout produces more unbalanced sample sizes by time. An

example can be seen in Figure 3-14 for the changing the sample size per time unit (7 in total, 1 is

baseline) where one replicate is conducted for MAF=0.3, N=2000 and 0 cross-sectional or

longitudinal SNP effect. It can be seen that as expected the dropout scenario has an increasing

missingness by time.

Other than that, the 2df test by LMM produces inflated T1E in reference and strong error

correlation scenarios, but no specific pattern is yet found for the inflation.

Figure 3-13 Type 1 error rates calculated in reference scenario, 4 missing scenarios and 2 within-subject

error correlation scenarios. T1E is plotted by MAF in upper plot with fixed N=2000; and by N in lower

plot at fixed MAF=0.3. Dashed lines represent criterion of accurate T1E which is (0.0408, 0.0592).

Figure 3-14 Example of sample size changing by time in missing scenarios. X-axis is the time point in

simulation study, there are 7 time units in total with assumed same distance, and 1 is baseline when all

individuals have complete data. The setting for this example is: MAF=0.3, N=2000, cross-sectional and

longitudinal SNP effects are 0 and unchanged error structure.

3.2.3 Power

The results of power are presented in a similar grid form as T1E, however barplots are generated

because of overlapping power curves among some methods. Considering that there are two SNP

effects, bars are only plotted for methods which can provide the corresponding power for the SNP

effect. For power of the cross-sectional SNP effect, barplots include power by GALLOP and

LMM. For power of the longitudinal SNP effect, it includes power for SAO, WSAO, TS, CTS,

GALLOP and LMM. In addition, GALLOP and LMM provides power for 2df tests. To be more

rigorous, plots for one SNP effect are stratified by the value of the other SNP effect considering

one might affect the inference of the other.

3.2.3.1 Analysis on Cross-sectional SNP Effect (β1)

For cross-sectional SNP effect Figure 3-15, not much difference exists among different scenarios.

When only cross-sectional effect exists and β2 = 0, the cross-sectional effect is in fact the average

SNP effect assumed to be the same at all time points. Compared to the reference scenario where

data is complete and error correlation is medium/reference, all missing scenarios and strong error

correlation cause a drop in power and independent error structure causes an increase in power.

Among these scenarios, the largest difference between LMM and GALLOP appears in strong error

correlation scenario in that GALLOP cannot adjust for different error structures. The most

powerful test in strong error correlation scenario is LMM cross-sectional SNP effect test, and then

LMM joint test.

When β2 = 0.016, the cross-sectional effect is now the SNP effect at baseline. The joint tests by

both LMM and GALLOP can produce very similar power that is much higher than cross-sectional

effect tests. Power by 2df tests can reach 100% very quickly by increasing MAF or sample size so

that there is not much difference in 2df test powers across different simulation scenarios.

Figure 3-15 Power of cross-sectional SNP effect calculated in 7 scenarios. Power is plotted by order of

MAF in upper plot with fixed N=2000; and by order of N in lower plot at fixed MAF=0.3. Each scenario

is stratified by longitudinal SNP effect b2=0 or b2=0.016 (right hand legend).

3.2.3.2 Analysis on Longitudinal SNP Effect (β2)

When the cross-sectional SNP effect β1 = 0, by looking at grid plots in Figure 3-16, the missing

scenarios all produce a drop in power, more severe in MNAR and dropout scenarios. The largest

power reached in these two scenarios is at least 12.5% lower than the largest power in other

scenarios.

When cross-sectional SNP effect β1 = 0.08, 2df tests by GALLOP and LMM again gain power

quickly with increasing MAF and sample size which makes them the most powerful tests across

all scenarios. Besides, TS method produces much higher power than the other longitudinal SNP

effect tests when both SNP effects are present in all scenarios. Except for these three tests, other

tests basically show a similar pattern as β1 = 0.

By comparing grid plots vertically, other than the increase in power for 2df tests and TS method,

other methods are not much affected by the value of the cross-sectional SNP effect in most

scenarios except for the MNAR and dropout scenarios. In the MNAR scenario, the longitudinal

SNP effect tests except TS have much lower power when the cross-sectional SNP effect is present,

with a difference of around 25% in highest power. In the dropout scenario, WSAO, SAO and CTS

gain much more power from the existence of a main SNP effect and become more powerful than

GALLOP and LMM for the longitudinal SNP effect tests.

Figure 3-16 Power calculated in 7 scenarios. Power is plotted by order of MAF in upper plot with fixed

N=2000; and by order of N in lower plot at fixed MAF=0.3. Each scenario is stratified by cross-sectional

SNP effect b1=0 or b1=0.08 (right hand legend).

3.2.4 Parameter Estimation

Parameter estimates are plotted along with 95% CI to see whether estimates by different methods

can capture the true effect sizes. Across all scenarios, the larger the MAF and sample size, the less

the estimates vary.

For main SNP effect Figure 3-17, GALLOP and LMM perform well in any scenario across the

MAF and sample size ranges yielding a mean estimate for β1 around 0.08 and a 95% CI covering

this value, unaffected by β2.

Figure 3-17 Parameter estimates of cross-sectional SNP effect calculated in 7 scenarios. Estimates are

plotted as points with bars as 95% confidence intervals. Ranks are ordered by MAF in upper plot with fixed

N=2000; and by N in lower plot at fixed MAF=0.3. Results for each scenario are stratified by longitudinal

SNP effect b2=0 or b2=0.016 (right hand legend). Solid lines represent the true parameter value for cross-

sectional SNP effect b1=0.08. Dashed lines represent b1=0.

For longitudinal SNP effect Figure 3-18 and Figure 3-19, WSAO, SAO, GALLOP and LMM can

obtain an accurate estimate in most scenarios, with exceptions in MNAR and dropout scenarios.

With a slightly lower estimate by these four in MNAR and a slightly higher estimate with larger

variation by WSAO, SAO in dropout scenario, their 95% CI still cover the true effect size 0.016.

However TS and CTS methods always produce a lower mean estimates with a relatively smaller

SE than other methods. These two methods tend to get substantially lower estimates than the true

effect size when MAF or sample size is large in most scenarios.

As to vertical comparison when cross-sectional effect is 0 or 0.08, similarly to the cross-sectional

SNP effect, LMM, GALLOP and CTS are almost never affected by cross-sectional SNP effect.

WSAO and SAO are affected to produce a different mean and CI for estimate by cross-sectional

effect but essentially still cover the true longitudinal SNP effect size. Lastly for TS, in most

scenarios the existence of β1 tends to raise the mean estimate of β2 with unchanged length of CI,

resulting in more possibility of capturing the true effect size.

Figure 3-18 Parameter estimates of longitudinal SNP effect by MAF calculated in 7 scenarios. Estimates

are plotted as points with bars as 95% confidence intervals. Ranks are ordered by MAF with fixed N=2000.

Results for each scenario are stratified by cross-sectional SNP effect b1=0 or b1=0.08 (right hand legend).

Solid lines represent the true parameter value for longitudinal SNP effect b2=0.016. Dashed lines represent

Figure 3-19 Parameter estimates of longitudinal SNP effect by N calculated in 7 scenarios. Estimates are

plotted as points with bars as 95% confidence intervals. Ranks are ordered by N at fixed MAF=0.3. Results

for each scenario are stratified by cross-sectional SNP effect b1=0 or b1=0.08 (right hand legend). Solid

lines represent the true parameter value for longitudinal SNP effect b2=0.016. Dashed lines represent b1=0.

3.2.5 Speed

The last aspect to be compared is the computational efficiency of methods. Different from the main

part of simulation study for calculating accuracy where outcomes are re-generated in each replicate

to make it representative for the distribution of different scenarios, comparison of speed has to be

done in the dataset where multiple SNPs are generated altogether with only one set of outcomes

per subject. With the speed from simulation study presented in Table 3-3, it is possible to generally

estimate the needed time for real data analysis. Limited by the speed of LMM, we are conducting

a simulation on 1000 simulated SNPs with a MAF of 0.3 and a sample size of 2000, 0 cross-

sectional or longitudinal SNP effects and medium/reference error structure. In result, it takes about

20 seconds on average for fast methods to run the 1000 SNPs, while LMM needs more than 6

hours. Simulations were run on one node at the high-performance computing server which includes

144 nodes consisting of 2x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz with 192Gb RAM for

general use. The total time was added up for each method instead of accounting parallel

computation because different nodes or servers might have different capability for parallel running.

The final obtained running time is compared as followed:

Table 3-3 Time comparison in simulation study for 1000 SNPs under null with MAF=0.3, N=2000, 0 cross-

sectional or longitudinal SNP effect and unchanged error structure.

LMM SAO(WSAO) TS CTS GALLOP

Time 6.23h 18.92s 31.88s 11.67s 17.04s

3.3 DCCT/EDIC Data Analysis Results

3.3.1 Set up

By fitting LMMs without SNPs on DCCT/EDIC data, both logAER and eGFR selected

ARMA(1,1) as the best error structure to the model with covariates gender and treatment × cohort.

The specific error correlations are different for these two outcomes, but we apply the same

structure to these two outcomes.

𝑙𝑜𝑔𝐴𝐸𝑅: 𝛴𝑖∗ = 𝜎2 [

1 𝛾 ⋯ 𝛾𝜌𝑛𝑖

𝛾 1 ⋯ 𝛾𝜌𝑛𝑖−1

⋮ ⋮ ⋱ ⋮𝛾𝜌𝑛𝑖 𝛾𝜌𝑛𝑖−1 ⋯ 1

] , 𝜎 = 0.8576, 𝛾 = 0.5680, 𝜌 = 0.8745.

𝑒𝐺𝐹𝑅: 𝛴𝑖∗ = 𝜎2 [

1 𝛾 ⋯ 𝛾𝜌𝑛𝑖

𝛾 1 ⋯ 𝛾𝜌𝑛𝑖−1

⋮ ⋮ ⋱ ⋮𝛾𝜌𝑛𝑖 𝛾𝜌𝑛𝑖−1 ⋯ 1

] , 𝜎 = 12.45, 𝛾 = 0.6540, 𝜌 = 0.9391.

LMM and fast methods are applied to combined DCCT/EDIC data with ~ 9M SNPs for GWAS.

We again specify the model covariates and error structures for LMM and fast methods in Table

Table 3-4 DCCT/EDIC data analysis settings.

With covariates No covariates

ARMA(1,1) covariance structure LMM, TS

Independent covariance structure GALLOP SAO, WSAO, CTS

Covariates: gender, treatment × cohort.

The timeline in this dataset is from DCCT baseline till EDIC year 18 (calendar year 2011). To

combine DCCT and EDIC, EDIC year 1 is treated as the follow up year after the last regular visit

in DCCT for those who continued in the EDIC study because the close out visits were relatively

irregular in time. With ~97% DCCT participants continued in EDIC, 95% of them started right

after DCCT close out in EDIC year 1 while others started in later EDIC years as shown in Figure

Figure 3-20 Number of participants in EDIC starting years.

3.3.2 GWAS of logAER

3.3.2.1 Slopes for Two-stage Methods

First of all, we want to compare two-stage methods SAO(WSAO), TS and CTS to see whether

there are systematic difference existing in the slopes in the first step. The slopes are generated in

different ways for each individual as described in Methods section. The Figure 3-21 upper plots

are the histograms of slopes. The slopes represent the effect of time on the longitudinal trait

logAER, with the absolute effect size from 0 to 1. All three have their highest frequency at around

0 or a slightly negative slope. The shape of WSAO slopes is more symmetric while TS and CTS

have more right-skewed slopes. We also made Bland-Altman plots (Altman and Bland, 1986;

Bland and Altman, 1999) to make pairwise comparison in Figure 3-21 lower plots. In addition,

paired t-tests are conducted to test whether the mean difference is 0 at a significance level of 5%.

Results are that there is significant difference between SAO and TS (p<0.001, t=14.229,

ΔSAO-TS=0.026), SAO and CTS (p<0.001, t=21.413, ΔSAO-CTS=0.026), but the TS and CTS are not

significantly different (p=1, t=3.85e-13, ΔTS-CTS=3.85e-16). Both t-tests and Bland-Altman plots

show that SAO always has a larger positive slope than TS and CTS, meaning a larger positive time

effect on outcome.

Figure 3-21 Slopes for two-stage fast methods on outcome logAER. Upper: Histograms of slopes for two-

stage fast methods (logAER). Lower: Bland-Altman plot comparing slopes between two fast methods. In

Bland-Altman plots, blue(middle) dashed line is mean difference. Green(top) and red(bottom) dashed lines

are upper and lower limits of 95% CI of mean difference.

3.3.2.2 Methods Comparison on Random Selection

Before conducting GWAS, we randomly selected a subset of around 20k SNPs (n=19,570, 0.22%

of total SNPs) from across all autosomes. We ran models with all fast methods and LMM on this

random selection of SNPs to compare the p-values and parameter estimates. This is done because

of the limited computation speed of LMM method. We made the random selection instead of

selecting a specific region to get a reasonable comparison between the fast methods and LMM.

The histograms of MAFs from all 8,979,131 SNPs and the 19,570 selected SNPs are plotted in

Figure 3-22. From the histograms, we can see the distribution of MAFs from the random SNPs

resembles the distribution of all SNPs.

Figure 3-22 Histograms of MAFs of all SNPs (n=8,979,131) and randomly selected SNPs (n=19,570).

Figure 3-23 P-P plots, comparing p-values in −𝑙𝑜𝑔10 scale, are generated for visualized

comparison. We can see the x axis is limited to 4 and y axis is limited to 5 at most, showing there

were no significant SNPs in the random selection under conventional GWAS threshold. With the

red line as diagonal line, SAO and TS have greater dispersion than other methods.

Figure 3-24 E-E plots, comparing parameter estimates, is also generated for visualized comparison.

x=0 and y=0 were added as dashed lines in plot to indicate the position where SNP effect is 0 by

LMM and fast methods. The center of cluster in each subplot is at the intersection of x=0 and y=0,

showing most SNPs in the random selection are under the null hypothesis. Similar to P-P plot,

SAO and TS have more variability than other methods and they seem a bit asymmetric around the

diagonal line.

In Table 3-5, we calculated intraclass correlation coefficient (ICC) to see the intra-SNP correlation

between the compared p-values or estimates. For p-values in −𝑙𝑜𝑔10 scale, SAO and TS have

moderate correlation, and the rest fast methods have very high correlation with LMM. For

parameter estimates, again SAO and TS are the only two methods having moderate correlation.

We also conducted paired Wilcoxon rank sum tests on p-values and paired t-tests on estimates to

see the significance of difference at a significance level of 5%. The results show that in the −𝑙𝑜𝑔10

scale, the p-values generated by fast methods and LMM are all significantly different. All fast

methods except WSAO are having smaller p-values than LMM, leaving WSAO as the only

conservative method. As to the parameter estimates, none of the comparisons between fast

methods and LMM were significantly different.

Figure 3-23 P-P plots on outcome logAER on random subset of SNPs. X and y axis are in -log10 scale. Red

solid line: diagonal line. b1: cross-sectional SNP effect. b2: longitudinal SNP effect. LMM(1): cross-

sectional SNP effect by LMM1. LMM(2): cross-sectional SNP effect by LMM2.

Figure 3-24 E-E plots on outcome logAER on random subset of SNPs. Red solid line: diagonal line. b1:

cross-sectional SNP effect. b2: longitudinal SNP effect. LMM(1): cross-sectional SNP effect by LMM1.

LMM(2): cross-sectional SNP effect by LMM2. Black dashed lines: x=0 or y=0.

Table 3-5 Statistical comparison between fast methods and LMM on random 19570 SNPs.

LMM 𝜷𝟐 𝜷𝟐 𝜷𝟐 𝜷𝟐 𝜷𝟏𝑳𝑴𝑴𝟏

𝜷𝟏𝑳𝑴𝑴𝟐

𝜷𝟐 2df

Fast SAO WSAO TS CTS GA (𝜷𝟏) GA (𝜷𝟏) GA (𝜷𝟐) GA (2df)

P* ICC 0.73 0.91 0.70 0.94 0.86 0.94 0.94 0.95

𝑃𝑤 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001

Δ̅ -0.022 0.032 -0.017 -0.017 -0.031 -0.041 -0.017 -0.034

𝛽 ICC 0.81 0.95 0.77 0.97 0.93 0.97 0.96 --

𝑡𝑝𝑡 -0.18 -0.02 0.89 -0.99 1.32 0.79 -0.39 --

𝑃𝑝𝑡 0.855 0.983 0.371 0.321 0.188 0.429 0.695 --

Δ̅ -5.82e-6 -2.92e-7 2.17e-5 -9.64e-6 2.10e-4 7.98e-5 -4.77e-6 --

GA: GALLOP. P*: −𝑙𝑜𝑔10(𝑃) of SNPs. β: parameter estimate. ICC: Intraclass correlation coefficient

(ICC3 calculated by r package psych using lme4 option (Revelle, 2018)). 𝑃𝑤: p-value of paired Wilcoxon

rank sum test. 𝑡𝑝𝑡 : t statistic of paired t-test. 𝑃𝑝𝑡 : p-value of paired t-test. Δ̅ : Mean difference of

−𝑙𝑜𝑔10(𝑃) or β between LMM and fast method. Bolded are p-values under 5%. --: not applicable.

3.3.2.3 Methods Comparison on Detecting Significant SNPs

We conducted GWAS with fast methods on all SNPs (n=8,979,131). For the fast method results

for the logAER outcome , Figure 3-25 provides histograms of p-values, Figure 3-26 provides Q-

Q plots and Figure 3-28 provides Manhattan plots for each method. Apart from WSAO, all other

methods have a uniform distribution of p-values from histograms Figure 3-25. WSAO has larger

density for large p-values and has a deflated genomic control value in Q-Q plot Figure 3-26. In

Manhattan plot Figure 3-28, SAO detected many more SNPs than other fast methods under the

conventional GWAS threshold. In order to prevent the potential false discovery caused by low

MAF, Figure 3-27 Q-Q plots are stratified by 1%<MAF<5% (n=2,238,985) and 5%≤MAF≤50%

(n=6,740,146) to see if the distributions of p-values are different for these two MAF ranges. In

Figure 3-27, among all methods SAO seems to separate the two groups of points the most, with

the largest difference (0.0415) between genomic control values for low MAF and high MAF.

Figure 3-25 Histograms of p-values (logAER).

Figure 3-26 Q-Q plots of p-values (logAER). gc: genomic control value.

Figure 3-27 Q-Q plots of p-values stratified by MAF (logAER). Red dots: SNPs with 1%<MAF<5% (gc_L

for genomic control value). Blue dots: 5%≤MAF≤50% (gc_H for genomic control value).

Figure 3-28 Manhattan plots (logAER). Red line: 𝑃 = 5 × 10−8. Blue line: 𝑃 = 10−5.

As mentioned in Section 2.5 Methods for DCCT/EDIC Data Analysis’, we applied a loose

threshold of 𝑃 < 10−5 to select candidate SNPs from the fast method results. A total of 1089 SNPs

were selected by fast methods. By running full LMM on the candidate SNPs, we found that for

cross-sectional SNP effect, neither GALLOP nor LMM found SNPs associated with the outcome

logAER under 𝑝 = 5 × 10−8 . For the longitudinal SNP effect, LMM detected 4 SNPs on

chromosome 1 which are also found by all fast methods. For the joint test of both SNP effects,

both GALLOP and LMM discovered the same 4 SNPs on chromosome 1 with a new finding on

chromosome 10.

We present summary information on these 5 significant SNPs in Table 3-6. We also compare

parameter estimates, standard error and p-values on these SNPs between fast methods and LMM

in Table 3-7 and Table 3-8. Results for the 4 SNPs on chromosome 1 are the same so we only

display rs3817222 as representative.

Table 3-6 Summary information of significant SNPs (P<5× 10−8) for outcome logAER.

Chr SNP BP MAF INFO A1:

A2 𝜷𝟏

𝑳𝑴𝑴𝟏 SE P 𝜷𝟏𝑳𝑴𝑴𝟐 SE P 𝜷𝟐 SE P 𝑷𝟐𝒅𝒇

1 rs3817222 202464760 0.26 1.00 C:T* 0.015 0.403 0.971 -0.056 0.404 0.891 0.026 0.004 5.02e-11 4.30e-11

1 rs12734338 202469723 0.26 1.00 T:C* 0.015 0.403 0.971 -0.056 0.404 0.891 0.026 0.004 5.02e-11 4.30e-11

1 rs12743401 202476648 0.26 1.00 T:C* 0.015 0.403 0.971 -0.056 0.404 0.891 0.026 0.004 5.02e-11 4.30e-11

1 rs3881953 202528021 0.26 1.00 G:A* 0.015 0.403 0.971 -0.056 0.404 0.891 0.026 0.004 5.02e-11 4.30e-11

10 rs74155187 107418625 0.01 0.99 T*:C -3.766 2.532 0.137 -0.469 0.176 0.008 -0.077 0.017 1.07e-05 1.47e-08

The significance is determined by LMM on any kind of effect of the SNP under conventional GWAS threshold. Chr: chromosome. BP: position in

base pairs on the chromosome (1000 genomes phase 3 v5). INFO: quality of imputation. A1 and A2 are alleles 1, 2, with the minor allele indicated

with *. Parameter estimates (β), standard error (SE) and p-values (P) from LMMs are presented. β1LMM1 and 𝛽1

𝐿𝑀𝑀2: estimates of cross-sectional SNP

effects from LMM1 and LMM2. β2: estimate of longitudinal SNP effect. 𝑃2𝑑𝑓: p-value for 2df test. Bolded SNPs are selected for following analysis.

Bolded p-values are under 5× 10−8.

Table 3-7 Parameter estimates (BETA), standard error (SE) and p-values (P) for rs3817222 on logAER.

Method Hypothesis test BETA SE P

LMM 𝛽1𝐿𝑀𝑀1 0.015 0.403 0.971

LMM 𝛽1𝐿𝑀𝑀2 -0.056 0.404 0.891

LMM β2 0.026 0.004 5.02e-12

LMM 2df -- -- 4.30e-11

SAO β2 0.030 0.005 2.09e-08

WSAO β2 0.028 0.005 6.76e-10

TS β2 0.014 0.002 2.07e-08

CTS β2 0.025 0.004 8.83e-12

GALLOP β1 -0.037 0.394 0.924

GALLOP β2 0.028 0.004 2.47e-11

GALLOP 2df -- -- 2.13e-10

𝛽1𝐿𝑀𝑀1 and 𝛽1

𝐿𝑀𝑀2: estimates of cross-sectional SNP effects from LMM1 and LMM2. 𝛽1: estimate of

cross-sectional effect. 𝛽2 : estimate of longitudinal effect. 2df: 2df test. “: same as above. --: not

applicable. Bolded p-values are under 5× 10−8.

Table 3-8 Parameter estimates (BETA), standard error (SE) and p-values (P) for rs74155187 on logAER.

LMM 𝛽1𝐿𝑀𝑀1 -0.689 0.169 4.87e-05

LMM 𝛽1𝐿𝑀𝑀2 -0.469 0.176 0.008

LMM β2 -0.076 0.017 1.07e-05

LMM 2df -- -- 1.47e-08

SAO β2 -0.101 0.023 1.42e-05

WSAO β2 -0.093 0.020 5.65e-06

TS β2 -0.063 0.011 5.13e-09

CTS β2 -0.080 0.016 6.00e-07

GALLOP β1 -0.367 0.170 0.031

GALLOP β2 -0.093 0.019 8.19e-07

GALLOP 2df -- -- 3.12e-08

By using the efficiency measures defined in Section 2.5 Methods for DCCT/EDIC Data Analysis,

we calculated the discovery proportions 𝐸𝑙 and 𝐸𝑠 for the fast methods in Table 3-9. All fast

methods detected at least the 4 SNPs on chromosome 1. For 𝐸𝑙, in the subset of SNPs with 𝑃 <

10−5 by fast methods the discovery proportions are all around 0~2% except for WSAO. WSAO

shows the best ability to narrow down the range of potential associated SNPs. SAO has very low

efficiency because it provides the most candidate SNPs but did not include all 5 SNPs. For 𝐸𝑠, TS

and GALLOP 2df test are able to detect all 5 SNPs under p-value 5 × 10−8.

Table 3-9 Measures of efficiency of fast methods on logAER.

logAER SAO WSAO TS CTS GALLOP

(𝛃𝟐)

GALLOP

𝐸𝑙 0.70% (4) 7.94% (5) 1.79% (5) 2.66% (5) 2.24% (5) 1.43% (5)

𝐸𝑠 8.89% (4) 80.00% (4) 38.46% (5) 66.67% (4) 80% (4) 100% (5)

El: % of SNPs detected by LMM under P < 5 × 10−8 out of SNPs by fast methods under P < 10−5. E𝑠:

% of SNPs detected by LMM under P < 5 × 10−8 out of SNPs by fast methods under P < 5 × 10−8 .

Count in brackets is the count of SNPs detected by LMM under P < 5 × 10−8. The maximum count

should be 5.

3.3.2.4 Significant SNPs

In order to discover whether the significant SNPs, rs3817222 on chromosome 1 and rs74155187

on chromosome 10, are meaningful in that these SNPs do have longitudinal effects on outcome,

we break down the longitudinal dataset into cross-sectional subsets stratified by study years. For

each year, we then run linear regression on the SNPs and outcome adjusting by the same covariates.

Because of the staggered entry in DCCT, we conduct the cross-sectional analysis in DCCT and

EDIC separately and combined (Figure 3-29). The SNP effect estimates along with 95% CI

constructed by SE are provided in Figure 3-29 to see the longitudinal change. When DCCT and

EDIC are combined, the last 4 years for rs3817222 generate missing estimate for SNP effect

because of the singularity problem. Some covariates such as sex are not linearly independent of

SNP dosage in the last 4 years which is also caused by the much smaller number of subjects

enrolled in later DCCT years. The last year for rs74155187 produces a very large estimate which

is caused by the sparse distribution of SNP in that year. In DCCT years rs3817222 has no estimates

significantly different from 0. In EDIC years, it suggests a difference between measurements in

odd and even years with even years always producing a negative estimate for SNP effect except

for last year. The difference between odd and even years is significant at the beginning and

narrowing down by years. For rs74155187 it shows that DCCT effects are always negative with

effect size of 0~2. In EDIC years, the difference between every other years is not that obvious

especially in early years, however there is a pattern in the later few years.

Figure 3-29 Estimates along with 95% CI for cross-sectional SNP effects on logAER by combined and

separate DCCT/EDIC years. Only rs3817222 on chromosome 1 is plotted as rs12734338, rs12743401 and

rs3881953 have the same results. Dashed lines represent 0 which is no SNP effect.

We made Locus Zoom plots (Pruim et al., 2010) on 2df p-values from LMMs in two selected

regions to contain all these 5 SNPs in Figure 3-30 and Figure 3-31. It shows that the 4 SNPs on

chromosome 1 are close to each other and all produce the same information by this model. The

other SNPs around them have large p values indicating little evidence of association with outcome

logAER. For rs74155187, many SNPs have relatively high correlation with the reference SNP

measured by r2 but no other SNPs reaching or close to the conventional GWAS threshold.

Figure 3-30 Locus plot on logAER with reference SNP rs3817222. The display range is from 202.06 –

202.86 Mb on chromosome 1 with 2024 SNPs plotted in total. P-values on left y-axis are 2df p-values by

LMM. SNPs with the same p-values are (from left to right) rs12734338, rs12743401 and rs3881953. There

is no usable linkage disequilibrium for these 4 SNPs. Reference panel: human genome 19/1000 Genomes

Nov 2014 European.

Figure 3-31 Locus plot on logAER with reference SNP rs74155187. The display range is from 107.02 –

107.82 Mb on chromosome 10 with 2445 SNPs plotted in total. P-values on left y-axis are 2df p-values by

LMM. Correlation between reference SNP and other SNPs are displayed in color shown in legend.

Reference panel: human genome 19/1000 Genomes Nov 2014 European.

3.3.3 GWAS of eGFR

3.3.3.1 Slopes for Two-stage Methods

First of all, we also want to compare two-stage methods SAO(WSAO), TS and CTS to see whether

there are systematic difference existing in the slopes in the first step. The figure is generated in the

same way as Figure 3-21. Now, in Figure 3-32 upper plots the slopes represent the effect of time

on the longitudinal trait eGFR, with the absolute effect size from 0 to 10. All three are having their

highest frequency at around 0 and have left-skewed slopes, which is opposite to logAER. We also

made Bland-Altman plots (Altman and Bland, 1986; Bland and Altman, 1999) to make pairwise

comparison in Figure 3-32 lower plots. In addition, paired t-tests are conducted to test whether the

mean difference is 0 at a significance level of 5%. Results again show that there is significant

difference between SAO and TS (p<0.001, t=-55.426, ΔSAO-TS=-1.496), SAO and CTS (p<0.001,

t=-67.310, ΔSAO-CTS=-1.496), but the TS and CTS are not significantly different (p=1, t=4.63e-12,

ΔTS-CTS=5.33e-14). Both t-tests and Bland-Altman plots show that SAO always gets a larger

negative slope than TS and CTS, meaning a larger negative time effect on outcome, which is the

opposite direction of logAER.

Figure 3-32 Slopes for two-stage fast methods on outcome eGFR. Upper: Histograms of slopes for two-

stage fast methods (eGFR). Lower: Bland-Altman plot comparing slopes between two fast methods. In

Bland-Altman plots, blue(middle) dashed line is mean difference. Green(top) and red(bottom) dashed lines

are upper and lower limits of 95% CI of mean difference.

3.3.3.2 Methods Comparison on Random Selection

Before conducting GWAS, we adopted the same random selection of 19,570 SNPs. We ran models

with all fast methods and LMM on this random selection of SNPs to compare the p-values and

parameter estimates.

Figure 3-33 P-P plots, comparing p-values in −𝑙𝑜𝑔10 scale, is generated for eGFR. We can see

from the maximum value of x or y axis that there were no significant SNPs in the same random

selection under conventional GWAS threshold for eGFR. SAO, TS and GALLOP (LMM(1) vs

GALLOP(b1)) have more spread out dots than others. WSAO is not spread out much, but the dots

are not symmetric around the diagonal line.

Figure 3-34 E-E plots, comparing parameter estimates, is generated for eGFR as well. The center

of cluster in each subplot is at the intersection of x=0 and y=0, showing most SNPs in the random

selection are under the null hypothesis. Similar to P-P plot, the same three methods have more

spread out clusters than the others. In addition, same as E-E plot for logAER Figure 3-24 SAO and

TS seem a bit asymmetric around the diagonal line.

In Table 3-10, similar to Table 3-5 we calculated ICC first. SAO and TS are moderately correlated

with LMM in both p-values (−𝑙𝑜𝑔10 scale) and parameter estimates. Besides, GALLOP is having

a good but slightly lower correlation with LMM1 in cross-sectional SNP effect, compared to Table

3-5. We also performed paired Wilcoxon rank sum tests on p-values and paired t-tests on parameter

estimates at a significance level of 5%. The results show that in the −𝑙𝑜𝑔10 scale, the p-values

generated by fast methods and LMM are mostly not significantly different except for WSAO.

WSAO again produces larger p-values than LMM as conservative method, same as logAER. As

to the parameter estimates, none of the comparisons between fast methods and LMM were

significantly different.

Figure 3-33 P-P plots on outcome eGFR on random 19570 SNPs. X and y axis are in -log10 scale. Red

solid line: diagonal line. b1: cross-sectional SNP effect. b2: longitudinal SNP effect. LMM(1): cross-

sectional SNP effect by LMM1. LMM(2): cross-sectional SNP effect by LMM2.

Figure 3-34 E-E plots on outcome eGFR on random 19570 SNPs. Red solid line: diagonal line. b1: cross-

sectional SNP effect. b2: longitudinal SNP effect. LMM(1): cross-sectional SNP effect by LMM1.

LMM(2): cross-sectional SNP effect by LMM2. Black dashed lines: x=0 or y=0.

Table 3-10 Statistical comparison between fast methods and LMM on random 19570 SNPs.

LMM 𝜷𝟐 𝜷𝟐 𝜷𝟐 𝜷𝟐 𝜷𝟏𝑳𝑴𝑴𝟏

𝜷𝟏𝑳𝑴𝑴𝟐

𝜷𝟐 2df

Fast SAO WSAO TS CTS GA (𝜷𝟏) GA (𝜷𝟏) GA (𝜷𝟐) GA (2df)

P* ICC 0.72 0.91 0.82 0.90 0.74 0.93 0.93 0.94

𝑃𝑤 0.642 <0.001 0.392 0.085 0.759 0.592 0.261 0.602

Δ̅ -0.002 0.085 0.001 -0.002 0.001 8.78e-5 -0.003 -0.001

𝛽 ICC 0.78 0.97 0.80 0.95 0.87 0.97 0.95 --

𝑡𝑝𝑡 -0.18 0.09 -1.00 1.94 -1.30 -1.37 0.45 --

𝑃𝑝𝑡 0.857 0.931 0.316 0.052 0.195 0.171 0.654 --

Δ̅ -8.44e-5 1.13e-5 -2.83e-4 3.31e-4 -0.004 -0.002 8.04e-5 --

GA: GALLOP. P*: −𝑙𝑜𝑔10(𝑃) of SNPs. β: parameter estimate. ICC: Intraclass correlation coefficient

(ICC3 calculated by r package psych using lme4 option (Revelle, 2018)). 𝑃𝑤: p-value of paired Wilcoxon

rank sum test. 𝑡𝑝𝑡: t statistic of paired t-test. 𝑃𝑝𝑡: p-value of paired t-test. Δ̅: Mean difference of −𝑙𝑜𝑔10(𝑃)

or β between LMM and fast method. Bolded are p-values under 5%. --: not applicable.

3.3.3.3 Methods Comparison on Detecting Significant SNPs

Same process of GWAS is done on outcome of eGFR, with Figure 3-35 histograms of p-values,

Figure 3-36 Q-Q plots, Figure 3-37 stratified Q-Q plots and Figure 3-38 Manhattan plots provided

for each method. We observed several common points between logAER and eGFR:

a. WSAO has larger density on large p-values in histogram Figure 3-35.

b. WSAO has a deflated genomic control value in Figure 3-36 and this value is the farthest

from 1 among all fast methods.

c. SAO has the largest difference (0.0345) between genomic control values for low MAF and

high MAF.

d. SAO detected much more SNPs than other fast methods under the conventional GWAS

threshold in Figure 3-38.

Figure 3-35 Histograms of p-values (eGFR).

Figure 3-36 Q-Q plots of p-values (eGFR). gc: genomic control value.

Figure 3-37 Q-Q plots of p-values stratified by MAF (eGFR). Red dots: SNPs with 1%<MAF<5% (gc_L

for genomic control value). Blue dots: 5%≤MAF≤50% (gc_H for genomic control value).

Figure 3-38 Manhattan plots (eGFR). Red line: 𝑃 = 5 × 10−8. Blue line: 𝑃 = 10−5.

As a result, a total count of 2347 SNPs are selected across all fast methods to run full LMM with

22 significant findings by LMM. Besides, the same SNP as for logAER (Section 3.3.2.1) on

chromosome 10, rs74155187, is found again by both GALLOP and LMM at the level of 5 × 10−8

with significant longitudinal effect and joint effect.

The summary information of the 22 SNPs are in the Table 3-11. There are 10 SNPs on chromosome

2 which are close to each other presented in BP with similar statistics. Apart from that, all of the

rest of SNPs have MAF under 2%. From the stratified Q-Q plots by fast methods in Figure 3-37,

there is not much change between genomic control values in low or high MAF. However, it can

be observed that when MAF is in range of 1%-5%, the dots in Q-Q plot deviate more from the

diagonal line revealing more potential signals than 5%-50%, especially for longitudinal SNP

effects. Therefore we chose to filter out the SNPs with low MAF and provide inference on

rs12713270 to represent the 10 SNPs on chromosome 2. We also kept rs74155187 as it is the only

common SNP for both logAER and eGFR which has the same positive direction for longitudinal

SNP effect.

We also compare parameter estimates, standard error and p-values on these SNPs between fast

methods and LMM in Table 3-12 and Table 3-13. Results for the 10 SNPs on chromosome 2 are

similar so we only display rs12713270 as representative.

Table 3-11 Summary information of significant SNPs (P<5× 10−8) for outcome eGFR.

Chr SNP BP MAF INFO A1:

A2 𝜷𝟏

𝑳𝑴𝑴𝟏 SE P 𝜷𝟏𝑳𝑴𝑴𝟐 SE P 𝜷𝟐 SE P 𝑷𝟐𝒅𝒇

1 rs74409324 34832080 0.01 0.83 A*:G 2.459 2.078 0.237 -2.097 2.229 0.347 1.024 0.183 2.38e-08 8.68e-08

1 rs10538156 75791208 0.01 0.99 A*:AGTT 0.390 10.389 0.970 -32.935 11.333 0.004 7.542 1.023 1.69e-13 1.54e-12

1 rs1707184 75794925 0.01 0.99 A*:C 0.390 10.389 0.970 -32.935 11.333 0.004 7.542 1.023 1.69e-13 1.54e-12

2 rs12713270 54832751 0.20 1.00 A*:C -0.475 0.507 0.349 -1.517 0.542 0.005 0.236 0.043 3.96e-08 1.82e-07

2 rs67244067 54835845 0.20 1.00 A*:G -0.451 0.508 0.375 -1.506 0.543 0.006 0.239 0.043 2.81e-08 1.36e-07

2 rs6724136 54837377 0.22 1.00 G*:A -0.621 0.490 0.205 -1.630 0.543 0.002 0.229 0.041 3.29e-08 1.06e-07

2 rs10183043 54838859 0.22 1.00 G*:T -0.622 0.490 0.205 -1.630 0.543 0.002 0.229 0.041 3.27e-08 1.05e-07

2 rs12713272 54842302 0.20 1.00 T*:C -0.371 0.505 0.463 -1.405 0.539 0.009 0.234 0.043 4.55e-08 2.46e-07

2 rs11902659 54842317 0.20 1.00 G*:A -0.362 0.504 0.473 -1.402 0.539 0.009 0.235 0.043 3.91e-08 2.14e-07

2 rs7569127 54844926 0.20 1.00 G*:T -0.390 0.504 0.438 -1.432 0.538 0.008 0.236 0.043 3.33e-08 1.76e-07

2 rs10176359 54845698 0.20 1.00 A*:G -0.387 0.504 0.443 -1.429 0.538 0.008 0.236 0.043 3.34e-08 1.77e-07

2 rs13408295 54854905 0.20 1.00 T*:C -0.380 0.507 0.454 -1.428 0.542 0.009 0.237 0.043 3.51e-08 1.89e-07

2 rs72806653 54892663 0.21 1.00 A*:C -0.687 0.499 0.169 -1.704 0.533 0.001 0.231 0.042 4.71e-08 1.30e-07

7 rs189015886 105075505 0.01 0.88 A*:G 1.807 1.973 0.360 -2.735 2.116 0.196 1.003 0.169 2.74e-09 1.36e-08

10 rs2339623 52688098 0.01 0.99 G:C* 3.134 2.934 0.286 9.216 3.139 0.003 -1.369 0.250 4.45e-08 1.79e-07

10 rs2339622 52688177 0.01 0.99 T:G* 3.130 2.934 0.286 9.214 3.140 0.003 -1.370 0.250 4.42e-08 1.78e-07

10 rs2339621 52689642 0.01 0.99 T:G* 3.059 2.941 0.298 9.180 3.148 0.004 -1.378 0.251 4.05e-08 1.68e-07

10 rs10821959 52691497 0.01 0.99 C:G* 2.552 2.987 0.393 8.890 3.199 0.006 -1.424 0.255 2.39e-08 1.20e-07

10 rs74155187 107418625 0.01 0.99 T*:C -3.766 2.532 0.137 -9.642 2.711 3.89e-04 1.339 0.220 1.09e-09 2.85e-09

10 rs1674926 127851203 0.01 0.97 A*:T -0.447 7.256 0.951 -18.015 7.768 0.021 3.887 0.612 2.24e-10 1.80e-09

14 rs72712117 99147183 0.01 0.84 C*:A 2.218 1.928 0.250 -1.983 2.067 0.338 0.946 0.168 1.81e-08 6.75e-08

21 rs117099726 32194470 0.01 0.86 A*:G 3.960 2.189 0.071 -0.788 2.350 0.738 1.052 0.191 3.44e-08 4.81e-08

The significance is determined by LMM on any kind of effect of the SNP under conventional GWAS threshold. Chr: chromosome. BP: position in base pairs on

the chromosome (1000 genomes phase 3 v5). INFO: quality of imputation. A1 and A2 are alleles 1, 2, with the minor allele indicated with *. Parameter estimates

(β), standard error (SE) and p-values (P) from LMMs are presented. β1LMM1 and 𝛽1

𝐿𝑀𝑀2: estimates of main effects from LMM1 and LMM2. β2: estimate of SNP-

time interaction effect. 𝑃2𝑑𝑓: p-value for 2df test. Bolded SNPs are selected for following analysis. Bolded p-values are under 5× 10−8.

chromosome 2 for eGFR.

LMM 𝛽1𝐿𝑀𝑀1 -0.475 0.507 0.349

LMM 𝛽1𝐿𝑀𝑀2 -1.517 0.542 0.005

LMM β2 0.236 0.043 3.96e-08

LMM 2df -- -- 1.82e-07

SAO β2 0.225 0.066 7.44e-04

WSAO β2 0.197 0.051 1.13e-04

TS β2 0.121 0.026 4.47e-06

CTS β2 0.198 0.039 4.03e-07

GALLOP β1 -1.652 0.546 2.49e-03

GALLOP β2 0.238 0.051 3.09e-06

GALLOP 2df -- -- 8.99e-06

Table 3-13 Parameter estimates (BETA), standard error (SE) and p-values (P) for rs74155187 on eGFR.

LMM 𝛽1𝐿𝑀𝑀1 -3.766 2.532 0.137

LMM 𝛽1𝐿𝑀𝑀2 -9.642 2.711 3.89e-04

LMM β2 1.339 0.220 1.09e-09

LMM 2df -- -- 2.85e-09

SAO β2 1.391 0.332 2.94e-05

WSAO β2 1.293 0.261 7.88e-07

TS β2 0.631 0.131 1.67e-06

CTS β2 1.158 0.194 3.11e-09

GALLOP β1 -11.958 2.737 1.25e-05

GALLOP β2 1.431 0.257 2.52e-08

GALLOP 2df -- -- 1.41e-08

By using the efficiency measures defined in Section 2.5 Methods for DCCT/EDIC Data Analysis,

we calculated the discovery proportions 𝐸𝑙 and 𝐸𝑠 for fast methods presented in Table 3-14. With

more findings by LMM, 22 for significant longitudinal effect and 6 out of these for joint effect,

the efficiency of fast methods is better discriminated. WSAO is still having high values for both

El and Es, but the total number of SNPs detected is small. For El, TS becomes more efficient on

this outcome with a discovery proportion of 6.42% in its candidate SNPs. CTS and GALLOP have

lower El than TS but included all 22 SNPs in their candidate SNPs.

Table 3-14 Measures of efficiency of fast methods on real eGFR data.

eGFR SAO WSAO TS CTS GALLOP

(𝛃𝟐)

GALLOP

𝐸𝑙 0.51% (7) 40.91% (9) 6.42% (21) 4.44% (22) 3.70% (22) 2.33% (16)

𝐸𝑠 1.85% (3) 100% (3) 71.43% (5) 46.15% (6) 24.24% (8) 30.44% (7)

El: % of SNPs detected by LMM under P < 5 × 10−8 out of SNPs by fast methods under P < 10−5. E𝑠:

% of SNPs detected by LMM under P < 5 × 10−8 out of SNPs by fast methods under P < 5 × 10−8 .

Count in brackets is the count of SNPs detected by LMM under P < 5 × 10−8. The maximum count

should be 22.

3.3.3.4 Significant SNPs

Similarly, the cross-sectional analysis stratified by study years are conducted on rs12713270 on

chromosome 2 and rs74155187 on chromosome 10, with results provided in Figure 3-39. These

two SNPs show the same pattern in DCCT/EDIC that the cross-sectional SNP effect is slowly

increasing by study year with changing direction of effect over time. However, it seems

rs74155187 has a larger effect size than rs12713270 overall.

Figure 3-39 Estimate along with 95% CI for cross-sectional SNP effects on eGFR by combined and separate

DCCT/EDIC years. Only rs12713270 on chromosome 2 is plotted as the other findings on chromosome 2

have the similar results. Dashed lines represent 0 which is no SNP effect.

We made Locus Zoom plots (Pruim et al., 2010) for 2df p-values from LMMs on two selected

regions to contain all SNPs on chromosome 2 in Figure 3-40 and the SNP on chromosome 10 in

Figure 3-41. For rs12713270, many SNPs on chromosome 2 have very high correlation to this

SNP measured by r2. For rs74155187, the locus plot look similar as Figure 3-31 that many SNPs

have relatively high correlation with the reference SNP but no other SNPs are reaching or close to

the conventional GWAS threshold.

Figure 3-40 Locus plot with reference SNP rs12713270. The display range is from 544.33 – 552.33 Mb on

chromosome 2 with 2846 SNPs plotted in total. P-values on left y-axis are 2df p-values by LMM.

Correlation between reference SNP and other SNPs are displayed in color shown in legend. Reference

panel: human genome 19/1000 Genomes Nov 2014 European.

Figure 3-41 Locus plot with reference SNP rs74155187. The display range is from 107.02 – 107.82 Mb on

chromosome 10 with 2445 SNPs plotted in total. P-values on left y-axis are 2df p-values by LMM.

Correlation between reference SNP and other SNPs are displayed in color shown in legend. Reference

panel: human genome 19/1000 Genomes Nov 2014 European.

3.3.4 Speed Comparison

The speed comparison of GWAS is present in Figure 3-42. As before, GWAS was run on one node

at high-performance computing server which includes 144 nodes consisting of 2x Intel(R)

Xeon(R) Gold 6140 CPU @ 2.30GHz with 192Gb RAM. The total time was added up for each

method instead of accounting parallel computation because different nodes or servers might have

different capability for parallel running. As expected, the two-stage methods SAO, WSAO, TS

and CTS are around the same speed in that they only differ in time for first speed of calculating

slopes, which is negligible compared to the GWAS analysis time. GALLOP requires more time

than other fast methods due to different algorithm and is able to provide more inference including

both single and joint SNP effects. Besides, the longer time of GALLOP for eGFR than logAER

indicates that the speed of GALLOP is also affected by the number of measurements while the

two-stage methods are not affected because the data is transformed into cross-sectional data. Due

to the expected limit of speed of LMM, the whole genome is not analyzed and only the random

subset of 19,570 SNPs is counted for speed comparison. LMM spends 169.13 hours on these SNPs

for logAER and 191.62 hours on the same set of SNPs for eGFR. With the same number of SNPs,

the speed of LMM is slightly affected by the different number of measurements as it takes 22.49

hours more to run on outcome eGFR than logAER. By estimating time for 8,979,131 SNPs from

time for 19,570 SNPs, it takes about 78k hours for logAER and 88k hours for eGFR to run GWAS

with LMMs without parallel running.

Figure 3-42 Running time for GWAS. For fast methods, the count of SNPs analyzed is 8,979,131. For

LMM*, time is recorded by running a random selection of 19,570 SNPs (~0.2% of SNPs).

Discussion

4.1 Simulation Study

Simulation studies help us to have a general idea of how methods perform on longitudinal data

with different features. The limitations of the previous Rotterdam simulation studies (Sikorska et

al., 2013b; Sikorska et al., 2018) include:

a. MAF of SNPs was 0.5 for all scenarios.

b. The study did not account for any within subject error correlation and the errors were

generated independently for each visit of each subject.

c. The missing data scenarios in the simulation study only included dropout based on MCAR

and MAR mechanisms with an overall missing rate of ~35%.

d. All methods only provide inference on either cross-sectional or longitudinal SNP effect,

without 2df test.

Compared to previous studies, we extend simulation aspects in MAF, missingness and within-

subject error structure.

4.1.1 Type 1 Error

Using the approximate 99% CI, we want to make sure that these methods produce controlled T1E

in any scenario so that it is fair to compare their powers in the next step.

As described in Section 3.2.2, Figure 3-13 showed most fast methods are relatively stable at

controlling T1E in current 7 scenarios, except for WSAO being significantly conservative when

data is MAR, MNAR or dropout. Other than that, we also observed inflated T1E for LMM 2df test

when MAF=0.3 and N=500 in strong correlation scenario, but T1E stays in 99% CI for other

combinations of MAF and N. For all the other methods currently well controlled, we mentioned

in Section 1.5.7 Concerns about other papers discussing possibility of inflated T1E for low MAF

or low minor allele count (Ma et al., 2013). In our study we set MAF and N constant (MAF=0.3;

N=2000) when varying one of them across a wide range, so the lowest MAF in our setting is 0.01

and the corresponding sample size is 2000. This type of combination design is convenient to

conduct but limited the findings, for instance, how T1E would be when MAF=0.01 and N=1000.

We have no idea about the inflation of T1E in many detailed situations like that, but we can always

adjust the settings to conduct simulation studies based on specific real data.

4.1.2 Power

4.1.2.1 Cross-sectional Effect Test

The power for cross-sectional effects by GALLOP and LMM are usually similar to each other

across all scenarios except for strong correlation scenario (Figure 3-15). Both methods lose little

power in missing scenarios. Both lose power when error correlation is adjusted to strong, but

GALLOP loses more than LMM because it always assumes independent within-subject errors. In

addition, power of cross-sectional effect is not affected by longitudinal SNP effect being 0 or not.

4.1.2.2 Longitudinal Effect Test

4.1.2.2.1 Missing Data Scenarios

We can observe obvious power loss in missing data scenarios for longitudinal SNP effect,

especially MNAR and dropout.

In general, no method is robust against MNAR which is expected. However, it is counterintuitive

to see the power drop for the longitudinal effect in MNAR when both main and interaction effects

exist (Figure 3-16).

Power loss in dropout scenario is not that severe compared to MNAR, maybe because the

probability of dropout is actually based on the previous observation and it can be seen as a type of

MAR. This design is inspired by data of eGFR that once participants reached an eGFR value under

10 ml/min/1.73m2, it is recoded as 10, and set to missing for following visits because of the

development of end-stage renal disease. Data generated from dropout resembles our motivating

data more than other missing scenarios in that the number of measurements decreases by years.

The reasons for performance change in these two missing scenarios is yet to be explained, but it

might have something to do with our current simulation settings including the relative effects of

the cross-sectional and longitudinal SNP effects or the approach to simulate missing scenarios.

4.1.2.2.2 Error Correlation Scenarios

We can observe some power loss in strong error correlation scenario in Figure 3-16. The strong

error structure has effects on all fast methods except TS because of their incapability of adjusting

for correlated within-subject error. However, the difference between methods is not obvious in the

reference scenario which allows a medium correlation. When the errors are independently

generated, which is 0 error correlation, the fast methods do not lose power and have similar

performance compared to reference scenario from Figure 3-16.

4.1.2.3 Two-degree-of-freedom Test

In power comparison Figure 3-15 and Figure 3-16, across all scenarios GALLOP may be the best

fast method so far in that the 2df tests have dominant advantage in all scenarios when both SNP

effects exist and are still powerful enough when either cross-sectional or longitudinal SNP effects

exists. This coincides with findings by Kraft et al. in a different study where it is claimed a joint

test combining tests on marginal and interaction effect is an optimal approach in most situations

and is more powerful than either main effect or interaction effect when both effects exist (Kraft et

al., 2007). Their conclusions were based on cross-sectional GWAS with both outcome and

environmental factor simulated as binary variables (Kraft et al., 2007). Also, there are other studies

applying 2df test on quantitative traits showing gain in power to detect new loci compared to using

main effects alone (Young et al., 2018; Persad et al., 2017) . By conducting this simulation study

we see that 2df for SNP×time to be a very powerful and efficient for longitudinal GWAS with

quantitative traits.

4.1.3 Speed

We simulated and analyzed 1000 SNPs for comparison of speed. The comparison between fast

methods might be unsteady and in theory SAO, WSAO, TS and CTS should have very similar

speed in large data because they only differ in the first step of slope calculation. The simulation is

limited by the speed of LMM and the results show fast methods runs about 1000 times faster than

LMM (Table 3-3). However, this is just an estimate because the true data may have different

sample size, MAF, missingness and error structure which can all affect the speed of LMM. Also,

file sizes may have speed implications as whether it is one file with all GWAS data, or separate by

chromosome, or chunks. The time can be shortened by parallel processing but we simply added

up the time for each model as results due to the different capability of parallel computing systems.

4.1.4 Implementation

In addition, ease of implementation is important to consider. SAO/WSAO are easiest to conduct

in that they only require linear regression for whole algorithm. Among others, CTS and GALLOP

need additional codes developed from the original papers which could require additional

adjustment of codes for specific data. Besides, CTS cannot transform data with > 22 repeated

measurements because it is infeasible to calculate the orthogonal polynomials needed. This might

require thinning the data from individuals with more than 22 measurements, but this could result

in loss of information.

4.2 DCCT/EDIC Data Analysis

4.2.1 GWAS of logAER

The cross-sectional effects of discovered SNPs for logAER showed an alternate pattern in alternate

years (Figure 3-29). This may be caused by the fact that during EDIC the measurement of logAER

is taken every other year for each subject and the assignment of participants in EDIC years is based

on their entry year in DCCT. Predominant adolescents (age 13-17 years) were recruited at the early

phase of DCCT, and the other subjects recruited were adults aged 18-39 years (DCCT Research

Group, 1986). This might explain the significant difference (p=0.003) described in Section 3.1.2

that the first EDIC logAER measure is higher in even year than odd year, and there can be different

populations between odd and even years in EDIC study. However, it is complex to totally separate

participants as odd or even years measured because as mentioned in Section 3.1.2, 14.46% of

participants in EDIC did not strictly adhere to the assignment at least once and change odd or even

year to be measured.

4.2.2 GWAS of eGFR

The eGFR outcome showed a different performance from logAER in that p-values from most fast

methods are not significantly different from LMM, except for WSAO and GALLOP on

longitudinal SNP effect. We found a lot more SNPs detected by LMM for eGFR than logAER

under conventional GWAS threshold. The explanation for the difference between the two

outcomes might be a different error structure (Section 3.3.1). eGFR also has more measurements

than logAER. There might be a different performance for T1E and power for a combination of low

MAF, more visits and such an error structure which is not assessed in our simulation study.

4.3 Limitations and Future Study

4.3.1 Simulation Study Settings

We are aware that our simulation designs do not cover all possible scenarios in real data. The first

thing to remember about this study is that it is only based on outcome logAER in DCCT data,

therefore it cannot be generalized to other traits and other studies. The simulation results are not

definitive and we can always extend the designs to more specific scenarios, but the current results

may be able to provide us some insights or mirror other longitudinal data which is similar to one

of our scenarios.

4.3.1.1 MAF, Sample Size and Effect Size

As we mentioned in 4.1.1, we set MAF to 0.3 or sample size to 2000 when varying the other design

across the range. This helps us know the performance change caused by single factor and is easier

to conduct than full factorial combinations. However without a full factorial design, it is impossible

to learn performance in all combinations, for example a low MAF and a small sample size.

The effect size on cross-sectional or longitudinal SNP effect was selected to make reasonable

power plot under reference scenario. The limitation of it is a combination of fixed ratio for cross-

sectional and longitudinal effect. This ratio may differ for different studies or not be constant by

Another option to consider is to set multiple SNPs which have different MAFs (e.g. 0.01 – 0.5) or

effect size coefficients in the simulation model (Eq. 2.4). By setting such a model, the application

of different fast methods on the generated data can reveal their performance on identifiability of

the true SNPs, association test results or effect size estimations all in one model.

4.3.1.2 Missing Pattern

Although we have simulated four missing scenarios, the approach of simulating missingness is

just one of many possible ways and maybe the simplest one. We used a missing rate of 40% to

exaggerate the difference between methods, however our data has an average missing rate of

around 7.5% for these two outcomes in DCCT/EDIC and the missing patterns are different in

separate studies (Table 3-2).

In our observed DCCT/EDIC data for renal measurements, one of the reasons for missingness is

by study design. When a participant develops end-stage renal disease by reaching the eGFR

threshold value, he/she will be censored afterwards for renal measures (DCCT/EDIC Research

Group, 2011). Another possible reason is death and till 2012 around 107 deaths occurred in the

EDIC cohort (Orchard and DCCT/EDIC Research Group, 2015). This indicates the missingness

in our data might be a mixture of many possible mechanisms, therefore the limitation of simulation

study is that we only considered one reason for the missingness.

4.3.1.3 Within-subject Error Correlation

We extended the Rotterdam study by allowing within-subject error correlation in LMM. Among

all kinds of error structures introduced in 2.2.2 Within-individual Correlation Structure Selection,

we used ARMA(1,1) because the model selects this structure as the best structure describing errors

for both outcomes. However, we varied one parameter γ only to manually adjust the strength of

correlation between time points as mentioned in Set up. For this structure, the parameter ρ is also

adjustable to change the correlation and extension can be based on different combinations of γ and

ρ. In addition, different error structure types could be extended to generate data.

4.3.2 Real Data Model Specification

In Section 2.5 we selected treatment as one of fixed effect covariates. By doing so we are assuming

effect of treatment is the same at all time points, including baseline. However, the fact is subjects

were not randomized at DCCT baseline, therefore there should be no effect of randomly assigned

treatment. This might influence results due to misspecification of model. One of the potential

solutions is to include an interaction term of treatment and binary indicator for baseline. It will

then provide the estimation of treatment effect at baseline as 0 which is more accurate than current

model.

4.3.3 Efficiency Measures

In Section 2.5 Methods for DCCT/EDIC Data Analysis, we defined two measures 𝐸𝑙 and 𝐸𝑠 for

efficiency or accuracy of fast methods to capture the same SNPs for LMM under conventional

GWAS threshold. The definition of these two measures has limitations especially after we found

in the GWAS of logAER or eGFR that there are clusters of correlated SNPs within a region (SNPs

on chromosome 2 in Table 3-11 for example). A modification could be made to define the

measures in terms of the localization ability to a region instead of requiring the exact same SNPs

as LMM. By spotting SNPs under the pre-specified threshold from the fast methods, we can find

correlated SNPs with high linkage disequilibrium to the targeted SNP. Further examination would

then be conducted on this cluster of SNPs to investigate their cross-sectional or longitudinal effects

on the trait.

4.3.4 Heteroscedasticity

In current analysis with a specified error structure, errors are generated independently from the

identical multivariate distribution for every subject at multiple time points. However we did not

investigate the heteroscedasticity in either simulation study or DCCT/EDIC data. Data might be

generated from different error variances for subjects from different subgroups, for example

treatments (Yamaguchi et al., 2019). We can extend our study to simulate heteroscedastic data and

see how performance of methods is affected by this potential issue.

4.3.5 Weighted Slope as Outcome (WSAO)

As a new method proposed by us based on SAO method, WSAO applies weights in the second

step to put more weights on subjects with more observations. The distributions of weights for

logAER and eGFR were presented in results Figure 3-5 and Figure 3-9. Both traits showed a left-

skewed distribution on numbers of measurements, in which most subjects have around 15

observations for logAER and over 20 observations for eGFR.

In simulation study, T1E result Figure 3-13 suggests that WSAO is significantly conservative in

missing data scenario, especially dropout. In DCCT/EDIC data description, the missing rates of

both traits increase by year in EDIC, suggesting a dropout pattern. Therefore, we again observed

the conservative performance of WSAO indicated by its histograms on p-values (Figure 3-25,

Figure 3-35). The power study for WSAO cannot be considered convincing under such a missing

pattern with a questionable T1E, which might also affect the performance of WSAO on

DCCT/EDIC data analysis.

We suspect this abnormal performance of WSAO might be caused by the small sample with small

number of observations, which forms a left-skewed distribution of weights for WSAO. In future

study, we can repeat the analysis but limit the subjects to people with most records to determine

whether WSAO is affected by the evenly-distributed missingness or clustered missingness. A

different weighting strategy could be designed due to the limitation of current WSAO method.

4.3.6 Missing Not at Random

For MNAR scenario, the truth is in real data it is very hard to detect this type of missingness as it

is based on unobserved data (Little and Rubin, 1987). Besides, no methods including LMM in our

study are robust against MNAR in theory and the classic ways of dealing with MNAR for

longitudinal data include selection model and pattern mixture model (Enders, 2011; Fiero et al.,

2017). Based on selection and pattern mixture model, some newly proposed methods such as

pattern submodel are computationally efficient tools for handling missing data including

MNAR (Fletcher Mercaldo and Blume, 2018). Although the missing rate in our data is around

15% at most in later years, ignoring MNAR might produce biased results and thus techniques for

MNAR can be taken into consideration to see whether the results from LMM are biased by MNAR

4.3.7 Empirical T1E and Theoretical T1E

The simulation study for T1E is limited to calculation of theoretical T1E. The null hypothesis for

GWAS is that all SNPs have no association with interested traits and the multiple testing problem

requires correction on T1E rate. Theoretical T1E is obtained by setting one single SNP

unassociated with outcome and analyze effect of this SNP only. However, the empirical T1E might

be different for the same method. One example of simulation study for empirical T1E is done by

Zhang et al (Zhang and Sun, 2019). The outcome is generated from one alternative that it is

associated with one SNP, but then a large number of null SNPs are regenerated or the outcome is

permutated to analyze and calculate T1E. In this case, they also calculate T1E rate with much more

stringent significance levels with large number of replicates which resembles a true GWAS. This

is one of our limitation that we did not address stringent threshold for T1E control in our simulation

study. Compared to empirical T1E, theoretical T1E is less realistic and might produce misleading

conclusions on T1E rate control, which affects the interpretation of power comparisons.

4.3.8 Multivariate Model

The current analysis is based on one phenotype and one SNP at a time. There are efforts devoted

in multivariate models combining multiple phenotype traits to increase genome-wide discovery.

Conducting GWAS repeatedly for several traits may be less efficient when the number of available

variants or sample size is large. Other benefits of multivariate LMMs are their ability of

considering sample relatedness and population stratification and the increased power compared to

standard univariate analysis (Zhou and Stephens, 2014). Joint outcome analysis can be useful when

the interest is in longitudinal trends between outcomes or evolution of association between

outcomes, but currently the limitation of multivariate LMMs exists in the computational

complexity as usual LMM tools like nlme and lme4 cannot conduct it (Verbeke et al., 2014). Many

different ways emerged to extend LMM to carry out GWAS of correlated traits efficiently with

tools like GMA (Ning et al., 2019), joineRML package in R (Hickey et al., 2018) or MTG2 (Lee

and van der Werf, 2016). Recently such tools have been proposed to conduct multivariate LMMs.

As described in Section 3.1, the renal outcomes AER and eGFR have a weak negative correlation

between each other. A joint analysis of multivariate LMMs with these outcomes for GWAS may

be another future direction to identify novel genetic variants associated with renal traits in this

cohort.

Summary

In summary, we based our study on diabetic renal complication because it is one of the four major

complications and is the leading cause of end-stage renal failure. The measurement for renal traits

might be interfered by environment and the trajectory in repeated measurements can reveal more

information on the status of a patient with diabetes than one-time measurement. It provided us the

motivation to find a fast and effective way to conduct GWAS with longitudinal data.

We extended the previous Rotterdam simulation study by Sikorska et al. with our motivating data

from the DCCT/EDIC study. The results show that GALLOP stands out as the most accurate

method because 1) 2df test by GALLOP always has very high power (Figure 3-15, 3-16Figure

3-16); 2) parameter estimation by GALLOP can always capture the true effect sizes for cross-

sectional and longitudinal SNP effects (Figure 3-17, 3-18, 3-19). It was found 2-stage methods

SAO, WSAO, TS and CTS require similar computation time and are faster than GALLOP when

the number of SNPs is high, but GALLOP is still much faster than LMM (Figure 3-42).

In the end, we realized there are limitations in our current simulation study designs and

DCCT/EDIC real data model. We want to extend our simulation study to assess these methods in

more realistic scenarios, and correct the specification of real data model to find the true genetic

associations in our data.

References

1000 Genomes Project Consortium (2015) 'A global reference for human genetic variation',

Nature, 526, pp. 68-74.

Afonso, G. and Mallone, R. (2013) 'Infectious triggers in type 1 diabetes: is there a case for

epitope mimicry?', Diabetes, Obesity & Metabolism, 15(Suppl 3), pp. 82-88.

Allan, G.M., Mannarino, M. and Tonelli, M. (2013) 'Tools for Practice Screening and diagnosis

of type 2 diabetes with HbA1c', Canadian Family Physician, 59(1), pp. 42.

Altman, D.G. and Bland, J.M. (1986) 'Comparison of methods of measuring blood pressure', J

Epidemiol Community Health, 40(3), pp. 274-7.

American Diabetes Association (2019) 'Classification and Diagnosis of Diabetes: Standards of

Medical Care in Diabetes—2019', Diabetes Care, 42(Supplement 1), pp. S13-S28.

Atkinson, M.A. and Eisenbarth, G.S. (2001) 'Type 1 diabetes: new perspectives on disease

pathogenesis and treatment', The Lancet, 358(9277), pp. 221-229.

Barrett, J.C., Clayton, D.G., Concannon, P., Akolkar, B., Cooper, J.D., Erlich, H.A., Julier, C.,

Morahan, G., Nerup, J., Nierras, C., Plagnol, V., Pociot, F., Schuilenburg, H., Smyth, D.J.,

Stevens, H., Todd, J.A., Walker, N.M. and Rich, S.S. (2009) 'Genome-wide association study

and meta-analysis find that over 40 loci affect risk of type 1 diabetes', Nature genetics, 41(6), pp.

703-707.

Bates, D., Machler, M., Bolker, B. and Walker, S. (2015) 'Fitting Linear Mixed-Effects Models

Using lme4', Journal of Statistical Software, 67(1), pp. 1-48.

Bebu, I., Braffett, B.H., Orchard, T.J., Lorenzi, G.M. and Lachin, J.M. (2019) 'Mediation of the

Effect of Glycemia on the Risk of CVD Outcomes in Type 1 Diabetes: The DCCT/EDIC Study',

Diabetes Care, pp. 10.2337/dc18-1613.

Bland, J.M. and Altman, D.G. (1999) 'Measuring agreement in method comparison studies', Stat

Methods Med Res, 8(2), pp. 135-60.

Boland, B.B., Rhodes, C.J. and Grimsby, J.S. (2017) 'The dynamic plasticity of insulin

production in β-cells', Molecular Metabolism, 6(9), pp. 958-973.

Buzzetti, R., Quattrocchi, C.C. and Nisticò, L. (1998) 'Dissecting the genetics of Type 1

diabetes: relevance for familial clustering and differences in incidence', Diabetes/Metabolism

Reviews, 14(2), pp. 111-128.

Cheung, N., Mitchell, P. and Wong, T.Y. (2010) 'Diabetic retinopathy', The Lancet, 376(9735),

pp. 124-136.

DCCT Research Group (1986) 'The Diabetes Control and Complications Trial (DCCT). Design

and Methodologic Considerations for the Feasibility Phase.', Diabetes, 35(5), pp. 530-545.

DCCT/EDIC Research Group (2007) 'Long-term effect of diabetes and its treatment on cognitive

function', New England Journal of Medicine, 356(18), pp. 1842-1852.

DCCT/EDIC Research Group (2011) 'Intensive Diabetes Therapy and Glomerular Filtration Rate

in Type 1 Diabetes', New England Journal of Medicine, 365, pp. 2366-2376.

DCCT/EDIC Research Group (2014a) 'Diabetic retinopathy and other ocular findings in the

diabetes control and complications trial/epidemiology of diabetes interventions and

complications study.', Diabetes Care, 37(1), pp. 17-23.

DCCT/EDIC Research Group (2014b) 'Effect of intensive diabetes treatment on albuminuria in

type 1 diabetes: long-term follow-up of the Diabetes Control and Complications Trial and

Epidemiology of Diabetes Interventions and Complications study', Lancet Diabetes &

Endocrinology, 2(10), pp. 793-800.

de Boer, I.H. and DCCT/EDIC Research Group (2014) 'Kidney disease and related findings in

the diabetes control and complications trial/epidemiology of diabetes interventions and

complications study.', Diabetes Care, 37(1), pp. 24-30.

EDIC Research Group (1999) 'Design, implementation, and preliminary results of a long-term

follow-up of the Diabetes Control and Complications Trial cohort.', Diabetes Care, 22(1), pp.

99-111.

Enders, C.K. (2011) 'Missing Not at Random Models for Latent Growth Curve Analyses',

American Psychological Association, 16(1), pp. 1-16.

Fiero, M.H., Hsu, C.H. and Bell, M.L. (2017) 'A pattern-mixture model approach for handling

missing continuous outcome data in longitudinal cluster randomized trials', Stat Med, 36(26), pp.

4094-4105.

Fletcher Mercaldo, S. and Blume, J.D. (2018) 'Missing data and prediction: the pattern

submodel', Biostatistics.

Ghaderian, S.B., Hayati, F., Shayanpour, S. and Beladi Mousavi, S.S. (2015) 'Diabetes and end-

stage renal disease; a review article on new concepts', Journal of renal injury prevention, 4(2),

pp. 28-33.

Greenbaum, C.J., Harrison, L.C., Wentworth, J.M., Elkassaby, S. and Fourlanos, S. (2008)

'Reappraising the Stereotypes of Diabetes', Diabetes: Translating Research into Practice. New

York: Informa Healthcare USA.

Haas, M.E., Aragam, K.G., Emdin, C.A., Bick, A.G., Hemani, G., Davey Smith, G. and

Kathiresan, S. (2018) 'Genetic Association of Albuminuria with Cardiometabolic Disease and

Blood Pressure', American Journal of Human Genetics, 103(4), pp. 461-473.

Henderson, C.R. (1950) 'Estimation of Genetic Parameters', Annals of Mathematical Statistics,

21, pp. 309-310.

Hickey, G.L., Philipson, P., Jorgensen, A. and Kolamunnage-Dona, R. (2018) 'joineRML: a joint

model and software package for time-to-event and multivariate longitudinal outcomes', BMC

Med Res Methodol, 18(1), pp. 50.

Ikram, M.A., Brusselle, G., Murad, S.D., van Duijn, C.M., Franco, O.H., Goedegebure, A.,

Klaver, C.C.W., Nijsten, T., Peeters, R.P., Stricker, B.H., Tiemeier, H., Uitterlinden, A.G.,

Vernooij, M.W., Hofman, A. and Hofman, A. (2017) 'The Rotterdam Study: 2018 update on

objectives, design and main results', European Journal of Epidemiology, 32(9), pp. 807-850.

Kraft, P., Yen, Y.C., Stram, O.D., Morrison, J. and Gauderman, W.J. (2007) 'Exploiting Gene-

Environment Interaction to Detect Genetic Associations', Human Heredity, 63, pp. 111-119.

Lachin, J.M. and DCCT/EDIC Research Group (2016) 'Risk Factors for Cardiovascular Disease

in Type 1 Diabetes', Diabetes, 65(5), pp. 1370-1379.

Lee, S.H. and van der Werf, J.H.J. (2016) 'MTG2: An efficient algorithm for multivariate linear

mixed model analysis based on genomic information', Bioinformatics, 32(9), pp. 1420-1422.

Levey, A.S., Stevens, L.A., Schmid, C.H., Zhang, Y.L., Castro, A.F., Feldman, H.I., Kusek,

J.W., Eggers, P., Lente, F.V., Greene, T. and Coresh, J. (2009) 'A new equation to estimate

glomerular filtration rate', Annals of Internal Medicine, 150(9), pp. 604-612.

Little, R.J.A. and Rubin, D.B. (1987) Statistical analysis with missing data. New York: John

Wiley & Sons.

Ma, C., Blackwell, T., Boehnke, M., Scott, L.J. and Go, T.D.i. (2013) 'Recommended joint and

meta-analysis strategies for case-control association testing of single low-count variants', Genet

Epidemiol, 37(6), pp. 539-50.

Ma, Y., Mazumdar, M. and Memtsoudis, S.G. (2012) 'Beyond Repeated-Measures Analysis of

Variance: Advanced Statistical Methods for the Analysis of Longitudinal Data in Anesthesia

Research', Regional Anesthesia and Pain Medicine, 37(1), pp. 99-105.

Martin, C.L., Albers, J.W., Pop-Busui, R. and DCCT/EDIC Research Group (2014) 'Neuropathy

and related findings in the diabetes control and complications trial/epidemiology of diabetes

interventions and complications study.', Diabetes Care, 37(1), pp. 31-38.

Mohlke, K.L. and Lindgren, C.M. (2014) 'Genome-Wide Association Studies of Obesity and

Related Traits', Type 2 Diabetes and Related Traits. Oxford: Karger, pp. 58-70.

Nathan, D.M. (2014) 'The Diabetes Control and Complications Trial/Epidemiology of Diabetes

Interventions and Complications Study at 30 Years: Overview', Diabetes Care, 37(1), pp. 9-16.

National Kidney Foundation (2002) 'K/DOQI clinical practice guidelines for chronic kidney

disease: Evaluation, classification, and stratification', American Journal of Kidney Diseases,

39(suppl 1), pp. S1-S266.

Nejentsev, S., Walker, N., Riches, D., Egholm, M. and Todd, J.A. (2009) 'Rare variants of

IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes', Science,

324(5925), pp. 387–389.

Ning, C., Wang, D., Zhou, L., Wei, J., Liu, Y., Kang, H., Zhang, S., Zhou, X., Xu, S. and Liu,

J.F. (2019) 'Efficient Multivariate Analysis Algorithms for Longitudinal Genome-wide

Association Studies', Bioinformatics.

Orchard, T.J. and DCCT/EDIC Research Group (2015) 'Asociation between seven years of

intensive treatment of type 1 diabetes and long time mortality', Journal of the American Medical

Association, 313(1), pp. 45-53.

Paterson, A.D., Waggott, D., Boright, A.P., Hosseini, S.M., Shen, E., Sylvestre, M.P., Wong, I.,

Bharaj, B., Cleary, P.A., Lachin, J.M., Below, J.E., Nicolae, D., Cox, N.J., Canty, A.J., Sun, L.

and Bull, S.B. (2010) 'A Genome-Wide Association Study Identifies a Novel Major Locus for

Glycemic Control in Type 1 Diabetes, as Measured by Both A1C and Glucose', Diabetes, 59(2),

pp. 539-549.

Persad, P.J., Heid, I.M., Weeks, D.E., Baird, P.N., de Jong, E.K., Haines, J.L., Pericak-Vance,

M.A., Scott, W.K. and International Age-Related Macular Degeneration Genomics, C. (2017)

'Joint Analysis of Nuclear and Mitochondrial Variants in Age-Related Macular Degeneration

Identifies Novel Loci TRPM1 and ABHD2/RLBP1', Invest Ophthalmol Vis Sci, 58(10), pp.

4027-4038.

Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D. and R Core Team (2019) 'nlme: Linear and

Nonlinear Mixed Effects Models'.

Pruim, R.J., Welch, R.P., Sanna, S., Teslovich, T.M., Chines, P.S., Gliedt, T.P., Boehnke, M.,

Abecasis, G.R. and Willer, C.J. (2010) 'LocusZoom: Regional visualization of genome-wide

association scan results', Bioinformatics, 26(18), pp. 2336-2337.

Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A.R., Bender, D., Maller, J.,

Sklar, P., de Bakker, P.I.W., Daly, M.J. and Sham, P.C. (2007) 'PLINK: a toolset for whole-

genome association and population-based linkage analysis.', American Journal of Human

Genetics, 81(3), pp. 559-575.

Radha, V. and Mohan, V. (2017) 'Genetic Basis of Monogenic Diabetes', Current Science,

113(07), pp. 1277-1286.

Razmaria, A.A. (2015) 'Diabetes Neuropathy', Journal of the American Medical Association,

314(20), pp. 2202.

Revelle, W. (2018) 'psych: Procedures for Psychological, Psychometric, and Personality

Research'.

Roshandel, D., Gubitosi-Klug, R., Bull, S.B., Canty, A.J., Pezzolesi, M.G., King, G.L., Keenan,

H.A., Snell-Bergeon, J.K., Maahs, D.M., Klein, R., Klein, B.E.K., Orchard, T.J., Costacou, T.,

Weedon, M.N., Oram, R.A. and Paterson, A.D. (2018) 'Meta-genome-wide association studies

identify a locus on chromosome 1 and multiple variants in the MHC region for serum C-peptide

in type 1 diabetes', Diabetologia, 61(5), pp. 1098-1111.

Sandholm, N., Haukka, J.K., Toppila, I., Valo, E., Harjutsalo, V., Forsblom, C. and Groop, P.H.

(2018) 'Confirmation of GLRA3 as a susceptibility locus for albuminuria in Finnish patients with

type 1 diabetes', Scientific Reports, 8(1), pp. doi: 10.1038/s41598-018-29211-1.

Secrest, A.M., Becker, D.J., Kelsey, S.F., LaPorte, R.E. and Orchard, T.J. (2010) 'Cause-Specific

Mortality Trends in a Large Population-Based Cohort With Long-Standing Childhood-Onset

Type 1 Diabetes', Diabetes, 59(12), pp. 3216-3222.

Shamoon, H., Duffy, H., Fleischer, N., Engel, S., Saenger, P., Strelzyn, M., Litwak, M., Wylie-

Rosett, J., Farkash, A., Geiger, D., Engel, H., Fleischman, J., Pompi, D., Ginsberg, N., Glover,

M., Brisman, M., Walker, E., Thomashunis, A. and Gonzalez, J. (1993) 'The effect of intensive

treatment of diabetes on the development and progression of long-term complications in insulin-

dependent diabetes mellitus.', The New England Journal of Medicine, 329(14), pp. 977-986.

Sikorska, K., Lesaffre, E., Groenen, P.F.J. and Eilers, P.H.C. (2013a) 'GWAS on your notebook:

fast semi-parallel linear and logistic regression for genome-wide association studies', BMC

Bioinformatics, 14(1), pp. 166-166.

Sikorska, K., Lesaffre, E., Groenen, P.J.F., Rivadeneira, F. and Eilers, P.H.C. (2018) 'Genome-

wide Analysis of Large-scale Longitudinal Outcomes using Penalization —GALLOP algorithm',

Scientific Reports, 8, pp. 6518.

Sikorska, K., Montazeri, N.M., Uitterlinden, A.G., Rivadeneira, F., Eilers, P.H.C. and Lesaffre,

E. (2015) 'GWAS with longitudinal phenotypes: performance of approximate procedures',

European Journal of Human Genetics, 23(10), pp. 1384-1391.

Sikorska, K., Rivadeneira, F., Groenen, P.J.F., Hofman, A., Uitterlinden, A.G., Eilers, P.H.C.

and Lesaffre, E. (2013b) 'Fast linear mixed model computations for genome-wide association

studies with longitudinal data', Statistics in Medicine, 32(1), pp. 165-180.

Task Force on Diabetes and Cardiovascular Diseases of the European Society of Cardiology and

European Association for the Study of Diabetes (2007) 'Guidelines on diabetes, pre-diabetes and

cardiovascular diseases', European Heart Journal, 60(4), pp. 88-136.

Thomas, S. and Karalliedde, J. (2019) 'Diabetic nephropathy', Medicine, 47(2), pp. 86-91.

Todd, J.A., Bell, J.I. and McDevitt, H.O. (1987) 'HLA-DQ beta gene contributes to susceptibility

and resistance to insulin-dependent diabetes mellitus', Nature, 329(6140), pp. 599-604.

van de Bunt, M., Moran, I., Ferrer, J. and McCarthy, M. (2014) 'Insights into β-cell biology and

type 2 diabetes pathogenesis from studies of the islet transcriptome', Genetics in Diabetes, pp.

111-121.

Verbeke, G., Fieuws, S., Molenberghs, G. and Davidian, M. (2014) 'The analysis of multivariate

longitudinal data: A review', Statistical Methods in Medical Research, 23(1), pp. 42-59.

Verbeke, G. and Molenberghs, G. (2000) Linear Mixed Models for Longitudinal Data. New

York: Springer.

World Health Organization (2018) Diabetes.

Yamaguchi, Y., Ueno, M., Maruo, K. and Gosho, M. (2019) 'Multiple imputation for

longitudinal data in the presence of heteroscedasticity between treatment groups', J Biopharm

Stat, pp. 1-19.

Younes, N., Cleary, P.A., Steffes, M.W., der Boer, I.H., Molitch, M.E., Rutledge, B.N., Lachin,

J.M. and Dahms, W. (2010) 'Comparison of urinary albumin-creatinine ratio and albumin

excretion rate in the diabetes control and complications trial/epidemiology of diabetes

interventions and complications study', Clinical Journal of The American Society of Nephrology,

5(7), pp. 1235-1242.

Young, A.I., Wauthier, F.L. and Donnelly, P. (2018) 'Identifying loci affecting trait variability

and detecting interactions in genome-wide association studies', Nat Genet, 50(11), pp. 1608-

Zhang, T. and Sun, L. (2019) 'Beyond the traditional simulation design for evaluating type 1

error control: From the "theoretical" null to "empirical" null', Genetic Epidemiology, 43, pp. 166-

Zheng, Z.L., Zhu, Z.H., Wu, Y.D., Kemper, K.E., Lloyd-Jones, L.R., McRae, A.F., Xue, A.,

Sidorenko, J., Visscher, P.M., Zhang, F.T. and Zeng, J. (2018) 'Genome-wide association

analyses identify 143 risk variants and putative regulatory mechanisms for type 2 diabetes',

Nature Communications, 9, pp. 2941.

Zhou, X. and Stephens, M. (2014) 'Efficient multivariate linear mixed model algorithms for

genome-wide association studies', Nature Methods, 11(4), pp. 407-409.

new comparison of methods for longitudinal analysis of … · 2019. 11. 28. · ii comparison of...

Documents

comparison of image fusion methods

selective a longitudinal o iii. a. comparison … tr 81-3 l...

qualitative longitudinal methods - sociology… ·...

comparison of models for analyzing seasonal activity using...

shaft construction methods comparison

comparison of single shot methods for r2* comparison

methods of longitudinal weight distribution

interpolation methods comparison - core

comparison of methods in calculating frequencies … ·...

comparison of design methods for composite slabs using...

introduction to intensive longitudinal methods

longitudinal research methods in personality psychology

meta-analysis methods for joint longitudinal and...

a longitudinal comparison of the social media

comparison of orthorectification methods

a longitudinal comparison of the growth factors of

harmonization of methods & measures in longitudinal studies

longitudinal integrated clerkships - denver,...

a longitudinal comparison of international student

data mining methods for longitudinal datamephisto.unige.ch...