jian yang - mixed linear model analyses of human complex traits using snp data
DESCRIPTION
Most traits and common diseases in humans, such as height, cognitive ability, psychiatric disorders and obesity, are influenced by many genes and their interplay with environmental factors. These diseases/traits are called “complex” traits to differentiate them from “Mendelian” traits that are caused by single genes. Understanding the genetic architecture of human complex traits, e.g. how much of the difference between people’s susceptibilities to diseases are accounted for by their difference in DNA sequence, how many genes are involved in the etiology of diseases, where the genes are located and how much effects of the genes are on the disease risks, is essential to diagnosis, discovery of new drug targets and prevention. To date, thousands gene loci as represented by single nucleotide polymorphisms (SNPs) have been identified to be associated with hundreds of human complex traits by the genome‐wide association study (GWAS) technique. In this lecture, I will be introducing the use of mixed linear model in the analyses of GWAS data, to estimate the proportion of variance for a trait that can be explained by all SNPs (or called SNP heritability), to quantify the extent to which two traits (or diseases) share a common genetic basis (genetic correlation) using all SNPs, and to control for population structure in genome‐wide association analyses of individuals SNPs. First presented at the 2014 Winter School in Mathematical and Computational Biology http://bioinformatics.org.au/ws14/program/TRANSCRIPT
Mixed linear model analyses of human complex traits using SNP
data
Jian Yang Queensland Brain Ins1tute
The University of Queensland
1
Why do we need a mixed linear model?
Linear model • y = b0 + x1b1 + x2b2 + … + xpbp + e y = phenotype xi = independent variable y ~ N(b0 + x1b1 + x2b2 + … + xpbp, σ2e) b0 = mean term b1 … bp = effect sizes (regression coefficients) e = residual, e ~ N(0, σ2e)
Linear model
• In matrix form y = Xb + e y = {yj}n x 1; X = {Xij}n x p; b = {bi}p x 1; e = {ej}n x 1 • Es1ma1on b-‐hat = (XTX)-‐1XTy var(b-‐hat) = σ2e(XTX)-‐1
Special cases
• Simple regression y = b0 + x1b1 + e b1-‐hat = b-‐hat = X1
Ty / (X1TX1)
E(b1-‐hat) = b1 = cov(x1, y) / var(x1) var(b1-‐hat) = σ2e / [n*var(x1)]
• Condi1onal analysis y | b2 … bp = b0 + x1b1 + e
Limita1ons
• n > p: sample size needs to be > than the number of parameters
• All the effect sizes are treated as fixed (we have no idea about the varia1on in effect sizes)
• What if n << p?
What is a mixed linear model (MLM)?
• y = Xb + Zu + e
Fixed effects: b (special case: X = 1 and b = b0) Random effects: u = {ui}, u ~ N(0, σ2uA) A = correla1on matrix between ui and uj E(y) = Xb var(y) = V = ZAZTσ2u + Iσ2e
Parameter es1ma1on • Es1ma1on of variance components (σ2u) logL = -‐1/2(log|V| + log|XTV-‐1X| + yTPy P = V-‐1 -‐ V-‐1X(XTV-‐1X)-‐1XTV-‐1
• Predic1on of random effects (u) u-‐hat = σ2u-‐hat ZTPy
• Es1ma1on of fixed effects (b) b-‐hat = (XTV-‐1X)-‐1XTy
Linear model: b-‐hat = (XTX)-‐1XTy
MLM analysis of human complex traits
• Animal and plant breeding – predic1ng breeding values – linkage mapping (QTL mapping)
• Human gene1cs (before 2007) – pedigree based analysis of variance (heritability) – linkage mapping
• Human gene1cs (aker 2007) – esBmaBng SNP-‐based heritability – associa1on analysis – gene1c risk predic1on
Background
Mendelian traits Complex traits
Cys1c fibrosis Human height
Schizophrenia
Obesity
Major ques1ons
• Are these traits heritable?
• If so, what is the heritability?
• How many genes involved and where are they located?
Risk of schizophrenia (%)
13
Resemblance between twins for human height
Heritability = ~80%
Heritability = ~80%
Heritability = 40%~60% Resemblance between relaBves for body mass index (BMI)
Relatedness CorrelaBon Full-‐sibs 0.36 Father-‐son 0.28
Complex traits such as height, BMI and SCZ are highly heritable.
14 8 genes for human complex traits before 2002
Glazier et al. 2002 Science
IdenBfying genes underlying complex traits
1700
30
8
15
Genome-‐wide AssociaBon Study (GWAS)
Manolio 2010 NEJM
Genome-‐wide threshold P = 5×10-‐8
Linear model (simple regression) y = b0 + x1b1 + e y = trait value x1 = SNP genotype (0, 1 or 2) b1-‐hat = X1
Ty / (X1TX1) = cov(x1,y) / var(x1)
SE2(b1-‐hat) = σ2e / [n var(x1)]
16
An explosion of gene discoveries
~5000 geneBc variants associated with ~650 traits / diseases
Glazier et al. 2002 Science
Prior to GWAS GWAS
0"
1000"
2000"
3000"
4000"
5000"
6000"
2006" 2007" 2008" 2009" 2010" 2011" 2012" 2013"first"half"
Num
ber'o
f'SNPs'
Year'
Height: • 180 loci • ~180K samples • < 10% of variance explained • heritability = ~80%
17
Schizophrenia: • 22 loci • ~21K cases / ~38K controls • < 3% of variance explained • heritability = ~80%
The missing heritability problem
BMI: • 32 loci • ~250K samples • ~1% of variance explained • Heritability = 40% ~ 60%
Lango Allen et al. 2010 Nature
Speliotes et al. 2010 Nat Genet
Ripke et al. 2013 Nat Genet
Fiwng all SNPs in a MLM • y = Wu + e
W = {wij}n x m, wij = standardised SNP genotype u ~ N(0, Iσ2u) var(y) = ZZTσ2u + Iσ2e variance explained = mσ2u / (mσ2u + σ2e)
• Let g = Zu
y = g + e g ~ N(0, Aσ2g), A = gene1c rela1onship matrix var(y) = Aσ2g + Iσ2e variance explained = σ2g / (σ2g + σ2e)
• var(y) = (1/m)ZZT(mσ2u) + Iσ2e
A = ZZT / m
Family studies: comparing phenotypic similarity to family relatedness – Our method: comparing phenotypic similarity to gene8c similarity (es8mated from SNPs) in unrelated individuals GWAS: tes8ng a SNP at a 8me in unrelated samples – Our method: Es8ma8ng the contribu8on from all SNPs together
19
~50% of variaBon explained by all SNPs for height vs. ~10% from GWAS
Reconciling family studies and GWAS
20
GWAS vs All-‐SNP esBmaBon
Yang et al. 2011 Nat Genet
Lee et al. 2012 Nat Genet
Yang et al. 2010 Nat Genet
Many geneBc variants each with a small effect contribuBng to the trait variaBon
0% 10% 20% 30% 40% 50%
Height
Schizophrenia
Obesity (BMI) GWAS
Our method
Genome par11oning
• Single component MLM y = g + e (or y = Wu + e)
• Mul1-‐component MLM y = g1 + g2 + … + g22 + e
var(y) = A1σ2g1 + A2σ2g2 + … + A22σ2g22 + Iσ2e
1"
2"
3"4"
5"
6"
7"
8"
9"
10"
11"12"13"
14"15"
16"
17"
18"
19"
20"
21"22"
0.00"
0.01"
0.02"
0.03"
0" 50" 100" 150" 200" 250" 300"
Heritab
ility*
Chromosome*length*
22
Yang et al. 2011 Nat Genet Lee et al. 2012 Nat Genet Yang et al. unpublished
1
2
3
4
5
6
789
101112
13
14
15
16
17
1819
20
21
22
0
0.01
0.02
0.03
0.04
0.05
0.06
0 50 100 150 200 250
Heritability
Chromosome length (Mb)
12
3
45
6
7
89
1011
1213
141516
17
18
19
202122
0
0.005
0.01
0.015
0.02
0.025
0 50 100 150 200 250
Heritability
Chromosome length (Mb)
~12,000 individuals 9000 cases 12,000 controls
Schizophrenia Height BMI
~25,000 individuals
ParBBoning the geneBc variance into individual chromosomes
GeneBc variants distributed across the whole genome
ParBBoning the geneBc variance based on funcBonal annotaBon
23
GeneBc signals are enriched in or close to funcBonal genes
Yang et al. 2011 Nat Genet Lee et al. 2012 Nat Genet
Schizophrenia
30%
35%
35% CNS+ genes
intergenic
Other genes 83%
17%
Height
68%
32% Genic es1mate
Intergenic es1mate
BMI
More …
• Bivariate analysis – es1ma1ng the gene1c correla1on between two traits or two diseases using SNP data (Deary et al. 2012 Nature; Lee et al. 2013 Nat Genet)
• Fiwng a mixture distribu1on rather than a single normal distribu1on to the random effects – e.g. Zhou et al. 2013 PLoS Genet
25
Linear model (simple regression based associaBon test)
y = b0 + x1b1 + e y = trait value; x1 = SNP genotype (0, 1 or 2) b1-‐hat = X1
Ty / (X1TX1) = cov(x1,y) / var(x1)
SE(b1-‐hat) = σ2e / [n var(x1)] Assump1on: e is independent and iden1cally distributed Issues: 1) Relatedness: there are rela1ves in the sample –
inflated false posi1ve rate 2) Popula1on stra1fica1on: individuals of different
ancestries – spurious associa1on; e.g. trait = ea1ng with chops1cks, data = a random sample of US popula1on.
Popula1on stra1fica1on es1mated from SNP data
Solu1on: MLM based associa1on analysis
• y = Xb + Zu + e or y = Xb + g + e V = var(y) = Aσ2g + Iσ2e
• Tes1ng for fixed effects given sample structure b-‐hat = (XTV-‐1X)-‐1XTy var(b-‐hat) = σ2e(XTV-‐1X)-‐1
• Issue: a SNP is fi}ed twice.
Excluding the SNP from calcula1ng the gene1c rela1onship matrix
So^ware tool h_p://gump.qimr.edu.au/gcta/
Complex Traits Genomics Group (UQ) • Peter Visscher • Naomi Wray • Hong Lee University of Melbourne • Mike Goddard QIMR cohort • Nick Mar1n • Grant Montgomery
GENEVA Consor8um • Teri Manolio • Bruce Weir dbGaP
30
Acknowledgements
The Australian NeurogeneBcs Conference
at the Queensland Brain InsBtute (QBI), The
University of Queensland, on September 11th and 12th, 2014
h_p://web.qbi.uq.edu.au/anc2014/