genome-wide association studiesadegenet.r-forge.r-project.org/files/glasgow2015... ·...
TRANSCRIPT
Genome-Wide Association Studies
Caitlin Collins, Thibaut Jombart
MRC Centre for Outbreak Analysis and Modelling
Imperial College London
Genetic data analysis using
30-10-2014
Outline
• Introduction to GWAS
• Study design o GWAS design
o Issues and considerations in GWAS
• Testing for association o Univariate methods
o Multivariate methods
• Penalized regression methods
• Factorial methods
2
Genomics & GWAS
3
The genomics revolution
• Sequencing technology o 1977 – Sanger
o 1995 – 1st bacterial genomes
• < 10,000
bases per day per machine
o 2003 – 1st human genome
• > 10,000,000,000,000
bases per day per machine
• GWAS publications o 2005 – 1st GWAS
o Age-related macular degeneration
o 2014 – 1,991 publications
o 14,342 associations
Genomics & GWAS 4
A few GWAS discoveries…
Genomics & GWAS 5
• Genome Wide Association Study o Looking for SNPs…
associated with a phenotype.
• Purpose: o Explain
• Understanding
• Mechanisms
• Therapeutics
o Predict • Intervention
• Prevention
• Understanding not required
So what is GWAS?
Genomics & GWAS 6
• Definition o Any relationship
between two measured quantities that renders them statistically dependent.
• Heritability o The proportion of
variance explained by genetics
o P = G + E + G*E • Heritability > 0
Association
Genomics & GWAS 7
p SNPs
n individuals
Cases
Controls
Genomics & GWAS 8
Why? • Environment, Gene-Environment interactions
• Complex traits, small effects, rare variants
• Gene expression levels
• GWAS methodology?
Genomics & GWAS 9
Study Design
10
• Case-Control o Well-defined “case”
o Known heritability
• Variations o Quantitative phenotypic data
• Eg. Height, biomarker concentrations
o Explicit models
• Eg. Dominant or recessive
GWAS design
Study Design 11
• Data quality o 1% rule
• Controlling for confounding o Sex, age, health profile
o Correlation with other variables
• Population stratification*
• Linkage disequilibrium*
Issues & Considerations
Study Design 12
*
*
• Problem o Violates assumed population homogeneity, independent
observations
• Confounding, spurious associations
o Case population more likely to be related than Control population
• Over-estimation of significance of associations
Population stratification
Study Design 13
• Definition o “Population stratification” =
population structure
o Systematic difference in allele frequencies btw. sub-populations…
• … possibly due to different ancestry
• Solutions
o Visualise
• Phylogenetics
• PCA
o Correct
• Genomic Control
• Regression on Principal
Components of PCA
Population stratification II
Study Design 14
• Definition o Alleles at separate loci are NOT
independent of each other
• Problem? o Too much LD is a problem
• noise >> signal
o Some (predictable) LD can be beneficial
• enables use of “marker” SNPs
Linkage disequilibrium (LD)
Study Design 15
Testing for Association
16
• Standard GWAS
o Univariate methods
• Incorporating interactions
o Multivariate methods
• Penalized regression methods (LASSO)
• Factorial methods (DAPC-based FS)
Methods for association testing
Testing for Association 17
• Variations o Testing
• Fisher’s exact test, Cochran-Armitage trend test, Chi-squared test, ANOVA
• Gold Standard—Fischer’s exact test
o Correcting
• Bonferroni
• Gold Standard—FDR
Univariate methods
Testing for Association 18
p SNPs
n individuals
Cases
Controls
• Approach o Individual test statistics
o Correction for multiple
testing
Strengths Weaknesses
• Straightforward
• Computationally fast
• Conservative
• Easy to interpret
• Multivariate system,
univariate framework
• Effect size of individual
SNPs may be too small
• Marginal effects of
individual SNPs ≠
combined effects
Univariate – Strengths & weaknesses
Testing for Association 19
What about interactions?
Testing for Association 20
• Epistasis o “Deviation from linearity under a
general linear model”
• With p predictors, there are:
• 𝑝𝑘
=𝑝𝑘
𝑘! k-way interactions
• p = 10,000,000 5 x 1011
o That’s 500 BILLION possible pair-wise interactions!
• Need some way to limit the number of pairwise interactions considered…
Interactions
Testing for Association 21
𝑌𝑖 = 𝑤0 + 𝑤1𝐴𝑖 + 𝑤2𝐵𝑖 +𝒘𝟑𝑨𝒊𝑩𝒊
Multivariate methods
Testing for Association 22
LASSO penalized regression
Ridge regression
Neural Networks Penalized Regression
Bayesian Approaches
Factorial Methods
Bayesian Epistasis
Association Mapping
Logic Trees
Modified Logic
Regression-Gene
Expression Programming Genetic Programming for
Association Studies
Logic feature selection Monte Carlo
Logic Regression Logic regression
Supervised-PCA
Sparse-PCA
DAPC-based FS
(snpzip)
Bayesian partitioning
The elastic net
Bayesian Logistic
Regression with
Stochastic Search
Variable Selection
Odds-ratio-
based MDR
Multi-factor
dimensionality reduction
method
Genetic programming
optimized neural
networks
Parametric
decreasing method
Restricted
partitioning method
Combinatorial
partitioning method
Random forests
Set association
approach
Non-parametric Methods
• Penalized regression methods o LASSO penalized regression
• Factorial methods o DAPC-based
feature selection
Multivariate methods (ii)
Testing for Association 23
• Approach o Regression models multivariate association
o Shrinkage estimation feature selection
• Variations o LASSO, Ridge, Elastic net, Logic regression
• Gold Standard—LASSO penalized regression
Penalized regression methods
Testing for Association 24
• Regression o Generalized linear model (“glm”)
• Penalization o L1 norm
o Coefficients 0
o Feature selection!
LASSO penalized regression
Testing for Association 25
Strengths Weaknesses
• Stability
• Interpretability
• Likely to accurately
select the most
influential predictors
• Sparsity
• Multicollinearity
• Not designed for high-p
• Computationally intensive
• Calibration of penalty parameters o User-defined variability
• Sparsity
• NO p-values!
LASSO – Strengths & weaknesses
Testing for Association 26
• Approach o Place all variables (SNPs) in a multivariate space
o Identify discriminant axis best separation
o Select variables with the highest contributions to that axis
• Variations o Supervised-PCA, Sparse-PCA, DA, DAPC-based FS
o Our focus—DAPC with feature selection (snpzip)
Factorial methods
Testing for Association 27
Discriminant axis
De
nsi
ty o
f in
div
idu
als
Discriminant axis
De
nsi
ty o
f in
div
idu
als
DAPC-based feature selection
a
b
c d
e
Alleles
Individuals
0
0.1
0.2
0.3
0.4
0.5
a b c d e
Contribution to
Discriminant Axis
Healthy (“controls”)
Diseased (“cases”)
Testing for Association 28
Discriminant Axis Discriminant Axis
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
a b c d e
Contribution to Discriminant Axis
?
Discriminant axis
De
nsi
ty o
f in
div
idu
als
DAPC-based feature selection • Where should we draw the line?
o Hierarchical clustering
Hierarchical clustering (FS)
Testing for Association 30
0
0.1
0.2
0.3
0.4
0.5
a b c d e
Contribution to
Discriminant Axis
Hooray!
Strengths Weaknesses
• More likely to catch all
relevant SNPs (signal)
• Computationally quick
• Good exploratory tool
• Redundancy > sparsity
• Sensitive to n.pca
• N.snps.selected varies
• No “p-value”
• Redundancy > sparsity
DAPC – Strengths & weaknesses
Testing for Association 31
Conclusions
• Study design o GWAS design
o Issues and considerations in GWAS
• Testing for association o Univariate methods
o Multivariate methods
• Penalized regression methods
• Factorial methods
32
Thanks for listening!
33
Questions?
34