genome-wide association studies (gwas)

67
Genome-wide association studies (GWAS) Thomas Hoffmann Department of Epidemiology and Biostatistics, and Institute for Human Genetics

Upload: damien

Post on 06-Jan-2016

124 views

Category:

Documents


7 download

DESCRIPTION

Genome-wide association studies (GWAS). Thomas Hoffmann Department of Epidemiology and Biostatistics, and Institute for Human Genetics. Outline. GWAS Overview Study design Plink overview QC Analysis If we have time. Main idea. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Genome-wide association studies (GWAS)

Genome-wide association studies (GWAS)

Thomas Hoffmann

Department of Epidemiology and Biostatistics, and Institute for Human Genetics

Page 2: Genome-wide association studies (GWAS)

Outline

GWAS Overview Study design Plink overview QC Analysis If we have time

Page 3: Genome-wide association studies (GWAS)

Main idea

● Correlation between genotype (SNP) and trait of interest (e.g., LDL/HDL, blood pressure)

● Test ~1 Million SNPs to see if any are correlated with the phenotype.

● Agnostic “shotgun” approach across the genome

Hirschhorn & Daly, Nat Rev Genet 2005

Page 4: Genome-wide association studies (GWAS)
Page 5: Genome-wide association studies (GWAS)

Size matters

Visscher, AJHG 2012,

● Large # SNPs tested, multiple comparison correction● Very small effect sizes● TagSNP, rather than actual SNP

Page 6: Genome-wide association studies (GWAS)

Outline

GWAS Overview Study design gPlink overview QC Analysis If we have time

Page 7: Genome-wide association studies (GWAS)

Study design with Quanto

http://biostats.usc.edu/software

➢Disease: case-control (unmatched/matched), continuous➢Hypothesis: Gene only (also does GxE)➢Outcome model: Baseline Risk➢Genetic Effect: many GWAS ORs have actually been very small➢Parameters: Power/Sample size➢Type I error: 5e-8 (=5x10-8, bonferroni correction for 1 million tests)

What parameters seem reasonable?How would you explain this in a grant?

Page 8: Genome-wide association studies (GWAS)

Outline

GWAS Overview Study design gPlink overview QC Analysis If we have time

Page 9: Genome-wide association studies (GWAS)

Plink overview

● Command-line driven code [You have seen this, right?]

● Start > All Programs > Accessories > Command Prompt

● Rock solid

● Graphical interface [You haven't seen this?]

● Pretty, but a little unstable, especially if you kill something

● May not allow you to delete some things until restart system

● Great documentation

http://pngu.mgh.harvard.edu/~purcell/plink/reference.shtml

● Apply an operation to data, which produces output. Manually looking at output much more difficult because of quantity of it for GWAS; use other programs (Haploview [today], stata, R, e.g.) to display that output better

Page 10: Genome-wide association studies (GWAS)

gPLINK setup

Click “File > Open project”, and then click “Browse” in the dialogue that pops up, and navigate to where you have your plink file located, then click OK.Ignore any “stream error” errors you get, they don't seem to cause any problems.

Page 11: Genome-wide association studies (GWAS)

gPLINK Setup (2)

Click “File > Configure”, then click the two “Browse” buttons, navigating to the haploview and plink executable files, respectively.

Page 12: Genome-wide association studies (GWAS)

Outline

GWAS Overview Study design Plink overview QC Analysis If we have time

Page 13: Genome-wide association studies (GWAS)

QC Steps (1)

Remove SNPs with low call rate (e.g., <95%), Individuals with too much missing data Proportion of SNPs actually called by software If it's low, the clusters aren't well defined, artifacts

PLINK produce summary:

– plink --bfile ceu --missing --out ceu-miss PLINK on the fly / to produce a new dataset

– plink --bfile ceu --geno 0.05 --mind 0.05 <rest of command, e.g., association>

Page 14: Genome-wide association studies (GWAS)
Page 15: Genome-wide association studies (GWAS)
Page 16: Genome-wide association studies (GWAS)
Page 17: Genome-wide association studies (GWAS)
Page 18: Genome-wide association studies (GWAS)

WARNING: It will look like it is not doing anything!!! In the background a plink.exe process has been launched. You need to wait for the green circle.

Page 19: Genome-wide association studies (GWAS)
Page 20: Genome-wide association studies (GWAS)

This creates a new dataset based on the filter, but this can also be done on the fly with some of the association analysis options as well...

Page 21: Genome-wide association studies (GWAS)

QC Steps (2)

Remove those with low minor allele frequency? Rarer variants more likely artifacts / underpowered Sometimes fit is unstable Other approaches to deal with rare variants than

single SNP at a time (combine them) Produce summary: plink --bfile ceu --freq --out ceu-

freq Filter: plink --bfile ceu --maf 0.05 --make-bed --out

ceu-0.05

Page 22: Genome-wide association studies (GWAS)
Page 23: Genome-wide association studies (GWAS)

QC Steps (3)

SNPs that fail Hardy-Weinberg Suppose a SNP with alleles A and B has allele

frequency of p. If random matting, then AA has frequency p*p AB has frequency 2*p*(1-p) BB has frequency (1-p)*(1-p)

Test for this (e.g., chi-squared test), if deviating too strongly, likely a bad SNP

In practice do for homogeneous populations (more later), controls only

Page 24: Genome-wide association studies (GWAS)

> hwe = read.table("hwe.hwe", header=TRUE, stringsAsFactors=FALSE, fill=TRUE)> head(hwe) CHR SNP TEST GENO O.HET. E.HET. P_HWD1 1 rs3131968 ALL 0/15/45 0.2500 0.2188 0.57952 1 rs3131968 AFF 0/0/0 -1.0000 NA NA3 1 rs3131968 UNAFF 0/0/0 -1.0000 NA NA4 1 rs12562034 ALL 0/11/49 0.1833 0.1665 1.00005 1 rs12562034 AFF 0/0/0 -1.0000 NA NA6 1 rs12562034 UNAFF 0/0/0 -1.0000 NA NA>

Page 25: Genome-wide association studies (GWAS)

QC Steps (4)

• Check genotype gender from X heterozygosity

plink --bfile ceu --check-sex --out ceu-check-sex● Microarrays often have another gender checkceu-check-sex.sexcheck FID IID PEDSEX SNPSEX STATUS F1341 NA06985 2 2 OK -0.064031341 NA06993 1 1 OK 11340 NA06994 1 1 OK 11340 NA07000 2 2 OK -0.033631340 NA07022 1 1 OK 11341 NA07034 1 1 OK 11341 NA07055 2 2 OK -0.065941340 NA07056 2 2 OK -0.028451345 NA07345 2 2 OK 0.00151

F: The actual X chromosome inbreeding (homozygosity) estimate. Which ones are female? Has this data been already cleaned?

Page 26: Genome-wide association studies (GWAS)

(5) Check for relatedness, e.g., HapMap

Pemberton et al., AJHG 2010

●Take overall average of all SNPs of how many alleles are shared.●E.g., parent-offspring never shares zero alleles.●HapMap was supposed to be unrelateds (this was a bit of a surprise!)● “MZ twins” sometime a duplicated sample...

Page 27: Genome-wide association studies (GWAS)

PLINK Identity By State (IBS) Calculations

plink --bfile ceu --genome --out ceu-genome

FID1 Family ID for first individual IID1 Individual ID for first individual FID2 Family ID for second individual IID2 Individual ID for second individual RT Relationship type given PED file EZ Expected IBD sharing given PED file Z0 P(IBD=0) Z1 P(IBD=1) Z2 P(IBD=2) PI_HAT P(IBD=2)+0.5*P(IBD=1) ( proportion IBD ) PHE Pairwise phenotypic code (1,0,-1 = AA, AU and UU pairs) DST IBS distance (IBS2 + 0.5*IBS1) / ( N SNP pairs ) PPC IBS binomial test RATIO Of HETHET : IBS 0 SNPs (expected value is 2)

Page 28: Genome-wide association studies (GWAS)

PLINK Identity By State (IBS) Calculations

plink --bfile ceu --genome --out ceu-genome

FID1 Family ID for first individual IID1 Individual ID for first individual FID2 Family ID for second individual IID2 Individual ID for second individual RT Relationship type given PED file EZ Expected IBD sharing given PED file Z0 P(IBD=0) Z1 P(IBD=1) Z2 P(IBD=2) PI_HAT P(IBD=2)+0.5*P(IBD=1) ( proportion IBD ) PHE Pairwise phenotypic code (1,0,-1 = AA, AU and UU pairs) DST IBS distance (IBS2 + 0.5*IBS1) / ( N SNP pairs ) PPC IBS binomial test RATIO Of HETHET : IBS 0 SNPs (expected value is 2)

Page 29: Genome-wide association studies (GWAS)

QC Notes

● Exact thresholds differ based on dataset, how many subjects you have

● GWAS of 1,000 subjects vs. GWAS of 100,000 subjects

● Ad-hoc look at QQ-plots (later) to asses some QC issues

● Look at cumulative distributions may be slightly easier than histograms (what we saw earlier) for cutoffs in the tails

Page 30: Genome-wide association studies (GWAS)

QC Summary

● Filter SNPs and individuals● MAF (very sample size dependent)● Low call rates

● Test HWE● Relatedness – remove first degree relatives at

least, or account for● Check genotypic gender (e.g., mislabeled

samples)● Anything extra you could do with family data?

● Mother AA, Father BB, Child AA

Page 31: Genome-wide association studies (GWAS)

Outline

GWAS Overview Study design Plink overview QC Analysis If we have time

Page 32: Genome-wide association studies (GWAS)

GWAS analysis

• Most common approach: look at each SNP one-at-a-time– Additive coding of SNP most common, e.g., # of A

alleles– Just a covariate in a regression framework• Dichotomous phenotype: logistic regression• Continuous phenotype: linear regression

– {BMI}=B1{SNP}+ B

2{Age}+...

• Can use residuals/propensity score to speed up analysis

Page 33: Genome-wide association studies (GWAS)

What is population stratification?

Balding, Nature Reviews Genetics 2010

● If allele frequency is different in two populations (races/ethnicities), and a disease prevalence is different in the two...● If not corrected for, what is your test actually telling you?● Generally inflates all test statistics if not accounted for

Page 34: Genome-wide association studies (GWAS)

Adjusting for PC's

• Li et al., Science 2008

Maximize variance using all SNPs between subjects – pulls out clusters of individuals of different populations/subpopulations.

Page 35: Genome-wide association studies (GWAS)

Adjusting for PC's

• Razib, Current Biology 2008

Page 36: Genome-wide association studies (GWAS)

PCs

● Eigenstrat most common (linux only)● Plink can still run something similar, two steps:

plink --bfile ceu --genome --out ceu-genome

plink --bfile ceu --read-genome ceu-genome.genome --cluster --mds-plot 10 --out ceu-mds

Page 37: Genome-wide association studies (GWAS)

Running the analysisContinuous phenotype: linear regression

{BMI}=B1{SNP}+ B

2{Age}+...

Page 38: Genome-wide association studies (GWAS)
Page 39: Genome-wide association studies (GWAS)

plink --bfile ceu --linear --covar ceu.cov --covar-name PC1,PC2,PC3,PC4 --pheno ceu.phe --pheno-name phe --out ceu-analysis --hide-covar

--hide-covar: supresses lots of output (every covariate for every regression), unfortunately, cannot do in gPLINK

Page 40: Genome-wide association studies (GWAS)

plink --bfile ceu --linear --covar ceu.cov --covar-name PC1,PC2,PC3,PC4 --pheno ceu.phe --pheno-name phe --out ceu-analysis --hide-covar --adjust

--adjust: Get the genomic inflation factor lambda, and inflate the test statistics slightly for it (otherwise can inflate later with other packages; especially if you want to run in parallel)

Do you really want to adjust? More later...

Page 41: Genome-wide association studies (GWAS)

Format of covariate files

Phenotype and covariate files not structurally different from each other:

FID IID LDL HDL BMI

ID1 ID1 120 160 200

FID and IID match things in the .fam file of the plink formatted files (which generally calling software produces)

Page 42: Genome-wide association studies (GWAS)

View analysis results

Page 43: Genome-wide association studies (GWAS)
Page 44: Genome-wide association studies (GWAS)

Manhattan plot

Page 45: Genome-wide association studies (GWAS)

Multiple comparison correction

• If you conduct 20 tests at =0.05, one true by chance http://xkcd.com/882/. If you conduct 1 million tests...

• Correct for multiple comparisons

– e.g., Bonferroni, 1 million, =5x10-8

Page 46: Genome-wide association studies (GWAS)

Multiple comparison correction

• If you conduct 20 tests at =0.05, one true by chance http://xkcd.com/882/. If you conduct 1 million tests...

• Correct for multiple comparisons

– e.g., Bonferroni, 1 million, =5x10-8

Page 47: Genome-wide association studies (GWAS)

Multiple comparison correction

• If you conduct 20 tests at =0.05, one true by chance http://xkcd.com/882/. If you conduct 1 million tests...

• Correct for multiple comparisons

– e.g., Bonferroni, 1 million, =5x10-8

Page 48: Genome-wide association studies (GWAS)

Multiple comparison correction

• If you conduct 20 tests at =0.05, one true by chance http://xkcd.com/882/. If you conduct 1 million tests...

• Correct for multiple comparisons

– e.g., Bonferroni, 1 million, =5x10-8

Page 49: Genome-wide association studies (GWAS)

QQ-plots and PC adjustment

• Wang, BMC Proc 2009Lambda quantifies how much it is deviating from what we would expect. Want it to be roughly 1, but I don't see too many objections for 1.04, 1.06, e.g., for ~10,000 subjects.

Page 50: Genome-wide association studies (GWAS)

Genomic inflation factor

Why might the genomic inflation factor be elevated?

Page 51: Genome-wide association studies (GWAS)

Genomic inflation factor

Why might the genomic inflation factor be elevated?

No, there was nothing done wrong with the analysis...

Page 52: Genome-wide association studies (GWAS)

Genomic inflation factor

Why might the genomic inflation factor be elevated?

There are hundreds of loci known for height...

Page 53: Genome-wide association studies (GWAS)

Genomic inflation factor

Why might the genomic inflation factor be elevated?

There are hundreds of loci known for height...

Page 54: Genome-wide association studies (GWAS)

After you found a hit: Check cluster plot (where genotype calls made)

Good calls! Bad calls!

Laird and Lange

Page 55: Genome-wide association studies (GWAS)

Outline

GWAS Overview Study design Plink overview QC Analysis If we have time

Page 56: Genome-wide association studies (GWAS)

http://cgems.cancer.govchromosome

Advanced: Conditional analysis

Prostate cancer, Witte, Nat Genet 2007

128.10 128.20 128.30 128.40 128.50 128.60 128.700

5

10

15

20

25

30

Column UColumn HColumn AAColumn AQ

Position on 8q24 (Mb)

-log

(p-v

alu

e)

rs6983267

rs1447295

rs16901979

Region 1

Region 2

Region 3

Multiple prostate cancer loci on 8q24

Y = b0 + b1X

1 + b

2X

2

In practice, take residual of Y on X

1, then fit that to X

2 (avoid

issues of multicollinearity)

Page 57: Genome-wide association studies (GWAS)

Conditional Analysis (continued)

● Different approaches – what question is being asked?

● Rerun only within region● Genome-wide

● Hard to coordinate in consortium

Page 58: Genome-wide association studies (GWAS)

Imputation of SNP Genotypes

Combine data from different platforms (e.g., Affy & Illumina) (for replication / meta-analysis).Estimate unmeasured or missing genotypes.Based on measured SNPs and external info (e.g., haplotype structure of HapMap).Increase GWAS power (impute and analyze all), e.g. Sick sinus syndrome, most significant was 1000 Genomes imputed SNP (Holm et al., Nature Genetics, 2011)HapMap as reference, now 1000 Genomes Project

Page 59: Genome-wide association studies (GWAS)

Imputation Example

Li et al., Ann Rev Genom Human Genet, 2009

What would you do?

Page 60: Genome-wide association studies (GWAS)

Imputation Example

Li et al., Ann Rev Genom Human Genet, 2009

Page 61: Genome-wide association studies (GWAS)

Imputation Example

Li et al., Ann Rev Genom Human Genet, 2009

Page 62: Genome-wide association studies (GWAS)

Imputation Application

Chromosomal PositionMarchini Nature Genetics2007http://www.stats.ox.ac.uk/~marchini/#software

TCF7L2 gene region & T2D from the WTCCC data

Observed genotypes blackImputed genotypes red.

Page 63: Genome-wide association studies (GWAS)

Imputation Software

● SHAPE-IT to phase http://www.shapeit.fr/● Impute2 to impute (+ 1000 genomes):

http://mathgen.stats.ox.ac.uk/impute/impute_v2.html● Analyze similar to typed genotypes, with plink –dosage

● Data subset, etc., not as flexible with dosages (i.e., don't work)

● 1000 Genomes: Tens of millions of SNPs...

Page 64: Genome-wide association studies (GWAS)

Replication• To replicate:– Association test for replication sample significant at

0.05/{Number of SNPs replicating} alpha level– Same genetic model (e.g. additive, dominant)– Same direction– Sufficient sample size for replication

• Non-replications not necessarily a false positive– LD structures, different populations (e.g., flip-flop)– covariates, phenotype definition, underpowered

Page 65: Genome-wide association studies (GWAS)

Meta-analysis

• Combine multiple studies to increase power• Either combine p-values (Fisher’s test),

– when might you use this? (tricky question)

• or z-scores (better)

• plink --meta-analysis study1.assoc study2.assoc study3.assoc

– Tip: Run previous analysis with --ci 0.95

Page 66: Genome-wide association studies (GWAS)

Extras

● If using separate control group (“convenience controls”), need to be extra careful with QC (give example).

● Why?

Page 67: Genome-wide association studies (GWAS)

Summary

● GWAS successful in finding large numbers of variants associated with disease.

● GWAS not as successful as people hoped● Not as much of the heritability / variance

explained.● Difficult to find true functional SNP?● What do we do after we find them?