lecture 7 gwas full

79
Session 7: Genome-wide Association Studies (GWAS)

Upload: lekki-frazier-wood

Post on 04-Nov-2014

198 views

Category:

Science


2 download

DESCRIPTION

Genome-wide Association Studies

TRANSCRIPT

Page 1: Lecture 7 gwas full

Session 7: Genome-wide Association Studies (GWAS)

Page 2: Lecture 7 gwas full

But first….

June 13th 2013

Page 3: Lecture 7 gwas full

Structure of this lecture

• Recap some concepts (SAS tutorial later)• Discuss GWAS• Look at the steps in running & analyzing results

GWAS• Lab – analyze a GWAS• SAS tutorial

Page 4: Lecture 7 gwas full

Recap from last session

• At an A/T locus:• -What are the genotypes?• -What are the alleles?• Given the following frequencies at a locus:

A/A: N=123 A/C: N=134C/C N=52

Which is the minor allele?What are the allele frequencies?

Page 5: Lecture 7 gwas full

Recap from last session

• Additive model: Each ‘risk’ allele conveys risk for the phenotype in an additive fashion

• Recessive effect of the minor allele: The minor allele conveys risk, but you need 2 copies to have the risk

• Dominant effect of the minor allele: The minor allele conveys risk, and you only need 1 copy to have the risk

Page 6: Lecture 7 gwas full

Recap from last session

• Given these genotype means, what genotype models would you run? (Additive / recessive effect of the minor allele / dominant effect of the minor allele?):

aa aA AA1 10 15 152 23 35 423 10 12 144 9 11 325 5 5 166 1 5 7

Page 7: Lecture 7 gwas full

All about GWAS

Page 8: Lecture 7 gwas full

Back in the 90s…

Page 9: Lecture 7 gwas full

Heralding the GWAS era

Page 10: Lecture 7 gwas full
Page 11: Lecture 7 gwas full

What is a GWAS

Basic idea: Genotype individuals for a large number (~1M) of SNPs spread in a generally unspecified way throughout the genome. Look for association.

Advantage?

What does this table show?

Page 12: Lecture 7 gwas full

The era of hypothesis generating research

Page 13: Lecture 7 gwas full

The era of hypothesis generating research

Page 14: Lecture 7 gwas full
Page 15: Lecture 7 gwas full

Why do we question the success of GWAS?

• What is the median income in the US?– $50,000

• What is federal tax?– 25%

• Median tax/year is – ~12,500

• What is the average cost of an NIH genetics grant?– $477,215 / $350,000

• 30-40 working years

Page 16: Lecture 7 gwas full

Huge numbers of published GWAS!

• http://www.genome.gov/gwastudies/• http://www.ebi.ac.uk/fgpt/gwas/#

Page 17: Lecture 7 gwas full

JAMA, 2007

• T2DB suspected that gens ‘related to insulin resistance would be identified. Interestingly, not one gene known to be associated with insulin resistance has been found in the recent set of genome-wide scans for diabetes, but genes related to insulin secretion, insulin transport, zinc binding to insulin, and pancreatic is-let beta cell development have been discovered”

Page 18: Lecture 7 gwas full

And yet….

• Sir Alec Jeffries: ‘‘One of the great hopes for GWAS was that, in the same way that huge numbers of Mendelian disorders were pinned down at the DNA level and the gene and mutations involved identified, it would be possible to simply extrapolate from single gene disor-ders to complex multigenic disorders. That really hasn’t happened. Proponents will argue that it has worked and that all sorts of fascinating genes that predispose to or protect against diabetes or breast cancer, for example, have been identified, but the fact remains that the bulk of the heritability in these conditions cannot be ascribed to loci that have emerged from GWAS, which clearly isn’t going to be the answer to everything.’’”

Page 19: Lecture 7 gwas full

• Cell, 2010 ‘‘To date, genome-wide association studies (GWAS) have published hundreds of common variants whose allele frequencies are statistically correlated with various illnesses and traits. However, the vast majority of such variants have no established biolog-ical relevance to disease or clinical utility for prognosis or treatment.”

Page 20: Lecture 7 gwas full
Page 21: Lecture 7 gwas full

Running (analyzing) a GWAS

Page 22: Lecture 7 gwas full

• With each step think: “What effect does this have on our ability to find genes for traits and disorders?”

Page 23: Lecture 7 gwas full

Running a GWAS: Getting your genotype data

• Select your chip• Complete your genotyping

Page 24: Lecture 7 gwas full

Running a GWAS: QC-ing SNP data

• Poor quality samples– Sample genotype success rate < 95 to 97.5%– Greater proportion of heterozygous genotypes than

expected• Related individuals (if independent samples)

– Based on pair-wise comparisons of similarity of genotypes

• Sample switches– Wrong sex

Page 25: Lecture 7 gwas full

Running a GWAS: QC-ing SNP data

• This can get very complex!• “GOLDN SNPs that were monomorphic (55,530) or had acall rate <96 %

(82,462) were removed from the analysis. In addition, SNPs were excluded from the analysis based on the number of families with Mendelian errors as follows: for minor allele frequency (MAF) <20 %, removed if errors were present in 3 families (1,486 SNPs); for MAF < 10 %, removed if errors were present in 2 families (1,338 SNPs); MAF < 5%, removed if errors were present in 1 family (1,767 SNPs); for MAF <5 %, removed if any errors were present (9,592 SNPs). In families with remaining errors, SNPs that exhibited Mendelian error were set to missing (31,595 SNPs). Furthermore, 16 participants with call rates<96 % were also removed from any subsequent analyses. Subsequently, 748 SNPs failing the Hardy–Weinberg equilibrium (HWE) test at P value<10-6 were excluded.”

Page 26: Lecture 7 gwas full

Running a GWAS: Imputing more SNPs

Useful for:

• Combining data from different platforms (e.g., Affy & Illumina) (for replication or meta-analysis)

• Estimating unmeasured or missing genotypes

• Based on measured SNPs and external info (e.g., haplotype structure of HapMap)

• Imputation methods use the dense genotype data available from HapMap samples (i.e., CEU) and the LD relationships of the SNPs to impute (predict) genotypes for a large number of SNPs that were not measured experimentally

Page 27: Lecture 7 gwas full

“Short cuts”

A T A G T A C ATC

AC

AT

GA

GC

GCA

AATT

GGAA

GCGC

TCCC

GCGC

ACCC

SNPs 1, 3 and 4 are TagSNPs

Page 28: Lecture 7 gwas full

Running a GWAS: Imputing more SNPs• Requires large scale computing resources• Can go from ~500,000 SNPs to ~2.2 million• Need to assess quality of imputation

– Compare imputed genotypes to actual genotypes• Error rates are higher than for genotyped SNPs• Works less well for rarer alleles (e.g., MAF<3%)• Best to take account of probabilities assigned to imputed

genotypes in the analysis– “dosages” = probabilities of the genotypes

• Allows association testing of untyped variation • Allows for ease of combining data across genotyping platforms

Genotype: AA AG GGProbability: 80% 20% 0%Coding: 2 1 0

Page 29: Lecture 7 gwas full

Running a GWAS: Imputing more SNPs

Page 30: Lecture 7 gwas full

Running a GWAS: Imputing more SNPs

MACH, Markov Chain Haplotyping– Developed by Goncalo Abecasis– http://www.sph.umich.edu/csg/abecasis/MACH/

IMPUTE– Developed by Jonathan Marchini– http://www.stats.ox.ac.uk/~marchini/#software

BIMBAM– http://quartus.uchicago.edu/~yguan/bimbam/index.html

Page 31: Lecture 7 gwas full

Running a GWAS: QC-ing SNP data

• More QC!• MAF again • Imputation quality (RSQ<.3??)

Page 32: Lecture 7 gwas full

Running a GWAS: Computing your p-values

• Commonly SNPs coded as the additive effect of alleles

• Logistic regression or linear regression (much like a candidate gene study)

Page 33: Lecture 7 gwas full

• PLINK– http://pngu.mgh.harvard.edu/~purcell/plink/

Page 34: Lecture 7 gwas full

Running a GWAS: Interpreting your data

• (Really… it is more QC)….• 2 main issues:• Population stratification• Multiple testing

Page 35: Lecture 7 gwas full

The Problems of population substructure

• Devlin and Roeder (1999) used theoretical arguments to propose that with population structure, the distribution of Cochran-Armitage trend tests, genome-wide, is inflated by a constant multiplicative factor λ.

• We can estimate the multiplicative inflation factor using the statistic λ = median(Xi

2)/0.456.• Inflation factor λ > 1 indicates

population structure and/or genotyping error.

• We can carry out an adjusted test of association that takes account of any mismatching of cases/controls at any SNP using the statistic Xi

2/ λ.Inflation factor λ = 1.11

Population outliers and/or structure?

True hits?

Page 36: Lecture 7 gwas full
Page 37: Lecture 7 gwas full

Solving population stratification

• Has become a standard tool in genetics to identify subpopulations

• Used to infer continuous axes of genetic variation (eigenvectors) that reduce the data to a small number of dimensions while describing as much of the variability among individuals as possible

• Utilizes a set of “neutral” SNPs (not associated with phenotype, need many)

• Implemented in the EIGENSTRAT software package.

Page 38: Lecture 7 gwas full

Solving population stratification - Eigenstrat

Page 39: Lecture 7 gwas full

Population substructure thought experiment

• You run a GWAS and find a significant hit within a gene. The SNP is genotyped, biologically plausible, in HWE (p>.05) and has a MAF = .13. However, your λ=1.56! So you decide not to use the SNP.

• What would have happened if you had taken a candidate gene approach and only analyzed that SNP?

Page 40: Lecture 7 gwas full

Multiple testing

Page 41: Lecture 7 gwas full

Multiple testing

• When all tests are independent,– Probability to observe P<.01 is– Probability to observe P<.05 is– Probability to observe P<.5 is

Page 42: Lecture 7 gwas full
Page 43: Lecture 7 gwas full

Multiple testing

• At an alpha of .05, the probability that the finding is not due to chance, is 95%.

• How many significant findings do we expect to arise by chance?

• How many tests does a GWAS have?• So, how many significant findings do we expect to

arise by chance?

Page 44: Lecture 7 gwas full

Multiple testing corrections

• Bonferroni:• When k independent tests are used, the corrected

p value should be: α / k

“Bonferroni adjustments are, at best, unnecessary, and, at worst, deleterious to sound statistical inference.” Perneger, 1998

While it is computationally simple (single step method), and stringent, maybe it is too stringent? High probability of Type II errors.

Page 45: Lecture 7 gwas full

Multiple testing corrections

• HolmTarget p value__________________

n - rank number of the pair in terms of degree of significance + 1• Assume 3 p-values, and α=.05• the smallest p value has to be smaller than .05/3 = .017 to be

sig• the second smallest p value has to be smaller than .05/2 = .025

to be sig• and the third smallest has to be smaller than .05/1 = .05 to be

sig.• Low probability of Type II errors… but too low?

Page 46: Lecture 7 gwas full

Multiple testing corrections

HolmBonferroni

FDR

Page 47: Lecture 7 gwas full

Multiple testing corrections

• FDR procedures are designed to control the expected proportion of incorrectly rejected null hypotheses

Page 48: Lecture 7 gwas full

Multiple testing corrections

Page 49: Lecture 7 gwas full

Multiple testing corrections

Page 50: Lecture 7 gwas full

Multiple testing corrections

• Storey – q values• For a given test, we estimate the q-value by

calculating the minimum estimated false discovery rate among all thresholds at which the false discovery rate is called significant

• E.g. Rank 52 has a p-value of 0.01 and a q-value of 0.0141.

• A p-value of 0.01 implies a 1% chance of false positives from all the results. Therefore 839 * .01 = 8.39 false tests. A q-value of .01 implies a 1% chance of false positive, from all the values < .01. In this experiment 52 tests had a q-value less than 0.0141 and so 52*0.0141 = 0.7332 false positives.

• False positives according to p-values take all 839 values into account; while q-values take into account only those tests with q-values less the threshold we choose.

Page 51: Lecture 7 gwas full

Multiple testing example

• *Think* it is a sliding scale, magic p-value of ‘no error’. BUT, repeatedly taken a range of SNPs

• 1.0*10-5>p>1.0*10-9

• So, what correction would you use? A stringent one? A lenient one?

• What is the purpose of GWAS?

Page 52: Lecture 7 gwas full

Multiple testing thought experiment

• Jane comes to you, she is interested in whether SNPs in LDL1R gene are associated with CVD. She uses all 7 SNPs genotyped in LDL1R, and the P-values are as follows: .0001, .01, .0006, .000006, .000001, .05, .09. How many SNPs are significantly associated with CVD?

• John comes to you, and has run a GWAS on a million SNPs, including Jane’s 7 in LDL1R. He does a Bonferroni correction AND an FDR correction on his GWAS, and none of the SNPs survive the correction for multiple testing. He concludes that the LDL1R gene is not associated with CVD

• Who is right?

Page 53: Lecture 7 gwas full

Multiple testing: a final thought

• Power is a product of -N-Effect size (subsuming error)-Alpha

Page 54: Lecture 7 gwas full

Running a GWAS: Visualize your results

Page 55: Lecture 7 gwas full

Running a GWAS: Visualize your results

Page 56: Lecture 7 gwas full

Running a GWAS: Visualize your results

K Wang et al. Nature 000, 1-6 (2009) doi:10.1038/nature07999

Page 57: Lecture 7 gwas full

Running a GWAS: Interpreting SNPs

• Look at the functionality of your SNP (SNPdoc)• Literature search – can you give biological

plausibility?• Other tests: pathway analysis / Gene based tests• Tonne of free & very expensive tools out there.

Conduct your own quality control!!

Page 58: Lecture 7 gwas full

Running a GWAS: Final steps

Page 59: Lecture 7 gwas full

Running a GWAS: Replication

Replication

Hirschhorn & Daly Nat. Genet. Rev. 6: 95, 2005NCI-NHGRI Working Group on Replication Nature 447: 655, 2007

Replication

Replication

Page 60: Lecture 7 gwas full

Winner’s Curse

• ‘Winner's curse’ = the phenomenon whereby winners at competitive auctions are likely to pay in excess of the value of the item.

• In genetic association studies, the winner's curse is the phenomenon whereby the disease risk of a newly identified genetic association is overestimated.

• Occurs when the statistical power of original study is not sufficient.

• The winner's curse implies that the sample size required for confirmatory study will be underestimated, resulting in failure of replication study to corroborate the association.

• The winner's curse is common in genome-wide association (GWA) studies because most single-GWA studies are underpowered to detect small genetic effects at a stringent genome-wide significance level.

• What are the solutions? Large GWAS (or a meta-analysis)

Page 61: Lecture 7 gwas full

Can you solve the case of the missing heritability?

Hint. What have we discussed about:-Traits-Statistics-Genotype models-Coverage of the genome-Power?

Page 62: Lecture 7 gwas full

Lab 7: Analyzing GWAS data

Page 63: Lecture 7 gwas full

Primer on SAS: Libraries

Find your libraries here. Toggle between libraries and results

Work is the default library. SAS will pull files from here if nothing else is specified.

Here are other libraries I have made

Page 64: Lecture 7 gwas full

Primer on SAS: Reading in data

• For SAS datasets:Tell SAS to make a library

Give SAS a path. SAS will read in all SAS datasets in this folder

Page 65: Lecture 7 gwas full

Primer on SAS: Reading in data

• For non-SAS datasets:

Follow on the import wizard (I have code if you prefer)

Page 66: Lecture 7 gwas full

Primer on SAS: Move your file into ‘work’

‘Data’ command: make a dataset. Call it Goldn. There is no library so put it in ‘work’.

‘set’ command: base the new dataset off goldn in the goldn library

Library

Dataset

Page 67: Lecture 7 gwas full

Primer on SAS: Titles

Page 68: Lecture 7 gwas full

Primer on SAS: Summarizing Data

• Continuous variables:

Command ‘proc univariate’

Tell SAS which variables

Select a subsample There are other commands. E.g. ‘BY’. Your data must be

sorted by the ‘BY’ variable to use this command

Page 69: Lecture 7 gwas full

Primer on SAS: Summarizing Data

• Categorical variables:

Command ‘proc freq’

Tell SAS which variables

Page 70: Lecture 7 gwas full

Primer on SAS: Manipulating data

Make your data… set your data. Think: Libraries.

Options (look up others)

Page 71: Lecture 7 gwas full

Primer on SAS: Sorting your dataset

Command = proc sort

Which dataset to sort What to call the

sorted dataset

What to sort by

Page 72: Lecture 7 gwas full

Primer on SAS. One fundamental rule

• Always check your log!

Page 73: Lecture 7 gwas full

Special commands for this session

Page 74: Lecture 7 gwas full

Special commands: QQ plot1. Make a new dataset (to preserve yours)

2. Set your data3. Set to the number of SNPs

5. Select your data

NOTE: Your data must be sorted by p-value

4. Select the smallest p-value power (here 1*10-25)

Page 75: Lecture 7 gwas full

Special commands: Calculating median P-val

1. Make a new dataset (to preserve yours)

2. Set your data

Page 76: Lecture 7 gwas full

Special commands: Correcting for multiple testing

• 1. Make a dataset with ONLY your p-values in (no other columns)

• 2. Make sure your P-values are called RAW_P

3. Set your data

4. Select your corrections (look them up )

Page 77: Lecture 7 gwas full

Lab 7: The dataset

Page 78: Lecture 7 gwas full

Rs number of SNP

chromosome

Base pair position

Effect allele

Frequency of the effect allele

Imputation quality

P value for Hardy-Weinberg

Page 79: Lecture 7 gwas full

Not sure? No. of people in analysis