lecture 7 gwas full
DESCRIPTION
Genome-wide Association StudiesTRANSCRIPT
Session 7: Genome-wide Association Studies (GWAS)
But first….
June 13th 2013
Structure of this lecture
• Recap some concepts (SAS tutorial later)• Discuss GWAS• Look at the steps in running & analyzing results
GWAS• Lab – analyze a GWAS• SAS tutorial
Recap from last session
• At an A/T locus:• -What are the genotypes?• -What are the alleles?• Given the following frequencies at a locus:
A/A: N=123 A/C: N=134C/C N=52
Which is the minor allele?What are the allele frequencies?
Recap from last session
• Additive model: Each ‘risk’ allele conveys risk for the phenotype in an additive fashion
• Recessive effect of the minor allele: The minor allele conveys risk, but you need 2 copies to have the risk
• Dominant effect of the minor allele: The minor allele conveys risk, and you only need 1 copy to have the risk
Recap from last session
• Given these genotype means, what genotype models would you run? (Additive / recessive effect of the minor allele / dominant effect of the minor allele?):
aa aA AA1 10 15 152 23 35 423 10 12 144 9 11 325 5 5 166 1 5 7
All about GWAS
Back in the 90s…
Heralding the GWAS era
What is a GWAS
Basic idea: Genotype individuals for a large number (~1M) of SNPs spread in a generally unspecified way throughout the genome. Look for association.
Advantage?
What does this table show?
The era of hypothesis generating research
The era of hypothesis generating research
Why do we question the success of GWAS?
• What is the median income in the US?– $50,000
• What is federal tax?– 25%
• Median tax/year is – ~12,500
• What is the average cost of an NIH genetics grant?– $477,215 / $350,000
• 30-40 working years
Huge numbers of published GWAS!
• http://www.genome.gov/gwastudies/• http://www.ebi.ac.uk/fgpt/gwas/#
JAMA, 2007
• T2DB suspected that gens ‘related to insulin resistance would be identified. Interestingly, not one gene known to be associated with insulin resistance has been found in the recent set of genome-wide scans for diabetes, but genes related to insulin secretion, insulin transport, zinc binding to insulin, and pancreatic is-let beta cell development have been discovered”
And yet….
• Sir Alec Jeffries: ‘‘One of the great hopes for GWAS was that, in the same way that huge numbers of Mendelian disorders were pinned down at the DNA level and the gene and mutations involved identified, it would be possible to simply extrapolate from single gene disor-ders to complex multigenic disorders. That really hasn’t happened. Proponents will argue that it has worked and that all sorts of fascinating genes that predispose to or protect against diabetes or breast cancer, for example, have been identified, but the fact remains that the bulk of the heritability in these conditions cannot be ascribed to loci that have emerged from GWAS, which clearly isn’t going to be the answer to everything.’’”
• Cell, 2010 ‘‘To date, genome-wide association studies (GWAS) have published hundreds of common variants whose allele frequencies are statistically correlated with various illnesses and traits. However, the vast majority of such variants have no established biolog-ical relevance to disease or clinical utility for prognosis or treatment.”
Running (analyzing) a GWAS
• With each step think: “What effect does this have on our ability to find genes for traits and disorders?”
Running a GWAS: Getting your genotype data
• Select your chip• Complete your genotyping
Running a GWAS: QC-ing SNP data
• Poor quality samples– Sample genotype success rate < 95 to 97.5%– Greater proportion of heterozygous genotypes than
expected• Related individuals (if independent samples)
– Based on pair-wise comparisons of similarity of genotypes
• Sample switches– Wrong sex
Running a GWAS: QC-ing SNP data
• This can get very complex!• “GOLDN SNPs that were monomorphic (55,530) or had acall rate <96 %
(82,462) were removed from the analysis. In addition, SNPs were excluded from the analysis based on the number of families with Mendelian errors as follows: for minor allele frequency (MAF) <20 %, removed if errors were present in 3 families (1,486 SNPs); for MAF < 10 %, removed if errors were present in 2 families (1,338 SNPs); MAF < 5%, removed if errors were present in 1 family (1,767 SNPs); for MAF <5 %, removed if any errors were present (9,592 SNPs). In families with remaining errors, SNPs that exhibited Mendelian error were set to missing (31,595 SNPs). Furthermore, 16 participants with call rates<96 % were also removed from any subsequent analyses. Subsequently, 748 SNPs failing the Hardy–Weinberg equilibrium (HWE) test at P value<10-6 were excluded.”
Running a GWAS: Imputing more SNPs
Useful for:
• Combining data from different platforms (e.g., Affy & Illumina) (for replication or meta-analysis)
• Estimating unmeasured or missing genotypes
• Based on measured SNPs and external info (e.g., haplotype structure of HapMap)
• Imputation methods use the dense genotype data available from HapMap samples (i.e., CEU) and the LD relationships of the SNPs to impute (predict) genotypes for a large number of SNPs that were not measured experimentally
“Short cuts”
A T A G T A C ATC
AC
AT
GA
GC
GCA
AATT
GGAA
GCGC
TCCC
GCGC
ACCC
SNPs 1, 3 and 4 are TagSNPs
Running a GWAS: Imputing more SNPs• Requires large scale computing resources• Can go from ~500,000 SNPs to ~2.2 million• Need to assess quality of imputation
– Compare imputed genotypes to actual genotypes• Error rates are higher than for genotyped SNPs• Works less well for rarer alleles (e.g., MAF<3%)• Best to take account of probabilities assigned to imputed
genotypes in the analysis– “dosages” = probabilities of the genotypes
• Allows association testing of untyped variation • Allows for ease of combining data across genotyping platforms
Genotype: AA AG GGProbability: 80% 20% 0%Coding: 2 1 0
Running a GWAS: Imputing more SNPs
Running a GWAS: Imputing more SNPs
MACH, Markov Chain Haplotyping– Developed by Goncalo Abecasis– http://www.sph.umich.edu/csg/abecasis/MACH/
IMPUTE– Developed by Jonathan Marchini– http://www.stats.ox.ac.uk/~marchini/#software
BIMBAM– http://quartus.uchicago.edu/~yguan/bimbam/index.html
Running a GWAS: QC-ing SNP data
• More QC!• MAF again • Imputation quality (RSQ<.3??)
Running a GWAS: Computing your p-values
• Commonly SNPs coded as the additive effect of alleles
• Logistic regression or linear regression (much like a candidate gene study)
• PLINK– http://pngu.mgh.harvard.edu/~purcell/plink/
Running a GWAS: Interpreting your data
• (Really… it is more QC)….• 2 main issues:• Population stratification• Multiple testing
The Problems of population substructure
• Devlin and Roeder (1999) used theoretical arguments to propose that with population structure, the distribution of Cochran-Armitage trend tests, genome-wide, is inflated by a constant multiplicative factor λ.
• We can estimate the multiplicative inflation factor using the statistic λ = median(Xi
2)/0.456.• Inflation factor λ > 1 indicates
population structure and/or genotyping error.
• We can carry out an adjusted test of association that takes account of any mismatching of cases/controls at any SNP using the statistic Xi
2/ λ.Inflation factor λ = 1.11
Population outliers and/or structure?
True hits?
Solving population stratification
• Has become a standard tool in genetics to identify subpopulations
• Used to infer continuous axes of genetic variation (eigenvectors) that reduce the data to a small number of dimensions while describing as much of the variability among individuals as possible
• Utilizes a set of “neutral” SNPs (not associated with phenotype, need many)
• Implemented in the EIGENSTRAT software package.
Solving population stratification - Eigenstrat
Population substructure thought experiment
• You run a GWAS and find a significant hit within a gene. The SNP is genotyped, biologically plausible, in HWE (p>.05) and has a MAF = .13. However, your λ=1.56! So you decide not to use the SNP.
• What would have happened if you had taken a candidate gene approach and only analyzed that SNP?
Multiple testing
Multiple testing
• When all tests are independent,– Probability to observe P<.01 is– Probability to observe P<.05 is– Probability to observe P<.5 is
Multiple testing
• At an alpha of .05, the probability that the finding is not due to chance, is 95%.
• How many significant findings do we expect to arise by chance?
• How many tests does a GWAS have?• So, how many significant findings do we expect to
arise by chance?
Multiple testing corrections
• Bonferroni:• When k independent tests are used, the corrected
p value should be: α / k
“Bonferroni adjustments are, at best, unnecessary, and, at worst, deleterious to sound statistical inference.” Perneger, 1998
While it is computationally simple (single step method), and stringent, maybe it is too stringent? High probability of Type II errors.
Multiple testing corrections
• HolmTarget p value__________________
n - rank number of the pair in terms of degree of significance + 1• Assume 3 p-values, and α=.05• the smallest p value has to be smaller than .05/3 = .017 to be
sig• the second smallest p value has to be smaller than .05/2 = .025
to be sig• and the third smallest has to be smaller than .05/1 = .05 to be
sig.• Low probability of Type II errors… but too low?
Multiple testing corrections
HolmBonferroni
FDR
Multiple testing corrections
• FDR procedures are designed to control the expected proportion of incorrectly rejected null hypotheses
Multiple testing corrections
Multiple testing corrections
Multiple testing corrections
• Storey – q values• For a given test, we estimate the q-value by
calculating the minimum estimated false discovery rate among all thresholds at which the false discovery rate is called significant
• E.g. Rank 52 has a p-value of 0.01 and a q-value of 0.0141.
• A p-value of 0.01 implies a 1% chance of false positives from all the results. Therefore 839 * .01 = 8.39 false tests. A q-value of .01 implies a 1% chance of false positive, from all the values < .01. In this experiment 52 tests had a q-value less than 0.0141 and so 52*0.0141 = 0.7332 false positives.
• False positives according to p-values take all 839 values into account; while q-values take into account only those tests with q-values less the threshold we choose.
Multiple testing example
• *Think* it is a sliding scale, magic p-value of ‘no error’. BUT, repeatedly taken a range of SNPs
• 1.0*10-5>p>1.0*10-9
• So, what correction would you use? A stringent one? A lenient one?
• What is the purpose of GWAS?
Multiple testing thought experiment
• Jane comes to you, she is interested in whether SNPs in LDL1R gene are associated with CVD. She uses all 7 SNPs genotyped in LDL1R, and the P-values are as follows: .0001, .01, .0006, .000006, .000001, .05, .09. How many SNPs are significantly associated with CVD?
• John comes to you, and has run a GWAS on a million SNPs, including Jane’s 7 in LDL1R. He does a Bonferroni correction AND an FDR correction on his GWAS, and none of the SNPs survive the correction for multiple testing. He concludes that the LDL1R gene is not associated with CVD
• Who is right?
Multiple testing: a final thought
• Power is a product of -N-Effect size (subsuming error)-Alpha
Running a GWAS: Visualize your results
Running a GWAS: Visualize your results
Running a GWAS: Visualize your results
K Wang et al. Nature 000, 1-6 (2009) doi:10.1038/nature07999
Running a GWAS: Interpreting SNPs
• Look at the functionality of your SNP (SNPdoc)• Literature search – can you give biological
plausibility?• Other tests: pathway analysis / Gene based tests• Tonne of free & very expensive tools out there.
Conduct your own quality control!!
Running a GWAS: Final steps
Running a GWAS: Replication
Replication
Hirschhorn & Daly Nat. Genet. Rev. 6: 95, 2005NCI-NHGRI Working Group on Replication Nature 447: 655, 2007
Replication
Replication
Winner’s Curse
• ‘Winner's curse’ = the phenomenon whereby winners at competitive auctions are likely to pay in excess of the value of the item.
• In genetic association studies, the winner's curse is the phenomenon whereby the disease risk of a newly identified genetic association is overestimated.
• Occurs when the statistical power of original study is not sufficient.
• The winner's curse implies that the sample size required for confirmatory study will be underestimated, resulting in failure of replication study to corroborate the association.
• The winner's curse is common in genome-wide association (GWA) studies because most single-GWA studies are underpowered to detect small genetic effects at a stringent genome-wide significance level.
• What are the solutions? Large GWAS (or a meta-analysis)
Can you solve the case of the missing heritability?
Hint. What have we discussed about:-Traits-Statistics-Genotype models-Coverage of the genome-Power?
Lab 7: Analyzing GWAS data
Primer on SAS: Libraries
Find your libraries here. Toggle between libraries and results
Work is the default library. SAS will pull files from here if nothing else is specified.
Here are other libraries I have made
Primer on SAS: Reading in data
• For SAS datasets:Tell SAS to make a library
Give SAS a path. SAS will read in all SAS datasets in this folder
Primer on SAS: Reading in data
• For non-SAS datasets:
Follow on the import wizard (I have code if you prefer)
Primer on SAS: Move your file into ‘work’
‘Data’ command: make a dataset. Call it Goldn. There is no library so put it in ‘work’.
‘set’ command: base the new dataset off goldn in the goldn library
Library
Dataset
Primer on SAS: Titles
Primer on SAS: Summarizing Data
• Continuous variables:
Command ‘proc univariate’
Tell SAS which variables
Select a subsample There are other commands. E.g. ‘BY’. Your data must be
sorted by the ‘BY’ variable to use this command
Primer on SAS: Summarizing Data
• Categorical variables:
Command ‘proc freq’
Tell SAS which variables
Primer on SAS: Manipulating data
Make your data… set your data. Think: Libraries.
Options (look up others)
Primer on SAS: Sorting your dataset
Command = proc sort
Which dataset to sort What to call the
sorted dataset
What to sort by
Primer on SAS. One fundamental rule
• Always check your log!
Special commands for this session
Special commands: QQ plot1. Make a new dataset (to preserve yours)
2. Set your data3. Set to the number of SNPs
5. Select your data
NOTE: Your data must be sorted by p-value
4. Select the smallest p-value power (here 1*10-25)
Special commands: Calculating median P-val
1. Make a new dataset (to preserve yours)
2. Set your data
Special commands: Correcting for multiple testing
• 1. Make a dataset with ONLY your p-values in (no other columns)
• 2. Make sure your P-values are called RAW_P
3. Set your data
4. Select your corrections (look them up )
Lab 7: The dataset
Rs number of SNP
chromosome
Base pair position
Effect allele
Frequency of the effect allele
Imputation quality
P value for Hardy-Weinberg
Not sure? No. of people in analysis