the trivial case of the missing heritability
TRANSCRIPT
Max Moldovan Bioinformatics Division, WEHI
[email protected] Bioinformatics Seminar
December 08, 2009
The danger of following traditions: The trivial case of the missing
heritability
Motivation ² It is well known that a number of human traits
are highly heritable ü Human height is 80-90% heritable (Visscher,
2008, Nature Genetics 40:489-490) ü Autism is more than 90% heritable (Sullivan,
2005, PLoS Med. 2:e212) ü Schizophrenia is more than 80% heritable
(Freitag, 2007, Mol. Psychiatr. 12:2-22) ü Heroin addiction is up to 60% heritable
(Tsuang et al., 1996, Am. J. Med. Gen. 67:473-477)
Looking at genes – ~95% of heritability is missing
Searching for genetic “dark matter”
q G x E and G x G interactions § How deep to go?
q Rare variants § With larger effect sizes?
q Structural variants § Deletions, duplications, inversions
q Epigenetics § Heritable?
q Overestimated heritability q Poorly characterized phenotypes
The trivial case of the missing heritability
Talk outline:
q GWAS and inheritance models q Traditional inference q Efficiency robust inference q Empirical illustration q Discussion (implication to heritability)
Genome-wide association study
(GWAS) q Genetic information (e.g. SNPs) is collected from
two groups of individuals – cases and controls –who are discordant with respect to a specified trait
q Genomes are analysed in order to define regions/markers where causative genetic variants are likely to reside
q One of the main analytical objectives is to detect associations between genotypes and the trait (phenotype)
Genotype Group Model AA Aa aa
A is Dominant
A is Recessive
A is Co-Dominant
Inheritance models at a single bi-allelic locus
Statistical tests for detection genotype-phenotype associations
q Cochran-Armitage trend test (CATT) is shown to be optimal if an inheritance model is known (Lettre et al., 2007, Genetic Epidemiology 31:358-362)
q In practice, the inheritance model is not know
q Co-dominant CATT is the traditional choice (see recently reported GWAS)
Alternatives: Efficiency robust significance tests
q Statistical tests that remain sensitive to detection of genotype-phenotype associations even though the genetic model is either unknown or misspecified (Podgor et al., 1996, Stat. Med. 15:2095-2105)
q MAX test (Freidlin et al., 2002, Human Heredity 53:146-152) is one of efficiency robust testing strategies
MAX efficiency robust testing
q MAX3 – additive (co-dominant), dominant and recessive CATTs:
TMAX3 = max(TA,TD,TR), then use TMAX3 to compute p-values
q MAX4 – additive (co-dominant), dominant,
recessive CATTs, plus Pearson’s Chi-sq: pp-min = min(pT-max,pChi-sq), then use pp-min as
test statistics to compute p-values
Problems with MAX
q The distribution of MAX test statistics is either unknown or difficult to obtain (e.g. asymptotic approximations)
q Permutations procedures can be applied but they are extremely computationally intensive
q The p-values based on permutations or asymptotic approximations are not statistically valid p-values
Statistical validity
Pr(P(Y) ≤ α|H0) ≤ α)
q Corresponds to a test of correct size, i.e. type I error does not exceed the nominal level α
Additive approximate vs. Fisher CATT p-vals
0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 0.0014
0.0000
0.0002
0.0004
0.0006
0.0008
0.0010
0.0012
0.0014
approximate additive CATT p-values
exact additiv
e C
AT
T p
-valu
es
Statistical validity
Liberal test with inflated size
Exact and conservative test
Statistical validity
Liberal test with inflated size
Exact and efficient test
Fisher-MAX p-values
The probability of each possible table (outcome):
Fisher-MAX p-value is the probability of tables
equally or more extreme than the observed i.e. with T(X1,X2) ≤ t(x1,x2):
Fisher-MAX p-values
q Valid: (can be slightly conservative, but conservatism can be eliminated leading to exact p-values)
q Computationally feasible: for (n1,n2)=(162,131) take between 0.01 and 1.10 seconds per SNP to compute (~3-4 h. for 300K+ SNPs on a single CPU)
q Efficient: the test is sensitive to association signals even though the model can be either unknown or misspecified
HCV genotype 1 progression
HCV infection
Clearance (~20%)
Chronic HCV (~80%)
No treatment response (~50%)
Treatment response (~50%)
Source: Based on NIH information
Genome-wide analysis in Suppiah et al. 2009
q 162 cases (non-responders) vs 131 controls (responders), 311,159 SNPs
q Additive CATT was used with p-value cut-off 0.001
q One SNP was genome-wide significant q 306 SNPs passed the p-val < 0.001
threshold
Reanalysis: additive CATT, MAX3 and MAX4 p-values for the same cut-off 0.001
max(pMAX4) = 0.0028
Reanalysis: additive CATT, MAX3 and MAX4 p-values for the same cut-off 0.001
max(pMAX4) = 0.0028 > 0.001
Reanalysis: additive CATT, MAX3 and MAX4 p-values for the same cut-off 0.001
Summary q Looking at genomes - ~95% of heritability
is missing q There are several alternative inheritance
models q Traditional statistical inference
procedures miss association signals by not accounting for alternative inheritance models
q Can some heritability be hidden in overlooked association signals?
Bioinformatics, WEHI Melanie BahloTerry Speed
NTNU, Norway
Mette Langaas
AGRF Rust Turakulov
MBS, UniMelb Chris Lloyd
Acknowledgments
Millenium Institute & Westmead Children’s Hospital, Sydney
Vijay SuppiahDavid BoothJacob George
Funding ARC Linkage Grant