the trivial case of the missing heritability

Max Moldovan Bioinformatics Division, WEHI

[email protected] Bioinformatics Seminar

December 08, 2009

The danger of following traditions: The trivial case of the missing

heritability

Motivation ² It is well known that a number of human traits

are highly heritable ü Human height is 80-90% heritable (Visscher,

2008, Nature Genetics 40:489-490) ü Autism is more than 90% heritable (Sullivan,

2005, PLoS Med. 2:e212) ü Schizophrenia is more than 80% heritable

(Freitag, 2007, Mol. Psychiatr. 12:2-22) ü Heroin addiction is up to 60% heritable

(Tsuang et al., 1996, Am. J. Med. Gen. 67:473-477)

Looking at genes – ~95% of heritability is missing

Searching for genetic “dark matter”

q G x E and G x G interactions §  How deep to go?

q Rare variants §  With larger effect sizes?

q Structural variants §  Deletions, duplications, inversions

q Epigenetics §  Heritable?

q Overestimated heritability q Poorly characterized phenotypes

The trivial case of the missing heritability

Talk outline:

q GWAS and inheritance models q Traditional inference q Efficiency robust inference q Empirical illustration q Discussion (implication to heritability)

Genome-wide association study

(GWAS) q Genetic information (e.g. SNPs) is collected from

two groups of individuals – cases and controls –who are discordant with respect to a specified trait

q Genomes are analysed in order to define regions/markers where causative genetic variants are likely to reside

q One of the main analytical objectives is to detect associations between genotypes and the trait (phenotype)

Genotype Group Model AA Aa aa

A is Dominant

A is Recessive

A is Co-Dominant

Inheritance models at a single bi-allelic locus

Statistical tests for detection genotype-phenotype associations

q Cochran-Armitage trend test (CATT) is shown to be optimal if an inheritance model is known (Lettre et al., 2007, Genetic Epidemiology 31:358-362)

q In practice, the inheritance model is not know

q Co-dominant CATT is the traditional choice (see recently reported GWAS)

Alternatives: Efficiency robust significance tests

q Statistical tests that remain sensitive to detection of genotype-phenotype associations even though the genetic model is either unknown or misspecified (Podgor et al., 1996, Stat. Med. 15:2095-2105)

q MAX test (Freidlin et al., 2002, Human Heredity 53:146-152) is one of efficiency robust testing strategies

MAX efficiency robust testing

q MAX3 – additive (co-dominant), dominant and recessive CATTs:

TMAX3 = max(TA,TD,TR), then use TMAX3 to compute p-values

q MAX4 – additive (co-dominant), dominant,

recessive CATTs, plus Pearson’s Chi-sq: pp-min = min(pT-max,pChi-sq), then use pp-min as

test statistics to compute p-values

Problems with MAX

q The distribution of MAX test statistics is either unknown or difficult to obtain (e.g. asymptotic approximations)

q Permutations procedures can be applied but they are extremely computationally intensive

q The p-values based on permutations or asymptotic approximations are not statistically valid p-values

Statistical validity

Pr(P(Y) ≤ α|H0) ≤ α)

q Corresponds to a test of correct size, i.e. type I error does not exceed the nominal level α

Additive approximate vs. Fisher CATT p-vals

0.0000 0.0002 0.0004 0.0006 0.0008 0.0010 0.0012 0.0014

0.0000

0.0002

0.0004

0.0006

0.0008

0.0010

0.0012

0.0014

approximate additive CATT p-values

exact additiv

e C

AT

T p

-valu

es


Liberal test with inflated size

Exact and conservative test


Liberal test with inflated size

Exact and efficient test

Fisher-MAX p-values

The probability of each possible table (outcome):

Fisher-MAX p-value is the probability of tables

equally or more extreme than the observed i.e. with T(X1,X2) ≤ t(x1,x2):

Fisher-MAX p-values

q Valid: (can be slightly conservative, but conservatism can be eliminated leading to exact p-values)

q Computationally feasible: for (n1,n2)=(162,131) take between 0.01 and 1.10 seconds per SNP to compute (~3-4 h. for 300K+ SNPs on a single CPU)

q Efficient: the test is sensitive to association signals even though the model can be either unknown or misspecified

HCV genotype 1 progression

HCV infection

Clearance (~20%)

Chronic HCV (~80%)

No treatment response (~50%)

Treatment response (~50%)

Source: Based on NIH information

Genome-wide analysis in Suppiah et al. 2009

q 162 cases (non-responders) vs 131 controls (responders), 311,159 SNPs

q Additive CATT was used with p-value cut-off 0.001

q One SNP was genome-wide significant q 306 SNPs passed the p-val < 0.001

threshold

Reanalysis: additive CATT, MAX3 and MAX4 p-values for the same cut-off 0.001

max(pMAX4) = 0.0028


max(pMAX4) = 0.0028 > 0.001


Summary q Looking at genomes - ~95% of heritability

is missing q There are several alternative inheritance

models q Traditional statistical inference

procedures miss association signals by not accounting for alternative inheritance models

q Can some heritability be hidden in overlooked association signals?

Bioinformatics, WEHI Melanie BahloTerry Speed

NTNU, Norway

Mette Langaas

AGRF Rust Turakulov

MBS, UniMelb Chris Lloyd

Acknowledgments

Millenium Institute & Westmead Children’s Hospital, Sydney

Vijay SuppiahDavid BoothJacob George

Funding ARC Linkage Grant

the trivial case of the missing heritability

Health & Medicine

max4 pvalues

valid pvalues

fishermax pvalues qvalid

fisher catt pvals0

pvalue cutoff

intensive qthe pvalues

conservative test

efficient test