more powerful genome-wide association methods for case-control data robert c. elston, phd case...

Post on 19-Dec-2015

215 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

More Powerful Genome-wide More Powerful Genome-wide Association Methods for Association Methods for

Case-control DataCase-control Data

Robert C. Elston, PhDRobert C. Elston, PhDCase Western Reserve UniversityCase Western Reserve University

Cleveland OhioCleveland Ohio

SINGLE-MARKER AND TWO-MARKER ASSOCIATION TESTS SINGLE-MARKER AND TWO-MARKER ASSOCIATION TESTS FOR UNPHASED CASE-CONTROL GENOTYPE DATA, FOR UNPHASED CASE-CONTROL GENOTYPE DATA,

WITH A POWER COMPARISONWITH A POWER COMPARISON

Kim S, Morris NJ, Won S, Elston RCKim S, Morris NJ, Won S, Elston RCGenetic Epidemiology, Genetic Epidemiology, in pressin press

2

IntroductionIntroduction• A genome-wide association study with case-control A genome-wide association study with case-control

data aims to localize disease susceptibility regions in the data aims to localize disease susceptibility regions in the genomegenome

• Single Nucleotide Polymorphism (SNP) markers, which Single Nucleotide Polymorphism (SNP) markers, which are usually diallelic, have been used to cover the whole are usually diallelic, have been used to cover the whole genomegenome

• Two categories of tests have been applied to these dataTwo categories of tests have been applied to these data• single marker association tests, which examine association single marker association tests, which examine association

between affection status and the SNP data one SNP at a between affection status and the SNP data one SNP at a timetime

• multi-marker association tests, which examine association multi-marker association tests, which examine association between affection status and multiple SNP data between affection status and multiple SNP data simultaneouslysimultaneously

3

Allele

HWD LD

Association Analysis

Information for associationInformation for association

g. phase-known genotype-g. phase-known genotype-based testbased test

a

b c

d e

f

g

a.a. Allele frequency trend testAllele frequency trend test

b.b. HWD trend testHWD trend test

c.c. LD contrast testLD contrast test

d. genotype frequency testd. genotype frequency test

e.e. haplotype-based test with haplotype-based test with HWEHWE

f. ???f. ???

• The allele frequency, HWD and LD contrast tests The allele frequency, HWD and LD contrast tests are typically developed in what has been termed a are typically developed in what has been termed a retrospective context; i.e. case-control status is retrospective context; i.e. case-control status is considered fixed and the genotypes are considered considered fixed and the genotypes are considered randomrandom

• For case-control data, epidemiologists typically For case-control data, epidemiologists typically take advantage of the properties of the odds ratio take advantage of the properties of the odds ratio and use the prospective logistic regression model, and use the prospective logistic regression model, making the case-control status the random variable making the case-control status the random variable dependent on the predictorsdependent on the predictors

• Prospective modeling tends to allow for greater Prospective modeling tends to allow for greater flexibility, especially when adjusting for covariatesflexibility, especially when adjusting for covariates

• It also provides a natural way to adjust for any It also provides a natural way to adjust for any correlations between the tests or other covariates, correlations between the tests or other covariates, and can be extended to quantitative traitsand can be extended to quantitative traits

5

Notation and AssumptionsNotation and Assumptions• We suppose there are two diallelic SNP markers, A and B We suppose there are two diallelic SNP markers, A and B

having alleles {having alleles {AA11,,AA22} and {} and {BB11,,BB22}, respectively, where }, respectively, where AA11 and and BB11 are the minor alleles are the minor alleles

X =X =

11 for Afor A11AA11 11 for Bfor B11BB11

00 for Afor A11AA22 , , Y =Y = 00 for Bfor B11BB22

-1-1 for Afor A22AA22 -1-1 for Bfor B22BB22

• IIcasecase andand IIctrlctrl denote the sets of cases and controlsdenote the sets of cases and controls

• We make minimal assumptions about the general We make minimal assumptions about the general population sampled; in particular, we do not assume HWE population sampled; in particular, we do not assume HWE in the populationin the population

• μμXX, , andand σσXYXY denote the expected value of X, the variance denote the expected value of X, the variance of X and the covariance of X and Y, respectivelyof X and the covariance of X and Y, respectively 6

2x

• The HWD parameter for marker A is given byThe HWD parameter for marker A is given by

• The HWD parameter can be expressed as The HWD parameter can be expressed as

• This means that the HWD parameter, This means that the HWD parameter, ddAA, is , is half the deviation of the variance from the half the deviation of the variance from the variance expected under HWEvariance expected under HWE

• The composite LD parameter for alleles AThe composite LD parameter for alleles A11 and and BB11 of markers A and B is of markers A and B is

1 1

2 A A A Ad p p

121,1 1,0 0,1 0,02 2 A B XYg g g g p p

7

2 21|2A X X HWEd

Probabilities for unphased genotypesProbabilities for unphased genotypes

1

8

121,1 1,0 0,1 0,02 2 A B XYg g g g p p

• The joint test of allele frequency and HWD contrasts The joint test of allele frequency and HWD contrasts between cases and controls tests the null hypothesis between cases and controls tests the null hypothesis HH00: (: (ppAA|case|case ddAA|case|case) = () = (ppAA|ctrl |ctrl ddAA|ctrl|ctrl))

• Let ZLet Zii = = (X(Xi i )’; the sample mean Z is a sufficient )’; the sample mean Z is a sufficient statistic for (statistic for (ppAA ddAA)’)’

• The Allelic-HWD contrast test can be performed by The Allelic-HWD contrast test can be performed by comparing Zcomparing Zcasecase and Z and Zctrlctrl. The T. The T22 statistic for this test is statistic for this test is

_

_ _

2

2 +case ctrlcase ctrl case ctrlT

case ctrl

n nT Z -Z S Z -Z

n +n

9

2iX

• Let ZLet Zii = (X = (Xii Y Yii X XiiYYii)’; is a sufficient statistic for )’; is a sufficient statistic for ((ppAA p pBB ΔΔ))’’

ZZ_

• The Allelic-LD contrast test can be performed The Allelic-LD contrast test can be performed using a version of Hotelling’s Tusing a version of Hotelling’s T22

• The additional case-control differences can be The additional case-control differences can be captured by the HWD and LD contrast tests, captured by the HWD and LD contrast tests, given the allele frequency contrast(s)given the allele frequency contrast(s)

• The Allelic-HWD-LD contrast test can be The Allelic-HWD-LD contrast test can be constructed in a similar manner by contrasting constructed in a similar manner by contrasting the mean vector of Zthe mean vector of Zii = (X = (Xii Y Yii X XiiYYii )’ )’ between cases and controlsbetween cases and controls

10

2 2i iX Y

Single-marker and two-marker Single-marker and two-marker association tests with corresponding association tests with corresponding

models and hypothesesmodels and hypotheses

11

Test Model Null hypothesis

Test Description

Single-marker association

Test 1-2 Allelic-HWD contrast test (Genotypic test)

Test 1-1 Allele frequency contrast test (Allelic test)

Two-marker association

Test 2-5

Joint Allelic-HWD-LD contrast test

Test 2-4

Joint Allelic-HWD contrast test

Test 2-3 Joint Allelic-LD contrast test

Test 2-2 Joint Allelic contrast test

Multistage TestsMultistage Tests

• ““Self-replication” if the tests are independentSelf-replication” if the tests are independent• Sequential testsSequential tests

E.g. The HWD contrast test adjusted for allele E.g. The HWD contrast test adjusted for allele frequency information which is used in the first frequency information which is used in the first stage can be performed by the test of stage can be performed by the test of

12

20H : | 0XX

Penetrance Model and Penetrance Model and True Marker Association ModelTrue Marker Association Model

• Let D denote the disease genotype variable Let D denote the disease genotype variable coded ascoded as

D =

1 for D1D1

0 for D1D2

-1 for D2D2

• We write the penetrance model as:We write the penetrance model as:

2

20 D D

P(affected|D) D D

13

Constraints for disease modelsConstraints for disease models

14

Disease Model Constraint

Additive

Dominant or Recessive

Heterozygote (Dis)advantage

• Given the true disease model and the LD structure, we Given the true disease model and the LD structure, we can set up the true single-marker can set up the true single-marker associationassociation model model between the phenotype and single-marker data X:between the phenotype and single-marker data X:

2

20 D D

D= 1,0,1

2

P(affected|X) P(affected|D) ( | ) E(D|X) E(D | X)

X X , where , and are functions of , and A D XD

P D X

a b c a b c p p D

• This true association model has the same form as the This true association model has the same form as the penetrance modelpenetrance model

• When (1 – 2pWhen (1 – 2pDD) - ) - ≠ ≠ 0, the coefficient of the0, the coefficient of the

quadratic terms generally approaches 0 faster than quadratic terms generally approaches 0 faster than does that of the linear termdoes that of the linear term

γγDD

γγDD22

15

Power ComputationPower Computation• TT22 test in a retrospective model and the score test in a retrospective model and the score

test and LRT in a prospective logistic model are test and LRT in a prospective logistic model are expected to perform similarlyexpected to perform similarly

• The noncentrality parameter of the TThe noncentrality parameter of the T22 test for test for test 2-5 istest 2-5 is

16

case ctrlcase ctrl case ctrl

case ctrl

case ctrlcase ctrl

case ctrl case ctrl

n n μ -μ μ -μ ,

n +n

n nwhere

n +n n +n

• The noncentrality parameters for the other tests The noncentrality parameters for the other tests can be obtained by using the corresponding sub-can be obtained by using the corresponding sub-matrices of (matrices of (μμcasecase – – μμctrlctrl) and () and (ΣΣcasecase + + ΣΣctrlctrl))

• ThenThen 2

21-Power 1 F X

X

Comparisons of theoretical and Comparisons of theoretical and empirical power of test 1-2empirical power of test 1-2

Theoretical Power Empirical Power T2 test T2 test LRT Score test

Additive 0.532 0.533 0.527 0.523 Dominant 0.366 0.366 0.361 0.359 Recessive 0.734 0.741 0.736 0.708

Heterozygote Disadvantage

0.284 0.283 0.277 0.275

For each of the four disease models, parameters were set as follows:

pD = 0.2, pA = 0.3, K = 0.05, DXD = 0.048(D’ = 0.8), n = 2,000 (500 for recessive), α = 0.05/500,000

Empirical power is obtained by the ratio of the number of rejected replicates to the total 100,000 replicates.

17

18

Power comparisons of Power comparisons of two-marker teststwo-marker tests

Test 2-5 Test 2-4 Test 2-3 Haplotype

-based Test 2-2 LD

contrast (LD structure 1)

Additive 0.775 0.813 0.851 0.842 0.890 0.000 Dominant 0.695 0.736 0.774 0.749 0.819 0.000 Recessive 0.823 0.845 0.746 0.784 0.717 0.001

Heterozygote Disadvantage

0.617 0.653 0.673 0.621 0.711 0.000

(LD structure 2) Additive 0.962 0.758 0.970 0.948 0.850 0.007

Dominant 0.921 0.673 0.926 0.887 0.769 0.003 Recessive 0.851 0.647 0.910 0.945 0.618 0.206 Heterozygote

Disadvantage 0.845 0.584 0.831 0.773 0.656 0.001

19

Power Comparisons on Real DataPower Comparisons on Real Data• We estimated LD parameters and marker allele frequencies We estimated LD parameters and marker allele frequencies

from the HapMap CEU populationfrom the HapMap CEU population• The data consist of 120 haplotypes estimated from 30 The data consist of 120 haplotypes estimated from 30

parent-offspring triosparent-offspring trios• We split chromosome 11 into mutually exclusive consecutive We split chromosome 11 into mutually exclusive consecutive

regions containing 3 SNPs eachregions containing 3 SNPs each• For each region we estimated the LD and allele frequency For each region we estimated the LD and allele frequency

parametersparameters• We excluded regions where the minor allele frequencies of We excluded regions where the minor allele frequencies of

three consecutive markers were less than 0.1, leaving 4,648 three consecutive markers were less than 0.1, leaving 4,648 regionsregions

• We chose the disease SNP to be the one with the smallest We chose the disease SNP to be the one with the smallest allele frequencyallele frequency

• Parameters other than the allele frequency and LD Parameters other than the allele frequency and LD parameters were set to be the same as beforeparameters were set to be the same as before

20

Mean of power over chromosome 11 Mean of power over chromosome 11 of CEU HapMap dataof CEU HapMap data

Single-marker Test Two-marker Test Disease Model Test

1-2 Test 1-1

HWD contrast

Test 2-5

Test 2-4

Test 2-3

Test 2-2

Haplotype-based

LD contrast

Additive 0.423 0.457 0.000 0.575 0.586 0.604 0.632 0.625 0.019

Dominant 0.361 0.347 0.001 0.505 0.513 0.518 0.505 0.488 0.003

Recessive 0.519 0.415 0.255 0.687 0.677 0.672 0.572 0.624 0.278

Heterozygote Disadvantage

0.423 0.241 0.163 0.587 0.580 0.546 0.367 0.344 0.058

21

ConclusionsConclusions

• The best two marker test always appear to be The best two marker test always appear to be more powerful than either the best single-more powerful than either the best single-marker test or the haplotype-based testmarker test or the haplotype-based test

• It should be possible, by examining the LD It should be possible, by examining the LD structure of the markers, to predict which will structure of the markers, to predict which will be the best two-marker test to performbe the best two-marker test to perform

• We need to study > two marker testsWe need to study > two marker tests22

http://darwin.case.edu/ http://darwin.case.edu/sage.html

top related