score-based adjustment for confounding by population stratification in genetic association studies

3
Genetic Epidemiology 34 : 383–385 (2010) Score-based Adjustment for Confounding by Population Stratification in Genetic Association Studies Andrew Allen, 1 Michael P. Epstein, 2 and Glen A. Satten 3 1 Department of Biostatistics and Bioinformatics, Duke University, Durham, North Carolina 2 Department of Human Genetics, Emory University, Atlanta, Georgia 3 Centers for Disease Control and Prevention, Atlanta, Georgia Published online 1 February 2010 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/gepi.20487 To the Editor We read with some interest the paper by Zhao, Rebbeck and Mitra ‘‘A propensity score approach to correction for bias due to population stratification using genetic and non-genetic factors.’’ We feel there are a number of issues raised by this paper that are not adequately addressed, which motivates this letter. Further, we take this opportu- nity to make some general comments that contrast stratification-based and direct (model-based) control of confounding by population stratification. Zhao, Rebbeck and Mitra (ZRM) give as their stated goal the development of a propensity score-based method for correcting for confounding by population stratification in population-based association studies. We note that ZRM demonstrate their genomic propensity score (GPS) ap- proach using simulation studies of case-control data. However, the propensity score can only be properly used to adjust for confounding in a cohort or case-cohort study [Joffe and Rosenbaum, 1999]. The artificial ratio of case to control participants in a case-control study yields a biased estimate of the propensity score; stratification on a biased estimate of the propensity score can lead to residual bias [Ma ˚nsson et al., 2007]. With no theoretical basis for believing that adjustment using the propensity score can control bias, the GPS approach applied to case-control data under a general association model is suspect even given the favorable simulation results in ZRM. The suggestion of Ma ˚nsson et al. [2007] that the propensity score can be estimated using data from control-participants only and then making the rare disease approximation should be viewed with caution, as this scoring function corresponds to one of Miettinen’s confounder scores [Miettinen, 1976]. Estimated odds ratios after stratification using Miettinen’s confounder scores are known to have biased variance estimates due to collinearity [Pike, 1977]. The propensity score is typically used to compare the differences in a mean response; in the genetic context, the proportion of cases among participants with the risk genotype(s) is compared to the proportion of cases among participants with the non-risk genotype(s). A comparison of this type seems questionable for a case-control study as the data are sampled conditional on case status. Instead, ZRM claim estimation of the odds ratios with bias close to zero under the alternative hypothesis. This claim should be viewed with caution, as it is known that even for prospective studies where the propensity score is appropriate, estimation of odds ratios may be biased [Austin et al., 2007]. Although the direction of bias found by Austin et al. agrees with that reported by ZRM in their discussion (who note that odds ratios estimated by GPS are consistently underestimated), Austin et al. also rely on Monte Carlo studies and so the direction of bias may in fact be indeterminate. For case-control studies, where there is no theoretical basis for use of GPS, finding a low bias in a simulation study should not be taken as demonstration of the applicability of the method in general. In fact, a logistic model for case–control status that includes genotype and the GPS as a covariate (as proposed by ZRM) is not compatible with either the logistic model that includes genotype and covariates, typically assumed when using direct adjustment for confounders, or with the marginal logistic model for genotype alone, typically assumed in the absence of confounding, in the sense that the parameters estimated by ZRM do not correspond to parameters in either of these models. In our previous work [Epstein et al., 2007], we developed the stratification score to account for confound- ing when testing hypotheses in genetic association studies. The Epstein, Allen and Satten (EAS) stratification score approach first uses genomic (or non-genomic) covariates to model case-control status, and then creates a stratifica- tion score based on fitting this model. Thus, for each individual, we compute the EAS stratification score, which is the estimated probability that an individual is a case given their genomic (and other) covariates. We then group the data into strata based on the values of the EAS stratification score. Finally, the association between case or control status and G is evaluated using the stratified data. Comparing the EAS stratification score and the genomic propensity score (GPS), we see that the roles of case- control status and genotype are reversed; for the GPS, the first step is to construct a model for a binary coding of G as a function of genomic and non-genomic covariates (ignor- ing case-control status). This model is then used to adjust for population stratification. We believe that, among stratification methods, the EAS stratification score has several advantages over GPS. First, GPS calculates the probability of a risk genotype (or set of risk genotypes) while the EAS stratification score calcu- lates the probability of being a case. Because case status is a binary covariate, the EAS stratification score is superior to the GPS in that it does not require an arbitrary r 2010 Wiley-Liss, Inc.

Upload: andrew-allen

Post on 11-Jun-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Score-based adjustment for confounding by population stratification in genetic association studies

Genetic Epidemiology 34 : 383–385 (2010)

Score-based Adjustment for Confounding by PopulationStratification in Genetic Association Studies

Andrew Allen,1 Michael P. Epstein,2 and Glen A. Satten3

1Department of Biostatistics and Bioinformatics, Duke University, Durham, North Carolina2Department of Human Genetics, Emory University, Atlanta, Georgia

3Centers for Disease Control and Prevention, Atlanta, Georgia

Published online 1 February 2010 in Wiley InterScience (www.interscience.wiley.com).DOI: 10.1002/gepi.20487

To the EditorWe read with some interest the paper by Zhao, Rebbeck

and Mitra ‘‘A propensity score approach to correction forbias due to population stratification using genetic andnon-genetic factors.’’ We feel there are a number of issuesraised by this paper that are not adequately addressed,which motivates this letter. Further, we take this opportu-nity to make some general comments that contraststratification-based and direct (model-based) control ofconfounding by population stratification.

Zhao, Rebbeck and Mitra (ZRM) give as their stated goalthe development of a propensity score-based method forcorrecting for confounding by population stratification inpopulation-based association studies. We note that ZRMdemonstrate their genomic propensity score (GPS) ap-proach using simulation studies of case-control data.However, the propensity score can only be properly usedto adjust for confounding in a cohort or case-cohort study[Joffe and Rosenbaum, 1999]. The artificial ratio of case tocontrol participants in a case-control study yields a biasedestimate of the propensity score; stratification on a biasedestimate of the propensity score can lead to residual bias[Mansson et al., 2007]. With no theoretical basis forbelieving that adjustment using the propensity score cancontrol bias, the GPS approach applied to case-control dataunder a general association model is suspect even giventhe favorable simulation results in ZRM. The suggestion ofMansson et al. [2007] that the propensity score can beestimated using data from control-participants only andthen making the rare disease approximation should beviewed with caution, as this scoring function correspondsto one of Miettinen’s confounder scores [Miettinen, 1976].Estimated odds ratios after stratification using Miettinen’sconfounder scores are known to have biased varianceestimates due to collinearity [Pike, 1977].

The propensity score is typically used to compare thedifferences in a mean response; in the genetic context, theproportion of cases among participants with the riskgenotype(s) is compared to the proportion of cases amongparticipants with the non-risk genotype(s). A comparisonof this type seems questionable for a case-control study asthe data are sampled conditional on case status. Instead,ZRM claim estimation of the odds ratios with bias close tozero under the alternative hypothesis. This claimshould be viewed with caution, as it is known that evenfor prospective studies where the propensity score is

appropriate, estimation of odds ratios may be biased[Austin et al., 2007]. Although the direction of bias foundby Austin et al. agrees with that reported by ZRM in theirdiscussion (who note that odds ratios estimated by GPSare consistently underestimated), Austin et al. also rely onMonte Carlo studies and so the direction of bias may infact be indeterminate. For case-control studies, where thereis no theoretical basis for use of GPS, finding a low bias ina simulation study should not be taken as demonstrationof the applicability of the method in general. In fact, alogistic model for case–control status that includesgenotype and the GPS as a covariate (as proposed byZRM) is not compatible with either the logistic model thatincludes genotype and covariates, typically assumed whenusing direct adjustment for confounders, or with themarginal logistic model for genotype alone, typicallyassumed in the absence of confounding, in the sense thatthe parameters estimated by ZRM do not correspond toparameters in either of these models.

In our previous work [Epstein et al., 2007], wedeveloped the stratification score to account for confound-ing when testing hypotheses in genetic association studies.The Epstein, Allen and Satten (EAS) stratification scoreapproach first uses genomic (or non-genomic) covariatesto model case-control status, and then creates a stratifica-tion score based on fitting this model. Thus, for eachindividual, we compute the EAS stratification score, whichis the estimated probability that an individual is a casegiven their genomic (and other) covariates. We then groupthe data into strata based on the values of the EASstratification score. Finally, the association between case orcontrol status and G is evaluated using the stratified data.Comparing the EAS stratification score and the genomicpropensity score (GPS), we see that the roles of case-control status and genotype are reversed; for the GPS, thefirst step is to construct a model for a binary coding of G asa function of genomic and non-genomic covariates (ignor-ing case-control status). This model is then used to adjustfor population stratification.

We believe that, among stratification methods, the EASstratification score has several advantages over GPS. First,GPS calculates the probability of a risk genotype (or set ofrisk genotypes) while the EAS stratification score calcu-lates the probability of being a case. Because case status isa binary covariate, the EAS stratification score is superiorto the GPS in that it does not require an arbitrary

r 2010 Wiley-Liss, Inc.

Page 2: Score-based adjustment for confounding by population stratification in genetic association studies

dichotomization of genotype into ‘‘risk’’ and ‘‘non-risk’’genotypes. Second, if multiple loci are tested, a differentGPS score must be calculated at each locus; this is a heavyburden when analyzing genome-wide data. When usingthe EAS stratification score, the same strata can be used totest association at every locus. This distinction is crucialwhen constructing permutation tests for multi-locusanalyses.

The EAS stratification scheme leads to a simpleapproach to permutation testing for multi-locus orgenome-wide analyses while GPS does not. When con-ducting a permutation test, it is crucial that the permuta-tion (replicate) data sets must be similar to the originaldata set in every way except for the association betweengenotype and outcome. In particular, each replicate dataset must reproduce the same population stratification asthe original data. When using the EAS stratification score,this is easily accomplished by permuting disease andgenotype vectors within score-based strata. This schemepreserves population stratification while also preser-ving linkage disequilibrium in the replicate data sets.Using GPS it is difficult to see how an analogouscalculation can be carried out. As each locus requires adifferent stratification scheme, permutation of eitherdisease or genotype labels within strata would notpreserve linkage disequilibrium. Further, permutation ofdisease labels before analysis does not preserve populationstratification.

Both GPS and the EAS stratification score approachrequire a model for how genetic covariates affect the score.For genome-wide association studies, we have recentlyhad success calculating the stratification score usingprincipal components (PCs) calculated using a thinnedset of marker genotypes [Fellay et al., 2007] and havefound that it controls stratification as measured by thevariance inflation factor after stratification [Allen andSatten, 2009a,b; Sarasua et al., 2009]. Further, because theEAS stratification score is the same at each locus, we havesuccessfully applied the EAS stratification score tohaplotype-based analyses without modification [Allenand Satten, 2009a,b].

It may seem at first glance that using PC to construct thestratification score is equivalent to including PCs ascovariates in a model such as Eigenstrat. However, thisis not the case, as the variance estimators are different.Extensive experience with propensity scores for prospec-tive data [e.g., Lunceford and Davidian, 2004], as well assimulations performed by EAS, attest to the validity of thestratification variance estimators. In contrast, McPeek andAbney [2008] have shown a variety of situations in whichEigenstrat does not preserve size or has diminished power.These ideas are demonstrated in the following example ofdata from a population with three strata. The first stratumprovided 80% of cases but no controls; the secondstratum provided 80% of controls but no cases; the thirdstratum provided the remaining participants. The minorallele frequency (MAF) at a locus we wish to test forassociation with case-control status was 0.05 in the firsttwo strata and 0.1 in the third. It should be noted thatwhile there is extreme stratification in this simulation,there is no confounding by stratification as the MAF is thesame in the first two strata. We generated 5,000 simulateddata sets each having data from 300 cases and 300 controls.Data on 500 ancestry-informative markers havingMAFs uniformly distributed between 0.05 and 0.5, with

400 having F_st 5 0.20 and 100 having F_st 5 0.01, wasgenerated using the model of Balding and Nichols [1995].We used the first 10 PCs based on these 500 markers in alogistic model to calculate the EAS stratification score. Weand also give results for linear and logistic models for case-control status where both genotype and the first 10 PCswere used as covariates.

At the nominal significance level of 0.05 (0.01), we foundthe size of the naıve (unstratified) analysis was 0.052(0.011), which was as anticipated given the lack ofconfounding due to stratification in the sample. However,the size of the test that includes PCs as covariates in thelogistic model was 0.093 (0.028); the size of the test usingthe analogous linear model was 0.12 (0.039). In contrast,testing after stratification using the EAS stratification scorehad a size of 0.056 (0.009). These results highlight thedifferences between stratification-based tests and directadjustment, and also suggest that the stratification scoreapproach to controlling confounding by population sub-structure deserves wider use when analyzing geneticassociation data.

In summary, post-stratification for control of confound-ing due to population stratification when testing forassociation is an attractive strategy. Two different scoreshave been proposed: the EAS stratification score and theGPS of ZRM. Both methods can easily make use of non-genetic covariates. Of the two scores, the EAS stratificationscore has several advantages over GPS when applied tocase-control data: a natural binary outcome to model; thesame stratification at each locus; easy permutation testingfor multi-locus or genome-wide analyses. We are unawareof any advantages of the GPS when compared to the EASstratification score. Finally, score-based methods may haveadvantages over direct adjustment for population stratifi-cation due to a more robust variance calculation.

ACKNOWLEDGMENTS

This work was funded in part by NIH grant HG003618(to M. P. E.) and R01 MH084680 (to A. S. A.). The findingsand opinions expressed in this letter are those of theauthors and do not necessarily reflect the official positionof the CDC.

REFERENCESAllen AS, Satten GA. 2009a. Genome-wide association analysis of

rheumatoid arthritis data via haplotype sharing. BMC Proceedings

3:S30.

Allen AS, Satten GA. 2009b. A novel haplotype-sharing approach for

genome-wide case-control association studies implicates the

calpastatin gene in Parkinson’s disease. Genet Epidemiol 33:

657–667.

Austin PC, Grootendorst P, Normand S-LT, Anderson GM. 2007.

Conditioning on the propensity score can result in biased

estimation of common measures of treatment effect: a Monte

Carlo study. Stat Med 26:754–768.

Balding DJ, Nichols RA. 1995. A method for quantifying differentia-tion between populations at multi-allelic loci and its implications

for investigating identity and paternity. Genetica 96:3–12.

Epstein MP, Allen AS, Satten GA. 2007. A simple and improved

correction for population stratification in case-control studies. Am

J Hum Genet 80:921–930.

Fellay J, Shianna KV, Ge D, Colombo S, Ledergerber B, Weale M,

Zhang K, Gumbs C, Castagna A, Cossarizza A, Cozzi-Lepri A,

384 Allen et al.

Genet. Epidemiol.

Page 3: Score-based adjustment for confounding by population stratification in genetic association studies

De Luca A, Easterbrook P, Francioli P, Mallal S, Martinez-

Picado J, Miro JM, Obel N, Smith JP, Wyniger J, Descombes P,

Antonarakis SE, Letvin NL, McMichael AJ, Haynes BF,

Telenti A, Goldstein DB. 2007. A whole-genome association

study of major determinants for host control of HIV-1. Science

5:944–947.

Joffe MM, Rosenbaum PR. 1999. Propensity scores. Am J Epidemiol

150:327–333.

Lunceford JK, Davidian M. 2004. Stratification and weighting via the

propensity score in estimation of causal treatment effects: a

comparative study. Stat Med 23:2937–2960.

Mansson R, Joffe MM, Sun W, Hennessy S. 2007. On the estimation

and use of propensity scores in case-control and case-cohort

studies. Am J Epidemiol 166:332–339.

McPeek MS, Abney M. 2008. Association testing with principal-

components-based correction for population stratification

[Abstract program number 58]. Presented at the annual meeting

of The American Society of Human Genetics, November 13,

2008, Philadelphia, PA). Available from: http://www.ashg.org/

2008meeting/abstracts/fulltext/

Miettinen O. 1976. Stratification by a multivariate confounder score.

Am J Epidemiol 1976:609–620.

Pike MC, Anderson J, Day N. 1979. Some insights into Miettinen’s

confounder score approach to analysis of case-control data. 33:104–106.

Sarasua SM, Collins JS, Williamson DM, Satten GA, Allen AS. 2009.

Effect of population stratification on the identification of

significant SNPs in genome wide association studies. BMC

Proceedings 3:S13.

385Score-based Adjustment for Confounding

Genet. Epidemiol.