naïve bayesian classifier and genetic risk score for genetic risk … · 2015-02-19 · naïve...

Naïve Bayesian Classifier and Genetic Risk Score for Genetic RiskPrediction of a Categorical Trait: Not So Different After All!

(Article begins on next page)

The Harvard community has made this article openly available.Please share how this access benefits you. Your story matters.

Citation Sebastiani, Paola, Nadia Solovieff, and Jenny X. Sun. 2012. NaïveBayesian classifier and genetic risk score for genetic riskprediction of a categorical trait: Not so different after all! Frontiersin Genetics 3:26.

Published Version doi:10.3389/fgene.2012.00026

Accessed February 19, 2015 9:55:41 AM EST

Citable Link http://nrs.harvard.edu/urn-3:HUL.InstRepos:10332596

Terms of Use This article was downloaded from Harvard University's DASHrepository, and is made available under the terms and conditionsapplicable to Other Posted Material, as set forth athttp://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA

http://osc.hul.harvard.edu/dash/open-access-feedback?handle=1/10332596&title=Na%C3%AFve+Bayesian+Classifier+and+Genetic+Risk+Score+for+Genetic+Risk+Prediction+of+a+Categorical+Trait%3A+Not+So+Different+After+All%21

http://dx.doi.org/10.3389/fgene.2012.00026

http://nrs.harvard.edu/urn-3:HUL.InstRepos:10332596

http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA

http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA

ORIGINAL RESEARCH ARTICLEpublished: 29 February 2012

doi: 10.3389/fgene.2012.00026

Naïve Bayesian classifier and genetic risk score forgenetic risk prediction of a categorical trait: not sodifferent after all!Paola Sebastiani 1*, Nadia Solovieff 2,3,4 and Jenny X. Sun1

1 Department of Biostatistics, Boston University School of Public Health, Boston, MA, USA2 Psychiatric and Neurodevelopmental Genetics Unit, Center for Human Genetics Research, Massachusetts General Hospital, Boston, MA, USA3 Stanley Center for Psychiatric Research, The Broad Institute of MIT and Harvard, Cambridge, MA, USA4 Department of Psychiatry, Harvard Medical School, Boston, MA, USA

Edited by:

Albert Tenesa, University ofEdinburgh, UK

Reviewed by:

Lei Zhang, University of Shanghai forScience and Technology, ChinaJohn Whittaker, GlaxoSmithKline, UK

*Correspondence:

Paola Sebastiani , Department ofBiostatistics, Boston UniversitySchool of Public Health, 801Massachusetts Avenue, Boston, MA02118, USA.e-mail: [email protected]

One of the most popular modeling approaches to genetic risk prediction is to use a sum-mary of risk alleles in the form of an unweighted or a weighted genetic risk score, withweights that relate to the odds for the phenotype in carriers of the individual alleles. Recentcontributions have proposed the use of Bayesian classification rules using Naïve Bayes clas-sifiers. We examine the relation between the two approaches for genetic risk predictionand show that the methods are mathematically related. In addition, we study the proper-ties of the two approaches and describe how they can be generalized to include variousmodels of inheritance.

Keywords: genetic risk prediction, genetic score, Naïve Bayes classifier, classification score, classification rule

INTRODUCTIONSeveral statistical methods have been proposed to capture the com-plex genetic bases of common diseases. These approaches includestandard regression models in which the contribution of severalgenetic variants is summarized by a genetic risk score (GRS; Meigset al., 2008; Purcell et al., 2009; Paynter et al., 2010), multivari-ate regression models and “machine learning type” approachessuch as support vector machines (Wei et al., 2009; Wu et al.,2011), Naïve Bayes classifiers (NBC; Okser et al., 2010), classifi-cation and regression trees, random forests (Bureau et al., 2005;McKinney et al., 2006), rule induction (Sebastiani and Perls,2010; Stengard et al., 2010), multifactor dimensionality reduction(Moore et al., 2006), and Bayesian networks (Rodin and Boerwin-kle, 2005; Sebastiani et al., 2005; Jiang et al., 2011; Kang et al.,2011a). NBCs use a simple but surprisingly effective Bayesianrule that classifies a subject at risk of a trait if the posteriorprobability of the trait, given the individual’s genetic profile, ismaximal (Hand, 2009). The classification rule can be built usinga large number of genetic variants, such as single nucleotide poly-morphisms (SNPs), by assuming that the SNPs are conditionallyindependent given the trait (Sebastiani et al., 2012). This hypoth-esis is often mistaken for “marginal independence” but mar-ginal and conditional independence have no relation (Whittaker,1990).

In this manuscript we show that there is a mathematical linkbetween NBCs and logistic regression models that use a GRS tosummarize the contribution of many SNPs to the susceptibilityto a genetic disease. The link between these two approaches alsohighlights their limitations. We discuss how the directed graphicalmodel underlying a NBC can be extended to include interactions

between genes and/or environmental risk factors by maintainingthe computations scalable to genome-wide genotype data and evenwhole genome sequence data.

METHODS AND RESULTSWe describe two approaches – logistic regression and Bayesianclassifier – to define a classification score and a rule to be usedfor genetic risk prediction of a dichotomous trait denoted as Tor “not T.” The classification score for genetic risk prediction is afunction that maps a set of SNPs Σ = {S1, . . ., Sk} into real num-bers. The classification rule links the output of the score functionto the events T or “not T.” Formally, with S denoting the space ofSNPs and R the real numbers:

Classification score: Sc(Σ) : S → R

Classification rule: Sc(Σ) > τ ⇒ Classify as T

LOGISTIC REGRESSION WITH A GENETIC RISK SCOREA logistic regression model that includes the general effectsof k biallelic SNPs Σ = {S1, . . ., Sk} to model the odds for adichotomous trait T is defined by the logit equation:

log

(p(T |Σ)

1 − p(T |Σ)

)= α0 +

k∑j=1

(α1j XjAB + α2j XjBB)

where XjAB ={

1 if Sj genotype = AB

0 otherwise

and XjBB ={

1 if Sj genotype = BB

0 otherwise

www.frontiersin.org February 2012 | Volume 3 | Article 26 | 1

http://www.frontiersin.org/Genetics

http://www.frontiersin.org/Genetics/editorialboard



http://www.frontiersin.org/Genetics/about

http://www.frontiersin.org/Statistical_Genetics_and_Methodology/10.3389/fgene.2012.00026/abstract

http://www.frontiersin.org/Community/WhosWhoDetails.aspx?UID=38742&d=1&sname=PaolaSebastiani&name=Science

mailto:[email protected]

http://www.frontiersin.org

http://www.frontiersin.org/Statistical_Genetics_and_Methodology/archive

Sebastiani et al. Naïve Bayes classifiers and genetic risk

We assume that the alleles of the SNPs are ordered in lexico-graphical order (A < C < G < T), and A represents the first alleleand B the second allele regardless of their frequency. The logitequation is the classification score that can be used to define aclassification rule based on a threshold τ:

Classification score: Sc(Σ) = log

(p (T |Σ)

1 − p (T |Σ)

)= α0

+∑

j(α1j

XjAB + α2jXjBB)

Classification rule: Sc(Σ) > τ ⇒ Classify as T

and τ can be determined to optimize sensitivity and specificity byreceiver operating characteristic (ROC) curve analysis.

The coefficients of the logistic score are typically estimated bymaximum likelihood (McCullagh and Nelder, 1989), or Bayesianmethods using large sample approximations or Gibbs sampling(Balding, 2006). By definition, the intercept α0 represents the log-odds for the trait T for the referent group with all SNPs genotypesequal to AA, while each parameter α1j represents the log-odds ratiofor the trait T between the AB genotype and the AA genotype ofthe jth SNP, and each parameter α2j represents the log-odds ratiofor T between the BB and AA genotype of the jth SNP, assumingthe other SNP genotypes fixed. When α2j = 2α1j for all j = 1, . . .,k, then the logistic regression encodes the additive effects of theSNPs, and each parameter α1j represents the log-odds ratio forT for each additional copy of the B allele relative to the referentgenotype AA.

It is well known that when the data are from a case–controlstudy design, the intercept does not provide the correct estimate ofthe odds for T in the populations and several corrections have beenproposed to limit this problem (Jewell, 2003). Bias of the interceptterm is not a problem when the logistic regression model is meantto be used for classification because different intercepts will sim-ply shift the logistic function and classification scores that differonly by the intercept term lead to equivalent classification rules.We state this property formally because it will be used further.

Property 1: Irrelevance of the intercept term of a logistic regressionmodel for classificationLet Sc1(Σ) and Sc2(Σ) be two classification scores defined as:

Sc1(Σ) = log

(p (T |Σ)

1 − p (T |Σ)

)= α0 +

∑j(α1j XjAB + α2j XjBB)

Sc2(Σ) = log

(p (T |Σ)

1 − p (T |Σ)

)= β0 +

∑j(α1j XjAB + α2j XjBB)

The two classification scores can be used to define equivalentclassification rules by using the relation:

“if Sc1(Σ) > τ ⇒ classify as T” if and only if

“if Sc2(Σ) > τ + β0 − α0 ⇒ classify as T” �

We note however that the correct estimate of the intercept termis necessary to be able to interpret the prediction from the logisticmodel in terms of prevalence of the trait in the population.

One of the limitations of multivariate logistic regression is thatthe number of covariates is bounded above by the sample size. It isexpected that many common genetic complex traits may be deter-mined by hundreds of genetic variants (Kraft and Hunter, 2009),so that the sample size needed to build reliable logistic regressionmodels for risk prediction can be prohibitively large.

A naïve but very popular alternative is to collapse the contri-bution of the k SNPs into a GRS to be used in a univariate logisticmodel. A GRS is typically defined as the weighted sum of thegenotypes:

GRS = GRS(Σ) =k∑

i=1

(wiXiAA + viXiAB + ziXiBB)

with weights that can be appropriately chosen. The variables XiAB

and XiBB are defined as above, and XiAA = 1 if the ith SNP geno-type is AA and 0 otherwise. See Table 1 for a summary of threepossible weighting schemes. The GRS is then used as risk factor todefine a classification score using a univariate logistic regression:

Sc(Σ) = log

(p(T |GRS)

1 − p(T |GRS)

)= γ0 + γ1 GRS

Case 1. Although this is often referred to as the “unweightedgenetic score,” the heterozygote genotype is always assigned aweight 1, while the homozygous genotype for the risk allele isassigned weight 2 and the other genotype is assigned weight 0. Byadopting this weighting scheme, we are simply counting the num-ber of risk alleles each subject carries. The risk allele of each SNPis determined by a “one-SNP-at-a-time” association analysis, typ-ically under an additive genetic model. Using the same notationand lexicographical order of the SNPs that we used earlier, the riskallele of each SNP will be the A allele if the regression coefficientαi of the logistic regression model

log

(p(T |Si)

1 − p(T |Si)

)= α0i + αi(XiAB + 2XiBB)

is negative, and the B allele if αi is positive. In the first case (αi < 0),each copy of the B allele decreases the odds for T, while in the sec-ond case (αi ≥ 0) each copy of the B allele increases the odds forT. With this definition, the GRS is only a function of the differentnumber of risk alleles regardless of their individual genetic effects,and two identical GRS values can represent genetic profiles thatare substantially different. See Figure 1 for an example.

The slope γ1 in the classification score:

Sc(Σ) = log

(p(T |GRS)

1 − p(T |GRS)

)= γ0 + γ1 GRS (1)

measures the association of the GRS with the trait T in terms oflog-odds ratio for T between two GRS that differ by 1, and it isoften estimated to test whether the GRS is significantly associatedwith T. However, the value of γ1 is irrelevant for classificationbecause two classification scores defined as in Eq. 1 that differ bythe slope will produce equivalent classification rules. This is statedin the next property.

Frontiers in Genetics | Statistical Genetics and Methodology February 2012 | Volume 3 | Article 26 | 2


http://www.frontiersin.org/Statistical_Genetics_and_Methodology



Table 1 | Example of choice of weights for the weighted genetic risk score.

Case wi vi zi Comments

1 2δ(A = R) 1 2δ(B = R) R denotes the risk allele and δ(X =Y ) = 1 if X =Y is true and 0 otherwise

2 0 vi = log

p(T |Xi = 1)

1 − p(T |Xi = 1)

p(T |Xi = 0)

1 − p(T |Xi = 0)

2vi Xi = 0 when the i th SNP genotype is AA, and Xi = 1 when the genotype is

AB. This is the standard coding for an additive model

3 0 log

p(T |Si = AB)

1 − p(T |Si = AB)

p(T |Si = AA)

1 − p(T |Si = AA)

log

p(T |Si = BB)

1 − p(T |Si = BB)

p(T |Si = AA)

1 − p(T |Si = AA)

The two weights represent the log-odds ratio relative to the referent genotype

AA. This is the coding for genotypic model.

Case 1 is known as the “unweighted score” and case 2 is typically referred to as the “weighted genetic risk score.” Case 3 is the most general and flexible but it

does not seem to be used.

FIGURE 1 | Example of GRS (case 1 and case 2 inTable 1) based on

three SNPs associated with exceptional longevity. The table on topreports the A/B alleles for the three SNPs, the frequencies of A allele incases and controls, and the p-value for the additive model (ColumnPVAL.AA) and the odds ratio (OR) for exceptional longevity in carriers ofthe B allele. The two bottom panels show the calculations of the GRSwith weights as in case 1 (left), and case 2 (right). Note that the GRS on

the left is only a function of the different number of risk alleles regardlessof their individual genetic effects, so the genetic profiles Σ2, Σ3, and Σ4

have the same score while the case 2 GRS assigns different weights tonon-referent genotypes and the scores are different. The profile ΣR1

denotes the referent group in case 1, while ΣR2 denotes the referentgroup in case 2. The data for this example are taken from Sebastiani et al.(2012).

Property 2: Irrelevance of the slope of a univariate logisticregression model for classificationLet Sc1(Σ) and Sc2(Σ) be two classification scores defined as:

Sc1(Σ) = log

(p (T |GRS)

1 − p (T |GRS)

)= γ0 + γ1 GRS

Sc2(Σ) = log

(p (T |GRS)

1 − p (T |GRS)

)= β0 + β1 GRS

The two classification scores can be used to define equivalentclassification rules by using the relation:

“Sc1(Σ) > τ ⇒ classify as T”, if and only if

“Sc2(Σ) > β0 + β1τ − γ0

γ1⇒ classify as T” �

The GRSs labeled 2 and 3 in Table 1 weight SNP alleles indifferent ways to reflect their individual associations with thetrait T.

Case 2. The GRS can be written as:

GRS =k∑

i=1

vi(XiAB + 2XiBB)

where each weight vi is the maximum likelihood estimate of theregression coefficient in the univariate logistic regression:

log

(p(T |Xi)

1 − p(T |Xi)

)= αi0 + viXi ; Xi =

⎧⎪⎨⎪⎩

1 if Si = AB

2 if Si = BB

0 otherwise





that measures the association between SNP Si and the trait T withan additive genetic model. Therefore, each weight

vi = log

p(T |Xi = 1)

1 − p(T |Xi = 1)

p(T |Xi = 0)

1 − p(T |Xi = 0)

estimates the log-odds ratio for T for each copy of the B allele inan additive genetic model. Note that this formulation of the GRSdoes not require the specification of the risk allele of the SNPs, andthe weighted genetic score will increase by vi for each copy of theB allele of SNP Si, if this is a risk allele, and decrease by vi for eachcopy of the B allele if this is the protective allele. See the examplein Figure 1.

The classification score based on this GRS is computed usingthe logistic regression in Eq. 1, with parameters γ0, γ1 that canbe estimated by maximum likelihood or Bayesian methods. Theslope represents the odds ratio (OR) for T for a unit change ofthe GRS. In general, the OR for T between two genetic profilesΣ1 = {S11, . . ., Sk1} and Σ2 = {S12, . . ., Sk2} associated with GRS1

and GRS2 is

log

(p(T |GRS1)/(1 − p(T |GRS1)

p(T |GRS2)/(1 − p(T |GRS2)

)

= γ1

k∑i=1

log

(p(T |Si1)/(1 − p(T |Si1)

p(T |Si2)/(1 − p(T |Si2)

)

and this equation shows that the log-odds ratio for T between twoweighted GRSs is an average of log-odds ratios of the individualgenetic effects rescaled by the coefficient γ1.

The classification rule

if Sc1(Σ) = log

(p(T |GRS)

1 − p(T |GRS)

)> τ ⇒ classify as T ,

based on the score

Sc1(Σ) = log

(p (T |GRS)

1 − p (T |GRS)

)= γ0 + γ1 GRS

is equivalent to:

ifk∑

i=1

logp(T |Si)/(1 − p(T |Si)

p(T |Si = AA)/(1 − p(T |Si = AA)>

τ − γ0

γ1

⇒ classify as T

So the classification rule that uses the weighted GRS in case 2is essentially based on an average of the individual log-odds ratiofor T of each SNP genotype relative to the referent genotypes.

Case 3. The GRS is:

GRS =k∑

i=1

(viXiAB + ziXiBB)

where vi and zi are the MLE estimate of the regression coefficientsof the univariate logistic regression

log

(p(T |Si)

1 − p(T |Si)

)= αi0 + viXiAB + ziXiBB;

XiAB ={

1 if Si = AB

0 otherwise; XiBB =

{1 if Si = BB

0 otherwise

that measures the genotypic association between SNP Si and thetrait T. Therefore

vi = log

p(T |Si = AB)

1 − p(T |Si = AB)

p(T |Si = AA)

1 − p(T |Si = AA)

; zi = log

p(T |Si = BB)

1 − p(T |Si = BB)

p(T |Si = AA)

1 − p(T |Si = AA)

are the log-odds ratio for T between the AB and AA genotypes, andBB and AA genotypes. See Figure 2 for an example. The classifi-cation score and classification rule are derived as in case 2 and canbe interpreted as average of the log-odds ratios of individual SNPsgenotypes. Compared to case 2, the weights based on genotypeassociations allow for more general model of associations that arenot restricted to linear increase of the log-odds for T. Note also thatwhen the SNPs included in a GRS (case 2 and 3) are independent,the two scores should be approximately equivalent to multivariatelogistic regression with additive (case 2) or genotypic association(case 3). In addition, if the SNPs included in the GRS have simi-lar effects, then the GRS in case 1 and 2 should be approximatelyequivalent.

NAÏVE BAYES CLASSIFIERSThe classification score based on a NBC is the posterior probabilityof the trait T that is calculated using the formula:

Sc(Σ) = p(T |Σ) = p(T )∏k

i=1 p(Si |T )

p(T )∏k

i=1 p(Si |T )

+(1 − p(T ))∏k

i=1 p(Si |notT )

where p(T ) and 1 − p(T ) are the prior probabilities of having thetrait T or not. The conditional probabilities p(Si | T ) and p(Si | notT ) represent the distribution of the ith SNP genotype in subjectswith and without the trait T. They are typically estimated assum-ing genotypic association (Sebastiani et al., 2012), but they couldalso be estimated using an additive genetic model. The formulais derived using Bayes’ theorem and assuming that the SNPs areindependent, conditionally on T (Hand, 2009). The usual Bayesianclassification rule is to classify a subject with the most probableoutcome

if Sc(Σ) > 0.5 ⇒ classify as T .

This rule is based on a 0–1 loss that assigns the same weight tomisclassification errors. A general loss function that weights dif-ferently sensitivity and specificity would lead to the classificationrule:

if Sc(Σ) >λ

1 + λ⇒ classify as T for λ > 0






FIGURE 2 | Example of GRS (case 3 inTable 1). The table on top reports theA/B alleles for the three SNPs, the frequencies of A allele in cases andcontrols, and the odds ratio for exceptional longevity in carriers of the ABallele relative to carriers of the AA allele (OR.AB.AA), and the odds ratio for

exceptional longevity in carriers of the BB allele relative to carriers of the AAallele (OR.BB.AA). The bottom panel shows the calculations of the GRS withweights as in case 3. The profile ΣR denotes the referent group.

that can also be written as:

Sc(Σ) >λ

1 + λ⇔ log

(p(T |Σ)

1 − p(T |Σ)

)> log(λ)

and simple algebra shows that this is equivalent to:

log

(p(T |Σ)

1 − p(T |Σ)

)= log

(p(T )

∏ki=1 p(Si |T )

(1 − p(T ))∏k

i=1 p(Si |not T )

)=

log

( ∏ki=1 p(T )p(Si |T )∏k

i=1(1 − p(T ))p(Si |not T )

)=

log

(∏k

i=1

p(T |Si)

1 − p(T |Si)

)=

k∑i=1

log

(p(T |Si)

1 − p(T |Si)

)> log(λ)

As long as the log-odds ratios are calculated using the samegenetic model, this classification rule is equivalent to the classifi-cation rule based on the GRS (either case 2 or 3)

ifk∑

i=1

logp(T |Si)/(1 − p(T |Si))

p(T |Si = AA)/(1 − p(T |Si = AA))>

τ − γ0

γ1

⇒ classify as T

by setting the threshold

τ = γ0 − γ1

k∑i=1

log

(p(T |Si = AA)

1 − p(T |Si = AA)

)+ γ1 log(λ)

We state this relation formally.

Property 3: Equivalence of classification rules based on the GRS andthe NBCThe classification rules based on a logistic model of a GRS(case 2or 3) and a NBC are equivalent when the same genetic models areused to link individual SNPs to the trait.

The details of the algebraic manipulations are in Section“Appendix.” �

Note that the equivalence between the classification rules basedon a NBC and a logistic regression model with a GRS as in case 2or 3 is a simple consequence of the fact that both models base theprediction on a weighted average of ORs of the individual SNPs.This equivalence is independent of the choice of the prior for Tbecause different prior distributions will lead to equivalent classi-fication rules but with different classification thresholds. Also, theequivalence of classification rules based on GRS and NBC impliesthat when alternative classifiers are compared by the area underthe receiving operator curve they must reach the same value. Thisis shown in the next example.

Example. To demonstrate the connection between the NBCand the GRS in case 3, we performed a simple simulation. Wesimulated a dataset with 3000 cases and 3000 controls, and geno-type data from 75 causal SNP and 500,000 null SNPs. For thenull SNPs, we randomly selected frequencies of the minor allele(p) from a uniform (0.05, 0.5) distribution and genotype fre-quencies were generated assuming Hardy–Weinberg equilibrium[p2,2p(1 − p),(1 − p)2]. The causal SNPs were simulated with ORsof 1.2, 1.3, 1.4, 1.5, and 1.6 and minor allele frequencies (MAFs) of0.1, 0.2, 0.3, 0.4, and 0.5. A causal SNP was simulated for each com-bination of the above ORs and MAFs (25 combinations) under anadditive, recessive and dominant mode of inheritance (25 com-binations × 3 modes of inheritance = 75 SNPs). The genotype





frequencies in controls were generated to follow Hardy–Weinbergequilibrium [p2,2p(1 − p),(1 − p)2]. The genotype frequenciesin cases for the additive, recessive, and dominant models were[p2,2ORp(1 − p),OR2(1 − p)2], [p2,2p(1 − p),OR(1 − p)2] and[p2,2ORp(1 − p),OR(1 − p)2], respectively. For the cases, thegenotype frequencies were divided by the sum of the frequenciesso that the frequencies add up to 1. Using the genotype frequenciesfor each SNP, we simulated a discovery set of 3000 cases and 3000controls and a replication set with the same sample sizes.

The data in the discovery set were analyzed to generate geneticrisk models based on GRS and NBCs in the following way. ABayesian genome-wide association study was performed on thediscovery set and SNPs were ordered according to the posteriorprobability for the genotypic association to build nested NBCswith increasing number of SNPs as in Sebastiani et al. (2012).To obtain the weights for the three GRSs, we ran two logis-tic regression models for each SNP, using an additive mode ofinheritance and a genotypic mode of inheritance. The results ofthese analyses were used to detect the risk alleles of SNPs fornested GRS as in case 1; and to estimate the weights of GRS asin cases 2 and 3. Using SNPs ordered by the posterior proba-bility for the genotypic association, we then built three sets ofclassification models based on logistic regression and the threedifferent GRS, with increasing number of SNPs. The predic-tion models were tested on the replication set to avoid issuesof over-fitting. The simulation described above was repeated fivetimes and the mean AUC across the replicates was used to assessaccuracy.

Figure 3 (left panel) shows the mean AUC across five repli-cates for the NBCs and logistic regression models for differentGRSs, with increasing number of SNPs. As expected based onour mathematical calculations, the AUCs of the genetic risk mod-els based on the NBCs and the GRSs with a genotypic weightsare identical (Figure 3, left panel), and the predicted probabil-ities are almost identical (Figure 3, right panel). The weightedand unweighted GRS using an additive mode of inheritance havelower AUCs demonstrating the loss of accuracy with assumingadditivity when some of the SNPs do not follow an additivemode of inheritance. Of course if all SNPs do in fact followan additive model of the inheritance, the genotypic and addi-tive prediction models would perform similarly. The trend of theAUC shows that accuracy keeps increasing as true positive SNPsare included in the model, and then declines when each classi-fication model starts including false positive SNPs. The declineis more evident for the case 1 GRS, while both weighted GRSbased on additive or genotypic associations appear to be morerobust.

DISCUSSIONOne of the selling points of genome-wide association studies wasto discover genetic variants that are associated with increased sus-ceptibility for disease and could be used for personalized diagnosisand prognosis. Initial results published for example in Meigs et al.(2008) and Paynter et al. (2010) however showed that genetic dataadded limited predicted values to well established risk factors ofType II diabetes and cardiovascular disease. These initial studies

FIGURE 3 | Results of simulation for replication set. The left handplot graphs the mean area under the ROC (AUC) versus the numberof SNPs in the prediction model. The colored lines refer to the AUCof the NBC (black), the unweighted GRS from an additive model(case 1, blue), the weighted GRS from an additive model (case 2,green), and the weighted GRS from a genotypic model (case 3, red).

The maximum AUC occurs at 45 SNPs. The right hand plot graphsthe probability of the trait T given the weighted GRS (genotypicmodel) on the y -axis versus the probability of disease given the SNPprofile estimated by NBC on the x -axis for a model containing 45SNPs for one of the replicates. The right hand plot is similar acrossthe five replicates.






limited the attention to those SNPs that reached genome-widesignificance and their effect was summarized into a GRS. Sincethen, a growing body of literature has shown the increased valueof deeper mining of genome-wide association studies but inclu-sion of large number of SNPs in genetic risk model has continuedto resort on GRSs (Cui, 2009; Goddard et al., 2009; Kooperberget al., 2009; Purcell et al., 2009; Yang et al., 2010; Chen et al., 2011;Chibnik et al., 2011), while machine learning type methods con-tinue to be rare regardless of some successful applications (Weiet al., 2009; Okser et al., 2010; Kang et al., 2011b; Sebastiani et al.,2012).

Our study shows that risk prediction based on a GRS is math-ematically equivalent to risk prediction based on a NBC, whenthe same SNPs with the same mode of inheritance are used inthe models. The equivalence is based on the fact that both modelsessentially base the prediction on a weighted average of ORs of theindividual SNPs. While this equivalence establishes the validity ofmethods based on the NBC for genetic risk prediction and we hopewill contribute to make this approach more popular in this field,it also shows that contrary to what stated in Okser et al. (2010)a NBC does not include interactions of SNPs but only additivegenetic effects. However, the directed graphical model underly-ing a NBC can be extended to more general structures to includeinteractions between genes and/or environmental risk factors bymaintaining the computations scalable to genome-wide genotypedata and even whole genome sequence data (Sebastiani and Perls,2008).

Figure 4 shows some ways to extend NBCs for risk pre-diction to include population ancestry, as well as genetic andnon-genetic effects that may be missed by test for marginal asso-ciations. Figure 4A describes a directed acyclic graph (DAG) withone parent node (T ) and two children nodes (X 1 and X 2) thatmay represent SNPs. The DAG describes the conditional inde-pendence of X 1 and X 2 given T. This type of DAG with oneroot node and multiple conditionally independent children rep-resents a NBC (Sebastiani and Abad-Grau, 2007). The DAG inFigure 4B extends the NBC in Figure 4A with an additional nodeX 3 that is marginally independent of T, but conditionally depen-dent on T given X 2. In the context of genetic risk modeling, thenode X 3 could represent a non-genetic risk factor that is asso-ciated with a trait T only in specific genetic backgrounds (thenode X 2). The DAG in Figure 4C includes an additional nodeX 4 that is conditionally independent of all other nodes givenX 1. This additional node may represent a gene × gene interac-tion that is induced by linkage disequilibrium. Note that bothDAGS in Figures 4B,C would give the same classification scorefor T, because of the independence of T from X 4 given X 1. So,the DAG in Figure 4C would be useful for a better explana-tion of the biology rather than improving genetic risk prediction.Finally, the DAG in Figure 4D extends the DAG in Figure 4B byadding a link from T to X 3. The inclusion of this link makes thenode X 3 marginally dependent of T and interaction between X 2

and X 3 changes the classification score compared to the DAG inFigure 4B.

In addition, and most importantly, the fact that all variables ina DAG are random provides a sound framework for marginal and

FIGURE 4 | Examples of directed acyclic graph (DAG). All nodes arerandom variables and the DAG represents Markov properties of marginaland conditional independence (Lauritzen and Sheehan, 2004). In particular,the global Markov property states that a node is independent of all othernodes in the DAG given its parent nodes, its children nodes and additionalparents of its children (Lauritzen and Sheehan, 2004). In addition, twonodes are marginally independent when they have no directed joining pathsafter their children are dropped. Therefore, the nodes X 1 and X 2 in the DAGin (A) are conditionally independent given T. The DAG in (B) adds the nodeX 3 to the NBC in (A). This additional node is marginally independent of Tbut conditionally dependent on T given X 2. The DAG in (C) includes anadditional node X 4 that is conditionally independent of all other nodes givenX 1. Finally, the DAG in (D) extends the DAG in (B) by adding a link from T toX 3 so that X 3 and T are marginally dependent.

conditional inference. For example, a genetic risk model basedon a DAG can be used for predicting the outcome of a sub-ject by marginalizing out unobserved variables (Solovieff et al.,2011).

Our analysis is limited to binary outcomes, but we expect thatsimilar results hold when the outcome to be predicted is a quan-titative trait that follows a normal distribution. Furthermore, ouranalysis shows that linear transformations of a GRS do not impactpredictive accuracy, and similarly, that the predictive accuracy ofa NBC cannot be changed by a choice of prior for T. Improvingthe accuracy can be accomplished by selection of the most pre-dictive SNP and by choosing alternative weights to calculate theGRS. There is no obvious similar choice for a NBC. However, aclosely related approach that we used in Sebastiani et al. (2012)to improve the predictive accuracy is to use ensemble of nestedNBCs. Finally, the machine learning community has developedmany feature selection algorithms for building classifiers (Hastieet al., 2009) that, by the equivalence proved in this paper, mayprove to be useful to generate better genetic risk models.

ACKNOWLEDGMENTSResearch supported with funds from NIH/NHLBI R01 HL089655-03 and R21HL114237 (Paola Sebastiani). We thank Dr. MariaAbad Grau for stimulating discussion during her visit at BostonUniversity in August 2011 that triggered the writing of thismanuscript.





REFERENCESBalding, D. J. (2006). A tutorial on sta-

tistical methods for population asso-ciation studies. Nat. Rev. Genet. 7,781–791.

Bureau, A., Dupuis, J., Falls, K., Lunetta,K. L., Hayward, B., Keith, T. P., andVan Eerdewegh, P. (2005). Identify-ing SNPs predictive of phenotypeusing random forests. Genet. Epi-demiol. 28, 171–182.

Chen, H., Poon, A., Yeung, C., Helms,C., Pons, J., Bowcock, A. M., Kwok,P. Y., and Liao, W. (2011). Agenetic risk score combining tenpsoriasis risk loci improves diseaseprediction. PLoS ONE 6, e19454.doi:10.1371/journal.pone.0019454

Chibnik,L. B.,Keenan,B. T.,Cui, J.,Liao,K. P., Costenbader, K. H., Plenge,R. M., and Karlson, E. W. (2011).Genetic risk score predicting riskof rheumatoid arthritis phenotypesand age of symptom onset. PLoSONE 6, e24380. doi:10.1371/jour-nal.pone.0024380

Cui, J. (2009). Overview of risk predic-tion models in cardiovascular dis-ease research. Ann Epidemiol 19,711–717.

Goddard, M. E., Wray, N. R., Ver-byla, K., and Visscher, P. M. (2009).Estimating effects and making pre-dictions from genome-wide markerdata. Stat. Sci. 24, 517–529.

Hand, D. J. (2009). “Naive Bayes,” in TheTop Ten Algorithms in Data Mining,eds X. Wu and V. Kumar (London:Chapman and Hall), 163–178.

Hastie, T., Tibshirani, R., and Friedman,J. H. (2009). The Elements of Statisti-cal Learning: Data Mining, Inference,and Prediction. New York: Springer.

Jewell, R. (2003). Statistics for Epidemi-ology. Boca Raton: CRC/Chapmanand Hall.

Jiang, X., Barmada, M. M., Cooper, G. F.,and Becich, M. J. (2011). A Bayesianmethod for evaluating and discov-ering disease loci associations. PLoSONE 6, e22075. doi:10.1371/jour-nal.pone.0022075

Kang, J., Zheng, W., Li, L., Lee, J.,Yan, X., and Zhao, H. (2011a).Use of Bayesian networks to dis-sect the complexity of genetic dis-ease: application to the GeneticAnalysis Workshop 17 simulateddata. BMC Proc. 5(Suppl. 9), S37.doi:10.1186/1753-6561-5-S9-S37

Kang, J., Kugathasan, S., Georges, M.,Zhao, H., and Cho, J. H. (2011b).Improved risk prediction for

Crohn’s disease with a multi-locusapproach. Hum. Mol. Genet. 20,2435–2442.

Kooperberg, C., LeBlanc, M., andObenchain, V. (2009). Risk predic-tion using genome-wide associa-tion studies. Genet. Epidemiol. 34,643–652.

Kraft, P., and Hunter, D. J. (2009).Genetic risk prediction – are wethere yet? N. Engl. J. Med. 360,1701–1703.

Lauritzen, S. L., and Sheehan, N. A.(2004). Graphical models for geneticanalysis. Stat. Sci. 18, 489–514.

McCullagh, P., and Nelder, J. (1989).Generalized Linear Models. London:Chapman and Hall.

McKinney, B. A., Reif, D. M., Ritchie,M. D., and Moore, J. H. (2006).Machine learning for detecting gene-gene interactions: a review. Appl.Bioinformatics 5, 77–88.

Meigs, J. B., Shrader, P., Sullivan, L. M.,McAteer, J. B., Fox, C. S., Dupuis, J.,Manning, A. K., Florez, J. C., Wil-son, P. W., D’Agostino, R. B. Sr.,and Cupples, L. A. (2008). Geno-type score in addition to commonrisk factors for prediction of type2 diabetes. N. Engl. J. Med. 359,2208–2219.

Moore, J. H., Gilbert, J. C., Tsai, C. T.,Chiang, F. T., Holden, T., Barney,N., and White, B. C. (2006). A flex-ible computational framework fordetecting, characterizing, and inter-preting statistical patterns of epista-sis in genetic studies of human dis-ease susceptibility. J. Theor. Biol. 241,252–261.

Okser, S., Lehtimaki, T., Elo, L. L.,Mononen, N., Peltonen, N., Kaho-nen, M., Juonala, M., Fan, Y.M., Hernesniemi, J. A., Laitinen,T., Lyytikainen, L. P., Rontu, R.,Eklund, C., Hutri-Kahonen, N.,Taittonen, L., Hurme, M., Viikari,J. S., Raitakari, O. T., and Ait-tokallio, T. (2010). Genetic vari-ants and their interactions in theprediction of increased pre-clinicalcarotid atherosclerosis: the car-diovascular risk in young Finnsstudy. PLoS Genet. 6, e1001146.doi:10.1371/journal.pgen.1001146

Paynter, N. P., Chasman, D. I., Pare,G., Buring, J. E., Cook, N. R.,Miletich, J. P., and Ridker, P.M. (2010). Association between aliterature-based genetic risk scoreand cardiovascular events in women.JAMA 303, 631–637.

Purcell, S. M., Wray, N. R., Stone, J. L.,Visscher, P. M., O’Donovan, M. C.,Sullivan, P. F., and Sklar, P. (2009).Common polygenic variation con-tributes to risk of schizophreniaand bipolar disorder. Nature 460,748–752.

Rodin, A. S., and Boerwinkle, E. (2005).Mining genetic epidemiology datawith Bayesian networks I: Bayesiannetworks and example application(plasma apoE levels). Bioinformatics21, 3273–3278.

Sebastiani, P., and Abad-Grau, M.(2007). “Bayesian networks forgenetic analysis,” in Bioinformat-ics: An Engineering Case-BasedApproach, eds G. Alterovitz and M.F. Ramoni (Cambridge, MA: ArtechHouse), 205–228.

Sebastiani, P., and Perls, T. T. (2008).“Complex genetic models,” inBayesian Networks, eds O. Pour-ret, P. Naïm, and B. Marcot(Chichester: John Wiley & Sons),53–72.

Sebastiani, P., and Perls, T. T. (2010).Prediction models that includegenetic data. Circ. Cardiovasc. Genet.3, 1–2.

Sebastiani, P., Ramoni, M. F., Nolan, V.,Baldwin, C. T., and Steinberg, M. H.(2005). Genetic dissection and prog-nostic modeling of overt stroke insickle cell anemia. Nat. Genet. 37,435–440.

Sebastiani, P., Solovieff, N., DeWan, A.,Walsh, K., Puca, A., Hartley, S. W.,Melista, E., Andersen, S., Dworkis, D.A., Wilk, J. B., Myers, R. H., Stein-berg, M. H., Montano, M., Bald-win, C. T., Hoh, J., and Perls, T. T.(2012). Genetic signatures of excep-tional longevity in humans. PLoSONE 7, e29848. doi:10.1371/jour-nal.pone.0029848

Solovieff, N., Baldwin, C. T., Steinberg,M. H., Perls, T. T., and Sebas-tiani, P. (2011). “Incorporatinggenetic ancestry into risk predictionmodels,” in The 12th InternationalCongress of Human Genetics andthe American Society of HumanGenetics 61st Annual Meeting,Montreal.

Stengard, J. H., Dyson, G., Frikke-Schmidt, R., Tybjaerg-Hansen, A.,Nordestgaard, B. G., and Sing, C.F. (2010). Context-dependent asso-ciations between variation in riskof ischemic heart disease and vari-ation in the 5′ promoter region ofthe apolipoprotein E gene in Danish

women. Circ. Cardiovasc. Genet. 3,22–30.

Wei, Z., Wang, K., Qu, H. Q., Zhang,H., Bradfield, J., Kim, C., Frackle-ton, E., Hou, C., Glessner, J. T., Chi-avacci, R., Stanley, C., Monos, D.,Grant, S. F., Polychronakos, C., andHakonarson, H. (2009). From dis-ease association to risk assessment:an optimistic view from genome-wide association studies on type 1diabetes. PLoS Genet. 5, e1000678.doi: 10.1371/journal.pgen.1000678

Whittaker, J. (1990). Graphical Models inApplied Multivariate Statistics. NewYork: John Wiley & Sons.

Wu, C., Walsh, K., DeWan, A., Hoh, J.,and Wang, Z. (2011). Disease riskprediction with rare and commonvariants. BMC Proc. 5(Suppl. 9), S61.doi:10.1186/1753-6561-5-S9-S61

Yang, J., Manolio, T. A., Pasquale, L.R., Boerwinkle, E., Caporaso, N.,Cunningham, J. M., de Andrade,M., Feenstra, B., Feingold, E., Hayes,M. G., Hill, W. G., Landi, M. T.,Alonso, A., Lettre, G., Lin, P., Ling,H., Lowe, W., Mathias, R. A., Melbye,M., Pugh, E., Cornelis, M. C.,Weir, B.S., Goddard, M. E., and Visscher, P.M. (2010). Genome partitioning ofgenetic variation for complex traitsusing common SNPs. Nat. Genet. 43,519–525.

Conflict of Interest Statement: Theauthors declare that the research wasconducted in the absence of any com-mercial or financial relationships thatcould be construed as a potential con-flict of interest.

Received: 24 December 2011; accepted:12 February 2012; published online: 29February 2012.Citation: Sebastiani P, Solovieff N andSun JX (2012) Naïve Bayesian classifierand genetic risk score for genetic risk pre-diction of a categorical trait: not so dif-ferent after all! Front. Gene. 3:26. doi:10.3389/fgene.2012.00026This article was submitted to Frontiers inStatistical Genetics and Methodology, aspecialty of Frontiers in Genetics.Copyright © 2012 Sebastiani, Solovi-eff and Sun. This is an open-accessarticle distributed under the terms ofthe Creative Commons Attribution NonCommercial License, which permits non-commercial use, distribution, and repro-duction in other forums, provided theoriginal authors and source are credited.


http://dx.doi.org/10.1371/journal.pone.0019454



http://dx.doi.org/10.1186/1753-6561-5-S9-S37

http://dx.doi.org/10.1371/journal.pgen.1001146


http://dx.doi.org/10.1186/1753-6561-5-S9-S61

http://dx.doi.org/10.3389/fgene.2012.00026

http://creativecommons.org/licenses/by-nc/3.0/









naïve bayesian classifier and genetic risk score for genetic risk … · 2015-02-19 · naïve...

Documents