a novel generalized ridge regression method for quantitative genetics

14
A Novel Generalized Ridge Regression Method for Quantitative Genetics Xia Shen,* ,,1 Moudud Alam, Freddy Fikse, and Lars Rönnegård ,*Division of Computational Genetics, Department of Clinical Sciences and Department of Animal Breeding & Genetics, Swedish University of Agricultural Sciences, 75007 Uppsala, Sweden, and Statistics, School of Technology and Business Studies, Dalarna University, 78170 Borlänge, Sweden ABSTRACT As the molecular marker density grows, there is a strong need in both genome-wide association studies and genomic selection to t models with a large number of parameters. Here we present a computationally efcient generalized ridge regression (RR) algorithm for situations in which the number of parameters largely exceeds the number of observations. The computationally demanding parts of the method depend mainly on the number of observations and not the number of parameters. The algorithm was implemented in the R package bigRR based on the previously developed package hglm. Using such an approach, a heteroscedastic effects model (HEM) was also developed, implemented, and tested. The efciency for different data sizes were evaluated via simu- lation. The method was tested for a bacteria-hypersensitive trait in a publicly available Arabidopsis data set including 84 inbred lines and 216,130 SNPs. The computation of all the SNP effects required ,10 sec using a single 2.7-GHz core. The advantage in run time makes permutation test feasible for such a whole-genome model, so that a genome-wide signicance threshold can be obtained. HEM was found to be more robust than ordinary RR (a.k.a. SNP-best linear unbiased prediction) in terms of QTL mapping, because SNP- specic shrinkage was applied instead of a common shrinkage. The proposed algorithm was also assessed for genomic evaluation and was shown to give better predictions than ordinary RR. H IGH-dimensional problems are increasing in importance in genetics, computational biology, and other elds of research where technological developments have greatly facilitated the collection of data (Hastie et al. 2009). In genome-wide association studies (GWAS) and genomic selection (GS), the number of observations n is generally in the order of hundreds/thousands whereas the number of marker effects to be tted p is in the order of hundreds of thousands. This is a rather extreme p n problem, and the methods developed for analyses of the data need to be computationally feasible. At the same time the mod- els tted should be exible enough to capture the impor- tant genetic effects that are often quite small (Hayes and Goddard 2001). Methodologies regarding high-dimensional genomic data focus on both detection and prediction purposes. There is currently a trend that GWAS and GS could potentially apply the same framework of models. Such models t the whole genome based on penalized likelihood or Bayesian shrinkage estimation (see the review by de los Campos et al. 2013). Or- dinary GWAS usually avoids high-dimensional models and turns the problem into multiple testing instead (e.g., the review by Kingsmore et al. 2008). The tests of all the SNPs (single nucle- otide polymorphisms) are dismembered. Such routine sacri ces both detective and predictive power. Using detected QTL (quan- titative trait loci that are genome-wide signi cant), the predic- tion can be rather poor, which led to the insignicant application of marker-assisted selection (MAS) (Dekkers 2004). GS, how- ever, has been practically useful by incorporating a large amount of small genetic effects unmappable from GWAS or QTL analy- sis. There are a number of whole-genome models where not only the individual predictors (breeding values) but also the SNP effects can be estimated, e.g., SNPbest linear unbiased prediction (BLUP) and different kinds of Bayesian models (e.g., Meuwissen et al. 2001; Xu 2003; Yi and Xu 2008; Gianola et al. 2009; Habier et al. 2011). The whole genome models are powerful; nevertheless, there are problems that limit its wide usage: (1) computation for these models in- cluding all the SNPs can be intensive, whereas efciency is required in practice so that prediction can be obtained in early life of the individuals (de los Campos et al. 2013); Copyright © 2013 by the Genetics Society of America doi: 10.1534/genetics.112.146720 Manuscript received October 11, 2012; accepted for publication January 9, 2013 1 Corresponding author: Swedish University of Agricultural Sciences, Box 7078, 75007 Uppsala, Sweden. E-mail: [email protected] Genetics, Vol. 193, 12551268 April 2013 1255 GENOMIC SELECTION

Upload: richard-ding

Post on 02-Oct-2015

234 views

Category:

Documents


3 download

DESCRIPTION

Quant Genetics

TRANSCRIPT

  • A Novel Generalized Ridge Regression Methodfor Quantitative Genetics

    Xia Shen,*,,1 Moudud Alam, Freddy Fikse, and Lars Rnnegrd,*Division of Computational Genetics, Department of Clinical Sciences and Department of Animal Breeding & Genetics, Swedish

    University of Agricultural Sciences, 75007 Uppsala, Sweden, and Statistics, School of Technology and Business Studies,Dalarna University, 78170 Borlnge, Sweden

    ABSTRACT As the molecular marker density grows, there is a strong need in both genome-wide association studies and genomicselection to t models with a large number of parameters. Here we present a computationally efcient generalized ridge regression(RR) algorithm for situations in which the number of parameters largely exceeds the number of observations. The computationallydemanding parts of the method depend mainly on the number of observations and not the number of parameters. The algorithm wasimplemented in the R package bigRR based on the previously developed package hglm. Using such an approach, a heteroscedasticeffects model (HEM) was also developed, implemented, and tested. The efciency for different data sizes were evaluated via simu-lation. The method was tested for a bacteria-hypersensitive trait in a publicly available Arabidopsis data set including 84 inbred linesand 216,130 SNPs. The computation of all the SNP effects required ,10 sec using a single 2.7-GHz core. The advantage in run timemakes permutation test feasible for such a whole-genome model, so that a genome-wide signicance threshold can be obtained. HEMwas found to be more robust than ordinary RR (a.k.a. SNP-best linear unbiased prediction) in terms of QTL mapping, because SNP-specic shrinkage was applied instead of a common shrinkage. The proposed algorithm was also assessed for genomic evaluation andwas shown to give better predictions than ordinary RR.

    HIGH-dimensional problems are increasing in importancein genetics, computational biology, and other elds ofresearch where technological developments have greatlyfacilitated the collection of data (Hastie et al. 2009). Ingenome-wide association studies (GWAS) and genomicselection (GS), the number of observations n is generallyin the order of hundreds/thousands whereas the numberof marker effects to be tted p is in the order of hundredsof thousands. This is a rather extreme p ! n problem,and the methods developed for analyses of the data needto be computationally feasible. At the same time the mod-els tted should be exible enough to capture the impor-tant genetic effects that are often quite small (Hayes andGoddard 2001).

    Methodologies regarding high-dimensional genomic datafocus on both detection and prediction purposes. There iscurrently a trend that GWAS and GS could potentially applythe same framework of models. Such models t the whole

    genome based on penalized likelihood or Bayesian shrinkageestimation (see the review by de los Campos et al. 2013). Or-dinary GWAS usually avoids high-dimensional models and turnsthe problem into multiple testing instead (e.g., the review byKingsmore et al. 2008). The tests of all the SNPs (single nucle-otide polymorphisms) are dismembered. Such routine sacricesboth detective and predictive power. Using detected QTL (quan-titative trait loci that are genome-wide signicant), the predic-tion can be rather poor, which led to the insignicant applicationof marker-assisted selection (MAS) (Dekkers 2004). GS, how-ever, has been practically useful by incorporating a large amountof small genetic effects unmappable from GWAS or QTL analy-sis. There are a number of whole-genome models where notonly the individual predictors (breeding values) but also theSNP effects can be estimated, e.g., SNPbest linear unbiasedprediction (BLUP) and different kinds of Bayesian models(e.g., Meuwissen et al. 2001; Xu 2003; Yi and Xu 2008;Gianola et al. 2009; Habier et al. 2011). The whole genomemodels are powerful; nevertheless, there are problems thatlimit its wide usage: (1) computation for these models in-cluding all the SNPs can be intensive, whereas efciency isrequired in practice so that prediction can be obtained inearly life of the individuals (de los Campos et al. 2013);

    Copyright 2013 by the Genetics Society of Americadoi: 10.1534/genetics.112.146720Manuscript received October 11, 2012; accepted for publication January 9, 20131Corresponding author: Swedish University of Agricultural Sciences, Box 7078, 75007Uppsala, Sweden. E-mail: [email protected]

    Genetics, Vol. 193, 12551268 April 2013 1255

    GENOMIC SELECTION

    Richard Ding

    Richard Ding

  • (2) tting large-p small-n models requires variable selectionor shrinkage estimation, and the signicant threshold forthe shrinkage estimates of SNP effects is difcult to specify,which is an issue that limits the usage of such models in genemapping; and (3) the tting of Bayesian models is performedusing randomization/simulations, where in application, mix-ing of the Markov chain Monte Carlo (MCMC) algorithm canbecome poor in case of high-dimensional models.

    Linear mixed models (LMM) have been proposed for GS(SNP-BLUP; Meuwissen et al. 2001) and ridge regression(RR) for GWAS (Malo et al. 2008). LMMs and RR are fun-damentally the same since they t a penalized likelihoodusing a quadratic penalty function (see the Appendix formore details). It is well established (Hastie et al. 2009) thatRR can be tted for p! n in a computationally efcient wayusing singular-value decomposition (SVD) of the design ma-trix, which, for instance, has been applied to expressionarrays in genetics (Hastie and Tibshirani 2004). However,this approach assumes that the RR shrinkage parameter isconstant for all p tted parameters. In generalized RR theshrinkage parameter may vary between the parameters (Hoerland Kennard 1970a,b). In both multilocus GWAS and GS, it isnot reasonable to assume that shrinkage should be constantfor all tted SNP effects over the entire genome. This is be-cause neither the gene effects are normally distributed nor aremost markers linked to any functional gene (Meuwissen et al.2001). To allow SNP-specic shrinkage, the previously men-tioned Bayesian methods were developed.

    There is a need for a method that is fast (efcient to per-form), testable (can produce a genome-wide signicance thresh-old for association study), deterministic (the same estimates areeasy to replicate), and exible (SNP-specic shrinkage can beeasily applied). The aim of this article is to develop such a gen-eralized RR method, which will be referred to as the heterosce-dastic effects model (HEM), for p ! n high-dimensionalproblems, based on LMM theory. HEM approximates a previ-ously proposed method (Rnnegrd and Lee 2010; Shen et al.2011) that was based on double hierarchical generalized linearmodels (DHGLM; Lee and Nelder 2006), but with a tremendousincrease in computational speed for p ! n problems. Animportant contribution of the theory presented is a fasttransformation of hat values (leverages) and prediction er-ror variances of the random effects. The method has beenimplemented in the R (R Development Core Team 2010)package bigRR (available at https://r-forge.r-project.org/R/?group_id=1301).

    Methods and Materials

    Statistical models

    Using Hendersons mixed model equation: We start byintroducing the normal RR as a LMM. The theoretical basis ofthe connection between RR and LMM is given in the Appendix.The SNP effects are estimated as random effects, i.e., so-calledSNPBLUP. We use the terms RR and SNPBLUP interchange-ably in this article. Given a phenotype vector y for n individ-

    uals, xed effects data X, and the data for p SNPs along thegenome Z, the normal LMM for SNPBLUP can be written as

    y Xb Zb e; (1)

    where b $ N0;s2bIp, e $ N0;s2e In, b is the vector of xedeffects, and b is the vector of random SNP effects. The matrixZ has p columns for the SNPs where each column is usuallycoded as 0, 1, and 2, for the homozygote aa, the heterozygoteAa, and the other homozygote AA, respectively. However,here, we standardize the coding for Z based on VanRaden(2008) using the allele frequencies. This is essential in RRproblems since the sizes of the estimated effects need to becomparable. Although the models are introduced in the simplenormal LMM notation, the method is developed for general-ized distributions of phenotypes (see also Fitting algorithmand the Appendix).

    It is well known that the xed effects b and randomeffects b can be estimated jointly via Hendersons mixedmodel equation (MME; Henderson 1953)!

    X9X X9ZZ9X Z9Z lI

    "!bb

    "!X9yZ9y

    "; (2)

    where l s^2e=s^2b , determined by the variance componentestimators, is the shrinkage parameter for the random SNPeffects. l is analogous to the one in the penalized likelihoodfor RR. In terms of estimating SNP effects for QTL mapping,such an MME for RR is not appropriate because the samemagnitude of shrinkage is applied to all the SNPs (Xu 2003).Hence, the markers are regarded a priori with no difference.Since most of the loci in the genome are supposed to con-tribute little to the observed phenotype, those SNPs shouldbe penalized more in the analysis. This is one of the funda-mental ideas from which the current Bayesian methods aredeveloped (e.g., Meuwissen et al. 2001; Xu 2003).

    From (2), it is clear that to obtain different shrinkage fordifferent SNPs, the lI part should be replaced so that the pnumbers on the diagonal are not identical. An essential ques-tion here is how much shrinkage should be given to eachSNP. We propose a generalized RR solution to this problem,which is presented as the following HEM. We use the MMEand t a generalized RR after the ordinary RR in (2).!

    X9X X9ZZ9X Z9Z diagl

    "!bb

    "!X9yZ9y

    "; (3)

    where l is a vector of p shrinkage parameters with its jthelement lj s^2e=s^2bj . The SNP-specic variance components^2bj is calculated as

    s^2bj b^2j

    12 hjj; (4)

    where for the jth SNP, b^j is the BLUP from (2), and hjj,known as the hat value, is the (n + j)th diagonal elementof the hat matrix H = T(T9T)21T9, where

    1256 X. Shen et al.

    Richard Ding

    Richard Ding

    Richard Ding

    Richard Ding

    Richard Ding

    Richard Ding

    Richard Ding

    Richard Ding

    Richard Ding

    Richard Ding

    Richard Ding

    Richard Ding

    Richard Ding

    Richard Ding

  • T !X Z0 diagl

    ": (5)

    Such a quantity (4) is useful because it (1) directly tells howmuch shrinkage should be given to each SNP and (2) makes theentire procedure deterministic and repeatable. This simple wayof setting the shrinkage parameters is an approximation ofpreviously established theory for double hierarchical general-ized linear models (Lee and Nelder 2006) applied in Rnnegrdand Lee (2010) and Shen et al. (2011), where bj depended ona second-layer model including random effects and was esti-mated iteratively until convergence. Using (4) the shrinkage foreach SNP in HEM is computed directly without iteration.

    Transformation via the animal model: Stranden andGarrick (2009) showed that the computations for an animalmodel including a genomic relationship matrix is equivalent totting an LMM including random SNP effects. Below we ex-ploit this fact to derive the algorithm for HEM. A major con-tribution of the theory below is the derivation of a simpleequation to compute the hat values for the SNP effects fromthe hat values for the animal effects (in step 5 of Fitting algo-rithm below). It should also be noted that the generalized RRpart of the HEM algorithm does not use singular-value decom-position as proposed by Hastie and Tibshirani (2004) and Has-tie et al. (2009) and described in the Appendix, but rather usestransformation between equivalent models as described below.

    From (3) and (4), the generalized RR method, HEM, isquite easy to describe mathematically. However, to obtain esti-mated effects for all the SNPs, the matrix Z9Z + diag(l) (sizep p) is too large to invert. So we need to make the equationscomputationally simple. This can be done by connecting (3) toan animal model. Let us dene G = ZZ9, which is a matrixrepresenting genomic kinship that indicates the relatednessbetween individuals. The animal model

    y Xb a e (6)

    contains a $ N0;Gs2a as random effects for n individuals.The size of the MME for such an animal model is muchsmaller than (2), i.e.,!

    X9X X9X I lG21

    "!ba

    "!X9yy

    ": (7)

    Because of the mathematical equivalence, the l in (7) is iden-tical to that in (2). In fact, there is no need to directly t theMME (2) or (3) because all the information that links theestimated individual effects a^ to the SNP effects b^ are con-tained in the single genotype matrix Z. After tting the animalmodel via (7), one can show that the following transformationholds (e.g., Pawitan 2001, p. 446),

    b^ Z9G21a^; (8)

    where only the matrices Z9 (size p n) and G21 (size n n)are required, given that G is full ranked. Since n ' p, the

    MME (2) and HEM (3) can be solved a lot faster by theanimal model (7) + transformation (8) procedure. ForHEM, the hat value for each SNP is required in the MME(3), and we show that fortunately, a similar transforma-tion can be applied for transforming the hat values of theanimal effects to the SNP effects as well (see Fitting al-gorithm and the Appendix). Besides the efciency, byavoiding huge matrices, a signicant amount of memorycan be saved so that almost any large number of SNPs canbe loaded simultaneously.

    To evaluate the run time for tting a linear mixed modelor RR using our algorithm, we simulated a standard-normallydistributed phenotype with sample sizes varied from 100 to1000. Marker genotypes were also simulated and the numberof markers varied from 10K to 1M.Fitting algorithm

    Below we present the tting algorithm for HEM. Steps 14t SNPBLUP and steps 58 t a generalized RR. The algo-rithm also includes a Cholesky decomposition of the geno-mic relationship matrix G to simplify the computations andthe transformation of hat values (in step 5).

    Given a phenotype vector y (size n 1) that belongs toany GLM (generalized linear model) family, e.g., binary,poisson, gamma, etc., xed effects design matrix X (sizen k), the SNP genotype matrix Z (size n p), the SNPBLUP (RR), and HEM (generalized RR) can be computedas follows:

    1. Calculate G = ZZ9, its Cholesky decomposition L s.t.LL9 = G, and its inverse G21.

    2. Fit a GLMM with response y, xed effects X and randomeffects design matrix L. Because of mathematical equiv-alence, this ts the animal model (6) as a GLMM withcorrelated random effects.

    3. From step 2, store the estimated variance components s^2b ,s^2e , and the animal effects a^. Calculate l s^2e=s^2b .

    4. Transform a^ back to the SNP effects b^ Z0G21a^.5. Dene

    Cv 1s^ 2e

    !X9X X9LL9X L9L lIn

    "(9)

    and divide the inverse of Cv into blocks

    C(1v C11v C

    12v

    C21v C22v

    !: (10)

    Dene a transformation matrix M = Z9G21L. Calculate thehat value for each random SNP effect as

    hjj 12Mj#In2C22v =s^

    2b

    $M9j; (11)

    where Mj is the jth row of the transformation matrix M.

    6. Dene a diagonal matrix W with each diagonal element

    Heteroscedastic Effects Model 1257

    Richard Ding

    Richard Ding

    Richard Ding

  • wjj b^2j

    12 hjj(12)

    and update G to be G* = ZWZ9. Calculate G*21, and L* s.t.L*L*9 G*.7. Fit a GLMM with response y, xed effects X, and random

    effects design matrix L*.8. From step 7, transform the updated individual effects a^

    back to the SNP effects b^ Z9G*21a^.In this algorithm, GLMMs are estimated based on penalizedquasi-likelihood (PQL) for MME (see R package hglm andits algorithm in Rnnegrd et al. 2010). To be comparativeto MME for normal LMM, the notation s2e is used in thealgorithm even for GLMM to denote the residual dispersionparameter. Theoretical details about the transformations aregiven in the Appendix.

    Randomization test

    Specifying a signicance threshold for any whole genomemodel has been a challenging problem. The predictors in LMMor GLMM for the random effects (i.e., BLUP) have predictionerrors (e.g., Pawitan 2001), which could be used to constructt-like statistics. But this is properly applicable only when thenumber of random effects predictors is small, namely, whenshrinkage does not affect much of the test statistic distributionsince the random effects estimates are not so different fromthose if all the effects are estimated as xed. But this is notproper anymore when the number of explanatory variables orgenetic markers is much more than the number of individuals,because the estimated effects are too much biased from theirreal genetic effects (Zeng 1993; Rodolphe and Lefort 1993).So when a whole genome of markers is tted together, themodel ends up with too much shrinkage to make the t-distri-bution hold. Hence, current Bayesian methods (e.g., Xu 2003)just set up an empirical LOD score threshold (e.g., Che and Xu2012) using the suggestions by Kidd and Ott (1984) and Risch(1991). Nevertheless, the genome-wide signicance test canactually be practically important. Here, randomization/per-mutation is a solution if the computation is not too intensivefor tting all the markers. Since the HEM algorithm proposedin this article is computationally efcient, a genome-wide sig-nicance threshold can be determined by randomization test.

    In the analysis of the Arabidopsis thaliana GWAS data usingHEM, the permutation test was performed to determine a 5%genome-wide signicance threshold for QTL detection, wherethe phenotype was permuted 1000 times, and the 95% quan-tile of the maximum effects was calculated as the threshold.

    Data

    We applied HEM on three data sets. Using HEM, wesearched for signicant SNPs in a publicly available A. thali-ana GWAS data set. In the other two data sets the predictivepower of HEM in GS was assessed.

    A. thaliana GWAS data: Atwell et al. (2010) performedGWA studies for 107 phenotypes of A. thaliana and success-fully detected a set of candidate genes. Using the heterosce-dastic effects model, we analyzed one defense-related binarytrait out of their 107 published phenotypes: hypersensitive re-sponse to the bacterial elicitor AvrRpm1. The reason for choos-ing this trait is because it is under regulation of a knowncandidate gene RPM1 and so we can validate our HEMmethodin terms of QTL detection. A total of 84 ecotypes were pheno-typed (28 controls and 56 cases). The genotype data are froma 250K SNP chip including 216 130 available SNPs (http://arabidopsis.usc.edu).

    GSA common simulated data: To compare different newlydeveloped genomic evaluation methods, the Genetics Soci-ety of America (GSA) provides several common data setsfor authors to analyze and report their results (Hickey andGorjanc 2012). We chose the simulated livestock data struc-ture to assess our method.

    The total number of segregating sites across the genome wasapproximately 1.67 million. A random sample of 60,000segregating sites was selected from the sequence to be used asSNPs on a 60K SNP array. In addition, a set of 9000 segregatingsites were randomly selected from the sequence to be used ascandidate QTL in two different ways: (1) a randomly sampledset, and (2) a randomly sampled set with the restriction thattheir minor allele frequencies (MAFs) should not exceed 0.30.Four different traits were simulated, assuming an additivegenetic model. The rst pair of traits was generated using the9000 unrestricted QTL. For the rst trait (PolyUnres), the allelesubstitution effect at each QTL was sampled from a standardnormal distribution. For the second trait (GammaUnres) a ran-dom subset of 900 of the candidate QTL was selected with allelesubstitution effects sampled from a gamma distribution witha shape parameter of 0.4, scale parameter of 1.66 (Meuwissenet al. 2001), and a 50% chance of being positive or negative. Thesecond pair of traits (PolyRes and GammaRes) was generated inthe same way as the rst pair except that the candidate QTLhave the restriction that their MAF not be .0.30. Phenotypeswith a heritability of 0.25 were generated for each trait.

    Training and validation subsets of the data were extractedfor training and validation. The training set comprised the 2000individuals in generations 4 and 5. The validation set comprised1500 individuals sampled at random from generation 6, 8,and 10 (500 individuals from each generation). We t a whole-genome model using HEM and compare the predictionperformance in the validation data set.

    QTLMAS data: The third data set used in this article wassimulated for the 14th QTLMAS workshop (http://jay.up.poznan.pl/qtlmas2010/; Szydlowski and Paczyska 2011).A pedigree consisting of 3226 individuals in ve generations(F0F4) was simulated, where F0 contained 5 males and 15females. Each female mated once and gave birth to around30 progeny. Two traits were simulated, where one was quan-titative (QT), and the other was binary (BT). Young individuals

    1258 X. Shen et al.

  • (F4 generation, individuals 23273226) had no phenotypicrecords. The genome was about 500-Mb long, consisting ofve chromosomes, each of which contained about 100 Mb.Each individual was genotyped for 10,031 biallelic SNPs thatdensely distributed along the genome. Regarding recombina-tion rate, 1 cM was assumed to be 1 Mb; therefore, the size ofthe genome is about 500 cM.

    Thirty-seven QTL were simulated along the genome, andno QTL existed on chromosome 5 (Table 1). All the QTL con-trolled QT, including 30 additive loci, 2 epistatic pairs, and 3imprinted loci. Of the 30 additive QTL, 22 also controlled BT,namely that pleiotropic effects existed for the two traits. QT wasmainly controlled by additive QTL 14 and 17, as well as the twoepistatic pairs. BT was mainly controlled by additive QTL 14.Due to epistasis and imprinting, QT had a more complicated

    genetic architecture than BT. We have published (Shen et al.2011) our previous analysis of this data set using a DHGLM (Leeand Nelder 2006). Considering HEM as a simple and efcientsubstitution of DHGLM, we reanalyzed the data by tting all themarkers and compared the results with the previous results.

    Results

    Computational efciency

    On a single Intel Xeon E5520 2.27-GHz CPU, the computationwas fast, especially when the number of individuals was small(Figure 1), since the computation-demanding parts in thealgorithm depend mainly on the sample size. For a populationwith 100 individuals, even when there are 1 million markers,estimation of all the effects along the genome takes ,2 min.

    Table 1 Genetic models of simulated quantitative trait for the QTLMAS 2010 data (Szydlowski and Paczynska 2011)

    QTL Chra SNPb Distc (bp) Freqd Adde QTL Variance Type

    1 1 152 788 0.45 1.93 1.84 Additive2 1 960 27540 0.64 21.56 1.13 Additive3 1 1106 11564 0.67 21.56 1.09 Additive4 1 1226 5083 0.30 21.68 1.19 Additive5 2 2036 1852 0.84 21.92 0.97 Additive6 2 2675 17414 0.38 1.02 0.48 Additive7 2 3114 128504 0.14 1.69 0.69 Additive8 2 3414 111127 0.47 21.32 0.87 Additive9 2 3534 0 0.66 0.25 0.03 Additive

    10 2 3553 15051 0.39 0.85 0.34 Additive11 2 3946 96549 0.26 1.01 0.40 Additive12 2 3959 1516 0.27 1.69 1.13 Additive13 3 4318 8609 0.17 21.98 1.10 Additive14 3 4483 17356 0.51 3.00 4.5 Additive (major controled)15 3 4615 44344 0.23 21.11 0.43 Additive16 3 4980 5654 0.60 20.73 0.26 Additive17 3 5488 0 0.53 3.00 4.49 Additive (major controled)18 3 5616 2462 0.49 20.76 0.29 Additive19 3 5722 184175 0.83 20.95 0.26 Additive20 3 5858 28506 0.88 1.66 0.58 Additive21 3 6022 0 0.57 0.65 0.21 Additive22 4 6224 45783 0.24 1.25 0.57 Additive23 4 6423 4137 0.47 0.96 0.46 Additive24 4 6684 40188 0.37 0.56 0.14 Additive25 4 6833 650 0.53 20.31 0.05 Additive26 4 6870 43162 0.28 1.55 0.96 Additive27 4 6982 20468 0.6 0.93 0.42 Additive28 4 7013 36740 0.57 20.22 0.02 Additive29 4 7446 119 0.55 0.75 0.28 Additive30 4 8024 27501 0.89 1.92 0.72 Additive31 1 939 0 0.54 0 7.01 Epistatic (pair 1)f

    32 1 959 0 0.56 0 Epistatic (pair 1)33 2 2715 0 0.5 0 4.18 Epistatic (pair 2)g

    34 2 2727 0 0.51 0 Epistatic (pair 2)35 2 3102 0 0.55 0 2.16 Imprintedh

    36 2 3623 0 0.54 0 2.20 Imprintedh

    37 2 3776 0 0.56 0 2.17 Imprintedh

    a Chromosome number. No QTL on chromosome 5.b The closest SNP marker index.c Distance from the QTL to the closest SNP marker.d Frequency of allele 1.e Additive effect: half the difference between homozygote means.f Extra effect of each haplotype 11: 4.00. Frequency of haplotype 11: 0.35.g Extra effect of each haplotype 11: 4.00. Frequency of haplotype 11: 0.17.h Extra effect of paternal allele 1: 3.0.

    Heteroscedastic Effects Model 1259

    Richard Ding

  • Analysis of the Arabidopsis data

    For the 84 individuals and 216,130 informative markers on theArabidopsis trait AvrRpm1, the shrinkage effect was much stron-ger for HEM than SNPBLUP (Figure 2). According to Atwellet al. (2010), this defense-related trait is essentially monogeni-cally controlled by the gene RPM1. The analysis via a whole-genome model should validate such a strong monogenic effectin terms of QTL detection. In Figure 2, 5% genome-wide signif-icance thresholds via permutation tests are provided for bothSNPBLUP and HEM. Here, SNPBLUP is not appropriate forQTL mapping due to constant shrinkage along the genome (Xu2003). By allowing different weights on different SNPs, HEM hasthe property that it shrinks the small effects down toward zero,highlights the QTL effects, and produces reasonable genome-wide signicance threshold obtained by permutation testing.

    HEM also produces better resolution in mapping thecandidate gene. A close-up of the region surrounding theRPM1 gene on chromosome 3 shows that the SNP withthe largest2log10 P-value from the Wilcoxon GWAS (Atwellet al. 2010) also has the largest estimate from HEM, whereasthe second largest estimate is found $0.1 Mb away from

    RPM1 (Figure 3). Hence, a ranking of the top estimatesresults in a similar ranking as for the 2log10P-values fromAtwell et al.s GWAS, where HEM is better at separating theranking of SNPs close to each other on the chromosome.Analysis of the GSA simulated data

    HEM is able to t the entire 60K SNP chip on the GSA simulateddata. We analyzed all the 10 replicates of the four simulatedtraits and performed the prediction using both SNPBLUP andHEM (Table 2). It is not surprising that HEM is generally betterin prediction than SNPBLUP because of more exible shrink-age. It is noteworthy that such an advantage in prediction isclearer when the QTL effects are skewed (Gamma) than sym-metrically (Normal) distributed. This is because the SNPs thathave major genetic effects are highlighted more by the HEMshrinkage compared to SNPBLUP. This is a good property forHEM since most of the time, one would expect the genes tocarry skewed genetic effects (Hayes and Goddard 2001).Analysis of the QTLMAS data

    Here we focus on breeding value estimation for the QTLMAS2010 data set that was previously analyzed by Shen et al.

    Figure 1 Run-time efciency of the R package bigRR for different sizes of data. Each column in the gure was evaluated by 12 replicates, where thedot shows the median, the solid line shows the 2575% quantile interval, and the dashed lines indicate the range from minimum to maximum.

    1260 X. Shen et al.

  • (2011). Due to the data size, the previous report could nott all the markers into a DHGLM, but this is possible usingHEM. Although HEM is theoretically an approximation of

    DHGLM, it is not worse than our previous DHGLM methodin terms of breeding value estimation where the strongesteffects match the simulated true QTL very well (Figure 4).

    Figure 2 Estimated SNP effects for the Arabidopsis bacteria-hypersensitive trait AvrRpm1 (Atwell et al. 2010) using (A) ridge regression (SNPBLUP) and(B) heteroscedastic effects model, which are plotted against each other in C. The horizontal dashed lines in A and B indicate the 5% genome-widesignicance threshold from a randomization test using 1000 permutations.

    Figure 3 The signicant association peak for theArabidopsis bacteria-hypersensitive trait AvrRpm1(Atwell et al. 2010) from (A) heteroscedastic effectsmodel and (B) genome-wide association using Wil-coxon rank-sum test. The window of the candidategene RPM1 is indicated as a vertical line.

    Heteroscedastic Effects Model 1261

  • By taking into account all the markers in the genome, HEMwas able to improve the prediction of the young individualscompared to DHGLM. It did as good as the previous DHGLMfor the binary trait and gave a correlation between the truebreeding values (TBV) and estimated breeding values (EBV)of 0.72, whereas for the quantitative trait, HEM successfullyraised the correlation between TBV and EBV from 0.60 to0.64 compared to our previous report (Shen et al. 2011).

    Discussion

    The presented generalized RR algorithm, HEM, ts models inwhich the number of parameters p is much greater than thenumber of observations n. The focus of the article has been onapplications in both GS and QTL detection, but the algorithmis expected to be of general use for applications of RR in otherelds of research as well. The computational limitations of thebigRR package come mainly from the number of observations,and not the number of parameters. In our implementation ofthe algorithm, we used the hglm package (Rnnegrd et al.2010) in R for the variance component estimation, which iscomputationally feasible for data sets having up to 10,000observations on a uni-core laptop computer. On any computerthat has a CUDA-supported graphic card, an advanced versionof the package can be required from the authors, which uti-lizes GPU for matrix calculation, accelerating the computationeven more.

    Compared to LMM and ordinary RR, the estimates fromHEM are less sensitive to the assumption that the effects comefrom a common normal distribution and are therefore morerobust. The method is computationally efcient due to itscompressiondecompression properties. In the Appendix, weshow that the SNP-effects model (19) with p effects to be esti-mated is compressed into an animal model (20) whose sizedepends on n. In the decompression part, the estimated SNPeffects can quickly be estimated through a simple transforma-tion from individual effects to SNP effects. These estimates areused to update the matrix G, which is subsequently used ina compressed animal model. The nal estimates are then com-puted by decompressing the animal model once more.

    RR is known to be able to address collinearity (Hoerl andKennard 1970b). It avoids computational trouble for an ill-conditioned data matrix and also solves the problem due toill-conditioned Fishers information matrix (e.g., in Poissonand binomial GLM) (Hastie et al. 2009). In our analysis ofthe Arabidopsis data, we found that HEM seems to have a goodperformance in terms of correctly ne-mapping functional loci.This suggests that when linkage disequilibrium (LD) existsaround a QTL, a clearer signal could be identied, which isa good property of the method, although further investigationsare required to verify this property of HEM. The improvementof a stronger feature selection method compared to RRdepends on the underlying genetic architecture. For a trait withonly a few QTL, an even stronger feature selection method,e.g., least absolute shrinkage and selection operator (LASSO)(Tibshirani 1996), may perform much better. However, manyTa

    ble2Su

    mmaryofbreed

    ingvalueestimationforthe10

    replicates

    ofvalid

    ationsamplesoftheGSA

    commonsimulateddataset(Hickeyan

    dGorjan

    c2012)

    QTL

    RidgeReg

    ression

    (SNP-BLU

    P)Heterosced

    astic

    EffectsModel

    Pred

    iction

    Trait

    Restriction

    Distribution

    COR(SE)

    MSE

    (SE)

    COR(SE)

    MSE

    (SE)

    Enhan

    cemen

    t(%

    )OR C

    OR(P-value

    )OR M

    SE(P-value)

    PolyUnres

    Normal

    0.29

    21(0.011

    3)2.09

    67(0.513

    3)0.29

    43(0.011

    5)2.02

    45(0.508

    2)0.74

    8/10

    (0.055

    )10

    /10(0.001

    )Gam

    maU

    nres

    Gam

    ma

    0.28

    01(0.011

    4)3.27

    90(0.593

    0)0.30

    29(0.010

    1)3.07

    52(0.603

    6)8.14

    9/10

    (0.011

    )10

    /10(0.001

    )Po

    lyRe

    sMAF#

    0.30

    Normal

    0.25

    46(0.008

    9)1.10

    89(0.147

    6)0.25

    87(0.008

    9)1.09

    49(0.146

    6)1.61

    6/10

    (0.172

    )9/10

    (0.011

    )Gam

    maR

    esMAF#

    0.30

    Gam

    ma

    0.26

    79(0.012

    2)2.14

    52(0.404

    7)0.28

    50(0.011

    7)2.02

    80(0.396

    9)6.40

    10/10(0.001

    )10

    /10(0.001

    )

    A60

    Kmicroarraywas

    used

    toge

    notype

    the20

    00individu

    alsin

    thetraining

    setan

    d15

    00in

    thevalidationset.Each

    simulated

    traitha

    she

    ritab

    ility

    of25

    %un

    derregu

    latio

    nof

    9000

    simulated

    QTL.COR,

    averag

    ecorrelation

    coefcien

    tsbe

    tweenTB

    V&GEB

    V;M

    SE,a

    verage

    meansqua

    rederrorsbe

    tweenTB

    Van

    dGEB

    V;S

    E,stan

    dard

    error.Pred

    ictio

    nen

    hancem

    entwas

    calculated

    basedon

    theim

    provem

    entin

    COR.

    OR,

    outdoing

    rate:the

    freq

    uency

    that

    thehe

    terosced

    astic

    effectsmod

    eldo

    minates

    ridge

    regression

    .

    1262 X. Shen et al.

    Richard Ding

    Richard Ding

    Richard Ding

    Richard Ding

    Richard Ding

    Richard Ding

  • complex traits, such as human height, have been shown to bevery polygenic (e.g., Yang et al. 2011). At present, it is evenmore challenging in terms of both QTL mapping and genomicprediction when there are so many small and even undetect-able QTL.

    The method combines two ideas from our earlier articles.First, in Rnnegrd and Lee (2010) and Shen et al. (2011),a DHGLM (Lee and Nelder 2006) was proposed. This modelts SNP-specic variance components using a randomeffects model also for the second level in Equation 30(see the Appendix). By introducing HEM as a simplicationof the DHGLM, one achieves a dramatic gain in speed.Second, Christensen and Lund (2010) suggest in their dis-cussion that G could be weighted by a diagonal matrix Dcalculated from SNP effects and note, However, incorpo-rating uncertainty on such estimated SNP effects into themethod seems less straight-forward. Here, we have incor-porated this uncertainty by using prediction error variancesin the estimation of D through computations of hat values(Equation 12). Calculation of the prediction error variancesis an important part of HEM. It should also be noted thatHEM is based on tting MME for an animal model and thatGS problems involving both genotyped and nongenotyped

    individuals can be solved following the method by Christensenand Lund (2010).

    Zhang et al. (2010) proposed a two-stage method, similarto ours, where the SNP-specic shrinkage parameters werecalculated from the squared estimated SNP effects froma preliminary RR analysis (the method was referred to asTAPBLUP by the authors). There are two important differ-ences though. First, they did not include the uncertainty inthe estimated SNP effects, which produces biased results.The calculations of prediction error variances in our pro-posed method are therefore an important contribution. Fur-thermore, in their preliminary RR analysis, a user-denedshrinkage parameter had to be given, and in their simulationstudy they used the true simulated values to calculate thisshrinkage. In HEM, the shrinkage is estimated from the data.

    The well-known ShermanMorrisonWoodbury formula(Sherman and Morrison 1950; Golub and Van Loan 1996)can be used to invert a big p p matrix of low rank (e.g.,Rnnegrd et al. 2007) and could therefore be a possiblealternative to our implementation. However, to obtain allthe SNP effects using this formula, the big p p matrix stillneeds to be stored, which is avoided in HEM. Furthermore,an important part of HEM is a transformation algorithm to

    Figure 4 Analysis of the 14th QTLMAS workshop (Szydlowski and Paczynska 2011) common data set using HEM. The results of the quantitative trait(QT) and the binary trait (BT) are shown in the top and bottom, respectively. (A and B) The shrinkage estimates of the SNP effects across the genome.The red, green, and blue vertical bars indicate the simulated epistatic, imprinting, and addtive QTL, where the width of each bar is proportional to thecorresponding QTL effect. Chromosomes are separated by the dual-colored dots. (C and D) Compare the estimated breeding values (EBV) with the truebreeding values (TBV).

    Heteroscedastic Effects Model 1263

    Richard Ding

    Richard Ding

    Richard Ding

    Richard Ding

    Richard Ding

  • obtain the hat values for all the SNPs, while this is notstraightforward using the ShermanMorrisonWoodburyformula.

    The proposed HEM is intended to be capable of address-ing both feature selection and prediction. Certainly, sucha universal capacity is fullled with price-biased estimatesdue to shrinkage. Many statistical methods are intended tosimultaneously perform feature selection and prediction,such as ridge regression or SNPBLUP, LASSO, their combi-nation elastic net (Zou and Hastie 2005), our proposed gen-eralized ridge method HEM, and all the series of Bayesianmethods in the genomic prediction area (e.g., Meuwissenet al. 2001). Taking the SNPBLUP for instance, especially forp! n cases, so much shrinkage is given to each effect estimate,sacricing the unbiasedness (BLUE) of the effects, to savedegrees of freedom so that the model is estimable. Fortunately,combining all the overshrunk estimates, we are able to obtaina good prediction even when there are a lot of small effectsundetectable (e.g., the results in human height by Yang et al.2010, 2011).

    HEM can be used to t all SNP effects in a single modeland the estimated effects can be used to rank interestingSNPs for further investigation in GWAS. Furthermore,using the computational advantage of HEM, we are ableto calculate genome-wide signicance thresholds usingpermutation testing.

    A possible extension of the method would be to apply a moregeneral autoregressive smoothing along each chromosome forthe shrinkage values using DHGLM (applied in Rnnegrd andLee 2010; Shen et al. 2011). An important development wouldbe to implement a computationally fast full DHGLM algorithm.

    Acknowledgments

    X.S. is funded by a Future Research Leaders grant fromSwedish Foundation for Strategic Research (SSF) to rjanCarlborg. L.R. is funded by the Swedish Research Councilfor Environment, Agricultural Sciences and Spatial Plan-ning (FORMAS).

    Literature Cited

    Atwell, S., Y. S. Huang, B. J. Vilhjalmsson, G. Willems, M. Hortonet al., 2010 Genome-wide association study of 107 phenotypesin Arabidopsis thaliana inbred lines. Nature 465: 627631.

    Bjrnstad, J. F., 1996 On the generalization of the likelihoodfunction and the likelihood principle. J. Am. Stat. Assoc. 91:791806.

    Breslow, N. E., and D. G. Clayton, 1993 Approximate inference ingeneralized linear mixed models. J. Am. Stat. Assoc. 88: 925.

    Che, X., and S. Xu, 2012 Generalized linear mixed models formapping multiple quantitative trait loci. Heredity 109: 4149.

    Christensen, O. F., and M. S. Lund, 2010 Genomic predictionwhen some animals are not genotyped. Genet. Sel. Evol. 42: 2.

    de los Campos, G., J. M. Hickey, R. Pong-Wong, H. D. Daetwyler,and M. P. L. Calus, 2013 Whole genome regression and pre-diction methods applied to plant and animal breeding. Genetics193: 327345.

    Dekkers, J. C. M., 2004 Commercial application of marker- andgene-assisted selection in livestock: strategies and lessons. J.Anim. Sci. 82: E313E328.

    Gianola, D., G. de los Campos, W. Hill, E. Manfredi, and R. Fer-nando, 2009 Additive genetic variability and the bayesian al-phabet. Genetics 183: 347363.

    Golub, G., and C. Van Loan, 1996 Matrix Computations, Ed. 3.Johns Hopkins University Press, Baltimore.

    Habier, D., R. Fernando, K. Kizilkaya, and D. Garrick,2011 Extension of the bayesian alphabet for genomic selec-tion. BMC Bioinformatics 12: 186.

    Hastie, T., and R. Tibshirani, 2004 Efcient quadratic regulariza-tion for expression arrays. Biostatistics 5: 329340.

    Hastie, T., R. Tibshirani, and J. Friedman, 2009 The Elements ofStatistical Learning. Springer-Verlag, Berlin.

    Hayes, B., and M. Goddard, 2001 The distribution of the effects ofgenes affecting quantitative traits in livestock. Genet. Sel. Evol.33: 209229.

    Henderson, C. R., 1953 Estimation of variance and covariancecomponents. Biometrics 9: 226252.

    Henderson, C. R., 1984 Applications of Linear Models in AnimalBreeding. University of Guelph, Guelph, Ontario, Canada.

    Hickey, J. M., and G. Gorjanc, 2012 Simulated data for genomicselection and genome-wide association studies using a combina-tion of coalescent and gene drop methods. G3: Genes, Genomes,Genetics 2: 425427.

    Hoerl, A., and R. Kennard, 1970a Ridge regression: applicationsto nonorthogonal problems. Technometrics 12: 6982.

    Hoerl, A., and R. Kennard, 1970b Ridge regression: biased esti-mation for nonorthogonal problems. Technometrics 12: 5567.

    Kidd, K., and J. Ott, 1984 Power and sample size in linkage stud-ies: Human Gene Mapping 7 (1984): Seventh InternationalWorkshop on Human Gene Mapping. Cytogenet. Cell Genet.37: 510511.

    Kingsmore, S. F., I. E. Lindquist, J. Mudge, D. D. Gessler, and W. D.Beavis, 2008 Genome-wide association studies: progress andpotential for drug discovery and development. Nat. Rev. DrugDiscov. 7: 221230.

    Lee, Y., and J. A. Nelder, 2006 Double hierarchical generalizedlinear models (with discussion). Appl. Stat. 55: 139185.

    Lee, Y., J. A. Nelder, and M. Noh, 2007 H-likelihood: problemsand solutions. Stat. Comput. 17: 4955.

    Lee, Y., J. A. Nelder, and Y. Pawitan, 2006 Generalized LinearModels with Random Effects - Unied Analysis via h-Likelihood,Chapman & Hall, London.

    Lynch, M., and B. Walsh, 1998 Genetics and Analysis of Quantita-tive Traits. Sinauer Associates, Sunderland, MA.

    Malo, N., O. Libiger, and N. J. Schork, 2008 Accommodating link-age disequilibrium in genetic-association analyses via ridge re-gression. Am. J. Hum. Genet. 82: 375385.

    Mnsson, K., and G. Shukur, 2011 On ridge parameters in logisticregression. Commun. Stat. 40: 33663381.

    Meuwissen, T., B. Hayes, and M. Goddard, 2001 Prediction oftotal genetic value using genome-wide dense marker maps. Ge-netics 157: 18191829.

    Nagamine, Y., 2005 Transformation of QTL genotypic effects toallelic effects. Genet. Sel. Evol. 37: 579584.

    Pawitan, Y., 2001 In All Likelihood: Statistical Modelling and In-ference Using Likelihood. Oxford University Press, Oxford.

    R Development Core Team, 2010 R: A Language and Environmentfor Statistical Computing. R Foundation for Statistical Comput-ing, Vienna, Austria.

    Risch, N., 1991 A note on multiple testing procedures in linkageanalysis. Am. J. Hum. Genet. 48: 10581064.

    Rodolphe, F., and M. Lefort, 1993 A multi-marker model for de-tecting chromosomal segments displaying QTL activity. Genetics134: 12771288.

    1264 X. Shen et al.

    Richard Ding

    Richard Ding

    Richard Ding

    Richard Ding

  • Rnnegrd, L., and . Carlborg, 2007 Separation of base alleleand sampling term effects gives new insights in variance com-ponent QTL analysis. BMC Genet. 8 .10.1186/1471-2156-8-1.

    Rnnegrd, L., and Y. Lee, 2010 Hierarchical generalized linearmodels have a great potential in genetics and animal breeding.Proceedings of the World Congress on Genetics Applied to Live-stock Production, Leipzig, Germany.

    Rnnegrd, L., K. Mischenko, S. Holmgren, and . Carlborg,2007 Increasing the efciency of variance component quan-titative trait loci analysis by using reduced-rank identity-by-descent matrices. Genetics 176: 19351938.

    Rnnegrd, L., X. Shen, and M. Alam, 2010 hglm: a package fortting hierarchical generalized linear models. R J. 2: 2028.

    Shen, X., L. Rnnegrd, and . Carlborg, 2011 Hierarchical likeli-hood opens a new way of estimating genetic values usinggenome-wide dense marker maps. BMC Proceedings 5(Suppl.3): S14.

    Sherman, J., and W. J. Morrison, 1950 Adjustment of an inversematrix corresponding to a change in one element of a givenmatrix. Ann. Math. Stat. 21: 124127.

    Stranden, I., and D. Garrick, 2009 Technical note: derivation ofequivalent computing algorithms for genomic predictions andreliabilities of animal merit. J. Dairy Sci. 92: 29712975.

    Szydlowski, M., and P. Paczyska, 2011 QTLMAS 2010: simu-lated dataset. BMC Proc. 5(Suppl. 3): S3.

    Tibshirani, R., 1996 Regression shrinkage and selection via thelasso. J. R. Stat. Soc. B 58: 267288.

    vanRaden, P. M., 2008 Efcient methods to compute genomicpredictions. J. Dairy Sci. 91: 44144423.

    Xu, S., 2003 Estimating polygenic effects using markers of theentire genome. Genetics 163: 789801.

    Yang, J., B. Benyamin, B. P. McEvoy, S. Gordon, A. K. Henders et al.,2010 Common SNPs explain a large proportion of the herita-bility for human height. Nat. Genet. 42: 565569.

    Yang, J., T. A. Manolio, L. R. Pasquale, E. Boerwinkle, N. Caporasoet al., 2011 Genome partitioning of genetic variation for com-plex traits using common snps. Nat. Genet. 43: 519525.

    Yi, N., and S. Xu, 2008 Bayesian LASSO for quantitative trait locimapping. Genetics 179: 10451055.

    Zeng, Z.-B., 1993 Theoretical basis for separation of multiplelinked gene effects in mapping quantitative trait loci. Proc. Natl.Acad. Sci. USA 90: 1097210976.

    Zhang, Z., J. Liu, X. Ding, P. Bijma, D.-J. de Koning et al.,2010 Best linear unbiased prediction of genomic breeding val-ues using a trait-specic marker-derived relationship matrix.PLoS ONE 5: e12648.

    Zou, H., and T. Hastie, 2005 Regularization and variable selectionvia the elastic net. J. R. Stat. Soc. B 67: 301320.

    Communicating editor: F. Zou

    Appendix

    An Example in R

    Here we include the R code for tting SNPBLUP and theHEM using the Arabidopsis data as an example. The codegenerates subgures of Figure 2 in black and white. Thecode can also be found as an embedded example in thebigRR package.

    install.packages(bigRR, repos = http://r-forge.r-project.org)

    require(bigRR)data(Arabidopsis)X ,- matrix(1, length(y), 1)SNP.BLUP.result ,- bigRR(y = y, X = X, Z = scale(Z), fam-

    ily = binomial(link = logit))HEM.result ,- bigRR.update(SNP.BLUP.result, scale(Z),

    family = binomial(link = logit))dev.new(); plot(SNP.BLUP.result$u)dev.new(); plot(HEM.result$u)dev.new(); plot(SNP.BLUP.result$u, HEM.result$u)

    Denitions of the Statistical Terminologies Used

    1. Ridge regression (RR): A shrinkage estimation methodoften used for tting more explanatory variables thanthe number of observations. A common shrinkage is ap-plied to all the effects, and the magnitude of shrinkage isusually determined via cross validation (see also Mn-sson and Shukur 2011, for a recent review on RR meth-ods for binary data).

    2. Linear mixed model (LMM): A linear (regression) model in-cluding xed and random effects. Treating effects as random,

    the model can handle more parameters than the number ofobservations. It provides shrinkage estimates for the randomeffects with a common magnitude of shrinkage, whereasthere is no shrinkage applied on the estimated xed effects.Furthermore, the covariates for the xed effects and the ran-dom effects are assumed to be independent. The likelihoodfor LMM is equivalent to the ridge regression penalized likeli-hood but the magnitude of shrinkage is determined by thevariance component estimates in LMM.

    3. Generalized RR: A ridge regression method allowing dif-ferent magnitudes of shrinkage for different explanatoryvariables.

    4. Heteroscedastic effects model (HEM): A generalized RRmethod based on LMM theory, where the magnitudes ofshrinkage for different effects are determined by theLMM random effects estimates and model hat values.

    5. Double hierarchical generalized linear model (DHGLM):A double-layer random effects model, which allows t-ting any variance component in an LMM using anotherrandom effects model, so that the second layer of themodel determines the weights in the rst layer. The fullmodel is established using the hierarchical likelihood (h-likelihood), and statistical inference can be performed onthe basis of the extended likelihood theory (Bjrnstad1996; Lee et al. 2007). Fitting DHGLM in GWAS was pro-posed by Rnnegrd and Lee (2010) and Shen et al. (2011).

    Reducing the Dimension in Ordinary Ridge RegressionUsing Singular-Value Decomposition

    Computationally fast methods for tting RR in p! n problemshave been proposed based on SVD. These methods were de-veloped for RR but not generalized RR. To clarify the difference

    Heteroscedastic Effects Model 1265

  • between our approach and algorithms using SVD for RR, wealso describe the latter below. A RR model is given by

    y Xb Zb e; (A1)

    where b are xed effects estimated without shrinkage andb are effects estimated with shrinkage. When b is simply anintercept term, the model can be reformulated by centeringZ and the estimates of b are given by

    b^ #Z9Z lIp

    $21Z9y; (A2)

    where Ip is the identity matrix with the subscript p denotingthe size, and l is the shrinkage parameter. Let Z = UDV9 bethe SVD of Z and dene R = UD; then (Hastie and Tibshir-ani 2004; Hastie et al. 2009)

    b^1 V#R9R lIn

    $21R9y; (A3)

    which reduces the size of the matrix to be inverted from p p to n n. Hence the parameter space is rotated to reducethe dimension and assumes that l is a constant.

    Note that the equivalence between LMM and RR hasconditions, especially in terms of the assumptions. In anLMM, covariates are separated into xed and randomeffects, where the inference of the xed effects, based onthe marginal likelihood, gives unbiased estimates, whileshrinkage estimates are obtained for the random effect. InRR, all the covariates are penalized, without separation ofxed and random effects. Philosophically, an LMM considersthe random effects as a sample drawn from an underlyingdistribution with a dispersion parameter to be estimated,whereas ridge regression is simply a computational methodthat provides estimates when the model is oversaturated.Only when the selected penalty parameter in RR equals theratio of the variance components in the corresponding LMM,they become mathematically the same.

    Generalized Ridge Regression and LinearMixed Models

    In the following subsections, methods from LMM theory willbe used to develop a fast generalized RR algorithm for p !n, where l is allowed to be a vector of length #p (Hoerl andKennard 1970a,b). Consider the linear mixed model

    y Xb Zb e; (A4)

    where b $ N0;s2bIp, e $ N0;s2e In, b is a vector of xedeffects, and b is a random effect. This is equivalent to theabove RR model and give the same estimates for a known l.

    The differences between LMM and RR are found in theestimation techniques used. For LMM, l is given by the var-iance component estimated using restricted maximum likeli-hood (REML) with l s^2e=s^2b , whereas for RR l is computedusing the generalized cross-validation (GCV) function GCV

    (l) = e9e/(n 2 d.f.e), where d.f.e is the effective degrees offreedom (Hastie et al. 2009). These two methods tend to givesimilar estimates of l (see Pawitan 2001, p. 488).

    In LMMs it is possible to include several variancecomponents, which is equivalent to dening l as a vector.This is possible in generalized RR (Hoerl and Kennard1970b) but the dimension reduction based on SVD assumesa constant l. In generalized RR we have

    b^ #Z9Z K

    $21Z9y; (A5)

    where K = diag(l) and l is the vector of shrinkage values.Below, we present how LMM theory can be used to reducedimension from p to n also for the case of l being a vector oflength p, and thereafter propose a method to give suitablevalues for l.

    The Linear Mixed Model Approach

    Here, we consider the estimation of an LMM with linearpredictor h and a diagonal weight matrix D (size p p) forthe random effects

    y h eh Xb Zbe $ N

    #0;fS21

    $b $ N%0;s2bD&;

    (A6)

    where S is a diagonal matrix of weights and f is the disper-sion parameter equal to s2e , and s

    2bD is equivalent to the

    weight matrix W in the Fitting algorithm. This notationallows for a later extension to a generalized linear mixedmodel (GLMM). The diagonal matrices K and D are relatedas K D21f=s2b .

    To derive a computationally efcient implementation ofthe algorithm for p ! n, we present equivalent models tomodel (18) and show how the estimates of the effects, andtheir associated prediction error variances, can be trans-formed between these. Prediction error variances are impor-tant to compute since they are the basis for calculations ofstandard errors and d.f.e.

    Three different, but equivalent, specications of therandom effects are used and are referred to as the SNPmodel, the animal model, and the Cholesky model. For allthree models the linear predictor h is the same:

    SNP model

    h Xb Zbb $ N%0;s2bD& (A7)

    Animal model

    h Xb aa $ N%0;s2bG&G ZDZ9

    (A8)

    1266 X. Shen et al.

  • Cholesky model

    h Xb Lvv $ N%0;s2bIn&LL9 G:

    (A9)

    The use of equivalent LMMs in the research eld of animalbreeding and quantitative genetics is well established(Lynch and Walsh 1998; Rnnegrd and Carlborg 2007).The contribution of this article is to present how LMMtheory can be used for generalized RR, to show how theprediction error variances can be transformed betweenmodels, to implement the theory in a computationally ef-cient R package bigRR (including GLMM), and to apply itto a novel heteroscedastic effects model presented furtherbelow.

    Different Mixed Model Equations for theEquivalent Models

    For LMM Hendersons MME are used to estimate both thexed and random effects for given variance components.They can also be used iteratively to estimate variance com-ponents as implemented in the R package hglm (Rnnegrdet al. 2010). Although the models above are equivalent, theMME are different.

    SNP modelFor the SNP Model we have the MME0B@X9SX X9SZZ9SX Z9SZ f

    s2bD21

    1CA!bb"!X9SyZ9Sy

    ": (A10)

    These MMEs are of size (k + p) (k + p), where k is thenumber of columns in X. Hence, the size of the equations arevery large for high-dimensional data.

    Animal modelLet the random effects a be individual effects for

    each observation and G = ZZ9 the correlation matrixbetween these. Then G is relatively small (n n) andthe MME are0B@X9SX X9SXS S f

    s2bG21

    1CA!ba"!X9SySy

    "(A11)

    of size (k + n) (k + n). Hence, the size of these MME ismuch smaller than Equation A10 for P ! n.Cholesky model

    In a third equivalent model we dene LL9 = G (where Lhas size n n) and the random effects v are individual in-dependent random effects. The MME are

    0B@X9SX X9SLL9SX L9SL fs2b

    In

    1CA!bv"!X9SyL9Sy

    "(A12)

    of size (k + n) (k + n).

    Transformation of Effects BetweenEquivalent Models

    For P ! n, the size of the MME in models (A11) and (A12)are much smaller than in model (A10). The random effectscan be transformed between these equivalent models (Lynchand Walsh 1998; Nagamine 2005) so that the estimatedSNP effects b^ can easily be calculated from the individualeffects a^ in model (A11)

    b^ Z9G21a^: (A13)

    Furthermore, we have a^ Lv^ so that

    b^ Z9G(1Lv^: (A14)

    The matrix Z is moderately large (n p) but the transfor-mation is a simple cross-product. Hence, the calculations canbe made in parts without reading all of Z into memory. Theycan also easily be parallelized if necessary.

    Transformation of Prediction Error VariancesBetween Equivalent Models

    Not only the estimates, but also the prediction errorvariances (i.e., the diagonal elements of Varv2 v^jv),are important to compute to allow for model checkingand inference. In the Cholesky model (Equation A12),let Cv be

    Cv 1f

    0B@X9SX X9SLL9SX L9SL fs2b

    In

    1CA: (A15)Decompose the inverse of Cv as

    C(1v C11v C

    12v

    C21v C22v

    !: (A16)

    Then the prediction covariance matrix isVarv2 v^jv s2bIn2C22v (Henderson 1984). Dene thejth diagonal element, Varv2 v^jv, as Vb^j . Then these ele-ments can be calculated separately as

    Vb^j s2b 2Mj

    %s2bIn2C

    22v&M 9j; (A17)

    where Mj is the jth row of the transformation matrix M =Z9G21L.

    Heteroscedastic Effects Model 1267

  • Extension to Penalized GLM Estimation

    For penalized generalized linear models (i.e., GLMM), theexpectation of y is connected to the linear predictor hthrough a link function g(.) such that E(y) = g(h). PQL esti-mation uses the same MME as above with a working weightmatrix S and y being replaced by an adjusted response z,where both S and z are updated iteratively until convergence(Breslow and Clayton 1993). Such a penalized likelihood issimilar to the one of ridge regression, where the sum ofsquared effects are used as a penalty term (Hastie et al.2009). The penalty parameter is estimated as the ratio ofthe two dispersion parameters in the mixed model setting.For GLMM, the left-hand side of the MME can be described bythe above formulae, e.g., Equations A10A12, and the sametransformations can be applied. The algorithm was imple-mented in the R (R Development Core Team 2010) packagebigRR and uses the hglm (Rnnegrd et al. 2010) package toestimate the variance components and individual effects a.

    Using Generalized Ridge Regression to CalculateHeteroscedastic SNP Effects

    Here, we consider the estimation of the hierarchical model

    y Xb Zb ee $ N0;s2e Ib $ N0;DD diag%s2bj&

    (A18)

    having a second-level model

    log#s2bj

    $ uj; (A19)

    where uj are xed effects in the linear predictor for the SNPvariances, and j is an index for the p different SNPs. Themodel logs2bj uj is saturated and Eb^

    2j =12 hjj* s2bj , so

    s^2bj are updated as

    s^2bj b^2j

    12 hjj; (A20)

    where hjj are the hat values for the random effects (Lee et al.2006). The hat values are related to the prediction errorvariance as hjj Vb^j=s2b .

    In the current article, we consider estimation where the SNP-specic variance components s2bj are updated twice and refer toit as the heteroscedastic effects model, which gives an increasedshrinkage for small SNP effects compared to ordinary RR.

    Here the transformation of prediction error variancesbetween the Cholesky model (Equation A12) and the SNPmodels (Equation A10) are derived. Let Cv be the left-handside of the MME from the Cholesky model (Equation A12)

    Cv 1f

    0B@X9SX X9SLL9SX L9SL fs2b

    In

    1CA: (A21)Decompose the inverse of Cv as

    C(1v C11v C

    12v

    C21v C22v

    !: (A22)

    Then the prediction covariance matrix is Varv2 v^jv s2bIn2C

    22v (Henderson 1984). Furthermore, let Cb be the

    left-hand side of the MME from the SNP model (EquationA10)

    Cb 1f

    0B@X9SX X9SZZ9SX Z9SZ fs2bD21

    1CA: (A23)Decompose the inverse of Cb as

    C(1b C11b C

    12b

    C21b C22b

    !: (A24)

    Then the prediction covariance matrix is (Henderson 1984)

    Var#b2 b^

    '''b$ s2bIn2C22b : (A25)Dene M to be the matrix transforming effects v to b inEquation A14 so that M = Z9G21L; then

    Var#b2 b^

    '''b$ M Varv2 v^jvM9: (A26)Combining these two equations, we get

    s2bIn2C22b M Varv2 v^jvM9; (A27)

    i.e.,

    s2bIn2C22b M

    %s2bIn2C

    22v&M9; (A28)

    so

    C22b s2bIn2M%s2bIn2C

    22v&M9: (A29)

    1268 X. Shen et al.